# 6.1 Remove duplicate rows

To eliminate duplicates rows in data frame, spark provides two methods:

- distinct(): Returns a new DataFrame containing the distinct rows in this DataFrame. It takes no arg and return a 
              new data frame
- dropDuplicates(*colName): is used to drop rows based on selected (one or multiple) columns. It takes an array of
      column names and return a new data frame.
- drop_duplicates(*colName):is a wraper of dropDuplicates()

In [5]:
from pyspark.sql import SparkSession, DataFrame
import os

In [12]:
locale=True
if locale:
    spark=SparkSession.builder.master("local[4]").appName("RemoveDuplicates").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RemoveDuplicates") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1') \
                      .getOrCreate()                   
                            

In [11]:
! kubectl get pods

I1009 18:44:19.898071    1303 request.go:655] Throttling request took 1.034906341s, request: GET:https://kubernetes.default/apis/rbac.authorization.k8s.io/v1?timeout=32s
NAME                               READY   STATUS    RESTARTS   AGE
flume-test-agent-df8c5b944-vtjbx   1/1     Running   0          20d
jupyter-371471-7b5b79fc7f-7k4jj    1/1     Running   0          3d6h
kafka-client1                      1/1     Running   0          10d
kafka-server-0                     1/1     Running   0          20d
kafka-server-1                     1/1     Running   0          20d
kafka-server-2                     1/1     Running   0          20d
kafka-server-zookeeper-0           1/1     Running   0          20d


In [10]:
spark.sparkContext.stop()

In [17]:
data = [("James", "Sales", 3000),
            ("Michael", "Sales", 4600),
            ("Robert", "Sales", 4100),
            ("Maria", "Finance", 3000),
            ("James", "Sales", 3000),
            ("Scott", "Finance", 3300),
            ("Jen", "Finance", 3900),
            ("Jeff", "Marketing", 3000),
            ("Kumar", "Marketing", 2000),
            ("Saif", "Sales", 4100)
            ]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
print("Source data frame: ")
df.printSchema()
df.show(truncate=False)
df.count()

Source data frame: 
root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+----------+------+
|name   |department|salary|
+-------+----------+------+
|James  |Sales     |3000  |
|Michael|Sales     |4600  |
|Robert |Sales     |4100  |
|Maria  |Finance   |3000  |
|James  |Sales     |3000  |
|Scott  |Finance   |3300  |
|Jen    |Finance   |3900  |
|Jeff   |Marketing |3000  |
|Kumar  |Marketing |2000  |
|Saif   |Sales     |4100  |
+-------+----------+------+



10

## 6.1.1 Use distinct() to remove duplicates

You can notice after we call distinct(), one row James|Sales|3000 has been removed.


In [16]:
df_dedup=df.distinct()
df_dedup.show()
df_dedup.count()

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|    Jen|   Finance|  3900|
|Michael|     Sales|  4600|
|  Scott|   Finance|  3300|
|  Kumar| Marketing|  2000|
|  James|     Sales|  3000|
| Robert|     Sales|  4100|
|   Jeff| Marketing|  3000|
|   Saif|     Sales|  4100|
|  Maria|   Finance|  3000|
+-------+----------+------+



9

If we want to remove duplicates of only several columns, we can call a select() before.  

In [19]:
df_dedup_part=df.select("department").distinct()
df_dedup_part.show()

+----------+
|department|
+----------+
|     Sales|
|   Finance|
| Marketing|
+----------+



## 6.1.2 Use dropDuplicates and drop_duplicates to remove duplicates

drop_duplicates is just a warper of dropDuplicates. They do exactly the same thing.

In [21]:
df_drop=df.dropDuplicates()
df_drop.show()
df_drop.count()

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|    Jen|   Finance|  3900|
|Michael|     Sales|  4600|
|  Scott|   Finance|  3300|
|  Kumar| Marketing|  2000|
|  James|     Sales|  3000|
| Robert|     Sales|  4100|
|   Jeff| Marketing|  3000|
|   Saif|     Sales|  4100|
|  Maria|   Finance|  3000|
+-------+----------+------+



9

In [22]:
df_drop=df.drop_duplicates()
df_drop.show()
df_drop.count()

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|    Jen|   Finance|  3900|
|Michael|     Sales|  4600|
|  Scott|   Finance|  3300|
|  Kumar| Marketing|  2000|
|  James|     Sales|  3000|
| Robert|     Sales|  4100|
|   Jeff| Marketing|  3000|
|   Saif|     Sales|  4100|
|  Maria|   Finance|  3000|
+-------+----------+------+



9

We can also give a list of column names as arguments to remove duplicates of certain column

In [23]:
df_drop1=df.drop_duplicates(["department","salary"])
df_drop1.show()

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|Michael|     Sales|  4600|
| Robert|     Sales|  4100|
|    Jen|   Finance|  3900|
|  Maria|   Finance|  3000|
|  Scott|   Finance|  3300|
|  Kumar| Marketing|  2000|
|  James|     Sales|  3000|
|   Jeff| Marketing|  3000|
+-------+----------+------+



In [24]:
df_drop2=df.drop_duplicates(["department"])
df_drop2.show()

+-----+----------+------+
| name|department|salary|
+-----+----------+------+
|James|     Sales|  3000|
|Maria|   Finance|  3000|
| Jeff| Marketing|  3000|
+-----+----------+------+



Compare to distinct(), drop_duplicates can keep all columns when removing duplicates based on certain columns. It's more interesting. 