## PySpark distinct() and dropDuplicates() Usage

PySpark `distinct()` function is used to drop/remove the duplicate rows (all columns) from DataFrame  
`dropDuplicates()` is used to drop rows based on selected (one or multiple) columns.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark distinct() and dropDuplicates() Usage').getOrCreate()

In [0]:
data = [
  ('James', 'Sales', 3200),
  ('Michael', 'Sales', 4650),
  ('Robert', 'Sales', 4300),
  ('Maria', 'Finance', 3160),
  ('James', 'Sales', 3200),
  ('Scott', 'Finance', 3300),
  ('Jen', 'Finance', 3900),
  ('Jeff', 'Marketing', 3200),
  ('Curtis', 'Marketing', 2000),
  ('Susie', 'Sales', 4300)
]

columns= ['employee_name', 'department', 'salary']

df = spark.createDataFrame(data = data, schema = columns)

df.printSchema()
df.show(truncate=False)
print(f'Row count: {df.count()}')

#### Get Distinct Rows

In [0]:
distinctDF = df.distinct()
print(F'Distinct count: {distinctDF.count()}')
distinctDF.show(truncate=False)

In [0]:
distinctDF2 = df.dropDuplicates()
print(F'Distinct count: {distinctDF2.count()}')
distinctDF2.show(truncate=False)

#### Distinct with selection of multiple columns

In [0]:
dropDisDupDF = df.dropDuplicates(['department','salary'])
print(f'Distinct count of department & salary : {dropDisDupDF.count()}')
dropDisDupDF.show(truncate=False)

#### The end of the notebook