## Removing Duplicate Rows
`dropDuplicates()` in PySpark removes duplicate rows from a DataFrame. You can specify a subset of columns to drop duplicates based only on those columns, keeping the first occurrence of each unique combination.

### Links and Resources
- [dropDuplicates()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html)

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define Schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True)
])

# Sample Data (with Duplicates)
data = [
    (1, "Alice", "HR"),
    (2, "Bob", "IT"),
    (3, "Charlie", "Finance"),
    (1, "Alice", "HR"),  # Duplicate row
    (2, "Bob", "IT"),    # Duplicate row
    (4, "David", "HR"),
    (3, "Charlie", "Finance"),  # Duplicate row
    (5, "Alice", "Finance"),  # Same name, different department
    (6, "Bob", "HR")  # Same name, different department
]

# Create DataFrame
df = spark.createDataFrame(data, schema)

df.show()

In [0]:
# Removes duplicate rows from the DataFrame, keeping only the first occurrence of each unique row.

df.dropDuplicates()

In [0]:
# Removes duplicate rows based on the "name" column, keeping only the first occurrence of each unique name.

df.dropDuplicates(["name"])

In [0]:
# Removes duplicate rows based on the "name" and "department" columns, keeping only the first occurrence of each unique name and department.

df.dropDuplicates(["name", "department"])