# üßπ DATA CLEANING WITH PYSPARK

---

## üìã **OBJECTIVES**

1. Handle missing values (null, NaN)
2. Remove duplicates
3. Data type conversions
4. String cleaning & transformations
5. Outlier detection

---

## üîß **SETUP SPARK SESSION**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd

spark = SparkSession.builder \
    .appName("DataCleaning") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/04 17:50:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Master: spark://spark-master:7077


---

## üìä **1. CREATE DIRTY DATASET**

T·∫°o dataset c√≥ nhi·ªÅu v·∫•n ƒë·ªÅ ƒë·ªÉ th·ª±c h√†nh cleaning

In [2]:
# T·∫°o dirty data v·ªõi nhi·ªÅu v·∫•n ƒë·ªÅ
dirty_data = [
    ("CUST001", "John Doe", "john@email.com", 25, 50000.0, "2024-01-01", "USA"),
    ("CUST002", "Jane Smith", "JANE@EMAIL.COM", 30, 60000.0, "2024-01-02", "UK"),
    ("CUST003", "  Bob Johnson  ", "bob@email.com", None, 55000.0, "2024-01-03", "Canada"),  # Missing age
    ("CUST004", "Alice Brown", None, 28, 70000.0, "2024-01-04", "USA"),  # Missing email
    ("CUST005", "Charlie Wilson", "charlie@email.com", 35, None, "2024-01-05", "UK"),  # Missing salary
    ("CUST001", "John Doe", "john@email.com", 25, 50000.0, "2024-01-01", "USA"),  # Duplicate
    ("CUST006", None, "david@email.com", 40, 80000.0, "2024-01-06", "Canada"),  # Missing name
    ("CUST007", "Eve Davis", "eve@email.com", -5, 90000.0, "2024-01-07", "USA"),  # Invalid age
    ("CUST008", "Frank Miller", "frank@email.com", 150, 100000.0, "2024-01-08", "UK"),  # Outlier age
    ("CUST009", "Grace Lee", "grace@email.com", 32, -10000.0, "2024-01-09", "Canada"),  # Invalid salary
    ("CUST010", "Henry Taylor", "HENRY@EMAIL.COM", 29, 65000.0, "invalid-date", "USA"),  # Invalid date
    ("CUST011", "Ivy Anderson", "ivy@email.com", 27, 58000.0, "2024-01-11", None),  # Missing country
    ("CUST012", "Jack Thomas", "jack@email.com", 33, 72000.0, "2024-01-12", "  UK  "),  # Whitespace
    ("CUST013", "KAREN JACKSON", "karen@email.com", 31, 68000.0, "2024-01-13", "usa"),  # Case inconsistency
    ("CUST014", "Leo White", "leo@email.com", 26, 54000.0, "2024-01-14", "Canada"),
    ("CUST015", "Mia Harris", "mia@email.com", None, None, None, None),  # All nulls except ID
]

schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True),
    StructField("registration_date", StringType(), True),
    StructField("country", StringType(), True)
])

df = spark.createDataFrame(dirty_data, schema)

print("üìä Original Dirty Data:")
df.show(20, truncate=False)
print(f"\nTotal rows: {df.count()}")

üìä Original Dirty Data:


                                                                                

+-----------+---------------+-----------------+----+--------+-----------------+-------+
|customer_id|name           |email            |age |salary  |registration_date|country|
+-----------+---------------+-----------------+----+--------+-----------------+-------+
|CUST001    |John Doe       |john@email.com   |25  |50000.0 |2024-01-01       |USA    |
|CUST002    |Jane Smith     |JANE@EMAIL.COM   |30  |60000.0 |2024-01-02       |UK     |
|CUST003    |  Bob Johnson  |bob@email.com    |NULL|55000.0 |2024-01-03       |Canada |
|CUST004    |Alice Brown    |NULL             |28  |70000.0 |2024-01-04       |USA    |
|CUST005    |Charlie Wilson |charlie@email.com|35  |NULL    |2024-01-05       |UK     |
|CUST001    |John Doe       |john@email.com   |25  |50000.0 |2024-01-01       |USA    |
|CUST006    |NULL           |david@email.com  |40  |80000.0 |2024-01-06       |Canada |
|CUST007    |Eve Davis      |eve@email.com    |-5  |90000.0 |2024-01-07       |USA    |
|CUST008    |Frank Miller   |fra




Total rows: 16


                                                                                

---

## üîç **2. DATA PROFILING**

Ph√¢n t√≠ch data ƒë·ªÉ hi·ªÉu v·∫•n ƒë·ªÅ

In [3]:
# 2.1 Schema & Data Types
print("üìã SCHEMA:")
df.printSchema()

# 2.2 Summary Statistics
print("\nüìä SUMMARY STATISTICS:")
df.describe().show()

# 2.3 Count nulls per column
print("\n‚ùå NULL COUNTS:")
null_counts = df.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df.columns
])
null_counts.show()

# 2.4 Duplicate check
print("\nüîÑ DUPLICATE CHECK:")
total_rows = df.count()
distinct_rows = df.distinct().count()
duplicates = total_rows - distinct_rows
print(f"Total rows: {total_rows}")
print(f"Distinct rows: {distinct_rows}")
print(f"Duplicates: {duplicates}")

# 2.5 Value counts per column
print("\nüìà VALUE COUNTS (Country):")
df.groupBy("country").count().orderBy(desc("count")).show()

üìã SCHEMA:
root
 |-- customer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- registration_date: string (nullable = true)
 |-- country: string (nullable = true)


üìä SUMMARY STATISTICS:


26/01/04 17:50:45 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------+-----------+---------------+---------------+------------------+-----------------+-----------------+-------+
|summary|customer_id|           name|          email|               age|           salary|registration_date|country|
+-------+-----------+---------------+---------------+------------------+-----------------+-----------------+-------+
|  count|         16|             15|             15|                14|               14|               15|     14|
|   mean|       NULL|           NULL|           NULL|36.142857142857146|61571.42857142857|             NULL|   NULL|
| stddev|       NULL|           NULL|           NULL| 34.32392559326319| 25364.1609232527|             NULL|   NULL|
|    min|    CUST001|  Bob Johnson  |HENRY@EMAIL.COM|                -5|         -10000.0|       2024-01-01|   UK  |
|    max|    CUST015|     Mia Harris|  mia@email.com|               150|         100000.0|     invalid-date|    usa|
+-------+-----------+---------------+---------------+-----------

---

## üßπ **3. HANDLE MISSING VALUES**

### **3.1 Identify Missing Values**

In [4]:
# Show rows with any null
print("‚ùå ROWS WITH NULL VALUES:")
df_with_nulls = df.filter(
    col("name").isNull() | 
    col("email").isNull() | 
    col("age").isNull() | 
    col("salary").isNull() | 
    col("country").isNull()
)
df_with_nulls.show(truncate=False)
print(f"Rows with nulls: {df_with_nulls.count()}")

‚ùå ROWS WITH NULL VALUES:
+-----------+---------------+-----------------+----+-------+-----------------+-------+
|customer_id|name           |email            |age |salary |registration_date|country|
+-----------+---------------+-----------------+----+-------+-----------------+-------+
|CUST003    |  Bob Johnson  |bob@email.com    |NULL|55000.0|2024-01-03       |Canada |
|CUST004    |Alice Brown    |NULL             |28  |70000.0|2024-01-04       |USA    |
|CUST005    |Charlie Wilson |charlie@email.com|35  |NULL   |2024-01-05       |UK     |
|CUST006    |NULL           |david@email.com  |40  |80000.0|2024-01-06       |Canada |
|CUST011    |Ivy Anderson   |ivy@email.com    |27  |58000.0|2024-01-11       |NULL   |
|CUST015    |Mia Harris     |mia@email.com    |NULL|NULL   |NULL             |NULL   |
+-----------+---------------+-----------------+----+-------+-----------------+-------+

Rows with nulls: 6


### **3.2 Drop Rows with Nulls**

In [5]:
# Strategy 1: Drop rows with ANY null
df_drop_any = df.dropna(how="any")
print(f"‚úÖ Drop ANY null: {df.count()} ‚Üí {df_drop_any.count()} rows")

# Strategy 2: Drop rows with ALL nulls
df_drop_all = df.dropna(how="all")
print(f"‚úÖ Drop ALL null: {df.count()} ‚Üí {df_drop_all.count()} rows")

# Strategy 3: Drop rows with nulls in specific columns
df_drop_subset = df.dropna(subset=["customer_id", "email"])
print(f"‚úÖ Drop null in [customer_id, email]: {df.count()} ‚Üí {df_drop_subset.count()} rows")

# Strategy 4: Drop rows with nulls in at least N columns
df_drop_thresh = df.dropna(thresh=5)  # Keep rows with at least 5 non-null values
print(f"‚úÖ Drop rows with < 5 non-nulls: {df.count()} ‚Üí {df_drop_thresh.count()} rows")

‚úÖ Drop ANY null: 16 ‚Üí 10 rows
‚úÖ Drop ALL null: 16 ‚Üí 16 rows
‚úÖ Drop null in [customer_id, email]: 16 ‚Üí 15 rows
‚úÖ Drop rows with < 5 non-nulls: 16 ‚Üí 15 rows


### **3.3 Fill Missing Values**

In [6]:
# Strategy 1: Fill with constant values
df_fill_const = df.fillna({
    "name": "Unknown",
    "email": "no-email@example.com",
    "age": 0,
    "salary": 0.0,
    "country": "Unknown"
})

print("‚úÖ FILL WITH CONSTANTS:")
df_fill_const.show(truncate=False)

# Strategy 2: Fill with mean/median
from pyspark.sql.functions import mean, median

# Calculate mean age and salary
stats = df.select(
    mean("age").alias("mean_age"),
    mean("salary").alias("mean_salary")
).collect()[0]

mean_age = stats["mean_age"]
mean_salary = stats["mean_salary"]

print(f"\nüìä Mean age: {mean_age:.2f}")
print(f"üìä Mean salary: {mean_salary:.2f}")

df_fill_mean = df.fillna({
    "age": int(mean_age),
    "salary": mean_salary
})

print("\n‚úÖ FILL WITH MEAN:")
df_fill_mean.show(truncate=False)

# Strategy 3: Fill with mode (most frequent value)
mode_country = df.groupBy("country").count() \
    .orderBy(desc("count")) \
    .first()["country"]

print(f"\nüìä Mode country: {mode_country}")

df_fill_mode = df.fillna({"country": mode_country})

print("\n‚úÖ FILL WITH MODE:")
df_fill_mode.show(truncate=False)

‚úÖ FILL WITH CONSTANTS:
+-----------+---------------+--------------------+---+--------+-----------------+-------+
|customer_id|name           |email               |age|salary  |registration_date|country|
+-----------+---------------+--------------------+---+--------+-----------------+-------+
|CUST001    |John Doe       |john@email.com      |25 |50000.0 |2024-01-01       |USA    |
|CUST002    |Jane Smith     |JANE@EMAIL.COM      |30 |60000.0 |2024-01-02       |UK     |
|CUST003    |  Bob Johnson  |bob@email.com       |0  |55000.0 |2024-01-03       |Canada |
|CUST004    |Alice Brown    |no-email@example.com|28 |70000.0 |2024-01-04       |USA    |
|CUST005    |Charlie Wilson |charlie@email.com   |35 |0.0     |2024-01-05       |UK     |
|CUST001    |John Doe       |john@email.com      |25 |50000.0 |2024-01-01       |USA    |
|CUST006    |Unknown        |david@email.com     |40 |80000.0 |2024-01-06       |Canada |
|CUST007    |Eve Davis      |eve@email.com       |-5 |90000.0 |2024-01-07  

### **3.4 Advanced: Fill with Forward/Backward Fill**

In [7]:
from pyspark.sql.window import Window

# Forward fill (fill with previous non-null value)
windowSpec = Window.orderBy("customer_id").rowsBetween(Window.unboundedPreceding, 0)

df_ffill = df.withColumn(
    "age_filled",
    last("age", ignorenulls=True).over(windowSpec)
)

print("‚úÖ FORWARD FILL (age):")
df_ffill.select("customer_id", "age", "age_filled").show()

‚úÖ FORWARD FILL (age):


26/01/04 17:50:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
26/01/04 17:50:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
26/01/04 17:50:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
26/01/04 17:50:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
26/01/04 17:50:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-----------+----+----------+
|customer_id| age|age_filled|
+-----------+----+----------+
|    CUST001|  25|        25|
|    CUST001|  25|        25|
|    CUST002|  30|        30|
|    CUST003|NULL|        30|
|    CUST004|  28|        28|
|    CUST005|  35|        35|
|    CUST006|  40|        40|
|    CUST007|  -5|        -5|
|    CUST008| 150|       150|
|    CUST009|  32|        32|
|    CUST010|  29|        29|
|    CUST011|  27|        27|
|    CUST012|  33|        33|
|    CUST013|  31|        31|
|    CUST014|  26|        26|
|    CUST015|NULL|        26|
+-----------+----+----------+



---

## üîÑ **4. REMOVE DUPLICATES**

In [8]:
# 4.1 Remove exact duplicates (all columns)
df_dedup_all = df.dropDuplicates()
print(f"‚úÖ Remove exact duplicates: {df.count()} ‚Üí {df_dedup_all.count()} rows")

# 4.2 Remove duplicates based on specific columns
df_dedup_id = df.dropDuplicates(["customer_id"])
print(f"‚úÖ Remove duplicates by customer_id: {df.count()} ‚Üí {df_dedup_id.count()} rows")

# 4.3 Keep first/last occurrence
from pyspark.sql.window import Window

# Keep first occurrence (earliest registration_date)
windowSpec = Window.partitionBy("customer_id").orderBy("registration_date")

df_keep_first = df.withColumn("row_num", row_number().over(windowSpec)) \
    .filter(col("row_num") == 1) \
    .drop("row_num")

print(f"\n‚úÖ Keep first occurrence: {df.count()} ‚Üí {df_keep_first.count()} rows")
df_keep_first.show(truncate=False)

# 4.4 Identify duplicates
print("\nüîç IDENTIFY DUPLICATES:")
df_with_dup_flag = df.withColumn(
    "is_duplicate",
    count("*").over(Window.partitionBy("customer_id")) > 1
)

df_with_dup_flag.filter(col("is_duplicate")).show(truncate=False)

‚úÖ Remove exact duplicates: 16 ‚Üí 15 rows
‚úÖ Remove duplicates by customer_id: 16 ‚Üí 15 rows

‚úÖ Keep first occurrence: 16 ‚Üí 15 rows
+-----------+---------------+-----------------+----+--------+-----------------+-------+
|customer_id|name           |email            |age |salary  |registration_date|country|
+-----------+---------------+-----------------+----+--------+-----------------+-------+
|CUST001    |John Doe       |john@email.com   |25  |50000.0 |2024-01-01       |USA    |
|CUST002    |Jane Smith     |JANE@EMAIL.COM   |30  |60000.0 |2024-01-02       |UK     |
|CUST003    |  Bob Johnson  |bob@email.com    |NULL|55000.0 |2024-01-03       |Canada |
|CUST004    |Alice Brown    |NULL             |28  |70000.0 |2024-01-04       |USA    |
|CUST005    |Charlie Wilson |charlie@email.com|35  |NULL    |2024-01-05       |UK     |
|CUST006    |NULL           |david@email.com  |40  |80000.0 |2024-01-06       |Canada |
|CUST007    |Eve Davis      |eve@email.com    |-5  |90000.0 |2024-01

---

## üî§ **5. STRING CLEANING**

In [9]:
# 5.1 Trim whitespace
df_trim = df.withColumn("name", trim(col("name"))) \
    .withColumn("country", trim(col("country")))

print("‚úÖ TRIM WHITESPACE:")
df_trim.select("name", "country").show(truncate=False)

# 5.2 Convert to lowercase/uppercase
df_case = df_trim.withColumn("email", lower(col("email"))) \
    .withColumn("country", upper(col("country")))

print("\n‚úÖ CASE CONVERSION:")
df_case.select("email", "country").show(truncate=False)

# 5.3 Title case for names
df_title = df_case.withColumn("name", initcap(col("name")))

print("\n‚úÖ TITLE CASE:")
df_title.select("name").show(truncate=False)

# 5.4 Remove special characters
df_clean = df_title.withColumn(
    "name_clean",
    regexp_replace(col("name"), "[^a-zA-Z\\s]", "")
)

print("\n‚úÖ REMOVE SPECIAL CHARACTERS:")
df_clean.select("name", "name_clean").show(truncate=False)

# 5.5 Extract parts of string
df_extract = df_clean.withColumn(
    "first_name",
    split(col("name"), " ").getItem(0)
).withColumn(
    "last_name",
    split(col("name"), " ").getItem(1)
)

print("\n‚úÖ EXTRACT FIRST/LAST NAME:")
df_extract.select("name", "first_name", "last_name").show(truncate=False)

‚úÖ TRIM WHITESPACE:
+--------------+-------+
|name          |country|
+--------------+-------+
|John Doe      |USA    |
|Jane Smith    |UK     |
|Bob Johnson   |Canada |
|Alice Brown   |USA    |
|Charlie Wilson|UK     |
|John Doe      |USA    |
|NULL          |Canada |
|Eve Davis     |USA    |
|Frank Miller  |UK     |
|Grace Lee     |Canada |
|Henry Taylor  |USA    |
|Ivy Anderson  |NULL   |
|Jack Thomas   |UK     |
|KAREN JACKSON |usa    |
|Leo White     |Canada |
|Mia Harris    |NULL   |
+--------------+-------+


‚úÖ CASE CONVERSION:
+-----------------+-------+
|email            |country|
+-----------------+-------+
|john@email.com   |USA    |
|jane@email.com   |UK     |
|bob@email.com    |CANADA |
|NULL             |USA    |
|charlie@email.com|UK     |
|john@email.com   |USA    |
|david@email.com  |CANADA |
|eve@email.com    |USA    |
|frank@email.com  |UK     |
|grace@email.com  |CANADA |
|henry@email.com  |USA    |
|ivy@email.com    |NULL   |
|jack@email.com   |UK     |
|karen@e

---

## üî¢ **6. DATA TYPE CONVERSIONS**

In [10]:
# 6.1 Convert string to date
df_date = df.withColumn(
    "registration_date_parsed",
    to_date(col("registration_date"), "yyyy-MM-dd")
)

print("‚úÖ STRING TO DATE:")
df_date.select("registration_date", "registration_date_parsed").show()

# 6.2 Handle invalid dates
df_date_safe = df.withColumn(
    "registration_date_safe",
    when(
        to_date(col("registration_date"), "yyyy-MM-dd").isNotNull(),
        to_date(col("registration_date"), "yyyy-MM-dd")
    ).otherwise(lit(None).cast("date"))
)

print("\n‚úÖ SAFE DATE CONVERSION:")
df_date_safe.select("registration_date", "registration_date_safe").show()

# 6.3 Convert to timestamp
df_timestamp = df_date_safe.withColumn(
    "registration_timestamp",
    to_timestamp(col("registration_date_safe"))
)

print("\n‚úÖ DATE TO TIMESTAMP:")
df_timestamp.select("registration_date_safe", "registration_timestamp").show()

# 6.4 Cast numeric types
df_cast = df.withColumn("age_double", col("age").cast("double")) \
    .withColumn("salary_int", col("salary").cast("int"))

print("\n‚úÖ NUMERIC CASTING:")
df_cast.select("age", "age_double", "salary", "salary_int").show()

‚úÖ STRING TO DATE:
+-----------------+------------------------+
|registration_date|registration_date_parsed|
+-----------------+------------------------+
|       2024-01-01|              2024-01-01|
|       2024-01-02|              2024-01-02|
|       2024-01-03|              2024-01-03|
|       2024-01-04|              2024-01-04|
|       2024-01-05|              2024-01-05|
|       2024-01-01|              2024-01-01|
|       2024-01-06|              2024-01-06|
|       2024-01-07|              2024-01-07|
|       2024-01-08|              2024-01-08|
|       2024-01-09|              2024-01-09|
|     invalid-date|                    NULL|
|       2024-01-11|              2024-01-11|
|       2024-01-12|              2024-01-12|
|       2024-01-13|              2024-01-13|
|       2024-01-14|              2024-01-14|
|             NULL|                    NULL|
+-----------------+------------------------+


‚úÖ SAFE DATE CONVERSION:
+-----------------+----------------------+
|registra

---

## üö® **7. HANDLE OUTLIERS**

In [11]:
# 7.1 Identify outliers using IQR method
from pyspark.sql.functions import percentile_approx

# Calculate Q1, Q3, IQR for age
quantiles = df.select(
    percentile_approx("age", 0.25).alias("Q1"),
    percentile_approx("age", 0.75).alias("Q3")
).collect()[0]

Q1 = quantiles["Q1"]
Q3 = quantiles["Q3"]
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"üìä AGE STATISTICS:")
print(f"Q1: {Q1}")
print(f"Q3: {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

# Flag outliers
df_outliers = df.withColumn(
    "is_age_outlier",
    (col("age") < lower_bound) | (col("age") > upper_bound)
)

print("\nüö® OUTLIERS DETECTED:")
df_outliers.filter(col("is_age_outlier")).show(truncate=False)

# 7.2 Remove outliers
df_no_outliers = df.filter(
    (col("age") >= lower_bound) & (col("age") <= upper_bound)
)

print(f"\n‚úÖ Remove outliers: {df.count()} ‚Üí {df_no_outliers.count()} rows")

# 7.3 Cap outliers (winsorization)
df_capped = df.withColumn(
    "age_capped",
    when(col("age") < lower_bound, lower_bound)
    .when(col("age") > upper_bound, upper_bound)
    .otherwise(col("age"))
)

print("\n‚úÖ CAP OUTLIERS:")
df_capped.select("customer_id", "age", "age_capped").show()

üìä AGE STATISTICS:
Q1: 26
Q3: 33
IQR: 7
Lower bound: 15.5
Upper bound: 43.5

üö® OUTLIERS DETECTED:
+-----------+------------+---------------+---+--------+-----------------+-------+--------------+
|customer_id|name        |email          |age|salary  |registration_date|country|is_age_outlier|
+-----------+------------+---------------+---+--------+-----------------+-------+--------------+
|CUST007    |Eve Davis   |eve@email.com  |-5 |90000.0 |2024-01-07       |USA    |true          |
|CUST008    |Frank Miller|frank@email.com|150|100000.0|2024-01-08       |UK     |true          |
+-----------+------------+---------------+---+--------+-----------------+-------+--------------+


‚úÖ Remove outliers: 16 ‚Üí 12 rows

‚úÖ CAP OUTLIERS:
+-----------+----+----------+
|customer_id| age|age_capped|
+-----------+----+----------+
|    CUST001|  25|      25.0|
|    CUST002|  30|      30.0|
|    CUST003|NULL|      NULL|
|    CUST004|  28|      28.0|
|    CUST005|  35|      35.0|
|    CUST001|  25|

---

## üéØ **8. COMPLETE CLEANING PIPELINE**

In [12]:
# Complete cleaning pipeline
def clean_customer_data(df):
    """
    Complete data cleaning pipeline
    """
    
    # 1. Remove exact duplicates
    df = df.dropDuplicates(["customer_id"])
    
    # 2. String cleaning
    df = df.withColumn("name", trim(col("name"))) \
        .withColumn("name", initcap(col("name"))) \
        .withColumn("email", lower(trim(col("email")))) \
        .withColumn("country", upper(trim(col("country"))))
    
    # 3. Handle missing values
    # Calculate mean for numeric columns
    stats = df.select(
        mean("age").alias("mean_age"),
        mean("salary").alias("mean_salary")
    ).collect()[0]
    
    # Get mode for country
    mode_country = df.groupBy("country").count() \
        .orderBy(desc("count")) \
        .first()["country"]
    
    df = df.fillna({
        "name": "Unknown",
        "email": "no-email@example.com",
        "age": int(stats["mean_age"]),
        "salary": stats["mean_salary"],
        "country": mode_country
    })
    
    # 4. Data type conversions
    df = df.withColumn(
        "registration_date",
        when(
            to_date(col("registration_date"), "yyyy-MM-dd").isNotNull(),
            to_date(col("registration_date"), "yyyy-MM-dd")
        ).otherwise(current_date())
    )
    
    # 5. Handle outliers (cap age)
    df = df.withColumn(
        "age",
        when(col("age") < 0, 0)
        .when(col("age") > 120, 120)
        .otherwise(col("age"))
    )
    
    # 6. Handle negative salary
    df = df.withColumn(
        "salary",
        when(col("salary") < 0, 0)
        .otherwise(col("salary"))
    )
    
    return df

# Apply cleaning pipeline
print("üßπ BEFORE CLEANING:")
print(f"Rows: {df.count()}")
df.show(truncate=False)

df_clean = clean_customer_data(df)

print("\n‚úÖ AFTER CLEANING:")
print(f"Rows: {df_clean.count()}")
df_clean.show(truncate=False)

# Verify no nulls
print("\n‚úÖ NULL CHECK AFTER CLEANING:")
null_counts_after = df_clean.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df_clean.columns
])
null_counts_after.show()

üßπ BEFORE CLEANING:
Rows: 16
+-----------+---------------+-----------------+----+--------+-----------------+-------+
|customer_id|name           |email            |age |salary  |registration_date|country|
+-----------+---------------+-----------------+----+--------+-----------------+-------+
|CUST001    |John Doe       |john@email.com   |25  |50000.0 |2024-01-01       |USA    |
|CUST002    |Jane Smith     |JANE@EMAIL.COM   |30  |60000.0 |2024-01-02       |UK     |
|CUST003    |  Bob Johnson  |bob@email.com    |NULL|55000.0 |2024-01-03       |Canada |
|CUST004    |Alice Brown    |NULL             |28  |70000.0 |2024-01-04       |USA    |
|CUST005    |Charlie Wilson |charlie@email.com|35  |NULL    |2024-01-05       |UK     |
|CUST001    |John Doe       |john@email.com   |25  |50000.0 |2024-01-01       |USA    |
|CUST006    |NULL           |david@email.com  |40  |80000.0 |2024-01-06       |Canada |
|CUST007    |Eve Davis      |eve@email.com    |-5  |90000.0 |2024-01-07       |USA    |
|

---

## üíæ **9. SAVE CLEANED DATA**

In [13]:
# Save to MinIO
output_path = "s3a://warehouse/cleaned_customers/"

df_clean.write \
    .mode("overwrite") \
    .partitionBy("country") \
    .parquet(output_path)

print(f"‚úÖ Cleaned data saved to: {output_path}")

# Verify
df_verify = spark.read.parquet(output_path)
print(f"\n‚úÖ Verification: {df_verify.count()} rows loaded")
df_verify.show()

26/01/04 17:51:10 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

‚úÖ Cleaned data saved to: s3a://warehouse/cleaned_customers/


                                                                                


‚úÖ Verification: 15 rows loaded
+-----------+--------------+--------------------+---+-----------------+-----------------+-------+
|customer_id|          name|               email|age|           salary|registration_date|country|
+-----------+--------------+--------------------+---+-----------------+-----------------+-------+
|    CUST001|      John Doe|      john@email.com| 25|          50000.0|       2024-01-01|    USA|
|    CUST004|   Alice Brown|no-email@example.com| 28|          70000.0|       2024-01-04|    USA|
|    CUST007|     Eve Davis|       eve@email.com|  0|          90000.0|       2024-01-07|    USA|
|    CUST010|  Henry Taylor|     henry@email.com| 29|          65000.0|       2026-01-04|    USA|
|    CUST011|  Ivy Anderson|       ivy@email.com| 27|          58000.0|       2024-01-11|    USA|
|    CUST013| Karen Jackson|     karen@email.com| 31|          68000.0|       2024-01-13|    USA|
|    CUST015|    Mia Harris|       mia@email.com| 37|62461.53846153846|       2026-0

---

## üìä **10. BEFORE/AFTER COMPARISON**

In [14]:
# Create comparison report
print("üìä DATA CLEANING REPORT")
print("=" * 60)

# Row count
print(f"\n1Ô∏è‚É£ ROW COUNT:")
print(f"   Before: {df.count()}")
print(f"   After:  {df_clean.count()}")
print(f"   Removed: {df.count() - df_clean.count()}")

# Null count
print(f"\n2Ô∏è‚É£ NULL VALUES:")
null_before = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0]
null_after = df_clean.select([count(when(col(c).isNull(), c)).alias(c) for c in df_clean.columns]).collect()[0]

for col_name in df.columns:
    before = null_before[col_name]
    after = null_after[col_name]
    if before > 0 or after > 0:
        print(f"   {col_name}: {before} ‚Üí {after}")

# Duplicates
print(f"\n3Ô∏è‚É£ DUPLICATES:")
dup_before = df.count() - df.dropDuplicates(["customer_id"]).count()
dup_after = df_clean.count() - df_clean.dropDuplicates(["customer_id"]).count()
print(f"   Before: {dup_before}")
print(f"   After:  {dup_after}")

# Outliers
print(f"\n4Ô∏è‚É£ OUTLIERS (age < 0 or > 120):")
outliers_before = df.filter((col("age") < 0) | (col("age") > 120)).count()
outliers_after = df_clean.filter((col("age") < 0) | (col("age") > 120)).count()
print(f"   Before: {outliers_before}")
print(f"   After:  {outliers_after}")

print("\n" + "=" * 60)
print("‚úÖ CLEANING COMPLETED!")

üìä DATA CLEANING REPORT

1Ô∏è‚É£ ROW COUNT:
   Before: 16
   After:  15
   Removed: 1

2Ô∏è‚É£ NULL VALUES:
   name: 1 ‚Üí 0
   email: 1 ‚Üí 0
   age: 2 ‚Üí 0
   salary: 2 ‚Üí 0
   registration_date: 1 ‚Üí 0
   country: 2 ‚Üí 0

3Ô∏è‚É£ DUPLICATES:
   Before: 1
   After:  0

4Ô∏è‚É£ OUTLIERS (age < 0 or > 120):
   Before: 2
   After:  0

‚úÖ CLEANING COMPLETED!


---

## üéì **KEY TAKEAWAYS**

### **‚úÖ Data Cleaning Best Practices:**

1. **Always profile data first** - Understand the problems before fixing
2. **Handle nulls strategically** - Drop, fill, or flag based on context
3. **Remove duplicates early** - Prevents skewed analysis
4. **Standardize strings** - Trim, case conversion, remove special chars
5. **Validate data types** - Convert and handle invalid values
6. **Handle outliers carefully** - Remove, cap, or flag based on domain
7. **Create reusable pipelines** - Encapsulate cleaning logic in functions
8. **Document transformations** - Track what was changed and why
9. **Validate results** - Compare before/after metrics
10. **Save cleaned data** - Separate raw and cleaned layers

### **üöÄ Next Steps:**
- **Day 2 - Notebook 4:** Data Quality & Validation
- Learn data profiling, validation rules, and quality metrics

---

In [15]:
# Cleanup
spark.stop()
print("‚úÖ Spark session stopped")

‚úÖ Spark session stopped
