# ðŸ§ª OPTION 1 â€” DATA QUALITY & VALIDATION (TASKS ONLY)

You now have:

* **Source table**: `chocolate_sales` (detailed)
* **Gold table**: `daily_chocolate_sales` (aggregated)

For this option, focus on the **source table** (`chocolate_sales`), because:

> *Bad input = bad output, even if your aggregation logic is perfect.*

In [0]:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
delta = spark.read.table("chocolate_sales")

## ðŸ”¹ LEVEL 1 â€” Basic Data Health Checks

1. Find **total number of rows** in `chocolate_sales`.

2. Count how many rows have:

   * `Shipdate` = NULL
   * `Amount` = NULL
   * `Boxes` = NULL

3. Check whether there are **duplicate rows** based on:

   * `ShipmentID`

In [0]:
# 1. Find total number of rows in chocolate_sales.

print("No. of rows in chocolate_sales:", delta.count())

# 2. Count how many rows have:

nulls_checks_column = ["Shipdate", "Amount", "Boxes"]
null_counts = [
    F.sum(F.when(F.col(c).isNull(), 1).otherwise(0)).alias(f"null_count_{c}")
    for c in nulls_checks_column
]
delta.select(*null_counts).show()

# 3. Check whether there are duplicate rows based on: ShipmentID

duplicate_shipment_id = delta.select("ShipmentID").distinct().count() < delta.count()
print("Duplicate ShipmentID:", duplicate_shipment_id)



## ðŸ”¹ LEVEL 2 â€” Business Rule Validation

Assume these business rules:

* `Amount` must be **greater than 0**
* `Boxes` must be **greater than or equal to 0**
* `Shipdate` must be **present**

Your tasks:

4. Count rows that violate each rule individually.

5. Count rows that violate **any** of the rules.

6. Count rows that violate **all** the rules.

In [0]:
#4. Count rows that violate each rule individually

delta.agg(
    F.sum(F.when(F.col("Amount") <= 0, 1).otherwise(0)).alias("amount_column_less_than_zero"),
    F.sum(F.when(F.col("Boxes") < 0, 1).otherwise(0)).alias("boxes_column_less_than_zero"),
    F.sum(F.when(F.col("Shipdate").isNull(), 1).otherwise(0)).alias("null_in_shipdate")
).show()

#5. Count rows that violate any of the rules
delta.agg(
    F.sum(
        F.when(
            (F.col("Amount") <= 0) | 
            (F.col("Boxes") < 0) |
            (F.col("Shipdate").isNull()), 1
            ).otherwise(0)
    ).alias("any_rule_violated")
).show()

#6. Count rows that violate all the rules

delta.agg(
    F.sum(
        F.when(
            (F.col("Amount") <= 0) &
            (F.col("Boxes") < 0) &
            (F.col("Shipdate").isNull()), 1
            ).otherwise(0)
    ).alias("all_rule_violated")
).show()

## ðŸ”¹ LEVEL 3 â€” Quality Summary Table (Important)

Create a **single summary result** that answers:

* Total rows
* Rows with null Shipdate
* Rows with invalid Amount
* Rows with invalid Boxes
* Rows that pass **all** checks
* Rows that fail **at least one** check

ðŸ‘‰ This summary should itself be **a DataFrame**
ðŸ‘‰ One row is fine (metrics-style output)

In [0]:
invalid_amount = (
    (F.col("Amount") <= 0) |
    (F.col("Amount").isNull()) |
    (F.isnan("Amount"))
)

invalid_boxes = (
    (F.col("Boxes") < 0) |
    (F.col("Boxes").isNull()) |
    (F.isnan("Boxes"))
)

invalid_shipdate = (
    (F.col("Shipdate").isNull())
)

invalid_list = invalid_amount | invalid_boxes | invalid_shipdate

level_3_df = delta.agg(
    F.count("*").alias("total_count"),
    F.sum(invalid_amount.cast("int")).alias("rows_with_invalid_amount"),
    F.sum(invalid_boxes.cast("int")).alias("rows_with_invalid_boxes"),
    F.sum(invalid_shipdate.cast("int")).alias("rows_with_invalid_shipdate"),
    F.sum((~invalid_list).cast("int")).alias("all_checks_passed"),
    F.sum(invalid_list.cast("int")).alias("failed_checks")
)

level_3_df.show()