# PySpark: Zero to Hero
## Module 13: Reading CSVs, Spark UI Internals, and Bad Record Handling

Reading a CSV file seems simple, but Spark does a lot of work under the hood. In this module, we will:
1.  Read CSV files and understand why Spark triggers background jobs.
2.  Learn about **Schema Inference** and why it can be expensive.
3.  Handle **Bad Records** (Corrupt Data) using three different modes:
    *   `PERMISSIVE` (Default)
    *   `DROPMALFORMED`
    *   `FAILFAST`

### Prerequisites
Ensure you have the `emp.csv` and `emp_new.csv` (the one with bad records) in your `data/input` folder as described in the video.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("CSV_Read_And_Bad_Records") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active. Go to localhost:4040 to view Spark UI.")

In [None]:
# Reading a CSV without specifying schema.
# We use 'inferSchema' = true.
# Note: This triggers a Spark Job! Spark needs to read the file once to guess the data types.

file_path = "data/input/emp.csv" # Ensure this path is correct for your setup

df_inferred = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(file_path)

print("--- Schema Inferred ---")
df_inferred.printSchema()
# Go to Spark UI (localhost:4040) -> Jobs tab. You will see a job created just for reading!

In [None]:
# In production, we define schema explicitly to avoid the extra read job (performance) 
# and to handle bad data correctly.

# Define Schema
emp_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("department_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("hire_date", DateType(), True)
])

# Read with Schema
df_schema = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(emp_schema) \
    .load(file_path)

print("--- Schema Explicitly Defined ---")
df_schema.printSchema()

# Note: If you check Spark UI now, NO JOB was triggered for this cell. 
# Spark accepted the schema blindly (Lazy Evaluation).

## Handling Bad Records
We will now read a file (`emp_new.csv`) that contains corrupt data:
*   A string "Low" in the `Salary` column (which expects Double).
*   A string "No Date" in the `hire_date` column (which expects Date).

Spark provides 3 modes to handle this:
1.  **PERMISSIVE (Default):** Sets corrupt fields to `null` and records the bad record in a separate column.
2.  **DROPMALFORMED:** Completely ignores (drops) the row containing bad data.
3.  **FAILFAST:** Throws an exception immediately and stops the job.

In [None]:
# 1. PERMISSIVE
# We add a special column option 'columnNameOfCorruptRecord' to capture the bad raw data.

bad_data_path = "data/input/emp_new.csv"

df_permissive = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(emp_schema) \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .load(bad_data_path)

print("--- Permissive Mode (Nulls inserted for bad data) ---")
df_permissive.show()

# Notice: 
# 1. The 'salary' column will be null for the bad row.
# 2. The '_corrupt_record' column will contain the raw text line of the bad record.

In [None]:
# 2. DROPMALFORMED
# Spark will silently drop the rows that don't match the schema.

df_drop = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(emp_schema) \
    .option("mode", "DROPMALFORMED") \
    .load(bad_data_path)

print("--- Drop Malformed Mode (Bad rows removed) ---")
df_drop.show()

# Notice: The count of rows will be less than the original file.

In [None]:
# 3. FAILFAST
# Useful for critical data pipelines where data quality is paramount.

print("--- Fail Fast Mode (Expect Error) ---")
try:
    df_fail = spark.read \
        .format("csv") \
        .option("header", "true") \
        .schema(emp_schema) \
        .option("mode", "FAILFAST") \
        .load(bad_data_path)
    
    df_fail.show() # Action triggers the read
except Exception as e:
    print("Error Encountered: Data quality check failed!")
    print(e)

In [None]:
# Instead of chaining .option().option().option(), use a dictionary.

read_options = {
    "header": "true",
    "inferSchema": "false",
    "mode": "PERMISSIVE",
    "sep": ","
}

df_dict = spark.read \
    .format("csv") \
    .options(**read_options) \
    .schema(emp_schema) \
    .load(file_path)

print("--- Read using Dictionary Options ---")
df_dict.show(2)

## Summary

1.  **Inference vs. Schema:** Always provide a schema in production to improve performance and ensure data quality.
2.  **Bad Records:**
    *   Use `PERMISSIVE` to load data but flag errors (using `columnNameOfCorruptRecord`).
    *   Use `DROPMALFORMED` to ignore bad data.
    *   Use `FAILFAST` to stop the pipeline on error.
3.  **Options:** Use dictionaries to manage configurations cleanly.

**Next Steps:**
In the next module, we will explore reading **JSON** and **Parquet** files, which are the standard for Big Data storage.