# üìö DAY 2 - NOTEBOOK 1: READING DATA

## üéØ Objectives:
- Read data from multiple formats (CSV, JSON, Parquet)
- Understand schema inference vs explicit schema
- Use read options effectively
- Handle corrupted/malformed data
- Compare performance of different formats
- Work with both local filesystem and MinIO (S3)

## üìÇ Data Pipeline:
```
raw/ ‚Üí staging/ ‚Üí production/
```

---
## üîß PART 1: SETUP SPARK SESSION

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, expr, count, isnan, isnull
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, DateType, TimestampType
)
import time
import os

print("üì¶ Importing libraries...")
print("‚úÖ All imports successful!")

üì¶ Importing libraries...
‚úÖ All imports successful!


In [2]:
# Stop existing Spark session if any
try:
    spark.stop()
    print("üõë Stopped existing Spark session")
    time.sleep(2)
except:
    print("‚ÑπÔ∏è  No existing Spark session to stop")
    pass

‚ÑπÔ∏è  No existing Spark session to stop


In [3]:
print("üöÄ Creating Spark Session with MinIO support...")
print("="*70)

spark = SparkSession.builder \
    .appName("Day2-ReadingData") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "512m") \
    .config("spark.executor.cores", "1") \
    .config("spark.cores.max", "2") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created Successfully!")
print(f"   App Name: {spark.sparkContext.appName}")
print(f"   App ID: {spark.sparkContext.applicationId}")
print(f"   Master: {spark.sparkContext.master}")
print(f"   Spark Version: {spark.version}")
print(f"   Python Version: {spark.sparkContext.pythonVer}")

üöÄ Creating Spark Session with MinIO support...


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/03 10:35:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created Successfully!
   App Name: Day2-ReadingData
   App ID: app-20260103103547-0000
   Master: spark://spark-master:7077
   Spark Version: 3.5.1
   Python Version: 3.8


---
## üìÇ PART 2: DEFINE DATA PATHS

In [4]:
# ============================================================================
# DEFINE DATA PATHS
# ============================================================================

# Option 1: Local filesystem (shared volume)
DATA_RAW_LOCAL = "/opt/spark-data/raw"
DATA_STAGING_LOCAL = "/opt/spark-data/staging"
DATA_PRODUCTION_LOCAL = "/opt/spark-data/production"

# Option 2: MinIO (S3-compatible)
DATA_RAW_S3 = "s3a://raw"
DATA_STAGING_S3 = "s3a://staging"
DATA_PRODUCTION_S3 = "s3a://production"

print("üìÇ Available data paths:")
print("="*70)
print("\nüóÇÔ∏è  LOCAL FILESYSTEM (Shared Volume):")
print(f"   Raw:        {DATA_RAW_LOCAL}")
print(f"   Staging:    {DATA_STAGING_LOCAL}")
print(f"   Production: {DATA_PRODUCTION_LOCAL}")

print("\n‚òÅÔ∏è  MINIO (S3-Compatible):")
print(f"   Raw:        {DATA_RAW_S3}")
print(f"   Staging:    {DATA_STAGING_S3}")
print(f"   Production: {DATA_PRODUCTION_S3}")

print("\nüí° TIP: Use local filesystem for development, MinIO for production")

üìÇ Available data paths:

üóÇÔ∏è  LOCAL FILESYSTEM (Shared Volume):
   Raw:        /opt/spark-data/raw
   Staging:    /opt/spark-data/staging
   Production: /opt/spark-data/production

‚òÅÔ∏è  MINIO (S3-Compatible):
   Raw:        s3a://raw
   Staging:    s3a://staging
   Production: s3a://production

üí° TIP: Use local filesystem for development, MinIO for production


---
## üìù PART 3: CREATE SAMPLE DATA

In [5]:
from faker import Faker
import random
import pandas as pd

# Set seeds for reproducibility
fake = Faker()
Faker.seed(42)
random.seed(42)

print("üìù Creating sample employee data...")
print("="*70)

# Create employees data
employees_data = []
departments = ["Engineering", "Sales", "HR", "Marketing", "Finance"]

for i in range(1, 1001):
    employees_data.append({
        "id": i,
        "name": fake.name(),
        "email": fake.email(),
        "department": random.choice(departments),
        "salary": random.randint(40000, 150000),
        "age": random.randint(22, 65),
        "hire_date": fake.date_between(start_date='-10y', end_date='today').strftime('%Y-%m-%d')
    })

print(f"‚úÖ Created {len(employees_data)} employee records")
print(f"   Departments: {', '.join(departments)}")
print(f"   Salary range: $40,000 - $150,000")
print(f"   Age range: 22 - 65")
print(f"   Hire date range: Last 10 years")

üìù Creating sample employee data...
‚úÖ Created 1000 employee records
   Departments: Engineering, Sales, HR, Marketing, Finance
   Salary range: $40,000 - $150,000
   Age range: 22 - 65
   Hire date range: Last 10 years


In [6]:
# Preview sample data
print("\nüìä Sample data preview:")
print("="*70)
df_preview = pd.DataFrame(employees_data[:5])
print(df_preview.to_string(index=False))


üìä Sample data preview:
 id              name                    email  department  salary  age  hire_date
  1      Allison Hill donaldgarcia@example.net Engineering   43278   39 2024-12-05
  2    Javier Johnson  jesseguzman@example.net       Sales   69256   30 2017-12-29
  3 Kimberly Robinson       lisa02@example.net Engineering  128696   56 2020-03-25
  4  Daniel Gallagher   daviscolin@example.com Engineering  117397   49 2022-01-16
  5    Monica Herrera      smiller@example.net Engineering   43905   27 2022-03-11


In [7]:
# Create directory structure
print("\nüìÅ Creating directory structure...")
print("="*70)

os.makedirs(f"{DATA_RAW_LOCAL}/employees", exist_ok=True)
print(f"‚úÖ Created: {DATA_RAW_LOCAL}/employees/")

# Save to different formats
print("\nüíæ Saving data to multiple formats...")

# 1. CSV
df_pandas = pd.DataFrame(employees_data)
csv_path = f'{DATA_RAW_LOCAL}/employees/employees.csv'
df_pandas.to_csv(csv_path, index=False)
csv_size = os.path.getsize(csv_path) / 1024  # KB
print(f"‚úÖ CSV saved: {csv_path}")
print(f"   Size: {csv_size:.2f} KB")

# 2. JSON (line-delimited)
json_path = f'{DATA_RAW_LOCAL}/employees/employees.json'
df_pandas.to_json(json_path, orient='records', lines=True)
json_size = os.path.getsize(json_path) / 1024  # KB
print(f"‚úÖ JSON saved: {json_path}")
print(f"   Size: {json_size:.2f} KB")

# 3. Parquet (using Spark)
df_spark = spark.createDataFrame(employees_data)
parquet_path = f'{DATA_RAW_LOCAL}/employees/employees.parquet'
df_spark.write.mode('overwrite').parquet(parquet_path)
print(f"‚úÖ Parquet saved: {parquet_path}")

print("\nüìä File size comparison:")
print(f"   CSV:     {csv_size:.2f} KB")
print(f"   JSON:    {json_size:.2f} KB")
print(f"   Parquet: (columnar format - typically smallest)")


üìÅ Creating directory structure...
‚úÖ Created: /opt/spark-data/raw/employees/

üíæ Saving data to multiple formats...
‚úÖ CSV saved: /opt/spark-data/raw/employees/employees.csv
   Size: 67.83 KB
‚úÖ JSON saved: /opt/spark-data/raw/employees/employees.json
   Size: 136.14 KB


                                                                                

‚úÖ Parquet saved: /opt/spark-data/raw/employees/employees.parquet

üìä File size comparison:
   CSV:     67.83 KB
   JSON:    136.14 KB
   Parquet: (columnar format - typically smallest)


---
## üìñ PART 4: READING CSV FILES

In [8]:
print("üìå READING CSV - Basic (with schema inference)")
print("="*70)

# Read CSV with schema inference
start_time = time.time()

df_csv = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(f"{DATA_RAW_LOCAL}/employees/employees.csv")

read_time = time.time() - start_time

print("‚úÖ CSV loaded successfully")
print(f"   Rows: {df_csv.count()}")
print(f"   Columns: {len(df_csv.columns)}")
print(f"   Read time: {read_time:.3f} seconds")

print("\nüìã Inferred Schema:")
df_csv.printSchema()

print("\nüìä Sample data:")
df_csv.show(5, truncate=False)

üìå READING CSV - Basic (with schema inference)


                                                                                

‚úÖ CSV loaded successfully
   Rows: 1000
   Columns: 7
   Read time: 4.460 seconds

üìã Inferred Schema:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- hire_date: date (nullable = true)


üìä Sample data:
+---+-----------------+------------------------+-----------+------+---+----------+
|id |name             |email                   |department |salary|age|hire_date |
+---+-----------------+------------------------+-----------+------+---+----------+
|1  |Allison Hill     |donaldgarcia@example.net|Engineering|43278 |39 |2024-12-05|
|2  |Javier Johnson   |jesseguzman@example.net |Sales      |69256 |30 |2017-12-29|
|3  |Kimberly Robinson|lisa02@example.net      |Engineering|128696|56 |2020-03-25|
|4  |Daniel Gallagher |daviscolin@example.com  |Engineering|117397|49 |2022-01-16|
|5  |Monica Herrera   |

In [9]:
print("üìå READING CSV - With Explicit Schema (RECOMMENDED)")
print("="*70)

# Define explicit schema
employee_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), True),
    StructField("department", StringType(), False),
    StructField("salary", IntegerType(), False),
    StructField("age", IntegerType(), False),
    StructField("hire_date", DateType(), False)
])

print("üìã Defined Schema:")
print(employee_schema.simpleString())

# Read with explicit schema
start_time = time.time()

df_csv_explicit = spark.read \
    .option("header", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .schema(employee_schema) \
    .csv(f"{DATA_RAW_LOCAL}/employees/employees.csv")

read_time = time.time() - start_time

print("\n‚úÖ CSV loaded with explicit schema")
print(f"   Read time: {read_time:.3f} seconds")
print("   ‚ö° Faster than schema inference!")

print("\nüìã Schema:")
df_csv_explicit.printSchema()

print("\nüìä Sample data:")
df_csv_explicit.show(5)

üìå READING CSV - With Explicit Schema (RECOMMENDED)
üìã Defined Schema:
struct<id:int,name:string,email:string,department:string,salary:int,age:int,hire_date:date>

‚úÖ CSV loaded with explicit schema
   Read time: 0.068 seconds
   ‚ö° Faster than schema inference!

üìã Schema:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- hire_date: date (nullable = true)


üìä Sample data:
+---+-----------------+--------------------+-----------+------+---+----------+
| id|             name|               email| department|salary|age| hire_date|
+---+-----------------+--------------------+-----------+------+---+----------+
|  1|     Allison Hill|donaldgarcia@exam...|Engineering| 43278| 39|2024-12-05|
|  2|   Javier Johnson|jesseguzman@examp...|      Sales| 69256| 30|2017-12-29|
|  3|Kimberly Robinson|  lisa02@e

In [10]:
print("üìå CSV READ OPTIONS - Advanced")
print("="*70)

# Common CSV options
df_csv_advanced = spark.read \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .option("sep", ",") \
    .option("quote", '"') \
    .option("escape", "\\") \
    .option("nullValue", "NULL") \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(employee_schema) \
    .csv(f"{DATA_RAW_LOCAL}/employees/employees.csv")

print("‚úÖ CSV loaded with advanced options")
print("\nüìù Options explained:")
print("   ‚Ä¢ header: First row is column names")
print("   ‚Ä¢ sep: Field delimiter (comma)")
print("   ‚Ä¢ quote: Quote character for strings")
print("   ‚Ä¢ escape: Escape character")
print("   ‚Ä¢ nullValue: String to treat as NULL")
print("   ‚Ä¢ dateFormat: Date parsing format")
print("   ‚Ä¢ mode: How to handle corrupt records")
print("   ‚Ä¢ columnNameOfCorruptRecord: Column for bad records")

df_csv_advanced.show(5)

üìå CSV READ OPTIONS - Advanced
‚úÖ CSV loaded with advanced options

üìù Options explained:
   ‚Ä¢ header: First row is column names
   ‚Ä¢ sep: Field delimiter (comma)
   ‚Ä¢ quote: Quote character for strings
   ‚Ä¢ escape: Escape character
   ‚Ä¢ nullValue: String to treat as NULL
   ‚Ä¢ dateFormat: Date parsing format
   ‚Ä¢ mode: How to handle corrupt records
   ‚Ä¢ columnNameOfCorruptRecord: Column for bad records
+---+-----------------+--------------------+-----------+------+---+----------+
| id|             name|               email| department|salary|age| hire_date|
+---+-----------------+--------------------+-----------+------+---+----------+
|  1|     Allison Hill|donaldgarcia@exam...|Engineering| 43278| 39|2024-12-05|
|  2|   Javier Johnson|jesseguzman@examp...|      Sales| 69256| 30|2017-12-29|
|  3|Kimberly Robinson|  lisa02@example.net|Engineering|128696| 56|2020-03-25|
|  4| Daniel Gallagher|daviscolin@exampl...|Engineering|117397| 49|2022-01-16|
|  5|   Monica Herre

---
## üìñ PART 5: READING JSON FILES

In [11]:
print("üìå READING JSON - Line-delimited (JSONL)")
print("="*70)

start_time = time.time()

df_json = spark.read.json(f"{DATA_RAW_LOCAL}/employees/employees.json")

read_time = time.time() - start_time

print("‚úÖ JSON loaded successfully")
print(f"   Rows: {df_json.count()}")
print(f"   Columns: {len(df_json.columns)}")
print(f"   Read time: {read_time:.3f} seconds")

print("\nüìã Schema (auto-inferred):")
df_json.printSchema()

print("\nüìä Sample data:")
df_json.show(5)

üìå READING JSON - Line-delimited (JSONL)
‚úÖ JSON loaded successfully
   Rows: 1000
   Columns: 7
   Read time: 0.403 seconds

üìã Schema (auto-inferred):
root
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- email: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)


üìä Sample data:
+---+-----------+--------------------+----------+---+-----------------+------+
|age| department|               email| hire_date| id|             name|salary|
+---+-----------+--------------------+----------+---+-----------------+------+
| 39|Engineering|donaldgarcia@exam...|2024-12-05|  1|     Allison Hill| 43278|
| 30|      Sales|jesseguzman@examp...|2017-12-29|  2|   Javier Johnson| 69256|
| 56|Engineering|  lisa02@example.net|2020-03-25|  3|Kimberly Robinson|128696|
| 49|Engineering|daviscolin@exampl...|2022-01-16|  4| Daniel Gallagher|117397|
| 27|En

In [12]:
print("üìå READING JSON - With Explicit Schema")
print("="*70)

start_time = time.time()

df_json_explicit = spark.read \
    .schema(employee_schema) \
    .json(f"{DATA_RAW_LOCAL}/employees/employees.json")

read_time = time.time() - start_time

print("‚úÖ JSON loaded with explicit schema")
print(f"   Read time: {read_time:.3f} seconds")
print("   ‚ö° Faster than schema inference!")

df_json_explicit.show(5)

üìå READING JSON - With Explicit Schema
‚úÖ JSON loaded with explicit schema
   Read time: 0.051 seconds
   ‚ö° Faster than schema inference!
+---+-----------------+--------------------+-----------+------+---+----------+
| id|             name|               email| department|salary|age| hire_date|
+---+-----------------+--------------------+-----------+------+---+----------+
|  1|     Allison Hill|donaldgarcia@exam...|Engineering| 43278| 39|2024-12-05|
|  2|   Javier Johnson|jesseguzman@examp...|      Sales| 69256| 30|2017-12-29|
|  3|Kimberly Robinson|  lisa02@example.net|Engineering|128696| 56|2020-03-25|
|  4| Daniel Gallagher|daviscolin@exampl...|Engineering|117397| 49|2022-01-16|
|  5|   Monica Herrera| smiller@example.net|Engineering| 43905| 27|2022-03-11|
+---+-----------------+--------------------+-----------+------+---+----------+
only showing top 5 rows



---
## üìñ PART 6: READING PARQUET FILES

In [13]:
print("üìå READING PARQUET - Columnar Format")
print("="*70)

start_time = time.time()

df_parquet = spark.read.parquet(f"{DATA_RAW_LOCAL}/employees/employees.parquet")

read_time = time.time() - start_time

print("‚úÖ Parquet loaded successfully")
print(f"   Rows: {df_parquet.count()}")
print(f"   Columns: {len(df_parquet.columns)}")
print(f"   Read time: {read_time:.3f} seconds")
print("   ‚ö° Fastest format!")

print("\nüìã Schema (embedded in Parquet):")
df_parquet.printSchema()

print("\nüìä Sample data:")
df_parquet.show(5)

print("\nüí° Parquet advantages:")
print("   ‚Ä¢ Schema is embedded (no inference needed)")
print("   ‚Ä¢ Columnar storage (efficient for analytics)")
print("   ‚Ä¢ Compressed by default")
print("   ‚Ä¢ Supports predicate pushdown")
print("   ‚Ä¢ Best for production workloads")

üìå READING PARQUET - Columnar Format
‚úÖ Parquet loaded successfully
   Rows: 1000
   Columns: 7
   Read time: 0.314 seconds
   ‚ö° Fastest format!

üìã Schema (embedded in Parquet):
root
 |-- age: long (nullable = true)
 |-- department: string (nullable = true)
 |-- email: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)


üìä Sample data:
+---+-----------+--------------------+----------+---+-----------------+------+
|age| department|               email| hire_date| id|             name|salary|
+---+-----------+--------------------+----------+---+-----------------+------+
| 39|Engineering|donaldgarcia@exam...|2024-12-05|  1|     Allison Hill| 43278|
| 30|      Sales|jesseguzman@examp...|2017-12-29|  2|   Javier Johnson| 69256|
| 56|Engineering|  lisa02@example.net|2020-03-25|  3|Kimberly Robinson|128696|
| 49|Engineering|daviscolin@exampl...|2022-01-16|  4| Dani

---
## üìä PART 7: PERFORMANCE COMPARISON

In [14]:
print("üìä PERFORMANCE COMPARISON")
print("="*70)

import time

results = []

# Test CSV
print("\nüîÑ Testing CSV...")
start = time.time()
df_csv_test = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{DATA_RAW_LOCAL}/employees/employees.csv")
count_csv = df_csv_test.count()
time_csv = time.time() - start
results.append(("CSV (inferred)", time_csv, count_csv))
print(f"   Time: {time_csv:.3f}s")

# Test CSV with explicit schema
print("\nüîÑ Testing CSV (explicit schema)...")
start = time.time()
df_csv_explicit_test = spark.read.option("header", "true").schema(employee_schema).csv(f"{DATA_RAW_LOCAL}/employees/employees.csv")
count_csv_explicit = df_csv_explicit_test.count()
time_csv_explicit = time.time() - start
results.append(("CSV (explicit)", time_csv_explicit, count_csv_explicit))
print(f"   Time: {time_csv_explicit:.3f}s")

# Test JSON
print("\nüîÑ Testing JSON...")
start = time.time()
df_json_test = spark.read.json(f"{DATA_RAW_LOCAL}/employees/employees.json")
count_json = df_json_test.count()
time_json = time.time() - start
results.append(("JSON (inferred)", time_json, count_json))
print(f"   Time: {time_json:.3f}s")

# Test Parquet
print("\nüîÑ Testing Parquet...")
start = time.time()
df_parquet_test = spark.read.parquet(f"{DATA_RAW_LOCAL}/employees/employees.parquet")
count_parquet = df_parquet_test.count()
time_parquet = time.time() - start
results.append(("Parquet", time_parquet, count_parquet))
print(f"   Time: {time_parquet:.3f}s")

# Display results
print("\n" + "="*70)
print("üìä RESULTS SUMMARY")
print("="*70)
print(f"{'Format':<20} {'Time (s)':<15} {'Rows':<10} {'Speedup'}")
print("-"*70)

baseline = time_csv
for format_name, time_taken, row_count in results:
    speedup = baseline / time_taken
    print(f"{format_name:<20} {time_taken:<15.3f} {row_count:<10} {speedup:.2f}x")

print("\nüèÜ Winner: Parquet (fastest)")
print("üí° Recommendation: Use Parquet for production workloads")

üìä PERFORMANCE COMPARISON

üîÑ Testing CSV...
   Time: 4.224s

üîÑ Testing CSV (explicit schema)...
   Time: 0.314s

üîÑ Testing JSON...
   Time: 0.551s

üîÑ Testing Parquet...
   Time: 0.574s

üìä RESULTS SUMMARY
Format               Time (s)        Rows       Speedup
----------------------------------------------------------------------
CSV (inferred)       4.224           1000       1.00x
CSV (explicit)       0.314           1000       13.46x
JSON (inferred)      0.551           1000       7.67x
Parquet              0.574           1000       7.36x

üèÜ Winner: Parquet (fastest)
üí° Recommendation: Use Parquet for production workloads


---
## üõ°Ô∏è PART 8: HANDLING CORRUPTED DATA

In [15]:
print("üìù Creating corrupted CSV file for testing...")
print("="*70)

# Create a CSV with some corrupted records
corrupted_csv = f"{DATA_RAW_LOCAL}/employees/employees_corrupted.csv"

with open(corrupted_csv, 'w') as f:
    f.write("id,name,email,department,salary,age,hire_date\n")
    f.write("1,John Doe,john@example.com,Engineering,75000,30,2020-01-15\n")
    f.write("2,Jane Smith,jane@example.com,Sales,65000,28,2021-03-20\n")
    f.write("3,Bad Record,missing_fields\n")  # ‚ùå Corrupted: missing fields
    f.write("4,Bob Johnson,bob@example.com,HR,55000,35,2019-06-10\n")
    f.write("5,Alice,alice@example.com,Marketing,INVALID,29,2022-01-05\n")  # ‚ùå Corrupted: invalid salary
    f.write("6,Charlie Brown,charlie@example.com,Finance,70000,32,2020-09-15\n")

print(f"‚úÖ Created corrupted CSV: {corrupted_csv}")
print("   Contains 2 corrupted records out of 6 total")

üìù Creating corrupted CSV file for testing...
‚úÖ Created corrupted CSV: /opt/spark-data/raw/employees/employees_corrupted.csv
   Contains 2 corrupted records out of 6 total


In [16]:
print("üìå MODE: PERMISSIVE (default) - Keep corrupted records")
print("="*70)

df_permissive = spark.read \
    .option("header", "true") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(employee_schema.add("_corrupt_record", StringType(), True)) \
    .csv(corrupted_csv)

print("‚úÖ Data loaded with PERMISSIVE mode")
print(f"   Total rows: {df_permissive.count()}")

print("\nüìä All records (including corrupted):")
df_permissive.show(10, truncate=False)

print("\nüîç Corrupted records only:")
df_corrupted = df_permissive.filter(col("_corrupt_record").isNotNull())
print(f"   Found {df_corrupted.count()} corrupted records")
df_corrupted.show(truncate=False)

üìå MODE: PERMISSIVE (default) - Keep corrupted records
‚úÖ Data loaded with PERMISSIVE mode
   Total rows: 6

üìä All records (including corrupted):
+---+-------------+-------------------+-----------+------+----+----------+---------------------------------------------------------+
|id |name         |email              |department |salary|age |hire_date |_corrupt_record                                          |
+---+-------------+-------------------+-----------+------+----+----------+---------------------------------------------------------+
|1  |John Doe     |john@example.com   |Engineering|75000 |30  |2020-01-15|NULL                                                     |
|2  |Jane Smith   |jane@example.com   |Sales      |65000 |28  |2021-03-20|NULL                                                     |
|3  |Bad Record   |missing_fields     |NULL       |NULL  |NULL|NULL      |3,Bad Record,missing_fields                              |
|4  |Bob Johnson  |bob@example.com    |HR         

In [17]:
print("üìå MODE: DROPMALFORMED - Drop corrupted records")
print("="*70)

df_dropmalformed = spark.read \
    .option("header", "true") \
    .option("mode", "DROPMALFORMED") \
    .schema(employee_schema) \
    .csv(corrupted_csv)

print("‚úÖ Data loaded with DROPMALFORMED mode")
print(f"   Total rows: {df_dropmalformed.count()}")
print("   ‚ö†Ô∏è  Corrupted records were dropped")

print("\nüìä Valid records only:")
df_dropmalformed.show(truncate=False)

üìå MODE: DROPMALFORMED - Drop corrupted records
‚úÖ Data loaded with DROPMALFORMED mode
   Total rows: 6
   ‚ö†Ô∏è  Corrupted records were dropped

üìä Valid records only:
+---+-------------+-------------------+-----------+------+---+----------+---------------+
|id |name         |email              |department |salary|age|hire_date |_corrupt_record|
+---+-------------+-------------------+-----------+------+---+----------+---------------+
|1  |John Doe     |john@example.com   |Engineering|75000 |30 |2020-01-15|NULL           |
|2  |Jane Smith   |jane@example.com   |Sales      |65000 |28 |2021-03-20|NULL           |
|4  |Bob Johnson  |bob@example.com    |HR         |55000 |35 |2019-06-10|NULL           |
|6  |Charlie Brown|charlie@example.com|Finance    |70000 |32 |2020-09-15|NULL           |
+---+-------------+-------------------+-----------+------+---+----------+---------------+



In [18]:
print("üìå MODE: FAILFAST - Fail on corrupted records")
print("="*70)

try:
    df_failfast = spark.read \
        .option("header", "true") \
        .option("mode", "FAILFAST") \
        .schema(employee_schema) \
        .csv(corrupted_csv)
    
    # This will trigger the error
    df_failfast.show()
    
except Exception as e:
    print("‚ùå FAILFAST mode detected corrupted data!")
    print(f"   Error: {str(e)[:200]}...")
    print("\nüí° Use FAILFAST in production to catch data quality issues early")

üìå MODE: FAILFAST - Fail on corrupted records


26/01/03 10:36:16 WARN TaskSetManager: Lost task 0.0 in stage 48.0 (TID 41) (172.18.0.7 executor 1): org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION] Malformed records are detected in record parsing: [3,Bad Record,missing_fields,null,null,null,null,3,Bad Record,missing_fields].
Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. 
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1610)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:79)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:456)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.datasou

‚ùå FAILFAST mode detected corrupted data!
   Error: An error occurred while calling o199.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 48.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4...

üí° Use FAILFAST in production to catch data quality issues early


---
## üì§ PART 9: READING MULTIPLE FILES

In [19]:
print("üìù Creating multiple CSV files...")
print("="*70)

# Create directory for multiple files
multi_dir = f"{DATA_RAW_LOCAL}/employees_multi"
os.makedirs(multi_dir, exist_ok=True)

# Split data into 3 files
chunk_size = len(employees_data) // 3

for i in range(3):
    start_idx = i * chunk_size
    end_idx = start_idx + chunk_size if i < 2 else len(employees_data)
    chunk_data = employees_data[start_idx:end_idx]
    
    df_chunk = pd.DataFrame(chunk_data)
    df_chunk.to_csv(f"{multi_dir}/employees_part_{i+1}.csv", index=False)
    print(f"‚úÖ Created: employees_part_{i+1}.csv ({len(chunk_data)} records)")

print(f"\nüìÅ Directory: {multi_dir}")

üìù Creating multiple CSV files...
‚úÖ Created: employees_part_1.csv (333 records)
‚úÖ Created: employees_part_2.csv (333 records)
‚úÖ Created: employees_part_3.csv (334 records)

üìÅ Directory: /opt/spark-data/raw/employees_multi


In [20]:
print("üìå Reading multiple CSV files at once")
print("="*70)

# Read all CSV files in directory
df_multi = spark.read \
    .option("header", "true") \
    .schema(employee_schema) \
    .csv(f"{multi_dir}/*.csv")

print("‚úÖ All files loaded successfully")
print(f"   Total rows: {df_multi.count()}")
print(f"   Files read: 3")

print("\nüìä Sample from combined data:")
df_multi.show(10)

print("\nüí° Spark automatically combines all matching files!")

üìå Reading multiple CSV files at once
‚úÖ All files loaded successfully
   Total rows: 1000
   Files read: 3

üìä Sample from combined data:
+---+-----------------+--------------------+-----------+------+---+----------+---------------+
| id|             name|               email| department|salary|age| hire_date|_corrupt_record|
+---+-----------------+--------------------+-----------+------+---+----------+---------------+
|667|      Susan Myers|kristina57@exampl...|    Finance| 71711| 31|2016-04-22|           NULL|
|668|   Gregory Murray|justinbailey@exam...|         HR| 96261| 22|2017-04-12|           NULL|
|669|    Robert Robles|angela14@example.net|    Finance| 86244| 37|2020-08-12|           NULL|
|670|  Cynthia Sanchez|terrireyes@exampl...|    Finance| 94600| 33|2018-02-01|           NULL|
|671|    Jessica Baird|   amy32@example.net|Engineering|108614| 45|2022-09-14|           NULL|
|672|Michelle Copeland|vargassteven@exam...|Engineering|108951| 56|2016-12-24|           NULL|
|

                                                                                

---
## ‚òÅÔ∏è PART 10: READING FROM MINIO (S3)

In [21]:
print("üì§ Writing data to MinIO (S3)...")
print("="*70)

try:
    # Write Parquet to MinIO
    s3_path = f"{DATA_RAW_S3}/employees/employees.parquet"
    
    df_parquet.write \
        .mode("overwrite") \
        .parquet(s3_path)
    
    print(f"‚úÖ Data written to MinIO: {s3_path}")
    
except Exception as e:
    print(f"‚ùå Failed to write to MinIO: {str(e)[:200]}")
    print("\nüí° Make sure:")
    print("   1. MinIO is running (docker-compose ps)")
    print("   2. Buckets are created (check MinIO console)")
    print("   3. Credentials are correct")

üì§ Writing data to MinIO (S3)...


26/01/03 10:36:17 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

‚úÖ Data written to MinIO: s3a://raw/employees/employees.parquet


In [22]:
print("üì• Reading data from MinIO (S3)...")
print("="*70)

try:
    # Read from MinIO
    df_from_s3 = spark.read.parquet(f"{DATA_RAW_S3}/employees/employees.parquet")
    
    print("‚úÖ Data read from MinIO successfully")
    print(f"   Rows: {df_from_s3.count()}")
    print(f"   Columns: {len(df_from_s3.columns)}")
    
    print("\nüìä Sample data from S3:")
    df_from_s3.show(5)
    
    print("\n‚úÖ MinIO (S3) integration working!")
    
except Exception as e:
    print(f"‚ùå Failed to read from MinIO: {str(e)[:200]}")

üì• Reading data from MinIO (S3)...
‚úÖ Data read from MinIO successfully
   Rows: 1000
   Columns: 7

üìä Sample data from S3:
+---+-----------+--------------------+----------+---+-----------------+------+
|age| department|               email| hire_date| id|             name|salary|
+---+-----------+--------------------+----------+---+-----------------+------+
| 39|Engineering|donaldgarcia@exam...|2024-12-05|  1|     Allison Hill| 43278|
| 30|      Sales|jesseguzman@examp...|2017-12-29|  2|   Javier Johnson| 69256|
| 56|Engineering|  lisa02@example.net|2020-03-25|  3|Kimberly Robinson|128696|
| 49|Engineering|daviscolin@exampl...|2022-01-16|  4| Daniel Gallagher|117397|
| 27|Engineering| smiller@example.net|2022-03-11|  5|   Monica Herrera| 43905|
+---+-----------+--------------------+----------+---+-----------------+------+
only showing top 5 rows


‚úÖ MinIO (S3) integration working!


---
## üìä PART 11: DATA QUALITY CHECKS

In [23]:
print("üîç DATA QUALITY CHECKS")
print("="*70)

# Use the clean parquet data
df = df_parquet

print("\n1Ô∏è‚É£ Basic Statistics:")
print(f"   Total rows: {df.count()}")
print(f"   Total columns: {len(df.columns)}")

print("\n2Ô∏è‚É£ Column Names:")
print(f"   {', '.join(df.columns)}")

print("\n3Ô∏è‚É£ Data Types:")
for field in df.schema.fields:
    print(f"   {field.name:<15} {str(field.dataType):<20} nullable={field.nullable}")

print("\n4Ô∏è‚É£ Null Counts:")
null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
null_counts.show()

print("\n5Ô∏è‚É£ Numeric Statistics:")
df.select("salary", "age").describe().show()

print("\n6Ô∏è‚É£ Department Distribution:")
df.groupBy("department").count().orderBy("count", ascending=False).show()

üîç DATA QUALITY CHECKS

1Ô∏è‚É£ Basic Statistics:
   Total rows: 1000
   Total columns: 7

2Ô∏è‚É£ Column Names:
   age, department, email, hire_date, id, name, salary

3Ô∏è‚É£ Data Types:
   age             LongType()           nullable=True
   department      StringType()         nullable=True
   email           StringType()         nullable=True
   hire_date       StringType()         nullable=True
   id              LongType()           nullable=True
   name            StringType()         nullable=True
   salary          LongType()           nullable=True

4Ô∏è‚É£ Null Counts:
+---+----------+-----+---------+---+----+------+
|age|department|email|hire_date| id|name|salary|
+---+----------+-----+---------+---+----+------+
|  0|         0|    0|        0|  0|   0|     0|
+---+----------+-----+---------+---+----+------+


5Ô∏è‚É£ Numeric Statistics:
+-------+------------------+------------------+
|summary|            salary|               age|
+-------+------------------+----------

---
## üìù PART 12: BEST PRACTICES SUMMARY

In [24]:
print("üìö BEST PRACTICES FOR READING DATA IN SPARK")
print("="*70)

best_practices = """
‚úÖ DO:
1. Always use explicit schema (avoid inferSchema in production)
2. Use Parquet format for production workloads
3. Use PERMISSIVE mode with _corrupt_record column to track bad data
4. Partition large datasets appropriately
5. Use MinIO/S3 for scalable storage
6. Set appropriate read options (dateFormat, nullValue, etc.)
7. Validate data quality after reading
8. Use columnar formats (Parquet, ORC) for analytics

‚ùå DON'T:
1. Don't use inferSchema on large datasets (slow!)
2. Don't use CSV/JSON in production (use Parquet)
3. Don't ignore corrupted records (use PERMISSIVE to track them)
4. Don't read all columns if you only need a few (use select)
5. Don't use FAILFAST without proper error handling
6. Don't store data on local filesystem in production (use S3/HDFS)

üéØ PERFORMANCE TIPS:
1. Parquet > ORC > JSON > CSV (in terms of speed)
2. Explicit schema > Schema inference (2-3x faster)
3. Predicate pushdown works best with Parquet
4. Use partitioning for large datasets
5. Coalesce/repartition after reading if needed

üìä FORMAT RECOMMENDATIONS:
‚Ä¢ Development/Testing: CSV (easy to inspect)
‚Ä¢ Production: Parquet (fast, compressed, schema-embedded)
‚Ä¢ Streaming: JSON (flexible schema)
‚Ä¢ Data Lake: Parquet with partitioning
"""

print(best_practices)

üìö BEST PRACTICES FOR READING DATA IN SPARK

‚úÖ DO:
1. Always use explicit schema (avoid inferSchema in production)
2. Use Parquet format for production workloads
3. Use PERMISSIVE mode with _corrupt_record column to track bad data
4. Partition large datasets appropriately
5. Use MinIO/S3 for scalable storage
6. Set appropriate read options (dateFormat, nullValue, etc.)
7. Validate data quality after reading
8. Use columnar formats (Parquet, ORC) for analytics

‚ùå DON'T:
1. Don't use inferSchema on large datasets (slow!)
2. Don't use CSV/JSON in production (use Parquet)
3. Don't ignore corrupted records (use PERMISSIVE to track them)
4. Don't read all columns if you only need a few (use select)
5. Don't use FAILFAST without proper error handling
6. Don't store data on local filesystem in production (use S3/HDFS)

üéØ PERFORMANCE TIPS:
1. Parquet > ORC > JSON > CSV (in terms of speed)
2. Explicit schema > Schema inference (2-3x faster)
3. Predicate pushdown works best with Parquet


---
## üéì PART 13: EXERCISES

In [25]:
print("üéì EXERCISES")
print("="*70)

exercises = """
üìù Exercise 1: Schema Definition
Define an explicit schema for a transactions dataset with:
- transaction_id (integer)
- customer_id (integer)
- amount (double)
- transaction_date (date)
- status (string)

üìù Exercise 2: Read with Options
Read a CSV file with:
- Custom delimiter (|)
- Custom null value ("N/A")
- Custom date format ("dd/MM/yyyy")
- DROPMALFORMED mode

üìù Exercise 3: Performance Test
Compare read performance of:
- CSV with inferSchema
- CSV with explicit schema
- Parquet
Measure time for reading and counting rows.

üìù Exercise 4: Corrupted Data Handling
Create a CSV with intentional errors and:
- Read with PERMISSIVE mode
- Identify corrupted records
- Save clean records to staging
- Save corrupted records to error log

üìù Exercise 5: Multi-file Reading
Create 5 CSV files with different date ranges and:
- Read all files at once
- Filter by date range
- Aggregate by month
- Write result to Parquet

üìù Exercise 6: S3 Integration
- Write data to MinIO in Parquet format
- Read it back
- Perform transformations
- Write result to different S3 bucket
"""

print(exercises)

üéì EXERCISES

üìù Exercise 1: Schema Definition
Define an explicit schema for a transactions dataset with:
- transaction_id (integer)
- customer_id (integer)
- amount (double)
- transaction_date (date)
- status (string)

üìù Exercise 2: Read with Options
Read a CSV file with:
- Custom delimiter (|)
- Custom null value ("N/A")
- Custom date format ("dd/MM/yyyy")
- DROPMALFORMED mode

üìù Exercise 3: Performance Test
Compare read performance of:
- CSV with inferSchema
- CSV with explicit schema
- Parquet
Measure time for reading and counting rows.

üìù Exercise 4: Corrupted Data Handling
Create a CSV with intentional errors and:
- Read with PERMISSIVE mode
- Identify corrupted records
- Save clean records to staging
- Save corrupted records to error log

üìù Exercise 5: Multi-file Reading
Create 5 CSV files with different date ranges and:
- Read all files at once
- Filter by date range
- Aggregate by month
- Write result to Parquet

üìù Exercise 6: S3 Integration
- Write data to 

---
## üßπ CLEANUP

In [26]:
print("üßπ Cleanup (optional)")
print("="*70)
print("\nTo clean up test data, uncomment and run:")
print("""\n# import shutil
# shutil.rmtree(f"{DATA_RAW_LOCAL}/employees", ignore_errors=True)
# shutil.rmtree(f"{DATA_RAW_LOCAL}/employees_multi", ignore_errors=True)
# print("‚úÖ Test data cleaned up")
""")

print("\n‚úÖ Notebook completed successfully!")
print("\nüìö Next: Notebook 2 - Writing Data")

üßπ Cleanup (optional)

To clean up test data, uncomment and run:

# import shutil
# shutil.rmtree(f"{DATA_RAW_LOCAL}/employees", ignore_errors=True)
# shutil.rmtree(f"{DATA_RAW_LOCAL}/employees_multi", ignore_errors=True)
# print("‚úÖ Test data cleaned up")


‚úÖ Notebook completed successfully!

üìö Next: Notebook 2 - Writing Data
