# üìä Week 2 ‚Äî Ingest & Explore with PySpark
### Building Reliable Data Pipelines in Databricks  

---

## **Goals for This Session**  
By the end of this session, you will:

- Understand how Spark ingests structured & semi-structured data  
- Explore datasets using DataFrames  
- Detect schema issues and data quality problems  
- Compare formats: CSV, Parquet, and Delta  
- Understand Delta features (ACID, schema evolution, time travel)  
- See how this fits into real data pipelines

## **Setup and Imports**

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from delta.tables import *
import time

## **1. Data Ingestion**

### Common Sources
- Databricks FileStore / DBFS  
- Local file uploads  
- Cloud storage (S3 / ADLS / GCS)  
- Databases (JDBC connectors)

In [None]:
# Sample data ingestion with schema inference
df_inferred = spark.read.format("csv") \
    .option("header", True) \
    .option("inferSchema", True) \
    .load("/FileStore/orders.csv")

print("Schema with inference:")
df_inferred.printSchema()

## **2. Explicit Schema Definition**

### Why Use Explicit Schema?
‚úî Predictable and reliable  
‚úî Enforces consistent data types  
‚úî Prevents data quality issues from type changes

In [None]:
# Define explicit schema
schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("price", DoubleType(), True),
    StructField("order_date", StringType(), True)
])

# Read with explicit schema
df = spark.read.csv("/FileStore/orders.csv", schema=schema, header=True)

print("Schema with explicit definition:")
df.printSchema()

## **3. Data Exploration**

### Basic DataFrame Operations

In [None]:
# Display the first 5 rows
print("First 5 rows:")
df.show(5)

In [None]:
# Get summary statistics
print("Summary statistics:")
df.describe().show()

In [None]:
# Extended summary including quartiles
print("Extended summary:")
df.summary().show()

## **4. Data Profiling and Quality Checks**

In [None]:
# Check for null values
print("Null value counts:")
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [None]:
# Check for duplicates
print(f"Total rows: {df.count()}")
print(f"Distinct rows: {df.distinct().count()}")
print(f"Duplicate rows: {df.count() - df.distinct().count()}")

## **5. Deeper Data Analysis**

In [None]:
# Group by product_id and calculate total quantity
product_sales = df.groupBy("product_id") \
                  .agg(sum("quantity").alias("total_quantity"),
                       count("*").alias("order_count"),
                       avg("price").alias("avg_price"))

print("Product sales summary:")
product_sales.show()

In [None]:
# Find top products by quantity
top_products = product_sales.orderBy("total_quantity", ascending=False)
print("Top 10 products by quantity:")
top_products.show(10)

## **6. Format Conversion: CSV ‚Üí Parquet**

In [None]:
# Convert to Parquet
print("Converting to Parquet...")
start_time = time.time()

df.write.mode("overwrite").parquet("/FileStore/orders_parquet")

parquet_time = time.time() - start_time
print(f"Parquet write completed in {parquet_time:.2f} seconds")

In [None]:
# Read from Parquet and compare performance
print("Reading from Parquet...")
start_time = time.time()

df_parquet = spark.read.parquet("/FileStore/orders_parquet")

parquet_read_time = time.time() - start_time
print(f"Parquet read completed in {parquet_read_time:.2f} seconds")

df_parquet.show(5)

## **7. Delta Lake Implementation**

In [None]:
# Convert to Delta format
print("Converting to Delta...")
start_time = time.time()

df.write.format("delta").mode("overwrite").save("/FileStore/orders_delta")

delta_time = time.time() - start_time
print(f"Delta write completed in {delta_time:.2f} seconds")

In [None]:
# Read from Delta
print("Reading from Delta...")
start_time = time.time()

df_delta = spark.read.format("delta").load("/FileStore/orders_delta")

delta_read_time = time.time() - start_time
print(f"Delta read completed in {delta_read_time:.2f} seconds")

df_delta.show(5)

## **8. Delta Lake Features**

### Schema Evolution

In [None]:
# Add a new column for schema evolution
df_with_region = df.withColumn("region", lit("US"))

print("DataFrame with new region column:")
df_with_region.show(5)

In [None]:
# Write with schema evolution
df_with_region.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("overwrite") \
    .save("/FileStore/orders_delta")

print("Schema evolution completed")

### Time Travel

In [None]:
# View Delta table history
delta_table = DeltaTable.forPath(spark, "/FileStore/orders_delta")
print("Delta table history:")
delta_table.history().show()

In [None]:
# Query previous version (if available)
try:
    df_v0 = spark.read.format("delta") \
                  .option("versionAsOf", 0) \
                  .load("/FileStore/orders_delta")
    
    print("Version 0 data:")
    df_v0.show(5)
    print("Version 0 columns:", df_v0.columns)
except:
    print("Version 0 not available or same as current version")

## **9. Data Quality Validation**

In [None]:
# Check for data quality issues
print("Data Quality Checks:")
print("=" * 30)

# Check for negative quantities
negative_qty = df_delta.filter(col("quantity") < 0).count()
print(f"Records with negative quantity: {negative_qty}")

# Check for zero or negative prices
invalid_prices = df_delta.filter(col("price") <= 0).count()
print(f"Records with invalid prices: {invalid_prices}")

# Check for reasonable quantity ranges
high_qty = df_delta.filter(col("quantity") > 1000).count()
print(f"Records with unusually high quantity (>1000): {high_qty}")

## **10. Summary and Best Practices**

In [None]:
print("Week 2 Summary:")
print("=" * 40)
print("‚úÖ Data ingestion with explicit schema")
print("‚úÖ Data exploration and profiling")
print("‚úÖ Format conversion (CSV ‚Üí Parquet ‚Üí Delta)")
print("‚úÖ Delta Lake features (schema evolution, time travel)")
print("‚úÖ Data quality validation")
print("‚úÖ Performance comparison")

print("\nBest Practices:")
print("- Use explicit schemas in production")
print("- Profile data for quality issues")
print("- Choose appropriate formats for your use case")
print("- Leverage Delta Lake for reliable pipelines")
print("- Implement data quality checks")

## **Format Comparison Table**

| Feature | CSV | Parquet | Delta |
|---------|-----|---------|-------|
| Structure | Row-based | Columnar | Columnar with transaction log |
| Compression | Basic | High & efficient | High & efficient |
| ACID Transactions | ‚ùå | ‚ùå | ‚úÖ |
| Schema Evolution | ‚ùå | Limited | ‚úÖ |
| Time Travel | ‚ùå | ‚ùå | ‚úÖ |
| Use Case | Simple data sharing | Analytics & BI | Reliable data pipelines |