### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [6]:
# Write your code from here

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col
import pandas as pd
import numpy as np

# --------------------- Task 1: Handle Schema Mismatch (Spark) ---------------------

# Step 1: Create Spark Session
spark = SparkSession.builder.appName("SchemaMismatchHandler").getOrCreate()

# Step 2: Simulated Spark DataFrame with wrong schema (e.g., age as string)
raw_data = [
    {"id": "1", "name": "Alice", "age": "25"},
    {"id": "2", "name": "Bob", "age": "thirty"},  # invalid numeric value
    {"id": "3", "name": "Charlie", "age": "40"}
]
raw_df = spark.createDataFrame(raw_data)

# Step 3: Define expected schema
expected_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Step 4: Handle mismatches by casting and filtering out bad rows
corrected_df = raw_df.withColumn("id", col("id").cast("int")) \
                     .withColumn("age", col("age").cast("int"))

print("‚úÖ Task 1: Corrected Spark DataFrame with expected schema:")
corrected_df.show()

# --------------------- Task 2: Handle Incomplete Data (Pandas) ---------------------

# Step 1: Simulated Pandas DataFrame with missing values
data = {
    "CustomerID": [101, 102, 103, 104, 105],
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "PurchaseAmount": [250.5, None, 300.0, 150.0, None]
}
df = pd.DataFrame(data)

# Step 2: Detect incomplete data
print("\nüîç Task 2: Missing Value Report:")
print(df.isnull().sum())

# Step 3: Fill missing values
# - Fill 'Name' with placeholder
# - Fill 'PurchaseAmount' with median estimate
df['Name'].fillna("Unknown", inplace=True)
df['PurchaseAmount'].fillna(df['PurchaseAmount'].median(), inplace=True)

# Step 4: Report after changes
print("\n‚úÖ Task 2: Cleaned Pandas DataFrame:")
print(df)

# Optional: Summary of changes
missing_after = df.isnull().sum().sum()
print(f"\nüìä Total missing values after imputation: {missing_after}")


JAVA_HOME is not set


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [None]:
# Write your code from here