<a href="https://colab.research.google.com/github/kareemullah123456789/BDF-big_data_foundation_scenario-/blob/main/01_PySpark_Basics_Healthcare_Data_Ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 1: PySpark Basics - Healthcare Data Ingestion
**Scenario:** Working for a Service Company (e.g., Cognizant/Accenture) for a Healthcare Client (e.g., UnitedHealth/CVS).

**Objective:** Understand how to ingest raw patient data, validate its structure (Schema), and perform basic cleanup.

**Why this matters:**
In a real project, you don't start with clean data. You receive massive CSV/JSON files (often Terabytes in size) from legacy systems (Mainframes). Your first job is to "Ingest" this data into a Data Lake (HDFS/S3).

**Data Volume Context:**
*   **Small Data (Excel/Pandas):** < 1-2 GB. Fits in your laptop's RAM.
*   **Big Data (Spark):** > 100 GB - Petabytes. Distributed across many servers.
*   *In this notebook, we simulate Big Data using small examples for learning.*

---
## 1. Setup Environment (Google Colab / Local)

In [2]:
# In Real Project: PySpark is pre-installed on the cluster (EMR/Databricks/Cloudera).
# In Google Colab or Local Machine: We must install it first.
try:
    import pyspark
    print("PySpark is already installed")
except ImportError:
    print("Installing PySpark...")
    !pip install pyspark findspark

PySpark is already installed


## 2. Initialize Spark Session (Entry Point)
Every PySpark application starts with a `SparkSession`. It connects your program to the Cluster Manager (YARN/Kubernetes).

*   **Master:** `local[*]` means run locally on your machine using *all* available CPU cores. In production, this would be `yarn` or `k8s://...`.
*   **AppName:** Name visible in the Spark UI / YARN logs. Important for debugging.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Create Spark Session
spark = SparkSession.builder \
    .appName("Healthcare_Data_Ingestion_Dev") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark Version: {spark.version}")
print("Spark Session Created Successfully!")

Spark Version: 4.0.2
Spark Session Created Successfully!


## 3. Data Ingestion Scenario: Raw Patient CSV
**The Context:**
*   A hospital's patient registration system exports daily data as CSV.
*   The data is stored in the "Raw Zone" or "Landing Zone" of our Data Lake.
*   **Problem:** The CSV has no type information (everything is a string). Dates might be messy.

### Approach 1: InferSchema (The "Lazy" Way)
Spark can guess the data types by reading the file twice (once to check types, once to load data).
*   **Pros:** Quick for development.
*   **Cons:** Very slow for Big Data (reading 1TB twice takes forever). **Never use in Production.**

### Approach 2: Define Schema (The "Professional" Way)
We tell Spark exactly what the columns are.
*   **Pros:** Fast (reads once). Fails immediately if data is wrong (Data Quality).
*   **Cons:** Need to type out the schema.

---
Let's first create some dummy CSV data.

In [4]:
# --- Create Dummy CSV Data ---
import os

# Create folder for raw data
os.makedirs("raw_data", exist_ok=True)

# Write dummy data to a CSV file
raw_csv_path = "raw_data/patients_daily_20230101.csv"
csv_content = """patient_id,full_name,dob,gender,contact_number,last_visit_date,diagnosis,bill_amount
P001,John Doe,1980-05-15,M,555-0101,2023-01-01,Hypertension,150.50
P002,Jane Smith,1992-08-22,F,555-0102,2023-01-02,Fracture-Arm,2500.00
P003,Robert Brown,1975-12-10,M,,2022-12-30,Diabetes_Type2,NULL
P004,Emily Davis,2001-03-30,F,555-0104,2023-01-03,Flu,85.00
P005,Michael Wilson,1988-11-11,M,555-0105,2023-01-01,Hypertension,150.50
"""

with open(raw_csv_path, "w") as f:
    f.write(csv_content)

print(f"Created dummy file at: {raw_csv_path}")

# --- Approach 1: Infer Schema (The Easy Way) ---
print("\n--- Reading with inferSchema=True (Scans file twice) ---")
df_inferred = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(raw_csv_path)

df_inferred.printSchema()
df_inferred.show()

# --- Approach 2: Explicit Schema (The Production Way) ---
# Using StructTypes is slightly verbose but crucial for large pipelines.
print("\n--- Reading with Defined Schema (Scans file once) ---")

patient_schema = StructType([
    StructField("patient_id", StringType(), False), # Not Nullable
    StructField("full_name", StringType(), True),
    StructField("dob", DateType(), True),
    StructField("gender", StringType(), True),
    StructField("contact_number", StringType(), True),
    StructField("last_visit_date", DateType(), True),
    StructField("diagnosis", StringType(), True),
    StructField("bill_amount", DoubleType(), True) # Handle decimals
])

df_patients = spark.read \
    .option("header", "true") \
    .schema(patient_schema) \
    .csv(raw_csv_path)

df_patients.printSchema()
# Notice how 'dob' and 'last_visit_date' are now proper DateTypes, not Strings.
df_patients.show()

Created dummy file at: raw_data/patients_daily_20230101.csv

--- Reading with inferSchema=True (Scans file twice) ---
root
 |-- patient_id: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- dob: date (nullable = true)
 |-- gender: string (nullable = true)
 |-- contact_number: string (nullable = true)
 |-- last_visit_date: date (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- bill_amount: string (nullable = true)

+----------+--------------+----------+------+--------------+---------------+--------------+-----------+
|patient_id|     full_name|       dob|gender|contact_number|last_visit_date|     diagnosis|bill_amount|
+----------+--------------+----------+------+--------------+---------------+--------------+-----------+
|      P001|      John Doe|1980-05-15|     M|      555-0101|     2023-01-01|  Hypertension|     150.50|
|      P002|    Jane Smith|1992-08-22|     F|      555-0102|     2023-01-02|  Fracture-Arm|    2500.00|
|      P003|  Robert Brown|

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 4. Renaming & Selecting Columns (Standardization)
In large organizations, Data Architects define "Naming Standards".
*   Example: Raw data has `dob`, but the target data warehouse (Hive/Snowflake) requires `date_of_birth`.
*   Example: We only need `full_name` and `diagnosis` for a specific report (PII Compliance - don't load unnecessary data).

**Task:**
1.  Rename `dob` to `date_of_birth`.
2.  Rename `bill_amount` to `total_billed_amount`.
3.  Filter out patients with `NULL` contact numbers (Data Quality).

In [6]:
# 1. Renaming Columns: Use `withColumnRenamed`
df_renamed = df_patients \
    .withColumnRenamed("dob", "date_of_birth") \
    .withColumnRenamed("contact_number", "phone_number") \
    .withColumnRenamed("bill_amount", "total_billed_amount")

# 2. Selecting Specific Columns: Use `select`
# Business Requirement: Provide a list for the Pharmacy system (only names & diagnosis needed)
df_pharmacy = df_renamed.select("patient_id", "full_name", "diagnosis")

# 3. Filtering: Use `filter` or `where`
# Data Quality Check: Remove records with NULL or Empty phone numbers
df_valid_contacts = df_renamed.filter(col("phone_number").isNotNull())

print("--- Data with Renamed Columns ---")
df_renamed.show()

print("--- Pharmacy Report (Selection) ---")
df_pharmacy.show()

print("--- Valid Contacts Only (Null Phones Removed) ---")
df_valid_contacts.show()
# Notice patient P003 (Robert Brown) is removed because contact_number was missing.

--- Data with Renamed Columns ---
+----------+--------------+-------------+------+------------+---------------+--------------+-------------------+
|patient_id|     full_name|date_of_birth|gender|phone_number|last_visit_date|     diagnosis|total_billed_amount|
+----------+--------------+-------------+------+------------+---------------+--------------+-------------------+
|      P001|      John Doe|   1980-05-15|     M|    555-0101|     2023-01-01|  Hypertension|              150.5|
|      P002|    Jane Smith|   1992-08-22|     F|    555-0102|     2023-01-02|  Fracture-Arm|             2500.0|
|      P003|  Robert Brown|   1975-12-10|     M|        NULL|     2022-12-30|Diabetes_Type2|               NULL|
|      P004|   Emily Davis|   2001-03-30|     F|    555-0104|     2023-01-03|           Flu|               85.0|
|      P005|Michael Wilson|   1988-11-11|     M|    555-0105|     2023-01-01|  Hypertension|              150.5|
+----------+--------------+-------------+------+------------+-