# PySpark: Zero to Hero
## Module 6: Hands-on - Creating DataFrames & Transformations

In this notebook, we will write our first actual PySpark code. We will simulate a real-world scenario: generating employee data, filtering it, and saving the results.

### Agenda:
1.  **SparkSession:** Creating the entry point.
2.  **Create DataFrame:** Making a distributed dataset from a list.
3.  **Inspecting Data:** Using Actions like `.show()`.
4.  **Transformations:** Filtering data.
5.  **Immutability:** Understanding how new DataFrames are created.
6.  **Bonus Tip:** Accessing the active session.

In [None]:
from pyspark.sql import SparkSession

# Create the SparkSession Object
# appName: Identifying your application in the Spark UI
# master: "local[*]" means run locally using all available cores
spark = SparkSession.builder \
    .appName("Spark_Introduction") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Created Successfully")

In [None]:
# 1. Define the Data (List of Lists)
# Each inner list represents a Row: [ID, DeptID, Name, Age, Gender, Salary, HireDate]
data = [
    ["001", "101", "John Doe", "30", "Male", "50000", "2015-01-01"],
    ["002", "101", "Jane Smith", "25", "Female", "45000", "2016-04-15"],
    ["003", "102", "Bob Brown", "35", "Male", "55000", "2014-05-01"],
    ["004", "102", "Alice Lee", "28", "Female", "48000", "2017-09-30"],
    ["005", "103", "Jack Chan", "40", "Male", "60000", "2013-04-01"],
    ["006", "103", "Jill Wong", "32", "Female", "52000", "2018-07-01"],
    ["007", "101", "James Johnson", "42", "Male", "70000", "2012-03-15"],
    ["008", "102", "Kate Kim", "29", "Female", "51000", "2019-10-01"],
    ["009", "103", "Tom Tan", "33", "Male", "58000", "2016-06-01"],
    ["010", "104", "Lisa Lee", "27", "Female", "47000", "2018-08-01"]
]

# 2. Define the Schema (Column Names)
# Note: For this example, we are treating everything as StringType for simplicity, 
# but in real scenarios, you would use Integer/Date types.
columns = ["employee_id", "department_id", "name", "age", "gender", "salary", "hire_date"]

print("Data and Schema defined.")

In [None]:
# Create the DataFrame using the createDataFrame method
emp = spark.createDataFrame(data=data, schema=columns)

# Validate: Check the number of partitions
# This tells us how many chunks the data is split into
print(f"Number of Partitions: {emp.rdd.getNumPartitions()}")

In [None]:
# The dataframe created above is just a plan. 
# Nothing executes until we call an Action like .show()
print("--- Employee Data ---")
emp.show()

## Transformations & Immutability

We now want to filter out employees who have a **Salary > 50,000**.

Since DataFrames are **Immutable**, we cannot modify the `emp` DataFrame directly. Instead, we apply a transformation which returns a **new** DataFrame (`emp_final`).

In [None]:
# Apply Transformation: Filter salary > 50000
# Note: Even though salary is a string in our schema, Spark SQL can often implicitly handle the comparison.
emp_final = emp.where("salary > 50000")

# Verify: Check if the partitions changed? (They usually stay the same unless reshuffled)
print(f"Partitions in filtered DF: {emp_final.rdd.getNumPartitions()}")

# Note: If you check the Spark UI now (localhost:4040), you will NOT see a job for this filter yet.
# That is Lazy Evaluation.
print("Transformation defined (Lazy Evaluation). No Job triggered yet.")

In [None]:
# Now we call an Action to see the result
print("--- Employees with Salary > 50,000 ---")
emp_final.show()

In [None]:
# Writing data is also an Action.
# We write the filtered data to a CSV file.

# 'overwrite' mode ensures we can run this cell multiple times without error.
emp_final.write.mode("overwrite").csv("data/output/high_salary_employees.csv")

print("Data written successfully to data/output/high_salary_employees.csv")

In [None]:
# Bonus Tip:
# Sometimes you are in a function where the 'spark' variable isn't available.
# You can grab the existing active SparkSession without creating a new one.

new_spark_ref = SparkSession.getActiveSession()

print(f"Original Object: {spark}")
print(f"New Reference:   {new_spark_ref}")

# They point to the exact same memory address
print(f"Are they the same object? {spark is new_spark_ref}")

## Summary

1.  **SparkSession** is the heart of the application.
2.  **DataFrames** represent data in rows and columns but are distributed.
3.  **Lazy Evaluation:** `emp.where(...)` did not run immediately. It waited for `.show()` or `.write()`.
4.  **Immutability:** We created `emp_final` rather than modifying `emp`.

**Next Steps:**
In the next video, we will dive deeper into **Read and Write modes** (CSV, Parquet, JSON) and understand schemas in depth.