## Imports

In [0]:
from pyspark.sql import SparkSession

## Create a Spark Session

In [0]:
spark = SparkSession.builder \
        .appName("Spark Introduction") \
        .getOrCreate()

## What is PySpark Session and Why is it Required? 

**What is PySpark Session?** A **PySpark Session** is the entry point to programming with Spark using the Python API. It represents a connection to a Spark cluster, allowing you to create DataFrames, execute SQL queries, and manage configurations. 

The `SparkSession` class was introduced in **Spark 2.0** to consolidate different contexts like: 
- `SQLContext` 
- `HiveContext` 
- `SparkContext` 

Now, `SparkSession` serves as the **unified gateway** for all Spark functionalities. 

**Why is PySpark Session Required?** The PySpark Session is essential because: 
- It **initializes the Spark environment** and provides access to the cluster's resources. 
- It allows you to **read data** from various sources like CSV, Parquet, JSON, Hive, JDBC, etc. 
- It enables **DataFrame creation and manipulation**, including SQL-like queries. 
- It manages configurations like memory settings, application name, master URL, etc. 
- It acts as a **singleton object**, ensuring efficient resource utilization across your application.

**Basic Example** 

```python 
from pyspark.sql import SparkSession 
# Create or get existing Spark Session 
spark = SparkSession.builder \ 
        .appName("ExampleApp") \ 
        .master("local[*]") \ 
        .getOrCreate() 

# Create a DataFrame 
data = [
  ("Alice", 25), ("Bob", 30)] df = spark.createDataFrame(data, ["Name", "Age"]
) df.show() 
```

**Key Points** 
- Only **one Spark Session** should exist per application. 
- Internally, it manages the `SparkContext` and provides high-level APIs. 
- Without `SparkSession`, you cannot use DataFrames or execute Spark SQL commands.

**Conclusion** 

The `SparkSession` is a mandatory component in any PySpark application as it: 

- Acts as the main entry point. 
- Provides access to all Spark features. 
- Ensures efficient communication with the Spark cluster. 
- Always start your PySpark code by creating a `SparkSession`.


## Create Dataframe

In [0]:
# Emp Data & Schema

emp_data = [
    ["001","101","John Doe","30","Male","50000","2015-01-01"],
    ["002","101","Jane Smith","25","Female","45000","2016-02-15"],
    ["003","102","Bob Brown","35","Male","55000","2014-05-01"],
    ["004","102","Alice Lee","28","Female","48000","2017-09-30"],
    ["005","103","Jack Chan","40","Male","60000","2013-04-01"],
    ["006","103","Jill Wong","32","Female","52000","2018-07-01"],
    ["007","101","James Johnson","42","Male","70000","2012-03-15"],
    ["008","102","Kate Kim","29","Female","51000","2019-10-01"],
    ["009","103","Tom Tan","33","Male","58000","2016-06-01"],
    ["010","104","Lisa Lee","27","Female","47000","2018-08-01"],
    ["011","104","David Park","38","Male","65000","2015-11-01"],
    ["012","105","Susan Chen","31","Female","54000","2017-02-15"],
    ["013","106","Brian Kim","45","Male","75000","2011-07-01"],
    ["014","107","Emily Lee","26","Female","46000","2019-01-01"],
    ["015","106","Michael Lee","37","Male","63000","2014-09-30"],
    ["016","107","Kelly Zhang","30","Female","49000","2018-04-01"],
    ["017","105","George Wang","34","Male","57000","2016-03-15"],
    ["018","104","Nancy Liu","29","Female","50000","2017-06-01"],
    ["019","103","Steven Chen","36","Male","62000","2015-08-01"],
    ["020","102","Grace Kim","32","Female","53000","2018-11-01"]
]

emp_schema = "employee_id string, department_id string, name string, age string, gender string, salary string, hire_date string"

In [0]:
# Create emp DataFrame

emp = spark.createDataFrame(data=emp_data, schema=emp_schema)

In [0]:
# Show data (ACTION)

display(emp)

## Filter Dataframes

In [0]:

# Write our first Transformation (EMP salary > 50000)

emp_final = emp.where("salary > 50000")

In [0]:
# Write data as CSV output (ACTION)

emp_final.write.format("csv").save("data/output/1/emp.csv")