# Set ENV Variable to Project Path

In [1]:
# Automatically reload modules when they change
%load_ext autoreload
%autoreload 2

Insert project root folder in environment variable

In [2]:
import os
import sys

def find_project_root(start_path=None, markers=(".git", "pyproject.toml", "requirements.txt")):
    """
    Walks up from start_path until it finds one of the marker files/folders.
    Returns the path of the project root.
    """
    if start_path is None:
        start_path = os.getcwd()

    current_path = os.path.abspath(start_path)

    while True:
        # check if any marker exists in current path
        if any(os.path.exists(os.path.join(current_path, marker)) for marker in markers):
            return current_path

        new_path = os.path.dirname(current_path)  # parent folder
        if new_path == current_path:  # reached root of filesystem
            raise FileNotFoundError(f"None of the markers {markers} found above {start_path}")
        current_path = new_path

project_root = find_project_root()
print("Project root:", project_root)

if project_root not in sys.path:
    sys.path.insert(0, project_root)


Project root: c:\ds_analytics_projects\darshil_course\apache-pyspark\darshil-pyspark


# Import Libraries

Import packages

In [3]:
import pandas as pd
import numpy as np

Relative import

In [4]:
from utils.file_utils import get_project_path

Import pyspark package and create a spark sesstion

In [5]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("FlightDataExample") \
    .getOrCreate()


# üìí Section 1: Reading a DataFrame

### üîé Step 1: What does this code do?

1. **`spark.read`**
    - Uses the active **SparkSession** (`spark`) to create a DataFrameReader.
    - Think of it as: *‚ÄúI want Spark to read some data.‚Äù*
2. **`.format("csv")`**
    - Tells Spark the file format is **CSV**.
    - Other formats supported: `"json"`, `"parquet"`, `"orc"`, `"jdbc"`, etc.
    - Default (if omitted) = `"parquet"`.
3. **`.option("header", "true")`**
    - Indicates the first row of the CSV file contains column names.
    - If `"false"`, Spark will assign generic column names like `_c0, _c1, _c2...`.
4. **`.option("inferSchema", "true")`**
    - Spark will automatically detect column data types (string, integer, double, etc.).
    - If `"false"`, all columns are read as **string** by default.
5. **`.load("path")`**
    - Specifies the location of the data file.
    - Here, it‚Äôs: `"/data/retail-data/by-day/2010-12-01.csv"`.
    - Spark reads this file and returns a **DataFrame**.
6. **`df`**
    - The DataFrame that holds your data.
    - Think of it like a **distributed table**: each row = record, each column = field.

In [6]:
data_path = get_project_path('data', 'darshil-data', 'retail-data', 'by-day', '2010-12-01.csv')

df = (spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(data_path))


### üîé Step 2: Inspecting the DataFrame

üëâ View schema:

In [7]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



üëâ Show sample rows:

In [8]:
df.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



üëâ Register as SQL table for queries:

In [9]:
df.createOrReplaceTempView('dfTable')

Now you can query from temp view:

In [10]:
spark.sql("SELECT * FROM dfTable LIMIT 5").show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+



# üìí Section 2: Literals in Spark

### üîé Step 1: What does this code do?

1. **`lit()` function**
    - Comes from `pyspark.sql.functions`.
    - It creates a **literal column** in Spark (i.e., a constant value treated as a column).
2. **Why is it needed?**
    - Spark DataFrames operate on **columns** (not raw Python values).
    - If you want to add a constant value (e.g., `5`) to every row, you need to wrap it inside `lit()`.
3. **In this example:**
    - `lit(5)` ‚Üí creates a constant integer column with value `5`.
    - `lit("five")` ‚Üí creates a constant string column `"five"`.
    - `lit(5.0)` ‚Üí creates a constant float column `5.0`.

In [11]:
from pyspark.sql.functions import lit

df.select(lit(5), lit("five"), lit(5.0)).show(5)


+---+----+---+
|  5|five|5.0|
+---+----+---+
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
|  5|five|5.0|
+---+----+---+
only showing top 5 rows



### üîé Step 2: Example outputs

In [12]:
df.select(lit(5).alias("int_col"),
          lit("five").alias("string_col"),
          lit(5.0).alias("float_col")).show(5)

+-------+----------+---------+
|int_col|string_col|float_col|
+-------+----------+---------+
|      5|      five|      5.0|
|      5|      five|      5.0|
|      5|      five|      5.0|
|      5|      five|      5.0|
|      5|      five|      5.0|
+-------+----------+---------+
only showing top 5 rows



**Notice**: all rows have the same constant values.

**These are now `Spark columns (not Python variables)`, so you can use them in transformations.**

### üîé Step 3: Why is this important?

- Literals let you:
    - Add **constant columns** for tagging data (`lit("2025")` for year, etc.).
    - Use constants inside expressions (`col("Quantity") + lit(10)`).
    - Make DataFrames behave more like SQL (where constants are allowed in queries).

In [13]:
df.withColumn("adjustedQuantity", df.Quantity + lit(10)).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+----------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|adjustedQuantity|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+----------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|              16|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|              16|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|              18|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|              16|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|              16|


‚úÖ In simple words:<br>
`lit()` lets you insert fixed values into a DataFrame as if they were columns. It‚Äôs the bridge between normal Python constants and Spark DataFrame columns.

# üìí Section 3: Working with Booleans

`Booleans (True / False)` are the foundation of filtering in Spark.
We use them in expressions to decide which rows to keep or discard.

### üîé Step 1: Filtering with `col`

We can build conditions using `col` from `pyspark.sql.functions`.

In [14]:
from pyspark.sql.functions import col

df.where(col("InvoiceNo") != 536365) \
  .select("InvoiceNo", "Description") \
  .show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



üëâ Explanation:

- `col("InvoiceNo") != 536365` ‚Üí creates a Boolean expression (`True`/`False`) for each row.
- `where(...)` (or `.filter(...)`) keeps only rows where the condition is `True`.

### üîé Step 2: Filtering with SQL-style strings

Instead of using `col`, you can directly write the condition as a string expression (like SQL).

In [15]:
df.where("InvoiceNo = 536365").show(5, False)

df.where("InvoiceNo <> 536365").show(5, False)

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |2010-12-01 08:26:00|2.55     |17850.0   |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |2010-12-01 08:26:00|2.75     |17850.0   |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
+---------+-----

### üîé Step 3: Combining Boolean Expressions

We can use `and`, `or` logic.

‚ö†Ô∏è In Spark, you can‚Äôt use plain Python `and` / `or` inside `col` expressions. Instead, you use `&` (and), `|` (or).

But chaining `.where()` calls is often the **cleanest way**.

In [16]:
from pyspark.sql.functions import instr

priceFilter = col("UnitPrice") > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1

df.where(df.StockCode.isin("DOT")) \
  .where(priceFilter | descripFilter) \
  .show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      NULL|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      NULL|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



üëâ Explanation:

- `isin("DOT")` ‚Üí checks membership (like SQL `IN`).
- `instr(df.Description, "POSTAGE") >= 1` ‚Üí checks if the word ‚ÄúPOSTAGE‚Äù exists in the `Description` column.

Equivalent SQL:

```sql
SELECT *
FROM dfTable
WHERE StockCode IN ("DOT")
  AND (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1);

```

### üîé Step 4: Why is this important?

- Booleans let you **filter data** ‚Üí one of the most common tasks.
- You can combine multiple conditions to replicate complex SQL `WHERE` clauses.
- Spark optimizes these Boolean filters for performance, even across distributed data.

---

‚úÖ **In simple words**:

Boolean expressions (`=, !=, >, isin, instr`) let you build filters to pick only the rows you care about. Think of them as Spark‚Äôs version of SQL `WHERE` clauses.

# üìí Section 4: Working with Numbers

When analyzing data, numerical operations are everywhere ‚Äî sums, multiplications, powers, rounding, etc.
In Spark, you can directly use columns in arithmetic expressions just like variables in math.

### üîé Step 1: Arithmetic Expressions

Let‚Äôs say we want to create a new quantity:

$$\text{realQuantity} = (Quantity \times UnitPrice)^2 + 5$$

We can express this using both **Pythonic expressions** and **SQL-style expressions**.

In [17]:
from pyspark.sql.functions import col, expr, pow

# Pythonic way
fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

# SQL-style expression
df.selectExpr("CustomerId", "(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)


+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



### üîé Step 2: Rounding

Sometimes you need to **round numbers** for reporting.

In [18]:
from pyspark.sql.functions import lit, round, bround

df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)


+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



üëâ Explanation:

- `round()` ‚Üí standard rounding (2.5 ‚Üí 3).
- `bround()` ‚Üí **banker‚Äôs rounding** (rounds 2.5 ‚Üí 2, 3.5 ‚Üí 4).

This matters in financial datasets where precise rounding rules apply.

### üîé Step 3: Why is this important?

- Arithmetic with columns lets you **derive new metrics** (like revenue = price √ó quantity).
- `pow` and `expr` allow complex formulas.
- Rounding ensures **clean reporting** and avoids floating-point quirks.

---

‚úÖ **In simple words**:

Spark treats numeric columns like math variables. You can multiply, add, take powers, and round them, just like in Excel or SQL ‚Äî but at **distributed scale**.

# üìí Section 5: Working with Strings

String manipulation is super common in data analysis:

- Cleaning messy text
- Formatting case (upper/lower)
- Padding or trimming spaces
- Replacing characters
- Checking if a substring exists

Spark provides many built-in functions for string handling.

### üîé Step 1: Capitalization

Convert strings into **Title Case** (first letter capitalized).

In [19]:
from pyspark.sql.functions import initcap, col

df.select(initcap(col("Description"))).show(5, False)


+-----------------------------------+
|initcap(Description)               |
+-----------------------------------+
|White Hanging Heart T-light Holder |
|White Metal Lantern                |
|Cream Cupid Hearts Coat Hanger     |
|Knitted Union Flag Hot Water Bottle|
|Red Woolly Hottie White Heart.     |
+-----------------------------------+
only showing top 5 rows



### üîé Step 2: Uppercase & Lowercase

Convert strings fully to **upper** or **lower** case.

In [20]:
from pyspark.sql.functions import lower, upper

df.select(
    col("Description"),
    lower(col("Description")).alias("lowercase"),
    upper(lower(col("Description"))).alias("upper_after_lower")
).show(5, False)


+-----------------------------------+-----------------------------------+-----------------------------------+
|Description                        |lowercase                          |upper_after_lower                  |
+-----------------------------------+-----------------------------------+-----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER |white hanging heart t-light holder |WHITE HANGING HEART T-LIGHT HOLDER |
|WHITE METAL LANTERN                |white metal lantern                |WHITE METAL LANTERN                |
|CREAM CUPID HEARTS COAT HANGER     |cream cupid hearts coat hanger     |CREAM CUPID HEARTS COAT HANGER     |
|KNITTED UNION FLAG HOT WATER BOTTLE|knitted union flag hot water bottle|KNITTED UNION FLAG HOT WATER BOTTLE|
|RED WOOLLY HOTTIE WHITE HEART.     |red woolly hottie white heart.     |RED WOOLLY HOTTIE WHITE HEART.     |
+-----------------------------------+-----------------------------------+-----------------------------------+
only showi

### üîé Step 3: Trimming & Padding

Remove spaces, or add padding to strings.

In [21]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(
    ltrim(lit("   HELLO   ")).alias("ltrim"),
    rtrim(lit("   HELLO   ")).alias("rtrim"),
    trim(lit("   HELLO   ")).alias("trim"),
    lpad(lit("HELLO"), 10, "-").alias("lpad"),
    rpad(lit("HELLO"), 10, ".").alias("rpad")
).show(2, False)


+--------+--------+-----+----------+----------+
|ltrim   |rtrim   |trim |lpad      |rpad      |
+--------+--------+-----+----------+----------+
|HELLO   |   HELLO|HELLO|-----HELLO|HELLO.....|
|HELLO   |   HELLO|HELLO|-----HELLO|HELLO.....|
+--------+--------+-----+----------+----------+
only showing top 2 rows



üëâ Explanation:

- `ltrim` ‚Üí removes spaces from left.
- `rtrim` ‚Üí removes spaces from right.
- `trim` ‚Üí removes spaces from both sides.
- `lpad("HELLO", 10, "-")` ‚Üí `"-----HELLO"`.
- `rpad("HELLO", 10, ".")` ‚Üí `"HELLO....."`.

### üîé Step 4: Replace/Translate Characters

Replace multiple characters at once.

In [22]:
from pyspark.sql.functions import translate

df.select(
    translate(col("Description"), "LEET", "1337").alias("leet_text"),
    col("Description")
).show(5, False)


+-----------------------------------+-----------------------------------+
|leet_text                          |Description                        |
+-----------------------------------+-----------------------------------+
|WHI73 HANGING H3AR7 7-1IGH7 HO1D3R |WHITE HANGING HEART T-LIGHT HOLDER |
|WHI73 M37A1 1AN73RN                |WHITE METAL LANTERN                |
|CR3AM CUPID H3AR7S COA7 HANG3R     |CREAM CUPID HEARTS COAT HANGER     |
|KNI773D UNION F1AG HO7 WA73R BO7713|KNITTED UNION FLAG HOT WATER BOTTLE|
|R3D WOO11Y HO77I3 WHI73 H3AR7.     |RED WOOLLY HOTTIE WHITE HEART.     |
+-----------------------------------+-----------------------------------+
only showing top 5 rows



‚úÖ In simple words:
String functions let you clean and format text ‚Äî trimming spaces, changing case, replacing values ‚Äî which is crucial for making messy data usable.

# üìí Section 6: Working with Dates & Timestamps

Dates and times are tricky in programming. Spark provides built-in functions to:

- Get current dates and timestamps
- Add or subtract days
- Find differences between dates
- Convert strings into dates with custom formats

### üîé Step 1: Create a DataFrame with Date & Timestamp

üëâ Explanation:

- `current_date()` ‚Üí current system date (no time).

- `current_timestamp()` ‚Üí current system date + time.

In [23]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(5) \
    .withColumn("today", current_date()) \
    .withColumn("now", current_timestamp())

dateDF.printSchema()
dateDF.show(truncate=False)

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)

+---+----------+--------------------------+
|id |today     |now                       |
+---+----------+--------------------------+
|0  |2025-09-12|2025-09-12 21:17:22.185137|
|1  |2025-09-12|2025-09-12 21:17:22.185137|
|2  |2025-09-12|2025-09-12 21:17:22.185137|
|3  |2025-09-12|2025-09-12 21:17:22.185137|
|4  |2025-09-12|2025-09-12 21:17:22.185137|
+---+----------+--------------------------+



### üîé Step 2: Add & Subtract Days

üëâ Explanation:

- `date_sub(..., 5)` ‚Üí subtracts 5 days.

- `date_add(..., 5)` ‚Üí adds 5 days.

In [24]:
from pyspark.sql.functions import date_add, date_sub, col

dateDF.select(
    date_sub(col("today"), 5).alias("5_days_ago"),
    date_add(col("today"), 5).alias("5_days_later")
).show(1)


+----------+------------+
|5_days_ago|5_days_later|
+----------+------------+
|2025-09-07|  2025-09-17|
+----------+------------+
only showing top 1 row



### üîé Step 3: Date Differences

üëâ Explanation:

- `datediff` ‚Üí gives number of days between two dates.

- `months_between` ‚Üí gives fractional months difference.

In [25]:
from pyspark.sql.functions import datediff, months_between

dateDF.withColumn("week_ago", date_sub(col("today"), 7)) \
    .select(datediff(col("week_ago"), col("today")).alias("days_diff")) \
    .show(1)

dateDF.select(
    months_between(lit("2017-05-22"), lit("2016-01-01")).alias("months_diff")
).show(1)


+---------+
|days_diff|
+---------+
|       -7|
+---------+
only showing top 1 row

+-----------+
|months_diff|
+-----------+
|16.67741935|
+-----------+
only showing top 1 row



### üîé Step 4: Converting Strings to Dates

üëâ **Note**: If Spark cannot parse the string, it will return `null`.

In [26]:
from pyspark.sql.functions import to_date, lit

# Default format: yyyy-MM-dd
spark.range(1).withColumn("date_str", lit("2017-01-01")) \
    .select(to_date(col("date_str")).alias("parsed_date")) \
    .show()


+-----------+
|parsed_date|
+-----------+
| 2017-01-01|
+-----------+



### üîé Step 5: Handling Custom Formats

üëâ Explanation:

- Format `"yyyy-dd-MM"` expects `year-day-month`.
- `"2017-12-11"` parses fine (11th Dec 2017).
- `"2017-20-12"` fails ‚Üí returns `null`.

In [29]:
dateFormat = "yyyy-dd-MM"
cleanDateDF = spark.range(1).select(
    to_date(lit("2017-12-11"), dateFormat).alias("valid_date"),
    to_date(lit("2017-20-13"), dateFormat).alias("invalid_date")
)

cleanDateDF.show()

# You can also convert to timestamp:

from pyspark.sql.functions import to_timestamp

cleanDateDF.select(to_timestamp(col("valid_date"), dateFormat)).show()


+----------+------------+
|valid_date|invalid_date|
+----------+------------+
|2017-11-12|        NULL|
+----------+------------+

+------------------------------------+
|to_timestamp(valid_date, yyyy-dd-MM)|
+------------------------------------+
|                 2017-11-12 00:00:00|
+------------------------------------+



### üîé Step 6: Why is this important?

- Dates are **central in business data** (sales dates, order deadlines, retention analysis).
- Spark gives flexibility to **parse different formats**, calculate differences, and adjust for time.
- Knowing how Spark silently returns `null` on bad formats helps avoid hidden data issues.

---

‚úÖ **In simple words**:

Spark makes it easy to work with dates and timestamps ‚Äî you can get today‚Äôs date, add/subtract days, calculate differences, and convert string dates into real date objects (with custom formats when needed).

# üìí Section 7: Working with Nulls

In real-world datasets, missing values are very common.

Spark represents missing data as **`null`** (not `NaN` or empty string).

Spark provides the **`.na` subpackage** on DataFrames for handling nulls:

- `drop()` ‚Üí remove rows with nulls
- `fill()` ‚Üí replace nulls with values
- `replace()` ‚Üí replace specific values (not just nulls)

### üîé Step 1: Dropping Nulls

Remove rows that contain null values.

üëâ Explanation:

- `"any"` ‚Üí drop row if **any column** is null.
- `"all"` ‚Üí drop row only if **all columns** are null.

In [31]:
# Drop rows with any null values
df.na.drop().show(5)

# Equivalent (default is "any")
df.na.drop("any").show(5)

# Drop only rows where all values are null
df.na.drop("all").show(5)


+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows

+--

### üîé Step 2: Filling Nulls

Replace nulls with specified values.

üëâ Explanation:

- Strings can be filled into text columns.
- Numbers can be filled into numeric columns.

In [34]:
# Fill all nulls with a single string
df.na.fill("All Null values become this string").show(5)

# Fill with different values for specific columns
fill_cols_vals = {"StockCode": 5, "Description": "No Value"}
df.na.fill(fill_cols_vals).show(5)


+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows

+--

### üîé Step 3: Replacing Values

You can replace not just nulls but any value (like `?` or `NA`).

üëâ Explanation:

- All occurrences of `"?"` and `"NA"` are replaced with `"Unknown"`.

In [35]:
df.na.replace(["?", "NA"], "Unknown").show(5)


+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



### üîé Step 4: Why is this important?

- **Nulls affect calculations** (e.g., averages, joins, filters).
- Cleaning nulls ensures consistent results.
- Using `.na.drop()` or `.na.fill()` makes handling nulls **scalable and optimized** in Spark.

---

‚úÖ **In simple words**:

Null handling in Spark is done with `.na`. You can **drop rows, fill them with defaults, or replace specific values**. Always handle nulls before analysis to avoid incorrect results.

# üìí Section 8: Working with Complex Types

Spark supports **complex data types** that let you organize nested or multi-valued data:

- **Structs** ‚Üí like a nested row (columns inside a column)
- **Arrays** ‚Üí ordered lists of values
- (Later we‚Äôll also touch JSON, which combines these ideas)

## üîπ 8.1 Structs

A **struct** is like a DataFrame inside a column. You can group multiple fields into one column.

In [36]:
from pyspark.sql.functions import struct, col

# Create a struct column
complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))
complexDF.show(2, truncate=False)


+--------------------------------------------+
|complex                                     |
+--------------------------------------------+
|{WHITE HANGING HEART T-LIGHT HOLDER, 536365}|
|{WHITE METAL LANTERN, 536365}               |
+--------------------------------------------+
only showing top 2 rows



**Accessing struct fields**

In [37]:
# Dot syntax
complexDF.select("complex.Description").show(2)

# Using getField
complexDF.select(col("complex").getField("InvoiceNo")).show(2)

# Expanding all fields
complexDF.select("complex.*").show(2)


+--------------------+
|         Description|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
+--------------------+
only showing top 2 rows

+-----------------+
|complex.InvoiceNo|
+-----------------+
|           536365|
|           536365|
+-----------------+
only showing top 2 rows

+--------------------+---------+
|         Description|InvoiceNo|
+--------------------+---------+
|WHITE HANGING HEA...|   536365|
| WHITE METAL LANTERN|   536365|
+--------------------+---------+
only showing top 2 rows



## üîπ 8.2 Arrays

An **array** is a list of values stored in a column.

**Splitting strings into arrays**

In [38]:
from pyspark.sql.functions import split

df.select(split(col("Description"), " ").alias("array_col")).show(2, False)


+----------------------------------------+
|array_col                               |
+----------------------------------------+
|[WHITE, HANGING, HEART, T-LIGHT, HOLDER]|
|[WHITE, METAL, LANTERN]                 |
+----------------------------------------+
only showing top 2 rows



**Accessing array elements**

In [39]:
df.select(split(col("Description"), " ").alias("array_col")) \
  .selectExpr("array_col[0]").show(2)


+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



**Array length**

In [40]:
from pyspark.sql.functions import size

df.select(size(split(col("Description"), " ")).alias("array_size")).show(2)


+----------+
|array_size|
+----------+
|         5|
|         3|
+----------+
only showing top 2 rows



**Checking if an array contains a value**

üëâ Returns `True/False` depending on whether "WHITE" exists in the array.

In [41]:
from pyspark.sql.functions import array_contains

df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2)


+------------------------------------------------+
|array_contains(split(Description,  , -1), WHITE)|
+------------------------------------------------+
|                                            true|
|                                            true|
+------------------------------------------------+
only showing top 2 rows



## üîπ 8.3 Exploding Arrays

`explode()` turns **one row with an array** into **multiple rows (one per element)**.

üëâ Example: `"WHITE METAL LANTERN"` ‚Üí becomes 3 rows: `"WHITE"`, `"METAL"`, `"LANTERN"`

In [43]:
from pyspark.sql.functions import explode

df.withColumn("splitted", split(col("Description"), " ")) \
  .withColumn("exploded", explode(col("splitted"))) \
  .select("Description", "InvoiceNo", "exploded").show(5, False)


+----------------------------------+---------+--------+
|Description                       |InvoiceNo|exploded|
+----------------------------------+---------+--------+
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |WHITE   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HANGING |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HEART   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |T-LIGHT |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HOLDER  |
+----------------------------------+---------+--------+
only showing top 5 rows



### üîé Why is this important?

- **Structs** ‚Üí Useful for grouping related columns together (like address: {city, state, zip}).
- **Arrays** ‚Üí Useful for handling multi-valued fields (like tags, categories, words).
- **Explode** ‚Üí Lets you normalize arrays into rows for analysis.

---

‚úÖ **In simple words**:

Structs are ‚Äúcolumns within a column‚Äù, arrays are ‚Äúlists inside a column‚Äù, and explode helps turn arrays into rows so you can analyze them easily.

# üìí Section 9: Working with JSON

JSON is a very common data format in real-world datasets (APIs, logs, configs).

Spark has **built-in functions** to parse JSON strings and extract values from them.

### üîé Step 1: Create a JSON Column

Let‚Äôs make a small DataFrame containing a JSON string.

In [44]:
jsonDF = spark.range(1).selectExpr(
    """'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString"""
)

jsonDF.show(truncate=False)


+-------------------------------------------+
|jsonString                                 |
+-------------------------------------------+
|{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}|
+-------------------------------------------+



### üîé Step 2: Extract Nested JSON Values with `get_json_object`

`get_json_object` lets you query JSON using **JSONPath syntax** (`$.path.to.value`).

üëâ Explanation:

- `$.myJSONKey.myJSONValue[1]` ‚Üí gets the **2nd element** (index 1) of the array `[1,2,3]`.
- Returns: `2`

In [45]:
from pyspark.sql.functions import get_json_object, col

jsonDF.select(
    get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]").alias("second_element")
).show()


+--------------+
|second_element|
+--------------+
|             2|
+--------------+



### üîé Step 3: Extract Top-Level Fields with `json_tuple`

If JSON has a **single-level structure**, `json_tuple` is more convenient.

üëâ Explanation:

- `json_tuple` extracts the field `"myJSONKey"` from the JSON string.
- Output is still a JSON object string: `{"myJSONValue":[1,2,3]}`

In [46]:
from pyspark.sql.functions import json_tuple

jsonDF.select(
    json_tuple(col("jsonString"), "myJSONKey").alias("parsed_key")
).show(truncate=False)


+-----------------------+
|parsed_key             |
+-----------------------+
|{"myJSONValue":[1,2,3]}|
+-----------------------+



### üîé Step 4: Why is this important?

- JSON is everywhere (web APIs, IoT logs, nested event data).
- With `get_json_object`, you can extract **deeply nested values** without converting the whole JSON into columns.
- With `json_tuple`, you can quickly grab **top-level fields**.

---

‚úÖ **In simple words**:

Spark lets you query JSON strings inside columns. Use `get_json_object` for nested fields, and `json_tuple` for flat JSON.

# üìí Section 9 (Bonus): Reading JSON Files into DataFrames

Instead of working with JSON strings inside a column, Spark can **directly read JSON files** into a DataFrame.

This is very common in pipelines where logs or data dumps are stored in JSON format.

### üîé Step 1: Reading a JSON File

üëâ Explanation:

- `format("json")` tells Spark the file is JSON.
- Spark automatically **infers the schema** by reading the JSON keys.
- Each JSON object = one row in the DataFrame.

In [49]:
# Load JSON file
json_path = get_project_path('data', 'darshil-data', 'flight-data', 'json', '2015-summary.json')
df_json = spark.read.format("json").load(json_path)

# Inspect schema and data
df_json.printSchema()
df_json.show(5)


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



### üîé Step 2: Querying the JSON DataFrame

Now you can work with it just like any other DataFrame:

In [50]:
from pyspark.sql.functions import col

# Filter
df_json.filter(col("count") > 10).show(5)

# Select specific columns
df_json.select("DEST_COUNTRY_NAME", "count").show(5)

# Aggregate
df_json.groupBy("DEST_COUNTRY_NAME").sum("count").show(5)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|            Grenada|   62|
+-----------------+-------------------+-----+
only showing top 5 rows

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|    United States|   15|
|    United States|    1|
|    United States|  344|
|            Egypt|   15|
|    United States|   62|
+-----------------+-----+
only showing top 5 rows

+-----------------+----------+
|DEST_COUNTRY_NAME|sum(count)|
+-----------------+----------+
|         Anguilla|        41|
|           Russia|       176|
|         Paraguay|        60|
|          Senegal|        40|
|           Sweden|       118|
+-----------------+----------+
only showing top 5 rows

### üîé Step 4: Why is this important?

- JSON is a **native file format** Spark can parse directly (like CSV, Parquet).
- No need to manually parse strings ‚Üí Spark maps JSON keys to DataFrame columns.
- Schema inference makes it quick, but you can also define schemas manually for strict control.

---

‚úÖ **In simple words**:

You can either **parse JSON strings inside a DataFrame** using `get_json_object`/`json_tuple`, or **load JSON files directly into a DataFrame** with `spark.read.json`. Both are essential depending on your data source.