# Set ENV Variable to Project Path

In [111]:
# Automatically reload modules when they change
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Insert project root folder in environment variable

In [112]:
import os
import sys

def find_project_root(start_path=None, markers=(".git", "pyproject.toml", "requirements.txt")):
    """
    Walks up from start_path until it finds one of the marker files/folders.
    Returns the path of the project root.
    """
    if start_path is None:
        start_path = os.getcwd()

    current_path = os.path.abspath(start_path)

    while True:
        # check if any marker exists in current path
        if any(os.path.exists(os.path.join(current_path, marker)) for marker in markers):
            return current_path

        new_path = os.path.dirname(current_path)  # parent folder
        if new_path == current_path:  # reached root of filesystem
            raise FileNotFoundError(f"None of the markers {markers} found above {start_path}")
        current_path = new_path

project_root = find_project_root()
print("Project root:", project_root)

if project_root not in sys.path:
    sys.path.insert(0, project_root)


Project root: c:\ds_analytics_projects\darshil_course\apache-pyspark\darshil-pyspark


# Import Libraries

Import packages

In [113]:
import pandas as pd
import numpy as np

Relative import

In [114]:
from utils.file_utils import get_project_path

Import pyspark package and create a spark sesstion

In [115]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("FlightDataExample") \
    .getOrCreate()


# Create Dataframe (Load Data)

🔹 1. Creating a DataFrame by reading data

In [116]:
# Path
flight_data_path_2015 = get_project_path("data", "darshil-data", "flight-data", "json", "2015-summary.json")

df = spark.read.format("json").load(flight_data_path_2015)

- Spark reads the JSON file and infers the schema automatically.

- This is the most common way of creating DataFrames in real projects (from files, tables, databases, etc.).

**You then registered it as a temporary SQL view:**

In [117]:
df.createOrReplaceTempView("dfTable")

Now you can query it using SQL:

In [118]:
spark.sql("SELECT DEST_COUNTRY_NAME, count FROM dfTable LIMIT 5").show()

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|    United States|   15|
|    United States|    1|
|    United States|  344|
|            Egypt|   15|
|    United States|   62|
+-----------------+-----+



🔹 2. Creating a DataFrame manually

This is useful when you want full control over schema (columns + datatypes).

*Step 1: Define schema*

In [119]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
    StructField("name", StringType(), True),   # nullable = True
    StructField("city", StringType(), True),
    StructField("age", LongType(), False)    # nullable = False
])


- StructType = the schema (table structure).

- StructField = each column: name, datatype, nullable.

*Step 2: Create a Row*

🔹 1. What is a Row?

- A Row is Spark’s way of representing a single record (like one line in a spreadsheet or a table).

- A DataFrame = collection of Rows + schema.

- A Row by itself does not have column names → just positional values.

Example from your DataFrame (DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count):

In [120]:
df.show(3, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
+-----------------+-------------------+-----+
only showing top 3 rows



Each of these lines is internally a Row object:

*Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)*

In [121]:
from pyspark.sql import Row

myRow = Row("Vinay", None, 30)

⚠️ Order matters here → the values must match schema order:

- "Vinay" → goes to column "name"

- None → goes to "city"

- 30 → goes to "age"

**Accessing Rows**

By index (positional):

In [122]:
print(myRow[0])   # "Vinay"
print(myRow[2])   # 30

Vinay
30


By attribute name (only if Row was created with named fields):

In [123]:
Person = Row("name", "age")
p1 = Person("Vinay", 28)

print(p1.name)   # "Vinay"
print(p1.age)    # 28

Vinay
28


*Step 3: Create DataFrame*

In [124]:
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

+-----+----+---+
| name|city|age|
+-----+----+---+
|Vinay|NULL| 30|
+-----+----+---+



🔹 3. Difference between the two approaches

| Method                               | When to use                                                                                                                           |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| **Read from file (inferred schema)** | Most common — fast, Spark figures out schema itself.                                                                                  |
| **Manual schema + Rows**             | When: <br>• You’re creating test DataFrames <br>• You want strict schema (avoid inference mistakes) <br>• You’re combining with RDDs. |


✅ Key takeaways:

- Row = record (like a tuple).

- When manually creating Rows, order matters because there’s no schema.

- Use Row("col1", "col2") style if you want name-based access.

- DataFrames = schema + collection of Rows.

- You can either read data (auto schema) or build manually (defined schema).

- Manual schemas are especially useful in testing & ensuring data types match exactly.

# Columns and Expressions

**Columns in Spark are similar to columns in a spreadsheet or pandas DataFrame. You can select, manipulate, and remove columns from DataFrames and these operations are represented as expressions.**

🔹 1. Using col and column

Both col and column do the same thing — they create a column expression by name.

In [125]:
from pyspark.sql.functions import col, column

# Both are equivalent
df.select(col("count")).show(5)
df.select(column("count")).show(5)

+-----+
|count|
+-----+
|   15|
|    1|
|  344|
|   15|
|   62|
+-----+
only showing top 5 rows

+-----+
|count|
+-----+
|   15|
|    1|
|  344|
|   15|
|   62|
+-----+
only showing top 5 rows



🔹 2. Using expr for expressions

An expression is just a SQL string translated into a column object.

In [126]:
from pyspark.sql.functions import expr

# expr is same as col in simple case
df.select(expr("count")).show(5)  
df.select(col("count")).show(5)    # both identical

+-----+
|count|
+-----+
|   15|
|    1|
|  344|
|   15|
|   62|
+-----+
only showing top 5 rows

+-----+
|count|
+-----+
|   15|
|    1|
|  344|
|   15|
|   62|
+-----+
only showing top 5 rows



But expr becomes powerful when chaining logic:

In [127]:
# Arithmetic expression
df.select(expr("count + 10 as count_plus_10")).show(5)

# Logical expression
df.filter(expr("count > 500 AND DEST_COUNTRY_NAME = 'United States'")).show(5)

# Complex nested expression
df.select(expr("(((count + 5) * 2) - 6) < 10000 as is_small")).show(5)

+-------------+
|count_plus_10|
+-------------+
|           25|
|           11|
|          354|
|           25|
|           72|
+-------------+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|        Netherlands|  660|
|    United States|         Costa Rica|  608|
|    United States|            Jamaica|  712|
|    United States|        The Bahamas|  986|
|    United States|              China|  920|
+-----------------+-------------------+-----+
only showing top 5 rows

+--------+
|is_small|
+--------+
|    true|
|    true|
|    true|
|    true|
|    true|
+--------+
only showing top 5 rows



✅ Key takeaway:

- col("x") and expr("x") both represent a column expression.

- Expressions are lazy recipes, not values. Spark builds a plan with them and executes later.

- Use col when writing Pythonic transformations, and expr when SQL syntax is easier.

Accessing a DataFrame’s columns to see all columns on a DataFrame

In [128]:
df.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

# Spark Transformations: ``` select()```, ```expr()```, and ```selectExpr()```

🔹 1. ```select``` (DataFrame-style, column-based)

This is like saying “give me these columns”.

Equivalent SQL:

```SQL
SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME
FROM dfTable
LIMIT 2;
```


In [129]:
# One column
df.select("DEST_COUNTRY_NAME").show(2)

# Multiple columns
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)


+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



🔹 2. Using ```col```, ```column```, and ```expr``` inside ```select```

Spark lets you build column expressions instead of just plain names:

In [130]:
from pyspark.sql.functions import expr, col, column

df.select(
    expr("DEST_COUNTRY_NAME"),     # SQL-style
    col("DEST_COUNTRY_NAME"),      # Python-friendly
    column("DEST_COUNTRY_NAME")    # same as col
).show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



🔹 3. Renaming columns

- With SQL ```AS```:

In [131]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



- Or with ```.alias()```:

In [132]:
df.select(col("DEST_COUNTRY_NAME").alias("destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



- Even combining both:

In [133]:
df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME")).show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



🔹 4. ```selectExpr``` (SQL-style in DataFrames)

```selectExpr``` is a shortcut for when you want to use SQL expressions directly:

In [134]:
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



👉 Same as writing ```spark.sql("SELECT DEST_COUNTRY_NAME as newColumnName, DEST_COUNTRY_NAME FROM dfTable")```.

🔹 5. Add new derived columns with ```selectExpr```

You can write expressions directly in SQL style:

In [135]:
df.selectExpr(
    "*",   # keep all existing columns
    "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry"  # new boolean col
).show(2)


+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



👉 This adds a new column ```withinCountry``` that checks if ```DEST_COUNTRY_NAME``` equals ```ORIGIN_COUNTRY_NAME```.

🔹 6. Aggregations inside ```selectExpr```

You can run aggregate functions directly:

Equivalent SQL:

```SQL
SELECT avg(count), count(distinct(DEST_COUNTRY_NAME))
FROM dfTable;
```


In [136]:
df.selectExpr("avg(count) as avg_count", "count(distinct(DEST_COUNTRY_NAME)) as distinct_country_count").show()

+-----------+----------------------+
|  avg_count|distinct_country_count|
+-----------+----------------------+
|1770.765625|                   132|
+-----------+----------------------+



✅ Key takeaways:

- ```select``` → Pythonic, works well with ```col``` & functions.

- `expr` → lets you write SQL inside `select`.

- `selectExpr` → shortcut for writing multiple SQL expressions in one go.

- If you know SQL, `selectExpr` feels very natural.

# Spark Column Transformations

🔹 1. Converting to Spark Types (Literals)

Sometimes you want to insert a constant value (not from a column) into your DataFrame.

In [137]:
from pyspark.sql.functions import lit, expr

# Add a constant column with value 1
df.select(expr("*"), lit(1).alias("numberOne")).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



👉 `lit(1)` means "literal 1" → Spark knows it's not a column, just a fixed value.

🔹 2. Adding Columns (`withColumn`)

The formal way to add or replace a column.

In [138]:
# Add a new column with a constant
df.withColumn("numberOne", lit(1)).show(2)

# Add a new column with an expression
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME")).show(2)


+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



👉 `withColumn(name, expression)`

`name` = new/existing column name

`expression` = column transformation / literal / condition

🔹 3. Renaming Columns

Rename one column at a time:

In [139]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

['dest', 'ORIGIN_COUNTRY_NAME', 'count']

🔹 4. Removing Columns (drop)

In [140]:
# Drop one column
df.drop("ORIGIN_COUNTRY_NAME").columns

# Drop multiple columns
df.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").columns


['count']

🔹 5. Changing Column Type (Casting)

Convert one column’s data type to another.

In [141]:
from pyspark.sql.functions import col

df.withColumn("count2", col("count").cast("long")).printSchema()


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)
 |-- count2: long (nullable = true)



👉 Useful when Spark reads data as `string` but you need it as `integer/long/float`.

🔹 6. Putting it all together (mini example)

Imagine you want:

- Keep all columns

- Add a constant column

- Add a boolean flag

- Rename a column

- Drop another column

- Cast a column

In [142]:
df_transformed = (df
    .withColumn("numberOne", lit(1))   # constant col
    .withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))  # boolean col
    .withColumnRenamed("DEST_COUNTRY_NAME", "dest")  # rename
    .drop("ORIGIN_COUNTRY_NAME")  # remove col
    .withColumn("count_long", col("count").cast("long"))  # cast
)

df_transformed.show(5)
df_transformed.printSchema()


+-------------+-----+---------+-------------+----------+
|         dest|count|numberOne|withinCountry|count_long|
+-------------+-----+---------+-------------+----------+
|United States|   15|        1|        false|        15|
|United States|    1|        1|        false|         1|
|United States|  344|        1|        false|       344|
|        Egypt|   15|        1|        false|        15|
|United States|   62|        1|        false|        62|
+-------------+-----+---------+-------------+----------+
only showing top 5 rows

root
 |-- dest: string (nullable = true)
 |-- count: long (nullable = true)
 |-- numberOne: integer (nullable = false)
 |-- withinCountry: boolean (nullable = true)
 |-- count_long: long (nullable = true)



✅ Key takeaways:

- `lit()` → add constants

- `withColumn()` → add or overwrite columns

- `withColumnRenamed()` → rename columns

- `drop()` → remove columns

- `cast()` → change column types

# Spark Row Transformations

🔹 1. Filtering Rows

You can filter using:

- `filter` (Pythonic, column expressions)

- `where` (SQL-style, string expressions)

In [143]:
from pyspark.sql.functions import col

# Filter with col
df.filter(col("count") < 2).show(2)

# Same with where (SQL-style string)
df.where("count < 2").show(2)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



👉 Both are equivalent. <br>
👉 Use col(...) if you want Python code completion, use strings if you prefer SQL style.

**Multiple Filters**

You can chain multiple conditions:

Equivalent in SQL:
```SQL
SELECT *
FROM dfTable
WHERE count < 2 AND ORIGIN_COUNTRY_NAME != 'Croatia'
LIMIT 2;
```

In [144]:
df.where(col("count") < 2) \
  .where(col("ORIGIN_COUNTRY_NAME") != "Croatia") \
  .show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



🔹 2. Getting Unique Rows

👉 This counts unique pairs of (ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME).<br>
👉 distinct() works like SELECT DISTINCT in SQL.

In [145]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

256

🔹 3. Concatenating & Appending Rows (Union)

To add new rows to a DataFrame → use `union`.
⚠️ Both DataFrames must have the same schema (column names + types).

In [146]:
from pyspark.sql import Row

schema = df.schema  # reuse schema

newRows = [
    Row("New Country", "Other Country", 5),   # note: plain int works
    Row("New Country 2", "Other Country 3", 1)
]

parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)

# Append and filter
df.union(newDF) \
  .where("count = 1") \
  .where(col("ORIGIN_COUNTRY_NAME") != "United States") \
  .show()


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|    United States|          Gibraltar|    1|
|    United States|             Cyprus|    1|
|    United States|            Estonia|    1|
|    United States|          Lithuania|    1|
|    United States|           Bulgaria|    1|
|    United States|            Georgia|    1|
|    United States|            Bahrain|    1|
|    United States|   Papua New Guinea|    1|
|    United States|         Montenegro|    1|
|    United States|            Namibia|    1|
|    New Country 2|    Other Country 3|    1|
+-----------------+-------------------+-----+



🔹 4. Sorting Rows

Two equivalent methods:

- `sort`

- `orderBy`

In [147]:
from pyspark.sql.functions import desc, asc

# Sort by one column
df.sort("count").show(5)

# Sort by multiple columns
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

# Explicit with col
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

# 👉 To specify ascending/descending explicitly:

df.orderBy(expr("count desc")).show(2)

df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)


+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--

🔹 5. Limiting Rows

Like SQL `LIMIT`.

In [148]:
df.limit(5).show()

# Top 6 largest counts
df.orderBy(expr("count desc")).limit(6).show()


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
|             Moldova|      United States|    1|
+--------------------+-------------------+-----+



✅ Quick Recap

- `filter / where` → keep rows that satisfy conditions.

- `distinct()` → unique rows.

- `union()` → append DataFrames (schema must match).

- `sort() / orderBy()` → order rows.

- `limit()` → restrict number of rows.

# Spark Optimisation

## 🔹 1. Partitions in Spark
- Spark breaks up your DataFrame into partitions (chunks of data distributed across executors).

- By default, when you load small files, you often get 1 partition.

- More partitions = more parallelism (but too many small partitions = overhead).

In [149]:
df.rdd.getNumPartitions()  
# often 1 for small JSON/CSV

1

##🔹 2. Repartition

- repartition(n) → creates exactly n partitions.

- Always triggers a full shuffle (data is redistributed across executors).

- Useful when:

    - You need more parallelism (increase partitions).

    - You want to partition by column for better query performance.

In [150]:
# Repartition into 5 partitions
df2 = df.repartition(5)
print(df2.rdd.getNumPartitions())

# Repartition based on column (common filter key)
df3 = df.repartition(col("DEST_COUNTRY_NAME"))

# Repartition into 5 partitions *by column*
df4 = df.repartition(5, col("DEST_COUNTRY_NAME"))


5


👉 If you often filter by `DEST_COUNTRY_NAME`, this ensures rows with the same country end up in the same partition → less shuffling later.

## 🔹 3. Coalesce

- `coalesce(n)` → reduces the number of partitions without a full shuffle (faster).

- It merges partitions together instead of redistributing.

- Useful when you need fewer partitions, e.g., before writing out to disk.

In [151]:
# Start with 5 partitions, reduce to 2
df5 = df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)
print(df5.rdd.getNumPartitions())


2


👉 Rule of thumb:

- Need more partitions or shuffle by column → use `repartition`.

- Need fewer partitions → use `coalesce`.

## 🔹 4. Collecting Rows

Collecting moves data from executors → driver program.<br> **⚠️ Dangerous for large datasets**.

In [153]:
collectDF = df.limit(10)

# Get first N rows as list of Row objects
collectDF.take(5)    

# Print nicely in tabular form
collectDF.show()     
collectDF.show(5, False)  # don't truncate

# Collect all rows into Python list (⚠️ expensive!)
rows = collectDF.collect()
print(rows)


+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India         

**⚠️ Important warnings:**

- `collect()` brings *all data* to the driver. If your dataset is big, this can crash the driver.

- `take(n)` is safer → only retrieves `n` `rows`.

- `toLocalIterator()` is very expensive → retrieves row by row to driver.

## **✅ Quick Recap**

- `repartition(n)` → increase partitions, full shuffle.

- `repartition(n, col)` → distribute by column, shuffle.

- `coalesce(n)` → decrease partitions, no full shuffle.

- `collect()` → brings all data to driver (**⚠️ use carefully**).

- `take(n)` / `show(n)` → safer, preview data.