<a href="https://colab.research.google.com/github/gvikas79/Spark-Tutorials/blob/main/spark_class3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**What is collect() in PySpark?**

*   collect() is an action in PySpark (not a transformation).
*   It retrieves all rows of a DataFrame (or RDD) to the driver node as a list of Row objects.
*   Since Spark works in a distributed environment, data is spread across executors.
*  collect() brings everything back to your local Python process.


Warning: Don’t use collect() on very large datasets (it can cause OutOfMemoryError). Use it only for small results, testing, or debugging.

Syntax

DataFrame.collect()

Returns:
A list of Row objects, where each row represents a record from the DataFrame.




In [None]:
#Example 1: Basic Usage
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CollectExample").getOrCreate()

# Sample DataFrame
data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

df.show()

# Collect all rows
result = df.collect()

print(result)


+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25|  3000|
|    Bob| 30|  4000|
|Charlie| 28|  5000|
+-------+---+------+

[Row(name='Alice', age=25, salary=3000), Row(name='Bob', age=30, salary=4000), Row(name='Charlie', age=28, salary=5000)]


In [None]:
#Example 2: Iterating Over Collected Data
rows = df.collect()

for row in rows:
    print(row["name"], row["age"], row["salary"])


Alice 25 3000
Bob 30 4000
Charlie 28 5000


In [None]:
#Example 3: Convert to Pandas
#Often, after collecting, people convert to Pandas for local analysis:

pandas_df = df.toPandas()
print(pandas_df)


      name  age  salary
0    Alice   25    3000
1      Bob   30    4000
2  Charlie   28    5000


In [None]:
#Example 4: Collect with Filtering

result = df.filter(df["age"] > 26).collect()
for row in result:
    print(row.name, row.salary)


Bob 4000
Charlie 5000


When to Use collect()?

Use collect() when:

•	Dataset is small enough to fit in memory.

•	You want to debug, print, or inspect results locally.

•	You’re passing results to an external Python library (like Pandas, NumPy, Matplotlib).

Avoid collect() when:
•	Dataset is large (millions of rows, GBs of data).
•	It can cause driver out of memory issues.

Tip: For safer alternatives use:

•	show(n) → prints first n rows nicely.

•	take(n) → returns first n rows as list.

•	limit(n).collect() → collects only a subset.



What is transform() in PySpark?

•	transform() is available on DataFrame objects.

•	It allows you to apply a function (transformation) to a DataFrame in a clean, reusable, and chainable way.

•	Instead of writing complex transformations inline, you can wrap them in functions and pass them to transform().

•	It improves readability and reusability of your PySpark code.
________________________________________
Syntax
DataFrame.transform(func)

•	func → a Python function that takes a DataFrame as input and returns a DataFrame.

•	Returns → the transformed DataFrame.


In [None]:
#Example 1: Basic Usage
from pyspark.sql.functions import col
# Sample DataFrame
data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

df.show()


+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25|  3000|
|    Bob| 30|  4000|
|Charlie| 28|  5000|
+-------+---+------+



In [None]:
# Define a function to add a 10% bonus to salary

def add_bonus(dataframe):
    return dataframe.withColumn("salary_with_bonus", col("salary") * 1.1)

# Apply using transform
df_transformed = df.transform(add_bonus)
df_transformed.show()


+-------+---+------+------------------+
|   name|age|salary| salary_with_bonus|
+-------+---+------+------------------+
|  Alice| 25|  3000|3300.0000000000005|
|    Bob| 30|  4000|            4400.0|
|Charlie| 28|  5000|            5500.0|
+-------+---+------+------------------+



In [None]:
#Example 2: Chaining Multiple transform() Calls
from pyspark.sql.functions import upper

# Function to uppercase name
def uppercase_name(df):
    return df.withColumn("name_upper", upper(col("name")))

# Function to categorize salary
def categorize_salary(df):
    return df.withColumn("salary_level",
                         (col("salary") > 4000).cast("string"))

# Apply multiple transformations
df_chain = df.transform(uppercase_name).transform(categorize_salary)
df_chain.show()


+-------+---+------+----------+------------+
|   name|age|salary|name_upper|salary_level|
+-------+---+------+----------+------------+
|  Alice| 25|  3000|     ALICE|       false|
|    Bob| 30|  4000|       BOB|       false|
|Charlie| 28|  5000|   CHARLIE|        true|
+-------+---+------+----------+------------+



In [None]:
#Example 3: Passing Parameters with lambda
#You can also use lambda functions directly:

df_lambda = df.transform(lambda d: d.withColumn("age_plus_5", col("age") + 5))
df_lambda.show()


+-------+---+------+----------+
|   name|age|salary|age_plus_5|
+-------+---+------+----------+
|  Alice| 25|  3000|        30|
|    Bob| 30|  4000|        35|
|Charlie| 28|  5000|        33|
+-------+---+------+----------+



In [None]:
#Example 4: Reusable Pipeline with transform()

def pipeline(df):
    return (df
            .transform(add_bonus)         # Step 1: Add bonus
            .transform(uppercase_name)    # Step 2: Uppercase name
            .transform(lambda d: d.filter(col("age") > 26))  # Step 3: Filter
           )

df_pipeline = df.transform(pipeline)
df_pipeline.show()


+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



 Why Use transform()?

Makes code cleaner & modular (define transformations once, reuse many times).

Useful when building pipelines of transformations.

Works well with functional programming style in PySpark.


In [None]:
#Direct Method Chaining
#You can chain transformations directly on the DataFrame:

spark = SparkSession.builder.appName("TransformVsChaining").getOrCreate()

data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

# Direct chaining
df_chain = (
    df.withColumn("salary_with_bonus", col("salary") * 1.1)
      .withColumn("name_upper", upper(col("name")))
      .filter(col("age") > 26)
)

df_chain.show()


+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



In [None]:
#Using transform()
#Instead of repeating logic, you define reusable functions:
# Define reusable transformations

def add_bonus(df):
    return df.withColumn("salary_with_bonus", col("salary") * 1.1)

def uppercase_name(df):
    return df.withColumn("name_upper", upper(col("name")))

def filter_age(df):
    return df.filter(col("age") > 26)

# Apply with transform
df_transformed = (
    df.transform(add_bonus)
      .transform(uppercase_name)
      .transform(filter_age)
)

df_transformed.show()

+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



Output is the same, but advantages:

•	Modular: Each transformation is a function.

•	Reusable: You can apply add_bonus() or filter_age() to other DataFrames easily.

•	Readable: Clearly separates logical steps.


In [None]:
#Mixing lambda with transform()

#For quick one-off transformations:
df_lambda = (
    df.transform(add_bonus)
      .transform(lambda d: d.withColumn("age_plus_5", col("age") + 5))
)
df_lambda.show()


+-------+---+------+------------------+----------+
|   name|age|salary| salary_with_bonus|age_plus_5|
+-------+---+------+------------------+----------+
|  Alice| 25|  3000|3300.0000000000005|        30|
|    Bob| 30|  4000|            4400.0|        35|
|Charlie| 28|  5000|            5500.0|        33|
+-------+---+------+------------------+----------+



When to Use What?

Approach	Best for
Method chaining	Quick scripts, small transformations, throwaway code
transform()
Reusable pipelines, production code, when the same transformations must be applied to multiple DataFrames
________________________________________
In real-world projects, transform() shines when:

•	You build data pipelines.

•	You want clean, testable, reusable code.

•	You’re working in a team (easier to understand functions like add_bonus than inline chains).


Unlike Pandas, PySpark DataFrame does not have an apply() method.
Instead, there are different contexts where apply() exists in PySpark:

1.	apply() in Pandas UDFs (with pandas_udf)

2.	applyInPandas() on DataFrames

3.	apply() in grouped operations (GroupedData)

4.	RDD map() / mapPartitions() (the lower-level equivalent to apply logic)

Let’s go through them one by one with code examples:


In [None]:
#apply() in Pandas UDFs
#PySpark integrates with Pandas via vectorized UDFs.
#Here, apply() is used inside Pandas UDF functions.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


# Sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
df = spark.createDataFrame(data, ["name", "age"])

# Define a Pandas UDF that applies a custom transformation
@pandas_udf("int")
def add_five(age_series: pd.Series) -> pd.Series:
    return age_series.apply(lambda x: x + 5)

# Use it in a DataFrame
df_with_new = df.withColumn("age_plus_5", add_five(df["age"]))
df_with_new.show()


+-------+---+----------+
|   name|age|age_plus_5|
+-------+---+----------+
|  Alice| 25|        30|
|    Bob| 30|        35|
|Charlie| 28|        33|
+-------+---+----------+



In [None]:
#applyInPandas()
#This is a DataFrame-level method that allows you to apply a function on grouped Pandas DataFrames.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Schema for output
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age_plus_10", IntegerType(), True)
])

# Function to apply on each Pandas DataFrame
def add_ten(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["age_plus_10"] = pdf["age"] + 10
    return pdf[["name", "age_plus_10"]]

# Use applyInPandas
df_applied = df.groupBy("name").applyInPandas(add_ten, schema=schema)
df_applied.show()

#applyInPandas() is grouped: Spark splits data into groups → converts each group to Pandas → applies the function → merges results back.

+-------+-----------+
|   name|age_plus_10|
+-------+-----------+
|  Alice|         35|
|    Bob|         40|
|Charlie|         38|
+-------+-----------+



In [None]:
#Equivalent of apply() on RDDs
#If you want Pandas-style apply() for row-wise operations, you can use map() on RDDs:

# Convert DataFrame to RDD
rdd = df.rdd

# Apply transformation (similar to row-wise apply)
rdd_applied = rdd.map(lambda row: (row["name"], row["age"] + 2))
print(rdd_applied.collect())


[('Alice', 27), ('Bob', 32), ('Charlie', 30)]


Summary

•	No direct apply() on PySpark DataFrames like Pandas.

•	You use:

  o	pandas_udf with .apply() → for row/column ops inside Pandas.

  o	applyInPandas() → for grouped transformations.

  o	GroupedData + Pandas UDF → for custom aggregations.

  o	RDD .map() → as a lower-level apply.


Great one  — let’s carefully unpack map() and flatMap() in PySpark.

These two are RDD (Resilient Distributed Dataset) methods, not DataFrame methods. They're used for low-level transformations in Spark.


map() in PySpark

  •	Applies a function to each element of the RDD.

  •	Returns a new RDD where each input element produces exactly one output element.

  •	Output count = Input count (1 → 1 mapping).



Good for element-wise transformations.


In [None]:
#Example: map()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MapFlatMapExample").getOrCreate()

# Create an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# Multiply each element by 2
mapped_rdd = rdd.map(lambda x: x * 2)

print(mapped_rdd.collect())

#Each element produced one output.


[2, 4, 6, 8, 10]


flatMap() in PySpark

•	Similar to map(), but flattens the results.

•	Each input element can produce zero, one, or many output elements.

•	Output count ≠ Input count.


Good for splitting, expanding, or filtering data.


In [None]:
#Example: flatMap()

# RDD of sentences
rdd2 = spark.sparkContext.parallelize(["hello world", "spark map flatmap", "pyspark example"])

# Split each sentence into words
flatmapped_rdd = rdd2.flatMap(lambda line: line.split(" "))

print(flatmapped_rdd.collect())

#Each sentence produced multiple words, and flatMap() flattened them into a single list.

['hello', 'world', 'spark', 'map', 'flatmap', 'pyspark', 'example']


In [None]:
#Comparison Between map() and flatMap()

#Using map() for word splitting:

mapped_words = rdd2.map(lambda line: line.split(" "))
print(mapped_words.collect())

#Result is a list of lists (not flattened).


[['hello', 'world'], ['spark', 'map', 'flatmap'], ['pyspark', 'example']]


In [None]:
#Using flatMap() for word splitting:

flatmapped_words = rdd2.flatMap(lambda line: line.split(" "))
print(flatmapped_words.collect())

#Result is a flat list of words.


['hello', 'world', 'spark', 'map', 'flatmap', 'pyspark', 'example']


In [None]:
#Another Example: Filtering with flatMap()

# RDD with numbers
nums = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# flatMap returns [] for odd numbers, [x] for even numbers
evens = nums.flatMap(lambda x: [x] if x % 2 == 0 else [])

print(evens.collect())

#Unlike map() which always returns one value, flatMap() can return zero elements (filtering effect).


[2, 4]


 Syntax

rdd.foreach(f)

f → a function to be executed on each element


In [None]:
#Example 1: Simple Print

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Apply foreach
def print_element(x):
    print(f"Value: {x}")

rdd.foreach(print_element)
#Note: You might not always see output in the driver logs because the printing happens on worker nodes.


In [None]:
#Writing to an External File

import os

def write_to_file(x):
    with open("output.txt", "a") as f:
        f.write(str(x) + "\n")

rdd.foreach(write_to_file)

#Each worker writes locally on its machine, not to the driver.
#So this is useful only in distributed storage or external databases


In [None]:
#Example 3: Using foreach for Database Insert

def insert_to_db(x):
    # Example: mock DB insert
    print(f"Inserting {x} into database...")

rdd.foreach(insert_to_db)


🔹 Difference Between foreach() and map()

map() → transformation, returns a new RDD.

foreach() → action, returns nothing.


In [None]:
#Using map

mapped = rdd.map(lambda x: x*2)
print(mapped.collect())  # [2,4,6,8,10]


[2, 4, 6, 8, 10]


In [None]:
# Using foreach

rdd.foreach(lambda x: print(x*2))  # Prints values but no return

Key Takeaways:

foreach() is an action.

Executes function on each element of RDD.

Typically used for side effects (DB updates, external API calls, logging).

Doesn’t return an RDD or DataFrame

What is partitionBy()?

partitionBy() is used when writing data (especially in formats like Parquet, ORC, Avro, CSV) to organize the output files into separate folders based on one or more columns.

This helps with:

Efficient querying (only scanning partitions needed).

Reducing data size when reading.

Better performance with tools like Spark SQL, Hive, Presto, etc.


Syntax
DataFrameWriter.partitionBy(col1, col2, ...).format("...").save(path)


col1, col2 → columns to partition the data by.

format("parquet") (or CSV, JSON, etc.).

save(path) → location in HDFS/local/S3.


In [None]:
#Example 1: Partition by one column

# Sample data
data = [
    (1, "Alice", "HR", 3000),
    (2, "Bob", "IT", 4000),
    (3, "Charlie", "IT", 4500),
    (4, "David", "Finance", 3500),
    (5, "Eve", "HR", 3200)
]

columns = ["id", "name", "dept", "salary"]

df = spark.createDataFrame(data, columns)

# Write data partitioned by 'dept'
df.write.partitionBy("dept").mode("overwrite").parquet("output/employee_partitioned")

#This creates a folder structure like:


In [None]:
#Example 2: Partition by multiple columns
# Write data partitioned by 'dept' and 'salary'

df.write.partitionBy("dept", "salary").mode("overwrite").parquet("output/employee_multi_partitioned")


#Folder structure will look like in the output

In [None]:
#Example 3: Reading partitioned data
# Read the partitioned Parquet back

df_read = spark.read.parquet("output/employee_partitioned")

df_read.show()

+---+-------+------+-------+
| id|   name|salary|   dept|
+---+-------+------+-------+
|  3|Charlie|  4500|     IT|
|  1|  Alice|  3000|     HR|
|  4|  David|  3500|Finance|
|  2|    Bob|  4000|     IT|
|  5|    Eve|  3200|     HR|
+---+-------+------+-------+



Key Notes:

partitionBy() does not change data inside files, only the directory structure.

It’s mainly useful for big data optimization.

Works best with formats like Parquet and ORC (not efficient with CSV/JSON).


What is MapType?

MapType is a Spark SQL data type that stores key-value pairs (like a Python dictionary).

Both keys and values have fixed data types (e.g., StringType for keys and IntegerType for values).

Keys are always non-null, but values can be nullable depending on schema definition.

Syntax
from pyspark.sql.types import MapType, StringType, IntegerType

MapType(keyType, valueType, valueContainsNull=True)
keyType → Data type of keys (e.g., StringType(), IntegerType()).

valueType → Data type of values.

valueContainsNull → Boolean (default = True). Whether map values can contain null.


In [None]:
#Example 1: Creating a MapType Column

from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

data = [
    (1, {"math": 80, "english": 90}),
    (2, {"math": 85, "science": 95}),
    (3, None)  # null map
]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df = spark.createDataFrame(data, schema)
df.show(truncate=False)
df.printSchema()


+---+---------------------------+
|id |scores                     |
+---+---------------------------+
|1  |{english -> 90, math -> 80}|
|2  |{science -> 95, math -> 85}|
|3  |NULL                       |
+---+---------------------------+

root
 |-- id: integer (nullable = true)
 |-- scores: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)



In [None]:
#Example 2: Accessing Map Values

#You can use col["mapField"]["key"] syntax.

from pyspark.sql.functions import col

df.select(
    col("id"),
    col("scores")["math"].alias("math_score"),
    col("scores")["english"].alias("english_score")
).show()

+---+----------+-------------+
| id|math_score|english_score|
+---+----------+-------------+
|  1|        80|           90|
|  2|        85|         NULL|
|  3|      NULL|         NULL|
+---+----------+-------------+



In [None]:
#Example 3: Creating Map Column Dynamically
#You can create a map column using create_map():

from pyspark.sql.functions import create_map, lit

df2 = df.withColumn("extra", create_map(lit("physics"), lit(88), lit("chemistry"), lit(77)))
df2.show(truncate=False)

+---+---------------------------+--------------------------------+
|id |scores                     |extra                           |
+---+---------------------------+--------------------------------+
|1  |{english -> 90, math -> 80}|{physics -> 88, chemistry -> 77}|
|2  |{science -> 95, math -> 85}|{physics -> 88, chemistry -> 77}|
|3  |NULL                       |{physics -> 88, chemistry -> 77}|
+---+---------------------------+--------------------------------+



Key Takeaways

MapType is like a dictionary in PySpark DataFrames.

Keys are always non-null, values can be null.

Use col["mapField"]["key"] to extract values.

Use create_map() to create new map columns.


In [34]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("emp.csv")

In [35]:
df.show()

+-----------+-------------+-------------+---+------+------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|
+-----------+-------------+-------------+---+------+------+----------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|
|          9|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|
|         10|          104|     Lisa Lee| 27|Female| 47000|2018-08-01|
|         11|          104|   David Park| 38|  Male| 65000|2015-11-01|
|     

In [36]:
def bonus(salary):
  return int(salary) * 0.1

In [37]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

bonus_udf = udf(bonus)
spark.udf.register("bonus_sql_udf", bonus, "double")

In [38]:
df.withColumn("bonus", bonus_udf(col("salary"))).show()

+-----------+-------------+-------------+---+------+------+----------+------+
|employee_id|department_id|         name|age|gender|salary| hire_date| bonus|
+-----------+-------------+-------------+---+------+------+----------+------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|5000.0|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|4500.0|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|5500.0|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|4800.0|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|6000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|5200.0|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|7000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|5100.0|
|          9|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|5800.0|
|         10|          104|     Lisa Lee| 27|Female| 47000|2018-

In [39]:
df.withColumn("bonus", col("salary")*1.1).show()

+-----------+-------------+-------------+---+------+------+----------+-----------------+
|employee_id|department_id|         name|age|gender|salary| hire_date|            bonus|
+-----------+-------------+-------------+---+------+------+----------+-----------------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|55000.00000000001|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|49500.00000000001|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|60500.00000000001|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|52800.00000000001|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|          66000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|57200.00000000001|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|          77000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|56100.00000000001|
|          9|        

In [40]:
# Example: Using a UDF with transform()

# Define a function that uses the UDF
def add_bonus_udf_transform(dataframe):
    return dataframe.withColumn("bonus_from_udf", bonus_udf(col("salary")))

# Apply the transformation using transform()
df_with_udf_transform = df.transform(add_bonus_udf_transform)

df_with_udf_transform.show()

+-----------+-------------+-------------+---+------+------+----------+--------------+
|employee_id|department_id|         name|age|gender|salary| hire_date|bonus_from_udf|
+-----------+-------------+-------------+---+------+------+----------+--------------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|        5000.0|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|        4500.0|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|        5500.0|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|        4800.0|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|        6000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|        5200.0|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|        7000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|        5100.0|
|          9|          103|      Tom Tan| 33|  Male| 5

Advantages of using UDFs over simple Python functions in PySpark:

*   **Integration with Spark SQL:** Registered UDFs can be directly used in Spark SQL queries, making your logic accessible from both DataFrame API and SQL.
*   **Serialization and Distribution:** Spark handles the serialization and distribution of UDFs to worker nodes, allowing your custom logic to be executed in a distributed manner.
*   **Performance (Pandas UDFs):** Pandas UDFs (vectorized UDFs) can offer significant performance improvements for certain operations by leveraging Apache Arrow and processing data in batches within Pandas DataFrames.
*   **Reusability:** Once defined and registered, a UDF can be easily reused across different parts of your Spark application.
*   **Handling Complex Logic:** UDFs are useful when the required transformation logic is too complex to express using built-in Spark functions.

## `explode()` in PySpark

The `explode()` function is used to create a new row for each element in an array or map column. It essentially transforms a single row with an array/map into multiple rows, with each new row containing one element from the original array/map.

This is particularly useful when you have nested data structures (arrays or maps) in your DataFrame and you want to flatten them for further processing or analysis.

**Syntax**

In [41]:
# Example 1: explode() with an array column

from pyspark.sql.functions import explode, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()

# Sample DataFrame with an array column
data = [
    ("Alice", ["Math", "Science"]),
    ("Bob", ["History"]),
    ("Charlie", []), # Empty array
    ("David", None) # Null array
]
columns = ["name", "subjects"]
df_array = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_array.show(truncate=False)

# Use explode() on the 'subjects' array column
df_exploded_array = df_array.select(col("name"), explode(col("subjects")).alias("subject"))

print("DataFrame after explode() on array column:")
df_exploded_array.show(truncate=False)

# Note: Rows with empty or null arrays are dropped by default.

Original DataFrame:
+-------+---------------+
|name   |subjects       |
+-------+---------------+
|Alice  |[Math, Science]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

DataFrame after explode() on array column:
+-----+-------+
|name |subject|
+-----+-------+
|Alice|Math   |
|Alice|Science|
|Bob  |History|
+-----+-------+



In [42]:
# Example 2: explode() with a map column

from pyspark.sql.functions import explode, col, create_map, lit
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

# Sample DataFrame with a map column
data = [
    ("Alice", {"Math": 90, "Science": 85}),
    ("Bob", {"History": 75}),
    ("Charlie", {}), # Empty map
    ("David", None) # Null map
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df_map = spark.createDataFrame(data, schema)

print("Original DataFrame:")
df_map.show(truncate=False)

# Use explode() on the 'scores' map column
# explode() on a map results in two columns: 'key' and 'value'
df_exploded_map = df_map.select(col("name"), explode(col("scores")))

print("DataFrame after explode() on map column:")
df_exploded_map.show(truncate=False)

# You can rename the resulting columns
df_exploded_map_renamed = df_map.select(col("name"), explode(col("scores")).alias("course", "score"))

print("DataFrame after explode() on map column (renamed columns):")
df_exploded_map_renamed.show(truncate=False)

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

DataFrame after explode() on map column:
+-----+-------+-----+
|name |key    |value|
+-----+-------+-----+
|Alice|Science|85   |
|Alice|Math   |90   |
|Bob  |History|75   |
+-----+-------+-----+

DataFrame after explode() on map column (renamed columns):
+-----+-------+-----+
|name |course |score|
+-----+-------+-----+
|Alice|Science|85   |
|Alice|Math   |90   |
|Bob  |History|75   |
+-----+-------+-----+



In [43]:
# Example 3: explode_outer()

from pyspark.sql.functions import explode_outer, col

# Assume df_array is already created from Example 1

print("Original DataFrame:")
df_array.show(truncate=False)

# Use explode_outer() on the 'subjects' array column
df_exploded_outer_array = df_array.select(col("name"), explode_outer(col("subjects")).alias("subject"))

print("DataFrame after explode_outer() on array column:")
df_exploded_outer_array.show(truncate=False)

# Assume df_map is already created from Example 2

print("Original DataFrame:")
df_map.show(truncate=False)

# Use explode_outer() on the 'scores' map column
df_exploded_outer_map = df_map.select(col("name"), explode_outer(col("scores")).alias("course", "score"))

print("DataFrame after explode_outer() on map column:")
df_exploded_outer_map.show(truncate=False)

Original DataFrame:
+-------+---------------+
|name   |subjects       |
+-------+---------------+
|Alice  |[Math, Science]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

DataFrame after explode_outer() on array column:
+-------+-------+
|name   |subject|
+-------+-------+
|Alice  |Math   |
|Alice  |Science|
|Bob    |History|
|Charlie|NULL   |
|David  |NULL   |
+-------+-------+

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

DataFrame after explode_outer() on map column:
+-------+-------+-----+
|name   |course |score|
+-------+-------+-----+
|Alice  |Science|85   |
|Alice  |Math   |90   |
|Bob    |History|75   |
|Charlie|NULL   |NULL |
|David  |NULL   |NU

In [44]:
df_exploded_array.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)



**Handling Nulls and Empty Arrays/Maps:**

By default, `explode()` drops rows where the array or map column is null or empty.

If you want to keep these rows and have nulls in the exploded columns, you can use `explode_outer()`.

The explode_outer() function is used on the original DataFrame containing the array (or map) column, not on a DataFrame that has already been exploded.

You use explode_outer() in the same way you would use explode(), but it will include rows where the array or map is NULL or empty, resulting in NULL values in the new exploded column(s).

I demonstrated this in Example 3 (cell 8edb1491), where explode_outer(col("subjects")) was applied to the original df_array DataFrame.

## `create_map()` in PySpark

The `create_map()` function in PySpark is used to create a new map column (key-value pairs) from existing columns or literal values. It's a function available in `pyspark.sql.functions`.

This function is useful for structuring data into a map format, which can then be used for various operations, including working with `MapType` columns or preparing data for nested structures.

**Syntax**

create_map(lit(key1), lit(value1), lit(key2), lit(value2))

In [45]:
# Example 1: create_map() from existing columns

from pyspark.sql.functions import create_map, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreateMapExample").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Math", 90, "Science", 85),
    ("Bob", "History", 75, "Art", 88),
    ("Charlie", "Physics", 92, "Chemistry", 80)
]
columns = ["name", "subject1_name", "subject1_score", "subject2_name", "subject2_score"]
df_subjects = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_subjects.show()

# Create a map column from subject name and score pairs
df_with_map = df_subjects.withColumn("scores_map",
                                     create_map(
                                         col("subject1_name"), col("subject1_score"),
                                         col("subject2_name"), col("subject2_score")
                                     ))

print("DataFrame after creating a map column:")
df_with_map.show(truncate=False)

df_with_map.printSchema()

Original DataFrame:
+-------+-------------+--------------+-------------+--------------+
|   name|subject1_name|subject1_score|subject2_name|subject2_score|
+-------+-------------+--------------+-------------+--------------+
|  Alice|         Math|            90|      Science|            85|
|    Bob|      History|            75|          Art|            88|
|Charlie|      Physics|            92|    Chemistry|            80|
+-------+-------------+--------------+-------------+--------------+

DataFrame after creating a map column:
+-------+-------------+--------------+-------------+--------------+--------------------------------+
|name   |subject1_name|subject1_score|subject2_name|subject2_score|scores_map                      |
+-------+-------------+--------------+-------------+--------------+--------------------------------+
|Alice  |Math         |90            |Science      |85            |{Math -> 90, Science -> 85}     |
|Bob    |History      |75            |Art          |88      

In [46]:
# Example 2: create_map() from literal values

from pyspark.sql.functions import create_map, lit
from pyspark.sql import SparkSession

# Assume spark is already created

# Sample DataFrame
data = [("Alice", 25), ("Bob", 30)]
columns = ["name", "age"]
df_lit = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_lit.show()

# Add a map column with fixed literal values
df_with_literal_map = df_lit.withColumn("info", create_map(
    lit("city"), lit("New York"),
    lit("country"), lit("USA")
))

print("DataFrame after adding a map column with literal values:")
df_with_literal_map.show(truncate=False)

df_with_literal_map.printSchema()

Original DataFrame:
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

DataFrame after adding a map column with literal values:
+-----+---+----------------------------------+
|name |age|info                              |
+-----+---+----------------------------------+
|Alice|25 |{city -> New York, country -> USA}|
|Bob  |30 |{city -> New York, country -> USA}|
+-----+---+----------------------------------+

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- info: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)



In [47]:
# Example 3: create_map() with mixed columns and literals

from pyspark.sql.functions import create_map, col, lit
from pyspark.sql import SparkSession

# Assume spark is already created

# Sample DataFrame
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Doctor")]
columns = ["name", "age", "occupation"]
df_mixed = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_mixed.show()

# Create a map column using a mix of columns and literals
df_with_mixed_map = df_mixed.withColumn("details", create_map(
    lit("age"), col("age").cast("string"), # Cast age to string for consistency
    lit("occupation"), col("occupation"),
    lit("status"), lit("active")
))

print("DataFrame after creating a map column with mixed types:")
df_with_mixed_map.show(truncate=False)

df_with_mixed_map.printSchema()

Original DataFrame:
+-----+---+----------+
| name|age|occupation|
+-----+---+----------+
|Alice| 25|  Engineer|
|  Bob| 30|    Doctor|
+-----+---+----------+

DataFrame after creating a map column with mixed types:
+-----+---+----------+-----------------------------------------------------+
|name |age|occupation|details                                              |
+-----+---+----------+-----------------------------------------------------+
|Alice|25 |Engineer  |{age -> 25, occupation -> Engineer, status -> active}|
|Bob  |30 |Doctor    |{age -> 30, occupation -> Doctor, status -> active}  |
+-----+---+----------+-----------------------------------------------------+

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- occupation: string (nullable = true)
 |-- details: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



## `map_keys()` and `map_values()` in PySpark

`map_keys()` and `map_values()` are PySpark SQL functions used to extract the keys and values, respectively, from a MapType column in a DataFrame.

*   **`map_keys(col)`**: Returns an array containing all the keys in the MapType column. The order of keys in the array is not guaranteed.
*   **`map_values(col)`**: Returns an array containing all the values in the MapType column. The order of values in the array corresponds to the order of keys returned by `map_keys()`.

**Syntax**

In [48]:
from pyspark.sql.functions import map_keys, map_values, col
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

spark = SparkSession.builder.appName("MapKeysValuesExample").getOrCreate()

# Sample DataFrame with a MapType column (using the df_map from a previous example)
data = [
    ("Alice", {"Math": 90, "Science": 85}),
    ("Bob", {"History": 75}),
    ("Charlie", {}), # Empty map
    ("David", None) # Null map
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df_map = spark.createDataFrame(data, schema)

print("Original DataFrame:")
df_map.show(truncate=False)
df_map.printSchema()

# Example 1: Using map_keys()
df_keys = df_map.select(col("name"), map_keys(col("scores")).alias("score_keys"))

print("DataFrame with score keys:")
df_keys.show(truncate=False)
df_keys.printSchema()

# Example 2: Using map_values()
df_values = df_map.select(col("name"), map_values(col("scores")).alias("score_values"))

print("DataFrame with score values:")
df_values.show(truncate=False)
df_values.printSchema()

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

root
 |-- name: string (nullable = true)
 |-- scores: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

DataFrame with score keys:
+-------+---------------+
|name   |score_keys     |
+-------+---------------+
|Alice  |[Science, Math]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

root
 |-- name: string (nullable = true)
 |-- score_keys: array (nullable = true)
 |    |-- element: string (containsNull = true)

DataFrame with score values:
+-------+------------+
|name   |score_values|
+-------+------------+
|Alice  |[85, 90]    |
|Bob    |[75]        |
|Charlie|[]  

In [49]:
from pyspark.sql.functions import map_keys, col

# Assume df_with_literal_map is already created (from the create_map examples)

# Use map_keys() on the 'info' column
df_literal_map_keys = df_with_literal_map.select(col("name"), map_keys(col("info")).alias("info_keys"))

print("DataFrame with keys from the literal map:")
df_literal_map_keys.show(truncate=False)
df_literal_map_keys.printSchema()

DataFrame with keys from the literal map:
+-----+---------------+
|name |info_keys      |
+-----+---------------+
|Alice|[city, country]|
|Bob  |[city, country]|
+-----+---------------+

root
 |-- name: string (nullable = true)
 |-- info_keys: array (nullable = false)
 |    |-- element: string (containsNull = true)



## `collect_list()` and `collect_set()` in PySpark

`collect_list()` and `collect_set()` are aggregation functions in PySpark that are used to gather elements from a column into a list or a set, respectively, within each group. They are often used after a `groupBy()` operation.

*   **`collect_list(col)`**: Aggregates the elements of the specified column into a `list`. It includes duplicate values and the order of elements in the list is not guaranteed.
*   **`collect_set(col)`**: Aggregates the elements of the specified column into a `set`. It only includes unique values and the order of elements in the set is not guaranteed (as sets are unordered collections).

**Syntax**

In [50]:
# Example 1: Basic Usage with groupBy()

from pyspark.sql.functions import collect_list, collect_set, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CollectListSetExample").getOrCreate()

# Sample DataFrame
data = [
    ("A", 1),
    ("B", 2),
    ("A", 3),
    ("C", 4),
    ("B", 2),
    ("A", 1)
]
columns = ["category", "value"]
df_agg = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_agg.show()

# Group by 'category' and collect values into a list
df_list = df_agg.groupBy("category").agg(collect_list("value").alias("list_of_values"))

print("DataFrame after groupBy() and collect_list():")
df_list.show()

# Group by 'category' and collect unique values into a set
df_set = df_agg.groupBy("category").agg(collect_set("value").alias("set_of_values"))

print("DataFrame after groupBy() and collect_set():")
df_set.show()

Original DataFrame:
+--------+-----+
|category|value|
+--------+-----+
|       A|    1|
|       B|    2|
|       A|    3|
|       C|    4|
|       B|    2|
|       A|    1|
+--------+-----+

DataFrame after groupBy() and collect_list():
+--------+--------------+
|category|list_of_values|
+--------+--------------+
|       B|        [2, 2]|
|       A|     [1, 3, 1]|
|       C|           [4]|
+--------+--------------+

DataFrame after groupBy() and collect_set():
+--------+-------------+
|category|set_of_values|
+--------+-------------+
|       B|          [2]|
|       A|       [1, 3]|
|       C|          [4]|
+--------+-------------+



In [51]:
# Example 2: Using collect_list() and collect_set() without groupBy()

# When used without groupBy(), these functions will collect all values from the entire DataFrame into a single list or set.
df_all_list = df_agg.agg(collect_list("value").alias("all_values_list"))
print("DataFrame after collect_list() on entire DataFrame:")
df_all_list.show(truncate=False)

df_all_set = df_agg.agg(collect_set("value").alias("all_values_set"))
print("DataFrame after collect_set() on entire DataFrame:")
df_all_set.show(truncate=False)

DataFrame after collect_list() on entire DataFrame:
+------------------+
|all_values_list   |
+------------------+
|[1, 2, 3, 4, 2, 1]|
+------------------+

DataFrame after collect_set() on entire DataFrame:
+--------------+
|all_values_set|
+--------------+
|[1, 2, 3, 4]  |
+--------------+



In [52]:
# Example 3: Collecting multiple columns or complex types

from pyspark.sql.functions import struct

# Collect 'category' and 'value' as structs into a list
df_struct_list = df_agg.groupBy("category").agg(collect_list(struct("category", "value")).alias("list_of_structs"))
print("DataFrame after collecting structs:")
df_struct_list.show(truncate=False)

DataFrame after collecting structs:
+--------+------------------------+
|category|list_of_structs         |
+--------+------------------------+
|B       |[{B, 2}, {B, 2}]        |
|A       |[{A, 1}, {A, 3}, {A, 1}]|
|C       |[{C, 4}]                |
+--------+------------------------+



In [53]:
# Sample() and sampleBy() in PySpark

## `sample()` in PySpark

`sample()` is used for simple random sampling. It allows you to randomly select a fraction of rows from your DataFrame.

You can perform sampling with or without replacement.

**Syntax**

In [54]:
# Example: sample()

# Assume df is already created from the previous examples (e.g., the employee dataframe)

# Simple random sampling with replacement (sample 30% of data)
sampled_df_with_replacement = df.sample(withReplacement=True, fraction=0.3, seed=123)

print("Sampled DataFrame with Replacement:")
sampled_df_with_replacement.show()

# Simple random sampling without replacement (sample 30% of data)
sampled_df_without_replacement = df.sample(withReplacement=False, fraction=0.3, seed=123)

print("Sampled DataFrame without Replacement:")
sampled_df_without_replacement.show()

# Note: The exact number of rows in the sampled DataFrame might vary slightly
# from fraction * total_rows due to the probabilistic nature of sampling.

Sampled DataFrame with Replacement:
+-----------+-------------+-----------+---+------+------+----------+
|employee_id|department_id|       name|age|gender|salary| hire_date|
+-----------+-------------+-----------+---+------+------+----------+
|          1|          101|   John Doe| 30|  Male| 50000|2015-01-01|
|          6|          103|  Jill Wong| 32|Female| 52000|2018-07-01|
|         10|          104|   Lisa Lee| 27|Female| 47000|2018-08-01|
|         12|          105| Susan Chen| 31|Female| 54000|2017-02-15|
|         15|          106|Michael Lee| 37|  Male| 63000|2014-09-30|
|         15|          106|Michael Lee| 37|  Male| 63000|2014-09-30|
|         17|          105|George Wang| 34|  Male| 57000|2016-03-15|
|         18|          104|  Nancy Liu| 29|Female| 50000|2017-06-01|
+-----------+-------------+-----------+---+------+------+----------+

Sampled DataFrame without Replacement:
+-----------+-------------+---------+---+------+------+----------+
|employee_id|department_id|  

## `sampleBy()` in PySpark

`sampleBy()` allows you to perform stratified sampling. This means you can sample different fractions of data from different categories (strata) within a column.

It's useful when you have an imbalanced dataset and want to ensure that each category is represented in your sample according to a specified proportion.

**Syntax**

In [55]:
# Example: sampleBy()

# Assume df is already created from the previous examples (e.g., the employee dataframe)

# Define fractions for stratified sampling by 'gender'
# Sample 50% of 'Male' and 100% of 'Female'
gender_fractions = {"Male": 0.5, "Female": 1.0}

# Perform stratified sampling
sampled_df_by_gender = df.sampleBy("gender", gender_fractions, seed=42)

print("Sampled DataFrame by Gender:")
sampled_df_by_gender.show()

# Example: sampleBy() by 'department_id'
# Sample 80% from department 101, 50% from 102, and 100% from 103
dept_fractions = {101: 0.8, 102: 0.5, 103: 1.0}

# Perform stratified sampling
sampled_df_by_dept = df.sampleBy("department_id", dept_fractions, seed=42)

print("Sampled DataFrame by Department ID:")
sampled_df_by_dept.show()

Sampled DataFrame by Gender:
+-----------+-------------+-----------+---+------+------+----------+
|employee_id|department_id|       name|age|gender|salary| hire_date|
+-----------+-------------+-----------+---+------+------+----------+
|          2|          101| Jane Smith| 25|Female| 45000|2016-02-15|
|          4|          102|  Alice Lee| 28|Female| 48000|2017-09-30|
|          6|          103|  Jill Wong| 32|Female| 52000|2018-07-01|
|          8|          102|   Kate Kim| 29|Female| 51000|2019-10-01|
|         10|          104|   Lisa Lee| 27|Female| 47000|2018-08-01|
|         11|          104| David Park| 38|  Male| 65000|2015-11-01|
|         12|          105| Susan Chen| 31|Female| 54000|2017-02-15|
|         13|          106|  Brian Kim| 45|  Male| 75000|2011-07-01|
|         14|          107|  Emily Lee| 26|Female| 46000|2019-01-01|
|         16|          107|Kelly Zhang| 30|Female| 49000|2018-04-01|
|         17|          105|George Wang| 34|  Male| 57000|2016-03-15|
|    

Here are some additional examples to further illustrate `sample()` and `sampleBy()`.

In [56]:
# More Examples for sample()

# Sample 50% of data with replacement
sampled_df_with_replacement_50 = df.sample(withReplacement=True, fraction=0.5, seed=456)
print("Sampled DataFrame with Replacement (50%):")
sampled_df_with_replacement_50.show()

# Sample 20% of data without replacement
sampled_df_without_replacement_20 = df.sample(withReplacement=False, fraction=0.2, seed=789)
print("Sampled DataFrame without Replacement (20%):")
sampled_df_without_replacement_20.show()

# Sample 100% of data without replacement (should return the original DataFrame approximately)
sampled_df_without_replacement_100 = df.sample(withReplacement=False, fraction=1.0, seed=1011)
print("Sampled DataFrame without Replacement (100%):")
sampled_df_without_replacement_100.show()

Sampled DataFrame with Replacement (50%):
+-----------+-------------+-----------+---+------+------+----------+
|employee_id|department_id|       name|age|gender|salary| hire_date|
+-----------+-------------+-----------+---+------+------+----------+
|          2|          101| Jane Smith| 25|Female| 45000|2016-02-15|
|          2|          101| Jane Smith| 25|Female| 45000|2016-02-15|
|          3|          102|  Bob Brown| 35|  Male| 55000|2014-05-01|
|          6|          103|  Jill Wong| 32|Female| 52000|2018-07-01|
|          6|          103|  Jill Wong| 32|Female| 52000|2018-07-01|
|          9|          103|    Tom Tan| 33|  Male| 58000|2016-06-01|
|         12|          105| Susan Chen| 31|Female| 54000|2017-02-15|
|         14|          107|  Emily Lee| 26|Female| 46000|2019-01-01|
|         16|          107|Kelly Zhang| 30|Female| 49000|2018-04-01|
|         16|          107|Kelly Zhang| 30|Female| 49000|2018-04-01|
|         16|          107|Kelly Zhang| 30|Female| 49000|2018

In [57]:
# More Examples for sampleBy()

# Sample different fractions based on 'age' groups
# For simplicity, let's create age groups
from pyspark.sql.functions import when

df_with_age_group = df.withColumn("age_group",
    when(col("age") < 30, "young")
    .when((col("age") >= 30) & (col("age") < 40), "middle_aged")
    .otherwise("senior")
)

print("DataFrame with Age Group:")
df_with_age_group.show()

# Define fractions for sampling by 'age_group'
age_group_fractions = {"young": 0.7, "middle_aged": 0.4, "senior": 1.0}

# Perform stratified sampling by 'age_group'
sampled_df_by_age_group = df_with_age_group.sampleBy("age_group", age_group_fractions, seed=1213)

print("Sampled DataFrame by Age Group:")
sampled_df_by_age_group.show()

# Another example: sample by 'gender' with different seeds
gender_fractions_2 = {"Male": 0.6, "Female": 0.9}
sampled_df_by_gender_2 = df.sampleBy("gender", gender_fractions_2, seed=1415)

print("Sampled DataFrame by Gender (different seed):")
sampled_df_by_gender_2.show()

DataFrame with Age Group:
+-----------+-------------+-------------+---+------+------+----------+-----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|  age_group|
+-----------+-------------+-------------+---+------+------+----------+-----------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|middle_aged|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|      young|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|middle_aged|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|      young|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|     senior|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|middle_aged|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|     senior|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|      young|
|          9|          103|      Tom Tan| 33|  Male| 58000|20

In [58]:
# pivot() in PySpark

## `split()` in PySpark

The `split()` function in PySpark is used to split a string column into an array of strings based on a specified delimiter. It's a function available in `pyspark.sql.functions`.

**Syntax**

In [59]:
# Example 1: Basic split()

from pyspark.sql.functions import split, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SplitExample").getOrCreate()

# Sample DataFrame
data = [("apple,banana,orange",), ("grape;kiwi",), ("mango",)]
columns = ["fruits"]
df_fruits = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_fruits.show(truncate=False)

# Split the 'fruits' column by comma
df_split_comma = df_fruits.withColumn("fruit_list_comma", split(col("fruits"), ","))

print("DataFrame after splitting by comma:")
df_split_comma.show(truncate=False)

# Split the 'fruits' column by semicolon
df_split_semicolon = df_fruits.withColumn("fruit_list_semicolon", split(col("fruits"), ";"))

print("DataFrame after splitting by semicolon:")
df_split_semicolon.show(truncate=False)

# Split by both comma and semicolon using regex
df_split_regex = df_fruits.withColumn("fruit_list_regex", split(col("fruits"), "[,;]"))

print("DataFrame after splitting by comma or semicolon (regex):")
df_split_regex.show(truncate=False)

Original DataFrame:
+-------------------+
|fruits             |
+-------------------+
|apple,banana,orange|
|grape;kiwi         |
|mango              |
+-------------------+

DataFrame after splitting by comma:
+-------------------+-----------------------+
|fruits             |fruit_list_comma       |
+-------------------+-----------------------+
|apple,banana,orange|[apple, banana, orange]|
|grape;kiwi         |[grape;kiwi]           |
|mango              |[mango]                |
+-------------------+-----------------------+

DataFrame after splitting by semicolon:
+-------------------+---------------------+
|fruits             |fruit_list_semicolon |
+-------------------+---------------------+
|apple,banana,orange|[apple,banana,orange]|
|grape;kiwi         |[grape, kiwi]        |
|mango              |[mango]              |
+-------------------+---------------------+

DataFrame after splitting by comma or semicolon (regex):
+-------------------+-----------------------+
|fruits       

In [60]:
# Example 2: Using the limit parameter

from pyspark.sql.functions import split, col
from pyspark.sql import SparkSession

# Sample DataFrame
data = [("a_b_c_d_e",), ("x_y",), ("z",)]
columns = ["text"]
df_limit = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_limit.show()

# Split with limit = 2
df_split_limit_2 = df_limit.withColumn("split_limit_2", split(col("text"), "_", 2))

print("DataFrame after splitting with limit = 2:")
df_split_limit_2.show(truncate=False)

# Split with limit = 0 (same as -1)
df_split_limit_0 = df_limit.withColumn("split_limit_0", split(col("text"), "_", 0))

print("DataFrame after splitting with limit = 0:")
df_split_limit_0.show(truncate=False)

# Split with limit = -1 (default)
df_split_limit_neg1 = df_limit.withColumn("split_limit_neg1", split(col("text"), "_", -1))

print("DataFrame after splitting with limit = -1:")
df_split_limit_neg1.show(truncate=False)

Original DataFrame:
+---------+
|     text|
+---------+
|a_b_c_d_e|
|      x_y|
|        z|
+---------+

DataFrame after splitting with limit = 2:
+---------+-------------+
|text     |split_limit_2|
+---------+-------------+
|a_b_c_d_e|[a, b_c_d_e] |
|x_y      |[x, y]       |
|z        |[z]          |
+---------+-------------+

DataFrame after splitting with limit = 0:
+---------+---------------+
|text     |split_limit_0  |
+---------+---------------+
|a_b_c_d_e|[a, b, c, d, e]|
|x_y      |[x, y]         |
|z        |[z]            |
+---------+---------------+

DataFrame after splitting with limit = -1:
+---------+----------------+
|text     |split_limit_neg1|
+---------+----------------+
|a_b_c_d_e|[a, b, c, d, e] |
|x_y      |[x, y]          |
|z        |[z]             |
+---------+----------------+



## `concat_ws()` in PySpark

The `concat_ws()` function (concatenate with separator) is used to concatenate multiple string columns together into a single string column, with a specified separator placed between each concatenated value. It's a function available in `pyspark.sql.functions`.

This function is useful for combining information from different columns into a more readable format or preparing data for output.

**Syntax**

In [61]:
# Example 1: Basic concat_ws()

from pyspark.sql.functions import concat_ws, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ConcatWSExample").getOrCreate()

# Sample DataFrame
data = [
    ("John", "Doe", "USA"),
    ("Jane", "Smith", "Canada"),
    ("Peter", "Jones", "UK"),
    (None, "Brown", "Germany"), # Example with a null value
    ("Alice", None, "France")  # Example with a null value
]
columns = ["first_name", "last_name", "country"]
df_names = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_names.show()

# Concatenate first_name and last_name with a space
df_full_name = df_names.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))

print("DataFrame with full_name:")
df_full_name.show()

# Concatenate first_name, last_name, and country with a comma and space
df_full_info = df_names.withColumn("full_info", concat_ws(", ", col("first_name"), col("last_name"), col("country")))

print("DataFrame with full_info:")
df_full_info.show()

Original DataFrame:
+----------+---------+-------+
|first_name|last_name|country|
+----------+---------+-------+
|      John|      Doe|    USA|
|      Jane|    Smith| Canada|
|     Peter|    Jones|     UK|
|      NULL|    Brown|Germany|
|     Alice|     NULL| France|
+----------+---------+-------+

DataFrame with full_name:
+----------+---------+-------+-----------+
|first_name|last_name|country|  full_name|
+----------+---------+-------+-----------+
|      John|      Doe|    USA|   John Doe|
|      Jane|    Smith| Canada| Jane Smith|
|     Peter|    Jones|     UK|Peter Jones|
|      NULL|    Brown|Germany|      Brown|
|     Alice|     NULL| France|      Alice|
+----------+---------+-------+-----------+

DataFrame with full_info:
+----------+---------+-------+-------------------+
|first_name|last_name|country|          full_info|
+----------+---------+-------+-------------------+
|      John|      Doe|    USA|     John, Doe, USA|
|      Jane|    Smith| Canada|Jane, Smith, Canada|
|    

**Handling NULLs:**

`concat_ws()` gracefully handles NULL values. If a column value is NULL, it is simply skipped, and the separator is not added for that specific value.

In [62]:
# Example 2: concat_ws() with array column

from pyspark.sql.functions import concat_ws, col, array, lit
from pyspark.sql import SparkSession

# Sample DataFrame with an array column
data = [
    ("apple", ["red", "green"]),
    ("banana", ["yellow"]),
    ("orange", ["orange", "sweet", "citrus"]),
    ("grape", []), # Empty array
    ("kiwi", None) # Null array
]
columns = ["fruit", "properties"]
df_fruits_props = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_fruits_props.show(truncate=False)

# Concatenate elements of the 'properties' array with a hyphen
df_props_string = df_fruits_props.withColumn("properties_string", concat_ws("-", col("properties")))

print("DataFrame with properties_string (concatenated array):")
df_props_string.show(truncate=False)

# Concatenate fruit name and properties array elements
df_combined = df_fruits_props.withColumn("fruit_and_props", concat_ws(":", col("fruit"), concat_ws(",", col("properties"))))

print("DataFrame with fruit_and_props:")
df_combined.show(truncate=False)

Original DataFrame:
+------+-----------------------+
|fruit |properties             |
+------+-----------------------+
|apple |[red, green]           |
|banana|[yellow]               |
|orange|[orange, sweet, citrus]|
|grape |[]                     |
|kiwi  |NULL                   |
+------+-----------------------+

DataFrame with properties_string (concatenated array):
+------+-----------------------+-------------------+
|fruit |properties             |properties_string  |
+------+-----------------------+-------------------+
|apple |[red, green]           |red-green          |
|banana|[yellow]               |yellow             |
|orange|[orange, sweet, citrus]|orange-sweet-citrus|
|grape |[]                     |                   |
|kiwi  |NULL                   |                   |
+------+-----------------------+-------------------+

DataFrame with fruit_and_props:
+------+-----------------------+--------------------------+
|fruit |properties             |fruit_and_props          

Example 1: Basic concat_ws() (cell 07f1531d)

This example demonstrates the basic usage of concat_ws() to combine string columns with a specified separator.

Original DataFrame: This shows the initial data with first_name, last_name, and country columns, including some rows with NULL values.
Concatenate first_name and last_name with a space:
df_names.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))
This line adds a new column named full_name.
concat_ws(" ", ...) is used to concatenate the columns. The first argument " " is the separator (a space).
col("first_name"), col("last_name") are the columns to be concatenated.
The output DataFrame with full_name shows the combined first_name and last_name. Notice how the row with NULL in first_name just shows the last_name ("Brown"), and the row with NULL in last_name just shows the first_name ("Alice"). concat_ws skips the NULL values and doesn't add the separator for them.
Concatenate first_name, last_name, and country with a comma and space:
df_names.withColumn("full_info", concat_ws(", ", col("first_name"), col("last_name"), col("country")))
This line adds another new column named full_info.
concat_ws(", ", ...) uses ", " as the separator.
col("first_name"), col("last_name"), col("country") are the columns to concatenate.
The output DataFrame with full_info shows the combined information. Again, observe how NULL values are handled – for the row with NULL first name, the output is "Brown, Germany", skipping the NULL first name and the separator before it.
Example 2: concat_ws() with array column (cell 320f2c24)

This example shows how concat_ws() can be used with an array column.

Original DataFrame: This DataFrame has a fruit column and a properties column, which is an array of strings. It includes examples with multiple elements, a single element, an empty array, and a NULL array.
Concatenate elements of the 'properties' array with a hyphen:
df_fruits_props.withColumn("properties_string", concat_ws("-", col("properties")))
This adds a new column properties_string.
concat_ws("-", col("properties")) concatenates the elements within the properties array using a hyphen - as the separator.
The output DataFrame with properties_string (concatenated array) shows the array elements joined by hyphens. Note that the empty array results in an empty string, and the NULL array also results in a NULL value in the new column.
Concatenate fruit name and properties array elements:
df_combined = df_fruits_props.withColumn("fruit_and_props", concat_ws(":", col("fruit"), concat_ws(",", col("properties"))))
This is a more complex example where concat_ws() is nested.
The outer concat_ws(":", ...) concatenates the fruit column and the result of the inner concat_ws(). The separator for the outer concatenation is :.
The inner concat_ws(",", col("properties")) concatenates the elements within the properties array using a comma , as the separator (as seen in the previous step).
The output DataFrame with fruit_and_props shows the fruit name, followed by a colon, and then the properties from the array joined by commas. Rows with empty or NULL arrays for properties still have the fruit name and the colon, followed by nothing.
In summary, concat_ws() is a versatile function for combining string columns or elements within a string array, providing control over the separator and handling NULL values gracefully.

Let me know if you'd like a deeper dive into any specific part of these examples!



## `translate()` in PySpark

The `translate()` function in PySpark is used to replace a sequence of characters in a string column with another sequence of characters. It performs a character-by-character replacement.

**Syntax**

In [63]:
# Example 1: Basic translate()

from pyspark.sql.functions import translate, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TranslateExample").getOrCreate()

# Sample DataFrame
data = [
    ("abcdefg",),
    ("12345",),
    ("hello world",),
    ("PySpark",),
    (None,) # Example with a null value
]
columns = ["text"]
df_text = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_text.show()

# Replace 'abc' with 'xyz'
# 'a' is replaced by 'x', 'b' by 'y', 'c' by 'z'
df_translated_basic = df_text.withColumn("translated_text", translate(col("text"), "abc", "xyz"))

print("DataFrame after translate(col('text'), 'abc', 'xyz'):")
df_translated_basic.show()

# Replace digits with asterisks
df_translated_digits = df_text.withColumn("translated_digits", translate(col("text"), "0123456789", "**********"))

print("DataFrame after translate(col('text'), '0123456789', '**********'):")
df_translated_digits.show()

Original DataFrame:
+-----------+
|       text|
+-----------+
|    abcdefg|
|      12345|
|hello world|
|    PySpark|
|       NULL|
+-----------+

DataFrame after translate(col('text'), 'abc', 'xyz'):
+-----------+---------------+
|       text|translated_text|
+-----------+---------------+
|    abcdefg|        xyzdefg|
|      12345|          12345|
|hello world|    hello world|
|    PySpark|        PySpxrk|
|       NULL|           NULL|
+-----------+---------------+

DataFrame after translate(col('text'), '0123456789', '**********'):
+-----------+-----------------+
|       text|translated_digits|
+-----------+-----------------+
|    abcdefg|          abcdefg|
|      12345|            *****|
|hello world|      hello world|
|    PySpark|          PySpark|
|       NULL|             NULL|
+-----------+-----------------+



In [64]:
# Example 2: Unequal lengths of 'from' and 'to' characters

from pyspark.sql.functions import translate, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Replace 'aeiou' with '123'
# 'a' -> '1', 'e' -> '2', 'i' -> '3'. 'o' and 'u' are removed.
df_translated_unequal = df_text.withColumn("translated_unequal", translate(col("text"), "aeiou", "123"))

print("DataFrame after translate(col('text'), 'aeiou', '123'):")
df_translated_unequal.show()

# Replace 'xyz' with '12345'
# 'x' -> '1', 'y' -> '2', 'z' -> '3'. No characters in 'to' for '4' and '5'.
df_translated_unequal_2 = df_text.withColumn("translated_unequal_2", translate(col("text"), "xyz", "12345"))

print("DataFrame after translate(col('text'), 'xyz', '12345'):")
df_translated_unequal_2.show()

Original DataFrame:
+-----------+
|       text|
+-----------+
|    abcdefg|
|      12345|
|hello world|
|    PySpark|
|       NULL|
+-----------+

DataFrame after translate(col('text'), 'aeiou', '123'):
+-----------+------------------+
|       text|translated_unequal|
+-----------+------------------+
|    abcdefg|           1bcd2fg|
|      12345|             12345|
|hello world|         h2ll wrld|
|    PySpark|           PySp1rk|
|       NULL|              NULL|
+-----------+------------------+

DataFrame after translate(col('text'), 'xyz', '12345'):
+-----------+--------------------+
|       text|translated_unequal_2|
+-----------+--------------------+
|    abcdefg|             abcdefg|
|      12345|               12345|
|hello world|         hello world|
|    PySpark|             P2Spark|
|       NULL|                NULL|
+-----------+--------------------+



In [65]:
# Example 3: Removing characters

from pyspark.sql.functions import translate, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Remove all vowels
# 'from' contains vowels, 'to' is an empty string
df_translated_remove_vowels = df_text.withColumn("no_vowels", translate(col("text"), "aeiouAEIOU", ""))

print("DataFrame after removing vowels:")
df_translated_remove_vowels.show()

# Remove spaces and commas
df_translated_remove_chars = df_text.withColumn("no_spaces_commas", translate(col("text"), " ,", ""))

print("DataFrame after removing spaces and commas:")
df_translated_remove_chars.show()

Original DataFrame:
+-----------+
|       text|
+-----------+
|    abcdefg|
|      12345|
|hello world|
|    PySpark|
|       NULL|
+-----------+

DataFrame after removing vowels:
+-----------+---------+
|       text|no_vowels|
+-----------+---------+
|    abcdefg|    bcdfg|
|      12345|    12345|
|hello world| hll wrld|
|    PySpark|   PySprk|
|       NULL|     NULL|
+-----------+---------+

DataFrame after removing spaces and commas:
+-----------+----------------+
|       text|no_spaces_commas|
+-----------+----------------+
|    abcdefg|         abcdefg|
|      12345|           12345|
|hello world|      helloworld|
|    PySpark|         PySpark|
|       NULL|            NULL|
+-----------+----------------+



In PySpark's translate() function, when the length of the from string and the to string are unequal, the translation is still done character by character based on the position in the strings.

If the from string is longer than the to string, the characters in the from string that do not have a corresponding character at the same position in the to string are removed from the input string.
If the to string is longer than the from string, the extra characters in the to string are ignored.
You can see this in Example 2 (cell fb0a226d). When translating 'aeiou' to '123', 'a' becomes '1', 'e' becomes '2', 'i' becomes '3', but 'o' and 'u' are removed because there are no 4th and 5th characters in the '123' string.

## `substring()` in PySpark

The `substring()` function in PySpark is used to extract a substring from a string column. It takes the starting position and the length of the substring to extract.

**Syntax**

In [69]:
display(df_substring_basic)

DataFrame[text: string, substring_example: string]

In [None]:
display(df_translated_basic)

In [67]:
# Example 1: Basic substring()

from pyspark.sql.functions import substring, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SubstringExample").getOrCreate()

# Sample DataFrame
data = [
    ("abcdefg",),
    ("PySpark",),
    ("Data Science",),
    (None,) # Example with a null value
]
columns = ["text"]
df_text = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_text.show()

# Extract substring starting from position 3 with length 4
df_substring_basic = df_text.withColumn("substring_example", substring(col("text"), 3, 4))

print("DataFrame after substring(col('text'), 3, 4):")
df_substring_basic.show()

# Extract substring from the beginning (position 1) with length 3
df_substring_start = df_text.withColumn("substring_from_start", substring(col("text"), 1, 3))

print("DataFrame after substring(col('text'), 1, 3):")
df_substring_start.show()

Original DataFrame:
+------------+
|        text|
+------------+
|     abcdefg|
|     PySpark|
|Data Science|
|        NULL|
+------------+

DataFrame after substring(col('text'), 3, 4):
+------------+-----------------+
|        text|substring_example|
+------------+-----------------+
|     abcdefg|             cdef|
|     PySpark|             Spar|
|Data Science|             ta S|
|        NULL|             NULL|
+------------+-----------------+

DataFrame after substring(col('text'), 1, 3):
+------------+--------------------+
|        text|substring_from_start|
+------------+--------------------+
|     abcdefg|                 abc|
|     PySpark|                 PyS|
|Data Science|                 Dat|
|        NULL|                NULL|
+------------+--------------------+



In [68]:
# Example 2: Using negative position

from pyspark.sql.functions import substring, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Extract substring starting from 3 characters from the end with length 3
df_substring_negative_pos = df_text.withColumn("substring_negative", substring(col("text"), -3, 3))

print("DataFrame after substring(col('text'), -3, 3):")
df_substring_negative_pos.show()

# Extract substring starting from 5 characters from the end with length 2
df_substring_negative_pos_2 = df_text.withColumn("substring_negative_2", substring(col("text"), -5, 2))

print("DataFrame after substring(col('text'), -5, 2):")
df_substring_negative_pos_2.show()

Original DataFrame:
+------------+
|        text|
+------------+
|     abcdefg|
|     PySpark|
|Data Science|
|        NULL|
+------------+

DataFrame after substring(col('text'), -3, 3):
+------------+------------------+
|        text|substring_negative|
+------------+------------------+
|     abcdefg|               efg|
|     PySpark|               ark|
|Data Science|               nce|
|        NULL|              NULL|
+------------+------------------+

DataFrame after substring(col('text'), -5, 2):
+------------+--------------------+
|        text|substring_negative_2|
+------------+--------------------+
|     abcdefg|                  cd|
|     PySpark|                  Sp|
|Data Science|                  ie|
|        NULL|                NULL|
+------------+--------------------+



In [None]:
# Example 3: Handling lengths longer than the remaining string

from pyspark.sql.functions import substring, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Extract substring starting from position 5 with a length longer than remaining
df_substring_long_len = df_text.withColumn("substring_long", substring(col("text"), 5, 10))

print("DataFrame after substring(col('text'), 5, 10):")
df_substring_long_len.show()

# Extract substring starting from a position beyond the string length
df_substring_invalid_pos = df_text.withColumn("substring_invalid", substring(col("text"), 10, 3))

print("DataFrame after substring(col('text'), 10, 3):")
df_substring_invalid_pos.show()

## Regular Expression Methods in PySpark

PySpark provides functions in `pyspark.sql.functions` for working with regular expressions on string columns. Two common ones are `regexp_extract()` and `regexp_replace()`.

## `regexp_extract()`

`regexp_extract()` is used to extract a specific part of a string that matches a regular expression pattern.

**Syntax**

In [70]:
# Example 1: regexp_extract()

from pyspark.sql.functions import regexp_extract, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RegexExample").getOrCreate()

# Sample DataFrame
data = [
    ("user_123_abc",),
    ("another_user_456_xyz",),
    ("id_789",),
    ("no_match",)
]
columns = ["text"]
df_regex = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_regex.show()

# Extract the numbers after "user_"
# Pattern: "user_" followed by one or more digits (\d+)
# Group 1: the digits captured by (\d+)
df_extracted = df_regex.withColumn("extracted_number", regexp_extract(col("text"), r"user_(\d+)", 1))

print("DataFrame after extracting numbers:")
df_extracted.show()

# Extract text after "user_"
# Pattern: "user_" followed by anything (.*)
# Group 1: the text captured by (.*)
df_extracted_text = df_regex.withColumn("extracted_text", regexp_extract(col("text"), r"user_(.*)", 1))

print("DataFrame after extracting text:")
df_extracted_text.show()

Original DataFrame:
+--------------------+
|                text|
+--------------------+
|        user_123_abc|
|another_user_456_xyz|
|              id_789|
|            no_match|
+--------------------+

DataFrame after extracting numbers:
+--------------------+----------------+
|                text|extracted_number|
+--------------------+----------------+
|        user_123_abc|             123|
|another_user_456_xyz|             456|
|              id_789|                |
|            no_match|                |
+--------------------+----------------+

DataFrame after extracting text:
+--------------------+--------------+
|                text|extracted_text|
+--------------------+--------------+
|        user_123_abc|       123_abc|
|another_user_456_xyz|       456_xyz|
|              id_789|              |
|            no_match|              |
+--------------------+--------------+



## `regexp_replace()`

`regexp_replace()` is used to replace all occurrences of a substring that matches a regular expression pattern with another string.

**Syntax**

In [71]:
# Example 2: regexp_replace()

from pyspark.sql.functions import regexp_replace, col

# Assume df_regex is already created from the previous example

print("Original DataFrame:")
df_regex.show()

# Replace all digits with 'X'
df_replaced_digits = df_regex.withColumn("replaced_digits", regexp_replace(col("text"), r"\d+", "X"))

print("DataFrame after replacing digits:")
df_replaced_digits.show()

# Replace "user_" with "id_"
df_replaced_user = df_regex.withColumn("replaced_user", regexp_replace(col("text"), "user_", "id_"))

print("DataFrame after replacing 'user_':")
df_replaced_user.show()

# Remove anything after "_"
df_removed_after_underscore = df_regex.withColumn("removed_after_underscore", regexp_replace(col("text"), r"\_.*", ""))

print("DataFrame after removing text after underscore:")
df_removed_after_underscore.show()

Original DataFrame:
+--------------------+
|                text|
+--------------------+
|        user_123_abc|
|another_user_456_xyz|
|              id_789|
|            no_match|
+--------------------+

DataFrame after replacing digits:
+--------------------+------------------+
|                text|   replaced_digits|
+--------------------+------------------+
|        user_123_abc|        user_X_abc|
|another_user_456_xyz|another_user_X_xyz|
|              id_789|              id_X|
|            no_match|          no_match|
+--------------------+------------------+

DataFrame after replacing 'user_':
+--------------------+------------------+
|                text|     replaced_user|
+--------------------+------------------+
|        user_123_abc|        id_123_abc|
|another_user_456_xyz|another_id_456_xyz|
|              id_789|            id_789|
|            no_match|          no_match|
+--------------------+------------------+

DataFrame after removing text after underscore:
+---

In [72]:
# Example 3: Accessing elements of the resulting array

from pyspark.sql.functions import split, col

# Assume df_fruits is already created from Example 1

df_split_comma = df_fruits.withColumn("fruit_list_comma", split(col("fruits"), ","))

# Access the first element (index 0)
df_first_fruit = df_split_comma.withColumn("first_fruit", col("fruit_list_comma")[0])

print("DataFrame with the first fruit:")
df_first_fruit.show(truncate=False)

# Access the second element (index 1)
df_second_fruit = df_split_comma.withColumn("second_fruit", col("fruit_list_comma")[1])

print("DataFrame with the second fruit:")
df_second_fruit.show(truncate=False)

DataFrame with the first fruit:
+-------------------+-----------------------+-----------+
|fruits             |fruit_list_comma       |first_fruit|
+-------------------+-----------------------+-----------+
|apple,banana,orange|[apple, banana, orange]|apple      |
|grape;kiwi         |[grape;kiwi]           |grape;kiwi |
|mango              |[mango]                |mango      |
+-------------------+-----------------------+-----------+

DataFrame with the second fruit:
+-------------------+-----------------------+------------+
|fruits             |fruit_list_comma       |second_fruit|
+-------------------+-----------------------+------------+
|apple,banana,orange|[apple, banana, orange]|banana      |
|grape;kiwi         |[grape;kiwi]           |NULL        |
|mango              |[mango]                |NULL        |
+-------------------+-----------------------+------------+



## `pivot()` in PySpark

`pivot()` is a transformation used to rotate a table-valued expression by turning the unique values from one column into multiple columns. It's commonly used for data aggregation and reshaping, similar to a pivot table in spreadsheet software.

**Syntax**

In [73]:
# Example 1: Basic pivot()

# Sample data
data = [
    ("USA", "ProductA", 100),
    ("USA", "ProductB", 150),
    ("Canada", "ProductA", 120),
    ("Canada", "ProductC", 200),
    ("USA", "ProductB", 180),
    ("Canada", "ProductA", 130)
]

columns = ["country", "product", "amount"]

df_sales = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df_sales.show()

# Pivot the data to show total amount by country and product
pivot_df = df_sales.groupBy("country").pivot("product").sum("amount")

print("Pivoted DataFrame:")
pivot_df.show()

# Note: NULL values appear where a combination of grouping column and pivot column value does not exist in the original data.

Original DataFrame:
+-------+--------+------+
|country| product|amount|
+-------+--------+------+
|    USA|ProductA|   100|
|    USA|ProductB|   150|
| Canada|ProductA|   120|
| Canada|ProductC|   200|
|    USA|ProductB|   180|
| Canada|ProductA|   130|
+-------+--------+------+

Pivoted DataFrame:
+-------+--------+--------+--------+
|country|ProductA|ProductB|ProductC|
+-------+--------+--------+--------+
|    USA|     100|     330|    NULL|
| Canada|     250|    NULL|     200|
+-------+--------+--------+--------+



In [74]:
# Example 2: pivot() with specified values

# It's generally recommended to provide a list of values to the pivot function
# to avoid collecting all unique values from a large dataset.

product_values = ["ProductA", "ProductB", "ProductC"]

pivot_df_specified = df_sales.groupBy("country").pivot("product", product_values).sum("amount")

print("Pivoted DataFrame with specified values:")
pivot_df_specified.show()

Pivoted DataFrame with specified values:
+-------+--------+--------+--------+
|country|ProductA|ProductB|ProductC|
+-------+--------+--------+--------+
|    USA|     100|     330|    NULL|
| Canada|     250|    NULL|     200|
+-------+--------+--------+--------+



In [75]:
# Example 3: pivot() with multiple aggregations

from pyspark.sql.functions import avg, count

pivot_df_multi_agg = df_sales.groupBy("country").pivot("product", product_values).agg(sum("amount").alias("total_amount"), count("amount").alias("count"))

print("Pivoted DataFrame with multiple aggregations:")
pivot_df_multi_agg.show()

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [76]:
# Example 4: pivot() on a different dataset (using the employee df from earlier)

# Pivot the employee data to show average salary by department and gender
pivot_employee_df = df.groupBy("department_id").pivot("gender").agg(avg("salary"))

print("Pivoted Employee DataFrame (Average Salary by Department and Gender):")
pivot_employee_df.show()

Pivoted Employee DataFrame (Average Salary by Department and Gender):
+-------------+------------------+-------+
|department_id|            Female|   Male|
+-------------+------------------+-------+
|          101|           45000.0|60000.0|
|          103|           52000.0|60000.0|
|          107|           47500.0|   NULL|
|          102|50666.666666666664|55000.0|
|          105|           54000.0|57000.0|
|          106|              NULL|69000.0|
|          104|           48500.0|65000.0|
+-------------+------------------+-------+



### Sample Dataset for Practice

Here's a sample PySpark DataFrame you can use to practice the methods discussed above. It contains information about orders, including `order_id`, `customer_id`, `product`, `quantity`, `price`, and `order_date`. This dataset includes various data types and potential scenarios for applying transformations and actions like `collect()`, `transform()`, `map()`, `flatMap()`, `foreach()`, `partitionBy()`, `MapType`, `explode()`, `create_map()`, `map_keys()`, `map_values()`, `collect_list()`, `collect_set()`, `sample()`, `sampleBy()`, `split()`, `concat_ws()`, `translate()`, `substring()`, `regexp_extract()`, `regexp_replace()`, and `pivot()`.

In [77]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, DateType
from datetime import date

# Assume spark is already created
# spark = SparkSession.builder.appName("PracticeDataset").getOrCreate()

# Define the schema
schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("price", DoubleType(), True),
    StructField("order_date", DateType(), True),
    StructField("tags", StringType(), True), # For split() and regex examples
    StructField("details", StringType(), True), # For translate(), substring() examples
    StructField("features", StringType(), True), # For regexp_extract(), regexp_replace() examples
    StructField("product_attributes", MapType(StringType(), StringType()), True), # For MapType, explode, map_keys, map_values
    StructField("related_products", StringType(), True) # For collect_list, collect_set
])

# Sample data
data = [
    (1, 101, "Laptop", 1, 1200.00, date(2023, 1, 15), "electronics,office", "Serial: A1B2C3D4", "user_123_laptop", {"color": "silver", "brand": "XYZ"}, "Keyboard,Mouse,Monitor"),
    (2, 102, "Mouse", 2, 25.50, date(2023, 1, 15), "electronics,accessory", "Model: M500", "user_456_mouse", {"color": "black", "wireless": "true"}, "Laptop,Monitor"),
    (3, 101, "Keyboard", 1, 75.00, date(2023, 1, 16), "electronics,accessory", "SKU: KB-101", "id_789_keyboard", {"layout": "US", "mechanical": "false"}, "Mouse,Monitor"),
    (4, 103, "Monitor", 1, 300.00, date(2023, 1, 16), "electronics,display", "DisplaySize: 27inch", "user_123_monitor", {"size": "27", "resolution": "1080p"}, "Laptop,Keyboard"),
    (5, 102, "Desk Chair", 1, 150.00, date(2023, 1, 17), "furniture,office", "Weight: 20kg", "user_456_chair", {"material": "mesh", "adjustable": "true"}, "Desk,Lamp"),
    (6, 104, "Lamp", 2, 35.00, date(2023, 1, 17), "furniture,lighting", "BulbType: LED", "id_abc_lamp", None, "Desk Chair,Desk"), # Example with None map
    (7, 101, "Laptop", 1, 1200.00, date(2023, 1, 18), "electronics,office", "Serial: E5F6G7H8", "user_123_laptop", {"color": "silver", "brand": "XYZ"}, "Keyboard,Mouse,Monitor"), # Duplicate order
    (8, 105, "Notebook", 5, 3.00, date(2023, 1, 18), "office,stationery", "Pages: 100", "user_xyz_notebook", {}, "Pen,Pencil"), # Example with empty map
    (9, 103, "Desk", 1, 250.00, date(2023, 1, 19), "furniture,office", "Material: Wood", "user_789_desk", {"size": "medium"}, "Desk Chair,Lamp"),
    (10, 104, "Pen", 10, 1.50, date(2023, 1, 19), "office,stationery", "InkColor: Blue", "id_def_pen", {"color": "blue"}, "Notebook,Pencil"),
    (11, 105, "Pencil", 12, 0.50, date(2023, 1, 20), "office,stationery", "LeadSize: 0.7mm", "user_xyz_pencil", {"lead": "0.7"}, "Notebook,Pen"),
    (12, 106, "Tablet", 1, 400.00, date(2023, 1, 20), "electronics", "Model: Tab-Pro", "user_abc_tablet", {"os": "Android"}, None), # Example with None related_products
    (13, 106, "Protector", 1, 15.00, date(2023, 1, 20), "electronics,accessory", "Type: Screen", "user_abc_protector", {}, ""), # Example with empty related_products
    # Added new rows
    (14, 101, "Mouse", 1, 25.50, date(2023, 1, 21), "electronics,accessory", "Model: M600", "user_101_mouse", {"color": "white", "wireless": "false"}, "Keyboard"),
    (15, 102, "Laptop", 1, 1100.00, date(2023, 1, 21), "electronics,office", "Serial: I9J10K11L12", "user_102_laptop", {"color": "black", "brand": "UVW"}, "Mouse,Monitor"),
    (16, 103, "Keyboard", 2, 70.00, date(2023, 1, 22), "electronics,accessory", "SKU: KB-202", "user_103_keyboard", {"layout": "UK", "mechanical": "true"}, "Mouse"),
    (17, 104, "Desk", 1, 220.00, date(2023, 1, 22), "furniture,office", "Material: Metal", "user_104_desk", {"size": "large"}, "Chair"),
    (18, 105, "Lamp", 1, 30.00, date(2023, 1, 23), "furniture,lighting", "BulbType: Incandescent", "user_105_lamp", None, "Desk"),
    (19, 106, "Notebook", 3, 2.50, date(2023, 1, 23), "office,stationery", "Pages: 150", "user_106_notebook", {}, "Pen,Pencil"),
    (20, 101, "Monitor", 1, 280.00, date(2023, 1, 24), "electronics,display", "DisplaySize: 24inch", "user_101_monitor", {"size": "24", "resolution": "1080p"}, "Laptop")
]

practice_df = spark.createDataFrame(data, schema)

# Show the DataFrame and its schema
print("Sample Practice DataFrame:")
practice_df.show(truncate=False)
practice_df.printSchema()

Sample Practice DataFrame:
+--------+-----------+----------+--------+------+----------+---------------------+----------------------+------------------+--------------------------------------+----------------------+
|order_id|customer_id|product   |quantity|price |order_date|tags                 |details               |features          |product_attributes                    |related_products      |
+--------+-----------+----------+--------+------+----------+---------------------+----------------------+------------------+--------------------------------------+----------------------+
|1       |101        |Laptop    |1       |1200.0|2023-01-15|electronics,office   |Serial: A1B2C3D4      |user_123_laptop   |{color -> silver, brand -> XYZ}       |Keyboard,Mouse,Monitor|
|2       |102        |Mouse     |2       |25.5  |2023-01-15|electronics,accessory|Model: M500           |user_456_mouse    |{color -> black, wireless -> true}    |Laptop,Monitor        |
|3       |101        |Keyboard  |1    

Using split() and explode():
Extract each individual tag from the tags column and create a new row for each tag, keeping the order_id.
Count how many orders belong to each unique tag.

Using regexp_extract():
From the details column, extract the serial number (e.g., "A1B2C3D4") for products where the detail starts with "Serial: ".
From the features column, extract the numeric ID (e.g., "123") that appears after "user_".

Using translate() and substring():
Create a new column by removing all vowels (both uppercase and lowercase) from the details column.

Extract the first 5 characters of the features column.

Using concat_ws():
Create a new column that combines the product and quantity columns into a single string like "Laptop (1)".

Combine the elements of the related_products string into a list using a comma as a separator (you might need split() first).

Using MapType, explode(), map_keys(), and map_values():
Explode the product_attributes map to have one row per attribute key-value pair, keeping the order_id and product.

Get a list of all unique attribute keys present in the product_attributes column across all orders.

Extract the value associated with the key "color" from the product_attributes map.

Using collect_list() and collect_set():
For each customer_id, create a list of all products they have ordered (collect_list).

For each customer_id, create a set of unique products they have ordered (collect_set).

Using pivot():
Pivot the data to show the total quantity of each product ordered by each customer_id.

Pivot the data to show the average price of each product for each customer_id.

Using sample() and sampleBy():
Take a random sample of 20% of the rows from the DataFrame.

Perform a stratified sample on the product column, taking 50% of "Laptop" orders and 100% of "Mouse" orders.


Combining split(), explode(), and Aggregation:
Find the total quantity of products ordered for each unique tag.

Using substring() and concat_ws():
Create a new column that takes the first 3 characters of the product name and concatenates it with the order_id, separated by a hyphen (e.g., "Lap-1").

Using regexp_replace() and translate():
Remove all non-digit characters from the details column.

Replace all occurrences of the letter 'e' (case-insensitive) in the product column with the character '@'.

Working with MapType and Filtering:
Filter the DataFrame to show only the orders where the product_attributes map contains the key "color" and its value is "silver".

Using collect_list() with struct():
For each order_date, collect a list of structs containing the product and quantity for all orders placed on that date.

Advanced pivot():
Pivot the data to show the total price and total quantity for each product across different customer_ids.

Combining sample() and groupBy():
Take a random sample of 50% of the data and then group the sampled data by customer_id to find the total number of orders for each customer in the sample.




