<a href="https://colab.research.google.com/github/gvikas79/Spark-Tutorials/blob/main/spark_class4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## `explode()` in PySpark

The `explode()` function is used to create a new row for each element in an array or map column. It essentially transforms a single row with an array/map into multiple rows, with each new row containing one element from the original array/map.

This is particularly useful when you have nested data structures (arrays or maps) in your DataFrame and you want to flatten them for further processing or analysis.

**Syntax**

In [None]:
# Example 1: explode() with an array column

from pyspark.sql.functions import explode, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()

# Sample DataFrame with an array column
data = [
    ("Alice", ["Math", "Science"]),
    ("Bob", ["History"]),
    ("Charlie", []), # Empty array
    ("David", None) # Null array
]
columns = ["name", "subjects"]
df_array = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_array.show(truncate=False)

# Use explode() on the 'subjects' array column
df_exploded_array = df_array.select(col("name"), explode(col("subjects")).alias("subject"))

print("DataFrame after explode() on array column:")
df_exploded_array.show(truncate=False)

# Note: Rows with empty or null arrays are dropped by default.

Original DataFrame:
+-------+---------------+
|name   |subjects       |
+-------+---------------+
|Alice  |[Math, Science]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

DataFrame after explode() on array column:
+-----+-------+
|name |subject|
+-----+-------+
|Alice|Math   |
|Alice|Science|
|Bob  |History|
+-----+-------+



In [None]:
# Example 2: explode() with a map column

from pyspark.sql.functions import explode, col, create_map, lit
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

# Sample DataFrame with a map column
data = [
    ("Alice", {"Math": 90, "Science": 85}),
    ("Bob", {"History": 75}),
    ("Charlie", {}), # Empty map
    ("David", None) # Null map
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df_map = spark.createDataFrame(data, schema)

print("Original DataFrame:")
df_map.show(truncate=False)

# Use explode() on the 'scores' map column
# explode() on a map results in two columns: 'key' and 'value'
df_exploded_map = df_map.select(col("name"), explode(col("scores")))

print("DataFrame after explode() on map column:")
df_exploded_map.show(truncate=False)

# You can rename the resulting columns
df_exploded_map_renamed = df_map.select(col("name"), explode(col("scores")).alias("course", "score"))

print("DataFrame after explode() on map column (renamed columns):")
df_exploded_map_renamed.show(truncate=False)

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

DataFrame after explode() on map column:
+-----+-------+-----+
|name |key    |value|
+-----+-------+-----+
|Alice|Science|85   |
|Alice|Math   |90   |
|Bob  |History|75   |
+-----+-------+-----+

DataFrame after explode() on map column (renamed columns):
+-----+-------+-----+
|name |course |score|
+-----+-------+-----+
|Alice|Science|85   |
|Alice|Math   |90   |
|Bob  |History|75   |
+-----+-------+-----+



In [None]:
# Example 3: explode_outer()

from pyspark.sql.functions import explode_outer, col

# Assume df_array is already created from Example 1

print("Original DataFrame:")
df_array.show(truncate=False)

# Use explode_outer() on the 'subjects' array column
df_exploded_outer_array = df_array.select(col("name"), explode_outer(col("subjects")).alias("subject"))

print("DataFrame after explode_outer() on array column:")
df_exploded_outer_array.show(truncate=False)

# Assume df_map is already created from Example 2

print("Original DataFrame:")
df_map.show(truncate=False)

# Use explode_outer() on the 'scores' map column
df_exploded_outer_map = df_map.select(col("name"), explode_outer(col("scores")).alias("course", "score"))

print("DataFrame after explode_outer() on map column:")
df_exploded_outer_map.show(truncate=False)

Original DataFrame:
+-------+---------------+
|name   |subjects       |
+-------+---------------+
|Alice  |[Math, Science]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

DataFrame after explode_outer() on array column:
+-------+-------+
|name   |subject|
+-------+-------+
|Alice  |Math   |
|Alice  |Science|
|Bob    |History|
|Charlie|NULL   |
|David  |NULL   |
+-------+-------+

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

DataFrame after explode_outer() on map column:
+-------+-------+-----+
|name   |course |score|
+-------+-------+-----+
|Alice  |Science|85   |
|Alice  |Math   |90   |
|Bob    |History|75   |
|Charlie|NULL   |NULL |
|David  |NULL   |NU

In [None]:
df_exploded_array.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)



**Handling Nulls and Empty Arrays/Maps:**

By default, `explode()` drops rows where the array or map column is null or empty.

If you want to keep these rows and have nulls in the exploded columns, you can use `explode_outer()`.

The explode_outer() function is used on the original DataFrame containing the array (or map) column, not on a DataFrame that has already been exploded.

You use explode_outer() in the same way you would use explode(), but it will include rows where the array or map is NULL or empty, resulting in NULL values in the new exploded column(s).

I demonstrated this in Example 3 (cell 8edb1491), where explode_outer(col("subjects")) was applied to the original df_array DataFrame.

## `create_map()` in PySpark

The `create_map()` function in PySpark is used to create a new map column (key-value pairs) from existing columns or literal values. It's a function available in `pyspark.sql.functions`.

This function is useful for structuring data into a map format, which can then be used for various operations, including working with `MapType` columns or preparing data for nested structures.

**Syntax**

create_map(lit(key1), lit(value1), lit(key2), lit(value2))

In [None]:
# Example 1: create_map() from existing columns

from pyspark.sql.functions import create_map, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreateMapExample").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Math", 90, "Science", 85),
    ("Bob", "History", 75, "Art", 88),
    ("Charlie", "Physics", 92, "Chemistry", 80)
]
columns = ["name", "subject1_name", "subject1_score", "subject2_name", "subject2_score"]
df_subjects = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_subjects.show()

# Create a map column from subject name and score pairs
df_with_map = df_subjects.withColumn("scores_map",
                                     create_map(
                                         col("subject1_name"), col("subject1_score"),
                                         col("subject2_name"), col("subject2_score")
                                     ))

print("DataFrame after creating a map column:")
df_with_map.show(truncate=False)

df_with_map.printSchema()

Original DataFrame:
+-------+-------------+--------------+-------------+--------------+
|   name|subject1_name|subject1_score|subject2_name|subject2_score|
+-------+-------------+--------------+-------------+--------------+
|  Alice|         Math|            90|      Science|            85|
|    Bob|      History|            75|          Art|            88|
|Charlie|      Physics|            92|    Chemistry|            80|
+-------+-------------+--------------+-------------+--------------+

DataFrame after creating a map column:
+-------+-------------+--------------+-------------+--------------+--------------------------------+
|name   |subject1_name|subject1_score|subject2_name|subject2_score|scores_map                      |
+-------+-------------+--------------+-------------+--------------+--------------------------------+
|Alice  |Math         |90            |Science      |85            |{Math -> 90, Science -> 85}     |
|Bob    |History      |75            |Art          |88      

In [None]:
# Example 2: create_map() from literal values

from pyspark.sql.functions import create_map, lit
from pyspark.sql import SparkSession

# Assume spark is already created

# Sample DataFrame
data = [("Alice", 25), ("Bob", 30)]
columns = ["name", "age"]
df_lit = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_lit.show()

# Add a map column with fixed literal values
df_with_literal_map = df_lit.withColumn("info", create_map(
    lit("city"), lit("New York"),
    lit("country"), lit("USA")
))

print("DataFrame after adding a map column with literal values:")
df_with_literal_map.show(truncate=False)

df_with_literal_map.printSchema()

Original DataFrame:
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

DataFrame after adding a map column with literal values:
+-----+---+----------------------------------+
|name |age|info                              |
+-----+---+----------------------------------+
|Alice|25 |{city -> New York, country -> USA}|
|Bob  |30 |{city -> New York, country -> USA}|
+-----+---+----------------------------------+

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- info: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)



In [None]:
# Example 3: create_map() with mixed columns and literals

from pyspark.sql.functions import create_map, col, lit
from pyspark.sql import SparkSession

# Assume spark is already created

# Sample DataFrame
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Doctor")]
columns = ["name", "age", "occupation"]
df_mixed = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_mixed.show()

# Create a map column using a mix of columns and literals
df_with_mixed_map = df_mixed.withColumn("details", create_map(
    lit("age"), col("age").cast("string"), # Cast age to string for consistency
    lit("occupation"), col("occupation"),
    lit("status"), lit("active")
))

print("DataFrame after creating a map column with mixed types:")
df_with_mixed_map.show(truncate=False)

df_with_mixed_map.printSchema()

Original DataFrame:
+-----+---+----------+
| name|age|occupation|
+-----+---+----------+
|Alice| 25|  Engineer|
|  Bob| 30|    Doctor|
+-----+---+----------+

DataFrame after creating a map column with mixed types:
+-----+---+----------+-----------------------------------------------------+
|name |age|occupation|details                                              |
+-----+---+----------+-----------------------------------------------------+
|Alice|25 |Engineer  |{age -> 25, occupation -> Engineer, status -> active}|
|Bob  |30 |Doctor    |{age -> 30, occupation -> Doctor, status -> active}  |
+-----+---+----------+-----------------------------------------------------+

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- occupation: string (nullable = true)
 |-- details: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



## `map_keys()` and `map_values()` in PySpark

`map_keys()` and `map_values()` are PySpark SQL functions used to extract the keys and values, respectively, from a MapType column in a DataFrame.

*   **`map_keys(col)`**: Returns an array containing all the keys in the MapType column. The order of keys in the array is not guaranteed.
*   **`map_values(col)`**: Returns an array containing all the values in the MapType column. The order of values in the array corresponds to the order of keys returned by `map_keys()`.

**Syntax**

In [None]:
from pyspark.sql.functions import map_keys, map_values, col
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

spark = SparkSession.builder.appName("MapKeysValuesExample").getOrCreate()

# Sample DataFrame with a MapType column (using the df_map from a previous example)
data = [
    ("Alice", {"Math": 90, "Science": 85}),
    ("Bob", {"History": 75}),
    ("Charlie", {}), # Empty map
    ("David", None) # Null map
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df_map = spark.createDataFrame(data, schema)

print("Original DataFrame:")
df_map.show(truncate=False)
df_map.printSchema()

# Example 1: Using map_keys()
df_keys = df_map.select(col("name"), map_keys(col("scores")).alias("score_keys"))

print("DataFrame with score keys:")
df_keys.show(truncate=False)
df_keys.printSchema()

# Example 2: Using map_values()
df_values = df_map.select(col("name"), map_values(col("scores")).alias("score_values"))

print("DataFrame with score values:")
df_values.show(truncate=False)
df_values.printSchema()

Original DataFrame:
+-------+---------------------------+
|name   |scores                     |
+-------+---------------------------+
|Alice  |{Science -> 85, Math -> 90}|
|Bob    |{History -> 75}            |
|Charlie|{}                         |
|David  |NULL                       |
+-------+---------------------------+

root
 |-- name: string (nullable = true)
 |-- scores: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

DataFrame with score keys:
+-------+---------------+
|name   |score_keys     |
+-------+---------------+
|Alice  |[Science, Math]|
|Bob    |[History]      |
|Charlie|[]             |
|David  |NULL           |
+-------+---------------+

root
 |-- name: string (nullable = true)
 |-- score_keys: array (nullable = true)
 |    |-- element: string (containsNull = true)

DataFrame with score values:
+-------+------------+
|name   |score_values|
+-------+------------+
|Alice  |[85, 90]    |
|Bob    |[75]        |
|Charlie|[]  

In [None]:
from pyspark.sql.functions import map_keys, col

# Assume df_with_literal_map is already created (from the create_map examples)

# Use map_keys() on the 'info' column
df_literal_map_keys = df_with_literal_map.select(col("name"), map_keys(col("info")).alias("info_keys"))

print("DataFrame with keys from the literal map:")
df_literal_map_keys.show(truncate=False)
df_literal_map_keys.printSchema()

DataFrame with keys from the literal map:
+-----+---------------+
|name |info_keys      |
+-----+---------------+
|Alice|[city, country]|
|Bob  |[city, country]|
+-----+---------------+

root
 |-- name: string (nullable = true)
 |-- info_keys: array (nullable = false)
 |    |-- element: string (containsNull = true)



## `collect_list()` and `collect_set()` in PySpark

`collect_list()` and `collect_set()` are aggregation functions in PySpark that are used to gather elements from a column into a list or a set, respectively, within each group. They are often used after a `groupBy()` operation.

*   **`collect_list(col)`**: Aggregates the elements of the specified column into a `list`. It includes duplicate values and the order of elements in the list is not guaranteed.
*   **`collect_set(col)`**: Aggregates the elements of the specified column into a `set`. It only includes unique values and the order of elements in the set is not guaranteed (as sets are unordered collections).

**Syntax**

In [None]:
# Example 1: Basic Usage with groupBy()

from pyspark.sql.functions import collect_list, collect_set, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CollectListSetExample").getOrCreate()

# Sample DataFrame
data = [
    ("A", 1),
    ("B", 2),
    ("A", 3),
    ("C", 4),
    ("B", 2),
    ("A", 1)
]
columns = ["category", "value"]
df_agg = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_agg.show()

# Group by 'category' and collect values into a list
df_list = df_agg.groupBy("category").agg(collect_list("value").alias("list_of_values"))

print("DataFrame after groupBy() and collect_list():")
df_list.show()

# Group by 'category' and collect unique values into a set
df_set = df_agg.groupBy("category").agg(collect_set("value").alias("set_of_values"))

print("DataFrame after groupBy() and collect_set():")
df_set.show()

Original DataFrame:
+--------+-----+
|category|value|
+--------+-----+
|       A|    1|
|       B|    2|
|       A|    3|
|       C|    4|
|       B|    2|
|       A|    1|
+--------+-----+

DataFrame after groupBy() and collect_list():
+--------+--------------+
|category|list_of_values|
+--------+--------------+
|       B|        [2, 2]|
|       A|     [1, 3, 1]|
|       C|           [4]|
+--------+--------------+

DataFrame after groupBy() and collect_set():
+--------+-------------+
|category|set_of_values|
+--------+-------------+
|       B|          [2]|
|       A|       [1, 3]|
|       C|          [4]|
+--------+-------------+



In [None]:
# Example 2: Using collect_list() and collect_set() without groupBy()

# When used without groupBy(), these functions will collect all values from the entire DataFrame into a single list or set.
df_all_list = df_agg.agg(collect_list("value").alias("all_values_list"))
print("DataFrame after collect_list() on entire DataFrame:")
df_all_list.show(truncate=False)

df_all_set = df_agg.agg(collect_set("value").alias("all_values_set"))
print("DataFrame after collect_set() on entire DataFrame:")
df_all_set.show(truncate=False)

DataFrame after collect_list() on entire DataFrame:
+------------------+
|all_values_list   |
+------------------+
|[1, 2, 3, 4, 2, 1]|
+------------------+

DataFrame after collect_set() on entire DataFrame:
+--------------+
|all_values_set|
+--------------+
|[1, 2, 3, 4]  |
+--------------+



In [None]:
# Example 3: Collecting multiple columns or complex types

from pyspark.sql.functions import struct

# Collect 'category' and 'value' as structs into a list
df_struct_list = df_agg.groupBy("category").agg(collect_list(struct("category", "value")).alias("list_of_structs"))
print("DataFrame after collecting structs:")
df_struct_list.show(truncate=False)

DataFrame after collecting structs:
+--------+------------------------+
|category|list_of_structs         |
+--------+------------------------+
|B       |[{B, 2}, {B, 2}]        |
|A       |[{A, 1}, {A, 3}, {A, 1}]|
|C       |[{C, 4}]                |
+--------+------------------------+



In [None]:
# Sample() and sampleBy() in PySpark

## `sample()` in PySpark

`sample()` is used for simple random sampling. It allows you to randomly select a fraction of rows from your DataFrame.

You can perform sampling with or without replacement.

**Syntax**

In [None]:
# Example: sample()

# Assume df is already created from the previous examples (e.g., the employee dataframe)

# Simple random sampling with replacement (sample 30% of data)
sampled_df_with_replacement = df.sample(withReplacement=True, fraction=0.3, seed=123)

print("Sampled DataFrame with Replacement:")
sampled_df_with_replacement.show()

# Simple random sampling without replacement (sample 30% of data)
sampled_df_without_replacement = df.sample(withReplacement=False, fraction=0.3, seed=123)

print("Sampled DataFrame without Replacement:")
sampled_df_without_replacement.show()

# Note: The exact number of rows in the sampled DataFrame might vary slightly
# from fraction * total_rows due to the probabilistic nature of sampling.

NameError: name 'df' is not defined

## `sampleBy()` in PySpark

`sampleBy()` allows you to perform stratified sampling. This means you can sample different fractions of data from different categories (strata) within a column.

It's useful when you have an imbalanced dataset and want to ensure that each category is represented in your sample according to a specified proportion.

**Syntax**

In [None]:
# Example: sampleBy()

# Assume df is already created from the previous examples (e.g., the employee dataframe)

# Define fractions for stratified sampling by 'gender'
# Sample 50% of 'Male' and 100% of 'Female'
gender_fractions = {"Male": 0.5, "Female": 1.0}

# Perform stratified sampling
sampled_df_by_gender = df.sampleBy("gender", gender_fractions, seed=42)

print("Sampled DataFrame by Gender:")
sampled_df_by_gender.show()

# Example: sampleBy() by 'department_id'
# Sample 80% from department 101, 50% from 102, and 100% from 103
dept_fractions = {101: 0.8, 102: 0.5, 103: 1.0}

# Perform stratified sampling
sampled_df_by_dept = df.sampleBy("department_id", dept_fractions, seed=42)

print("Sampled DataFrame by Department ID:")
sampled_df_by_dept.show()

Here are some additional examples to further illustrate `sample()` and `sampleBy()`.

In [None]:
# More Examples for sample()

# Sample 50% of data with replacement
sampled_df_with_replacement_50 = df.sample(withReplacement=True, fraction=0.5, seed=456)
print("Sampled DataFrame with Replacement (50%):")
sampled_df_with_replacement_50.show()

# Sample 20% of data without replacement
sampled_df_without_replacement_20 = df.sample(withReplacement=False, fraction=0.2, seed=789)
print("Sampled DataFrame without Replacement (20%):")
sampled_df_without_replacement_20.show()

# Sample 100% of data without replacement (should return the original DataFrame approximately)
sampled_df_without_replacement_100 = df.sample(withReplacement=False, fraction=1.0, seed=1011)
print("Sampled DataFrame without Replacement (100%):")
sampled_df_without_replacement_100.show()

In [None]:
# More Examples for sampleBy()

# Sample different fractions based on 'age' groups
# For simplicity, let's create age groups
from pyspark.sql.functions import when

df_with_age_group = df.withColumn("age_group",
    when(col("age") < 30, "young")
    .when((col("age") >= 30) & (col("age") < 40), "middle_aged")
    .otherwise("senior")
)

print("DataFrame with Age Group:")
df_with_age_group.show()

# Define fractions for sampling by 'age_group'
age_group_fractions = {"young": 0.7, "middle_aged": 0.4, "senior": 1.0}

# Perform stratified sampling by 'age_group'
sampled_df_by_age_group = df_with_age_group.sampleBy("age_group", age_group_fractions, seed=1213)

print("Sampled DataFrame by Age Group:")
sampled_df_by_age_group.show()

# Another example: sample by 'gender' with different seeds
gender_fractions_2 = {"Male": 0.6, "Female": 0.9}
sampled_df_by_gender_2 = df.sampleBy("gender", gender_fractions_2, seed=1415)

print("Sampled DataFrame by Gender (different seed):")
sampled_df_by_gender_2.show()

In [None]:
# pivot() in PySpark

## `split()` in PySpark

The `split()` function in PySpark is used to split a string column into an array of strings based on a specified delimiter. It's a function available in `pyspark.sql.functions`.

**Syntax**

In [None]:
# Example 1: Basic split()

from pyspark.sql.functions import split, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SplitExample").getOrCreate()

# Sample DataFrame
data = [("apple,banana,orange",), ("grape;kiwi",), ("mango",)]
columns = ["fruits"]
df_fruits = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_fruits.show(truncate=False)

# Split the 'fruits' column by comma
df_split_comma = df_fruits.withColumn("fruit_list_comma", split(col("fruits"), ","))

print("DataFrame after splitting by comma:")
df_split_comma.show(truncate=False)

# Split the 'fruits' column by semicolon
df_split_semicolon = df_fruits.withColumn("fruit_list_semicolon", split(col("fruits"), ";"))

print("DataFrame after splitting by semicolon:")
df_split_semicolon.show(truncate=False)

# Split by both comma and semicolon using regex
df_split_regex = df_fruits.withColumn("fruit_list_regex", split(col("fruits"), "[,;]"))

print("DataFrame after splitting by comma or semicolon (regex):")
df_split_regex.show(truncate=False)

In [None]:
# Example 2: Using the limit parameter

from pyspark.sql.functions import split, col
from pyspark.sql import SparkSession

# Sample DataFrame
data = [("a_b_c_d_e",), ("x_y",), ("z",)]
columns = ["text"]
df_limit = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_limit.show()

# Split with limit = 2
df_split_limit_2 = df_limit.withColumn("split_limit_2", split(col("text"), "_", 2))

print("DataFrame after splitting with limit = 2:")
df_split_limit_2.show(truncate=False)

# Split with limit = 0 (same as -1)
df_split_limit_0 = df_limit.withColumn("split_limit_0", split(col("text"), "_", 0))

print("DataFrame after splitting with limit = 0:")
df_split_limit_0.show(truncate=False)

# Split with limit = -1 (default)
df_split_limit_neg1 = df_limit.withColumn("split_limit_neg1", split(col("text"), "_", -1))

print("DataFrame after splitting with limit = -1:")
df_split_limit_neg1.show(truncate=False)

## `concat_ws()` in PySpark

The `concat_ws()` function (concatenate with separator) is used to concatenate multiple string columns together into a single string column, with a specified separator placed between each concatenated value. It's a function available in `pyspark.sql.functions`.

This function is useful for combining information from different columns into a more readable format or preparing data for output.

**Syntax**

In [None]:
# Example 1: Basic concat_ws()

from pyspark.sql.functions import concat_ws, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ConcatWSExample").getOrCreate()

# Sample DataFrame
data = [
    ("John", "Doe", "USA"),
    ("Jane", "Smith", "Canada"),
    ("Peter", "Jones", "UK"),
    (None, "Brown", "Germany"), # Example with a null value
    ("Alice", None, "France")  # Example with a null value
]
columns = ["first_name", "last_name", "country"]
df_names = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_names.show()

# Concatenate first_name and last_name with a space
df_full_name = df_names.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))

print("DataFrame with full_name:")
df_full_name.show()

# Concatenate first_name, last_name, and country with a comma and space
df_full_info = df_names.withColumn("full_info", concat_ws(", ", col("first_name"), col("last_name"), col("country")))

print("DataFrame with full_info:")
df_full_info.show()

**Handling NULLs:**

`concat_ws()` gracefully handles NULL values. If a column value is NULL, it is simply skipped, and the separator is not added for that specific value.

In [None]:
# Example 2: concat_ws() with array column

from pyspark.sql.functions import concat_ws, col, array, lit
from pyspark.sql import SparkSession

# Sample DataFrame with an array column
data = [
    ("apple", ["red", "green"]),
    ("banana", ["yellow"]),
    ("orange", ["orange", "sweet", "citrus"]),
    ("grape", []), # Empty array
    ("kiwi", None) # Null array
]
columns = ["fruit", "properties"]
df_fruits_props = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_fruits_props.show(truncate=False)

# Concatenate elements of the 'properties' array with a hyphen
df_props_string = df_fruits_props.withColumn("properties_string", concat_ws("-", col("properties")))

print("DataFrame with properties_string (concatenated array):")
df_props_string.show(truncate=False)

# Concatenate fruit name and properties array elements
df_combined = df_fruits_props.withColumn("fruit_and_props", concat_ws(":", col("fruit"), concat_ws(",", col("properties"))))

print("DataFrame with fruit_and_props:")
df_combined.show(truncate=False)

Example 1: Basic concat_ws() (cell 07f1531d)

This example demonstrates the basic usage of concat_ws() to combine string columns with a specified separator.

Original DataFrame: This shows the initial data with first_name, last_name, and country columns, including some rows with NULL values.
Concatenate first_name and last_name with a space:
df_names.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))
This line adds a new column named full_name.
concat_ws(" ", ...) is used to concatenate the columns. The first argument " " is the separator (a space).
col("first_name"), col("last_name") are the columns to be concatenated.
The output DataFrame with full_name shows the combined first_name and last_name. Notice how the row with NULL in first_name just shows the last_name ("Brown"), and the row with NULL in last_name just shows the first_name ("Alice"). concat_ws skips the NULL values and doesn't add the separator for them.
Concatenate first_name, last_name, and country with a comma and space:
df_names.withColumn("full_info", concat_ws(", ", col("first_name"), col("last_name"), col("country")))
This line adds another new column named full_info.
concat_ws(", ", ...) uses ", " as the separator.
col("first_name"), col("last_name"), col("country") are the columns to concatenate.
The output DataFrame with full_info shows the combined information. Again, observe how NULL values are handled – for the row with NULL first name, the output is "Brown, Germany", skipping the NULL first name and the separator before it.
Example 2: concat_ws() with array column (cell 320f2c24)

This example shows how concat_ws() can be used with an array column.

Original DataFrame: This DataFrame has a fruit column and a properties column, which is an array of strings. It includes examples with multiple elements, a single element, an empty array, and a NULL array.
Concatenate elements of the 'properties' array with a hyphen:
df_fruits_props.withColumn("properties_string", concat_ws("-", col("properties")))
This adds a new column properties_string.
concat_ws("-", col("properties")) concatenates the elements within the properties array using a hyphen - as the separator.
The output DataFrame with properties_string (concatenated array) shows the array elements joined by hyphens. Note that the empty array results in an empty string, and the NULL array also results in a NULL value in the new column.
Concatenate fruit name and properties array elements:
df_combined = df_fruits_props.withColumn("fruit_and_props", concat_ws(":", col("fruit"), concat_ws(",", col("properties"))))
This is a more complex example where concat_ws() is nested.
The outer concat_ws(":", ...) concatenates the fruit column and the result of the inner concat_ws(). The separator for the outer concatenation is :.
The inner concat_ws(",", col("properties")) concatenates the elements within the properties array using a comma , as the separator (as seen in the previous step).
The output DataFrame with fruit_and_props shows the fruit name, followed by a colon, and then the properties from the array joined by commas. Rows with empty or NULL arrays for properties still have the fruit name and the colon, followed by nothing.
In summary, concat_ws() is a versatile function for combining string columns or elements within a string array, providing control over the separator and handling NULL values gracefully.

Let me know if you'd like a deeper dive into any specific part of these examples!



## Regular Expression Methods in PySpark

PySpark provides functions in `pyspark.sql.functions` for working with regular expressions on string columns. Two common ones are `regexp_extract()` and `regexp_replace()`.

## `translate()` in PySpark

The `translate()` function in PySpark is used to replace a sequence of characters in a string column with another sequence of characters. It performs a character-by-character replacement.

**Syntax**

In [None]:
# Example 1: Basic translate()

from pyspark.sql.functions import translate, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TranslateExample").getOrCreate()

# Sample DataFrame
data = [
    ("abcdefg",),
    ("12345",),
    ("hello world",),
    ("PySpark",),
    (None,) # Example with a null value
]
columns = ["text"]
df_text = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_text.show()

# Replace 'abc' with 'xyz'
# 'a' is replaced by 'x', 'b' by 'y', 'c' by 'z'
df_translated_basic = df_text.withColumn("translated_text", translate(col("text"), "abc", "xyz"))

print("DataFrame after translate(col('text'), 'abc', 'xyz'):")
df_translated_basic.show()

# Replace digits with asterisks
df_translated_digits = df_text.withColumn("translated_digits", translate(col("text"), "0123456789", "**********"))

print("DataFrame after translate(col('text'), '0123456789', '**********'):")
df_translated_digits.show()

In [None]:
# Example 2: Unequal lengths of 'from' and 'to' characters

from pyspark.sql.functions import translate, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Replace 'aeiou' with '123'
# 'a' -> '1', 'e' -> '2', 'i' -> '3'. 'o' and 'u' are removed.
df_translated_unequal = df_text.withColumn("translated_unequal", translate(col("text"), "aeiou", "123"))

print("DataFrame after translate(col('text'), 'aeiou', '123'):")
df_translated_unequal.show()

# Replace 'xyz' with '12345'
# 'x' -> '1', 'y' -> '2', 'z' -> '3'. No characters in 'to' for '4' and '5'.
df_translated_unequal_2 = df_text.withColumn("translated_unequal_2", translate(col("text"), "xyz", "12345"))

print("DataFrame after translate(col('text'), 'xyz', '12345'):")
df_translated_unequal_2.show()

In [None]:
# Example 3: Removing characters

from pyspark.sql.functions import translate, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Remove all vowels
# 'from' contains vowels, 'to' is an empty string
df_translated_remove_vowels = df_text.withColumn("no_vowels", translate(col("text"), "aeiouAEIOU", ""))

print("DataFrame after removing vowels:")
df_translated_remove_vowels.show()

# Remove spaces and commas
df_translated_remove_chars = df_text.withColumn("no_spaces_commas", translate(col("text"), " ,", ""))

print("DataFrame after removing spaces and commas:")
df_translated_remove_chars.show()

In PySpark's translate() function, when the length of the from string and the to string are unequal, the translation is still done character by character based on the position in the strings.

If the from string is longer than the to string, the characters in the from string that do not have a corresponding character at the same position in the to string are removed from the input string.
If the to string is longer than the from string, the extra characters in the to string are ignored.
You can see this in Example 2 (cell fb0a226d). When translating 'aeiou' to '123', 'a' becomes '1', 'e' becomes '2', 'i' becomes '3', but 'o' and 'u' are removed because there are no 4th and 5th characters in the '123' string.

## `substring()` in PySpark

The `substring()` function in PySpark is used to extract a substring from a string column. It takes the starting position and the length of the substring to extract.

**Syntax**

In [None]:
display(df_substring_basic)

In [None]:
display(df_translated_basic)

In [None]:
# Example 1: Basic substring()

from pyspark.sql.functions import substring, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SubstringExample").getOrCreate()

# Sample DataFrame
data = [
    ("abcdefg",),
    ("PySpark",),
    ("Data Science",),
    (None,) # Example with a null value
]
columns = ["text"]
df_text = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_text.show()

# Extract substring starting from position 3 with length 4
df_substring_basic = df_text.withColumn("substring_example", substring(col("text"), 3, 4))

print("DataFrame after substring(col('text'), 3, 4):")
df_substring_basic.show()

# Extract substring from the beginning (position 1) with length 3
df_substring_start = df_text.withColumn("substring_from_start", substring(col("text"), 1, 3))

print("DataFrame after substring(col('text'), 1, 3):")
df_substring_start.show()

In [None]:
# Example 2: Using negative position

from pyspark.sql.functions import substring, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Extract substring starting from 3 characters from the end with length 3
df_substring_negative_pos = df_text.withColumn("substring_negative", substring(col("text"), -3, 3))

print("DataFrame after substring(col('text'), -3, 3):")
df_substring_negative_pos.show()

# Extract substring starting from 5 characters from the end with length 2
df_substring_negative_pos_2 = df_text.withColumn("substring_negative_2", substring(col("text"), -5, 2))

print("DataFrame after substring(col('text'), -5, 2):")
df_substring_negative_pos_2.show()

In [None]:
# Example 3: Handling lengths longer than the remaining string

from pyspark.sql.functions import substring, col

# Assume df_text is already created from Example 1

print("Original DataFrame:")
df_text.show()

# Extract substring starting from position 5 with a length longer than remaining
df_substring_long_len = df_text.withColumn("substring_long", substring(col("text"), 5, 10))

print("DataFrame after substring(col('text'), 5, 10):")
df_substring_long_len.show()

# Extract substring starting from a position beyond the string length
df_substring_invalid_pos = df_text.withColumn("substring_invalid", substring(col("text"), 10, 3))

print("DataFrame after substring(col('text'), 10, 3):")
df_substring_invalid_pos.show()

## `regexp_extract()`

`regexp_extract()` is used to extract a specific part of a string that matches a regular expression pattern.

**Syntax**

In [None]:
# Example 1: regexp_extract()

from pyspark.sql.functions import regexp_extract, col
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RegexExample").getOrCreate()

# Sample DataFrame
data = [
    ("user_123_abc",),
    ("another_user_456_xyz",),
    ("id_789",),
    ("no_match",)
]
columns = ["text"]
df_regex = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df_regex.show()

# Extract the numbers after "user_"
# Pattern: "user_" followed by one or more digits (\d+)
# Group 1: the digits captured by (\d+)
df_extracted = df_regex.withColumn("extracted_number", regexp_extract(col("text"), r"user_(\d+)", 1))

print("DataFrame after extracting numbers:")
df_extracted.show()

# Extract text after "user_"
# Pattern: "user_" followed by anything (.*)
# Group 1: the text captured by (.*)
df_extracted_text = df_regex.withColumn("extracted_text", regexp_extract(col("text"), r"user_(.*)", 1))

print("DataFrame after extracting text:")
df_extracted_text.show()

## `regexp_replace()`

`regexp_replace()` is used to replace all occurrences of a substring that matches a regular expression pattern with another string.

**Syntax**

In [None]:
# Example 2: regexp_replace()

from pyspark.sql.functions import regexp_replace, col

# Assume df_regex is already created from the previous example

print("Original DataFrame:")
df_regex.show()

# Replace all digits with 'X'
df_replaced_digits = df_regex.withColumn("replaced_digits", regexp_replace(col("text"), r"\d+", "X"))

print("DataFrame after replacing digits:")
df_replaced_digits.show()

# Replace "user_" with "id_"
df_replaced_user = df_regex.withColumn("replaced_user", regexp_replace(col("text"), "user_", "id_"))

print("DataFrame after replacing 'user_':")
df_replaced_user.show()

# Remove anything after "_"
df_removed_after_underscore = df_regex.withColumn("removed_after_underscore", regexp_replace(col("text"), r"\_.*", ""))

print("DataFrame after removing text after underscore:")
df_removed_after_underscore.show()

In [None]:
# Example 3: Accessing elements of the resulting array

from pyspark.sql.functions import split, col

# Assume df_fruits is already created from Example 1

df_split_comma = df_fruits.withColumn("fruit_list_comma", split(col("fruits"), ","))

# Access the first element (index 0)
df_first_fruit = df_split_comma.withColumn("first_fruit", col("fruit_list_comma")[0])

print("DataFrame with the first fruit:")
df_first_fruit.show(truncate=False)

# Access the second element (index 1)
df_second_fruit = df_split_comma.withColumn("second_fruit", col("fruit_list_comma")[1])

print("DataFrame with the second fruit:")
df_second_fruit.show(truncate=False)

## `pivot()` in PySpark

`pivot()` is a transformation used to rotate a table-valued expression by turning the unique values from one column into multiple columns. It's commonly used for data aggregation and reshaping, similar to a pivot table in spreadsheet software.

**Syntax**

In [None]:
# Example 1: Basic pivot()

# Sample data
data = [
    ("USA", "ProductA", 100),
    ("USA", "ProductB", 150),
    ("Canada", "ProductA", 120),
    ("Canada", "ProductC", 200),
    ("USA", "ProductB", 180),
    ("Canada", "ProductA", 130)
]

columns = ["country", "product", "amount"]

df_sales = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df_sales.show()

# Pivot the data to show total amount by country and product
pivot_df = df_sales.groupBy("country").pivot("product").sum("amount")

print("Pivoted DataFrame:")
pivot_df.show()

# Note: NULL values appear where a combination of grouping column and pivot column value does not exist in the original data.

In [None]:
# Example 2: pivot() with specified values

# It's generally recommended to provide a list of values to the pivot function
# to avoid collecting all unique values from a large dataset.

product_values = ["ProductA", "ProductB", "ProductC"]

pivot_df_specified = df_sales.groupBy("country").pivot("product", product_values).sum("amount")

print("Pivoted DataFrame with specified values:")
pivot_df_specified.show()

In [None]:
# Example 3: pivot() with multiple aggregations

from pyspark.sql.functions import avg, count

pivot_df_multi_agg = df_sales.groupBy("country").pivot("product", product_values).agg(sum("amount").alias("total_amount"), count("amount").alias("count"))

print("Pivoted DataFrame with multiple aggregations:")
pivot_df_multi_agg.show()

In [None]:
# Example 4: pivot() on a different dataset (using the employee df from earlier)

# Pivot the employee data to show average salary by department and gender
pivot_employee_df = df.groupBy("department_id").pivot("gender").agg(avg("salary"))

print("Pivoted Employee DataFrame (Average Salary by Department and Gender):")
pivot_employee_df.show()