**What is collect() in PySpark?**

*   collect() is an action in PySpark (not a transformation).
*   It retrieves all rows of a DataFrame (or RDD) to the driver node as a list of Row objects.
*   Since Spark works in a distributed environment, data is spread across executors.
*  collect() brings everything back to your local Python process.


Warning: Don’t use collect() on very large datasets (it can cause OutOfMemoryError). Use it only for small results, testing, or debugging.

Syntax

DataFrame.collect()

Returns:
A list of Row objects, where each row represents a record from the DataFrame.




In [2]:
#Example 1: Basic Usage
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CollectExample").getOrCreate()

# Sample DataFrame
data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

df.show()

# Collect all rows
result = df.collect()

print(result)


+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25|  3000|
|    Bob| 30|  4000|
|Charlie| 28|  5000|
+-------+---+------+

[Row(name='Alice', age=25, salary=3000), Row(name='Bob', age=30, salary=4000), Row(name='Charlie', age=28, salary=5000)]


In [3]:
#Example 2: Iterating Over Collected Data
rows = df.collect()

for row in rows:
    print(row["name"], row["age"], row["salary"])


Alice 25 3000
Bob 30 4000
Charlie 28 5000


In [4]:
#Example 3: Convert to Pandas
#Often, after collecting, people convert to Pandas for local analysis:

pandas_df = df.toPandas()
print(pandas_df)


      name  age  salary
0    Alice   25    3000
1      Bob   30    4000
2  Charlie   28    5000


In [5]:
#Example 4: Collect with Filtering

result = df.filter(df["age"] > 26).collect()
for row in result:
    print(row.name, row.salary)


Bob 4000
Charlie 5000


When to Use collect()?

Use collect() when:

•	Dataset is small enough to fit in memory.

•	You want to debug, print, or inspect results locally.

•	You’re passing results to an external Python library (like Pandas, NumPy, Matplotlib).

Avoid collect() when:
•	Dataset is large (millions of rows, GBs of data).
•	It can cause driver out of memory issues.

Tip: For safer alternatives use:

•	show(n) → prints first n rows nicely.

•	take(n) → returns first n rows as list.

•	limit(n).collect() → collects only a subset.



What is transform() in PySpark?

•	transform() is available on DataFrame objects.

•	It allows you to apply a function (transformation) to a DataFrame in a clean, reusable, and chainable way.

•	Instead of writing complex transformations inline, you can wrap them in functions and pass them to transform().

•	It improves readability and reusability of your PySpark code.
________________________________________
Syntax
DataFrame.transform(func)

•	func → a Python function that takes a DataFrame as input and returns a DataFrame.

•	Returns → the transformed DataFrame.


In [6]:
#Example 1: Basic Usage
from pyspark.sql.functions import col
# Sample DataFrame
data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

df.show()


+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25|  3000|
|    Bob| 30|  4000|
|Charlie| 28|  5000|
+-------+---+------+



In [7]:
# Define a function to add a 10% bonus to salary

def add_bonus(dataframe):
    return dataframe.withColumn("salary_with_bonus", col("salary") * 1.1)

# Apply using transform
df_transformed = df.transform(add_bonus)
df_transformed.show()


+-------+---+------+------------------+
|   name|age|salary| salary_with_bonus|
+-------+---+------+------------------+
|  Alice| 25|  3000|3300.0000000000005|
|    Bob| 30|  4000|            4400.0|
|Charlie| 28|  5000|            5500.0|
+-------+---+------+------------------+



In [8]:
#Example 2: Chaining Multiple transform() Calls
from pyspark.sql.functions import upper

# Function to uppercase name
def uppercase_name(df):
    return df.withColumn("name_upper", upper(col("name")))

# Function to categorize salary
def categorize_salary(df):
    return df.withColumn("salary_level",
                         (col("salary") > 4000).cast("string"))

# Apply multiple transformations
df_chain = df.transform(uppercase_name).transform(categorize_salary)
df_chain.show()


+-------+---+------+----------+------------+
|   name|age|salary|name_upper|salary_level|
+-------+---+------+----------+------------+
|  Alice| 25|  3000|     ALICE|       false|
|    Bob| 30|  4000|       BOB|       false|
|Charlie| 28|  5000|   CHARLIE|        true|
+-------+---+------+----------+------------+



In [9]:
#Example 3: Passing Parameters with lambda
#You can also use lambda functions directly:

df_lambda = df.transform(lambda d: d.withColumn("age_plus_5", col("age") + 5))
df_lambda.show()


+-------+---+------+----------+
|   name|age|salary|age_plus_5|
+-------+---+------+----------+
|  Alice| 25|  3000|        30|
|    Bob| 30|  4000|        35|
|Charlie| 28|  5000|        33|
+-------+---+------+----------+



In [10]:
#Example 4: Reusable Pipeline with transform()

def pipeline(df):
    return (df
            .transform(add_bonus)         # Step 1: Add bonus
            .transform(uppercase_name)    # Step 2: Uppercase name
            .transform(lambda d: d.filter(col("age") > 26))  # Step 3: Filter
           )

df_pipeline = df.transform(pipeline)
df_pipeline.show()


+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



 Why Use transform()?

Makes code cleaner & modular (define transformations once, reuse many times).

Useful when building pipelines of transformations.

Works well with functional programming style in PySpark.


In [11]:
#Direct Method Chaining
#You can chain transformations directly on the DataFrame:

spark = SparkSession.builder.appName("TransformVsChaining").getOrCreate()

data = [("Alice", 25, 3000), ("Bob", 30, 4000), ("Charlie", 28, 5000)]
df = spark.createDataFrame(data, ["name", "age", "salary"])

# Direct chaining
df_chain = (
    df.withColumn("salary_with_bonus", col("salary") * 1.1)
      .withColumn("name_upper", upper(col("name")))
      .filter(col("age") > 26)
)

df_chain.show()


+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



In [12]:
#Using transform()
#Instead of repeating logic, you define reusable functions:
# Define reusable transformations

def add_bonus(df):
    return df.withColumn("salary_with_bonus", col("salary") * 1.1)

def uppercase_name(df):
    return df.withColumn("name_upper", upper(col("name")))

def filter_age(df):
    return df.filter(col("age") > 26)

# Apply with transform
df_transformed = (
    df.transform(add_bonus)
      .transform(uppercase_name)
      .transform(filter_age)
)

df_transformed.show()

+-------+---+------+-----------------+----------+
|   name|age|salary|salary_with_bonus|name_upper|
+-------+---+------+-----------------+----------+
|    Bob| 30|  4000|           4400.0|       BOB|
|Charlie| 28|  5000|           5500.0|   CHARLIE|
+-------+---+------+-----------------+----------+



Output is the same, but advantages:

•	Modular: Each transformation is a function.

•	Reusable: You can apply add_bonus() or filter_age() to other DataFrames easily.

•	Readable: Clearly separates logical steps.


In [13]:
#Mixing lambda with transform()

#For quick one-off transformations:
df_lambda = (
    df.transform(add_bonus)
      .transform(lambda d: d.withColumn("age_plus_5", col("age") + 5))
)
df_lambda.show()


+-------+---+------+------------------+----------+
|   name|age|salary| salary_with_bonus|age_plus_5|
+-------+---+------+------------------+----------+
|  Alice| 25|  3000|3300.0000000000005|        30|
|    Bob| 30|  4000|            4400.0|        35|
|Charlie| 28|  5000|            5500.0|        33|
+-------+---+------+------------------+----------+



When to Use What?

Approach	Best for
Method chaining	Quick scripts, small transformations, throwaway code
transform()
Reusable pipelines, production code, when the same transformations must be applied to multiple DataFrames
________________________________________
In real-world projects, transform() shines when:

•	You build data pipelines.

•	You want clean, testable, reusable code.

•	You’re working in a team (easier to understand functions like add_bonus than inline chains).


Unlike Pandas, PySpark DataFrame does not have an apply() method.
Instead, there are different contexts where apply() exists in PySpark:

1.	apply() in Pandas UDFs (with pandas_udf)

2.	applyInPandas() on DataFrames

3.	apply() in grouped operations (GroupedData)

4.	RDD map() / mapPartitions() (the lower-level equivalent to apply logic)

Let’s go through them one by one with code examples:


In [14]:
#apply() in Pandas UDFs
#PySpark integrates with Pandas via vectorized UDFs.
#Here, apply() is used inside Pandas UDF functions.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


# Sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
df = spark.createDataFrame(data, ["name", "age"])

# Define a Pandas UDF that applies a custom transformation
@pandas_udf("int")
def add_five(age_series: pd.Series) -> pd.Series:
    return age_series.apply(lambda x: x + 5)

# Use it in a DataFrame
df_with_new = df.withColumn("age_plus_5", add_five(df["age"]))
df_with_new.show()


+-------+---+----------+
|   name|age|age_plus_5|
+-------+---+----------+
|  Alice| 25|        30|
|    Bob| 30|        35|
|Charlie| 28|        33|
+-------+---+----------+



In [15]:
#applyInPandas()
#This is a DataFrame-level method that allows you to apply a function on grouped Pandas DataFrames.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Schema for output
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age_plus_10", IntegerType(), True)
])

# Function to apply on each Pandas DataFrame
def add_ten(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["age_plus_10"] = pdf["age"] + 10
    return pdf[["name", "age_plus_10"]]

# Use applyInPandas
df_applied = df.groupBy("name").applyInPandas(add_ten, schema=schema)
df_applied.show()

#applyInPandas() is grouped: Spark splits data into groups → converts each group to Pandas → applies the function → merges results back.

+-------+-----------+
|   name|age_plus_10|
+-------+-----------+
|  Alice|         35|
|    Bob|         40|
|Charlie|         38|
+-------+-----------+



In [16]:
#Equivalent of apply() on RDDs
#If you want Pandas-style apply() for row-wise operations, you can use map() on RDDs:

# Convert DataFrame to RDD
rdd = df.rdd

# Apply transformation (similar to row-wise apply)
rdd_applied = rdd.map(lambda row: (row["name"], row["age"] + 2))
print(rdd_applied.collect())


[('Alice', 27), ('Bob', 32), ('Charlie', 30)]


Summary

•	No direct apply() on PySpark DataFrames like Pandas.

•	You use:

  o	pandas_udf with .apply() → for row/column ops inside Pandas.

  o	applyInPandas() → for grouped transformations.

  o	GroupedData + Pandas UDF → for custom aggregations.

  o	RDD .map() → as a lower-level apply.


Great one  — let’s carefully unpack map() and flatMap() in PySpark.

These two are RDD (Resilient Distributed Dataset) methods, not DataFrame methods. They're used for low-level transformations in Spark.


map() in PySpark

  •	Applies a function to each element of the RDD.

  •	Returns a new RDD where each input element produces exactly one output element.

  •	Output count = Input count (1 → 1 mapping).



Good for element-wise transformations.


In [17]:
#Example: map()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MapFlatMapExample").getOrCreate()

# Create an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# Multiply each element by 2
mapped_rdd = rdd.map(lambda x: x * 2)

print(mapped_rdd.collect())

#Each element produced one output.


[2, 4, 6, 8, 10]


flatMap() in PySpark

•	Similar to map(), but flattens the results.

•	Each input element can produce zero, one, or many output elements.

•	Output count ≠ Input count.


Good for splitting, expanding, or filtering data.


In [18]:
#Example: flatMap()

# RDD of sentences
rdd2 = spark.sparkContext.parallelize(["hello world", "spark map flatmap", "pyspark example"])

# Split each sentence into words
flatmapped_rdd = rdd2.flatMap(lambda line: line.split(" "))

print(flatmapped_rdd.collect())

#Each sentence produced multiple words, and flatMap() flattened them into a single list.

['hello', 'world', 'spark', 'map', 'flatmap', 'pyspark', 'example']


In [19]:
#Comparison Between map() and flatMap()

#Using map() for word splitting:

mapped_words = rdd2.map(lambda line: line.split(" "))
print(mapped_words.collect())

#Result is a list of lists (not flattened).


[['hello', 'world'], ['spark', 'map', 'flatmap'], ['pyspark', 'example']]


In [20]:
#Using flatMap() for word splitting:

flatmapped_words = rdd2.flatMap(lambda line: line.split(" "))
print(flatmapped_words.collect())

#Result is a flat list of words.


['hello', 'world', 'spark', 'map', 'flatmap', 'pyspark', 'example']


In [21]:
#Another Example: Filtering with flatMap()

# RDD with numbers
nums = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# flatMap returns [] for odd numbers, [x] for even numbers
evens = nums.flatMap(lambda x: [x] if x % 2 == 0 else [])

print(evens.collect())

#Unlike map() which always returns one value, flatMap() can return zero elements (filtering effect).


[2, 4]


 Syntax

rdd.foreach(f)

f → a function to be executed on each element


In [22]:
#Example 1: Simple Print

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Apply foreach
def print_element(x):
    print(f"Value: {x}")

rdd.foreach(print_element)
#Note: You might not always see output in the driver logs because the printing happens on worker nodes.


In [23]:
#Writing to an External File

import os

def write_to_file(x):
    with open("output.txt", "a") as f:
        f.write(str(x) + "\n")

rdd.foreach(write_to_file)

#Each worker writes locally on its machine, not to the driver.
#So this is useful only in distributed storage or external databases


In [24]:
#Example 3: Using foreach for Database Insert

def insert_to_db(x):
    # Example: mock DB insert
    print(f"Inserting {x} into database...")

rdd.foreach(insert_to_db)


🔹 Difference Between foreach() and map()

map() → transformation, returns a new RDD.

foreach() → action, returns nothing.


In [25]:
#Using map

mapped = rdd.map(lambda x: x*2)
print(mapped.collect())  # [2,4,6,8,10]


[2, 4, 6, 8, 10]


In [26]:
# Using foreach

rdd.foreach(lambda x: print(x*2))  # Prints values but no return

Key Takeaways:

foreach() is an action.

Executes function on each element of RDD.

Typically used for side effects (DB updates, external API calls, logging).

Doesn’t return an RDD or DataFrame

What is partitionBy()?

partitionBy() is used when writing data (especially in formats like Parquet, ORC, Avro, CSV) to organize the output files into separate folders based on one or more columns.

This helps with:

Efficient querying (only scanning partitions needed).

Reducing data size when reading.

Better performance with tools like Spark SQL, Hive, Presto, etc.


Syntax
DataFrameWriter.partitionBy(col1, col2, ...).format("...").save(path)


col1, col2 → columns to partition the data by.

format("parquet") (or CSV, JSON, etc.).

save(path) → location in HDFS/local/S3.


In [27]:
#Example 1: Partition by one column

# Sample data
data = [
    (1, "Alice", "HR", 3000),
    (2, "Bob", "IT", 4000),
    (3, "Charlie", "IT", 4500),
    (4, "David", "Finance", 3500),
    (5, "Eve", "HR", 3200)
]

columns = ["id", "name", "dept", "salary"]

df = spark.createDataFrame(data, columns)

# Write data partitioned by 'dept'
df.write.partitionBy("dept").mode("overwrite").parquet("output/employee_partitioned")

#This creates a folder structure like:


In [28]:
#Example 2: Partition by multiple columns
# Write data partitioned by 'dept' and 'salary'

df.write.partitionBy("dept", "salary").mode("overwrite").parquet("output/employee_multi_partitioned")


#Folder structure will look like in the output

In [29]:
#Example 3: Reading partitioned data
# Read the partitioned Parquet back

df_read = spark.read.parquet("output/employee_partitioned")

df_read.show()

+---+-------+------+-------+
| id|   name|salary|   dept|
+---+-------+------+-------+
|  3|Charlie|  4500|     IT|
|  4|  David|  3500|Finance|
|  1|  Alice|  3000|     HR|
|  2|    Bob|  4000|     IT|
|  5|    Eve|  3200|     HR|
+---+-------+------+-------+



In [30]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

Key Notes:

partitionBy() does not change data inside files, only the directory structure.

It’s mainly useful for big data optimization.

Works best with formats like Parquet and ORC (not efficient with CSV/JSON).


What is MapType?

MapType is a Spark SQL data type that stores key-value pairs (like a Python dictionary).

Both keys and values have fixed data types (e.g., StringType for keys and IntegerType for values).

Keys are always non-null, but values can be nullable depending on schema definition.

Syntax
from pyspark.sql.types import MapType, StringType, IntegerType

MapType(keyType, valueType, valueContainsNull=True)
keyType → Data type of keys (e.g., StringType(), IntegerType()).

valueType → Data type of values.

valueContainsNull → Boolean (default = True). Whether map values can contain null.


In [31]:
#Example 1: Creating a MapType Column

from pyspark.sql.types import MapType, StringType, IntegerType, StructType, StructField

data = [
    (1, {"math": 80, "english": 90}),
    (2, {"math": 85, "science": 95}),
    (3, None)  # null map
]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("scores", MapType(StringType(), IntegerType()), True)
])

df = spark.createDataFrame(data, schema)
df.show(truncate=False)
df.printSchema()


+---+---------------------------+
|id |scores                     |
+---+---------------------------+
|1  |{english -> 90, math -> 80}|
|2  |{science -> 95, math -> 85}|
|3  |NULL                       |
+---+---------------------------+

root
 |-- id: integer (nullable = true)
 |-- scores: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)



In [33]:
#Example 2: Accessing Map Values

#You can use col["mapField"]["key"] syntax.

from pyspark.sql.functions import col

df.select(
    col("id"),
    col("scores")["math"].alias("math_score"),
    col("scores")["english"].alias("english_score")
).show()

+---+----------+-------------+
| id|math_score|english_score|
+---+----------+-------------+
|  1|        80|           90|
|  2|        85|         NULL|
|  3|      NULL|         NULL|
+---+----------+-------------+



In [None]:
#Example 3: Creating Map Column Dynamically
#You can create a map column using create_map():

from pyspark.sql.functions import create_map, lit

df2 = df.withColumn("extra", create_map(lit("physics"), lit(88), lit("chemistry"), lit(77)))
df2.show(truncate=False)

Key Takeaways

MapType is like a dictionary in PySpark DataFrames.

Keys are always non-null, values can be null.

Use col["mapField"]["key"] to extract values.

Use create_map() to create new map columns.


In [45]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("emp.csv")

In [46]:
df.show()

+-----------+-------------+-------------+---+------+------+----------+
|employee_id|department_id|         name|age|gender|salary| hire_date|
+-----------+-------------+-------------+---+------+------+----------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|
|          9|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|
|         10|          104|     Lisa Lee| 27|Female| 47000|2018-08-01|
|         11|          104|   David Park| 38|  Male| 65000|2015-11-01|
|     

In [47]:
def bonus(salary):
  return int(salary) * 0.1

In [50]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

bonus_udf = udf(bonus)
spark.udf.register("bonus_sql_udf", bonus, "double")

In [51]:
df.withColumn("bonus", bonus_udf(col("salary"))).show()

+-----------+-------------+-------------+---+------+------+----------+------+
|employee_id|department_id|         name|age|gender|salary| hire_date| bonus|
+-----------+-------------+-------------+---+------+------+----------+------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|5000.0|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|4500.0|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|5500.0|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|4800.0|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|6000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|5200.0|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|7000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|5100.0|
|          9|          103|      Tom Tan| 33|  Male| 58000|2016-06-01|5800.0|
|         10|          104|     Lisa Lee| 27|Female| 47000|2018-

In [53]:
df.withColumn("bonus", col("salary")*1.1).show()

+-----------+-------------+-------------+---+------+------+----------+-----------------+
|employee_id|department_id|         name|age|gender|salary| hire_date|            bonus|
+-----------+-------------+-------------+---+------+------+----------+-----------------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|55000.00000000001|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|49500.00000000001|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|60500.00000000001|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|52800.00000000001|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|          66000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|57200.00000000001|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|          77000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|56100.00000000001|
|          9|        

In [54]:
# Example: Using a UDF with transform()

# Define a function that uses the UDF
def add_bonus_udf_transform(dataframe):
    return dataframe.withColumn("bonus_from_udf", bonus_udf(col("salary")))

# Apply the transformation using transform()
df_with_udf_transform = df.transform(add_bonus_udf_transform)

df_with_udf_transform.show()

+-----------+-------------+-------------+---+------+------+----------+--------------+
|employee_id|department_id|         name|age|gender|salary| hire_date|bonus_from_udf|
+-----------+-------------+-------------+---+------+------+----------+--------------+
|          1|          101|     John Doe| 30|  Male| 50000|2015-01-01|        5000.0|
|          2|          101|   Jane Smith| 25|Female| 45000|2016-02-15|        4500.0|
|          3|          102|    Bob Brown| 35|  Male| 55000|2014-05-01|        5500.0|
|          4|          102|    Alice Lee| 28|Female| 48000|2017-09-30|        4800.0|
|          5|          103|    Jack Chan| 40|  Male| 60000|2013-04-01|        6000.0|
|          6|          103|    Jill Wong| 32|Female| 52000|2018-07-01|        5200.0|
|          7|          101|James Johnson| 42|  Male| 70000|2012-03-15|        7000.0|
|          8|          102|     Kate Kim| 29|Female| 51000|2019-10-01|        5100.0|
|          9|          103|      Tom Tan| 33|  Male| 5

Advantages of using UDFs over simple Python functions in PySpark:

*   **Integration with Spark SQL:** Registered UDFs can be directly used in Spark SQL queries, making your logic accessible from both DataFrame API and SQL.
*   **Serialization and Distribution:** Spark handles the serialization and distribution of UDFs to worker nodes, allowing your custom logic to be executed in a distributed manner.
*   **Performance (Pandas UDFs):** Pandas UDFs (vectorized UDFs) can offer significant performance improvements for certain operations by leveraging Apache Arrow and processing data in batches within Pandas DataFrames.
*   **Reusability:** Once defined and registered, a UDF can be easily reused across different parts of your Spark application.
*   **Handling Complex Logic:** UDFs are useful when the required transformation logic is too complex to express using built-in Spark functions.