### Objective

The objective of this notebook is to provide hands-on experience with **PySpark** for basic data processing tasks. Learners will:

- Load structured data from a **CSV file** using PySpark  
- Perform transformations using custom Python functions and register them as **User Defined Functions (UDFs)**  
- Classify flower characteristics based on **petal length** and **sepal width**  
- Save transformed data into multiple formats: **JSON** and **Parquet**  
- Create and export a new DataFrame to **CSV** and **TSV** formats  
- Understand the differences and benefits of using various **data storage formats**  

This exercise helps learners gain practical skills in working with **PySpark's DataFrame API**, **UDFs**, and **file I/O operations**.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("PySpark CSV Transformation") \
    .getOrCreate()

In [None]:
# Load datasets using Spark
iris_csv_spark = spark.read.csv("/content/drive/MyDrive/Datasets/Iris.csv", header=True, inferSchema=True)

# Show a few rows
print("CSV Sample (Spark):")
iris_csv_spark.show(5)

CSV Sample (Spark):
+------------+-----------+------------+-----------+-------+
|sepal.length|sepal.width|petal.length|petal.width|variety|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| Setosa|
|         4.9|        3.0|         1.4|        0.2| Setosa|
|         4.7|        3.2|         1.3|        0.2| Setosa|
|         4.6|        3.1|         1.5|        0.2| Setosa|
|         5.0|        3.6|         1.4|        0.2| Setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows



In [None]:
print("\nSpark CSV Schema:")
iris_csv_spark.printSchema()


Spark CSV Schema:
root
 |-- sepal.length: double (nullable = true)
 |-- sepal.width: double (nullable = true)
 |-- petal.length: double (nullable = true)
 |-- petal.width: double (nullable = true)
 |-- variety: string (nullable = true)



In [None]:
for col_name in iris_csv_spark.columns:
    print(col_name)

sepal.length
sepal.width
petal.length
petal.width
variety


In [None]:
iris_csv_spark = iris_csv_spark.withColumnRenamed("sepal.width", "sepal_width")

In [None]:
iris_csv_spark = iris_csv_spark.withColumnRenamed("sepal.length", "sepal_length")
iris_csv_spark = iris_csv_spark.withColumnRenamed("petal.length", "petal_length")
iris_csv_spark = iris_csv_spark.withColumnRenamed("petal.width", "petal_width")

In [None]:
# Choose one format for the demo
df = iris_csv_spark
df.show(5)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|variety|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| Setosa|
|         4.9|        3.0|         1.4|        0.2| Setosa|
|         4.7|        3.2|         1.3|        0.2| Setosa|
|         4.6|        3.1|         1.5|        0.2| Setosa|
|         5.0|        3.6|         1.4|        0.2| Setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows



In [None]:
# Define flower size classification based on petal length
def classify_by_petal_length(petal_length):
    if petal_length < 2.0:
        return "Small"
    elif 2.0 <= petal_length < 5.0:
        return "Medium"
    else:
        return "Large"

In [None]:
# Define flower size classification based on sepal width
def classify_by_sepal_width(sepal_width):
    if sepal_width < 3.0:
        return "Narrow"
    elif 3.0 <= sepal_width < 3.5:
        return "Moderate"
    else:
        return "Wide"


In [None]:
# Register the functions as UDFs
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

petal_size_udf = udf(classify_by_petal_length, StringType())
sepal_size_udf = udf(classify_by_sepal_width, StringType())

In [None]:
# Apply UDFs to CSV DataFrame
iris_classified = iris_csv_spark.withColumn("petal_size", petal_size_udf("petal_length")) \
                          .withColumn("sepal_size", sepal_size_udf("sepal_width"))

iris_classified.show(10)

+------------+-----------+------------+-----------+-------+----------+----------+
|sepal_length|sepal_width|petal_length|petal_width|variety|petal_size|sepal_size|
+------------+-----------+------------+-----------+-------+----------+----------+
|         5.1|        3.5|         1.4|        0.2| Setosa|     Small|      Wide|
|         4.9|        3.0|         1.4|        0.2| Setosa|     Small|  Moderate|
|         4.7|        3.2|         1.3|        0.2| Setosa|     Small|  Moderate|
|         4.6|        3.1|         1.5|        0.2| Setosa|     Small|  Moderate|
|         5.0|        3.6|         1.4|        0.2| Setosa|     Small|      Wide|
|         5.4|        3.9|         1.7|        0.4| Setosa|     Small|      Wide|
|         4.6|        3.4|         1.4|        0.3| Setosa|     Small|  Moderate|
|         5.0|        3.4|         1.5|        0.2| Setosa|     Small|  Moderate|
|         4.4|        2.9|         1.4|        0.2| Setosa|     Small|    Narrow|
|         4.9|  

In [None]:
# Save to JSON
iris_classified.write.mode("overwrite").json("output/iris_transformed.json")

# Save to Parquet
iris_classified.write.mode("overwrite").parquet("output/iris_transformed.parquet")

In [None]:
# Create and Show New DataFrame
data = [
    ("Alice", "Mumbai", 28),
    ("Bob", "Delhi", 34),
    ("Charlie", "Bangalore", 25),
    ("Debra", "Hyderabad", 31)
]

columns = ["name", "city", "age"]

df_people = spark.createDataFrame(data, columns)
df_people.show()

+-------+---------+---+
|   name|     city|age|
+-------+---------+---+
|  Alice|   Mumbai| 28|
|    Bob|    Delhi| 34|
|Charlie|Bangalore| 25|
|  Debra|Hyderabad| 31|
+-------+---------+---+



In [None]:
# Save to CSV
df_people.write.mode("overwrite").option("header", True).csv("output/people.csv")

# Save to TSV
df_people.write.mode("overwrite").option("header", True).option("delimiter", "\t").csv("output/people.tsv")

### CSV vs TSV

- **CSV (Comma Separated Values)**: Fields are separated using commas.
  - Commonly used for spreadsheets and simple data interchange.
  - Can cause issues if commas are part of field values (needs quoting).

- **TSV (Tab Separated Values)**: Fields are separated using tabs (`\t`).
  - Less ambiguity with commas in text fields.
  - Better for datasets containing natural language or punctuation.

**Why Save to Multiple Formats?**
- **CSV**: Human-readable and widely supported.
- **JSON**: Semi-structured, useful for hierarchical data and APIs.
- **Parquet**: Columnar format; optimized for performance and analytics.

Each format serves different use cases â€“ it's good practice to understand and use them accordingly.