# Spark DataFrames API


### Dataframe

- DataFrames are ditributed collections of records, all with pre-defined structure(schema - structure and data types of all columns)
-  DataFrames are built on Spark's core concepts but with structure, optimization and SQL-like operations for data manipulation.
- DataFrames track their schema and provide native support for many common SQL functions and relational operators
- DataFrames are evaluated as DAGs, using lazy evaluation and providing lineage and fault tolerance.
- DataFrames are immutable

### SparkContext vs SparkSession

- SparkSession is Spark application entry point. 
- Introduced in spark 2.0 as a unified entry point for all contexts (formerly instantiated individually as SparkContext, SQLContext, HiveContext, StreamingContext)

<i>Note: In databricks it is automatically created for you as spark</i>

### DataFrame API Optimizations

- **Adaptive Query Execution:** Dynamic plan adjustments during runtime based on actual data characteristics and execution patterns.
- **In-Memory Columnar Storage(Tungsten):** In-Memory coloumnar format for all the DataFrames enabling efficient analytical query performance and reduced memory footprint.
- **Built-in Statistics** - Automatic statistics collection when saving to optimized formats (Parqurt, Delta in databricks) enables smarter query planning and execution.
- **Catalyst Optimizer:** Query optimization engine that coverts DataFrame operations into an optimized execution plan


<i>**Note** Databricks comes with a native vectorized query engine that accelerates query execution using photon engine</i>

**DataFrame Query Planning:** 

- When a DataFrame is evaluated, the driver creates an optimized execution plan through a series of transformations 
- Converts the logical plan into phycal execution that minimizes resource usage and execution time. (Unresolved LP -> analysed LP -> optimized LP -> Physical Plan)



In [1]:
# Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CutomerDFExample").getOrCreate()




Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/06/30 18:13:56 WARN Utils: Your hostname, krishnagopi-trng2224dat-g3q9nc1wf47, resolves to a loopback address: 127.0.0.1; using 10.0.5.2 instead (on interface eth0)
25/06/30 18:13:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/30 18:13:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Creating DataFrames - DataFrameReader
# supports multiple formats such as JSON, CSV, Parquet, ORC, Text or Binary files, existing RDD, and an external db


df_customers = spark.read.csv("file:///workspace/TRNG-2224-data-engineering/week2/datasets/customer_data.csv", header= True, inferSchema=True)

df_customers.show(4)




+--------------------+---------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+
|         customer_id|           name|               email|age|gender|             country|signup_date|         last_login|is_active|total_spent|
+--------------------+---------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+
|20780d38-901f-450...| Michael Malone|    dhart@haynes.com| 58|  Male|    Saint Barthelemy| 2021-04-29|2024-10-20 15:56:26|     true|     3733.6|
|a2c56b05-acdc-4a7...|     Edwin Wall| bradley08@yahoo.com| 33|  Male|United Arab Emirates| 2025-01-02|2025-06-19 22:44:59|     true|    3708.71|
|2fe8ff2e-19ea-493...|  Rachel Strong|heather15@schmidt...| 61| Other|              Israel| 2023-02-13|2025-04-12 21:14:26|     true|    2993.41|
|5fd9f4a6-2134-41b...|Eddie Rodriguez|mitchell49@hotmai...| 20|  Male|             Nigeria| 2024-07-06|2025-03-06 17:09:20| 

### DataFrame Data Types

#### Primitive

**`pyspark.sql.types.DataType`**

- `ByteType`
- `ShortType`
- `IntegerType`
- `LongType`
- `FloatType`
- `DoubleType`
- `BooleanType`
- `StringType`
- `BinaryType`
- `TimestampType`
- `DateType`

#### complex data types

- `ArrayType`
- `MapType`
- `StructType`



In [6]:
# DataFrame Schema

df_customers.schema

StructType([StructField('customer_id', StringType(), True), StructField('name', StringType(), True), StructField('email', StringType(), True), StructField('age', IntegerType(), True), StructField('gender', StringType(), True), StructField('country', StringType(), True), StructField('signup_date', DateType(), True), StructField('last_login', TimestampType(), True), StructField('is_active', BooleanType(), True), StructField('total_spent', DoubleType(), True)])

In [7]:
# custom schema definition

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, BooleanType, TimestampType, DateType

custom_schema = StructType([
    StructField("customer_id", StringType(), True ),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("country", StringType(), True),
    StructField("signup_date", DateType(), True),
    StructField("last_login", TimestampType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("total_spent", DoubleType(), True)
])

df_customers = spark.read.csv("file:///workspace/TRNG-2224-data-engineering/week2/datasets/customer_data.csv", header= True, schema=custom_schema)

df_customers.printSchema()




root
 |-- customer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- signup_date: date (nullable = true)
 |-- last_login: timestamp (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- total_spent: double (nullable = true)



In [10]:
# DDL Schema

ddl_schema = """
    customer_id STRING,
    name STRING,
    email STRING,
    age INT,
    gender STRING,
    country STRING,
    signup_date DATE,
    last_login TIMESTAMP,
    is_active BOOLEAN,
    total_spent DOUBLE
"""

df_customers = spark.read.csv("file:///workspace/TRNG-2224-data-engineering/week2/datasets/customer_data.csv", header= True, schema=ddl_schema)

df_customers.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- country: string (nullable = true)
 |-- signup_date: date (nullable = true)
 |-- last_login: timestamp (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- total_spent: double (nullable = true)



### Common DataFrame API methods

#### Transformations

##### Narrow Transformations

- narrow transformations process data within each partition independetly, without needing to combine data from other partitions.
- faster and more efficient because they avoid data shuffling between partitions. 

1. `select()` : selecting specific rows
2. `filter()`: Applying a filter condition to rows. 
3. `map()`: Applying a function to each row. 
4. `union()`: Combining two DataFrames with identical schemas. 
5. `withColumn()`: Adding a new column based on existing ones. 
6. `drop()`: Removing a column. 

##### Wide Transformations

- Wide transformations require data to be redistributed across partitions, often involving shuffling data based on keys.

1. `groupBy()`: Grouping data based on a column, which often requires shuffling to aggregate data from different partitions. 
2. `join()`: Joining two DataFrames, which requires shuffling data to combine rows based on a join key. 
3. `distinct()`: Removing duplicate rows, which might require shuffling to compare rows across partitions. 

#### Actions

1. `count()`: returns number of rows in a Dataframe
2. `show()`: display DataFrame content
3. `take(n)`: return first n rows from a DataFrame
4. `first()`: return first row from a DataFrame
5. `write()`: save DataFrame to storage

In [13]:
# Map, Shuffle and Reduce

from pyspark.sql.functions import sum


df_customers.filter(df_customers.age > 30) \
                    .select("country", "total_spent")\
                    .groupBy("country") \
                    .agg(sum("total_spent").alias("revenue")).show()


+-----------------+------------------+
|          country|           revenue|
+-----------------+------------------+
|            Macao|            3022.4|
|            Yemen|           1407.35|
|         Kiribati|           1833.03|
|           Guyana|           1149.76|
|           Jersey|           1398.78|
|   Norfolk Island|           2929.03|
|         Djibouti|           4291.06|
|            Tonga|           2167.22|
|           Malawi|           4418.94|
|          Germany|2965.7599999999998|
|           Jordan|           2175.54|
|            Sudan|            800.07|
|           Greece|           2734.41|
|             Togo|           3097.22|
|          Ecuador|           3340.19|
|            Qatar|           5966.42|
|          Lesotho|            1977.1|
|       Madagascar|           3372.53|
|Brunei Darussalam|           4201.99|
|             Peru|           2080.31|
+-----------------+------------------+
only showing top 20 rows


In [16]:
# Select Specific Columns

df_customers.select("name", "email", "country").show(4)

+---------------+--------------------+--------------------+
|           name|               email|             country|
+---------------+--------------------+--------------------+
| Michael Malone|    dhart@haynes.com|    Saint Barthelemy|
|     Edwin Wall| bradley08@yahoo.com|United Arab Emirates|
|  Rachel Strong|heather15@schmidt...|              Israel|
|Eddie Rodriguez|mitchell49@hotmai...|             Nigeria|
+---------------+--------------------+--------------------+
only showing top 4 rows


In [19]:
# Filter Active Customers Over 30

df_customers.filter((df_customers.age>30) & (df_customers.is_active == True)).show()

+--------------------+-------------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+
|         customer_id|               name|               email|age|gender|             country|signup_date|         last_login|is_active|total_spent|
+--------------------+-------------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+
|20780d38-901f-450...|     Michael Malone|    dhart@haynes.com| 58|  Male|    Saint Barthelemy| 2021-04-29|2024-10-20 15:56:26|     true|     3733.6|
|a2c56b05-acdc-4a7...|         Edwin Wall| bradley08@yahoo.com| 33|  Male|United Arab Emirates| 2025-01-02|2025-06-19 22:44:59|     true|    3708.71|
|2fe8ff2e-19ea-493...|      Rachel Strong|heather15@schmidt...| 61| Other|              Israel| 2023-02-13|2025-04-12 21:14:26|     true|    2993.41|
|b290eed5-e70c-48d...|       Kayla Powell|johnnash@hotmail.com| 50| Other|        Burkina Faso| 2021

In [20]:
# Group by Country and Get Average Spend

df_customers.groupBy("country").avg("total_spent").show()

+-----------------+------------------+
|          country|  avg(total_spent)|
+-----------------+------------------+
|            Macao|            3022.4|
|            Yemen|           1407.35|
|         Kiribati|           1833.03|
|           Guyana|           1149.76|
|           Jersey|           1398.78|
|   Norfolk Island|           2929.03|
|         Djibouti|           4291.06|
|            Tonga|           2167.22|
|           Malawi|           4418.94|
|          Germany|1482.8799999999999|
|           Jordan|           2175.54|
|     Saint Helena|           4639.94|
|            Sudan|            800.07|
|           Greece|           2734.41|
|             Togo|           3097.22|
|Equatorial Guinea|           4459.59|
|          Ecuador|           3340.19|
|            Qatar|           2983.21|
|          Lesotho|            988.55|
|       Madagascar|           3372.53|
+-----------------+------------------+
only showing top 20 rows


In [21]:
# Add a New Column for Spend Category

from pyspark.sql.functions import when

df_customer_with_catgory = df_customers.withColumn(
    "spend_category",
    when(df_customers.total_spent >3000, "High")
    .when(df_customers.total_spent> 1000, "Medium")
    .otherwise("Low")
)

df_customer_with_catgory.select("name", "email","total_spent" ,"spend_category").show()

+----------------+--------------------+-----------+--------------+
|            name|               email|total_spent|spend_category|
+----------------+--------------------+-----------+--------------+
|  Michael Malone|    dhart@haynes.com|     3733.6|          High|
|      Edwin Wall| bradley08@yahoo.com|    3708.71|          High|
|   Rachel Strong|heather15@schmidt...|    2993.41|        Medium|
| Eddie Rodriguez|mitchell49@hotmai...|    1171.33|        Medium|
|    Kayla Powell|johnnash@hotmail.com|     2850.1|        Medium|
| Kathleen Nelson|     qpena@gmail.com|    1180.16|        Medium|
| Nicolas Kennedy|catherineblack@wa...|    3818.16|          High|
| Jacqueline Reid|smithjoshua@baker...|    1767.23|        Medium|
|  Matthew Mendez|jeremymontoya@har...|     600.83|           Low|
|      Jay Little|brogers@wright-wo...|    4252.72|          High|
|       Amber Ray|   alan33@taylor.net|    3675.69|          High|
| Elizabeth Ellis|jjohnson@smith-th...|     486.12|           

In [22]:
# DataFrameWriter - flexible output formats and partitioning. supports various save modes (overwrite, append)

df_customer_with_catgory.write.mode("overwrite").parquet("cutomer_oputput.parquet")

                                                                                