
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# 2.2 Lab - Modularize PySpark Code

### Estimated Duration: 15-20 minutes

By the end of this lab, you will practice analyzing a PySpark script by breaking it down into smaller, reusable functions, and assessing how well their changes improve the code's clarity and ease of maintenance.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course. 

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

##### The notebook "2.1 - Modularizing PySpark Code - Required" sets up the catalogs for this course. If you have not run this notebook, the catalogs will not be available.

In [0]:
%run ../Includes/Classroom-Setup-2.2L

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


0,1
Lab catalog reference: DA.catalog_name:,


Run the cell below to view your current default catalog and schema. 

  Confirm the following:
- The default catalog is your unique catalog name (shown above).
- The current schema is **default**.

In [0]:
%sql
SELECT current_catalog(), current_schema()

current_catalog(),current_schema()
labuser9989464_1744809149,default


## B. Review the Provided PySpark Code

1. Run the cell below to preview the **samples.nyctaxi.trips** table. Confirm the table exists and view the data.

    Notice the following:
    - All columns are in lower case
    - **trip_distance** is currently in miles

In [0]:
%sql
SELECT * 
FROM samples.nyctaxi.trips 
LIMIT 10;

tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,pickup_zip,dropoff_zip
2016-02-13T21:47:53Z,2016-02-13T21:57:15Z,1.4,8.0,10103,10110
2016-02-13T18:29:09Z,2016-02-13T18:37:23Z,1.31,7.5,10023,10023
2016-02-06T19:40:58Z,2016-02-06T19:52:32Z,1.8,9.5,10001,10018
2016-02-12T19:06:43Z,2016-02-12T19:20:54Z,2.3,11.5,10044,10111
2016-02-23T10:27:56Z,2016-02-23T10:58:33Z,2.6,18.5,10199,10022
2016-02-13T00:41:43Z,2016-02-13T00:46:52Z,1.4,6.5,10023,10069
2016-02-18T23:49:53Z,2016-02-19T00:12:53Z,10.4,31.0,11371,10003
2016-02-18T20:21:45Z,2016-02-18T20:38:23Z,10.15,28.5,11371,11201
2016-02-03T10:47:50Z,2016-02-03T11:07:06Z,3.27,15.0,10014,10023
2016-02-19T01:26:39Z,2016-02-19T01:40:01Z,4.42,15.0,10003,11222


2. You have been provided with the following PySpark script that performs the following:

   a. Reads from the **samples.nyctaxi.trips** table.

   b. Creates a new column named **trip_distance_km** that converts **trip_distance** to kilometers and rounds it to two decimal places.

   c. Converts all of the column names to uppercase.

   d. Saves the DataFrame as a table named **nyc_lab_solution_table** in your specific catalog (`DA.catalog_name`).

Run the cell below and confirm that the **nyc_lab_solution_table** table was created with all uppercase column names and the new **trip_distance_km** column.

In [0]:
## Run the code to view the default catalog the table is being written to.
print(DA.catalog_name)

labuser9989464_1744809149


In [0]:
# Import necessary libraries
from pyspark.sql import functions as F

# Load the data and create a new column named trip_distince_km
new_taxi = (spark
            .read
            .table("samples.nyctaxi.trips")
            .withColumn("trip_distance_km", F.round(F.col("trip_distance") * 1.60934, 2))
        )


## Upper case all columns
new_taxi = new_taxi.select([F.col(col).alias(col.upper()) for col in new_taxi.columns])


## Save the table to the your catalog
(new_taxi
 .write
 .mode('overwrite')
 .saveAsTable(f'{DA.catalog_name}.default.nyc_lab_solution_table')
)

## View the final table
display(spark.table(f'{DA.catalog_name}.default.nyc_lab_solution_table'))

TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,TRIP_DISTANCE,FARE_AMOUNT,PICKUP_ZIP,DROPOFF_ZIP,TRIP_DISTANCE_KM
2016-02-13T21:47:53Z,2016-02-13T21:57:15Z,1.4,8.0,10103,10110,2.25
2016-02-13T18:29:09Z,2016-02-13T18:37:23Z,1.31,7.5,10023,10023,2.11
2016-02-06T19:40:58Z,2016-02-06T19:52:32Z,1.8,9.5,10001,10018,2.9
2016-02-12T19:06:43Z,2016-02-12T19:20:54Z,2.3,11.5,10044,10111,3.7
2016-02-23T10:27:56Z,2016-02-23T10:58:33Z,2.6,18.5,10199,10022,4.18
2016-02-13T00:41:43Z,2016-02-13T00:46:52Z,1.4,6.5,10023,10069,2.25
2016-02-18T23:49:53Z,2016-02-19T00:12:53Z,10.4,31.0,11371,10003,16.74
2016-02-18T20:21:45Z,2016-02-18T20:38:23Z,10.15,28.5,11371,11201,16.33
2016-02-03T10:47:50Z,2016-02-03T11:07:06Z,3.27,15.0,10014,10023,5.26
2016-02-19T01:26:39Z,2016-02-19T01:40:01Z,4.42,15.0,10003,11222,7.11


## C. Modularize the PySpark Code

1. Your task is to take the provided Spark code from above and break it down into modular functions. Each function should perform a specific part of the task, making it easier to test, reuse, and maintain.

    There are a variety of ways to solve this problem. For consistency in this example, create the following functions:

    - `convert_miles_to_km`: Converts a column from miles to kilometers and rounds the result to two decimal places.

    - `uppercase_column_names`: Converts all column names in the DataFrame to uppercase.
    
    - \*`load_data`: Reads the table.

    - \*`save_to_catalog`: Saves the DataFrame as a new table in your catalog.

    \***NOTE:** The `load_data` and `save_to_catalog` functions have already been created for you. 

    **TO DO:** Create the `convert_miles_to_km` and `uppercase_column_names` in the cell below.
<br></br>
    **HINT:** The solution functions can be found in **[./src_lab/lab_functions/transforms.py]($./src_lab/lab_functions/transforms.py)**.

In [0]:

from pyspark.sql import DataFrame
from pyspark.sql import functions as F

## load_data has already been created for you
def load_data(table_name):
    return spark.read.table(table_name)


## save_to_catalog has already been already created for you
def save_to_catalog(df, catalog_name, schema_name, table_name):
    (df
     .write
     .mode('overwrite')
     .saveAsTable(f'{catalog_name}.{schema_name}.{table_name}')
    )

In [0]:
## convert_miles_to_km
def convert_miles_to_km(df, new_column_name, miles_column):
    return df.withColumn(new_column_name, F.round(F.col(miles_column) * 1.60934, 2))

In [0]:
## uppercase_column_names
def uppercase_columns_names(df):
    return df.select([F.col(col).alias(col.upper()) for col in df.columns])

2. Run your functions to obtain the same results as the original PySpark code. The `save_to_catalog` function will name your new table **my_lab_table**. 

**NOTE:** If you are receiving a schema mismatch error that is because you are trying to overwrite a table you created with a different schema. Delete the table and recreate the table.

In [0]:
## Load table
df = load_data("samples.nyctaxi.trips")

## Convert miles to km
df = convert_miles_to_km(df, new_column_name = "trip_distance_km", miles_column = "trip_distance")

## Upcase column
df = uppercase_columns_names(df)

## Save DataFrame as a table in your catalog
save_to_catalog(df, catalog_name = DA.catalog_name, schema_name="default", table_name = "my_lab_table")

3. Run the following cell to test that the original table created in cell 11 (**nyc_km_solution_table**) is the same as your new table created by the functions above (**my_lab_table**). The test uses the PySpark `assertDataFrameEqual` method.

    If there is an error, it means the original table is not the same as your new table, and you need to fix your functions.

In [0]:
from pyspark.testing.utils import assertDataFrameEqual

# Read the tables (solution and your created table)
solution_df = spark.read.table(f"{DA.catalog_name}.default.nyc_lab_solution_table")
user_df = spark.read.table(f"{DA.catalog_name}.default.my_lab_table")

# Use assertDataFrameEqual to compare the two tables. Return an error if they tables are different.
assertDataFrameEqual(solution_df, user_df)

print("The tables are identical! Functions were created correctly.")

The tables are identical! Functions were created correctly.



&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>