In [22]:
!export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
!export PATH="${JAVA_HOME}/bin:${PATH}"
!export SPARK_HOME="${HOME}/spark/spark-3.4.2-bin-hadoop3-scala2.13"
!export PATH="${SPARK_HOME}/bin:${PATH}"

# **1. Introduction to Spark**

Apache Spark is built for parallel data processing across machine clusters. It supports languages like Java, Python, R, and Scala. Spark includes libraries for a different tasks, such as SQL, streaming, and machine learning. It's adaptable, being possible to run from a laptop to clusters of thousands of servers. This makes Spark beginner-friendly, yet it comes with a steep learning curve.

The Apache Spark framework operates on a master-slave architecture, comprising two key components: a **driver program** and **executors** nodes. The term "nodes" refers to individual computers or servers that are part of a larger network or cluster. Each node is an independent unit with its own CPU, memory, and storage, capable of executing tasks. To illustrate this, consider the following diagram for a higher overview of the architecture:

<center>
<img src="figures/architecture-spark.png">
</center>

- **Driver program:** This is the heart of the Spark application, running the main function and creating the **SparkContext**. The **SparkContext** is essential for establishing communication with the **cluster manager** and creating Resilient Distributed Datasets (RDDs), which are distributed across the cluster and operated in parallel. Also, Spark Driver contains various other components such as DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, which are responsible for translating the user-written code into jobs that are actually executed on the cluster.

- **Cluster Manager:** It's a service that can run on a node or across multiple nodes. The **cluster manager** is an external service for acquiring resources on the cluster (e.g., Apache Mesos, Hadoop YARN, or Spark's own standalone **cluster manager**). It allocates resources to Spark applications based on the requirements sent by the **Driver Program** through the **SparkContext**. It's responsible for allocating executor processes across the available worker nodes within the cluster as requested by the **Driver Program**.

- **Worker Node:** This is a node where Spark executors run **tasks**. Each **worker node** is a physical or virtual machine that is part of the Spark cluster. 
    
    - **Executors:** Are processes that run computations and store data for your application on the **worker node**. Each executor can run multiple tasks concurrently using threads.
    - **Tasks:** Tasks are individual units of work sent to the **executor** by the driver. Each task corresponds to a combination of data and computation.
    - **Cache:** This is used by executors to store data that can be reused in other **tasks**, reducing the need to fetch this data from disk or over the network for subsequent **tasks**.

The **SparkContext** in the **driver program** requests resources from the **cluster manager**, which then assigns **worker nodes**. Executors on the **worker nodes** then process **tasks** and store results. These results are sent back to the **driver program**, which may then initiate further **tasks** or return results to the user.

We can run Spark in three different modes: local mode, cluster mode, and client mode. The mode in which Spark runs determines the location of the **driver program** and the executors. The mode also affects how the **driver program** communicates with the **cluster manager** and the **worker nodes**. Consider each mode in more detail:

### **Local Mode**

Local mode simplifies the execution environment by running the entire Spark application on a single machine. Unlike cluster and client modes, which distribute **tasks** across multiple nodes in a cluster, local mode simulates a distributed environment using threads on one machine. This mode is ideal for development, testing, or experimentation, as it does not require a cluster and simplifies the setup process.

### **Cluster Mode**

The execution of a Spark application involves submitting the application to a **cluster manager** (such as Apache Mesos, Hadoop YARN, or Spark's own standalone **cluster manager**). Here, both the **driver program** and **executor** processes are launched within the cluster. Specifically, the driver process runs on one of the **worker nodes** designated by the **cluster manager**, separate from the machine where the submission occurred.

### **Client Mode**

Client mode operates similarly to cluster mode, with the key difference being the location of the Spark driver. In client mode, the driver runs on the client machine — the machine from which the application is submitted, often referred to as the gateway machine or edge node. This setup facilitates direct interaction between the application's user and the driver, making it easier to debug or monitor the application's progress. The **executor** processes, however, are still managed by the **cluster manager** and run within the cluster. Client mode is particularly useful during the development and debugging phases of an application, as it allows for immediate access to the driver's logs and prompts.



## **1.1 Batch Processing**

Batch processing in Spark executes jobs that handle large volumes of data in batches, contrasting with stream processing where data is processed continuously as it arrives. Each job processes a complete dataset at once.

The workflow of batch processing in Spark begins by reading data in large batches from diverse data sources. This data is then processed in parallel across the nodes in the cluster, utilizing Spark's core abstractions such as RDDs (Resilient Distributed Datasets), DataFrames, or Datasets, depending on the specific requirements of the task. The data goes to a series of transformations, orchestrated through Directed Acyclic Graph (DAG) operations, scheduling the job's execution across the cluster.

RDDs (Resilient Distributed Datasets), are a fundamental concept in Apache Spark, representing a distributed collection of objects. Internally, Spark DataFrames are built upon RDDs, introducing an additional layer of abstraction that facilitates more efficient data handling and processing. This is crucial for batch processing, where large datasets are divided into smaller partitions that are processed concurrently across multiple nodes, significantly speeding up processing times.

Spark's approach to batch processing involves constructing a DAG of RDD transformations for each job, optimizing the execution plan to minimize data shuffling and enhance job execution times. The scheduler divides the DAG into stages that can be executed concurrently, further optimizing resource usage and processing time for batch jobs.

The execution stages are divided into tasks, each designated for execution on an executor. Executors are basically JVM (java virtual machine) processes allocated to Spark applications by the cluster manager, running on worker nodes. The task scheduler efficiently distributes these tasks among the available executors, optimizing for data locality and balancing the workload to ensure the efficient execution of batch processing jobs.


# **2. Virtual Machine in Google Cloud**

The idea is to run a Spark application in a virtual machine in Google Cloud in local and cluster mode. The first step is to create a virtual machine in Google Cloud.Before configuring the Virtual machine, we need to generate a ssh key pair. This will allow us to connect to the VM instance using the ssh protocol. To generate the key pair, open a terminal go to the `.ssh` directory and run the following command:

```bash
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USERNAME -b 2048
```

Where `KEY_FILENAME` is the name of the file that will store the key pair and `USERNAME` is the username that will be used to connect to the VM instance. The `-b` flag is used to specify the number of bits in the key. The default value is 2048 bits. This created a private key file `KEY_FILENAME` and a public key file `KEY_FILENAME.pub`. The public key file will be used to configure the VM instance.

Now in the Google Cloud Plataform open the left bar and go to `Compute Engine > Metadata`, click on `SSH Keys` and `ADD SSH KEY` to add the public key to the list of SSH keys. The public key that need to be copied is the content of the `KEY_FILENAME.pub` file. After adding the key, click on `Save` to save the changes. At the end we should have the following:

<center>
<img src="figures/ssh-key.png">
</center>

Now any Virtual machine that is created will use this ssh key. This will allow us to connect to the VM instance using the private key. Now we in the Google Cloud Platform, go to the navigation bar on the left and select `Compute Engine > VM instances` to create a new VM instance. Click on `Create` to create a new VM instance. 

**Machine configuration**

The following image shows the configuration of the machine that will be used in this project.


<center>
<img src="figures/vm-setup.png">
</center>

**Boot Disk**

This will create a machine with 4 vCPUs and 16 GB of memory. The boot disk is a Ubuntu 22.04 LTS image with disk size of 40 GB. 

<center>
<img src="figures/boot-disk.png">
</center>

Then we can click on create to create the VM instance. We should see the following:

<center>
<img src="figures/vm-instance-ip.png">
</center>

Now copy the external ip (EXTERNAL_IP) address of the VM instance and go to the terminal to connect using the following command:

```bash
ssh -i ~/.ssh/KEY_FILENAME USERNAME@EXTERNAL_IP
```

The `-i` flag is used to specify the private key file. After running the command, we should be connected to the VM instance. In my specific case i should connect using `ssh -i ~/.ssh/gcp_key marcos@34.138.143.219`. We can check the machine configuration using the htop command. The following image shows the output of the htop command.

<center>
<img src="figures/htop-vm.png">
</center>

instead of using the entire ssh command, we can simplify by creating a `config` file in `.ssh` directory. The file should have the following content for each new ssh connection:

```bash
Host VM_NAME
  HostName EXTERNAL_IP
  User USERNAME
  IdentityFile ~/.ssh/KEY_FILENAME
```

For my particular case, to connect to google Cloud, this would be:

```bash
Host de-bootcamp
  HostName 34.138.143.219
  User marcos
  IdentityFile ~/.ssh/gcp_key
```

then we can connect to the VM instance simply using:

```bash
ssh VM_NAME
```

We can also configure our VScode to connect via ssh to the VM instance. To do this, we need to install the `Remote - SSH` extension. After installing the extension, click on the green icon on the bottom left of the VScode and select `Open a Remote Window:Connect to Host...`. After selecting the correct connection, the VScode will connect to the VM instance and we can start coding in the VM instance. For more details check [this](https://code.visualstudio.com/docs/remote/ssh). We should have something like:

<center>
<img src="figures/vscode-ssh.png">
</center>

## **2.1 Spark Installation**

To install Spark on the Virtual Machine is similar to the installation in a local computer with ubuntu via terminal. We first install Java and then Spark and Pyspark.

### **java Installation**

To install java, download OpenJDK 11 in [OpenJDK](https://jdk.java.net/archive/) and download it to `~/spark` folder and Unpack it with:

```bash
wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz

tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz
```
the flag `xzfv` means to extract, decompress the archive using gzip before extracting, indicate that the next argument is the name of the file and print the names of files as they are extracted, respectively. In the `.bashrc` we need to define `JAVA_HOME` and add it to `PATH` to make it available to the system when the terminal are opened. Go to the `~/.bashrc` file and add the following:

```bash
export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"
```

check that it works:

```bash
java --version
```

### **Spark Installation**


To install Spark, download [Spark](https://spark.apache.org/downloads.html) to the same folder `~/spark` and unpack it with:

```bash
wget https://dlcdn.apache.org/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3-scala2.13.tgz

tar xzfv spark-3.4.2-bin-hadoop3-scala2.13.tgz
```

Again, in the `.bashrc` add a new path for spark to `PATH`:

```bash
export SPARK_HOME="${HOME}/spark/spark-3.4.2-bin-hadoop3-scala2.13"
export PATH="${SPARK_HOME}/bin:${PATH}"
```

To check if it works, open a new terminal, execute `spark-shell` and run the following:

```bash
val data = 1 to 10000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect()
```

<center>
<img src="figures/spark-test.png" >
</center>

### **Pyspark Installation**

To install `pyspark` we can use the following command:

```bash
pip install pyspark==3.4.2
```
make sure that the version match the version of the Spark installed, otherwise this can cause conflicts.

### 2.1.1 Testing Spark Setup

When setting up a development environment with VSCode to work with Jupyter notebooks, we might encounter a situation where VSCode does not automatically recognize environment variables such as JAVA_HOME and SPARK_HOME. In this case, if we are running in our local machine, we can open VScode via terminal by using `code`. This will allow VSCode to recognize the environment variables. But, if we are running in a VM instance from Google Cloud, for example, and connecting the VScode by the Remote-SSH application, a practical solution involves using port forwarding to access the Jupyter notebook through a web browser. Here I add port 8888 is for the jupyter notebook and port 4040  for the Spark UI, as show below:

<center>
<img src="figures/port-vscode.png">
</center>

after this, make sure to have jupyter installed in the VM instance. To open the jupyter notebook, go to the terminal, connect with the VM instance and run the following:

```bash
jupyter notebook
```
We should have something like:

<center>
<img src="figures/run-jupyter-terminal.png">
</center>

This will be running in port `http://localhost:8888/`. The server provides a URL that includes a token-based authentication part, which looks like `?token=<some_long_string>`. This token is a security measure to prevent unauthorized access to our Jupyter notebooks, and in this case the HTTP request address is `http://localhost:8888/?token=f4c5f14db0b2934c18e2a911c85270067f6f7ba79f677510`. For the Spark UI we just need to access the address `http://localhost:4040`.

Another approach to directly use VScode to run the Jupyter notebook with Spark, without the need to start the Jupyter UI in browser, would be starting the notebook by defining the environment variables as follows:

In [23]:
!export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
!export PATH="${JAVA_HOME}/bin:${PATH}"
!export SPARK_HOME="${HOME}/spark/spark-3.4.2-bin-hadoop3-scala2.13"
!export PATH="${SPARK_HOME}/bin:${PATH}"




Now download the Taxi Zone Lookup CSV file from [NYC taxi](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and save inside the `data` directory :

```bash
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
```

In [24]:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd

The following code initiates a `SparkSession`, an entry point for programming Spark with the Dataset and DataFrame API. It is used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read parquet files. As we have seen before, Spark has three modes: cluster, client, and local. For testing, let's use the local mode by setting `.master("local[*]")`. This means that Spark will run locally on the virtual machine instance from Google Cloud, using as many worker threads as there are logical cores on the machine. The method `.appName("test")` sets the name of the application to be shown in the Spark web UI, and the method `.getOrCreate()` creates the SparkSession with all the parameters set previously.

In [25]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

# Suppress warnings
spark.sparkContext.setLogLevel("ERROR")


In [26]:
# Read the taxi zone lookup and show
# Set header to true to consider first row as header
df = spark.read \
    .option("header", "true") \
    .csv('data/raw/taxi+_zone_lookup.csv')

df.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

When we run the `df.write.parquet('data/zones')` command in PySpark, the resulting `zones` directory will contains several files due to how Spark manages data storage in Parquet format. The `_SUCCESS` file is a marker indicating the write operation completed successfully. Accompanying CRC (Cyclic Redundancy Check) files, like `._SUCCESS.crc`, provide cyclic redundancy checks for data integrity verification. The data itself is stored in `.snappy.parquet` files, with their names including a unique identifier and a partition index, indicating they contain a portion of the DataFrame's data compressed using Snappy for efficiency. Each `.snappy.parquet` file also has a corresponding `.crc` file for integrity checks. The presence of these files reflects Spark's distributed data processing approach, where data can be partitioned across multiple files for scalable processing and storage.

In [27]:
# writes the data in df to the zones directory in Parquet format
df.write.parquet('data/zones', mode='overwrite')

# **3. Spark DataFrames**

## **3.1 Transformation vs Action**

In Spark, operations on DataFrames are categorized into transformations and actions, which are fundamental to understanding how Spark processes data, especially in a distributed manner. Transformations are operations that create a new DataFrame from an existing one but do not trigger computation by themselves. Instead, they build up internally as a Directed Acyclic Graph (DAG) that Spark uses to compute the result lazily. Actions, in contrast, trigger the execution of the computations specified by the DAG of transformations. They are operations that produce a result.

### **Transformation (Lazy)**
Transformations create a new DataFrame or RDD from an existing one. 

1. **`filter(condition)`**: Returns a new DataFrame containing only the rows that meet the condition.
   
2. **`select(*cols)`**: Projects a set of expressions and returns a new DataFrame with the selected columns.
   
3. **`groupBy(*cols)`**: Groups the DataFrame using the specified columns, returning a GroupedData object that can be further aggregated.
   
4. **`orderBy(*cols, **kwargs)`**: Returns a new DataFrame sorted by the specified column(s).
   
5. **`join(other, on=None, how=None)`**: Joins with another DataFrame, using the given join expression and method.
   
6. **`withColumn(colName, col)`**: Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
   

### **Actions (Eager)**
Actions trigger the execution of the DAG of transformations and return a result

1. **`show(n=20)`**: Displays the first `n` rows of the DataFrame.
   
2. **`count()`**: Returns the number of rows in the DataFrame.
   
3. **`collect()`**: Returns all the rows as a list of Row objects.
   
4. **`take(n)`**: Returns the first `n` rows as a list of Row objects.
   
5. **`first()`**: Returns the first row as a Row object.


For better understand this concepts, for this part download the High Volume For-Hire Vehicle Trip Records (PARQUET) from 2023 in [NYC taxi](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and save inside the `data` directory :

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet
```

Let's start the SparkSession if not yet started and print the schema from parquet file:

In [28]:
# Start Session if not
#spark = SparkSession.builder \
#    .master("local[*]") \
#    .appName('test') \
#    .getOrCreate()

df = spark.read.parquet('data/raw/fhvhv_tripdata_2023-01.parquet')
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp_ntz (nullable = true)
 |-- on_scene_datetime: timestamp_ntz (nullable = true)
 |-- pickup_datetime: timestamp_ntz (nullable = true)
 |-- dropoff_datetime: timestamp_ntz (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_f

To see this transformation and actions in practice, let's select the `pickup_datetime`, `dropoff_datetime`, `PULocationID` and  `DOLocationID` columns filtering by  `hvfhs_license_num == 'HV0003` and show the result. The following diagram shows the DAG of transformations and actions:

<center>
<img src="figures/transf-actions.png">
</center>

In [29]:
df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID' )\
    .filter(df.hvfhs_license_num == 'HV0003')\
    .show()

+-------------------+-------------------+------------+------------+
|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|
+-------------------+-------------------+------------+------------+
|2023-01-01 00:19:38|2023-01-01 00:48:07|          48|          68|
|2023-01-01 00:58:39|2023-01-01 01:33:08|         246|         163|
|2023-01-01 00:20:27|2023-01-01 00:37:54|           9|         129|
|2023-01-01 00:41:05|2023-01-01 00:48:16|         129|         129|
|2023-01-01 00:52:47|2023-01-01 01:04:51|         129|          92|
|2023-01-01 00:10:29|2023-01-01 00:18:22|          90|         231|
|2023-01-01 00:22:10|2023-01-01 00:33:14|         125|         246|
|2023-01-01 00:39:09|2023-01-01 01:03:50|          68|         231|
|2023-01-01 00:14:35|2023-01-01 00:49:13|          79|          50|
|2023-01-01 00:52:15|2023-01-01 01:31:11|         143|         223|
|2023-01-01 00:24:48|2023-01-01 00:37:39|          49|         181|
|2023-01-01 00:46:20|2023-01-01 00:52:51|       

# **4. Preparing Yellow and Green Taxi Data**

## 4.1 Download Data via Bash Script

## 4.2  Redefining Parquet Schema and Using SQL

Even our data being in parquet file, it's important to standardizing the schema across all datasets to ensures that the data types are exactly as expected for all periods. 

In [30]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types

# Start Session if not
spark = SparkSession.builder\
    .master("local[*]")\
    .appName('test')\
    .getOrCreate()

Let's load some piece of data to check the schema that comes with the taxi trip data

In [31]:
df_yellow_2022_01 = spark.read.parquet('data/raw/yellow/2022/01/')
print("yellow schema\n")
display(set(df_yellow_2022_01.schema))

df_green_2022_01 = spark.read.parquet('data/raw/green/2022/01/')
print("green schema\n")
display(set(df_green_2022_01.schema))

yellow schema



{StructField('DOLocationID', LongType(), True),
 StructField('PULocationID', LongType(), True),
 StructField('RatecodeID', DoubleType(), True),
 StructField('VendorID', LongType(), True),
 StructField('airport_fee', DoubleType(), True),
 StructField('congestion_surcharge', DoubleType(), True),
 StructField('extra', DoubleType(), True),
 StructField('fare_amount', DoubleType(), True),
 StructField('improvement_surcharge', DoubleType(), True),
 StructField('mta_tax', DoubleType(), True),
 StructField('passenger_count', DoubleType(), True),
 StructField('payment_type', LongType(), True),
 StructField('store_and_fwd_flag', StringType(), True),
 StructField('tip_amount', DoubleType(), True),
 StructField('tolls_amount', DoubleType(), True),
 StructField('total_amount', DoubleType(), True),
 StructField('tpep_dropoff_datetime', TimestampNTZType(), True),
 StructField('tpep_pickup_datetime', TimestampNTZType(), True),
 StructField('trip_distance', DoubleType(), True)}

green schema



{StructField('DOLocationID', LongType(), True),
 StructField('PULocationID', LongType(), True),
 StructField('RatecodeID', DoubleType(), True),
 StructField('VendorID', LongType(), True),
 StructField('congestion_surcharge', DoubleType(), True),
 StructField('ehail_fee', IntegerType(), True),
 StructField('extra', DoubleType(), True),
 StructField('fare_amount', DoubleType(), True),
 StructField('improvement_surcharge', DoubleType(), True),
 StructField('lpep_dropoff_datetime', TimestampNTZType(), True),
 StructField('lpep_pickup_datetime', TimestampNTZType(), True),
 StructField('mta_tax', DoubleType(), True),
 StructField('passenger_count', DoubleType(), True),
 StructField('payment_type', DoubleType(), True),
 StructField('store_and_fwd_flag', StringType(), True),
 StructField('tip_amount', DoubleType(), True),
 StructField('tolls_amount', DoubleType(), True),
 StructField('total_amount', DoubleType(), True),
 StructField('trip_distance', DoubleType(), True),
 StructField('trip_type

Here we changed the types of `VendorID`, `DOLocationID`, `PULocationID`, `RatecodeID`, `passenger_count`, `payment_type` to `IntegerType`, `tpep_pickup_datetime` and `tpep_dropoff_datetime` to `TimestampType`. The script iterates over two years (2022 and 2023) and all months within those years, processing the datasets stored in parquet format. Then the data is redistribute across the Spark cluster by using `.repartition( )`. This will help to optimize the performance of the Spark job by ensuring that the data is evenly distributed across the available nodes in the cluster to be processed in parallel. The data is then saved in the parquet format with the new schema.

In [51]:
for year in [2022, 2023]:
    for month in range(1, 13):
        print(f'processing data for {year}/{month}')

        input_path = f'data/raw/yellow/{year}/{month:02d}/'
        output_path = f'data/processed/new-schema/yellow/{year}/{month:02d}/'

        df_yellow = spark.read.parquet(input_path)
        df_yellow = df_yellow\
                    .withColumn("DOLocationID",df_yellow["DOLocationID"].cast(types.IntegerType()) )\
                    .withColumn("PULocationID",df_yellow["PULocationID"].cast(types.IntegerType()) )\
                    .withColumn("RatecodeID",df_yellow["RatecodeID"].cast(types.IntegerType()) )\
                    .withColumn("VendorID",df_yellow["VendorID"].cast(types.IntegerType()) )\
                    .withColumn("airport_fee",df_yellow["airport_fee"].cast(types.DoubleType()) )\
                    .withColumn("congestion_surcharge",df_yellow["congestion_surcharge"].cast(types.DoubleType()) )\
                    .withColumn("extra",df_yellow["extra"].cast(types.DoubleType()) )\
                    .withColumn("fare_amount",df_yellow["fare_amount"].cast(types.DoubleType()) )\
                    .withColumn("improvement_surcharge",df_yellow["improvement_surcharge"].cast(types.DoubleType()) )\
                    .withColumn("mta_tax",df_yellow["mta_tax"].cast(types.DoubleType()) )\
                    .withColumn("passenger_count",df_yellow["passenger_count"].cast(types.IntegerType()) )\
                    .withColumn("payment_type",df_yellow["payment_type"].cast(types.IntegerType()) )\
                    .withColumn("store_and_fwd_flag",df_yellow["store_and_fwd_flag"].cast(types.StringType()) )\
                    .withColumn("tip_amount",df_yellow["tip_amount"].cast(types.DoubleType()) )\
                    .withColumn("tolls_amount",df_yellow["tolls_amount"].cast(types.DoubleType()) )\
                    .withColumn("total_amount",df_yellow["total_amount"].cast(types.DoubleType()) )\
                    .withColumn("tpep_dropoff_datetime",df_yellow["tpep_dropoff_datetime"].cast(types.TimestampType()) )\
                    .withColumn("tpep_pickup_datetime",df_yellow["tpep_pickup_datetime"].cast(types.TimestampType()) )\
                    .withColumn("trip_distance",df_yellow["trip_distance"].cast(types.DoubleType()))            
                
        df_yellow\
            .repartition(4) \
            .write.mode('overwrite')\
            .parquet(output_path)

processing data for 2022/1


                                                                                

processing data for 2022/2


                                                                                

processing data for 2022/3


                                                                                

processing data for 2022/4


                                                                                

processing data for 2022/5


                                                                                

processing data for 2022/6


                                                                                

processing data for 2022/7


                                                                                

processing data for 2022/8


                                                                                

processing data for 2022/9


                                                                                

processing data for 2022/10


                                                                                

processing data for 2022/11


                                                                                

processing data for 2022/12


                                                                                

processing data for 2023/1


                                                                                

processing data for 2023/2


                                                                                

processing data for 2023/3


                                                                                

processing data for 2023/4


                                                                                

processing data for 2023/5


                                                                                

processing data for 2023/6


                                                                                

processing data for 2023/7


                                                                                

processing data for 2023/8


                                                                                

processing data for 2023/9


                                                                                

processing data for 2023/10


                                                                                

processing data for 2023/11


                                                                                

processing data for 2023/12


                                                                                

In [34]:
for year in [2022, 2023]:
    for month in range(1, 13):
        print(f'processing data for {year}/{month}')

        input_path = f'data/raw/green/{year}/{month:02d}/'
        output_path = f'data/processed/new-schema/green/{year}/{month:02d}/'
        
        df_green = spark.read.parquet(input_path)
        df_green = df_green\
            .withColumn("DOLocationID", df_green["DOLocationID"].cast(types.IntegerType()))\
            .withColumn("PULocationID", df_green["PULocationID"].cast(types.IntegerType()))\
            .withColumn("RatecodeID", df_green["RatecodeID"].cast(types.IntegerType()))\
            .withColumn("VendorID", df_green["VendorID"].cast(types.IntegerType()))\
            .withColumn("congestion_surcharge", df_green["congestion_surcharge"].cast(types.DoubleType()))\
            .withColumn("ehail_fee", df_green["ehail_fee"].cast(types.IntegerType()))\
            .withColumn("extra", df_green["extra"].cast(types.DoubleType()))\
            .withColumn("fare_amount", df_green["fare_amount"].cast(types.DoubleType()))\
            .withColumn("improvement_surcharge", df_green["improvement_surcharge"].cast(types.DoubleType()))\
            .withColumn("lpep_dropoff_datetime", df_green["lpep_dropoff_datetime"].cast(types.TimestampType()))\
            .withColumn("lpep_pickup_datetime", df_green["lpep_pickup_datetime"].cast(types.TimestampType()))\
            .withColumn("mta_tax", df_green["mta_tax"].cast(types.DoubleType()))\
            .withColumn("passenger_count", df_green["passenger_count"].cast(types.IntegerType()))\
            .withColumn("payment_type", df_green["payment_type"].cast(types.IntegerType()))\
            .withColumn("store_and_fwd_flag", df_green["store_and_fwd_flag"].cast(types.StringType()))\
            .withColumn("tip_amount", df_green["tip_amount"].cast(types.DoubleType()))\
            .withColumn("tolls_amount", df_green["tolls_amount"].cast(types.DoubleType()))\
            .withColumn("total_amount", df_green["total_amount"].cast(types.DoubleType()))\
            .withColumn("trip_distance", df_green["trip_distance"].cast(types.DoubleType()))\
            .withColumn("trip_type", df_green["trip_type"].cast(types.DoubleType()))
        
        df_green\
            .repartition(4) \
            .write.mode('overwrite')\
            .parquet(output_path)

processing data for 2022/1


processing data for 2022/2
processing data for 2022/3
processing data for 2022/4
processing data for 2022/5
processing data for 2022/6
processing data for 2022/7
processing data for 2022/8
processing data for 2022/9
processing data for 2022/10
processing data for 2022/11
processing data for 2022/12
processing data for 2023/1
processing data for 2023/2
processing data for 2023/3
processing data for 2023/4
processing data for 2023/5
processing data for 2023/6
processing data for 2023/7
processing data for 2023/8
processing data for 2023/9
processing data for 2023/10
processing data for 2023/11
processing data for 2023/12


We can now load the entire dataset taking into account all years for each taxi service. The transformation in the schema ensure that all data is consistent and can be used for further analysis. 


In [40]:
pd.set_option('display.max_columns', None)


# All years and months from green and yellow taxi
df_yellow = spark.read.parquet('data/processed/new-schema/yellow/*/*')
df_green = spark.read.parquet('data/processed/new-schema/green/*/*/')


# Rename columns so the green and yellow taxi match
df_yellow = df_yellow \
    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')
    
df_green = df_green \
    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')


    
print(f'number of rows for yellow taxi: {df_yellow.count()}\n')
display(df_yellow.select("pickup_datetime", "dropoff_datetime", "passenger_count", "trip_distance").show(5))
print(f'number of rows for green taxi: {df_green.count()}\n')
display(df_green.select("pickup_datetime", "dropoff_datetime", "passenger_count", "trip_distance").show(5))



number of rows for yellow taxi: 77966324

+-------------------+-------------------+---------------+-------------+
|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|
+-------------------+-------------------+---------------+-------------+
|2022-10-30 13:20:54|2022-10-30 13:35:28|           null|          0.0|
|2022-10-20 11:28:50|2022-10-20 11:43:51|              2|         2.32|
|2022-10-22 17:05:36|2022-10-22 17:17:40|              1|          1.8|
|2022-10-20 13:36:58|2022-10-20 13:47:52|              3|          0.9|
|2022-10-13 09:49:44|2022-10-13 10:43:58|              2|         7.83|
+-------------------+-------------------+---------------+-------------+
only showing top 5 rows



None

number of rows for green taxi: 1627462

+-------------------+-------------------+---------------+-------------+
|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|
+-------------------+-------------------+---------------+-------------+
|2022-03-22 20:32:55|2022-03-22 20:35:40|              1|         0.54|
|2022-03-24 20:52:32|2022-03-24 20:56:31|              1|         0.86|
|2022-03-22 18:41:31|2022-03-22 19:05:08|              1|          8.2|
|2022-03-29 09:26:54|2022-03-29 09:43:17|              1|          3.5|
|2022-03-07 16:23:16|2022-03-07 16:30:23|              1|          1.0|
+-------------------+-------------------+---------------+-------------+
only showing top 5 rows



None

Let's select just the common columns between the yellow and green taxi datasets and union them together, identifying the taxi service type by adding a new column called `service_type`. Then, we can take some summary statistic using SQL.

In [41]:
# To maintain the order of the columns
common_columns = []

yellow_columns = set(df_yellow.columns)
for col in df_green.columns:
    if col in yellow_columns:
        common_columns.append(col)

#print(f'\n{common_columns}\n')

# Select common columns and
# Add service_type column (yellow or green)
df_green_select = df_green \
    .select(common_columns) \
    .withColumn('service_type', F.lit('green'))

df_yellow_select = df_yellow \
    .select(common_columns) \
    .withColumn('service_type', F.lit('yellow'))
    
df_trips_data = df_green_select.unionAll(df_yellow_select)

df_trips_data.createOrReplaceTempView('trips_data')
spark.sql("""
SELECT
    service_type,
    count(1)
FROM
    trips_data
GROUP BY 
    service_type
""").show()




+------------+--------+
|service_type|count(1)|
+------------+--------+
|       green| 1627462|
|      yellow|77966324|
+------------+--------+



                                                                                

In [42]:
df_trips_data.columns

['VendorID',
 'pickup_datetime',
 'dropoff_datetime',
 'store_and_fwd_flag',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'passenger_count',
 'trip_distance',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'payment_type',
 'congestion_surcharge',
 'service_type']

In [43]:
df_result = spark.sql("""
SELECT 
    -- Reveneue grouping 
    PULocationID AS revenue_zone,
    date_trunc('month', pickup_datetime) AS revenue_month, 
    service_type, 

    -- Revenue calculation 
    SUM(fare_amount) AS revenue_monthly_fare,
    SUM(extra) AS revenue_monthly_extra,
    SUM(mta_tax) AS revenue_monthly_mta_tax,
    SUM(tip_amount) AS revenue_monthly_tip_amount,
    SUM(tolls_amount) AS revenue_monthly_tolls_amount,
    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,
    SUM(total_amount) AS revenue_monthly_total_amount,
    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,

    -- Additional calculations
    AVG(passenger_count) AS avg_montly_passenger_count,
    AVG(trip_distance) AS avg_montly_trip_distance
FROM
    trips_data
GROUP BY
    1, 2, 3
""")

display(df_result.select("revenue_zone", "revenue_month", "service_type", "revenue_monthly_fare").show(5))

df_result.coalesce(1).write.parquet('data/report/revenue/', mode='overwrite')



+------------+-------------------+------------+--------------------+
|revenue_zone|      revenue_month|service_type|revenue_monthly_fare|
+------------+-------------------+------------+--------------------+
|          92|2022-03-01 00:00:00|       green|            10347.98|
|          80|2022-04-01 00:00:00|       green|             5414.24|
|         109|2022-04-01 00:00:00|       green|                25.0|
|          82|2022-06-01 00:00:00|       green|  32203.450000000004|
|          72|2022-06-01 00:00:00|       green|  1744.9600000000007|
+------------+-------------------+------------+--------------------+
only showing top 5 rows



                                                                                

None

                                                                                

# **5. Group By and Joins**

Understanding the mechanics of a GROUP BY in Spark is crucial. Consider the following:

In [46]:
df_green = spark.read.parquet('data/processed/new-schema/green/*/*/')
df_green.registerTempTable('green')

With the data loaded, we perform a GROUP BY query that aggregates total revenue and counts the records by hour and zone:

In [50]:

df_green_revenue = spark.sql("""
SELECT 
    date_trunc('hour', lpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) AS amount,
    COUNT(1) AS number_records
FROM
    green
WHERE
    lpep_pickup_datetime >= '2022-01-01 00:00:00'
GROUP BY
    1, 2
ORDER BY
    1, 2
""")

display(df_green_revenue.show(5))

df_green_revenue.write.parquet('data/report/revenue/green', mode='overwrite')



+-------------------+----+------+--------------+
|               hour|zone|amount|number_records|
+-------------------+----+------+--------------+
|2022-01-01 00:00:00|   7|   9.8|             1|
|2022-01-01 00:00:00|  25| 25.05|             1|
|2022-01-01 00:00:00|  33|158.72|             4|
|2022-01-01 00:00:00|  37| 37.95|             2|
|2022-01-01 00:00:00|  40|   7.3|             1|
+-------------------+----+------+--------------+
only showing top 5 rows



                                                                                

None

                                                                                

<center>
<img src="figures/groupby-dag.png">
</center>


The DAG visualization, in the `Stage 10` represents the initial action of reading data from a Parquet file. This stage was skipped, because the data was already read and is available in memory. The `WholeStageCoden` is an optimization technique that compiles an entire query stage into a single Java function, which reduces virtual function calls and CPU instruction cache misses. It encompasses the Scan Parquet operation. Finally, the `Exchange` is a shuffle operation, which redistributes data across different executors based on a partitioning scheme. This is required for operations like `GROUP BY` that need to aggregate data that could be distributed across different partitions.

The `Stage 11` represents an adaptive query execution shuffle read. It's part of Spark's adaptive query execution, where it reads shuffled data. This stage was also skipped because we previously run a GROUP BY query, making the shuffle data was already available.

and finally, the `Stage 12` is similar to the previous shuffle read, but this is part of the stage that wasn't skipped, indicating that this shuffle read is part of preparing data for the final ORDER BY and write operations. The WriteFiles is the final stage where the processed data is written out to a Parquet file in the specified location.

For the JOIN, there is two types, joining two large tables and joining one large table with a small one. consider the following:

In [59]:
df_yellow = spark.read.parquet('data/processed/new-schema/yellow/*/*/')
df_yellow.registerTempTable('yellow')

In [60]:
df_yellow_revenue = spark.sql("""
SELECT 
    date_trunc('hour', tpep_pickup_datetime) AS hour,
    PULocationID AS zone,

    SUM(total_amount) AS amount,
    COUNT(1) AS number_records
FROM
    yellow
WHERE
    tpep_pickup_datetime >= '2022-01-01 00:00:00'
GROUP BY
    1, 2
""")

In [61]:
df_yellow_revenue\
    .repartition(20)\
    .write.parquet('data/report/revenue/yellow', mode='overwrite')

                                                                                

In [64]:
# to make clear which column is from each table
df_green_revenue_tmp = df_green_revenue\
    .withColumnRenamed('amount', 'green_amount')\
    .withColumnRenamed('number_records', 'green_number_records')

df_yellow_revenue_tmp = df_yellow_revenue\
    .withColumnRenamed('amount', 'yellow_amount')\
    .withColumnRenamed('number_records', 'yellow_number_records')

In [None]:
# join record that is in green but not in yellow
df_join = df_green_revenue.join(df_yellow_revenue, on= ['hour', 'zone'], how = 'outer')

In [None]:
df_join.write('data/report/revenue/yellow', mode='overwrite')

In [65]:
display(df_join.show(5))

[Stage 448:>                                                        (0 + 1) / 1]

+-------------------+----+------+--------------+------------------+--------------+
|               hour|zone|amount|number_records|            amount|number_records|
+-------------------+----+------+--------------+------------------+--------------+
|2022-01-01 00:00:00|   4|  null|          null|203.73000000000002|            11|
|2022-01-01 00:00:00|   7|   9.8|             1|             92.12|             6|
|2022-01-01 00:00:00|  33|158.72|             4|             89.55|             5|
|2022-01-01 00:00:00|  40|   7.3|             1|              null|          null|
|2022-01-01 00:00:00|  45|  null|          null|            205.46|            11|
+-------------------+----+------+--------------+------------------+--------------+
only showing top 5 rows



                                                                                

None

In [None]:
df_join.write.parquet('data/report/revenue/yellow')

<center>
<img src="figures/join-dag1.png">
</center>


# **6.Running Spark in the Clound**