# Unioning the cleaned combined data with Location data

In this data processing project, our objective was to enhance a vast taxi trip dataset by enriching it with location information from a separate lookup dataset. The goal was to perform joins based on pickup and dropoff locations, ultimately providing valuable insights for further analysis. However, the journey to achieving this task revealed a set of challenges that led to an incomplete join between the datasets.


Here are the steps to combine the new dataframe with the location data (pick up and drop off locations) and export it to a Parquet file in Databricks DBFS, and then load it as a table or view:

In [0]:
combined_data = spark.read.parquet("/CombinedCleanedData/cleaned_combined.parquet")
combined_data.count()

Out[1]: 709262227

In [0]:
combined_data.printSchema()

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- color: string (nullable = true)



In [0]:
lookup_df = spark.read.csv("dbfs:/FileStore/taxi_zone_lookup.csv", header=True, inferSchema=True)


In [0]:
lookup_df.printSchema()

root
 |-- LocationID: integer (nullable = true)
 |-- Borough: string (nullable = true)
 |-- Zone: string (nullable = true)
 |-- service_zone: string (nullable = true)



In [0]:
#dbutils.fs.rm("/MergedData/2016_parquet", True)


Out[2]: True

In [0]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Merge and Save").getOrCreate()

# Read the combined data
combined_data = spark.read.parquet("/CombinedCleanedData/cleaned_combined.parquet")

# Read the lookup data
lookup_df = spark.read.csv("dbfs:/FileStore/taxi_zone_lookup.csv", header=True, inferSchema=True)

# Select the relevant columns from the lookup data and rename them
lookup_df = lookup_df.selectExpr("LocationID as PULocationID", "Borough as PULocation_Borough", "Zone as PULocation_Zone")

# Rename the PULocationID column in the combined_data
combined_data = combined_data.withColumnRenamed("PULocationID", "Combined_PULocationID")

# Join the combined_data with the lookup_df on PULocationID
merged_data = combined_data.join(lookup_df, combined_data["Combined_PULocationID"] == lookup_df["PULocationID"], "left")

# Save the merged data as a Parquet file with a specific name
merged_data.write.parquet("/MergedData/merged_pulocation.parquet")

# Stop the Spark session
spark.stop()


In [0]:
merged_data= spark.read.parquet("/MergedData/merged_pulocation.parquet/*")
merged_data.count()

Out[3]: 605365170

In [0]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Merge and Save").getOrCreate()

# Read the combined data
combined_data = spark.read.parquet("/MergedData/merged_pulocation.parquet/*")

# Read the lookup data
lookup_df = spark.read.csv("dbfs:/FileStore/taxi_zone_lookup.csv", header=True, inferSchema=True)

# Select the relevant columns from the lookup data and rename them
lookup_df = lookup_df.selectExpr("LocationID as DOLocationID", "Borough as DOLocation_Borough")

# Rename the DOLocationID column in the combined_data
combined_data = combined_data.withColumnRenamed("DOLocationID", "Combined_DOLocationID")

# Join the combined_data with the lookup_df on DOLocationID
merged_data = combined_data.join(lookup_df, combined_data["Combined_DOLocationID"] == lookup_df["DOLocationID"], "left")

# Save the merged data as a Parquet file with a specific name
merged_data.write.parquet("/MergedData/merged_both_locations.parquet")

# Stop the Spark session
spark.stop()


In [0]:
merged_data = spark.read.parquet("/MergedData/merged_both_locations.parquet")
merged_data.count()

Out[1]: 605365170

# Some Highlights:


In our data project, we aimed to combine a massive taxi trip dataset with location information. However, we faced three big challenges:

1. **Data Size:** Our combined dataset had a staggering 709,262,227 entries. Handling such a huge dataset was tough.

2. **Limited Resources:** We didn't have unlimited computing power. Our computers struggled to process the large dataset, making the work slow.

3. **Data Growth:** The dataset kept growing, which made things even harder. It was like trying to solve a puzzle that kept getting bigger.

In simple terms, our problem wasn't the quality of the location data—it was the sheer size of our taxi dataset. This project taught us that working with big data is tough, even with the best tools.

In the end, we couldn't completely combine the location data. But we learned a lot about managing data, planning, and making the most of what we have. It's been an interesting journey, reminding us that big data is both challenging and full of exciting possibilities.


