# Big Data Analysis Case: NYC Taxi Trips

**Example Workflow**

![](https://miro.medium.com/max/700/1*WrRz33pZkGEU8q50kxSf8Q.jpeg)

_[Source](https://towardsdatascience.com/if-taxi-trips-were-fireflies-1-3-billion-nyc-taxi-trips-plotted-b34e89f96cfa)_

## Analysis Challenge

**How did the COVID19 Pandemic impact Taxi traffic in New York City?**

- _Conduct an **exploratory data analysis**: Use descriptive statistics and data visualization._
- _Then compare the traffic patterns during the 2020/2021 pandemic with those of several previous years._
- _Implement your data processing with PySpark and other tools from the Python data science ecosystem._

## Workflow

### Preamble

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark

### Data 

_Point this path to the location of the taxi trip data_

In [None]:
data_dir = "../.assets/data/taxi/raw"

In [None]:
ls -lah {data_dir}

_Initiate a `SparkSession`_

In [None]:
spark = pyspark.sql.SparkSession \
                    .builder \
                    .appName("Spark SQL First Example") \
                    .getOrCreate()

_Take a look into the CSV file to see the header_

In [None]:
!head {data_dir}/yellow_tripdata_2020-04.csv

_Define the schema of the dataframe_

In [None]:
yellow_schema = [
    "VendorID INT",
    "tpep_pickup_datetime TIMESTAMP",
    "tpep_dropoff_datetime TIMESTAMP",
    "passenger_count INT",
    "trip_distance DOUBLE",
    "RatecodeID STRING",
    "store_and_fwd_flag STRING",
    "PULocationID STRING",
    "DOLocationID STRING",
    "payment_type STRING",
    "fare_amount DOUBLE",
    "extra STRING",
    "mta_tax DOUBLE",
    "tip_amount DOUBLE",
    "tolls_amount DOUBLE",
    "improvement_surcharge DOUBLE",
    "total_amount DOUBLE",
    "congestion_surcharge DOUBLE"
]

In [None]:
yellow_schema_str  = ", ".join(yellow_schema)
yellow_schema_str

_Read the CSV file into a Spark DataFrame_

In [None]:
yellow_data = spark.read.csv(
    f"{data_dir}/yellow*.csv",
    schema=yellow_schema_str,
    header=True
)

In [None]:
yellow_data.printSchema()

## Analysis

In [None]:
columns_of_interest = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "total_amount", "trip_distance"]

In [None]:
yellow_data[columns_of_interest].show()

_Analysis task: Calculate number of trips per day_

In [None]:
from pyspark.sql.functions import window

In [None]:
%%time
trips_per_day_spark = (
    yellow_data
    .groupBy(window("tpep_pickup_datetime", "1 day"))
    .count()
)

_We have small data now... continue with pandas_

In [None]:
%%time
trips_per_day = trips_per_day_spark.toPandas()

In [None]:
import pandas

In [None]:
trips_per_day.head()

_Extract a `DateTimeIndex`_

In [None]:
trips_per_day["date"] = trips_per_day["window"].apply(
    lambda row: row[0].date()
)
trips_per_day["date"] = pandas.to_datetime(trips_per_day["date"])
trips_per_day = trips_per_day.set_index("date")

In [None]:
trips_per_day.head()

In [None]:
trips_per_day.index

In [None]:
trips_per_day.head()

In [None]:
trips_per_day = trips_per_day.sort_index()
trips_per_day

In [None]:
trips_per_day.index

_Plot number of trips per time interval_

In [None]:
import seaborn
import matplotlib.pyplot as plt


seaborn.set(style="whitegrid")
plt.style.use("dark_background")

In [None]:
trips_per_day["2020"].plot(
    kind="line",
    figsize=(20,5),
    title="number of trips"
)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_