## BIG DATA ANALYSIS Project 2024 (Graph Analytics) 

Group: 

- Dinis Fernandes 20221848
- Dinis Gaspar 20221869
- Inês Santos 20221916
- Luis Davila 20221949
- Sara Ferrer 20221947

## Yellow taxi movements during the most crowded month in NYC (December) in 2023.
We will perform an analysis of NYC yellow taxi movements during December.

## Goals
-Identify Popular and important Locations:
- For the whole Month, in general.
- For Christmas day.
- For New Year's eve.

-Most common Trip paths.


-Optimize Taxi Distribution (Areas with high degree centrality indicate regions requiring more taxi).

-Identify Communities and Human patterns.

##IMPORTS

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

from pyspark.ml.stat import Correlation

import seaborn as sns
import matplotlib.pyplot as plt
from graphframes import *
from pyspark.sql import *
from pyspark.sql.functions import col, year, month
import pandas as pd
import networkx as nx

## The Data
The dataset choosen to perform graph analysis is Trip Data regarding Yellow Taxis in NYC during the most crowded month in the year (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

## Content
- VendorID: ID of taximeter company.
- tpep_pickup_datetime: Date and time when the meter was engaged.
- tpep_dropoff_datetime: Date and time when the meter was disengaged.
- passenger_count: Number of passengers in the vehicle.
- trip_distance: Trip distance in miles reported by the taximeter.
- RatecodeID: long Final rate code in effect at the end of the trip.
- store_and_fwd_flag: whether the trip record was held in vehicle memory before sending to the vendor.
- PULocationID: TLC Taxi Zone in which the taximeter was engaged.
- DOLocationID: TLC Taxi Zone in which the taximeter was disengaged.
- payment_type: A numeric code signifying how the passenger paid for the trip.
- fare_amount: The time-and-distance fare calculated by the meter.
- extra: Miscellaneous extras and surcharges.
- mta_tax: $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
- tip_amount: Total amount of tips.
- tolls_amount: Total amount of all tolls paid in trip.
- improvement_surcharge: $0.30 improvement surcharge assessed trips at the flag drop.
- total_amount: Total amount charged to passengers.
- congestion_surcharge: Total amount collected in trip for NYS congestion surcharge.
- Airport_fee: $1.75 if pick up at LaGuardia or John F. Kennedy Airports.

**Trip Data:**

In [0]:
spark = SparkSession.builder.appName("Read Parquet").getOrCreate()
yellow_taxis_dec = spark.read.parquet("/FileStore/tables/yellow_taxis_Dec.parquet")
display(yellow_taxis_dec)


**Taxi zones:**

In [0]:
file_location = "/FileStore/tables/taxi_zone_lookup.csv"
taxi_zone_lookup = spark.read.load(
    file_location, format="csv", sep=",", inferSchema="true", header="true"
)
display(taxi_zone_lookup)

**Join the tables in LocationID (in order to have the names of the Boroughs, Zones and Service Zones):**

In [0]:
# Join yellow_taxis_dec with locations to get the name of the zone(PUzone,PUBorough and PUservice_zone)
yellow_taxis_dec = (
    yellow_taxis_dec.join(
        taxi_zone_lookup,
        yellow_taxis_dec["PULocationID"] == taxi_zone_lookup["LocationID"],
        "left",
    )
    .withColumnRenamed("Borough", "PUBorough")
    .withColumnRenamed("service_zone", "PUservice_zone")
    .withColumnRenamed("Zone", "PUZone")
)

# Drop the redundant LocationID column from the first join
yellow_taxis_dec = yellow_taxis_dec.drop("LocationID")

# Join yellow_taxis_dec with locations again to get DOzone, DOBorough and DOservice_zone
yellow_taxis_dec = (
    yellow_taxis_dec.join(
        taxi_zone_lookup,
        yellow_taxis_dec["DOLocationID"] == taxi_zone_lookup["LocationID"],
        "left",
    )
    .withColumnRenamed("Borough", "DOBorough")
    .withColumnRenamed("service_zone", "DOservice_zone")
    .withColumnRenamed("Zone", "DOZone")
)

# Drop the redundant LocationID column from the second join
yellow_taxis_dec = yellow_taxis_dec.drop("LocationID")

display(yellow_taxis_dec)

## Explanatory Data Analysis (EDA)

In [0]:
yellow_taxis_dec.printSchema()

In [0]:
taxi_zone_lookup.printSchema()

## Summary Statistics

In [0]:
# Check column statistics
yellow_taxis_dec.select(yellow_taxis_dec.columns[:5]).describe().show()

yellow_taxis_dec.select(yellow_taxis_dec.columns[5:11]).describe().show()

yellow_taxis_dec.select(yellow_taxis_dec.columns[11:16]).describe().show()

yellow_taxis_dec.select(yellow_taxis_dec.columns[16:19]).describe().show()

yellow_taxis_dec.select(yellow_taxis_dec.columns[19:]).describe().show()

##Missing Values

In [0]:
# Get the total number of rows in the DataFrame
nrows = yellow_taxis_dec.count()

# Count the number of missing (null) values for each column
n_missing = yellow_taxis_dec.select(
    [count(when(col(c).isNull(), c)).alias(c) for c in yellow_taxis_dec.columns]
)

# Calculate and display the percentage of missing values for each column
display(
    n_missing.select([((col(c) / nrows) * 100).alias(c) for c in n_missing.columns])
)

**As we can see this dataset doesn't have missing values.**

##Correlation Between Numeric Variables

In [0]:
# Define a list of numeric columns
num_cols = [
    "passenger_count",
    "trip_distance",
    "payment_type",
    "fare_amount",
    "extra",
    "mta_tax",
    "tip_amount",
    "total_amount",
    "congestion_surcharge",
    "Airport_fee",
]

# Create a VectorAssembler to combine the selected columns into a single feature vector
assembler = VectorAssembler(
    inputCols=num_cols, outputCol="features", handleInvalid="skip"
)

# Apply the assembler to the DataFrame and select the 'features' column
df = assembler.transform(yellow_taxis_dec).select("features")

In [0]:
# Calculate the Pearson correlation matrix
pearson_correlation = Correlation.corr(df, "features", "pearson").collect()[0][0]

# Convert the correlation matrix to an array and then to a list
corr_matrix = pearson_correlation.toArray().tolist()

# Create a new DataFrame from the correlation matrix
df_s = spark.createDataFrame(corr_matrix, num_cols)

display(df_s)

In [0]:
# Create a heatmap to visualize the correlation matrix using seaborn
sns.heatmap(
    corr_matrix, cmap=sns.cm.rocket_r, xticklabels=num_cols, yticklabels=num_cols
)

plt.show()

**From the Heatmap we can conclude that only variables related to monetary expense have some correlation between each other.**

##Transformations and Data Cleaning

In [0]:
# Modify the DataFrame to format date columns
yellow_taxis_dec = yellow_taxis_dec.withColumn(
    "tpep_pickup_datetime", F.date_format("tpep_pickup_datetime", "yyyy-MM-dd HH:mm:ss")
).withColumn(  # Format the pickup datetime column
    "tpep_dropoff_datetime",
    F.date_format("tpep_dropoff_datetime", "yyyy-MM-dd HH:mm:ss"),
)  # Format the dropoff datetime column

display(yellow_taxis_dec)

In [0]:
# Drop Rows that aren't from December of 2023 or from January 1st of 2024

# Define the date range
start_date = "2023-12-01"
end_date = "2024-01-01"

# Keep rows where the pickup and dropoff datetimes are within the specified range
yellow_taxis_dec = yellow_taxis_dec.filter(
    (F.col("tpep_pickup_datetime").between(start_date, end_date))
    & (F.col("tpep_dropoff_datetime").between(start_date, end_date))
)

display(yellow_taxis_dec)

In [0]:
# Drop rows where dropoff is before pickup
yellow_taxis_dec = yellow_taxis_dec.filter(
    F.col("tpep_dropoff_datetime") > F.col("tpep_pickup_datetime")
)

# Show the result
display(yellow_taxis_dec)

In [0]:
# We have trips where the distance is 0
# If they have the same id in PULocationID and DOLocationID and took less than a minute we will consider them Cancelled Trips

df = yellow_taxis_dec.withColumn(
    "time_diff",
    F.unix_timestamp("tpep_dropoff_datetime")
    - F.unix_timestamp("tpep_pickup_datetime"),
)  # create a time difference of seconds

# Keep rows where time difference is at least 60 seconds, pickup and dropoff locations differ or trip distance is not zero
yellow_taxis_dec = df.filter(
    (F.col("time_diff") >= 60)
    | (F.col("PULocationID") != F.col("DOLocationID"))
    | (F.col("trip_distance") != 0)
)

# Drop the 'time_diff' column after filtering
yellow_taxis_dec = yellow_taxis_dec.drop("time_diff")

display(yellow_taxis_dec)

In [0]:
# Drop rows that have negative values in numeric columns as it isn't possible
num_cols = [
    "passenger_count",
    "trip_distance",
    "payment_type",
    "fare_amount",
    "extra",
    "mta_tax",
    "tip_amount",
    "total_amount",
    "congestion_surcharge",
    "Airport_fee",
]

for column in num_cols:
    yellow_taxis_dec = yellow_taxis_dec.filter(col(column) >= 0)

display(yellow_taxis_dec)

## Visualizations using SparkSQL and MapReduce
- Distribution of payment_type.
- If tips or not? 
- Average tip per pull up zone.
- Average tip per drop off zone.
- Number of trips per day.
- Number of trips per hour.
- Trips paths with highest costs.


A numeric code signifying how the passenger paid for the trip.
- 1 = Credit card.
- 2 = Cash.
- 3 = No charge.
- 4 = Dispute.
- 5 = Unknown.
- 6 = Voided trip (trip that was canceled).


In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to count the number of entries for each payment type
sorted_distribution_with_window = spark.sql(
    """
    SELECT DISTINCT payment_type,
           COUNT(*) OVER (PARTITION BY payment_type) AS count
    FROM yellow_taxis
    ORDER BY count DESC
"""
)

display(sorted_distribution_with_window)

**Using mapreduce:**

In [0]:
# Convert DataFrame to a Resilient Distributed Dataset
yellow_taxis_rdd = yellow_taxis_dec.rdd

# MapReduce for payment type distribution
payment_type_counts_rdd = yellow_taxis_rdd.map(lambda row: (row["payment_type"], 1)) \ # Map: create key-value pairs (payment_type, 1)
                                      .reduceByKey(lambda x, y: x + y)  # Reduce: sum the counts for each payment type                  

# Collect results
display(payment_type_counts_rdd.collect())


In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to categorize trips into "Tipped" and "Didn't Tip", and count occurrences for each category
tip_distribution = spark.sql(
    """
    SELECT 
        CASE 
            WHEN tip_amount = 0 THEN "Didn't Tip"
            ELSE "Tipped"
        END AS tip_status,
        COUNT(*) AS count
    FROM yellow_taxis
    GROUP BY tip_status
"""
)

display(tip_distribution)

**Using mapreduce:**

In [0]:
# Convert DataFrame to a Resilient Distributed Dataset
yellow_taxis_rdd = yellow_taxis_dec.rdd

# MapReduce for tip distribution
# Map step: For each row,assign "Didn't Tip" if 0,or "Tipped" if it's non-zero.
tip_distribution_rdd = yellow_taxis_rdd.map(lambda row: ("Didn't Tip" if row["tip_amount"] == 0 else "Tipped", 1)) \ 
                                      .reduceByKey(lambda x, y: x + y) # Reduce step: For each unique key ("Didn't Tip" or "Tipped"), sum the values

display(tip_distribution_rdd.collect())


In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to calculate the average tip amount per PUZone
avg_tip_per_zone_with_window = spark.sql(
    """
    SELECT DISTINCT 
        PUZone,
        AVG(tip_amount) OVER (PARTITION BY PUZone) AS avg_tip
    FROM yellow_taxis
    ORDER BY avg_tip DESC
"""
)

display(avg_tip_per_zone_with_window)

**Using MapReduce:**

In [0]:
# Convert DataFrame to a Resilient Distributed Dataset
yellow_taxis_rdd = yellow_taxis_dec.rdd

# Map and Reduce in one step to calculate average tip per PUZone
avg_tip_per_zone_rdd = (
    yellow_taxis_rdd.map(lambda row: (row["PUZone"], (row["tip_amount"], 1)))
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    .mapValues(lambda x: x[0] / x[1])
)  # Calculate the average tip

# Sort by average tip in descending order
sorted_avg_tip_per_zone_rdd = avg_tip_per_zone_rdd.sortBy(
    lambda x: x[1], ascending=False
)

display(sorted_avg_tip_per_zone_rdd.collect())

**A large tip in West Brighton which is possible**

In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to calculate the average tip amount per DOZone
avg_tip_per_zone = spark.sql(
    """
    SELECT 
        DOZone, 
        AVG(tip_amount) AS avg_tip
    FROM yellow_taxis
    GROUP BY DOZone
    ORDER BY avg_tip DESC
"""
)

display(avg_tip_per_zone)

**Using MapReduce:**

In [0]:
# Convert DataFrame to a Resilient Distributed Dataset
yellow_taxis_rdd = yellow_taxis_dec.rdd

# Map and Reduce in one step to calculate average tip per DOZone
avg_tip_per_zone_rdd = (
    yellow_taxis_rdd.map(lambda row: (row["DOZone"], (row["tip_amount"], 1)))
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    .mapValues(lambda x: x[0] / x[1])
)  # Calculate the average tip

# Sort by average tip in descending order
sorted_avg_tip_per_zone_rdd = avg_tip_per_zone_rdd.sortBy(
    lambda x: x[1], ascending=False
)

display(sorted_avg_tip_per_zone_rdd.collect())

In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to get the distribution of trips by day
trip_distribution_by_day_with_window = spark.sql(
    """
    SELECT DISTINCT 
        DATE(tpep_pickup_datetime) AS trip_date, 
        COUNT(*) OVER (PARTITION BY DATE(tpep_pickup_datetime)) AS trip_count
    FROM yellow_taxis
    ORDER BY trip_count DESC
"""
)

display(trip_distribution_by_day_with_window)

In [0]:
# Create Temporary View of the dataset for an efficient way to leverage SQL queries
yellow_taxis_dec.createOrReplaceTempView("yellow_taxis")

# SQL query to get the distribution of trips by pick up hour
trip_distribution_by_hour = spark.sql(
    """
    SELECT 
        HOUR(tpep_pickup_datetime) AS pickup_hour,  
        COUNT(*) AS trip_count
    FROM yellow_taxis
    GROUP BY pickup_hour
    ORDER BY trip_count DESC
"""
)

display(trip_distribution_by_hour)

**From visualizations we can conlude that:**
1. The large majority of users pay with credit card.
2. The large majority of users tip.
3. Suburban zones with older demographics and Airports are the zones which tip more on average.
4. Theres always a small decrease in usage in Sundays and Mondays.
5. Theres clear descent in the week prior to the 25th, being the Christmas day the day with the lowest usage.
6. Theres a clear rise from 5AM to 6PM (working Hours).

**SparkSQL is generally faster than MapReduce due to its in-memory processing and no optimization**

-----

## **Graph Analysis for the whole month of December**

In [0]:
zone_Vertices = taxi_zone_lookup.withColumnRenamed(
    "Zone", "id"
).distinct()  # Rename the "Zone" column to typically named "id"
display(zone_Vertices)

In [0]:
tripEdges = yellow_taxis_dec.withColumnRenamed("PUZone", "src").withColumnRenamed(
    "DOZone", "dst"
)  # Rename the "PUZone" column to typically named "src" and "DOZone" to typically "dst"
display(tripEdges)

In [0]:
gtd = GraphFrame(zone_Vertices, tripEdges) # Create a graphframe for the whole month of december

**Importance of a zone accounting for the quantity and quality of connections pointing to it using PageRank**

Explaining PageRank for Taxi Trips: Imagine a passenger taking taxi rides in a city. At each location, they either take another taxi to a popular destination or randomly move to any location. The "PageRank" of a location represents the probability of finding the passenger there after many trips. Locations frequently connected to important places, like an airport, gain importance, but this importance is shared if the airport connects to many places. PageRank helps identify key hotspots and transit points in the taxi trip network.

In [0]:
pageRanks = gtd.pageRank(
    resetProbability=0.15, tol=0.01
)  # Apply pagerank to the graphframe with a bound on the error in the computation

In [0]:
display(pageRanks.vertices.orderBy("pagerank", ascending=False))

**Identify clusters of areas that are strongly interconnected based on trip data**


Label Propagation for Taxi Trips: Imagine a city where each location (e.g., pickup or drop-off point) starts with a unique label. During each iteration, a location updates its label to match the most frequent label among its connected locations (other places frequently visited by taxis from there). Over time, locations with frequent and shared trip patterns form communities, representing clusters of areas with strong connectivity. The process continues for a set number of iterations (e.g., 100) or until labels stabilize, helping uncover regions with closely linked taxi traffic flows.

In [0]:
label_propagation_results = gtd.labelPropagation(
    maxIter=10
)  # Apply label propagation to the graphframe

In [0]:
display(label_propagation_results)

**All zones are strongly interconnected to each other with exceptions of:**
- Rikers Island (correctional facility).
- Great Kills Park (wildlife park).
- Saint Michaels Cemetery/Woodside.

**Strongly Connected Zones:**

In [0]:
strong_components_td = gtd.stronglyConnectedComponents(maxIter=10)
display(strong_components_td.select("id", "component").orderBy("component"))

**Most commmon trip paths:**

In [0]:
tripEdgesSum = tripEdges.groupBy(
    tripEdges.src, tripEdges.dst
).count()  # DataFrame for trips with the total number of that specific trip

tripEdgesSum = tripEdgesSum.withColumnRenamed(
    "count", "value"
)  # When using it for visualizations like a Sankey diagram

display(tripEdgesSum.orderBy("value", ascending=False).limit(50))

**Searching for structural patterns in a graph**

Ex. Trips from JFK Airport to Staten Island 

In [0]:
# Find all edges where there is a directed relationship from node 'a' to node 'b'
motiftd = gtd.find("(a)-[]->(b)")
filteredtd = motiftd.filter("a.id = 'JFK Airport' AND b.Borough = 'Staten Island'")
display(filteredtd)

**Shortest Paths:**

Ex. From JFK to the rest of locations in NYC 

In [0]:
# Define source vertice and compute the shortest paths from source vertices to all other vertices
shortest_paths_td = gtd.shortestPaths(landmarks=["JFK Airport"])

# Display the nodes and their shortest paths to the source vertices
display(shortest_paths_td.select("id", "distances").orderBy("id"))

* {} means that there is no path from JFK Airport to the location
* {1} Means that the location is directly connected to JFK Airport
* {2} Means that the location is 2 hops away from JFK Airport

**Identify Popular/Important Locations using indegree and out:**

In [0]:
gtdOut = gtd.outDegrees  # Number of outgoing edges for each vertex in the graph
gtdIn = gtd.inDegrees  # Number of ingoing edges for each vertex in the graph

gtdJoin = gtdIn.join(gtdOut, ["id"])

# Calculate the ratio of in-degree to out-degree for each vertex (zone)
gtdDeg = gtdJoin.withColumn("Ratio", col("inDegree") / col("outDegree"))

In [0]:
display(gtdIn.orderBy("inDegree", ascending=False))

In [0]:
display(gtdOut.orderBy("outDegree", ascending=False))

In [0]:
display(gtdDeg.orderBy("Ratio", ascending=False))

**Trips between Boroughs:**

In [0]:
tripEdges_borough = yellow_taxis_dec.withColumnRenamed(
    "PUBorough", "src"
).withColumnRenamed(
    "DOBorough", "dst"
)  # Rename the "PUBorough" column to typically named "src" and "DOBorough" to typically "dst"

gtd_boroughs = GraphFrame(
    zone_Vertices, tripEdges_borough
)  # Create a graphframe for the whole month of december

In [0]:
# Group by 'src' and 'dst' and count the number of trips between each source and destination with the alias of weight
tripEdges_weighted = tripEdges_borough.groupBy("src", "dst").agg(
    count("*").alias("weight")
)

# Calculate total connections for each node
out_degrees = tripEdges_borough.groupBy("src").agg(count("dst").alias("outDegree"))
in_degrees = tripEdges_borough.groupBy("dst").agg(count("src").alias("inDegree"))

# Combine in-degree and out-degree
degrees = (
    in_degrees.join(out_degrees, in_degrees.dst == out_degrees.src, "full_outer")
    .withColumn("id", col("dst").alias("zone_id"))
    .fillna(0)
    .withColumn("totalDegree", col("inDegree") + col("outDegree"))
)

# Create GraphFrame
graph = GraphFrame(zone_Vertices, tripEdges_weighted)

# Convert vertices and edges to Pandas DataFrames
nodes_df = degrees.select("id", "totalDegree").toPandas()
edges_df = tripEdges_weighted.toPandas()

# Create a NetworkX graph
G = nx.DiGraph()

# Filter out any rows where 'id' is None or NaN
nodes_df = nodes_df.dropna(subset=["id"])

# Add nodes with their total degree
for _, row in nodes_df.iterrows():
    # Ensure the ID is not None
    if row["id"] is not None:
        G.add_node(row["id"], label=f'{row["id"]}\n{int(row["totalDegree"])}')

# Add edges
for _, row in edges_df.iterrows():
    # Ensure the nodes exist before adding the edge
    if row["src"] in G and row["dst"] in G:
        G.add_edge(row["src"], row["dst"])

nodes_df["totalDegree"] = nodes_df["totalDegree"].astype(
    float
)  # Convert to float for proportional sizing

# Normalize node sizes to a reasonable range
min_size = 600  # Minimum size for nodes
max_size = 6000  # Maximum size for nodes

# Calculate normalized sizes
degrees = nodes_df["totalDegree"]
normalized_sizes = min_size + (
    (degrees - degrees.min()) / (degrees.max() - degrees.min())
) * (
    max_size - min_size
)  # Scale 'degrees' to fit within the range [min_size, max_size]
node_sizes = dict(
    zip(nodes_df["id"], normalized_sizes)
)  # Create a mapping of node IDs to sizes

# Visualize the graph
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(G, k=0.9, iterations=200)  # Force-directed layout

# Draw nodes with normalized sizes
nx.draw_networkx_nodes(
    G,
    pos,
    node_size=[
        node_sizes[node] if node in node_sizes else min_size for node in G.nodes
    ],
    node_color="lightblue",
    alpha=0.9,
)

# Draw edges
nx.draw_networkx_edges(
    G, pos, arrowstyle="->", arrowsize=10, edge_color="gray", alpha=0.7
)

# Add labels (ID and degree inside the circle)
labels = nx.get_node_attributes(G, "label")
nx.draw_networkx_labels(
    G, pos, labels, font_size=10, font_color="black", font_weight="bold"
)

plt.title("Taxi Zone Connections: Boroughs and Total Connections")
plt.axis("off")
plt.show()

*EWR = Newark Airport

**From the GrapFrame of the whole month we can conclude that:**
1. People that use taxis are in it’s majority High-income individuals which circulate between high cost living places and shopping areas.
2. High-Cost living places and shopping zones are the most important areas (Page Rank).
3. All zones are strongly interconnected with exception to very specific zones (Label Propagation).
4. High cost living places and tourists areas are the ones with higher centrality (indegree).
5. Transportations hubs appear in top 10 (outdegree).
6. There are no direct trips between Staten Island and Newark Airport (Probably due to the usage of private transportation).
7. There are no direct trips between Bronx and Newark Airport (Due to its proximity to LaGuardia Airport).
8. There are no direct trips between Staten Island and Bronx (Probably due to the big distance).

-----

## Graph Analysis for Christmas Eve as 25th has a low usage


In [0]:
# Filter rows where the pickup date is Christmas Eve (2023-12-24)
christmas_yellow_taxis = yellow_taxis_dec.filter(
    to_date(col("tpep_pickup_datetime")) == "2023-12-24"
)

# Write the filtered data to a Parquet file named 'christmas_day_rows.parquet',
christmas_yellow_taxis.write.mode("overwrite").parquet("christmas_day_rows.parquet")

display(christmas_yellow_taxis)

In [0]:
tripEdges_christmas = christmas_yellow_taxis.withColumnRenamed(
    "PUZone", "src"
).withColumnRenamed(
    "DOZone", "dst"
)  # Rename the "PUZone" column to typically named "src" and "DOZone" to typically "dst"
display(tripEdges_christmas)

**GraphFrame of christmas using subgraph:**

In [0]:
gtd_christmas = gtd.filterEdges(
    to_date(col("tpep_pickup_datetime")) == "2023-12-24"
)  # Filter the original graphframe to just have rows of 24th

**Importance of a zone accounting for the quantity and quality of connections pointing to it using PageRank:**

In [0]:
pageRanks_christmas = gtd_christmas.pageRank(
    resetProbability=0.15, tol=0.01
)  # Apply pagerank to the graphframe with a bound on the error in the computation

In [0]:
display(pageRanks_christmas.vertices.orderBy("pagerank", ascending=False))

**Identify clusters of areas that are strongly interconnected based on trip data:**

In [0]:
label_propagation_results_christmas = gtd_christmas.labelPropagation(
    maxIter=10
)  # Apply label propagation to the graphframe

In [0]:
display(label_propagation_results_christmas)

**Most common trip paths:**

In [0]:
tripEdgesSumChristmas = tripEdges_christmas.groupBy(
    tripEdges_christmas.src, tripEdges_christmas.dst
).count()  # DataFrame for trips with the total number of that specific trip
tripEdgesSumChristmas = tripEdgesSumChristmas.withColumnRenamed(
    "count", "value"
)  # When using it for visualizations like a Sankey diagram
display(tripEdgesSumChristmas.orderBy("value", ascending=False).limit(50))

**Identify Popular/Important Locations using indegree and out:**

In [0]:
gtdOut_christmas = (
    gtd_christmas.outDegrees
)  # Number of outgoing edges for each vertex in the graph
gtdIn_christmas = (
    gtd_christmas.inDegrees
)  # Number of ingoing edges for each vertex in the graph

gtdJoin_christmas = gtdIn_christmas.join(gtdOut_christmas, ["id"])

# Calculate the ratio of in-degree to out-degree for each vertex (zone)
gtdDeg_christmas = gtdJoin_christmas.withColumn(
    "Ratio", col("inDegree") / col("outDegree")
)

In [0]:
display(gtdIn_christmas.orderBy("inDegree", ascending=False))

In [0]:
display(gtdOut_christmas.orderBy("outDegree", ascending=False))

In [0]:
display(gtdDeg_christmas.orderBy("Ratio", ascending=False))

**From the GrapFrame of the 24th of December we can conclude that:**
1. Newark Airport appear in top 5 because of High Volume of Holiday travels and airline company’s tend to have more flights scheduled around Christmas (PageRank)
2. Zones are strongly interconnected with exception of Staten Island, that is the Borough that has the most unique values in label 
3. High-income individuals circulate between high cost living places 
4. JFK has a strong trip path to Time sq/Theatre District (Tourist and Hotel Areas)
5. Midtown Center and Times square appear in 3rd and 4th probably because of late Christmas shopping and Rockefeller center (Indegree)
6. JFK appears in top 4 meaning arrivals for Christmas celebrations (Outdegree)

-----

## Graph Analysis for the New Year's eve

In [0]:
# Filter rows where the pickup date is New Year's Eve (2023-12-31)
newyear_yellow_taxis = yellow_taxis_dec.filter(
    to_date(col("tpep_pickup_datetime")) == "2023-12-31"
)

# Write the filtered data to a Parquet file named 'christmas_day_rows.parquet',
newyear_yellow_taxis.write.mode("overwrite").parquet("christmas_day_rows.parquet")

display(newyear_yellow_taxis)

In [0]:
tripEdges_newyear = newyear_yellow_taxis.withColumnRenamed(
    "PUzone", "src"
).withColumnRenamed(
    "DOzone", "dst"
)  # Rename the "PUZone" column to typically named "src" and "DOZone" to typically "dst"
display(tripEdges_newyear)

**GraphFrame of new years eve using subgraph:**

In [0]:
gtd_newyear = gtd.filterEdges(
    to_date(col("tpep_pickup_datetime")) == "2023-12-31"
)  # Filter the original graphframe to just have rows of 31st

**Importance of a zone accounting for the quantity and quality of connections pointing to it using PageRank:**

In [0]:
pageRanks_newyear = gtd_newyear.pageRank(
    resetProbability=0.15, tol=0.01
)  # Apply pagerank to the graphframe with a bound on the error in the computation

In [0]:
display(pageRanks_newyear.vertices.orderBy("pagerank", ascending=False))

**Identify clusters of areas that are strongly interconnected based on trip data:**

In [0]:
label_propagation_results_newyear = gtd_newyear.labelPropagation(
    maxIter=10
)  # Apply label propagation to the graphframe

In [0]:
display(label_propagation_results_newyear)

**Most common trip paths:**

In [0]:
tripEdgesSumnewyear = tripEdges_newyear.groupBy(
    tripEdges_newyear.src, tripEdges_newyear.dst
).count()  # DataFrame for trips with the total number of that specific trip
tripEdgesSumnewyear = tripEdgesSumnewyear.withColumnRenamed(
    "count", "value"
)  # When using it for visualizations like a Sankey diagram
display(tripEdgesSumnewyear.orderBy("value", ascending=False).limit(50))

**Identify Popular/Important Locations using indegree and out:**

In [0]:
gtdOut_newyear = (
    gtd_newyear.outDegrees
)  # Number of outgoing edges for each vertex in the graph
gtdIn_newyear = (
    gtd_newyear.inDegrees
)  # Number of ingoing edges for each vertex in the graph

gtdJoin_newyear = gtdIn_newyear.join(gtdOut_newyear, ["id"])
gtdDeg_newyear = gtdJoin.withColumn("Ratio", col("inDegree") / col("outDegree"))

In [0]:
display(gtdIn_newyear.orderBy("inDegree", ascending=False))

In [0]:
display(gtdOut_newyear.orderBy("outDegree", ascending=False))

In [0]:
display(gtdDeg_newyear.orderBy("Ratio", ascending=False))

**From the GraphFrame Analysis of New Years Eve we can conclude that:**
1. Midtown appears first probably because of it’s proximity to Times square for the Ball drop (PageRank).
2. Top 10 trip paths only has connections between Upper Living Zones (Private Parties).
3. JFK appears first, Laguardia 8th and Penn Station 9th due to arrivals for new years celebration (outdegree).
4. Zones are strongly interconnected with exception of Staten Island, that is the Borough that has the most unique values in label.

-----

**General Insights:**
- The upper east side is without any doubt the area that necessitates the most taxis.
- Tourist areas also necessitate more taxis.
- In holidays Tourist areas and Transportations hubs necessitate a strength in taxi numbers.
- We can identify zones with different characteristics only from the taxi trips.
- We can identify High-income individuals trip paths.
- Staten Island differs from other boroughs because of its family residential demography.


##Conclusion
PySpark is an effective tool for analyzing large-scale taxi trip data due to its ability to process vast amounts of information efficiently. When combined with GraphFrames, it provides valuable insights into patterns and connections within the trip data.

This approach facilitates the identification of high-traffic zones, analysis of airport connectivity, important zones and detection of underserved areas. Together, PySpark and GraphFrames offer a robust framework for analyzing millions of trips, making them highly useful for urban mobility studies and transportation planning.

##References


https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding

https://medium.com/tomtalkspython/exploring-graphframes-in-pyspark-f948c9a39844