# NYC Yellow Taxi Trips

The New York City Taxi & Limousine Commission provides data on trips taken via yellow, green, and for-hire vehicles such as Uber and Lyft. This project focuses on the Yellow Taxi industry. The data is stored in parquet files and will be read into pandas for data manipulation. The goal of this project is to analyze how the industry's network of trips (edges) flows throughout the city based on pickup and dropoff locations (nodes). Utilizing centrality measures such as degree and eigenvector centrality, we aim to identify potential differences in trip patterns based on payment methods, specifically cash versus credit card.

Our hypothesis is that the most important nodes will differ depending on the payment method. For credit card transactions, longer-distance trips, such as those between the city and airports, will likely highlight airport nodes as the most important. For cash transactions, intra-city trips such as those between Times Square and Wall Street—will be more significant, as these trips tend to be shorter, making cash payments more convenient.

It is important to note that the records being used will **only be for January 2024** due to the size of each data file. Also, a data dictionary of what each column represents can be found [here](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

## Import Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import random

## Import Trips Data

We can see we have a lot of data that needs to be parsed down for this network analysis. The most important features will be pickup location(**PULocationID**), dropoff location (**DOLocationID**) and payment type (**payment_type**)

In [None]:
trips_df = pd.read_parquet('yellow_tripdata_2024-01.parquet')

In [None]:
trips_df.head()

## Import Location Data
We also have each locationID and what borough and zone it represents. 

In [None]:
location_df = pd.read_csv('taxi_zone_lookup.csv')
location_df = location_df.drop(columns='service_zone')
location_df.head()

## Data Exploration

### Missing Values

We can see that some columns have missing values such as passenger counts, ratecodeID and airport_fee. Fortunately these columns are not of importance in this analysis.

In [None]:
trips_df.isna().sum()

### Total Trips

We have a total of almost 3 million trips taken. We can also see that pickup location **node 132** has the highest amount of trips being picked up from.

In [None]:
trips_df['PULocationID'].count()

In [None]:
top_pickup_trips = trips_df.groupby('PULocationID')['DOLocationID'].agg(total_trips = 'count').reset_index(drop=False).sort_values(by='total_trips', ascending=False)
top_pickup_trips

To have some more clarity of the top 10 trips being picked up from, we will join our location data with it. 

Not surprisingly, JFK Airport has the highest amount of trips being picked up from. However, NYC's other airport, LaGuardia Aiport, falls down to 8th place with Manhattan's Midtown Center coming in 2nd instead. Some reasoning behind this is given how central of a hub Midtown is to all of NYC but definitely raises questions if it is truly as important in the network.

In [None]:
merged_df = pd.merge(top_pickup_trips, location_df, left_on='PULocationID', right_on='LocationID')
merged_df.head(10)

Looking at the dropoff locations, we find that the Upper East Side North/South have the highest number of dropoffs in our network. This again raises the question as to which node is the most important in our network. 

In [None]:
top_pickup_trips = trips_df.groupby('DOLocationID')['PULocationID'].agg(total_trips = 'count').reset_index(drop=False).sort_values(by='total_trips', ascending=False)
merged_df = pd.merge(top_pickup_trips, location_df, left_on='DOLocationID', right_on='LocationID')
merged_df.head(10)

### Payment Methods

We discover that most trips payment method is 1 (**Credit Card**), with over 2.3 million transactions, followed by 2 (**Cash**) payments of under 0.5 millions transactions. We also see there are other categories such as 3 (**No Charge**) and 4 (**Dispute**). Given the constraints of our analysis, trips not ending with a credit card or cash transaction will be removed.

In [None]:
payment_trips = trips_df.groupby('payment_type')['payment_type'].agg(total = 'count').reset_index(drop=False).sort_values(by='total', ascending=False)
payment_trips

## Data Wrangling

We will remove trips that did not end with a credit card or cash transaction. Also, columns not needed in this analysis will be dropped as well.

In [None]:
drop_columns = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 
                'RatecodeID', 'store_and_fwd_flag', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
               'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'Airport_fee']
drop_trips_df = trips_df[trips_df['payment_type'].isin([1, 2])].drop(columns=drop_columns)
drop_trips_df.head()

Next, we will replace our numeric payment type to its appropriate string value

In [None]:
drop_trips_df['payment_type'] = drop_trips_df['payment_type'].apply(lambda x: 'Credit Card' if x==1 else 'Cash')
drop_trips_df.head()

Let's add a column developing trip counts based on a pickup to dropoff location and its payment type. This will be important later on as we will use it as a weight in our network. 

In [None]:
drop_trips_df['trip_count'] = 1
cleaned_trips_df = drop_trips_df.groupby(['PULocationID', 'DOLocationID', 'payment_type'])['trip_count'].agg('count').reset_index()
cleaned_trips_df.head()

## Yellow Taxi Trips (Credit Card)

We will begin our first network graph setup with credit card transactions. We are looking to measure a network's most important nodes using both degree and eigenvector centrality. The graph will be a directed, weighted graph using **trip_counts** as its weights and the direction of a **PULocationID** to **DOLocationID** trip. 

In [None]:
cc_trips = cleaned_trips_df[cleaned_trips_df['payment_type'] == 'Credit Card']
cc_trips.head()

Most of our nodes are centralized in the center, with a few of them outside of it, probably seen as  outliers as few trips occur between itself and other places. 

In [None]:
G_credit = nx.from_pandas_edgelist(cc_trips, source='PULocationID', target='DOLocationID', 
                            edge_attr='trip_count', create_using=nx.DiGraph())

pos = nx.spring_layout(G_credit, seed=123)  

fig, ax = plt.subplots(figsize=(30, 18))

nx.draw_networkx_nodes(G_credit, pos, node_size=300, node_color="skyblue")
nx.draw_networkx_edges(G_credit, pos, width=0.15)
nx.draw_networkx_labels(G_credit, pos, font_size=8, font_weight="bold")

plt.title("NYC Credit Card Taxi Trips Network (Directed, Weighted)")
plt.show()

### Degree Centrality

Looking at the node with the highest degree centrality, we find that **node 140** is the most important node by this measure and it had 402 edges connected to it. 

In [None]:
top_degree_centrality = max(nx.degree_centrality(G_credit), key=nx.degree_centrality(G_credit).get)
print(f"Highest Degree Centrality: {top_degree_centrality}")
print("Number of Edges:", G_credit.degree(132))

We can see that **node 140** lies around the central cluster of nodes in our graph

In [None]:
pos = nx.spring_layout(G_credit, seed=123)  

fig, ax = plt.subplots(figsize=(30, 18))

nx.draw(G_credit, pos, with_labels=True, node_color="lightgray", edge_color="gray", node_size=300, font_size=10)

nx.draw_networkx_nodes(G_credit, pos, nodelist=[top_degree_centrality], node_color="red", node_size=600)

plt.title(f"Top Node (Degree Centrality): {top_degree_centrality}")
plt.axis('off')
plt.show()

In [None]:
top_cc_trips = cc_trips.copy()
top_cc_trips = top_cc_trips[(top_cc_trips['PULocationID']==140) | (top_cc_trips['DOLocationID']==140)]

As for the top 10 trip counts within **node 140**, this zone is called Lennox Hill East, and has most of its rides going to/from zones Upper East Side North/South and Yorkville East/West. 

In [None]:
top_10 = top_cc_trips.sort_values(by='trip_count', ascending=False).head(10)
top_10 = top_10.merge(location_df, left_on='PULocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'PU_Borough', 'Zone':'PU_Zone'}).drop(columns=['LocationID'])
top_10.merge(location_df, left_on='DOLocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'DO_Borough', 'Zone':'DO_Zone'}).drop(columns=['LocationID'])

### Eigenvector Centrality

Looking at the node with the highest eigenvector centrality, we find that **node 236** is the most important node by this measure and it has 339 edges connected to it. However, **Node 237** closely follows in its eigenvector score before we have a steep drop off from the next node. 

In [None]:
sorted_eigenvector = sorted(nx.eigenvector_centrality(G_credit, weight='trip_count', max_iter=1000).items(), key=lambda x: x[1], reverse=True)

for node, centrality in sorted_eigenvector[:5]:
    print(f"Node {node}: {centrality}")
    print("Number of Edges:", G.degree(node))
    print("------")

Looking deeper into our top node, we see that **node 236** is the Upper East Side North. **Node 237** is the Upper East Side South and these areas combined accounts for a majority of the trips happening in Manhattan.

In [None]:
top_cc_trips = cc_trips.copy()
top_cc_trips = top_cc_trips[(top_cc_trips['PULocationID']==237) | (top_cc_trips['DOLocationID']==237)]

top_10 = top_cc_trips.sort_values(by='trip_count', ascending=False).head(10)
top_10 = top_10.merge(location_df, left_on='PULocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'PU_Borough', 'Zone':'PU_Zone'}).drop(columns=['LocationID'])
top_10.merge(location_df, left_on='DOLocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'DO_Borough', 'Zone':'DO_Zone'}).drop(columns=['LocationID'])

### Total Trip Counts by Location

Lets look at total trip counts regardless of the direction it was taken in. This means it can be either the pickup or dropoff location. When we do, we immediately see that both **nodes 236 and 237** are the highest in total trips within our credit card network. This was a great finding as before, we solely were looking at one direction in our exploratory analysis, showing that either JFK Airport (pickups) or the Upper East Side (dropoffs) were both the highest traveled trips in our network. We can now see that it is solely the Upper East Side regions. 

Another note is that our top 4 regions here match our results of the eigenvector score rankings, however, it does not solely mean that if something has the highest amount of trips it will rank highly. Our 5th placed rank here is **node 162** Midtown East, while the 5th place eigenvector score was for **node 239** Upper West Side South

In [None]:
locations_1 = cc_trips[['PULocationID', 'trip_count']].rename(columns={'PULocationID':'location_id'})
locations_2 = cc_trips[['DOLocationID', 'trip_count']].rename(columns={'DOLocationID':'location_id'})
pd.concat([locations_1, locations_2]).groupby('location_id').sum().reset_index().sort_values(by='trip_count', ascending=False).head(5)

## Yellow Taxi Trips (Cash)

Our second network graph setup will be looking into cash transactions. We are again looking to measure a network's most important nodes using both degree and eigenvector centrality. The graph will be a directed, weighted graph using **trip_counts** as its weights and the direction of a **PULocationID** to **DOLocationID** trip. 

In [None]:
cash_trips = cleaned_trips_df[cleaned_trips_df['payment_type'] == 'Cash']
cash_trips.head()

The shape of our network is once again similar to the credit card network, where a majority of our nodes are centralized, with some nodes on the outskirts of the plot, leaning more to locations not traveled as often.

In [None]:
G_cash = nx.from_pandas_edgelist(cash_trips, source='PULocationID', target='DOLocationID', 
                            edge_attr='trip_count', create_using=nx.DiGraph())

pos = nx.spring_layout(G_cash, seed=123)  

fig, ax = plt.subplots(figsize=(30, 18))

nx.draw_networkx_nodes(G_cash, pos, node_size=300, node_color="skyblue")
nx.draw_networkx_edges(G_cash, pos, width=0.15)
nx.draw_networkx_labels(G_cash, pos, font_size=8, font_weight="bold")

plt.title("NYC Cash Taxi Trips Network (Directed, Weighted)")
plt.show()

### Degree Centrality

Looking at the node with the highest degree centrality, we find that **node 132** is the most important node by this measure and it had 380 edges connected to it. 

In [None]:
top_degree_centrality = max(nx.degree_centrality(G), key=nx.degree_centrality(G).get)
print(f"Highest Degree Centrality: {top_degree_centrality}")
print("Number of Edges:", G_cash.degree(132))

Highlighting **node 132** we see it near the center of our graph

In [None]:
pos = nx.spring_layout(G_cash, seed=123)  

fig, ax = plt.subplots(figsize=(30, 18))

nx.draw(G_cash, pos, with_labels=True, node_color="lightgray", edge_color="gray", node_size=300, font_size=10)

nx.draw_networkx_nodes(G_cash, pos, nodelist=[top_degree_centrality], node_color="red", node_size=600)

plt.title(f"Top Node (Degree Centrality): {top_degree_centrality}")
plt.axis('off')
plt.show()

In [None]:
top_cash_trips = cash_trips.copy()
top_cash_trips = top_cash_trips[(top_cash_trips['PULocationID']==132) | (top_cash_trips['DOLocationID']==132)]

Interestingly enough **node 132** is JFK Airport and most trips are being picked up from the airport.

In [None]:
top_10 = top_cash_trips.sort_values(by='trip_count', ascending=False).head(10)
top_10 = top_10.merge(location_df, left_on='PULocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'PU_Borough', 'Zone':'PU_Zone'}).drop(columns=['LocationID'])
top_10.merge(location_df, left_on='DOLocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'DO_Borough', 'Zone':'DO_Zone'}).drop(columns=['LocationID'])

### Eigenvector Centrality

Looking at the node with the highest eigenvector centrality, we find that **node 237** is the most important node by this measure and it has 245 edges connected to it. Once again, **node 236** is close in score. This matches the same results with our credit card network.

In [None]:
sorted_eigenvector = sorted(nx.eigenvector_centrality(G_cash, weight='trip_count', max_iter=1000).items(), key=lambda x: x[1], reverse=True)

for node, centrality in sorted_eigenvector[:5]:
    print(f"Node {node}: {centrality}")
    print("Number of Edges:", G.degree(node))
    print("------")

In [None]:
top_cash_trips = cash_trips.copy()
top_cash_trips = top_cash_trips[(top_cash_trips['PULocationID']==237) | (top_cash_trips['DOLocationID']==237)]

top_10 = top_cash_trips.sort_values(by='trip_count', ascending=False).head(10)
top_10 = top_10.merge(location_df, left_on='PULocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'PU_Borough', 'Zone':'PU_Zone'}).drop(columns=['LocationID'])
top_10.merge(location_df, left_on='DOLocationID', right_on='LocationID', 
             how='left').rename(columns={'Borough':'DO_Borough', 'Zone':'DO_Zone'}).drop(columns=['LocationID'])

### Total Trip Counts by Location

As we look at total trips in either direction for a node, we see JFK Airport clearly has the most trips in general, which follows our degree centrality measure

In [None]:
locations_1 = cash_trips[['PULocationID', 'trip_count']].rename(columns={'PULocationID':'location_id'})
locations_2 = cash_trips[['DOLocationID', 'trip_count']].rename(columns={'DOLocationID':'location_id'})
pd.concat([locations_1, locations_2]).groupby('location_id').sum().reset_index().sort_values(by='trip_count', ascending=False).head(5).merge(location_df, left_on='location_id', right_on='LocationID').drop(columns=['LocationID'])

## Conclusion
