# Project 1

Authors: Lucas and Ari

## Introduction

In this project, we're going to analyze a flight route network and identify which airlines serve which markets. For this, we're going to build a network where airports are the nodes, routes are the edge, and airlines are a node feature, meaning an airline flies off that airport. Airports get an edge if airport $j$ and $i$ share a route.

## Load Packages and Data

In [17]:
import pandas as pd
import numpy as np 
import networkx as nx
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
from scipy import stats

routes = pd.read_csv('https://raw.githubusercontent.com/lucasweyrich958/DATA620/refs/heads/main/routes.csv')

## Build Graph

After loading the data set, we're going to build a network graph. Since we have source and destination airport information, we're going to build a directed graph. We're going to build nodes out of all unique airport codes from both columns, and then add nodes if an airline serves two nodes.

In [None]:
#Build directed graph
G = nx.DiGraph()

#Add nodes (unique airports)
all_airports = pd.concat([routes['Source airport'], routes['Destination airport']]).unique()
G.add_nodes_from(all_airports)

#Add edges (flight routes)
for index, row in routes.iterrows():
    source_id = row['Source airport']
    destination_id = row['Destination airport']
    G.add_edge(source_id, destination_id)

#Using set to avoid duplicates
airport_serving_airlines = defaultdict(set)

for index, row in routes.iterrows():
    source_id = row['Source airport']
    destination_id = row['Destination airport']
    airline = row['Airline']
    airport_serving_airlines[source_id].add(airline) #Add the airline to the set of airlines serving the source airport
    airport_serving_airlines[destination_id].add(airline) #Add the airline to the set of airlines serving the destination airport

#Assign the set of airlines as a node attribute
for airport_id, airlines_set in airport_serving_airlines.items():
    G.nodes[airport_id]['serving_airlines'] = list(airlines_set)


## Network Measures

We're going to calculate degree centrality and eigenvector centrality. Because we have a directed graph we must calculate the sum of the inbound and outbound measures. 

Then, we can assess these measures for select airports if we choose so. In this case, we;re selecting a small-sized, medium-sized and two large sized airports. Omaha, St. Louis, Atlanta, and New York JFK respectively.

In [14]:
#Degree centrality
total_degree_centrality = {node: G.in_degree(node) + G.out_degree(node) for node in G.nodes()}

#Eigenvector Centrality
eigenvector_centrality = nx.eigenvector_centrality(G)

#Store centrality measures as node attributes
for node in G.nodes():
    G.nodes[node]['total_degree_centrality'] = total_degree_centrality.get(node, 0)
    G.nodes[node]['eigenvector_centrality'] = eigenvector_centrality.get(node, 0)

#Print measures for select airports
selected_airports = ['OMA', 'STL', 'ATL', 'JFK']

for airport_code in selected_airports:
    if airport_code in G:
        node_attributes = G.nodes[airport_code]
        print(f"Airport: {airport_code}")
        print(f"  Serving Airlines: {node_attributes.get('serving_airlines', 'N/A')}")
        print(f"  Total Degree Centrality: {node_attributes['total_degree_centrality']:.4f}")
        print(f"  Eigenvector Centrality: {node_attributes['eigenvector_centrality']:.4f}")
        print("-" * 30)

Airport: OMA
  Serving Airlines: ['UA', 'KL', 'WN', 'AF', 'AS', 'AZ', 'US', 'AA', 'DL', 'F9', 'FL']
  Total Degree Centrality: 38.0000
  Eigenvector Centrality: 0.0147
------------------------------
Airport: STL
  Serving Airlines: ['AM', '9K', 'UA', 'KL', 'WN', 'VS', 'AF', 'AS', 'AZ', 'US', 'AA', 'DL', 'F9', '3E', 'AC', 'FL']
  Total Degree Centrality: 121.0000
  Eigenvector Centrality: 0.0313
------------------------------
Airport: ATL
  Serving Airlines: ['NH', 'AY', 'QF', 'UA', 'SU', '3M', 'AZ', 'TK', 'CA', 'DL', 'CX', 'F9', 'FL', 'MH', 'WN', 'JL', 'VS', 'NK', 'OZ', 'WS', 'EY', 'AC', 'KE', 'BA', 'QR', 'AM', 'EI', '9E', 'IB', 'VA', 'CI', 'LH', 'KL', 'AF', 'AS', 'NZ', 'US', 'AA']
  Total Degree Centrality: 433.0000
  Eigenvector Centrality: 0.0813
------------------------------
Airport: JFK
  Serving Airlines: ['NH', 'OJ', 'AY', 'AT', 'QF', 'UA', 'SU', 'AZ', 'HA', 'SY', 'TK', 'BW', 'CA', 'DL', 'CX', 'MU', 'LA', '4M', 'LY', 'SV', 'B6', 'XL', 'MH', 'CM', 'JL', 'VS', 'PK', 'VX', 'OZ', '

As we can see, the total degree centrality is lowest for the lowest airport, Omaha, followed by St. Louis as the medium sized airport. However, even though JFK has more airlines serving, Atlanta has a higher degree centrality. That is because ATL is the main hub for Delta, one of the biggest airlines in the world. JFK is not a hub per se, hence, ATL is connected to more nodes than JFK is.

This begs the question of assessing airlines more closely as well. For the current example, we're going to compare the three major US airlines. Delta, American Airlines, and United. We're computing the average degree centrality of their airports, and then use a one-way ANOVA to compare the three averages.

In [19]:
#Compare Degree Centrality for DL, AA, and UA
selected_airlines = ['DL', 'AA', 'UA']
airline_groups_degree_centrality = defaultdict(list)

for node, attributes in G.nodes(data=True):
    if 'serving_airlines' in attributes:
        for airline in attributes['serving_airlines']:
            if airline in selected_airlines:
                airline_groups_degree_centrality[airline].append(attributes['total_degree_centrality'])

print(f"Total Degree Centrality for {', '.join(selected_airlines)}")

for airline in selected_airlines:
    avg_degree = np.mean(airline_groups_degree_centrality[airline])
    print(f"Airline {airline}: Average Total Degree Centrality of Served Airports = {avg_degree:.4f} (N={len(airline_groups_degree_centrality[airline])})")

#Compute ANOVA
anova_data = []
group_labels = []

for airline in selected_airlines:
    anova_data.append(airline_groups_degree_centrality[airline])
    group_labels.append(airline)

f_statistic, p_value = stats.f_oneway(*anova_data)
print(f"\nOne-Way ANOVA for: {', '.join(selected_airlines)}")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

Total Degree Centrality for DL, AA, UA
Airline DL: Average Total Degree Centrality of Served Airports = 79.5706 (N=354)
Airline AA: Average Total Degree Centrality of Served Airports = 79.6567 (N=434)
Airline UA: Average Total Degree Centrality of Served Airports = 70.5972 (N=432)

One-Way ANOVA for: DL, AA, UA
F-statistic: 1.1673
P-value: 0.3116


The degree centrality for the three airlines appears to be quite similar, as the ANOVA confirms with a $p$ value of 0.31, it is not significantly different. Therefore, it appears that the three major airlines have about similarly connected number of nodes, with United having a bit less.

This leads to the conclusion that it should not matter, on average, which airlines one takes to get to a destination within the US.