# Optimizing Airline Routes using Graph Anomaly Detection

### 553.602 Research and Design in Applied Mathematics: Data Mining

Team Members: Krutal Patel, Mansi Goel, Chenyu Xie, Nihaar Thakkar   

Subject Area: Transportation Optimization, Transportation Science 

<img src="Images/airline.jpeg" alt= “network” width="500" height="250">


## Research Goals
1. Analyze Southwest route data
2. Gather metrics and statistics over time for Node (Airport) and Edge (Route) attributes 
3. Develop 
4. Apply 
5. Apply 
6. Determine a new route model (by removing or adding )

Optimize Southwest Airline routes by examining the route patterns for airlines that utilize hub-spoke models such as United and Americans and perform Graph Anomaly Detection to determine potential hubs and essential routes for Southwest Airline, aiming to achieve high profits and lower delays. 

In [18]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

# Phase I: Querying and Cleaning Data

1. Database Schema
    - Fleet Data: Information on southwest fleet: Unit Cost ($Millions USD) | Total Cost ($Millions USD) 
    - full_quarterly_data_set: southwest routes over time (we will analyze the most recent year 2019 data)
    - Airports2: routes between two airports data, including metrics such as seats, 


2. Filtering and Cleaning Fleet_Data
    - for now...

3. Filtering and Cleaning 
    - for now...

#### Node (Airport) Attributes / Properties
1. Airport Host City population
2. 

#### Edge (Trip X -> Y) Attributes / Properties 
1. Aircraft Type: Unit Cost of Aircraft, Average Age for Aircraft on service, 
2. 

In [3]:
fleet_data = pd.read_csv("Database/Fleet Data.csv")
# compiled data on 737 statistics for southwest fleets 
sw_737 = fleet_data.loc[fleet_data['Airline'] == 'Southwest Airlines']   # fixed sw_737 data - no changes should be applied here

Unnamed: 0,Parent Airline,Airline,Aircraft Type,Current,Future,Historic,Total,Orders,Unit Cost,Total Cost (Current),Average Age
1362,Southwest Airlines,Southwest Airlines,Boeing 737,718.0,38.0,195.0,952.0,284.0,$74,"$53,118",11.7


In [27]:
sw_routes = pd.read_csv("Database/full_quarterly_data_set.csv")
# focus on year 2019 for network generation 
sw_routes_2019 = sw_routes.loc[sw_routes['year'] == 2019]
# compile statistics between the two quarters 
quarters = set(sw_routes_2019['quarter'].to_list())
# split citypair column to origin and destination key id columns 
sw_routes_2019[['Origin_Airport_Code','Destination_Airport_Code']] = sw_routes_2019['citypair'].str.split('-',expand=True)
# In the following code, we extract all of the unique airports serviced by southwest airlines - this will eventually produce a node table for us 
# To compute this, we extract origin and destination airport lists
SW_all_destinations = list(set(sw_routes_2019['Origin_Airport_Code'].to_list() + sw_routes_2019['Destination_Airport_Code'].to_list()))
print('Total no. of Airports Serviced by Southwest Airlines in 2019', len(SW_all_destinations))
# Compile data for the citypair into a list to be able to compare/query from the other Airports2 dataset
citypair_list = list(set(sw_routes_2019['citypair'].to_list()))

Total no. of Airports Serviced by Southwest Airlines in 2019 85


### Quick Statistics for Southwest Network in 2019

# 85 
Airports Serviced       

# 1240 
Total Flight routes    

In [46]:
# Read in the airports dataset  AIRPORTS2
airports_data = pd.read_csv('Database\Airports2.csv')
# This dataset contains A LOT of data records about flight routes between any two destinations both within and outside the mainland United States
# Based on the queried total flight route data from above, we must filter this dataframe so we only look at those specific routes and the corresponding information from them
# Note, the route data we are looking at from the above dataset is 2019-based, we are compiling the data below as initial predictors and attributes 
airports_data['Route_Combined'] = airports_data['Origin_airport'].astype(str) + '-' + airports_data['Destination_airport']
filtered_Airports_Data = airports_data[airports_data['Route_Combined'].isin(citypair_list)]
# Next step is to break apart the dates into date - month - year extractions
# to do this, we first apply a data type transformation to convert col to python dateTime object
filtered_Airports_Data['Fly_date'] =  pd.to_datetime(filtered_Airports_Data['Fly_date'], format='%m/%d/%Y')
# 
filtered_Airports_Data['year'] = pd.DatetimeIndex(filtered_Airports_Data['Fly_date']).year


In [49]:
filtered_Airports_Data

Unnamed: 0,Origin_airport,Destination_airport,Passengers,Seats,Flights,Distance,Fly_date,Origin_population,Destination_population,Org_airport_lat,Org_airport_long,Dest_airport_lat,Dest_airport_long,Route_Combined,year
15158,LAX,RNO,1780,2430,30,390,1990-09-01,22585772,258859,33.942501,-118.407997,39.499100,-119.767998,LAX-RNO,1990
15159,LAX,RNO,4773,7808,61,390,1990-01-01,22585772,258859,33.942501,-118.407997,39.499100,-119.767998,LAX-RNO,1990
15160,LAX,RNO,996,2610,29,390,1990-01-01,22585772,258859,33.942501,-118.407997,39.499100,-119.767998,LAX-RNO,1990
15161,LAX,RNO,1802,3584,28,390,1990-11-01,22585772,258859,33.942501,-118.407997,39.499100,-119.767998,LAX-RNO,1990
15162,LAX,RNO,971,1890,21,390,1990-02-01,22585772,258859,33.942501,-118.407997,39.499100,-119.767998,LAX-RNO,1990
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048551,CRP,HOU,8191,14659,107,187,1992-06-01,379471,3983375,27.770399,-97.501198,29.645399,-95.278900,CRP-HOU,1992
1048552,CRP,HOU,2104,3416,28,187,1992-06-01,379471,3983375,27.770399,-97.501198,29.645399,-95.278900,CRP-HOU,1992
1048559,CRP,HOU,1272,2806,23,187,1992-01-01,379471,3983375,27.770399,-97.501198,29.645399,-95.278900,CRP-HOU,1992
1048563,CRP,HOU,6820,14933,109,187,1992-02-01,379471,3983375,27.770399,-97.501198,29.645399,-95.278900,CRP-HOU,1992
