CMPT 2400: Exploratory Data Analysis Data Project
Prepared by Laura Brin, Sandra Alex & Annabell Rodriguez

loading libraries

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split

# This makes it so we are able to see 100 rows when displaying the data
pd.set_option("display.max_rows", 100)

loading datasets

In [16]:
pass_df=pd.read_csv("dataset/International_Report_Passengers.csv")
depart_df=pd.read_csv("dataset/International_Report_Departures.csv")

Observing Departures Dataset

This dataset contains data on all the flights between US gateways and non-US gateways. It is a record of international flights departing US and can be used to highlight busiest airports, and peak times for flight volume

    ###Laura

Posed Problem: Flight delay propagation

International travel involves a web of interconnected airports in hundreds of countries every single day of the year. 
The ripple effect caused by a cancelled or delayed flight can cause issues with missed connections, missing baggage, carrier fines, reimbursed customers and staffing issues.

According to the Federal Aviation Administration, Delay Propagation occurs when three conditions are met simultaneously (https://aspm.faa.gov/aspmhelp/index/Delay_Propagation.html#:~:text=Delay%20propagation%20occurs%20when%20a,identified%20by%20a%20tail%20number.):
    A flight arrives late at an airport.
    A flight departs late in subsequent stages.
    A flight arrives late at the next destination.


We would like to pitch a ML solution using this dataset that would assist with real time decisions for domestic flights. When domestic flight centers expierence multiple delays, air traffic decision makers can use the model to help predict which flights should be priortized for take-off to reduce flight delay propogation into international connecting flights. It will do this by looking at the relationship between the flight's intended landing airport, the number of international airports that site connects with, the region of the airport and the time of year.



Features

Date- in MM/DD/YYYY 

Year

Month

usg_apt_id: US Gateway Airport ID- assigned by US DOT to identify airport

usg_apt: US Gateway Airport Code- usually assigned by IATA but in absence of IATA designation, may show FAA-assigned code

        These two features are related. They represent the numerical and three letter code for location identification, respectively. 

usg_wac: US Gateway World Area code- assigned by US DOT to represent a geographic territory

fg_apt_id: Foreign Gateway Airport ID-assigned by US DOT to identify an airport

fg_apt: Foreign Gateway Airport Code- usually assigned by IATA but in absence of IATA designation, may show FAA assigned code

fg_wac: Foreigh Gateway World Area Code- Assigned by US DOT to represent territory

airlineid: Airline ID assigned by US DOT to identify an air carrier




Notes: 
need to set date format correctly
year/month-numerical-any need for month to be in categorical?
need to check where more than 2 usg_apt is assigned to usg_apt_id and relabel
important pieces- US gatewayForeign Gateway and US_ world Area codeforeign world area code
year,month,
apt_ids and airlineid all as numeric-actually categorical

In [17]:
depart_df.head(20)

Unnamed: 0,data_dte,Year,Month,usg_apt_id,usg_apt,usg_wac,fg_apt_id,fg_apt,fg_wac,airlineid,carrier,carriergroup,type,Scheduled,Charter,Total
0,05/01/2006,2006,5,12016,GUM,5,13162,MAJ,844,20177,PFQ,1,Departures,0,10,10
1,05/01/2003,2003,5,10299,ANC,1,13856,OKO,736,20007,5Y,1,Departures,0,15,15
2,03/01/2007,2007,3,10721,BOS,13,12651,KEF,439,20402,GL,1,Departures,0,1,1
3,12/01/2004,2004,12,11259,DAL,74,16271,YYZ,936,20201,AMQ,1,Departures,0,1,1
4,05/01/2009,2009,5,13303,MIA,33,11075,CMW,219,21323,5L,0,Departures,0,20,20
5,10/01/2007,2007,10,14761,SFB,33,11928,GLA,493,20444,JN,0,Departures,0,8,8
6,02/01/2002,2002,2,14100,PHL,23,11032,CUN,148,20402,MMQ,1,Departures,0,1,1
7,02/01/2008,2008,2,16091,YIP,43,16166,YQG,936,20201,AMQ,1,Departures,0,3,3
8,11/01/2001,2001,11,13930,ORD,41,16042,YEG,916,19531,AC,0,Departures,0,1,1
9,07/01/2003,2003,7,13198,MCI,64,13514,MTY,148,20201,AMQ,1,Departures,0,1,1


In [18]:
depart_df.describe()

Unnamed: 0,Year,Month,usg_apt_id,usg_wac,fg_apt_id,fg_wac,airlineid,carriergroup,Scheduled,Charter,Total
count,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0,930808.0
mean,2006.021361,6.414783,12809.473781,42.51174,13484.676238,466.910479,20057.217505,0.599361,40.003181,2.005483,42.008665
std,8.558831,3.47107,2716.223845,27.571338,1932.601107,288.005971,479.071456,0.490028,60.948973,8.278403,60.340835
min,1990.0,1.0,10010.0,1.0,10119.0,106.0,19386.0,0.0,0.0,0.0,1.0
25%,1999.0,3.0,11618.0,22.0,11868.0,205.0,19704.0,0.0,0.0,0.0,3.0
50%,2007.0,6.0,12892.0,33.0,13408.0,427.0,19991.0,1.0,17.0,0.0,20.0
75%,2014.0,9.0,13487.0,72.0,15084.0,736.0,20312.0,1.0,60.0,1.0,60.0
max,2020.0,12.0,99999.0,93.0,16881.0,975.0,22067.0,1.0,2019.0,1092.0,2019.0


In [19]:
depart_df.shape

(930808, 16)

In [20]:
depart_df.dtypes

data_dte        object
Year             int64
Month            int64
usg_apt_id       int64
usg_apt         object
usg_wac          int64
fg_apt_id        int64
fg_apt          object
fg_wac           int64
airlineid        int64
carrier         object
carriergroup     int64
type            object
Scheduled        int64
Charter          int64
Total            int64
dtype: object

space

In [None]:
#code space

    ###Annabell

    ###Sandra