CMPT 2400: Exploratory Data Analysis Data Project
Prepared by Laura Brin, Sandra Alex & Annabell Rodriguez

loading libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split

# This makes it so we are able to see 100 rows when displaying the data
pd.set_option("display.max_rows", 100)

loading datasets

In [None]:
pass_df=pd.read_csv("dataset/International_Report_Passengers.csv")
depart_df=pd.read_csv("dataset/International_Report_Departures.csv")

### Posed Problem: Flight Delay Propagation Mitigation

International travel involves a web of interconnected airports in hundreds of countries every day of the year. 
The ripple effect caused by a cancelled or delayed flight can cause issues with missed connections, missing baggage, carrier fines, reimbursed customers and staffing issues.

According to the Federal Aviation Administration, Delay Propagation occurs when three conditions are met simultaneously (https://aspm.faa.gov/aspmhelp/index/Delay_Propagation.html#:~:text=Delay%20propagation%20occurs%20when%20a,identified%20by%20a%20tail%20number.):

- A flight arrives late at an airport.
- A flight departs late in subsequent stages.
- A flight arrives late at the next destination.


### Posed Solution

We would like to pitch a ML solution using this dataset that would assist with real time decisions for domestic flight delays. When domestic flight centers experience multiple delays, air traffic decision makers can use the model to help predict which flights should be prioritized for take-off to reduce flight delay propagation into international connecting flights. It will do this by looking at the relationship between the flight's intended landing airport, the number of international airports that site connects with, how many flights leave that site, the region of the airport and the time of year.

### Observing Departures Dataset

This dataset contains data on all the flights between US gateways and non-US gateways. It is a record of international flights departing US and can be used to highlight busiest airports, and peak times for flight volume

There are multiple abbreviations used in this section:
* DOT: Department of Transportation
* FAA: Federal Aviation Administration
* IATA: International Air Transportation Association
* ICAO: International Civil Aviation Organization

    #Laura

#### Features

Date- in MM/DD/YYYY format

Year

Month

> usg_apt_id: US Gateway Airport ID- assigned by US DOT to identify airport

> usg_apt: US Gateway Airport Code- usually assigned by IATA but in absence of IATA designation, may show FAA-assigned code. For full list of World Airport codes see the Bureau of Transportation Statistics: https://www.bts.gov/topics/airlines-and-airports/world-airport-codes 

        These two features are related. They represent the numerical location code (US) and three letter code for location identification, respectively. These should correlate 1:1 except where FAA coding was used in the absence of IATA coding. 


usg_wac: US Gateway World Area code- assigned by US DOT to represent a geographic territory. 
* 1-99 USA, 
* 100-199 Central America, 
* 200-299 Caribbean, Bahamas and Bermuda, 
* 300-399 South America, 
* 400-499 Europe, 
* 500-599 Africa, 
* 600-699 Middle East, 
* 700-800 Far East/Asia, 
* 801-899 Antarctica, Australasia and Oceania, 
* 900-999 Canada and Greenland 
codes groupings from https://en.wikipedia.org/wiki/World_Area_Codes



> fg_apt_id: Foreign Gateway Airport ID-assigned by US DOT to identify an airport

> fg_apt: Foreign Gateway Airport Code- usually assigned by IATA but in absence of IATA designation, may show FAA assigned code

> fg_wac: Foreign Gateway World Area Code- Assigned by US DOT to represent territory. For code groups see above in usg_wac comments

        These three features are related. They represent the five digit numerical location code (US), three letter code for location identification (International), and three digit numerical location code (international), respectively.

> airlineid: Airline ID assigned by US DOT to identify an air carrier

> carrier: IATA assigned air carrier code. If carrier has no IATA code, ICAO- or FAA assigned code may be used. These are mixed letter/number codes. For full list of air carrier codes see the Bureau of Transportation Statistics: https://www.bts.gov/topics/airlines-and-airports/airline-codes 

        These two features are related. They represent the five digit numerical airline ID (US) and the two or three character air carrier id (international). These should correlate 1:1 except where IATA coding was absent. 

carriergroup: group code. 1=US domestic air carriers, 0=foreign air carriers

type: type of the metrics- this is a single code for this dataset= "Departures"

> Scheduled: metric flow by scheduled service operations. Scheduled flights are those commercially available for indivdual purchase

> Charter: metric flown by charter operations. Charter flights are booked by a group or consortium responsible for all seats on the flight. This is commonly reffered to as private flights

        These two features are related. Flights are listed as either scheduled or charter. Flights on the same day, from the same airline id, with the same take off and landing sites are recorded as a count metric

Total: scheduled+charter flight counts



        Notes: 
need to set date format correctly
year/month-numerical-any need for month to be in categorical?
need to check where more than 2 usg_apt is assigned to usg_apt_id and relabel
important pieces- US gatewayForeign Gateway and US_ world Area codeforeign world area code
year,month,
apt_ids and airlineid all as numeric-actually categorical
should discuss if we want to include charter flights as the problem statement is directed towards scheduled flights

In [None]:
depart_df.head(20)

In [None]:
depart_df.describe()

In [None]:
depart_df.shape

In [None]:
depart_df.dtypes

### Annabell

This is a test

In [None]:
print("This is a test")

### Sandra