### Open Flights Data Wrangling

To practice, you are going to wrangle data from OpenFlights.  You can read about it here: 

http://openflights.org/data.html

This includes three main files, one for each airport, one for each airline, and one for each route.  They can be merged or joined with the appropriate fields.  I have modified the files slightly to include a header row in the .dat files, which makes it a bit easier for you.  

You are required to work through the problems below.  This may take some time.  Be persistent, and ask questions or seek help as needed.  

In [85]:
import pandas as pd
import numpy as np

In [86]:
# These files use \N as a missing value indicator.  When reading the CSVs, we will tell
# it to use that value as missing or NA.  The double backslash is required because
# otherwise it will interpret \N as a carriage return. 
airports = pd.read_csv('data/airports.txt', na_values=['\\N'])
airlines = pd.read_csv('data/airlines.txt', na_values=['\\N'])
routes = pd.read_csv('data/routes.txt', na_values=['\\N'])


1) Start by seeing what's in the data.  What columns are there?  What data types are the columns?  

Remember, 'object' means it is a string, while the numerical values can be floats or ints.  Sometimes you will have problems if it reads numeric data in as strings.  If that happens, you can use the function .astype() to convert it.  Look it up in the pandas API to get more details

In [87]:
airports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7184 entries, 0 to 7183
Data columns (total 15 columns):
AirportID           7184 non-null int64
Name                7184 non-null object
City                7140 non-null object
Country             7184 non-null object
IATA                5652 non-null object
ICAO                7184 non-null object
Latitude            7184 non-null float64
Longitude           7184 non-null float64
Altitude            7184 non-null int64
Timezone            6874 non-null float64
DST                 6874 non-null object
Tz database time    6591 non-null object
zone                7184 non-null object
Type                7184 non-null object
Source              0 non-null float64
dtypes: float64(4), int64(2), object(9)
memory usage: 589.4+ KB


In [88]:
airports.head()

Unnamed: 0,AirportID,Name,City,Country,IATA,ICAO,Latitude,Longitude,Altitude,Timezone,DST,Tz database time,zone,Type,Source
0,1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.08169,145.391998,5282,10.0,U,Pacific/Port_Moresby,airport,OurAirports,
1,2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20708,145.789001,20,10.0,U,Pacific/Port_Moresby,airport,OurAirports,
2,3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.82679,144.296005,5388,10.0,U,Pacific/Port_Moresby,airport,OurAirports,
3,4,Nadzab Airport,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569803,146.725977,239,10.0,U,Pacific/Port_Moresby,airport,OurAirports,
4,5,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM,AYPY,-9.44338,147.220001,146,10.0,U,Pacific/Port_Moresby,airport,OurAirports,


In [89]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6162 entries, 0 to 6161
Data columns (total 8 columns):
Airline    6162 non-null int64
ID         6162 non-null object
Name       179 non-null object
Alias      1534 non-null object
IATA       5887 non-null object
ICAO       5351 non-null object
Country    6144 non-null object
Active     6162 non-null object
dtypes: int64(1), object(7)
memory usage: 216.7+ KB


In [90]:
airlines.head()

Unnamed: 0,Airline,ID,Name,Alias,IATA,ICAO,Country,Active
0,-1,Unknown,,-,,,,Y
1,1,Private flight,,-,,,,Y
2,2,135 Airways,,,GNL,GENERAL,United States,N
3,3,1Time Airline,,1T,RNX,NEXTIME,South Africa,Y
4,4,2 Sqn No 1 Elementary Flying Training School,,,WYT,,United Kingdom,N


In [91]:
routes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67663 entries, 0 to 67662
Data columns (total 9 columns):
Airline                   67663 non-null object
AirlineID                 67184 non-null float64
Source Airport            67663 non-null object
Source airport ID         67443 non-null float64
Destination airport       67663 non-null object
Destination airport ID    67442 non-null float64
Codeshare                 14597 non-null object
Stops                     67663 non-null int64
Equipment                 67645 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 3.4+ MB


In [92]:
print(len(routes))
routes.head()


67663


Unnamed: 0,Airline,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment
0,2B,410.0,AER,2965.0,KZN,2990.0,,0,CR2
1,2B,410.0,ASF,2966.0,KZN,2990.0,,0,CR2
2,2B,410.0,ASF,2966.0,MRV,2962.0,,0,CR2
3,2B,410.0,CEK,2968.0,KZN,2990.0,,0,CR2
4,2B,410.0,CEK,2968.0,OVB,4078.0,,0,CR2


2) Select just the routes that go to or from Lexington Bluegrass Airport, and store them in their own dataframe.  

The airport code is LEX.  You should have a much smaller dataframe.  How many inbound routes and how many outbound routes are there? 

In [93]:
source = routes[routes['Source Airport'] == 'LEX']
print(len(source))
source

20


Unnamed: 0,Airline,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment
3588,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ
5763,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ
5764,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4
5765,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4
9641,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9
21095,DL,2009.0,LEX,4017.0,ATL,3682.0,,0,M88 717
21096,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ
21097,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9
21098,DL,2009.0,LEX,4017.0,LGA,3697.0,,0,ERJ
21099,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ


In [94]:
destination = routes[routes['Destination airport'] == 'LEX']
print(len(destination))
destination

20


Unnamed: 0,Airline,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment
3569,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ
4953,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7
5247,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4
6283,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4
9097,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717
20164,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717
20534,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ
20638,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717
21131,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ
21402,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ


3) Now let's look at which airlines operate in and out of Lexington.  To do this, you need to merge the airline dataframe to the route dataframe.  

How many routes does each airline have?  The value_counts() method may be useful for answering this question.  

In [95]:
pd.value_counts(destination['AirlineID'])

2009.0    5
35.0      4
24.0      3
5265.0    3
5209.0    2
3090.0    1
137.0     1
3976.0    1
Name: AirlineID, dtype: int64

In [96]:
pd.value_counts(source['AirlineID'])

2009.0    5
35.0      4
24.0      3
5265.0    3
5209.0    2
3090.0    1
137.0     1
3976.0    1
Name: AirlineID, dtype: int64

In [110]:
df_dep = pd.merge(source,airlines, left_on='AirlineID', right_on='Airline')
df_dep

Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
0,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y
1,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
2,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
4,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9,137,Air France,,AF,AFR,AIRFRANS,France,Y
5,DL,2009.0,LEX,4017.0,ATL,3682.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
6,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LEX,4017.0,LGA,3697.0,,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y


In [112]:
df_arrive = pd.merge(destination,airlines, left_on='AirlineID', right_on='Airline')
df_arrive

Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
0,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y
1,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
4,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717,137,Air France,,AF,AFR,AIRFRANS,France,Y
5,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
6,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y


4) It looks like there are some international airlines with Lexington routes.  To look at how many routes they have, create a new column in your dataframe called 'International', which is set to Y for an overseas airline and N for a domestic airline.  Calculate the percent of routes with an overseas airline.  

In [99]:
df_arrive['International'] = ''
df_arrive.head()

Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active,International
0,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y,
1,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
4,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717,137,Air France,,AF,AFR,AIRFRANS,France,Y,


In [100]:
for label in df_arrive.index:
    if df_arrive.Country[label] == 'United States':
        df_arrive['International'][label] = 'N'
    else:
        df_arrive['International'][label] = 'Y'
        
df_arrive

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active,International
0,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y,N
1,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
4,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717,137,Air France,,AF,AFR,AIRFRANS,France,Y,Y
5,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
6,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
7,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
8,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
9,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N


In [101]:
df_dep['International'] = ''
df_dep.head()

Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active,International
0,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y,
1,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
2,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
3,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,
4,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9,137,Air France,,AF,AFR,AIRFRANS,France,Y,


In [102]:
for label in df_dep.index:
    if df_dep.Country[label] == 'United States':
        df_dep['International'][label] = 'N'
    else:
        df_dep['International'][label] = 'Y'
        
df_dep

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active,International
0,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y,N
1,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
2,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
3,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y,N
4,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9,137,Air France,,AF,AFR,AIRFRANS,France,Y,Y
5,DL,2009.0,LEX,4017.0,ATL,3682.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
6,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
7,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
8,DL,2009.0,LEX,4017.0,LGA,3697.0,,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N
9,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y,N


In [103]:
percent_int = len(df_dep[df_dep['Destination airport'] == 'CLT']) / len(df_arrive['Country'])
print(percent_int*100,'%')

10.0 %


In [104]:
df_arrive['Country'].iteritems

<bound method Series.iteritems of 0     United States
1     United States
2     United States
3     United States
4            France
5     United States
6     United States
7     United States
8     United States
9     United States
10    United States
11    United States
12    United States
13    United States
14      Netherlands
15    United States
16    United States
17    United States
18    United States
19    United States
Name: Country, dtype: object>

In [105]:
percent_int = len(df_arrive[df_arrive['Country'] != 'United States']) / len(df_arrive['Country'])
print(percent_int*100,'%')

10.0 %


In [106]:
percent_int = len(df_dep[df_dep['Country'] != 'United States']) / len(df_arrive['Country'])
print(percent_int*100,'%')

10.0 %


In [107]:
percent_int = len(df_dep[df_dep['International'] != 'N']) / len(df_arrive['International'])
print(percent_int*100, '%')

10.0 %


5) Actually, it looks like a bunch of these routes are codeshares.  That means they are marketed by this airline, but operated by a different airline.  See the note in the data documentation on openflights.org/data.  The implication of this is that there are duplicates.

Can you figure out which ones are duplicates?  Can you then create a dataframe with only the unique routes?  How many unique inbound and outbound routes are there? 

Remember, someone has to operate the flight, so if all the routes to/from a particular airport are listed as codeshares, then something is funny...

It is also possible that more than one airline actually operates a route between the same two airports. (Having this sort of competition generally means that you will get better fares as a traveler.)  It may not be obvious what is actually in the data set, so dig or do external research as needed.  

In [118]:
df_arrive

Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
0,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y
1,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
4,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717,137,Air France,,AF,AFR,AIRFRANS,France,Y
5,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
6,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y


In [119]:
df.dropna?

In [138]:
df2 = df_arrive.dropna(axis = 0,thresh = 16)
print(len(df2))
#df2.drop(df[])
df2

10


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
4,AF,137.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ 717,137,Air France,,AF,AFR,AIRFRANS,France,Y
6,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
14,KL,3090.0,ATL,3682.0,LEX,4017.0,Y,0,CR9 M88 CRJ,3090,KLM Royal Dutch Airlines,,KL,KLM,KLM,Netherlands,Y
15,UA,5209.0,IAH,3550.0,LEX,4017.0,Y,0,ERJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y
16,UA,5209.0,ORD,3830.0,LEX,4017.0,Y,0,ERJ CRJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y
19,US,5265.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,5265,US Airways,,US,USA,U S AIR,United States,Y


In [137]:
df_arrive2 = df2[df2['Airline_x'].isin(['AA', 'DL','G4','UA'])]
print(len(df_arrive2))
df_arrive2

7


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
2,AA,24.0,DFW,3670.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,ORD,3830.0,LEX,4017.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
6,DL,2009.0,DCA,3520.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LGA,3697.0,LEX,4017.0,Y,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,MSP,3858.0,LEX,4017.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
15,UA,5209.0,IAH,3550.0,LEX,4017.0,Y,0,ERJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y
16,UA,5209.0,ORD,3830.0,LEX,4017.0,Y,0,ERJ CRJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y


In [136]:
dfY = df_arrive[df_arrive['Codeshare'].isnull() ]
print(len(dfY))
dfY

10


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
0,9E,3976.0,ATL,3682.0,LEX,4017.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y
1,AA,24.0,CLT,3876.0,LEX,4017.0,,0,CR7,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
5,DL,2009.0,ATL,3682.0,LEX,4017.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,DTW,3645.0,LEX,4017.0,,0,717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
10,G4,35.0,FLL,3533.0,LEX,4017.0,,0,M80,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
11,G4,35.0,PGD,7056.0,LEX,4017.0,,0,M80,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
12,G4,35.0,PIE,3617.0,LEX,4017.0,,0,320,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
13,G4,35.0,SFB,4167.0,LEX,4017.0,,0,M80 320,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
17,US,5265.0,CLT,3876.0,LEX,4017.0,,0,CR7,5265,US Airways,,US,USA,U S AIR,United States,Y
18,US,5265.0,DFW,3670.0,LEX,4017.0,,0,ERD,5265,US Airways,,US,USA,U S AIR,United States,Y


In [135]:
#similar process performed for departures
df3 = df_dep.dropna(axis = 0,thresh = 16)
print(len(df3))
df3

11


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
1,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
2,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
4,AF,137.0,LEX,4017.0,ATL,3682.0,Y,0,CRJ CR9,137,Air France,,AF,AFR,AIRFRANS,France,Y
6,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
14,KL,3090.0,LEX,4017.0,ATL,3682.0,Y,0,CR9 CRJ,3090,KLM Royal Dutch Airlines,,KL,KLM,KLM,Netherlands,Y
15,UA,5209.0,LEX,4017.0,IAH,3550.0,Y,0,ERJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y
16,UA,5209.0,LEX,4017.0,ORD,3830.0,Y,0,ERJ CRJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y


In [133]:
df_dep2 = df3[df3['Airline_x'].isin(['AA', 'DL','G4','UA'])]

df_dep2 = df_dep2.dropna(how = 'all')
print(len(df_dep2))
df_dep2

8


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
1,AA,24.0,LEX,4017.0,CLT,3876.0,Y,0,CR7 CRJ,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
2,AA,24.0,LEX,4017.0,DFW,3670.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
3,AA,24.0,LEX,4017.0,ORD,3830.0,Y,0,ERD ER4,24,American Airlines,,AA,AAL,AMERICAN,United States,Y
6,DL,2009.0,LEX,4017.0,DCA,3520.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
7,DL,2009.0,LEX,4017.0,DTW,3645.0,Y,0,CR7 CRJ CR9,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
9,DL,2009.0,LEX,4017.0,MSP,3858.0,Y,0,CRJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
15,UA,5209.0,LEX,4017.0,IAH,3550.0,Y,0,ERJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y
16,UA,5209.0,LEX,4017.0,ORD,3830.0,Y,0,ERJ CRJ,5209,United Airlines,,UA,UAL,UNITED,United States,Y


In [134]:
dfY2 = df_dep[df_dep['Codeshare'].isnull() ]
print(len(dfY2))
dfY2

9


Unnamed: 0,Airline_x,AirlineID,Source Airport,Source airport ID,Destination airport,Destination airport ID,Codeshare,Stops,Equipment,Airline_y,ID,Name,Alias,IATA,ICAO,Country,Active
0,9E,3976.0,LEX,4017.0,ATL,3682.0,,0,CRJ,3976,Pinnacle Airlines,,9E,FLG,FLAGSHIP,United States,Y
5,DL,2009.0,LEX,4017.0,ATL,3682.0,,0,M88 717,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
8,DL,2009.0,LEX,4017.0,LGA,3697.0,,0,ERJ,2009,Delta Air Lines,,DL,DAL,DELTA,United States,Y
10,G4,35.0,LEX,4017.0,FLL,3533.0,,0,M80,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
11,G4,35.0,LEX,4017.0,PGD,7056.0,,0,M80,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
12,G4,35.0,LEX,4017.0,PIE,3617.0,,0,320,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
13,G4,35.0,LEX,4017.0,SFB,4167.0,,0,M80 320,35,Allegiant Air,,G4,AAY,ALLEGIANT,United States,Y
17,US,5265.0,LEX,4017.0,CLT,3876.0,,0,CR7,5265,US Airways,,US,USA,U S AIR,United States,Y
18,US,5265.0,LEX,4017.0,DFW,3670.0,,0,ERD,5265,US Airways,,US,USA,U S AIR,United States,Y
