In [1]:
import pandas as pd

In [2]:
df = pd.read_pickle('Data19.pkl')

After reopening the file in a new notebook, we have been able to perform all the commands we were willing to run with no issues at all. First of all, we are going to check the types of the variables.

In [3]:
df.dtypes

Route       object
Month        int64
Carrier     object
From        object
FCity       object
FST         object
To          object
TCity       object
TST         object
Delay      float64
Flights    float64
Dist       float64
dtype: object

As we can see, Delay, Flights and Dist are floats which take a larger amount of memory than integers. Because we are working with a very large dataset, it is in our best interest to optimize the use of the memory throughout the process, and since we really don't lose any information by converting these variables into integers, this is our next step.

In [4]:
df['Flights'] = df['Flights'].astype('int32')

In [5]:
df['Dist'] = df['Dist'].astype('int32')

In [6]:
df['Delay'] = df['Delay'].astype('int32')

In [7]:
df

Unnamed: 0,Route,Month,Carrier,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist
0,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-12,1,83
1,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-20,1,83
2,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-13,1,83
3,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-15,1,83
4,ATL-CSG,1,9E,ATL,"Atlanta, GA",GA,CSG,"Columbus, GA",GA,-11,1,83
...,...,...,...,...,...,...,...,...,...,...,...,...
7422028,JFK-BQN,12,B6,JFK,"New York, NY",NY,BQN,"Aguadilla, PR",PR,-3,1,1576
7422029,JFK-SAV,12,B6,JFK,"New York, NY",NY,SAV,"Savannah, GA",GA,-2,1,718
7422030,SAV-JFK,12,B6,SAV,"Savannah, GA",GA,JFK,"New York, NY",NY,-13,1,718
7422031,BOS-SYR,12,B6,BOS,"Boston, MA",MA,SYR,"Syracuse, NY",NY,-40,1,265


In [8]:
df.dtypes

Route      object
Month       int64
Carrier    object
From       object
FCity      object
FST        object
To         object
TCity      object
TST        object
Delay       int32
Flights     int32
Dist        int32
dtype: object

### Grouping by Route, Month and Carrier.
After the dataframe has been joined, cleaned and we have all the information we want to use, we are going to perform the grouping.

Grouping by the forementioned categories allows us to get in a single observation the amount of flights performing each route in each month by each carrier. We also want to keep the rest of the columns, so we have to choose an aggregation method. For most of the features, they contain the same values: e.g. for all flights departing from ABE airport, FCity will be 'Allentown/Bethlehem/Easton, PA', while FST will be the departure state abbreviation, 'PA', and so on and so forth. 

Because of this, we are going to choose the aggregation method 'first' as it is the one that optimizes the memory usage: instead of making a calculation within all the possible options, it just grabs the value of the first observation and continues on with the next group. 

For the rest of the variables, which are the numeric ones, we are going to take the mean, except for the variable Flights where we are interested in the amount of flights, so we are just taking a count.

After we perform the grouping, the dataframe is going to be shown as a main category and sub-categories. In order to remove this and just get an unstructured dataframe, we will need to reset_index().

In [9]:
dfg = df.groupby(by=['Route', 'Month', 'Carrier']).agg({'From': 'first', 'FCity': 'first', 'FST': 'first', 
                                                        'To': 'first', 'TCity': 'first', 'TST': 'first',
                                                        'Delay': 'mean', 'Flights': 'sum', 'Dist': 'mean'})

In [10]:
dfg

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist
Route,Month,Carrier,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ABE-ATL,1,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,7.073171,41,692
ABE-ATL,1,DL,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,-0.923077,26,692
ABE-ATL,2,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,20.382353,34,692
ABE-ATL,2,DL,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,1.964286,28,692
ABE-ATL,3,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,4.929825,57,692
...,...,...,...,...,...,...,...,...,...,...,...
YUM-PHX,10,YV,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,5.543860,57,160
YUM-PHX,11,OO,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,4.470588,102,160
YUM-PHX,11,YV,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,5.375000,40,160
YUM-PHX,12,OO,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,-1.022727,88,160


In [11]:
dfg = dfg.reset_index()

In [12]:
dfg

Unnamed: 0,Route,Month,Carrier,From,FCity,FST,To,TCity,TST,Delay,Flights,Dist
0,ABE-ATL,1,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,7.073171,41,692
1,ABE-ATL,1,DL,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,-0.923077,26,692
2,ABE-ATL,2,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,20.382353,34,692
3,ABE-ATL,2,DL,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,1.964286,28,692
4,ABE-ATL,3,9E,ABE,"Allentown/Bethlehem/Easton, PA",PA,ATL,"Atlanta, GA",GA,4.929825,57,692
...,...,...,...,...,...,...,...,...,...,...,...,...
118314,YUM-PHX,10,YV,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,5.543860,57,160
118315,YUM-PHX,11,OO,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,4.470588,102,160
118316,YUM-PHX,11,YV,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,5.375000,40,160
118317,YUM-PHX,12,OO,YUM,"Yuma, AZ",AZ,PHX,"Phoenix, AZ",AZ,-1.022727,88,160


Once again, due to technical difficulties, we are going to export the dataframe as a .pkl to continue working in another notebook.

In [13]:
dfg.to_pickle('GroupedData.pkl')