<a href="https://colab.research.google.com/github/ivihernandez/data_science_tutorials/blob/main/The_last_pandas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Value Proposition of the tutorial
To the best of my knowledge, this is the only tutorial that simultaneously meets all of these criteria:
* Can be used for interactive learning (through Google's Colab)
* You can verify all your answers (since the answers are provided)
* Whenever possible, more than one way for obtaining the solution is presented
* For every problem, at least one of the answers is obtained in a single step (e.g. no for loops, no assignments to intermediate datasets)
* Based on real life datasets (flights)
* Based on existing tutorials so you can get more information and verify my answers (data.table)

# Summary of the tutorial

Using the [NYC flights](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv) you will learn how to manipulate data through:
* Subsetting (filter to find records of interest)
* Summarizing (computing math functions on aggregate or per group)
* Sorting (ordering the records)
* Creating new fields (columns)
* Getting rid of certain fields (columns)
* Combinations of one or more of the above

Note: the tutorial does not cover join operations, see my [pandas join tutorial](https://colab.research.google.com/drive/1L6eTL9IhNoSgCRcvgpTf5LCmBPy9jYZl)

# Resources
* [Pandas](https://pandas.pydata.org/)
* R's [data.table tutorial](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)

# More information
* Solving time: an hour
* Keywords: pandas, python, data wrangling
* Author: Ivan Hernandez

##Set-up

In [109]:
## Import the pandas library and print its version
import pandas as pd
import numpy as np
from copy import *
import typing

In [2]:
print(pd.__version__)

2.2.2


## Load the dataset

In [3]:
flights = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv')

In [4]:
# explore a couple of rows
flights

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11
2,2014,1,1,2,9,AA,JFK,LAX,351,2475,19
3,2014,1,1,-8,-26,AA,LGA,PBI,157,1035,7
4,2014,1,1,2,1,AA,JFK,LAX,350,2475,13
...,...,...,...,...,...,...,...,...,...,...,...
253311,2014,10,31,1,-30,UA,LGA,IAH,201,1416,14
253312,2014,10,31,-5,-14,UA,EWR,IAH,189,1400,8
253313,2014,10,31,-8,16,MQ,LGA,RDU,83,431,11
253314,2014,10,31,-4,15,MQ,LGA,DTW,75,502,11


We noticed the data frame has 253316 rows and 11 columns. We can also see the name of the columns such as year, month, origin, dest, etc.

In [5]:
# We can obtain the dimensions explicitly by:
print('columns and rows:', flights.shape)
print('columns:', len(flights.columns))
print('columns (another way):', flights.shape[1])


columns and rows: (253316, 11)
columns: 11
columns (another way): 11


## 1 Basics
**a) What is a data frame**

A pandas data frame is a "Two-dimensional, size-mutable, potentially heterogeneous tabular data" (see [definition](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)). It can be "thought of as a dict-like container for Series objects".

In [6]:
"""
You can create a pandas dataframe by passing a dictionary of (key, value) pairs,
where:
  - the keys are the column names and
  - the value is a list for the column values
"""
df = pd.DataFrame(data=
                      {'id': ['b', 'b', 'b', 'a', 'a', 'c'],
                        'a': np.arange(1, 6 + 1),
                        'b': np.arange(7, 12 + 1),
                        'c': np.arange(13, 18 + 1)
                        })
df

Unnamed: 0,id,a,b,c
0,b,1,7,13
1,b,2,8,14
2,b,3,9,15
3,a,4,10,16
4,a,5,11,17
5,c,6,12,18


In [7]:
# Obtain the type of the data frame
df.dtypes

Unnamed: 0,0
id,object
a,int64
b,int64
c,int64


In [8]:
df.describe()

Unnamed: 0,a,b,c
count,6.0,6.0,6.0
mean,3.5,9.5,15.5
std,1.870829,1.870829,1.870829
min,1.0,7.0,13.0
25%,2.25,8.25,14.25
50%,3.5,9.5,15.5
75%,4.75,10.75,16.75
max,6.0,12.0,18.0


**b) How can I access elements of the data frame?**

The general form is data_frame_name[row_operator, column_operator]

**c) Subset rows**

**Select all flights starting on JFK on the 6th month, display 10**

In [9]:
# V1: Select all flights starting on JFK on the 6th month, display 6 records
flights[ (flights.origin == 'JFK') & (flights.month == 6)].head(6)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
121142,2014,6,1,-9,-5,AA,JFK,LAX,324,2475,8
121143,2014,6,1,-10,-13,AA,JFK,LAX,329,2475,12
121144,2014,6,1,18,-1,AA,JFK,LAX,326,2475,7
121145,2014,6,1,-6,-16,AA,JFK,LAX,320,2475,10
121146,2014,6,1,-4,-45,AA,JFK,LAX,326,2475,18
121147,2014,6,1,-6,-23,AA,JFK,LAX,329,2475,14


In [10]:
# V2: Select all flights starting on JFK on the 6th month, display 10
flights[ (flights['origin'] == 'JFK') & (flights['month'] == 6)].head(10)


Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
121142,2014,6,1,-9,-5,AA,JFK,LAX,324,2475,8
121143,2014,6,1,-10,-13,AA,JFK,LAX,329,2475,12
121144,2014,6,1,18,-1,AA,JFK,LAX,326,2475,7
121145,2014,6,1,-6,-16,AA,JFK,LAX,320,2475,10
121146,2014,6,1,-4,-45,AA,JFK,LAX,326,2475,18
121147,2014,6,1,-6,-23,AA,JFK,LAX,329,2475,14
121148,2014,6,1,-1,-24,AA,JFK,DFW,177,1391,8
121149,2014,6,1,2,1,AA,JFK,LAX,322,2475,8
121150,2014,6,1,11,4,AA,JFK,LAS,305,2248,17
121151,2014,6,1,12,-4,AA,JFK,SFO,351,2586,8


In [11]:
# V3: Select all flights starting on JFK on the 6th month, display 10
# create an array of True/False for each row
mask = (flights['origin'] == 'JFK') & (flights['month'] == 6)
flights[mask][:10]


Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
121142,2014,6,1,-9,-5,AA,JFK,LAX,324,2475,8
121143,2014,6,1,-10,-13,AA,JFK,LAX,329,2475,12
121144,2014,6,1,18,-1,AA,JFK,LAX,326,2475,7
121145,2014,6,1,-6,-16,AA,JFK,LAX,320,2475,10
121146,2014,6,1,-4,-45,AA,JFK,LAX,326,2475,18
121147,2014,6,1,-6,-23,AA,JFK,LAX,329,2475,14
121148,2014,6,1,-1,-24,AA,JFK,DFW,177,1391,8
121149,2014,6,1,2,1,AA,JFK,LAX,322,2475,8
121150,2014,6,1,11,4,AA,JFK,LAS,305,2248,17
121151,2014,6,1,12,-4,AA,JFK,SFO,351,2586,8


In [12]:
# V4: Select all flights starting on JFK on the 6th month
flights[(
    flights.
    apply(axis='columns',
          func=lambda row:
          (row['origin'] == 'JFK') & (row['month'] == 6)
          )
)].head(10)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
121142,2014,6,1,-9,-5,AA,JFK,LAX,324,2475,8
121143,2014,6,1,-10,-13,AA,JFK,LAX,329,2475,12
121144,2014,6,1,18,-1,AA,JFK,LAX,326,2475,7
121145,2014,6,1,-6,-16,AA,JFK,LAX,320,2475,10
121146,2014,6,1,-4,-45,AA,JFK,LAX,326,2475,18
121147,2014,6,1,-6,-23,AA,JFK,LAX,329,2475,14
121148,2014,6,1,-1,-24,AA,JFK,DFW,177,1391,8
121149,2014,6,1,2,1,AA,JFK,LAX,322,2475,8
121150,2014,6,1,11,4,AA,JFK,LAS,305,2248,17
121151,2014,6,1,12,-4,AA,JFK,SFO,351,2586,8


In [13]:
# V5: Select all flights starting on JFK on the 6th month
(
    flights.apply(
      func=lambda row:
      row[(flights['origin'] == 'JFK') & (flights['month'] == 6)]
    )
).head(10)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
121142,2014,6,1,-9,-5,AA,JFK,LAX,324,2475,8
121143,2014,6,1,-10,-13,AA,JFK,LAX,329,2475,12
121144,2014,6,1,18,-1,AA,JFK,LAX,326,2475,7
121145,2014,6,1,-6,-16,AA,JFK,LAX,320,2475,10
121146,2014,6,1,-4,-45,AA,JFK,LAX,326,2475,18
121147,2014,6,1,-6,-23,AA,JFK,LAX,329,2475,14
121148,2014,6,1,-1,-24,AA,JFK,DFW,177,1391,8
121149,2014,6,1,2,1,AA,JFK,LAX,322,2475,8
121150,2014,6,1,11,4,AA,JFK,LAS,305,2248,17
121151,2014,6,1,12,-4,AA,JFK,SFO,351,2586,8


**Get the first two rows of flights**

In [14]:
# V1: Get the first two rows of flights
flights[:2]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [15]:
# V2: Get the first two rows of flights
flights[0:2]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [16]:
# V3: Get the first two rows of flights
flights.head(2)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [17]:
# V4: Get the first two rows of flights
flights.iloc[0:2]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [18]:
# V5: Get the first two rows of flights
flights.loc[0:1, ]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [19]:
# V6: Get the first two rows of flights
flights.loc[ [0,1],]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


In [20]:
# V7: Get the first two rows of flights
flights[
    flights.apply(axis='columns', func=lambda row: row.name <= 1)
]

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11


**Sort flights first by column origin in ascending order, and then by dest in descending order**

In [21]:
# V1: Sort by origin in ascending order, then by dest in descending order
flights.sort_values(by=['origin', 'dest'], ascending=[True, False]).head(6)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
2790,2014,1,5,6,49,EV,EWR,XNA,195,1131,8
3351,2014,1,6,7,13,EV,EWR,XNA,190,1131,8
4066,2014,1,7,-6,-13,EV,EWR,XNA,179,1131,8
4825,2014,1,8,-7,-12,EV,EWR,XNA,184,1131,8
5658,2014,1,9,16,7,EV,EWR,XNA,181,1131,8
8665,2014,1,13,66,66,EV,EWR,XNA,188,1131,9


**d) Select columns**

**Select columns, return as a vector**

In [22]:
# V1: select columns, return as a vector
flights['arr_delay']

Unnamed: 0,arr_delay
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


In [23]:
# V2: select columns, return as a vector
flights.arr_delay

Unnamed: 0,arr_delay
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


In [24]:
# V3: select columns, return as a vector
flights.apply(axis='columns', func=lambda row: row['arr_delay'])


Unnamed: 0,0
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


**Select columns, return as a data frame**

In [25]:
# V1: select columns, return as data frame
flights[ ['arr_delay']]

Unnamed: 0,arr_delay
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


In [26]:
# V2: select columns, return as data frame
flights.loc[:, ['arr_delay']]

Unnamed: 0,arr_delay
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


In [27]:
# V3: select columns, return as a vector
(
  pd.DataFrame({'arr_delay':
                flights.
                apply(axis='columns', func=lambda row: row['arr_delay'])
                })
)


Unnamed: 0,arr_delay
0,13
1,13
2,9
3,-26
4,1
...,...
253311,-30
253312,-14
253313,16
253314,15


**Select both arr_delay and dep_delay columns**

In [28]:
# V1: select columns, return as data frame
flights[ ['arr_delay', 'dep_delay']]

Unnamed: 0,arr_delay,dep_delay
0,13,14
1,13,-3
2,9,2
3,-26,-8
4,1,2
...,...,...
253311,-30,1
253312,-14,-5
253313,16,-8
253314,15,-4


In [29]:
# V2: select columns, return as data frame
flights.loc[:, ['arr_delay', 'dep_delay']]

Unnamed: 0,arr_delay,dep_delay
0,13,14
1,13,-3
2,9,2
3,-26,-8
4,1,2
...,...,...
253311,-30,1
253312,-14,-5
253313,16,-8
253314,15,-4


**e) Compute per column**

**How many trips have had total delay < 0**

In [30]:
# V1: How many trips have had total delay < 0
len(flights[flights.arr_delay + flights.dep_delay < 0])

141814

In [31]:
# V2: How many trips have had total delay < 0
flights[flights.arr_delay + flights.dep_delay < 0].shape[0]

141814

In [32]:
# V3: How many trips have had total delay < 0
flights[
    flights.
    apply(axis='columns', func=lambda row: row['arr_delay'] + row['dep_delay'] < 0)
].shape[0]

141814

In [33]:
# V4: How many trips have had total delay < 0
flights[ flights['arr_delay'] + flights['dep_delay'] < 0 ].shape[0]

141814

**f) Subset in rows and columns**

**Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June**

In [34]:
# V1: Calculate the average arrival and departure delay for
# all flights with “JFK” as the origin airport in the month of June
(
  flights[ (flights.origin == 'JFK') & (flights.month == 6)]
  [['arr_delay', 'dep_delay']].
  apply(axis='index', func=lambda column: pd.Series.mean(column))
)


Unnamed: 0,0
arr_delay,5.839349
dep_delay,9.807884


In [76]:
# V2: Calculate the average arrival and departure delay for
# all flights with “JFK” as the origin airport in the month of June
(
  flights[ (flights.origin == 'JFK') & (flights.month == 6)]
  [['arr_delay', 'dep_delay']].
  agg(axis='index', func=['mean'])
)


Unnamed: 0,arr_delay,dep_delay
mean,5.839349,9.807884


In [36]:
# V3: Calculate the average arrival and departure delay for
# all flights with “JFK” as the origin airport in the month of June
(
    flights[
      flights.
      apply(axis='columns', func=lambda row:
        (row['origin'] == 'JFK') & (row['month'] == 6)
        )
      ]
)[['arr_delay', 'dep_delay']].mean()

Unnamed: 0,0
arr_delay,5.839349
dep_delay,9.807884


In [40]:
# V4: Calculate the average arrival and departure delay for
# all flights with “JFK” as the origin airport in the month of June
(
  flights[ (flights['origin'] == 'JFK') & (flights['month'] == 6)].
  groupby(by=['origin', 'month']).
  apply(func=lambda group:
        pd.Series({
            'mean_arr_delay': np.mean(group['arr_delay']),
            'mean_dep_delay': np.mean(group['dep_delay'])
        })
    )
)


  apply(func=lambda group:


Unnamed: 0_level_0,Unnamed: 1_level_0,mean_arr_delay,mean_dep_delay
origin,month,Unnamed: 2_level_1,Unnamed: 3_level_1
JFK,6,5.839349,9.807884


In [39]:
# V5: Calculate the average arrival and departure delay for
# all flights with “JFK” as the origin airport in the month of June
(
  flights[ (flights['origin'] == 'JFK') & (flights['month'] == 6)]
  [['arr_delay', 'dep_delay']].
  apply(axis='index', func=lambda column:
        column.mean()
  )
)


Unnamed: 0,0
arr_delay,5.839349
dep_delay,9.807884


**How many trips have been made in 2014 from “JFK” airport in the month of June?**

In [77]:
# V1: How many trips have been made in
# 2014 from “JFK” airport in the month of June?
(
  flights[ (flights.origin == 'JFK') &
    (flights.month == 6) &
    (flights.year == 2014)].shape[0]
)


8422

In [78]:
# V2: How many trips have been made in
# 2014 from “JFK” airport in the month of June?
flights.query(' (origin=="JFK") & (month==6) & (year==2014) ').shape[0]

8422

In [82]:
# V3: How many trips have been made in
# 2014 from “JFK” airport in the month of June?
flights[flights.apply(
      axis=1,
      func=lambda row:
      (row['origin'] == 'JFK') &
      (row['month'] == 6) &
      (row['year'] == 2014))].shape[0]


8422

In the previous example, the apply function returns a mask (a series of Booleans with True and False), which we then use to index the original data frame. The axis=1 parameter indicates we apply the function to each row
of the data frame

**g) Column selection**

**Select columns named in a variable**



In [83]:
# V1: Select columns named in a variable (requires two steps by definition)
columns_of_interest = ['arr_delay', 'dep_delay']
flights[columns_of_interest]

Unnamed: 0,arr_delay,dep_delay
0,13,14
1,13,-3
2,9,2
3,-26,-8
4,1,2
...,...,...
253311,-30,1
253312,-14,-5
253313,16,-8
253314,15,-4


**De-select columns**

In [84]:
# V1: De-select arr_delay, dep_delay
flights.drop(axis=1, columns=['arr_delay', 'dep_delay'])

Unnamed: 0,year,month,day,carrier,origin,dest,air_time,distance,hour,rank
0,2014,1,1,AA,JFK,LAX,359,2475,9,1.0
1,2014,1,1,AA,JFK,LAX,363,2475,11,2.0
2,2014,1,1,AA,JFK,LAX,351,2475,19,3.0
3,2014,1,1,AA,LGA,PBI,157,1035,7,4.0
4,2014,1,1,AA,JFK,LAX,350,2475,13,5.0
...,...,...,...,...,...,...,...,...,...,...
253311,2014,10,31,UA,LGA,IAH,201,1416,14,26039.0
253312,2014,10,31,UA,EWR,IAH,189,1400,8,26040.0
253313,2014,10,31,MQ,LGA,RDU,83,431,11,26041.0
253314,2014,10,31,MQ,LGA,DTW,75,502,11,26042.0


In [85]:
# V2: De-select arr_delay, dep_delay
flights[flights.columns[~flights.columns.isin(['arr_delay', 'dep_delay'])]]

Unnamed: 0,year,month,day,carrier,origin,dest,air_time,distance,hour,rank
0,2014,1,1,AA,JFK,LAX,359,2475,9,1.0
1,2014,1,1,AA,JFK,LAX,363,2475,11,2.0
2,2014,1,1,AA,JFK,LAX,351,2475,19,3.0
3,2014,1,1,AA,LGA,PBI,157,1035,7,4.0
4,2014,1,1,AA,JFK,LAX,350,2475,13,5.0
...,...,...,...,...,...,...,...,...,...,...
253311,2014,10,31,UA,LGA,IAH,201,1416,14,26039.0
253312,2014,10,31,UA,EWR,IAH,189,1400,8,26040.0
253313,2014,10,31,MQ,LGA,RDU,83,431,11,26041.0
253314,2014,10,31,MQ,LGA,DTW,75,502,11,26042.0


## 2) Aggregations
Combining the previous lesson with `group by`

**How can we get the number of trips corresponding to each origin airport?**

In [87]:
# V1: number of trips corresponding to each origin airport
(
  flights.
  groupby(by='origin').
  apply(func=lambda group: group.shape[0])

)


  apply(func=lambda group: group.shape[0])


Unnamed: 0_level_0,0
origin,Unnamed: 1_level_1
EWR,87400
JFK,81483
LGA,84433


In [88]:
# V2: number of trips corresponding to each origin airport
(
  flights.
  groupby(by='origin').
  size()
)


Unnamed: 0_level_0,0
origin,Unnamed: 1_level_1
EWR,87400
JFK,81483
LGA,84433


In [89]:
# V3: number of trips corresponding to each origin airport
(
  flights.
  groupby(by='origin')['origin'].
  count()
)


Unnamed: 0_level_0,origin
origin,Unnamed: 1_level_1
EWR,87400
JFK,81483
LGA,84433


**How can we calculate the number of trips for each origin airport for carrier code "AA"**

In [90]:
# V1: Number of trips for each origin airport for carrier code "AA"
(
  flights[flights['carrier'] == 'AA'].
  groupby(by='origin').
  size()
)


Unnamed: 0_level_0,0
origin,Unnamed: 1_level_1
EWR,2649
JFK,11923
LGA,11730


In [115]:
# V2: Number of trips for each origin airport for carrier code "AA"
# (returns a dataframe)
(
  flights[flights.apply(axis=1, func=lambda row: row['carrier'] == 'AA')].
  groupby(by='origin', as_index=False).
  apply(func=lambda group: len(group), include_groups=False).
  rename(columns=str).rename(columns={'None':'number_of_trips'})

)


Unnamed: 0,origin,number_of_trips
0,EWR,2649
1,JFK,11923
2,LGA,11730


In [114]:
# V3: Number of trips for each origin airport for carrier code "AA"
# (returns a dataframe)
(
  flights[flights.apply(axis=1, func=lambda row: row['carrier'] == 'AA')].
  groupby(by='origin', as_index=False).
  apply(func=lambda group:
        pd.Series(
            {
            'number_of_trips': len(group)
            }
        )
        ,
        include_groups=False
  )
)




Unnamed: 0,origin,number_of_trips
0,EWR,2649
1,JFK,11923
2,LGA,11730


**How can we get the total number of trips for each origin, dest pair for carrier code "AA"?**

In [113]:
# V1: Number of trips for each origin, dest airport for carrier code "AA"
(
  flights[flights.carrier == 'AA'].
  groupby(by=['origin', 'dest'], as_index=False).
  apply(func=lambda my_group: my_group.shape[0], include_groups=False).
  head() # display the first 5
)


Unnamed: 0,origin,dest,None
0,EWR,DFW,1618
1,EWR,LAX,62
2,EWR,MIA,848
3,EWR,PHX,121
4,JFK,AUS,297


In [93]:
# V2: Number of trips for each origin, dest airport for carrier code "AA"
(
    flights.
    query('carrier == "AA"').
    groupby(by=['origin', 'dest'], as_index=False).
    size().
    head() # display the first 5
)


Unnamed: 0,origin,dest,size
0,EWR,DFW,1618
1,EWR,LAX,62
2,EWR,MIA,848
3,EWR,PHX,121
4,JFK,AUS,297


**How can we get the average arrival and departure delay for each orig, dest pair for each month for carrier code "AA"?**

In [94]:
# V1: average arrival and departure delay for
# each origin,dest pair for each month for carrier code "AA"?
(
  flights[flights.carrier == 'AA']
  [['arr_delay', 'dep_delay', 'month', 'origin', 'dest']].
  groupby(by=['origin', 'dest', 'month'], as_index=False).
  agg(func='mean')
)


Unnamed: 0,origin,dest,month,arr_delay,dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927
...,...,...,...,...,...
195,LGA,PBI,1,-7.758621,0.310345
196,LGA,PBI,2,-7.865385,2.403846
197,LGA,PBI,3,-5.754098,3.032787
198,LGA,PBI,4,-13.966667,-4.733333


In [98]:
# V2: average arrival and departure delay for
# each origin,dest pair for each month for carrier code "AA"?
(
  flights[flights['carrier']=='AA'].
  groupby(by=['origin', 'dest', 'month'], as_index=False).
  apply(func=lambda group:
      pd.Series({
          'mean_arr_delay': np.mean(group['arr_delay']),
          'mean_dep_delay': np.mean(group['dep_delay']),
        }
      ),
        include_groups=False
    )
)


Unnamed: 0,origin,dest,month,mean_arr_delay,mean_dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927
...,...,...,...,...,...
195,LGA,PBI,1,-7.758621,0.310345
196,LGA,PBI,2,-7.865385,2.403846
197,LGA,PBI,3,-5.754098,3.032787
198,LGA,PBI,4,-13.966667,-4.733333


In [60]:
# Verify with the original R data.table tutorial
# The R tutorial value reports 14.2289157 for
# JFK, LAX, month=1, dep_delay
# With pandas we obtain 14.228916
(
  flights[
      (flights.carrier == 'AA') &
      (flights.origin == 'JFK') &
      (flights.dest == 'LAX') &
      (flights.month == 1)]
      [['dep_delay', 'month', 'origin', 'dest']].
  groupby(by=['origin', 'dest', 'month'], as_index=False).
  agg(func='mean')
)


Unnamed: 0,origin,dest,month,dep_delay
0,JFK,LAX,1,14.228916


**Sorting: get mean arr_delay and dep_delay and sort by origin, dest and month for carrier AA**

In [61]:
# V1: Compute mean delays and sort by origin, dest, month for carrier "AA"
(
  flights[flights['carrier']=='AA']
  [['origin', 'dest', 'month', 'arr_delay', 'dep_delay']].
  groupby(by=['origin', 'dest', 'month'], as_index=False).
  agg(func='mean').
  sort_values(by=['origin', 'dest', 'month'], ascending=[True, True, True])
)


Unnamed: 0,origin,dest,month,arr_delay,dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927
...,...,...,...,...,...
195,LGA,PBI,1,-7.758621,0.310345
196,LGA,PBI,2,-7.865385,2.403846
197,LGA,PBI,3,-5.754098,3.032787
198,LGA,PBI,4,-13.966667,-4.733333


In [99]:
# V2: Compute mean delays and sort by origin, dest, month for carrier "AA"
(
    flights[flights['carrier'] == 'AA'].
    sort_values(
        by=['origin', 'dest', 'month'],
        ascending=[True, True, True]).
    groupby(by=['origin', 'dest', 'month'], as_index=False).
    apply(func=lambda group:
        pd.Series({
            'mean_arr_delay': np.mean(group['arr_delay']),
            'mean_dep_delay': np.mean(group['dep_delay'])
        })
        ,
        include_groups=False
    )
).head()


Unnamed: 0,origin,dest,month,mean_arr_delay,mean_dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927


**Sorting: obtain total number of trips for carrier AA by origin and destination, sorted by origin in ascending order and dst in descending order**

In [63]:
# V1: Number of trips for each origin airport for carrier code "AA"
(
  flights[flights['carrier'] == 'AA'].
  groupby(by=['origin', 'dest'], as_index=False).
  size().
  sort_values(by=['origin', 'dest'], ascending=[True, False])
).head()

Unnamed: 0,origin,dest,size
3,EWR,PHX,121
2,EWR,MIA,848
1,EWR,LAX,62
0,EWR,DFW,1618
19,JFK,STT,229


**Obtain total number of flights by the combination of arriving and departing status (late or on time)**

In [64]:
# V1: obtain number of flights by combination of arriving and departing status
# Note: performs very slow
(
  flights.apply(axis=1, result_type='expand', func=lambda row:
    pd.Series(
        {
        'arr_on_time': row.arr_delay>0,
        'dep_on_time': row.dep_delay>0
        }
    )
  )
).groupby(by=['arr_on_time', 'dep_on_time'], as_index=False).size()


Unnamed: 0,arr_on_time,dep_on_time,size
0,False,False,119304
1,False,True,26593
2,True,False,34583
3,True,True,72836


In [65]:
# V2: obtain number of flights by combination of arriving and departing status
# WARNING: very slow, ~ 1 min
(
  flights.
  assign(arr_on_time=lambda row: row.arr_delay > 0 ).
  assign(dep_on_time=lambda row: row.dep_delay > 0 ).
  groupby(by=['arr_on_time', 'dep_on_time'], as_index=False).
  size()
)


Unnamed: 0,arr_on_time,dep_on_time,size
0,False,False,119304
1,False,True,26593
2,True,False,34583
3,True,True,72836


**Obtain the mean of all columns of the df dataset without typing the mean formula for each**

In [66]:
df.groupby(by='id').agg('mean')

Unnamed: 0_level_0,a,b,c
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.5,10.5,16.5
b,2.0,8.0,14.0
c,6.0,12.0,18.0


**Obtain the mean by specifying the columns (arr_delay and dep_delay) for carrier AA by origin, dest and month**

In [68]:
# V1: obtain the mean by specifying the column names (arr_delay and dep_delay)
(
  flights[flights['carrier'] == 'AA'].
  groupby(by=['origin', 'dest', 'month'], as_index=False)
  [['arr_delay', 'dep_delay']].
  agg('mean')
)



Unnamed: 0,origin,dest,month,arr_delay,dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927
...,...,...,...,...,...
195,LGA,PBI,1,-7.758621,0.310345
196,LGA,PBI,2,-7.865385,2.403846
197,LGA,PBI,3,-5.754098,3.032787
198,LGA,PBI,4,-13.966667,-4.733333


In [69]:
# V2: obtain the mean by specifying the column names (arr_delay and dep_delay)
# throws warnings
(
  flights[flights['carrier'] == 'AA']
  [['origin', 'dest', 'month', 'arr_delay', 'dep_delay' ]].
  groupby(by=['origin', 'dest', 'month'], as_index=False).
  apply(axis='index', func=np.mean)
)


Unnamed: 0,origin,dest,month,arr_delay,dep_delay
0,EWR,DFW,1,6.427673,10.012579
1,EWR,DFW,2,10.536765,11.345588
2,EWR,DFW,3,12.865031,8.079755
3,EWR,DFW,4,17.792683,12.920732
4,EWR,DFW,5,18.487805,18.682927
...,...,...,...,...,...
195,LGA,PBI,1,-7.758621,0.310345
196,LGA,PBI,2,-7.865385,2.403846
197,LGA,PBI,3,-5.754098,3.032787
198,LGA,PBI,4,-13.966667,-4.733333


**Return the first two rows of each month**

In [100]:
# V1: Return the first two rows of each month
(
  flights.
  groupby(by=['month'], as_index=False).
  apply(lambda group: group.head(2), include_groups=False )
).reset_index(drop=True).head()


Unnamed: 0,year,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour,rank
0,2014,1,14,13,AA,JFK,LAX,359,2475,9,1.0
1,2014,1,-3,13,AA,JFK,LAX,363,2475,11,2.0
2,2014,1,-1,1,AA,JFK,LAX,358,2475,8,1.0
3,2014,1,-5,3,AA,JFK,LAX,358,2475,11,2.0
4,2014,1,-11,36,AA,JFK,LAX,375,2475,8,1.0


In [72]:
# V2: Return the first two rows of each month
(
  flights.
  assign(rank=flights.
    groupby(by=['month'], as_index=False)['month'].
         rank(method='first')).
  query('rank <= 2').
  drop(columns=['rank'])
).head(6)

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11
22796,2014,2,1,-1,1,AA,JFK,LAX,358,2475,8
22797,2014,2,1,-5,3,AA,JFK,LAX,358,2475,11
43609,2014,3,1,-11,36,AA,JFK,LAX,375,2475,8
43610,2014,3,1,-3,14,AA,JFK,LAX,368,2475,11


In [101]:
# V3: Return the first two rows of each month
(
    flights.
    groupby(by=['month'], as_index=False).
    apply(lambda group: group.iloc[:2], include_groups=False)
).head()

Unnamed: 0,Unnamed: 1,year,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour,rank
0,0,2014,1,14,13,AA,JFK,LAX,359,2475,9,1.0
0,1,2014,1,-3,13,AA,JFK,LAX,363,2475,11,2.0
1,22796,2014,1,-1,1,AA,JFK,LAX,358,2475,8,1.0
1,22797,2014,1,-5,3,AA,JFK,LAX,358,2475,11,2.0
2,43609,2014,1,-11,36,AA,JFK,LAX,375,2475,8,1.0


In [102]:
# V4: Return the first two rows of each month
flights.assign(index=flights.index)
flights['rank'] = (
    flights.
    groupby(by=['month'], as_index=False)['month'].
    rank(method='first')
)
flights[flights['rank'] <= 2].head()

Unnamed: 0,year,month,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour,rank
0,2014,1,1,14,13,AA,JFK,LAX,359,2475,9,1.0
1,2014,1,1,-3,13,AA,JFK,LAX,363,2475,11,2.0
22796,2014,2,1,-1,1,AA,JFK,LAX,358,2475,8,1.0
22797,2014,2,1,-5,3,AA,JFK,LAX,358,2475,11,2.0
43609,2014,3,1,-11,36,AA,JFK,LAX,375,2475,8,1.0


In [111]:
# V5: Return the first two rows of each month,
# but we use a helper function
def get_first_two_rows(group: pd.DataFrame) -> pd.DataFrame:
  """Returns the first two rows of a group.

  Args:
    group: group onto which we want to extract the first two rows.
  Returns:
    The first two rows of the group.
  """
  assert group.empty is False, 'Group is empty'
  return group.head(2)

(
    flights.
    groupby(by=['month'], as_index=False).
    apply(lambda group: get_first_two_rows(group), include_groups=False)
).head()

Unnamed: 0,Unnamed: 1,year,day,dep_delay,arr_delay,carrier,origin,dest,air_time,distance,hour,rank
0,0,2014,1,14,13,AA,JFK,LAX,359,2475,9,1.0
0,1,2014,1,-3,13,AA,JFK,LAX,363,2475,11,2.0
1,22796,2014,1,-1,1,AA,JFK,LAX,358,2475,8,1.0
1,22797,2014,1,-5,3,AA,JFK,LAX,358,2475,11,2.0
2,43609,2014,1,-11,36,AA,JFK,LAX,375,2475,8,1.0
