<a href="https://colab.research.google.com/github/SoIllEconomist/ds4b/blob/master/python_ds4b/01_exploration/02_data_transformation/02_data_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Transformation

## Introduction

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
### Prerequisites
In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the NYC Flight data, and use `seaborn` to help us understand the data.

In [0]:
import pandas as pd
import numpy as np

flights = pd.read_csv("flights.csv")

### NYC Flights Dataset

To explore the basic data manipulation with `pandas`. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations. The data comes from the [US Bureau of Transportation Statistics](https://www.kaggle.com/usdot/flight-delays#flights.csv).

You might notice that this data frame prints a differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `flights` which will open the dataset instead of `flights.head()`. 

In [4]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 24 columns):
Unnamed: 0        150000 non-null int64
year              150000 non-null int64
month             150000 non-null int64
day               150000 non-null int64
dep_time          146274 non-null float64
sched_dep_time    150000 non-null int64
dep_delay         146274 non-null float64
arr_time          146138 non-null float64
sched_arr_time    150000 non-null int64
arr_delay         145890 non-null float64
carrier           150000 non-null object
flight            150000 non-null int64
tail_num          148864 non-null object
origin            150000 non-null object
dest              150000 non-null object
air_time          145890 non-null float64
distance          150000 non-null int64
hour              150000 non-null int64
minute            150000 non-null int64
time_hour         150000 non-null object
gain              145890 non-null float64
speed             146138 non-nul

You might have noticed that `.info()` prints a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. 

## Pandas Basics

In this chapter you are going to learn the key pandas methods and funcgtions that allow you to solve the vast majority of your data manipulation challenges:

1. Pick observations by their values.
1. Reorder the rows.
1. Pick variables by their names.
1. Create new variables with functions of existing variables.
1. Collapse many values down to a single summary.

These can all be used in conjunction with `groupby()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

### Query

`query()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

In [5]:
flights.query("month ==1 & day == 1")

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
72,291,2013,1,1,1153.0,1123,30.0,1454.0,1425,29.0,B6,1,N552JB,JFK,FLL,167.0,1069,11,23,1/1/2013 11:00,1.0,44.112792,2.783333,0.359281
255,666,2013,1,1,1832.0,1828,4.0,2144.0,2144,0.0,UA,1075,N18220,EWR,SNA,342.0,2434,18,28,1/1/2013 18:00,4.0,68.115672,5.700000,0.701754
742,185,2013,1,1,917.0,915,2.0,1206.0,1211,-5.0,B6,41,N568JB,JFK,MCO,145.0,944,9,15,1/1/2013 9:00,7.0,46.965174,2.416667,2.896552
845,67,2013,1,1,659.0,700,-1.0,959.0,1008,-9.0,UA,960,N838UA,EWR,RSW,164.0,1068,7,0,1/1/2013 7:00,8.0,66.819604,2.733333,2.926829
1807,108,2013,1,1,803.0,810,-7.0,903.0,925,-22.0,AA,1838,N3GEAA,JFK,BOS,38.0,187,8,10,1/1/2013 8:00,15.0,12.425249,0.633333,23.684211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149308,687,2013,1,1,1855.0,1859,-4.0,2140.0,2145,-5.0,DL,947,N339NW,LGA,ATL,135.0,762,18,59,1/1/2013 18:00,1.0,21.364486,2.250000,0.444444
149424,299,2013,1,1,1157.0,1158,-1.0,1310.0,1315,-5.0,EV,4511,N16546,EWR,ROC,50.0,246,11,58,1/1/2013 11:00,4.0,11.267176,0.833333,4.800000
149521,31,2013,1,1,623.0,610,13.0,920.0,915,5.0,AA,1837,N3EMAA,LGA,MIA,153.0,1096,6,10,1/1/2013 6:00,8.0,71.478261,2.550000,3.137255
149635,198,2013,1,1,931.0,930,1.0,1237.0,1238,-1.0,B6,375,N508JB,LGA,FLL,161.0,1076,9,30,1/1/2013 9:00,2.0,52.190784,2.683333,0.745342


When you run that line of code, pandas executes the querying operation and returns a new data frame. pandas functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `=`:

In [0]:
jan1 = flights.query("month ==1 & day == 1")

### Comparisons

To use querying effectively, you have to know how to select the observations that you want using the comparison operators. Python provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

When you’re starting out with Python, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error:

In [7]:
flights.query("month =1")

ValueError: ignored

There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!

In [0]:
from math import sqrt

In [9]:
sqrt(2) ** 2 == 2

False

In [10]:
1/49 * 49 == 1

False

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation.

## Logical Operators

Multiple arguments to `query()` are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: `&` is “and”, `|` is “or”, and `~` is “not”. The figure below shows the complete set of Boolean operations.

![Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.](https://github.com/SoIllEconomist/ds4b/blob/master/python_ds4b/01_exploration/02_data_transformation/transform_logical.png?raw=1)



In [11]:
flights.query("month in [11, 12]")

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
1,69716,2013,11,15,1854.0,1905,-11.0,2146.0,2205,-19.0,AA,1691,N501AA,EWR,DFW,204.0,1372,19,5,15-11-2013 19:00,8.0,38.359739,3.400000,2.352941
5,83149,2013,11,30,2235.0,1950,165.0,126.0,2302,144.0,UA,250,N413UA,EWR,SEA,331.0,2402,19,50,30-11-2013 19:00,21.0,1143.809524,5.516667,3.806647
9,59523,2013,11,5,645.0,645,0.0,852.0,907,-15.0,B6,675,N283JB,JFK,MSY,167.0,1182,6,45,5/11/2013 6:00,15.0,83.239437,2.783333,5.389222
19,91707,2013,12,10,825.0,829,-4.0,1053.0,1028,25.0,B6,219,N274JB,JFK,CLT,110.0,541,8,29,10/12/2013 8:00,-29.0,30.826211,1.833333,-15.818182
23,61079,2013,11,6,1638.0,1645,-7.0,1758.0,1820,-22.0,MQ,3216,N673MQ,JFK,ORF,56.0,290,16,45,6/11/2013 16:00,15.0,9.897611,0.933333,16.071429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149989,93287,2013,12,11,1904.0,1910,-6.0,2210.0,2220,-10.0,AA,1193,N3EMAA,LGA,DFW,218.0,1389,19,10,11/12/2013 19:00,4.0,37.710407,3.633333,1.100917
149993,104629,2013,12,24,827.0,756,31.0,1043.0,959,44.0,US,1733,N554UW,LGA,CLT,113.0,544,7,56,24-12-2013 07:00,-13.0,31.294343,1.883333,-6.902655
149994,70680,2013,11,17,742.0,745,-3.0,1007.0,1012,-5.0,DL,807,N779NC,EWR,ATL,119.0,746,7,45,17-11-2013 07:00,2.0,44.448858,1.983333,1.008403
149995,91738,2013,12,10,857.0,900,-3.0,1257.0,1220,37.0,AA,2335,N3AXAA,LGA,MIA,167.0,1096,9,0,10/12/2013 9:00,-40.0,52.315036,2.783333,-14.371257


Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

In [12]:
flights.query("~(arr_delay > 120 | dep_delay > 120)")

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
0,155090,2013,3,21,824.0,825,-1.0,1118.0,1133,-15.0,B6,181,N705JB,JFK,SAN,334.0,2446,8,25,21-03-2013 08:00,14.0,131.270125,5.566667,2.514970
1,69716,2013,11,15,1854.0,1905,-11.0,2146.0,2205,-19.0,AA,1691,N501AA,EWR,DFW,204.0,1372,19,5,15-11-2013 19:00,8.0,38.359739,3.400000,2.352941
2,159787,2013,3,26,949.0,1000,-11.0,1238.0,1251,-13.0,UA,684,N478UA,EWR,MCO,143.0,937,10,0,26-03-2013 10:00,2.0,45.411955,2.383333,0.839161
3,122785,2013,2,14,854.0,900,-6.0,1104.0,1116,-12.0,DL,181,N350NA,LGA,DTW,93.0,502,9,0,14-02-2013 09:00,6.0,27.282609,1.550000,3.870968
4,167006,2013,4,2,2151.0,2145,6.0,53.0,48,5.0,B6,11,N633JB,JFK,FLL,153.0,1069,21,45,2/4/2013 21:00,1.0,1210.188679,2.550000,0.392157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149994,70680,2013,11,17,742.0,745,-3.0,1007.0,1012,-5.0,DL,807,N779NC,EWR,ATL,119.0,746,7,45,17-11-2013 07:00,2.0,44.448858,1.983333,1.008403
149995,91738,2013,12,10,857.0,900,-3.0,1257.0,1220,37.0,AA,2335,N3AXAA,LGA,MIA,167.0,1096,9,0,10/12/2013 9:00,-40.0,52.315036,2.783333,-14.371257
149996,141274,2013,3,6,1445.0,1450,-5.0,1623.0,1640,-17.0,MQ,4403,N834MQ,JFK,RDU,71.0,427,14,50,6/3/2013 14:00,12.0,15.785582,1.183333,10.140845
149997,141586,2013,3,6,2143.0,2145,-2.0,240.0,232,8.0,B6,701,N292JB,JFK,SJU,203.0,1598,21,45,6/3/2013 21:00,-10.0,399.500000,3.383333,-2.955665


In [13]:
flights.query("arr_delay <= 120 | dep_delay <= 120") # Double Check

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
0,155090,2013,3,21,824.0,825,-1.0,1118.0,1133,-15.0,B6,181,N705JB,JFK,SAN,334.0,2446,8,25,21-03-2013 08:00,14.0,131.270125,5.566667,2.514970
1,69716,2013,11,15,1854.0,1905,-11.0,2146.0,2205,-19.0,AA,1691,N501AA,EWR,DFW,204.0,1372,19,5,15-11-2013 19:00,8.0,38.359739,3.400000,2.352941
2,159787,2013,3,26,949.0,1000,-11.0,1238.0,1251,-13.0,UA,684,N478UA,EWR,MCO,143.0,937,10,0,26-03-2013 10:00,2.0,45.411955,2.383333,0.839161
3,122785,2013,2,14,854.0,900,-6.0,1104.0,1116,-12.0,DL,181,N350NA,LGA,DTW,93.0,502,9,0,14-02-2013 09:00,6.0,27.282609,1.550000,3.870968
4,167006,2013,4,2,2151.0,2145,6.0,53.0,48,5.0,B6,11,N633JB,JFK,FLL,153.0,1069,21,45,2/4/2013 21:00,1.0,1210.188679,2.550000,0.392157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149994,70680,2013,11,17,742.0,745,-3.0,1007.0,1012,-5.0,DL,807,N779NC,EWR,ATL,119.0,746,7,45,17-11-2013 07:00,2.0,44.448858,1.983333,1.008403
149995,91738,2013,12,10,857.0,900,-3.0,1257.0,1220,37.0,AA,2335,N3AXAA,LGA,MIA,167.0,1096,9,0,10/12/2013 9:00,-40.0,52.315036,2.783333,-14.371257
149996,141274,2013,3,6,1445.0,1450,-5.0,1623.0,1640,-17.0,MQ,4403,N834MQ,JFK,RDU,71.0,427,14,50,6/3/2013 14:00,12.0,15.785582,1.183333,10.140845
149997,141586,2013,3,6,2143.0,2145,-2.0,240.0,232,8.0,B6,701,N292JB,JFK,SJU,203.0,1598,21,45,6/3/2013 21:00,-10.0,399.500000,3.383333,-2.955665


Whenever you start using complicated, multipart expressions in query(), consider making them explicit variables instead. That makes it much easier to check your work. You’ll learn how to create new variables shortly.

### Missing Values

One important feature of R that can make comparison tricky are missing values, or `NA`s (“not availables”). `NA` represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown or False.

In [14]:
np.nan

nan

In [15]:
np.nan > 5

False

In [16]:
np.nan < 5

False

In [17]:
np.nan + 10

nan

In [18]:
np.nan / 2

nan

The most confusing result is this one:

In [19]:
np.nan == np.nan

False

It’s easiest to understand why this is true with a bit more context:

In [20]:
# Let x be Mary's age. We don't know how old she is.
x = np.nan

# Let y be John's age. We don't know how old he is.
y = np.nan

# Are John and Mary the same age?
x == y
# We don't know!

False

If you want to determine if a value is missing, use `pd.isna()`

In [21]:
pd.isna(x)

True

`query()` only includes rows where the condition is TRUE; it excludes both FALSE.

### Exercises

1. Find all the flights that
  1. Had an arrival delay of two or more hours
  1. Flew to Houston (IAH or HOU)
  1. Were operated by United, American, or Delta
  1. Departed in summer (July, August, and September)
  1. Arrived more than two hours late, but didn’t leave late
  1. Were delayed by at least an hour, but made up over 30 minutes in flight
  1. Departed between midnight and 6am (inclusive)6am

1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

## Arrange rows with `sort_values()`
`sort_values()` works similarly to `query()` except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:


In [22]:
flights.sort_values(by=["year", "month", "day"])

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
72,291,2013,1,1,1153.0,1123,30.0,1454.0,1425,29.0,B6,1,N552JB,JFK,FLL,167.0,1069,11,23,1/1/2013 11:00,1.0,44.112792,2.783333,0.359281
255,666,2013,1,1,1832.0,1828,4.0,2144.0,2144,0.0,UA,1075,N18220,EWR,SNA,342.0,2434,18,28,1/1/2013 18:00,4.0,68.115672,5.700000,0.701754
742,185,2013,1,1,917.0,915,2.0,1206.0,1211,-5.0,B6,41,N568JB,JFK,MCO,145.0,944,9,15,1/1/2013 9:00,7.0,46.965174,2.416667,2.896552
845,67,2013,1,1,659.0,700,-1.0,959.0,1008,-9.0,UA,960,N838UA,EWR,RSW,164.0,1068,7,0,1/1/2013 7:00,8.0,66.819604,2.733333,2.926829
1807,108,2013,1,1,803.0,810,-7.0,903.0,925,-22.0,AA,1838,N3GEAA,JFK,BOS,38.0,187,8,10,1/1/2013 8:00,15.0,12.425249,0.633333,23.684211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149126,110889,2013,12,31,1258.0,1259,-1.0,1450.0,1428,22.0,EV,4104,N21537,EWR,BNA,144.0,748,12,59,31-12-2013 12:00,-23.0,30.951724,2.400000,-9.583333
149143,111237,2013,12,31,2022.0,1940,42.0,2323.0,2243,40.0,B6,1271,N661JB,LGA,FLL,166.0,1076,19,40,31-12-2013 19:00,2.0,27.791649,2.766667,0.722892
149203,111187,2013,12,31,1854.0,1859,-5.0,2135.0,2150,-15.0,B6,527,N564JB,EWR,MCO,146.0,937,18,59,31-12-2013 18:00,10.0,26.332553,2.433333,4.109589
149395,111073,2013,12,31,1639.0,1630,9.0,1921.0,1925,-4.0,UA,1232,N14121,EWR,IAH,201.0,1400,16,30,31-12-2013 16:00,13.0,43.727225,3.350000,3.880597


Use `ascending=False` to re-order by a column in descending order:

In [23]:
flights.sort_values(by=["year", "month", "day"], ascending=False).head()

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
123,111192,2013,12,31,1857.0,1900,-3.0,2246.0,2244,2.0,DL,435,N706TW,JFK,SFO,368.0,2586,19,0,31-12-2013 19:00,-5.0,69.082814,6.133333,-0.815217
348,111108,2013,12,31,1714.0,1629,45.0,2003.0,1932,31.0,B6,1161,N529JB,LGA,PBI,154.0,1035,16,29,31-12-2013 16:00,14.0,31.003495,2.566667,5.454545
356,110963,2013,12,31,1437.0,1437,0.0,1812.0,1743,29.0,B6,581,N334JB,JFK,HOU,249.0,1428,14,37,31-12-2013 14:00,-29.0,47.284768,4.15,-6.987952
392,111040,2013,12,31,1556.0,1501,55.0,1647.0,1610,37.0,UA,1146,N33294,EWR,BOS,34.0,200,15,1,31-12-2013 15:00,18.0,7.285974,0.566667,31.764706
1099,110753,2013,12,31,948.0,950,-2.0,1123.0,1123,0.0,FL,160,N947AT,LGA,CAK,72.0,397,9,50,31-12-2013 09:00,-2.0,21.211042,1.2,-1.666667


Missing values are sorted at the end:

In [24]:
flights.sort_values(by=["year", "month", "dep_time"]).tail()

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
149437,96010,2013,12,14,,1620,,,1829,,EV,4352,N12569,EWR,CVG,,569,16,20,14-12-2013 16:00,,,,
149609,87965,2013,12,5,,1059,,,1313,,EV,4880,N722EV,LGA,MEM,,963,10,59,5/12/2013 10:00,,,,
149650,87904,2013,12,5,,2038,,,2259,,9E,3681,,LGA,GSP,,610,20,38,5/12/2013 20:00,,,,
149791,98776,2013,12,17,,1859,,,2150,,B6,327,N592JB,EWR,MCO,,937,18,59,17-12-2013 18:00,,,,
149961,88966,2013,12,6,,600,,,915,,AA,1103,N3FKAA,LGA,DFW,,1389,6,0,6/12/2013 6:00,,,,


Puttingi NAs first:

In [25]:
flights.sort_values(by=["year", "month", "dep_time"], na_position='first')

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tail_num,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed,hours,gain_per_hour
367,24250,2013,1,28,,2159,,,2300,,EV,4519,N14198,EWR,BWI,,169,21,59,28-01-2013 21:00,,,,
442,13964,2013,1,16,,1628,,,1735,,EV,4588,N17159,EWR,MHT,,209,16,28,16-01-2013 16:00,,,,
601,13980,2013,1,16,,1815,,,2037,,9E,3424,,JFK,DTW,,509,18,15,16-01-2013 18:00,,,,
630,22535,2013,1,26,,605,,,745,,WN,1681,N7704B,EWR,MDW,,711,6,5,26-01-2013 06:00,,,,
1940,24226,2013,1,28,,1555,,,1810,,EV,3820,N16951,EWR,SDF,,642,15,55,28-01-2013 15:00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145417,109531,2013,12,29,2359.0,2359,0.0,502.0,437,25.0,B6,839,N665JB,JFK,BQN,200.0,1576,23,59,29-12-2013 23:00,-25.0,188.366534,3.333333,-7.500000
30948,95379,2013,12,13,2400.0,2359,1.0,432.0,440,-8.0,B6,1503,N587JB,JFK,SJU,192.0,1598,23,59,13-12-2013 23:00,9.0,221.944444,3.200000,2.812500
82972,100795,2013,12,19,2400.0,2359,1.0,434.0,440,-6.0,B6,1503,N561JB,JFK,SJU,193.0,1598,23,59,19-12-2013 23:00,7.0,220.921659,3.216667,2.176166
120577,91492,2013,12,9,2400.0,2359,1.0,432.0,440,-8.0,B6,1503,N705JB,JFK,SJU,195.0,1598,23,59,9/12/2013 23:00,9.0,221.944444,3.250000,2.769231


### Exercises
1. How could you use `sort_values()` to sort all missing values to the start?.

1. Sort flights to find the most delayed flights. Find the flights that left earliest.

1. Sort flights to find the fastest flights.

1. Which flights travelled the longest? Which travelled the shortest?



## Select columns with `[]`, `loc`, and `filter()`

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. Bracket notation `[]` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

Bracket notation is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea:

In [26]:
flights[['year','month','day']]

Unnamed: 0,year,month,day
0,2013,3,21
1,2013,11,15
2,2013,3,26
3,2013,2,14
4,2013,4,2
...,...,...,...
149995,2013,12,10
149996,2013,3,6
149997,2013,3,6
149998,2013,12,17


In [27]:
flights.filter(["year", "month", "day"])

Unnamed: 0,year,month,day
0,2013,3,21
1,2013,11,15
2,2013,3,26
3,2013,2,14
4,2013,4,2
...,...,...,...
149995,2013,12,10
149996,2013,3,6
149997,2013,3,6
149998,2013,12,17


In [28]:
flights.loc[:,'year':'day']

Unnamed: 0,year,month,day
0,2013,3,21
1,2013,11,15
2,2013,3,26
3,2013,2,14
4,2013,4,2
...,...,...,...
149995,2013,12,10
149996,2013,3,6
149997,2013,3,6
149998,2013,12,17


There are a number of helper functions you can use:

```python
df.filter(regex="^abc") # starts with abc
df.filter(regex="$xyz") # ends with xyz
df.filter(regex="example") # contains example
```

## rename

In [29]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


Rename columns using a mapping:

In [30]:
df.rename(columns={"A": "a", "B": "c"})

Unnamed: 0,a,c
0,1,4
1,2,5
2,3,6


Rename index using a mapping:

In [31]:
df.rename(index={0: "x", 1: "y", 2: "z"})

Unnamed: 0,A,B
x,1,4
y,2,5
z,3,6


Using axis-style parameters

In [32]:
df.rename(str.lower, axis='columns')

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [33]:
df.rename({1: 2, 2: 4}, axis='index')

Unnamed: 0,A,B
0,1,4
2,2,5
4,3,6


If you want any changes to stick, you need to use the `inplace=True` arguments.

In [0]:
df.rename(str.lower, axis="columns", inplace=True)

In [35]:
df # Now we see that everything is changed.

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [0]:
flights.rename({"tailnum":"tail_num"}, axis="columns", inplace=True)

In [37]:
flights.columns

Index(['Unnamed: 0', 'year', 'month', 'day', 'dep_time', 'sched_dep_time',
       'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier',
       'flight', 'tail_num', 'origin', 'dest', 'air_time', 'distance', 'hour',
       'minute', 'time_hour', 'gain', 'speed', 'hours', 'gain_per_hour'],
      dtype='object')

### Exercises
1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

1. What happens if you include the name of a variable multiple times in a selection statement call?

1. What does the filter and loc methods do? Why might it be helpful in conjunction with this list?
```python
vars = ["year", "month", "day", "dep_delay", "arr_delay"]
```

## Add new variables

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.

In [0]:
flights["gain"] = flights.dep_delay - flights.arr_delay
flights["speed"] = flights.distance / flights.arr_time * 60
flights['hours'] = flights.air_time / 60
flights["gain_per_hour"] = flights.gain / flights.hours

### Exercises
1. Currently dep_time and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

1. Compare air_time with `arr_time` - `dep_time`. What do you expect to see? What do you see? What do you need to do to fix it?

1. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?

1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

## Grouped summaries: split-apply-combine

In this section, you'll learn how to use the pandas groupby operation, which draws from the well-known split-apply-combine strategy.

Intuitively, you want to split the dataset into groups, one for each year, and then to compute a summary statistic, such as the mean or the median, and then to see whether this statistic increases over the years (after this, you may want to perform a statistical test).

The framework is known as **split-apply-combine** because we:

1. **split** the data into groups by creating a groupby object from the original DataFrame;
1. **apply** a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step);
1. **combine** the results into a new Dataframe

This is the conceptual framework for the analysis at hand. We'll also necessarily delve into groupby objects, which are not the most intuitive objects. The process of **split-apply-combine** with groupby objects is a pattern that we all perform intuitively, as we'll see, but it took Hadley Wickham to formalize the procedure in 2011 with his paper The Split-Apply-Combine Strategy for Data Analysis.

Here you'll use pandas, groupby objects and the principles of split-apply-combine to check out the flights dataset.

### Summarising your data with plots and statistics

The pandas DataFrame .info() method is invaluable. Applying it below shows that you have 150000 rows and 23 columns of data, but also that the column of interest, `dep_delay`, has only 146274 non-null values. This means that there are 3726 missing values:

In [40]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 24 columns):
Unnamed: 0        150000 non-null int64
year              150000 non-null int64
month             150000 non-null int64
day               150000 non-null int64
dep_time          146274 non-null float64
sched_dep_time    150000 non-null int64
dep_delay         146274 non-null float64
arr_time          146138 non-null float64
sched_arr_time    150000 non-null int64
arr_delay         145890 non-null float64
carrier           150000 non-null object
flight            150000 non-null int64
tail_num          148864 non-null object
origin            150000 non-null object
dest              150000 non-null object
air_time          145890 non-null float64
distance          150000 non-null int64
hour              150000 non-null int64
minute            150000 non-null int64
time_hour         150000 non-null object
gain              145890 non-null float64
speed             146138 non-nul

If you'd like to check out several summary statistics of the DataFrame, you can also do this using the `.describe()` method:

In [41]:
flights.describe()

Unnamed: 0.1,Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute,gain,speed,hours,gain_per_hour
count,150000.0,150000.0,150000.0,150000.0,146274.0,150000.0,146274.0,146138.0,150000.0,145890.0,150000.0,145890.0,150000.0,150000.0,150000.0,145890.0,146138.0,145890.0,145890.0
mean,88738.232053,2013.0,6.41614,15.029547,1348.838399,1343.373047,10.427622,1515.100651,1543.1258,5.569381,1970.752127,153.790637,1031.369967,13.176367,25.73638,4.807828,195.804721,2.563177,2.821006
std,51217.980586,0.0,4.44729,8.795769,482.96049,466.710235,35.939307,523.525356,492.911003,39.840628,1631.025932,94.536153,722.851652,4.655999,19.262609,16.536792,2449.897456,1.575603,10.060148
min,0.0,2013.0,1.0,1.0,1.0,500.0,-43.0,1.0,1.0,-70.0,1.0,20.0,80.0,5.0,0.0,-196.0,2.0,0.333333,-169.230769
25%,44364.75,2013.0,2.0,7.0,908.0,905.0,-5.0,1114.0,1129.0,-16.0,537.0,85.0,502.0,9.0,6.0,-4.0,20.009966,1.416667,-1.682243
50%,88874.5,2013.0,4.0,14.0,1404.0,1359.0,-2.0,1547.0,1604.0,-4.0,1491.0,134.0,888.0,13.0,29.0,6.0,34.854309,2.233333,2.666667
75%,133126.25,2013.0,11.0,23.0,1741.0,1729.0,9.0,1944.0,1950.0,13.0,3478.0,195.0,1389.0,17.0,43.0,15.0,62.32816,3.25,7.328244
max,177350.0,2013.0,12.0,31.0,2400.0,2359.0,1301.0,2400.0,2359.0,1272.0,8500.0,695.0,4983.0,23.0,59.0,87.0,155160.0,11.583333,85.384615


### Groupbys and split-apply-combine to answer the question

#### Step 1. Split

Now that you've checked out out data, it's time for the fun part. You'll first use a `.groupby()` method to split the data into groups, where each group is the date. This is the split in split-apply-combine:

In [0]:
flights_by_month = flights[["year", "month", "gain_per_hour", "dep_delay"]].groupby(by=["year", "month"])

This creates a *groupby* object:

In [44]:
type(flights_by_day)

pandas.core.groupby.generic.DataFrameGroupBy

#### Step 2. Apply
Such groupby objects are very useful. Remember that the `.describe()` method for a DataFrame returns summary statistics for numeric columns? Well, the `.describe()` method for DataFrameGroupBy objects returns summary statistics for each numeric column, but computed for each group in the split. In your case, it's for each release_year. This is an example of the apply in split-apply-combine: you're applying the `.describe()` method to each group in the groupby. Do this and print the first 5 rows of the result:

In [67]:
flights_by_month.describe().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
year,month,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
2013,1,22354.0,2.407132,10.188676,-161.428571,-2.117647,2.201835,6.774194,77.419355,22419.0,9.936438,36.162099,-30.0,-5.0,-2.0,8.0,1301.0
2013,2,19987.0,2.768578,9.998214,-120.0,-1.904762,2.654867,7.18832,69.677419,20047.0,10.660548,35.758378,-33.0,-5.0,-2.0,9.0,853.0
2013,3,23626.0,3.991077,10.779158,-169.230769,-0.722349,3.797468,8.863112,85.384615,23688.0,13.276638,40.151364,-25.0,-5.0,-1.0,12.0,911.0
2013,4,10186.0,2.282663,10.473078,-132.272727,-2.0,2.337662,6.666667,80.0,10223.0,11.966057,39.230396,-21.0,-5.0,-2.0,10.0,639.0
2013,10,24156.0,3.776325,9.650829,-103.2,-1.016949,3.290177,8.323954,78.461538,24183.0,6.230699,29.386134,-25.0,-6.0,-3.0,4.0,390.0


If you want to see what the grouping looks like, you can pass the groupby object to the function list():

In [63]:
list(flights_by_month)[5] # Cast grouping as a list and check out one month

((2013, 11),
         Unnamed: 0  year  month  ...        speed     hours  gain_per_hour
 1            69716  2013     11  ...    38.359739  3.400000       2.352941
 5            83149  2013     11  ...  1143.809524  5.516667       3.806647
 9            59523  2013     11  ...    83.239437  2.783333       5.389222
 23           61079  2013     11  ...     9.897611  0.933333      16.071429
 25           61471  2013     11  ...    61.093248  2.333333       5.142857
 ...            ...   ...    ...  ...          ...       ...            ...
 149980       76392  2013     11  ...  1435.714286  2.400000       3.333333
 149981       72730  2013     11  ...    53.955409  2.500000       5.200000
 149982       58576  2013     11  ...    61.192982  2.383333       0.419580
 149983       69658  2013     11  ...    29.194729  2.383333       2.097902
 149994       70680  2013     11  ...    44.448858  1.983333       1.008403
 
 [22974 rows x 24 columns])

### Step 3. Combine
Let's say that you wanted the mean or median *dep_delay* and *gain_per_hour* for each month. Then you can apply the `.mean()` or `.median()` method, respectively, to the groupby object and 'combine' these into a new DataFrame.

In [0]:
flights_med_by_month = flights_by_month.median()

In [74]:
flights_med_by_month.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,gain_per_hour,dep_delay
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,2.201835,-2.0
2013,2,2.654867,-2.0
2013,3,3.797468,-1.0
2013,4,2.337662,-2.0
2013,10,3.290177,-3.0


In [75]:
flights_med_by_month.unstack()

Unnamed: 0_level_0,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,gain_per_hour,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay,dep_delay
month,1,2,3,4,10,11,12,1,2,3,4,10,11,12
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
2013,2.201835,2.654867,3.797468,2.337662,3.290177,2.637363,1.556196,-2.0,-2.0,-1.0,-2.0,-3.0,-3.0,0.0


### Groupbys and split-apply-comnine in Daily Use

Groupby objects are not intuitive. They do, however, correspond to a natural the act of splitting a dataset with respect to one its columns (or more than one, but let's save that for another post about grouping by multiple columns and hierarchical indexes).

The split-apply-combine principle is not only elegant and practical, it's something that Data Scientists use daily, as in the above example.

## Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

  1. A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

  1. A flight is always 10 minutes late.

  1. A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

  1. 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Which is more important: arrival delay or departure delay?

1. Our definition of cancelled flights `(flights.dep_delay.isna() | flights.arr_delay.isna())` is slightly suboptimal. Why? Which is the most important column?

1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights.groupby(['carrier','dest']).count()`

1. What does the sort argument to count() do. When might you use it?

1. Which plane (tailnum) has the worst on-time record?

1. What time of day should you fly if you want to avoid delays as much as possible?

1. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

1. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

1. For each plane, count the number of flights before the first delay of greater than 1 hour.