# Data Manipulation in Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

The fundamental Pandas data structures:

* **Series**: a "one-dimensional array" with flexible indices
* **DataFrame**: a "two-dimensional array" with both flexible row indices and flexible column names

# Introduction

When you get a dataset to analyze, it is rare that the data set is clean or in exactly the right form you need. Often you’ll need to perform some data preprocessing/wrangling, e.g., creating some new variables or summaries, filtering out some rows based on certain search criteria, renaming the variables, reordering the observations by some column, etc. 

In this notebook, you will learn how to perform a variety of data preprocessing tasks. Here, we will use a dataset on flights departing New York City in 2013. 

In [1]:
import pandas as pd
import numpy as np 

In [2]:
# Install the package 'nycflights13' before you can run this
from nycflights13 import flights
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [10]:
flights.shape

(336776, 19)

In [11]:
list(flights.columns) 

['year',
 'month',
 'day',
 'dep_time',
 'sched_dep_time',
 'dep_delay',
 'arr_time',
 'sched_arr_time',
 'arr_delay',
 'carrier',
 'flight',
 'tailnum',
 'origin',
 'dest',
 'air_time',
 'distance',
 'hour',
 'minute',
 'time_hour']

## Data frame with columns

- year,month,day
        Date of departure    
- dep_time,arr_time
        Actual departure and arrival times (format HHMM or HMM), local tz.
- sched_dep_time,sched_arr_time
        Scheduled departure and arrival times (format HHMM or HMM), local tz.    
- dep_delay,arr_delay
        Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- hour,minute
        Time of scheduled departure broken into hour and minutes.
- carrier
        Two letter carrier abbreviation. See airlines() to get name
- tailnum
        Plane tail number
- flight
        Flight number
- origin,dest
        Origin and destination. See airports() for additional metadata.
- air_time
        Amount of time spent in the air, in minutes
- distance
        Distance between airports, in miles
- time_hour
        Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.

In [12]:
flights.dtypes

year                int64
month               int64
day                 int64
dep_time          float64
sched_dep_time      int64
dep_delay         float64
arr_time          float64
sched_arr_time      int64
arr_delay         float64
carrier            object
flight              int64
tailnum            object
origin             object
dest               object
air_time          float64
distance            int64
hour                int64
minute              int64
time_hour          object
dtype: object

In [13]:
flights.describe(include='all')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
count,336776.0,336776.0,336776.0,328521.0,336776.0,328521.0,328063.0,336776.0,327346.0,336776,336776.0,334264,336776,336776,327346.0,336776.0,336776.0,336776.0,336776
unique,,,,,,,,,,16,,4043,3,105,,,,,6936
top,,,,,,,,,,UA,,N725MQ,EWR,ORD,,,,,2013-09-20T12:00:00Z
freq,,,,,,,,,,58665,,575,120835,17283,,,,,94
mean,2013.0,6.54851,15.710787,1349.109947,1344.25484,12.63907,1502.054999,1536.38022,6.895377,,1971.92362,,,,150.68646,1039.912604,13.180247,26.2301,
std,0.0,3.414457,8.768607,488.281791,467.335756,40.210061,533.264132,497.457142,44.633292,,1632.471938,,,,93.688305,733.233033,4.661316,19.300846,
min,2013.0,1.0,1.0,1.0,106.0,-43.0,1.0,1.0,-86.0,,1.0,,,,20.0,17.0,1.0,0.0,
25%,2013.0,4.0,8.0,907.0,906.0,-5.0,1104.0,1124.0,-17.0,,553.0,,,,82.0,502.0,9.0,8.0,
50%,2013.0,7.0,16.0,1401.0,1359.0,-2.0,1535.0,1556.0,-5.0,,1496.0,,,,129.0,872.0,13.0,29.0,
75%,2013.0,10.0,23.0,1744.0,1729.0,11.0,1940.0,1945.0,14.0,,3465.0,,,,192.0,1389.0,17.0,44.0,


## Basic Operations of Data Manipulations

You will learn the five key operations that allow you to solve the vast majority of your data manipulation challenges:

* Pick observations by their values.
* Reorder the rows.
* Pick variables by their names.
* Create new variables with functions of existing variables.
* Collapse many values down to a single summary.

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

## Select Rows

In [14]:
# Filter rows 
# Select all flights in January: 
flights.loc[flights['month']==1]
# flights.loc[flights.month==1] 

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26999,2013,1,31,,1325,,,1505,,MQ,4475,N730MQ,LGA,RDU,,431,13,25,2013-01-31T18:00:00Z
27000,2013,1,31,,1200,,,1430,,MQ,4658,N505MQ,LGA,ATL,,762,12,0,2013-01-31T17:00:00Z
27001,2013,1,31,,1410,,,1555,,MQ,4491,N734MQ,LGA,CLE,,419,14,10,2013-01-31T19:00:00Z
27002,2013,1,31,,1446,,,1757,,UA,337,,LGA,IAH,,1416,14,46,2013-01-31T19:00:00Z


In [15]:
flights[flights['month']==1]
#flights[flights.month==1]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26999,2013,1,31,,1325,,,1505,,MQ,4475,N730MQ,LGA,RDU,,431,13,25,2013-01-31T18:00:00Z
27000,2013,1,31,,1200,,,1430,,MQ,4658,N505MQ,LGA,ATL,,762,12,0,2013-01-31T17:00:00Z
27001,2013,1,31,,1410,,,1555,,MQ,4491,N734MQ,LGA,CLE,,419,14,10,2013-01-31T19:00:00Z
27002,2013,1,31,,1446,,,1757,,UA,337,,LGA,IAH,,1416,14,46,2013-01-31T19:00:00Z


In [16]:
# Select all flights on January 1st: 
flights[(flights.month==1) & (flights.day==1)]
#flights[(flights['month']==1) & (flights['day']==1)]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,2013,1,1,2356.0,2359,-3.0,425.0,437,-12.0,B6,727,N588JB,JFK,BQN,186.0,1576,23,59,2013-01-02T04:00:00Z
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416,16,30,2013-01-01T21:00:00Z
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389,19,35,2013-01-02T00:00:00Z
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096,15,0,2013-01-01T20:00:00Z


In [17]:
# Save the subset to a new dataframe
flights_0101 = flights[(flights.month==1) & (flights.day==1)]
flights_0101

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,2013,1,1,2356.0,2359,-3.0,425.0,437,-12.0,B6,727,N588JB,JFK,BQN,186.0,1576,23,59,2013-01-02T04:00:00Z
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416,16,30,2013-01-01T21:00:00Z
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389,19,35,2013-01-02T00:00:00Z
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096,15,0,2013-01-01T20:00:00Z


In [18]:
# Select all flights scheduled to depart before 6:00 am. 
flights[flights.sched_dep_time<=600]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335808,2013,9,30,601.0,600,1.0,839.0,905,-26.0,AA,1175,N3FEAA,LGA,MIA,140.0,1096,6,0,2013-09-30T10:00:00Z
335810,2013,9,30,603.0,600,3.0,705.0,730,-25.0,UA,279,N457UA,EWR,ORD,103.0,719,6,0,2013-09-30T10:00:00Z
335814,2013,9,30,609.0,600,9.0,834.0,815,19.0,FL,345,N261AT,LGA,ATL,111.0,762,6,0,2013-09-30T10:00:00Z
335842,2013,9,30,632.0,600,32.0,734.0,701,33.0,US,2134,N748UW,LGA,BOS,35.0,184,6,0,2013-09-30T10:00:00Z


In [19]:
# Use query() function
flights.query('sched_dep_time<=600')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335808,2013,9,30,601.0,600,1.0,839.0,905,-26.0,AA,1175,N3FEAA,LGA,MIA,140.0,1096,6,0,2013-09-30T10:00:00Z
335810,2013,9,30,603.0,600,3.0,705.0,730,-25.0,UA,279,N457UA,EWR,ORD,103.0,719,6,0,2013-09-30T10:00:00Z
335814,2013,9,30,609.0,600,9.0,834.0,815,19.0,FL,345,N261AT,LGA,ATL,111.0,762,6,0,2013-09-30T10:00:00Z
335842,2013,9,30,632.0,600,32.0,734.0,701,33.0,US,2134,N748UW,LGA,BOS,35.0,184,6,0,2013-09-30T10:00:00Z


## Logical operators

As shown above, multiple filtering conditions are combined with “&”: every condition must be true in order for a row to be included in the output. 

For other types of combinations, you’ll need to use Boolean operators yourself: ``&`` is “and”, ``|`` is “or”, and ``~`` is “not”. 

In [20]:
# Select flights in either Janurary or Feburary
flights[(flights.month==1) | (flights.month==2)]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136242,2013,2,28,,850,,,1035,,MQ,4558,N737MQ,LGA,CLE,,419,8,50,2013-02-28T13:00:00Z
136243,2013,2,28,,905,,,1115,,MQ,4478,N722MQ,LGA,DTW,,502,9,5,2013-02-28T14:00:00Z
136244,2013,2,28,,1115,,,1310,,MQ,4485,N725MQ,LGA,CMH,,479,11,15,2013-02-28T16:00:00Z
136245,2013,2,28,,830,,,1205,,UA,1480,,EWR,SFO,,2565,8,30,2013-02-28T13:00:00Z


In [21]:
# Select flights in the second quarter
flights[flights.month.isin([4,5,6])]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
165081,2013,4,1,454.0,500,-6.0,636.0,640,-4.0,US,1843,N566UW,EWR,CLT,84.0,529,5,0,2013-04-01T09:00:00Z
165082,2013,4,1,509.0,515,-6.0,743.0,814,-31.0,UA,1545,N76288,EWR,IAH,194.0,1400,5,15,2013-04-01T09:00:00Z
165083,2013,4,1,526.0,530,-4.0,812.0,827,-15.0,UA,1714,N76517,LGA,IAH,206.0,1416,5,30,2013-04-01T09:00:00Z
165084,2013,4,1,534.0,540,-6.0,833.0,850,-17.0,AA,1141,N5DSAA,JFK,MIA,152.0,1089,5,40,2013-04-01T09:00:00Z
165085,2013,4,1,542.0,545,-3.0,914.0,920,-6.0,B6,725,N784JB,JFK,BQN,191.0,1576,5,45,2013-04-01T09:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
250445,2013,6,30,,1945,,,2104,,EV,5714,N836AS,JFK,IAD,,228,19,45,2013-06-30T23:00:00Z
250446,2013,6,30,,1610,,,1805,,EV,4092,N16147,EWR,DAY,,533,16,10,2013-06-30T20:00:00Z
250447,2013,6,30,,1709,,,1856,,EV,4662,N16911,EWR,RDU,,416,17,9,2013-06-30T21:00:00Z
250448,2013,6,30,,2059,,,2307,,EV,5254,N760EV,LGA,DSM,,1031,20,59,2013-07-01T00:00:00Z


In [22]:
# Select flights that are not in January
flights[flights.month!=1]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
27004,2013,10,1,447.0,500,-13.0,614.0,648,-34.0,US,1877,N538UW,EWR,CLT,69.0,529,5,0,2013-10-01T09:00:00Z
27005,2013,10,1,522.0,517,5.0,735.0,757,-22.0,UA,252,N556UA,EWR,IAH,174.0,1400,5,17,2013-10-01T09:00:00Z
27006,2013,10,1,536.0,545,-9.0,809.0,855,-46.0,AA,2243,N630AA,JFK,MIA,132.0,1089,5,45,2013-10-01T09:00:00Z
27007,2013,10,1,539.0,545,-6.0,801.0,827,-26.0,UA,1714,N37252,LGA,IAH,172.0,1416,5,45,2013-10-01T09:00:00Z
27008,2013,10,1,539.0,545,-6.0,917.0,933,-16.0,B6,1403,N789JB,JFK,SJU,186.0,1598,5,45,2013-10-01T09:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30T18:00:00Z
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-10-01T02:00:00Z
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30T16:00:00Z
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30T15:00:00Z


In [23]:
# Select flights that are not in January, Feburary, or March
flights[(flights.month!=1) & (flights.month!=2) & (flights.month!=3)]
#flights[~flights.month.isin([1,2,3])]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
27004,2013,10,1,447.0,500,-13.0,614.0,648,-34.0,US,1877,N538UW,EWR,CLT,69.0,529,5,0,2013-10-01T09:00:00Z
27005,2013,10,1,522.0,517,5.0,735.0,757,-22.0,UA,252,N556UA,EWR,IAH,174.0,1400,5,17,2013-10-01T09:00:00Z
27006,2013,10,1,536.0,545,-9.0,809.0,855,-46.0,AA,2243,N630AA,JFK,MIA,132.0,1089,5,45,2013-10-01T09:00:00Z
27007,2013,10,1,539.0,545,-6.0,801.0,827,-26.0,UA,1714,N37252,LGA,IAH,172.0,1416,5,45,2013-10-01T09:00:00Z
27008,2013,10,1,539.0,545,-6.0,917.0,933,-16.0,B6,1403,N789JB,JFK,SJU,186.0,1598,5,45,2013-10-01T09:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30T18:00:00Z
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-10-01T02:00:00Z
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30T16:00:00Z
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30T15:00:00Z


In [24]:
# Use query() function
flights.query('month>=1 and month<=3')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165076,2013,3,31,2349.0,2355,-6.0,333.0,338,-5.0,B6,707,N657JB,JFK,SJU,202.0,1598,23,55,2013-04-01T03:00:00Z
165077,2013,3,31,2358.0,2359,-1.0,332.0,339,-7.0,B6,727,N608JB,JFK,BQN,195.0,1576,23,59,2013-04-01T03:00:00Z
165078,2013,3,31,,1627,,,1734,,EV,4299,N17560,EWR,DCA,,199,16,27,2013-03-31T20:00:00Z
165079,2013,3,31,,600,,,725,,EV,5689,N829AS,LGA,IAD,,229,6,0,2013-03-31T10:00:00Z


In [25]:
#Get rows with 'dest' cloumn containting 'A'
flights[flights.dest.str.contains('A')]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
7,2013,1,1,557.0,600,-3.0,709.0,723,-14.0,EV,5708,N829AS,LGA,IAD,53.0,229,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336751,2013,9,30,2140.0,2140,0.0,10.0,40,-30.0,AA,185,N335AA,JFK,LAX,298.0,2475,21,40,2013-10-01T01:00:00Z
336759,2013,9,30,2207.0,2140,27.0,2257.0,2250,7.0,MQ,3660,N532MQ,LGA,BNA,97.0,764,21,40,2013-10-01T01:00:00Z
336770,2013,9,30,,1842,,,2019,,EV,5274,N740EV,LGA,BNA,,764,18,42,2013-09-30T22:00:00Z
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30T18:00:00Z


## Missing values

It is quite common to have missing values or NaN's in data frames. NaN represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown. 

In Python, if you want to determine if a value is missing, use ``.isnull()``:

In [26]:
flights[flights.arr_time.isnull()]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
754,2013,1,1,2016.0,1930,46.0,,2220,,EV,4204,N14168,EWR,OKC,,1325,19,30,2013-01-02T00:00:00Z
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416,16,30,2013-01-01T21:00:00Z
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389,19,35,2013-01-02T00:00:00Z
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096,15,0,2013-01-01T20:00:00Z
841,2013,1,1,,600,,,901,,B6,125,N618JB,JFK,FLL,,1069,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30T18:00:00Z
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-10-01T02:00:00Z
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30T16:00:00Z
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30T15:00:00Z


## Sorting

Given a data frame, we often want to sort the rows by a column name, or a set of column names, or more complicated expressions. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. 

In [27]:
# Order rows by month
flights.sort_values('month')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
18009,2013,1,21,1754.0,1800,-6.0,1903.0,1915,-12.0,B6,1016,N184JB,JFK,BOS,44.0,187,18,0,2013-01-21T23:00:00Z
18008,2013,1,21,1753.0,1800,-7.0,1859.0,1913,-14.0,US,2185,N737US,LGA,DCA,54.0,214,18,0,2013-01-21T23:00:00Z
18007,2013,1,21,1752.0,1800,-8.0,1850.0,1913,-23.0,US,2138,N952UW,LGA,BOS,42.0,184,18,0,2013-01-21T23:00:00Z
18006,2013,1,21,1751.0,1753,-2.0,2052.0,2105,-13.0,UA,535,N554UA,JFK,LAX,336.0,2475,17,53,2013-01-21T22:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92533,2013,12,11,622.0,630,-8.0,814.0,815,-1.0,AA,303,N3DEAA,LGA,ORD,136.0,733,6,30,2013-12-11T11:00:00Z
92532,2013,12,11,621.0,625,-4.0,805.0,750,15.0,WN,1360,N8321D,LGA,MDW,134.0,725,6,25,2013-12-11T11:00:00Z
92531,2013,12,11,620.0,630,-10.0,940.0,938,2.0,B6,929,N595JB,JFK,RSW,179.0,1074,6,30,2013-12-11T11:00:00Z
92542,2013,12,11,631.0,635,-4.0,948.0,943,5.0,UA,1299,N17229,EWR,RSW,179.0,1068,6,35,2013-12-11T11:00:00Z


In [28]:
# Order rows by month in descending order
flights.sort_values('month', ascending=False)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
84862,2013,12,2,1713.0,1715,-2.0,1856.0,1915,-19.0,AA,199,N3FWAA,JFK,ORD,120.0,740,17,15,2013-12-02T22:00:00Z
93115,2013,12,11,1629.0,1459,90.0,1731.0,1625,66.0,9E,2903,N297PQ,JFK,BOS,39.0,187,14,59,2013-12-11T19:00:00Z
93104,2013,12,11,1622.0,1620,2.0,1848.0,1829,19.0,EV,4352,N14953,EWR,CVG,122.0,569,16,20,2013-12-11T21:00:00Z
93105,2013,12,11,1623.0,1630,-7.0,1842.0,1845,-3.0,DL,2231,N944DL,LGA,DTW,102.0,502,16,30,2013-12-11T21:00:00Z
93106,2013,12,11,1623.0,1630,-7.0,1756.0,1805,-9.0,EV,5293,N712EV,LGA,ORF,63.0,296,16,30,2013-12-11T21:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18007,2013,1,21,1752.0,1800,-8.0,1850.0,1913,-23.0,US,2138,N952UW,LGA,BOS,42.0,184,18,0,2013-01-21T23:00:00Z
18008,2013,1,21,1753.0,1800,-7.0,1859.0,1913,-14.0,US,2185,N737US,LGA,DCA,54.0,214,18,0,2013-01-21T23:00:00Z
18009,2013,1,21,1754.0,1800,-6.0,1903.0,1915,-12.0,B6,1016,N184JB,JFK,BOS,44.0,187,18,0,2013-01-21T23:00:00Z
18010,2013,1,21,1755.0,1800,-5.0,2015.0,2006,9.0,US,373,N657AW,JFK,CLT,99.0,541,18,0,2013-01-21T23:00:00Z


In [29]:
# Order rows by year, month, day
flights.sort_values(by=['year','month','day'])
# Or simply: 
#flights.sort_values(['year','month','day'])

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111291,2013,12,31,,705,,,931,,UA,1729,,EWR,DEN,,1605,7,5,2013-12-31T12:00:00Z
111292,2013,12,31,,825,,,1029,,US,1831,,JFK,CLT,,541,8,25,2013-12-31T13:00:00Z
111293,2013,12,31,,1615,,,1800,,MQ,3301,N844MQ,LGA,RDU,,431,16,15,2013-12-31T21:00:00Z
111294,2013,12,31,,600,,,735,,UA,219,,EWR,ORD,,719,6,0,2013-12-31T11:00:00Z


In [30]:
# You can specify different ascending arguments for different column names
flights.sort_values(['month', 'day'], ascending=[True, False])

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
26076,2013,1,31,1.0,2100,181.0,124.0,2225,179.0,WN,530,N550WN,LGA,MDW,127.0,725,21,0,2013-02-01T02:00:00Z
26077,2013,1,31,4.0,2359,5.0,455.0,444,11.0,B6,739,N599JB,JFK,PSE,206.0,1617,23,59,2013-02-01T04:00:00Z
26078,2013,1,31,7.0,2359,8.0,453.0,437,16.0,B6,727,N505JB,JFK,BQN,197.0,1576,23,59,2013-02-01T04:00:00Z
26079,2013,1,31,12.0,2250,82.0,132.0,7,85.0,B6,30,N178JB,JFK,ROC,60.0,264,22,50,2013-02-01T03:00:00Z
26080,2013,1,31,26.0,2154,152.0,328.0,50,158.0,B6,515,N663JB,EWR,FLL,161.0,1065,21,54,2013-02-01T02:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84143,2013,12,1,,830,,,1039,,9E,3385,,EWR,MSP,,1008,8,30,2013-12-01T13:00:00Z
84144,2013,12,1,,2229,,,2343,,B6,234,N192JB,JFK,BTV,,266,22,29,2013-12-02T03:00:00Z
84145,2013,12,1,,631,,,742,,EV,4194,N13975,EWR,DCA,,199,6,31,2013-12-01T11:00:00Z
84146,2013,12,1,,620,,,826,,EV,5178,N614QX,EWR,MSP,,1008,6,20,2013-12-01T11:00:00Z


In [31]:
# By default, missing values (NAs) are always sorted at the end. 
flights.sort_values('dep_delay')

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
89673,2013,12,7,2040.0,2123,-43.0,40.0,2352,48.0,B6,97,N592JB,JFK,DEN,265.0,1626,21,23,2013-12-08T02:00:00Z
113633,2013,2,3,2022.0,2055,-33.0,2240.0,2338,-58.0,DL,1715,N612DL,LGA,MSY,162.0,1183,20,55,2013-02-04T01:00:00Z
64501,2013,11,10,1408.0,1440,-32.0,1549.0,1559,-10.0,EV,5713,N825AS,LGA,IAD,52.0,229,14,40,2013-11-10T19:00:00Z
9619,2013,1,11,1900.0,1930,-30.0,2233.0,2243,-10.0,DL,1435,N934DL,LGA,TPA,139.0,1010,19,30,2013-01-12T00:00:00Z
24915,2013,1,29,1703.0,1730,-27.0,1947.0,1957,-10.0,F9,837,N208FR,LGA,DEN,250.0,1620,17,30,2013-01-29T22:00:00Z
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30T18:00:00Z
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-10-01T02:00:00Z
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30T16:00:00Z
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30T15:00:00Z


## Select Columns

When you work with a dataset with hundreds or even thousands of variables, which is not uncommon, the first challenge is often narrowing in on the variables you’re actually interested in. 

In [32]:
# Select one column
#flights['carrier']
flights.carrier

0         UA
1         UA
2         AA
3         B6
4         DL
          ..
336771    9E
336772    9E
336773    MQ
336774    MQ
336775    MQ
Name: carrier, Length: 336776, dtype: object

In [33]:
# Select multiple columns
flights[['year','month','day']]

Unnamed: 0,year,month,day
0,2013,1,1
1,2013,1,1
2,2013,1,1
3,2013,1,1
4,2013,1,1
...,...,...,...
336771,2013,9,30
336772,2013,9,30
336773,2013,9,30
336774,2013,9,30


Select columns whose name matches regular expression regex.

``df.filter(regex='regex')``

In [34]:
# Select all columns containing a '_' in the name.
flights.filter(regex='_')

Unnamed: 0,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,air_time,time_hour
0,517.0,515,2.0,830.0,819,11.0,227.0,2013-01-01T10:00:00Z
1,533.0,529,4.0,850.0,830,20.0,227.0,2013-01-01T10:00:00Z
2,542.0,540,2.0,923.0,850,33.0,160.0,2013-01-01T10:00:00Z
3,544.0,545,-1.0,1004.0,1022,-18.0,183.0,2013-01-01T10:00:00Z
4,554.0,600,-6.0,812.0,837,-25.0,116.0,2013-01-01T11:00:00Z
...,...,...,...,...,...,...,...,...
336771,,1455,,,1634,,,2013-09-30T18:00:00Z
336772,,2200,,,2312,,,2013-10-01T02:00:00Z
336773,,1210,,,1330,,,2013-09-30T16:00:00Z
336774,,1159,,,1344,,,2013-09-30T15:00:00Z


In [35]:
# Select all columns beginning with word 'dep'
flights.filter(regex='^dep')

Unnamed: 0,dep_time,dep_delay
0,517.0,2.0
1,533.0,4.0
2,542.0,2.0
3,544.0,-1.0
4,554.0,-6.0
...,...,...
336771,,
336772,,
336773,,
336774,,


In [36]:
# Select all columns endding with word 'time'
flights.filter(regex='time$')

Unnamed: 0,dep_time,sched_dep_time,arr_time,sched_arr_time,air_time
0,517.0,515,830.0,819,227.0
1,533.0,529,850.0,830,227.0
2,542.0,540,923.0,850,160.0
3,544.0,545,1004.0,1022,183.0
4,554.0,600,812.0,837,116.0
...,...,...,...,...,...
336771,,1455,,1634,
336772,,2200,,2312,
336773,,1210,,1330,
336774,,1159,,1344,


In [37]:
# Select all columns beginning with 'a', endding with 'e', and any string in between. 
flights.filter(regex='^a.*e$')

Unnamed: 0,arr_time,air_time
0,830.0,227.0
1,850.0,227.0
2,923.0,160.0
3,1004.0,183.0
4,812.0,116.0
...,...,...
336771,,
336772,,
336773,,
336774,,


In [38]:
# Select all columns between 'carrier' and 'dest' (inclusive).
flights.loc[:,'carrier':'dest']

Unnamed: 0,carrier,flight,tailnum,origin,dest
0,UA,1545,N14228,EWR,IAH
1,UA,1714,N24211,LGA,IAH
2,AA,1141,N619AA,JFK,MIA
3,B6,725,N804JB,JFK,BQN
4,DL,461,N668DN,LGA,ATL
...,...,...,...,...,...
336771,9E,3393,,JFK,DCA
336772,9E,3525,,LGA,SYR
336773,MQ,3461,N535MQ,LGA,BNA
336774,MQ,3572,N511MQ,LGA,CLE


In [39]:
# Select by column indexes: 
# Select columns in positions 1, 2 and 5 (first column is 0).
flights.iloc[:,[1,2,5]]

Unnamed: 0,month,day,dep_delay
0,1,1,2.0
1,1,1,4.0
2,1,1,2.0
3,1,1,-1.0
4,1,1,-6.0
...,...,...,...
336771,9,30,
336772,9,30,
336773,9,30,
336774,9,30,


In [40]:
# Select rows meeting logical condition, and only the specific columns.
# Select all flights in January, display the day, carrier, and flight: 
flights.loc[flights['month']==1, ['day','carrier', 'flight']]

Unnamed: 0,day,carrier,flight
0,1,UA,1545
1,1,UA,1714
2,1,AA,1141
3,1,B6,725
4,1,DL,461
...,...,...,...
26999,31,MQ,4475
27000,31,MQ,4658
27001,31,MQ,4491
27002,31,UA,337


## Add new variables

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. 

In [41]:
# First, let's create a small dataframe to work with
flights_sml = flights.filter(['year','month','day','dep_delay','arr_delay','distance','air_time'])
flights_sml

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time
0,2013,1,1,2.0,11.0,1400,227.0
1,2013,1,1,4.0,20.0,1416,227.0
2,2013,1,1,2.0,33.0,1089,160.0
3,2013,1,1,-1.0,-18.0,1576,183.0
4,2013,1,1,-6.0,-25.0,762,116.0
...,...,...,...,...,...,...,...
336771,2013,9,30,,,213,
336772,2013,9,30,,,198,
336773,2013,9,30,,,764,
336774,2013,9,30,,,419,


In [42]:
# Create two new variables one at a time
flights_sml['gain'] = flights_sml.dep_delay - flights_sml.arr_delay
flights_sml['speed'] = flights_sml.distance / flights_sml.air_time * 60
flights_sml.head()

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time,gain,speed
0,2013,1,1,2.0,11.0,1400,227.0,-9.0,370.044053
1,2013,1,1,4.0,20.0,1416,227.0,-16.0,374.273128
2,2013,1,1,2.0,33.0,1089,160.0,-31.0,408.375
3,2013,1,1,-1.0,-18.0,1576,183.0,17.0,516.721311
4,2013,1,1,-6.0,-25.0,762,116.0,19.0,394.137931


In [43]:
# Remove existing columns from a dataframe
flights_sml.drop(columns=['gain','speed'])

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time
0,2013,1,1,2.0,11.0,1400,227.0
1,2013,1,1,4.0,20.0,1416,227.0
2,2013,1,1,2.0,33.0,1089,160.0
3,2013,1,1,-1.0,-18.0,1576,183.0
4,2013,1,1,-6.0,-25.0,762,116.0
...,...,...,...,...,...,...,...
336771,2013,9,30,,,213,
336772,2013,9,30,,,198,
336773,2013,9,30,,,764,
336774,2013,9,30,,,419,


In [44]:
# Create multiple new columns 
flights_sml.assign(
    gain = lambda x: x.dep_delay - x.arr_delay,
    hours = lambda x: x.air_time / 60,
    gain_per_hour = lambda x: x.gain / x.hours # Note that you can refer to columns that you’ve just created
)

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time,gain,speed,hours,gain_per_hour
0,2013,1,1,2.0,11.0,1400,227.0,-9.0,370.044053,3.783333,-2.378855
1,2013,1,1,4.0,20.0,1416,227.0,-16.0,374.273128,3.783333,-4.229075
2,2013,1,1,2.0,33.0,1089,160.0,-31.0,408.375000,2.666667,-11.625000
3,2013,1,1,-1.0,-18.0,1576,183.0,17.0,516.721311,3.050000,5.573770
4,2013,1,1,-6.0,-25.0,762,116.0,19.0,394.137931,1.933333,9.827586
...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,,213,,,,,
336772,2013,9,30,,,198,,,,,
336773,2013,9,30,,,764,,,,,
336774,2013,9,30,,,419,,,,,


## Useful creation functions

There are many functions for creating new variables
- Arithmetic operators: +, -, *, /, ^. 
- Modular arithmetic: // (floor division) and % (remainder), where x == y * (x // y) + (x % y). 
- Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. 
- Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.
- Cumulative and rolling aggregates: Python provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax()
- Ranking: get the rankings of rows using function rank()

In [49]:
flights_sml['air_time_hours'] = flights_sml.air_time // 60
flights_sml['log2_dist'] = np.log2(flights_sml.distance)
flights_sml['gain_pos'] = flights_sml.gain > 0
flights_sml['gain_cumsum'] = flights_sml.gain.cumsum()
flights_sml['dist_rank'] = flights_sml['distance'].rank(method='min',ascending=True)

flights_sml.head()

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time,gain,speed,air_time_hours,log2_dist,gain_pos,gain_cumsum,dist_rank
0,2013,1,1,2.0,11.0,1400,227.0,-9.0,370.044053,3.0,10.451211,False,-9.0,254751.0
1,2013,1,1,4.0,20.0,1416,227.0,-16.0,374.273128,3.0,10.467606,False,-25.0,259700.0
2,2013,1,1,2.0,33.0,1089,160.0,-31.0,408.375,2.0,10.088788,False,-56.0,228548.0
3,2013,1,1,-1.0,-18.0,1576,183.0,17.0,516.721311,3.0,10.622052,True,-39.0,266833.0
4,2013,1,1,-6.0,-25.0,762,116.0,19.0,394.137931,1.0,9.573647,True,-20.0,149279.0


In [50]:
flights_sml.sort_values(['dist_rank'])

Unnamed: 0,year,month,day,dep_delay,arr_delay,distance,air_time,gain,speed,air_time_hours,log2_dist,gain_pos,gain_cumsum,dist_rank
275945,2013,7,27,,,17,,,,,4.087463,False,,1.0
3083,2013,1,4,40.0,27.0,80,30.0,13.0,160.000000,0.0,6.321928,True,8403.0,2.0
16328,2013,1,19,0.0,0.0,80,34.0,0.0,141.176471,0.0,6.321928,False,69247.0,2.0
112178,2013,2,1,-1.0,-8.0,80,24.0,7.0,200.000000,0.0,6.321928,True,465440.0,2.0
19983,2013,1,23,-1.0,-3.0,80,23.0,2.0,208.695652,0.0,6.321928,True,79295.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99112,2013,12,18,-2.0,8.0,4983,641.0,-10.0,466.427457,10.0,12.282799,False,394818.0,336435.0
223207,2013,6,2,-4.0,7.0,4983,617.0,-11.0,484.570502,10.0,12.282799,False,1139965.0,336435.0
151311,2013,3,17,6.0,37.0,4983,686.0,-31.0,435.830904,11.0,12.282799,False,707506.0,336435.0
218562,2013,5,28,-7.0,-13.0,4983,631.0,6.0,473.819334,10.0,12.282799,True,1090590.0,336435.0
