<DIV ALIGN=CENTER>

# Introduction to Machine Learning Pre-Processing
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we explore data pre-processing. This is a
large topic, so in this notebook we will focus on some basic concepts.
You should feel free to try completing additional tasks.

-----



Read in airline data

explore with describe

copy to new dataframe. Drop na

fillna

categoricals.

fix datetime


In [1]:
# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd 

# Change this to read a different file, for example
# /home/data_scientist/data/2001.csv on the JupyterHub Server
#
# Note that the JupyterHub server has data from other years in the raw
# subdirectory.
#
filename = '/home/data_scientist/rppdm/data/2001.csv'

# Read select columns for all rows.

ucs = (1, 2, 4, 14, 15, 16, 17, 18)
cnms = ['Month', 'Day', 'dTime', 'aDelay', 'dDelay', 'Depart', 'Arrive', 'Distance']

alldata = pd.read_csv(filename, header=0, na_values=['NA'], usecols=ucs, names=cnms)


-----

Note DtaFrame is mising `Depoart` and `Arrive` columns. Also have different counts for each column since NA valus are explicitly skipped by `count` method. And don't have an actual datetime.


In [3]:
alldata.describe()

Unnamed: 0,Month,Day,dTime,aDelay,dDelay,Distance
count,5967780.0,5967780.0,5736582.0,5723673.0,5736582.0,5967780.0
mean,6.306294,15.683204,1348.704605,5.528249,8.154837,733.029305
std,3.371688,8.775346,482.686013,31.429291,28.348469,574.071625
min,1.0,1.0,1.0,-1116.0,-204.0,21.0
25%,3.0,8.0,930.0,-9.0,-3.0,313.0
50%,6.0,16.0,1333.0,-2.0,0.0,571.0
75%,9.0,23.0,1740.0,10.0,6.0,980.0
max,12.0,31.0,2400.0,1688.0,1692.0,4962.0


In [4]:
# Drop any row that is missing a departure time
#
# axis = 0 means drop rows
# subset=['dTime'] means drop if the departure time is missing
# inplace=True means modify the existing DataFrame

alldata.dropna(axis=0, subset=['dTime'], inplace=True)

In [5]:
# Now display number of 'good' values for each column.
alldata.count()

Month       5736582
Day         5736582
dTime       5736582
aDelay      5723673
dDelay      5736582
Depart      5736582
Arrive      5736582
Distance    5736582
dtype: int64

In [6]:
# Now replace missing values (which are all in Arrival Delay column)
# with 0, note we could use another value, such as the departure delay.

alldata.fillna(value=0, axis=0, inplace=True)

In [7]:
# First create a new DataFrame
# For now, simply copy over the columns, but we could
# convert them to integers (for number of minutes delay)
# and change data type to save memory.

newdata = alldata[['aDelay', 'dDelay', 'Distance']]

Now we talk about categoricals.

Will turn Arrival and Departure locations to categories.

http://pandas.pydata.org/pandas-docs/stable/categorical.html


In [8]:
alldata['Depart'].unique()

array(['BWI', 'PHL', 'CLT', 'CLE', 'MKE', 'PIT', 'TPA', 'LGA', 'DFW',
       'BUF', 'BOS', 'ROC', 'DCA', 'MHT', 'CRW', 'MEM', 'DTW', 'CMH',
       'GSO', 'IAD', 'IAH', 'BHM', 'HPN', 'CHS', 'STL', 'AVP', 'ATL',
       'GSP', 'RDU', 'MCI', 'ORF', 'MYR', 'ALB', 'RIC', 'BNA', 'PVD',
       'BGM', 'TOL', 'SAV', 'ROA', 'IND', 'MDT', 'BTV', 'ELM', 'ITH',
       'JAX', 'PBI', 'SRQ', 'MSY', 'FLL', 'EWR', 'GRR', 'MSP', 'ORD',
       'ABE', 'LAS', 'BDL', 'MIA', 'MCO', 'CAE', 'ILM', 'SFO', 'SJU',
       'SDF', 'RSW', 'DAY', 'SYR', 'PNS', 'AVL', 'CAK', 'SBN', 'ERI',
       'HSV', 'CHA', 'SEA', 'TYS', 'FAY', 'LEX', 'STT', 'PWM', 'TRI',
       'STX', 'MDW', 'DAL', 'HOU', 'OKC', 'HRL', 'LIT', 'CRP', 'TUL',
       'LBB', 'MAF', 'ABQ', 'AMA', 'PHX', 'SAN', 'SMF', 'ISP', 'LAX',
       'AUS', 'PDX', 'ELP', 'OAK', 'ONT', 'SJC', 'JAN', 'RNO', 'BOI',
       'TUS', 'BUR', 'SAT', 'SLC', 'GEG', 'OMA', 'SNA', 'MSN', 'MBS',
       'FAR', 'ICT', 'GPT', 'VPS', 'DEN', 'TVC', 'FNT', 'GRB', 'GFK',
       'MOT', 'BIS',

In [9]:
# Now copy over the departure and Arrival columns, 
# but change data type to categoricals. 

newdata['Depart'] = alldata['Depart'].astype('category')
newdata['Arrive'] = alldata['Arrive'].astype('category')

Now we make a datetime index

http://pandas.pydata.org/pandas-docs/stable/timeseries.html

http://strftime.org

In [10]:
newdata['Year'] = 2001
newdata['Month'] = alldata.Month
newdata['Day'] = alldata.Day
newdata['Hour'] = (alldata.dTime/100.).astype(int)
newdata['Min'] = (alldata.dTime - 100*(alldata.dTime/100.).astype(int)).astype(int)

In [11]:
newdata.tail()


Unnamed: 0,aDelay,dDelay,Distance,Depart,Arrive,Year,Month,Day,Hour,Min
5967775,4,4,1189,ONT,DFW,2001,12,14,7,4
5967776,3,8,1189,ONT,DFW,2001,12,15,7,8
5967777,-8,-4,1189,ONT,DFW,2001,12,16,6,56
5967778,-4,-4,1189,ONT,DFW,2001,12,17,6,56
5967779,3,9,1189,ONT,DFW,2001,12,18,7,9


In [12]:
newdata.describe()

Unnamed: 0,aDelay,dDelay,Distance,Year,Month,Day,Hour,Min
count,5736582.0,5736582.0,5736582.0,5736582,5736582.0,5736582.0,5736582.0,5736582.0
mean,5.515809,8.154837,735.611691,2001,6.291336,15.712031,13.191086,29.596014
std,31.395001,28.348469,574.963899,0,3.3813,8.827161,4.830028,17.81198
min,-1116.0,-204.0,21.0,2001,1.0,1.0,0.0,0.0
25%,-9.0,-3.0,314.0,2001,3.0,8.0,9.0,14.0
50%,-2.0,0.0,576.0,2001,6.0,16.0,13.0,30.0
75%,10.0,6.0,984.0,2001,9.0,23.0,17.0,45.0
max,1688.0,1692.0,4962.0,2001,12.0,31.0,24.0,59.0


Can't convert hour==24 to datetime in python. need to fix. For now we will simply subtract one minute. A better way might be to convert to next day.


In [13]:
newdata.loc[newdata.Hour == 24, 'Min'] = 59
newdata.loc[newdata.Hour == 24, 'Hour'] = 23

In [14]:
newdata['DTI'] = pd.to_datetime(newdata.Year * 100000000 + 
                                newdata.Month * 1000000 + 
                                newdata.Day * 10000 + 
                                newdata.Hour * 100 + 
                                newdata.Min, format="%Y%m%d%H%M")

In [15]:
newdata.set_index('DTI', inplace=True)

In [16]:
newdata.tail()

Unnamed: 0_level_0,aDelay,dDelay,Distance,Depart,Arrive,Year,Month,Day,Hour,Min
DTI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2001-12-14 07:04:00,4,4,1189,ONT,DFW,2001,12,14,7,4
2001-12-15 07:08:00,3,8,1189,ONT,DFW,2001,12,15,7,8
2001-12-16 06:56:00,-8,-4,1189,ONT,DFW,2001,12,16,6,56
2001-12-17 06:56:00,-4,-4,1189,ONT,DFW,2001,12,17,6,56
2001-12-18 07:09:00,3,9,1189,ONT,DFW,2001,12,18,7,9


At this point we have a DataFrame that contains a DateTime Index, no missing values, and has the airport codes represented by categoricals.
