2018.01.18
Grouping again

this notebook will take advantage of the `%store` magic function... storing variables for use in other notebooks

In [1]:
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt

from pygeocoder import Geocoder
from mpl_toolkits.basemap import Basemap

%matplotlib inline

## Import the csv

In [2]:
# where this data came from: https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1
orig_data = pd.read_csv('./project_datasets/682464398_102017_1651_airline_delay_causes.csv', skipinitialspace=True)

## Clean the data

### drop the extraneous column that has no information

In [3]:
data_right_columns = orig_data.drop('Unnamed: 21', axis=1)

Some of the column names have initial space, so we use a regular expression  

to make them easier to work with programmatically

In [4]:
columns = []
for i, col in enumerate(data_right_columns.columns):
    global columns
    col = re.sub(r'\s+', '', col)
    columns.append(col)
data_right_columns.columns = columns
data_right_columns.columns

Index([u'year', u'month', u'carrier', u'carrier_name', u'airport',
       u'airport_name', u'arr_flights', u'arr_del15', u'carrier_ct',
       u'weather_ct', u'nas_ct', u'security_ct', u'late_aircraft_ct',
       u'arr_cancelled', u'arr_diverted', u'arr_delay', u'carrier_delay',
       u'weather_delay', u'nas_delay', u'security_delay',
       u'late_aircraft_delay'],
      dtype='object')

### Dropping the NaN rows as invalid data, it is a small percentage of the entire dataset

In [5]:
data_right_columns.shape

(11193, 21)

In [6]:
data_NoNaN = data_right_columns.dropna(axis=0)

So, the Nans have been dropped...

### We wil now use the first variable, `data_NoNaN` and set it to `data` ...

The next four strategies of NaN handling are commented out, but also will be `data`

In [7]:
data = data_NoNaN

Looking at the shape of our data, the 18 dropped values, give us 11173 observations and 21 features

In [8]:
data.shape

(11173, 21)

# Target
We have determined that the goal of this data analysis is to predict delayed flights, given certain conditions based on the dataset.  Therefore the target variable is `arr_del15`




Let's make the target variable binary.

In [10]:
data['target'] = [1 if targ >= 15.0 else 0 for targ in data.arr_del15]
data['target'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


1    6651
0    4522
Name: target, dtype: int64

#### Now that `target_x` has been created and added to the dataframe, `data`, `arr_del15` needs to be dropped since it is now redundant

In [11]:
data = data.drop('arr_del15', axis=1)
data.columns

Index([u'year', u'month', u'carrier', u'carrier_name', u'airport',
       u'airport_name', u'arr_flights', u'carrier_ct', u'weather_ct',
       u'nas_ct', u'security_ct', u'late_aircraft_ct', u'arr_cancelled',
       u'arr_diverted', u'arr_delay', u'carrier_delay', u'weather_delay',
       u'nas_delay', u'security_delay', u'late_aircraft_delay', u'target'],
      dtype='object')

`data.columns` is a list of all 21 features including the target variable

# Stores variables 1 - 6 :
### `data`, `col_year_month`, `col_categorical`, `col_target`, `col_numeric`, `combined_list`

In [12]:
%store data

Stored 'data' (DataFrame)


## Group by Categorical / Numeric

In [13]:
col_numeric = data.select_dtypes(include=[np.number]).columns.tolist()
col_categorical = data.select_dtypes(include=[np.object]).columns.tolist()

`select_dtypes()` selected `year` and `month` as numeric. They are now separated so that pure numeric columns may be seen graphically


Use `year` and `month` separately in the `col_year_month` variable

In [14]:
# `col_year_month` is the first variable we see below in the Quick Summary__________ first [1st]
col_year_month = col_numeric[0:2]
col_year_month

['year', 'month']

In [15]:
%store col_year_month

Stored 'col_year_month' (list)


### Remove redundant columns
Of the four columns in `col_categorical`, only two are relevant since the other two contain the same information

In [16]:
# remove redundant columns
# `col_categorical` is the second variable we see below in the Quick Summary__________ second [2nd]
col_categorical = col_categorical[0::2]
col_categorical

['carrier', 'airport']

In [17]:
%store col_categorical

Stored 'col_categorical' (list)


### Separate the target so it can be treated separately

In [18]:
# `col_target` is the third variable we see below in the Quick Summary__________ third [3rd]
col_target = col_numeric[16:17]
col_target

['target']

In [19]:
%store col_target

Stored 'col_target' (list)


### Numeric Columns

In [20]:
# rename `col_numeric` so it does not include 'year' and 'month'
# `col_numeric` is the fourth variable we see below in the Quick Summary__________ fourth [4th]
col_numeric = col_numeric[2:16]
# col_numeric

In [21]:
%store col_numeric

Stored 'col_numeric' (list)


### Combine the lists

In [22]:
pre_combine = [col_year_month, col_categorical, col_numeric, col_target]
combined_list = [item for sublist in pre_combine for item in sublist]
# combined_list
# ['year', 'month', 'carrier', 'airport', 'arr_flights', 'carrier_ct', 'weather_ct', 
#  'nas_ct', 'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted', 
#  'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 
#  'late_aircraft_delay', 'target']

In [23]:
%store combined_list

Stored 'combined_list' (list)


In [24]:
# data[combined_list].columns == data.columns
# this throws an error because the combied list does not have u'carrier_name or u'airport_name

In [25]:
# commands to get these variables from another notebook
# %store -r data
# %store -r col_year_month
# %store -r col_categorical
# %store -r col_target
# %store -r col_numeric
# %store -r combined_list