# Data Preparation Demonstration
#### Noah Rippner | May 2016 ####

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/maris_zpsl444g4so.jpg)

My hope is that this demonstration helps you to: 
1.  identify more clearly with the process of data preparation in data mining and, 
2.  further empathize with the data scientist's perspective. 

I didn't begin to know what data preparation was about until I walked through it. I want to offer that experience to you. Let's get started! 

> _I drew heavily from and owe credit to this superb [analysis](http://www.analyticsvidhya.com/blog/2015/06/solution-kaggle-competition-bike-sharing-demand/ "Sunil Ray") by Sunil Ray at Analytics Vidhya, wherein he has laid out a fantastic exploration/preparation process for us to go through. Aside from reconstructing Sunil's R code into Python, the exploration, preparation, and modeling process is largely Sunil's creation._



## Summary

Tasks:

1. Exploration
    - Attempt to understand data's origin, content, and meaning. Form preliminary hypotheses for further exporation
    - Prepare Python environment and load datasets as dataframe objects
    - Examine summary statistics and plots in order to assess data quality
    - Check for duplicates and missing values
    - "Test" hypotheses (multivariate interrelationships) with exploratory data analysis (plots)
2. Preparation
    - Feature Engineering
        * Extract time variables
        * Bin 'hour' and 'temp' predictor variables using decision tree regression to establish partition thresholds
        * Extract day type variables
3. Model building

## Fundamentals

### Data preparation:
1. is iterative throughout the data mining process
2. can require concrete and/or subjective judgment
3. is usually mandatory and crucial to building a good model
4. is often time consuming and exhausting
5. is usually completed using R, Python, or SAS code
6. is context specific -- must be attuned to the business/research scenario as well as the mechanics of the intended model(s)
7. [data preprocessing techniques](https://drive.google.com/open?id=0Byud_5Mue3EYZDgzMklyUXNtYzQ) have been researched extensively
8. most common exploration/preparation tasks in applied data science[(Sunil Ray)](http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/):
    * Variable Identification
    * Univariate Analysis
    * Multivariate Analysis
    * Missing Value Treatment
    * Variable Transformation (centering, normalization, scaling, etc.)
    * Variable Creation ([Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering))

\***these tasks may be iterated over a large number of times in order to refine a good model**\*

The Data Mining Cycle:
![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/crisp-dm_zps065t3ltv.png)

### Data Exploration: Exploratory Data Analysis
![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/tukey_quotes_zps5wuw4qei.jpg)
1. EDA, in contrast with traditional statistics orthodoxy, is a frame of mind that values creative investigation over rigorous adherence to theoretical assumptions.
2. It relies heavily on statistical graphs.
3. > “Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out.” [(Garcia, et al, 2014)](http://www.springer.com/us/book/9783319102467)
4. EDA serves to [(NIST)](http://itl.nist.gov/div898/handbook/) :
    * maximize insight into a data set
    * uncover underlying structure
    * extract important variables
    * detect outliers and anomalies
    * test underlying asumptions
    * develop parsimonious models
5. As I see it, EDA allows the data scientist to:    
    * quickly discover relationships between variables and 'test' hypotheses
    * make informed decisions about data preparation
    * make informed decisions about model building

# Exploration

## Business Understanding
[Kaggle Bike Sharing Competition](https://www.kaggle.com/c/bike-sharing-demand)
> You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

### Data Dictionary

Feature         | Description    
--------------- |----------
datetime        | timestamp
season          | 1 = spring, 2 = summer, 3 = fall, 4 = winter 
holiday         | whether the day is considered a 
workingday      | whether the day is neither a weekend nor holiday
weather         | 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
temp            | temperature in Celsius
atemp           | 'feels like' temperature in Celsius
humidity        | relative humidity
windspeed       | wind speed in kilometers per hour
casual          | number of non-registered user rentals 
registered      | umber of registered user rentals
count           | total number of rentals

## Data Understanding

First, import modules and load data:

In [1]:
import pandas as pd
import numpy as np
from pandas.tools.plotting import scatter_matrix
from scipy.stats import boxcox
from scipy import stats
import matplotlib.pyplot as plt
% matplotlib inline

with open('bike_share_train.csv') as f:
    train_ = pd.read_csv(f)
with open('bike_share_test.csv') as f:
    test_= pd.read_csv(f)

Inspect data:

In [2]:
train_.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [3]:
train_.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,36.021955,155.552177,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,49.960477,151.039033,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,4.0,36.0,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,17.0,118.0,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,49.0,222.0,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,367.0,886.0,977.0


In [4]:
train_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB


Check for duplicates:

In [5]:
data = pd.concat([train_,test_])
dup = [(dup[0],dup[1]) for dup in enumerate(data.duplicated()) if dup[1]==True]
if dup:
    print dup
else:
    print "no duplicates"                                             

no duplicates


Check for missing values:

In [6]:
def count_missing(data):
    names = [col for col in data.columns]
    dtypes = [i for i in data.dtypes]
    features = zip(names, dtypes)
    values = [sum(data.iloc[:,i[0]].isnull()) for i in enumerate(names)] 
    missing = [i for i in zip(features, values) if i[1]]
    if not missing:
        return "%No missing values"
    else:
        return pd.DataFrame(missing, columns=['Feature', '#Miss'])

print count_missing(train_)
print count_missing(test_)

%No missing values
%No missing values


Let's look at some graphs.

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/hist_train_zps0yc6jlxb.png)

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/box_train_zpsw8gpyai2.png)

Scatter plot matrix of correlations:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/scatter_matrix_zps9askcqsw.png)

In [3]:
train_.corr()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
season,1.0,0.029368,-0.008126,0.008879,0.258689,0.264744,0.19061,-0.147121,0.096758,0.164011,0.163439
holiday,0.029368,1.0,-0.250491,-0.007074,0.000295,-0.005215,0.001929,0.008409,0.043799,-0.020956,-0.005393
workingday,-0.008126,-0.250491,1.0,0.033772,0.029966,0.02466,-0.01088,0.013373,-0.319111,0.11946,0.011594
weather,0.008879,-0.007074,0.033772,1.0,-0.055035,-0.055376,0.406244,0.007261,-0.135918,-0.10934,-0.128655
temp,0.258689,0.000295,0.029966,-0.055035,1.0,0.984948,-0.064949,-0.017852,0.467097,0.318571,0.394454
atemp,0.264744,-0.005215,0.02466,-0.055376,0.984948,1.0,-0.043536,-0.057473,0.462067,0.314635,0.389784
humidity,0.19061,0.001929,-0.01088,0.406244,-0.064949,-0.043536,1.0,-0.318607,-0.348187,-0.265458,-0.317371
windspeed,-0.147121,0.008409,0.013373,0.007261,-0.017852,-0.057473,-0.318607,1.0,0.092276,0.091052,0.101369
casual,0.096758,0.043799,-0.319111,-0.135918,0.467097,0.462067,-0.348187,0.092276,1.0,0.49725,0.690414
registered,0.164011,-0.020956,0.11946,-0.10934,0.318571,0.314635,-0.265458,0.091052,0.49725,1.0,0.970948


Initial observations:
- our target variables (count, registered, casual) are severely skewed right and outliers are present
- holiday, season, workingday, and weather are categorical.
- no missing values, no duplicates
- atemp, temp, and humidity are fairly normally distributed
- predictors don't appear to be intercorrelated with exception of temp/season and humidity/weather
- temp is positively correlated with count (bike rentals) -- higher temperatures are associated with more rentals
- count is negatively correlated to casual rentals -- on non-working days more non-registered people rent bikes than on working days

Impressions/hypotheses (what are your hunches about which variables predict/explain the number of bike rentals?)

My initial hypotheses:
1. more rentals when temp is warmer
2. more rentals when it's not raining (lower humidity, weather=1)
3. more rentals when workingday is false
4. more rentals when holiday is true
5. rentals will increase during certain times of day, eg commutes to and from work

Let's examine these hypotheses using the exploratory data analysis perspective. But first, let's do some feature engineering.

## Feature Engineering

Extract time variables:

In [10]:
def extract_date(data):
    data.datetime = pd.to_datetime(data.datetime)
    for i, j in data.datetime.iteritems():
        data.loc[i, 'hour'] = j.hour
        data.loc[i, 'month'] = j.month
        data.loc[i, 'day'] = j.day
    data['weekday'] = data.datetime.apply(lambda x: x.weekday())
extract_date(train_)
extract_date(test_)

weekday variable:

code   |   weekday
-------|----------
0      |   Monday
1      |   Tuesday
2      |   Wednesday
3      |   Thursday
4      |   Friday
5      |   Saturday
6      |   Sunday

Discretize (bin) continuous variables using rpart decision tree regression to determine the thresholds: 

In [8]:
from sklearn import tree
import patsy
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image

In [9]:
def create_tree(dep_var, ind_var, data, pdf=True):
    
    y, X = patsy.dmatrices("%s ~ %s + 0" % (dep_var, ind_var), data=data)

    clf = tree.DecisionTreeRegressor(max_leaf_nodes=7)
    clf = clf.fit(X, np.log(y))

    dot_data = StringIO()
    tree.export_graphviz(clf, out_file=dot_data)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    
    if pdf == True:
        graph.write_pdf('tree_%s.pdf' % ind_var)
    else:
        return Image(graph.create_png())


create_tree('count', 'hour', train_)
create_tree('count', 'temp', train_)

Decision tree results: y=count, x=hour
![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/tree_hour_zpsnrq3epkx.png)

Discretize hour, temp and year variables based on decision tree boundaries:

In [11]:
def bin_hour(data):
    data['hour_binned'] = 0
    for i, j in data.hour.iteritems():
        if 0 <= j < 1.5:
            data.loc[i,'hour_binned'] = 1
        elif 1.5 < j < 4.5:
            data.loc[i,'hour_binned'] = 2
        elif 4.5 < j < 6.5:
            data.loc[i,'hour_binned'] = 3
        elif 6.5 < j < 10.5:
            data.loc[i,'hour_binned'] = 4
        elif 10.5 < j < 12.15:
            data.loc[i,'hour_binned'] = 5
        elif 12.5 < j < 21.5:
            data.loc[i,'hour_binned'] = 6
        elif 21.5 < j <= 24:
            data.loc[i,'hour_binned'] = 7
        else:
            print i, j, "problem"
            
bin_hour(train_)
bin_hour(test_)

hour_binned variable:

code    |  hour range
--------|-------------
1       |   12:00 AM - 1:59 AM
2       |    2:00 AM - 4:59 AM
3       |    5:00 AM - 6:59 AM
4       |    7:00 AM - 10:59 AM
5       |   11:00 AM - 12:59 PM
6       |    1:00 PM -  8:59 PM
7       |    9:00 PM - 11:59 PM

In [12]:
def bin_temp(data):
    data['temp_binned'] = 0
    for i, j in data.temp.iteritems():
        if j <= 6.15:
            data.loc[i,'temp_binned'] = 1
        elif 6.15 < j <= 11.1:
            data.loc[i,'temp_binned'] = 2
        elif 11.1 < j <= 12.7:
            data.loc[i,'temp_binned'] = 3
        elif 12.7 < j <= 19.3:
            data.loc[i,'temp_binned'] = 4
        elif 19.3 < j <= 29.1:
            data.loc[i,'temp_binned'] = 5
        elif 29.1 < j <= 30.8:
            data.loc[i,'temp_binned'] = 6
        elif j > 30.8:
            data.loc[i,'temp_binned'] = 7
        else:
            print i,j, "problem"

bin_temp(train_)
bin_temp(test_)

temp_binned variable:

code    |  temp range(c)
--------|-------------
1       |   (0 - 6.15]
2       |   (6.15 - 11.1]
3       |   (11.1, 12.7]
4       |   (12.7, 19.3]
5       |   (19.3, 29.1]
6       |   (29.1, 30.8]
7       |   (30.8, +)

In [13]:
def bin_year(data):
    data['year_binned'] = 0
    for i, j in data.datetime.iteritems():
        if (j.year == 2011 and 0 <j.month<= 3):
            data.loc[i,'year_binned'] = 1
        elif (j.year == 2011 and 3<j.month<=6):
            data.loc[i, 'year_binned'] = 2
        elif (j.year == 2011 and 6<j.month<=9):
            data.loc[i, 'year_binned'] = 3
        elif (j.year == 2011 and 9<j.month<=12):
            data.loc[i, 'year_binned'] = 4
        elif (j.year == 2012 and 0<j.month<= 3):
            data.loc[i,'year_binned'] = 5
        elif (j.year == 2012 and 3<j.month<=6):
            data.loc[i, 'year_binned'] = 6
        elif (j.year == 2012 and 6<j.month<=9):
            data.loc[i, 'year_binned'] = 7
        elif (j.year == 2012 and 9<j.month<=12):
            data.loc[i, 'year_binned'] = 8
        elif (j.year > 2012):
            data.loc[i, 'year_binned'] = 8
        else:
            print i, j, "problem"
            
bin_year(train_)
bin_year(test_)

Create a 'day_type' variable:

In [14]:
def daytype(data):
    data['day_type'] = ""
    for i, j in data.holiday.iteritems():
        if (j==0 and data.loc[i,'workingday']==0):
            data.loc[i, 'day_type'] = 1
        elif j == 1:
            data.loc[i, 'day_type'] = 2
        elif (j==0 and data.loc[i, 'workingday']==1):
            data.loc[i, 'day_type'] = 3
        else:
            print i,j, "problem"

daytype(train_)
daytype(test_)

day_time variable:

code     |   day_type
---------|------------
1        |  weekend
2        |   holiday
3        |   work day

Now, let's use the Exploratory Data Analysis (EDA) framework to look for relationships within the data:

Boxplots by hour:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/box_by_hour_zpssznl15ev.png)

- registered users rent more bikes than casual (non-registered)
    - they rent more around the morning and evening commutes -- clearly they're using the bike share for their trainsportation to and from work
- this is does not account for weather, temp, weekends, or holidays, though.
- there are a lot of outliers, especially for casual users
    - what might account for these outliers?
    
Let's try to learn more:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/box_by_day_type_zpsfqgbmard.png)

Interesting. 
- day type (1:weekend, 2:holiday, 3:work day) on its own doesn't really predict number of rentals
- on work days there are a lot of outliers for registered users...?
- we see outliers for casual users on weekends and work days, but fewer on holidays
- clearly, there are additional factors in play...maybe weather or temperature?

Let's look at one or two more. We remember from our correlation matrix that temp had the highest correlation with rentals:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/box_by_temp_zpsidefgp0o.png)

Hmmmmm. Usage increases along with temperature, but it doesn't explain away our outliers.

Let's look at one more -- grouped by both day_type and temperature:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/multi_day_temp_zps0k3xpzki.png)

This is cluttered, but we see some patterns.
Let's try zooming in for a closer look:

![missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/day_temp_zoom_zpsntqiou97.png)

Stopping our exploraton here, we've so far:
- been unable to completely account for our outliers. Something else is in play
- we see that registered users appear to be using the bike share for their daily commutes
- interestingly, day type doesn't directly predict usage rates
- temperature is a direct predictor of usage

We could definitely continue to explore. 

# Summary:

- this data exploration and preparation example included:
    -- summary statistics
    -- visualizations
    -- transformations
    -- feature engineering
- we actually used machine learning to help prepare the data
- our exploration of summary statistics and graphs was essential to knowing how to prepare the data
- this is a very simple example -- the number of dimensions is small, we had no missing values or duplicates, there was practically no redundancy in the variables, and the data was overall very clean. Nonetheless, to prepare our data for the iteration of the modeling process, a fairly intensive preparation process was involved. Refining a *good* model will require multiple iterations while using practical judgment and trial and error to try different configurations of data preparation and machine learning/modeling options.
- I think of data preparation as inextricable from the larger data analysis process -- with business understanding, data exploration, and data modeling -- it's not just a preliminary step. It's an iterative, if not ongoing, activity during data mining.

## Thoughts
1. a minimally useful collection of data preparation tools (if defined as we have in this example) would require a huge number of diverse functions
2. when people have tried to create tools to simplify the process (there are lot, such as Knime, Trifacta, etc.) they haven't really caught on
    - in my opinion, data scientists prefer R, Python or SAS to these GUI-based tools due to the difficulty navigating via point-and-click through such complex visuo-spatial software environments
3. Is there a minimal set of preparation/exploration tasks to allow useful and satisfying "pre-preparation/pre-exploration" while maintaining the parsimony of the data.world product?
4. data exploration is so much easier when you can produce visualizations as well as numeric summaries
5. the Python Pandas library is, in my opinion, the pinnacle of data manipulation tools at this time. It's so amazing. I hear that in R plyr and dplyr are also good.
6. The 'business understanding' component is huge in the real world. 
    - a data dictionary is a must
    - where did the data come from? can its quality be validated? is there a "story" to the data's measurement/collection that can help the data scientist derive insight?
7. Finally, I hope this has been relevant and useful! I'll leave you with this quote from George Clason (whoever he was)

![Missing](http://i882.photobucket.com/albums/ac24/Noah_Rippner/clason_zps4qdtsfb1.jpg)