In [None]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

%matplotlib inline

# Data Pipelines

In this exercise, you will practice pasting together multiple feature processing steps into a single pipeline that allows for easy cross-validation and model selection.

## Data

We will use the crime rate data that we have used in previous weeks. This time, we will not drop the first few columns or the rows with missing values in them.

In [None]:
from sklearn.model_selection import train_test_split

# Load some crime data
headers = pd.read_csv('comm_names.txt', squeeze=True)
headers = headers.apply(lambda s: s.split()[1])
crime = (pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', 
                    header=None, na_values=['?'], names=headers)
#          .iloc[:, 5:]
#          .dropna()
         )

# Set target and predictors
target = 'ViolentCrimesPerPop'
predictors = [c for c in crime.columns if not c == target]

# Train/test split
X = crime[predictors]
y = crime[[target]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
# train_df, test_df = train_test_split(crime, random_state=2)

Always start by taking a look at the first few rows of your data.

In [None]:
X_train.head()

## EDA

It's always a good idea to start by asking yourself a few questions about the data. For example
- What types of features are there?
- Are there missing values?
- What is the distribution of the target?

#### What types of features are there?

In [None]:
X_train.dtypes.value_counts()

In [None]:
X_train.dtypes

#### Are there missing values?

In [None]:
X_train.apply(lambda col: col.isnull().sum())

#### What is the distribution of the target?

In [None]:
y_train.hist()

#### What are the distributions of the features?

It looks like there are both continuous and categorical features. It is usually a good idea to separate them.

In [None]:
numeric_predictors = X_train.columns[4:]
categorical_predictors = ['county', 'community', 'fold', 'communityname']

##### Numeric

In [None]:
X_train[numeric_predictors].describe().T

##### Categorical

In [None]:
for col in categorical_predictors:
    print col
    print X_train[col].value_counts().head()

Both `community` and `communityname` look like they are sliced too thin to be useful. `fold` is probably an index that was added for k-fold cross-validation. So it looks like the only real categorical variable is `county`.

In [None]:
categorical_predictors = ['county']

## Processing

There are a few obvious things we would like to do with this data before we start trying different models.

1. Impute missing values. For categorical variables, this is easy, a good strategy is to just add a new level: '?'. For the continuous variables, we need to be a little bit more careful.
- All of our sklearn learning algorithms only work with numeric data. We need to convert the categorical column to numeric, using either one-hot encoding or feature hashing.
- Some learning algorithms are sensitive to scaling. We should try normalizing the numeric features.
- This dataset has a relatively large number of features, compared to a small number of examples. We might want to try some dimensionality reduction (will be discussed in future classes).

There are different strategies for the two feature types (numeric and categorical), so we will treat them individually.

### Categorical Features

In [None]:
X_train[categorical_predictors].head()

Even though county is being represented with floating point numbers, we don't want the learning algorithm to treat it that way, so we should explicitly change it to a string.

In [None]:
# Change to strings
for col in categorical_predictors:
    X_train.loc[:, col] = X_train[col].astype(str)

In [None]:
X_train.county.value_counts()

This actually fixes the second problem as well. The NaN (not a number) entries have just been changed to the string 'nan', which should be treated the same as any other category level.

Now, the approach above is fine, but sklearn encourages us to treat feature processing and engineering in a very principled way. Feature processing steps in sklearn always have a .fit() method and a .transform() method, which has several advantages:
1. It is easy to combine and/or chain together multiple processing steps.
2. It helps keep the training and testing data separate, since .fit() only deals with training data and .transform() deals with both training and test

For example, we could write the feature processing step above 'the sklearn way':

In [None]:
class CategoricalImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self, cols):
        self.cols = cols
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        transformed_df = X
        for col in self.cols:
            transformed_df.loc[:, col] = transformed_df.loc[:, col].astype(str)
        return transformed_df
            
        

In [None]:
ci = CategoricalImputer(categorical_predictors)
transformed_train = ci.fit_transform(X_train)
transformed_test = ci.transform(X_test)

## Categorical --> Numeric

Now let's consider two different ways of transforming categorical features to numeric features.

1: Feature Hashing

In [None]:
from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(n_features=5)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
fh.fit(feature_dict)
out = pd.DataFrame(fh.transform(feature_dict).toarray())

2: One-hot encoding

In [None]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
dv.fit(feature_dict)
out = pd.DataFrame(
    dv.transform(feature_dict),
    columns = dv.feature_names_
)

**Exercise**: Examine all of the objects in the above two code cells and make sure you understand what is happening. Next, write a class MyVectorizer. The goal is to write a single transformation step that can perform either one-hot encoding or feature hashing, depending on the argument.

In [None]:
from sklearn.feature_extraction import FeatureHasher, DictVectorizer

class MyVectorizer(BaseEstimator, TransformerMixin):
    """
    Vectorize a set of categorical variables
    """
    
    def __init__(self, cols, hashing=None):
        """
        args:
            cols: a list of column names of the categorical variables
            hashing: 
                If None, then vectorization is a simple one-hot-encoding.
                If an integer, then hashing is the number of features in the output.
        """
        self.cols = cols
        self.hashing = hashing
        
    def fit(self, X, y=None):
        ### Your code goes here
        return self
            
    def transform(self, X):
        
        pass
            



**Test:**

one-hot encoding

In [None]:
mv = MyVectorizer(cols=categorical_predictors, hashing=None)
transformed_train = mv.fit_transform(X_train)
transformed_test = mv.transform(X_test)

Feature hashing

In [None]:
mv = MyVectorizer(cols=categorical_predictors, hashing=5)
transformed_train = mv.fit_transform(X_train)
transformed_test = mv.transform(X_test)

### Numeric Features

For the continuous features, there are two main feature processing steps:
1. Impute missing values
2. Scale features to normalized z-scores.

One can imagine other feature processing steps, e.g. dealing with outliers, discretization, etc., but we will stick with these for now

### Impute Missing Values

**Exercise:** Write your own class, MyImputer that takes as an argument the columns you would like to impute missing values for.

In [None]:
from sklearn.preprocessing import Imputer

class MyImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self, cols):
        self.cols = cols
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        pass
        

**Test:**

In [None]:
imp = MyImputer(numeric_predictors)
transformed_train = imp.fit_transform(X_train)
transformed_test = imp.transform(X_test)

### Scale

In addition to imputing missing values, we also want to scale the numeric columns. We can do this using StandardScaler, but not until the missing values have been imputed (it will throw an error). 

So it makes sense that imputation and scaling are preprocessing steps that happen in sequence. This is what sklearn's Pipeline() is for. Since each processing step (or BaseEstimator, in sklearn nomenclature) has a .fit() and .transform() method, they can be easily linked together.

**Exercise:** Define a pipeline that first imputes missing values and then scales all the continuous variables to have mean=0 and variance=1.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(
    ### Your code goes here
)

**Test:**

In [None]:
transformed_train = pipe.fit_transform(X_train)
transformed_test = pipe.transform(X_test)

### Combine Features

At this point, we have two 'threads' going. We have a couple of transformations that make sense for categorical variables, and a pipeline of transformations that make sense for the continuous variables. Now, let's put it all together into one big preprocessing object.

**Exercise**: Combine the categorical steps into a single pipeline

In [None]:
categorical_pipe = Pipeline(
    ### Your code goes here
)

**Test:**

In [None]:
transformed_train = categorical_pipe.fit_transform(X_train)
transformed_test = categorical_pipe.transform(X_test)

**Exercise:** Use sklearn's FeatureUnion to combine both of your pipelines (one continuous and one categorical) into a single step.

In [None]:
from sklearn.pipeline import FeatureUnion

fu = FeatureUnion( 
    ### Your code goes here  
)

In [None]:
transformed_train = fu.fit_transform(X_train)
transformed_test = fu.transform(X_test)

## Try some different models

The great thing about this paradigm is that you can write a whole data processing a modeling pipeline 'in the abstract' without doing anthing to your data. Scikit-learn then lets you treat the entire pipeline as one 'model', which allows you to do things like cross-validation and model selection without ever contaminating your test data.

### Linear Regression

Here is an example using linear regression

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

#### Define the Pipeline

In [None]:
ridge_pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('numeric', Pipeline([
            ('impute', MyImputer(cols=numeric_predictors)),
            ('scale', StandardScaler()),
            ('reduce_dim', PCA())
        ])
        ),
        ('categorical', Pipeline([
            ('impute', CategoricalImputer(cols=['county'])),
            ('vectorize', MyVectorizer(cols=['county']))
        ])
        )
    ])),
    ('predict', LinearRegression())
])

#### Define some hyper-parameters to search over

In [None]:
search_params = {
    'preprocess__categorical__vectorize__hashing': [None, 20, 40, 80],
    'preprocess__numeric__reduce_dim__n_components': [10, 20, 40, 80, 100]
}


#### Grid Search

In [None]:
grid_search = GridSearchCV(ridge_pipeline, search_params)
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
grid_search.best_score_

In [None]:
grid_search.score(X_test, y_test)

**Exercise**: Try to build your own pipeline. You can use a different estimator (e.g. Ridge(), RandomForestRegressor(), GradientBoostingRegressor(), SVR(), ...), and you can also add additional variables to the steps in the pipeline (e.g., what happens if you impute missing values based on median instead of mean?)

How high can you get your R^2 on the test set?