In [None]:
import pandas as pd
import numpy as np

%matplotlib inline

# Data Pipelines

In this exercise, you will practice pasting together multiple feature processing steps into a single pipeline that allows for easy cross-validation and model selection.

## Data

We will use the crime rate data that we have used in previous weeks. This time, we will not drop the first few columns or the rows with missing values in them.

In [None]:
from sklearn.model_selection import train_test_split

# Load some crime data
headers = pd.read_csv('comm_names.txt', squeeze=True)
headers = headers.apply(lambda s: s.split()[1])
crime = (pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', 
                    header=None, na_values=['?'], names=headers)
#          .iloc[:, 5:]
#          .dropna()
         )

# Set target and predictors
target = 'ViolentCrimesPerPop'
predictors = [c for c in crime.columns if not c == target]

# Train/test split
X = crime[predictors]
y = crime[[target]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
# train_df, test_df = train_test_split(crime, random_state=2)

Always start by taking a look at the first few rows of your data.

## EDA

It's always a good idea to start by asking yourself a few questions about the data. For example
- What types of features are there?
- Are there missing values?
- What is the distribution of the target?

#### What types of features are there?

#### Are there missing values?

#### What is the distribution of the target?

#### What are the distributions of the features?

It looks like there are both continuous and categorical features. It is usually a good idea to separate them.

##### Numeric

##### Categorical

Both `community` and `communityname` look like they are sliced too thin to be useful. `fold` is probably an index that was added for k-fold cross-validation. So it looks like the only real categorical variable is `county`. We'll leave `communityname` for now, just so we have more than one categorical variable.

## Processing

There are a few obvious things we would like to do with this data before we start trying different models.

1. Impute missing values. For categorical variables, this is easy, a good strategy is to just add a new level: '?'. For the continuous variables, we need to be a little bit more careful.
- All of our sklearn learning algorithms only work with numeric data. We need to convert the categorical column to numeric, using either one-hot encoding or feature hashing.
- Some learning algorithms are sensitive to scaling. We should try normalizing the numeric features.
- This dataset has a relatively large number of features, compared to a small number of examples. We might want to try some dimensionality reduction (will be discussed in future classes).

There are different strategies for the two feature types (numeric and categorical), so we will treat them individually.

## Numeric Feature Processing Pipeline

For the continuous features, there are two main feature processing steps:
1. Impute missing values
2. Scale features to normalized z-scores.

One can imagine other feature processing steps, e.g. dealing with outliers, discretization, etc., but we will stick with these for now

In [None]:
from sklearn.preprocessing import FunctionTransformer, StandardScaler, Imputer
from sklearn.pipeline import Pipeline

### Step 1: Select Numeric Features

### Step 2: Impute missing values

### Step 3: Scale

### Put it all in a pipeline

## Categorical Feature Processing Pipeline

### Step 1: Select columns that correspond to categorical features

#### Start a Pipeline

### Step 2: Ensure that each feature has the correct data type

#### Continue the pipeline

### Step 3: Encode as numeric features

#### Continue the pipeline

#### Let's do this with feature hashing instead

In [None]:
from sklearn.feature_extraction import FeatureHasher



## Combine the two Pipelines

## Try some different models

The great thing about this paradigm is that you can write a whole data processing a modeling pipeline 'in the abstract' without doing anthing to your data. Scikit-learn then lets you treat the entire pipeline as one 'model', which allows you to do things like cross-validation and model selection without ever contaminating your test data.

### Linear Regression

Here is an example using linear regression

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge

#### Define the Pipeline

#### Define some hyper-parameters to search over

#### Grid Search

**Exercise**: Try to build your own pipeline. You can use a different estimator (e.g. Ridge(), RandomForestRegressor(), GradientBoostingRegressor(), SVR(), ...), and you can also add additional variables to the steps in the pipeline (e.g., what happens if you binarize the numeric variables? take the logs? perform dimensionality reduction with PCA?)

How high can you get your R^2 on the test set?