# Lab 06 - Data Transformation

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline
sns.set_style("darkgrid")

import sys
sys.path.append('../')
from lib.processing_functions import convert_to_pandas

## Exercise goals:

- practice with building data pipelines
- perform feature extraction from text data


---
## Exercise 1: Builing a pipeline

We revisit the Boston dataset to see if adding extra polynomial terms to the data improves the prediction performance of a `Lasso` model. We will build a pipeline for figuring this out.

In [None]:
# load the Boston dataset
X, y = convert_to_pandas(datasets.load_boston())

### 1.1 Initialize estimators objects

Two preprocessing steps will be performed before fitting the data: standardization of data and addition of polynomial terms.

Let's start with creating the `StandardScaler` and `PolynomialFeatures` transformers.
Initialize both of them with `include_bias=False`: transformer:

```python
# TODO: Replace <FILL IN> with appropriate code
# import the StandardScaler class
 <FILL IN>

# initialize standard scaler
scaler = <FILL IN>


# TODO: Replace <FILL IN> with appropriate code
# import the PolynomialFeatures class
<FILL IN>

# intialize polynomial feature creator
polynomial = <FILL IN>
```

In [None]:
%load ../answers/06_01_initialize.py

Initialize the Lasso estimator (in contrast to the previous lab, we will not use the one with build-in cross-validation); note that normalization is not necessary due to the standard scaling preprocessing step:

In [None]:
from sklearn.linear_model import Lasso

# Initialize the lasso estimator
lasso = Lasso()

### 1.2 Create the pipeline

Combine the preprocessing transformers and estimator in a pipeline:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.pipeline import Pipeline

# create the pipeline object
pipeline = Pipeline([('scaler', <FILL IN>),
                     ('polynomial', <FILL IN>),
                     ('lasso', <FILL IN>)])
```

In [None]:
%load ../answers/06_02_pipeline.py

### 1.3 Optimize with grid search

Now perform the grid search over the specified parameters of objects in the pipeline:

In [None]:
from sklearn.model_selection import GridSearchCV

# choose hyperparameter grid
param_grid = {'lasso__alpha': np.logspace(-.25,1,10),
              'polynomial__degree': [1,2,3]}

# perform cross-validated gridsearch 
grid_pipeline = GridSearchCV(pipeline, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# fit the data
grid_pipeline.fit(X, y)

# detimine best parameters
grid_pipeline.best_params_

**Question**: For which objects in the pipeline did we optimize hyperparameters, and what cross-validation scoring strategies were used? Did adding the polynomial features to our data make sense?

### 1.4 Estimated model parameters

Let's have a look at the estimated model parameters of the best Lasso estimator resulting from the grid search.

Extract the `Lasso` estimator from the best estimator of the `grid_pipeline` object (hint: look up how get an estimator from a pipeline by naming the step):

```python
# TODO: Replace <FILL IN> with appropriate code
best_pipeline = grid_pipeline.<FILL IN>
best_lasso = best_pipeline.<FILL IN>
```

In [None]:
%load ../answers/06_03_best.py

Adding the second degree polynomial features to the data increases the number of features. Determine the number of features from the shape of the `coef_` vector:

In [None]:
n_features = best_lasso.coef_.shape
print("number of features: {}".format(n_features[0]))

With this many features it important to prevent overfitting. Lasso protects against overfitting by applying 'l1' regularization. A nice property of this type of regularization is that the resulting models are sparse, meaning many coefficient are zero. 

Let's visualize the estimated model coefficients:

In [None]:
pd.DataFrame(best_lasso.coef_, columns=['coef']).plot(kind='Bar', figsize=(15,4))

**Question**: The visualization show that around 20% of the 104 features are non-zero, what would be an advantage of such an sparse model? Does 'l2' regularization (used in Ridge regression) also result in sparse models?

---
## Exercise 2: Text classification

We will train a classifier on text data to determine whether a comment contains an insult or not.

The labeled data is split into train and test datasets:

In [None]:
# load the insult datasets
train_data = pd.read_csv("../data/insult_train.csv") 
test_data = pd.read_csv("../data/insult_test.csv") 

test_data.head()

### 2.1 Extract features and targets

Extract the targets from the 'Insult' column and the features from the 'Comment' column of the dataframe:

```python
# TODO: Replace <FILL IN> with appropriate code
# create X, y data sets for both testing and training
text_train = <FILL IN>
y_train = <FILL IN>
text_test = <FILL IN>
y_test = <FILL IN>
```

In [None]:
%load ../answers/06_04_extract.py

We will use the `TfidfVectorizer` transformer to convert the text to feature vectors that can be fed to a classification algorithm. 

This transformer outputs sparse feature vectors containing token counts (=word counts). These counts have been normalized based on how often the tokens occur in the different documents (=comments), this is called tf-idf. 

Initialize the `TfidfVectorizer` transformer. 
Fit the train text and check-out the size of the built vocabulary:

```python
# TODO: Replace <FILL IN> with appropriate code
# import tranformer classs
<FILL IN>

# initialize the countvectorizer
vectorizer = <FILL IN>

# TODO: Replace <FILL IN> with appropriate code
# fit the train text
vectorizer.<FILL IN>

# print vocabulary size
vocabulary_size = len(vectorizer.<FILL IN>)

print('vocabulary size: {}'.format(vocabulary_size))

# TODO: Replace <FILL IN> with appropriate code
# transform train text
X_train = vectorizer.<FILL IN>
#transform test text
X_test = vectorizer.<FILL IN>
```

In [None]:
%load ../answers/06_05_vectorizer.py

### 2.2 Fit a linear SVC model

Fit a `LinearSVC` estimator to the train data.
Compute the score on the test set using the estimator's default scoring method:

```python
# TODO: Replace <FILL IN> with appropriate code
# import the estimator class
<FILL IN>

# initialize the classifier
clf = <FILL IN>

# fit the train data
clf.<FILL IN>

# TODO: Replace <FILL IN> with appropriate code
# compute accuracy on test set
accuracy = clf.<FILL IN>

print("accuracy {}:".format(accuracy))
```

In [None]:
%load ../answers/06_06_svc.py

### 2.3 Estimated model parameters

Use the function below to visualize the top features used by fitted model:

In [None]:
def visualize_coefficients(classifier, feature_names, n_top_features=20):
    # get coefficients with large absolute values 
    coef = classifier.coef_.ravel()
    positive_coefficients = np.argsort(coef)[-n_top_features:]
    negative_coefficients = np.argsort(coef)[:n_top_features]
    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])
    # plot them
    plt.figure(figsize=(15, 8))
    colors = ["red" if c < 0 else "blue" for c in coef[interesting_coefficients]]
    plt.bar(np.arange(2*n_top_features), coef[interesting_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, (2*n_top_features)+1), feature_names[interesting_coefficients], 
               rotation=60, ha="right", fontsize=16)
    plt.ylabel('insult contribution')
    max_coef = np.abs(max(coef[interesting_coefficients]))
    plt.ylim(-max_coef*1.1, max_coef*1.1)
    
visualize_coefficients(clf, vectorizer.get_feature_names())

**Question**: Does the insult contribution of the coefficients make sense?

### 2.4 Pipeline with grid search 

- Combine the vectorizer and classfier into a data pipeline
- Perform grid search over the predefined parameters (note this can take a while!)
- Compute the score on the test set using the estimator's default scoring method:

```python
# TODO: Replace <FILL IN> with appropriate code
from sklearn.pipeline import Pipeline

# create pipeline out of the TfidfVectorizer and LinearSVC
pipeline = Pipeline([('tv', <FILL IN>), ('svc', <FILL IN>)])

# TODO: Replace <FILL IN> with appropriate code
from sklearn.model_selection import GridSearchCV

# defined hyperparameter grid
param_grid = {'svc__C': [0.5,1,2],
              'tv__ngram_range': [(1,1),(1,2)]}

# create gridsearch object
grid_pipeline = <FILL IN>

# fit the text and y train data
grid_pipeline.<FILL IN>

# display best hyperparameters
grid_pipeline.<FILL IN>

# TODO: Replace <FILL IN> with appropriate code
# comput accuracy on test text (note not test set; but test text!)
accuracy = grid_pipeline.<FILL IN>

print("accuracy {}:".format(accuracy))
```

In [None]:
%load ../answers/06_07_pipeline.py

**Question**: Did our grid search result in performance gain on the test set?

In [None]:
%load ../answers/06_questions.py