# Logistic Regression Lab with pipelines

In this lab you will try out pipelines with what you've learned so far and practice logistic regression on news headlines.

In [None]:
# import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

%matplotlib inline

In [None]:
# read in the data
df = pd.read_csv('../../assets/datasets/train.tsv', sep='\t')

## Cleaning
There's quite a lot of stuff in the dataset. For a more detailed description of what everything is, you can find the data dictionary here: https://www.kaggle.com/c/stumbleupon/data

For the purposes of this exercise, we're interested in 'boilerplate' (our predictor) and 'label' (our target). In this case, the target is binary, indicating whether something is evergreen - read over and over again - or not. 

You may want to clean up the 'boilerplate' column.

## Set up a Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = 

## 1. Model Pipeline

Try out making pipelines with different transformations (look at the scikit-learn documentation for some that you think would be good) with a LogisticRegression instance. 

Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [None]:
from sklearn.pipeline import Pipeline

# pipeline = Pipeline([
#         ('transformer_1', transformer_1()),
#         ('transformer_2', transformer_2()),
#         ('estimator', estimator())
#         ])

## 2. Train the model
Use `X_train` and `y_train` to fit the model.
Use `X_test` to generate predicted values for the target variable and save those in a new variable called `y_pred`.

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Are there more false positives or false negatives? (remember we are trying to predict evergreen-ness)
- How does that relate to what the `classification_report` is showing?

## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

Discuss with your team which parameters you could try to change.

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination. (Remember that the 'CV' stands for 'cross-validation' so you don't need to use the train-test splits that you set up earlier.)

1. How will you choose the grid granularity?
1. How can you prevent the grid to exponentially grow?

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix` and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?