# Logistic Regression Lab with pipelines

In this lab you will try out pipelines with what you've learned so far and practice logistic regression on news headlines.

In [77]:
# import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import json
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report
%matplotlib inline

In [31]:
# read in the data
df = pd.read_csv('../assets/dataset/train.tsv', sep='\t')

## Cleaning
There's quite a lot of stuff in the dataset. For a more detailed description of what everything is, you can find the data dictionary here: https://www.kaggle.com/c/stumbleupon/data

For the purposes of this exercise, we're interested in 'boilerplate' (our predictor) and 'label' (our target). In this case, the target is binary, indicating whether something is evergreen - read over and over again - or not. 

You may want to clean up the 'boilerplate' column.

In [18]:
pd.set_option('max_colwidth', 200)

In [59]:
# Extract the title and body from the boilerplate JSON text
df['title'] = df.boilerplate.map(lambda x: json.loads(x).get('title', ''))
df['body'] = df.boilerplate.map(lambda x: json.loads(x).get('body', ''))


df.dropna(axis=0,inplace=True)

## Set up a Train/Test Split

In [60]:
X = df['title']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.33) 

## 1. Model Pipeline

Try out making pipelines with different transformations (look at the scikit-learn documentation for some that you think would be good) with a LogisticRegression instance. 

Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [39]:
pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', LogisticRegression())
    ])

In [48]:
parameters = {
    'vect__analyzer': ('word', 'char'),
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2), (2, 2), (3, 3), (4, 4)),  # individually checking uni- through tetragrams
}

In [62]:
X_train.isnull().sum()

0

In [69]:
grid_search = GridSearchCV(pipeline, parameters, scoring='precision_macro', n_jobs=-1, verbose=1) # precision-- minimize false-positives

## 2. Train the model
Use `X_train` and `y_train` to fit the model.
Use `X_test` to generate predicted values for the target variable and save those in a new variable called `y_pred`.

In [66]:
model = grid_search.fit(X_train,y_train)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   13.0s finished


In [71]:
print model.best_params_
print model.best_score_

{'vect__analyzer': 'char', 'vect__ngram_range': (4, 4), 'vect__max_df': 0.5}
0.767165459613


In [72]:
y_pred = model.predict(X_test)

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Are there more false positives or false negatives? (remember we are trying to predict evergreen-ness)
- How does that relate to what the `classification_report` is showing?

In [85]:
conmat = confusion_matrix(y_test, y_pred)

confusion = pd.DataFrame(conmat, index=['evergreen', 'not_evergreen'],
                         columns=['predicted_evergreen','predicted_not_evergreen'])
### i have no idae what my labels should be
print(confusion)


# In order to do this, we 


               predicted_evergreen  predicted_not_evergreen
evergreen                     2033                      378
not_evergreen                  779                     1720


In [80]:
print classification_report(y_test, y_pred)

             precision    recall  f1-score   support

          0       0.72      0.84      0.78      2411
          1       0.82      0.69      0.75      2499

avg / total       0.77      0.76      0.76      4910



## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

Discuss with your team which parameters you could try to change.

In [83]:
LogisticRegression.classes

AttributeError: type object 'LogisticRegression' has no attribute 'classes'

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination. (Remember that the 'CV' stands for 'cross-validation' so you don't need to use the train-test splits that you set up earlier.)

1. How will you choose the grid granularity?
1. How can you prevent the grid to exponentially grow?

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix` and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?