## Overview

In Unit 2, we worked with pipelines a lot. We used them to create a consistent work flow of preprocessing and model fitting steps. When we want to use any type of cross-validation, a pipeline was also necessary to ensure that the data was processed in the same way during each fold of the cross-validation.

To start off this module, we're going to focus on two tasks: extracting features from text and then classify the text with a simple logistic regression. We'll put these two tasks together in a pipeline and then make use of a grid search cross-validation to find the ideal parameters.


## Follow Along

For this example, we'll be using the sentiment labeled reviews from this [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). To make this exercise a little simpler and the dataset a little smaller, we'll just use the reviews from Yelp.

First, let's vectorize the text data and then fit a classifier model.

In [1]:
# Imports
import pandas as pd

# Read in the locally saved file from the link above
df_yelp = pd.read_csv('yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df_yelp.head()

Unnamed: 0,sentence,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [2]:
# Import train-test split
from sklearn.model_selection import train_test_split

# Create the feature and target variables
sentences = df_yelp['sentence']
y = df_yelp['label']

# Train-test split
sentences_train, sentences_test, y_train, y_test = train_test_split(
    sentences, y, test_size=0.25, random_state=42)

We now have a list list of sentences; we did the train-test split before we vectorize. If we instead vectorized the whole training set and then split into train-test sets, the training set would have information about what is in the testing test set.

In [3]:
# Import the tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate and fit the tf-idf vectorizer
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (2,2))
vectorizer.fit(sentences_train)

# Vectorize the training and testing data
X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

# Display the properties of the vectorized text
X_train

<750x2864 sparse matrix of type '<class 'numpy.float64'>'
	with 3051 stored elements in Compressed Sparse Row format>

In [4]:
# Import the classifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Instantiate and fit a model
classifier = LogisticRegression(solver='lbfgs')

classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

Accuracy: 0.588


We have decent accuracy with a logistic regression model. Now, if we want to optimize our model and use cross-validation, we'll need to put the vectorizer and classifier steps in each fold of the cross-validation. As we learned in Unit 2, we need to apply the same transformation within each fold of the validation, otherwise we could accidentally introduce data leakage (where we give the model more information about the data that it shouldn't have).

Our pipeline will have two steps: the vectorizer and the classifier.

In [5]:
from sklearn.pipeline import Pipeline

# Define the Pipeline
pipe = Pipeline([('vect', vectorizer), # vectorizer
                 ('clf', classifier) # classifier
                ])

# Define the parameter space for the grid serach
parameters = {'clf__C': [1, 10, 1000000]} # C: regularization strength


# Implement a grid search with cross-validation
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(sentences, y);

# Print out the best score
grid_search.best_score_

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:    2.3s remaining:    2.0s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    2.4s finished


0.611

The accuracy is improved compared to the version we did without a grid search. It's relatively straightforward to adjust the parameters that you would like to perform the grid search over. In this case, the only somewhat useful parameter to search over is the C or the inverse of the regularization. For classifiers with more parameters, you just add a key:value pair to the `parameters` dictionary.

## Challenge

For this challenge, try using a different classifier in place of the logistic regression. Make sure to adjust the `parameters` dictionary to be consistent with the classifier you choose. Some suggested classifiers to begin with are a decision tree or a random forest.

## Additional Resources

* [Scikit-learn: Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [Scikit-learn: Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [UCI: Sentiment Labelled Sentences](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)