# Assignment 7: Using a Pipeline for Text Transformation and Classification

In [None]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_roc_curve, accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In this assignment, you will practice text vectorization to transform text into numerical feature vectors that can be used to train a classifier. You will then see how to use scikit-learn pipelines to chain together these processes into one step. You will:

1. Load the book reviews data set.
2. Use a single text column as a feature. 
3. Transform features using a TF-IDF vectorizer. 
4. Fit a logistic regression model to the transformed features. 
5. Evaluate the performance of the model using AUC.
6. Set up a scikit-learn pipeline to perform the same tasks above. 
7. Execute the pipeline and verify that the performance is the same.
8. Add a grid search to the pipeline to find the optimal hyperparameter configuration.
9. Evaluate the performance of the optimal configuration using ROC-AUC.

**<font color='red'>Note: some of the code cells in this notebook may take a while to run</font>**

## Part 1: Load the Data Set

We will work with the book review dataset that you worked with in the sentiment analysis demo.

In [None]:
filename = os.path.join(os.getcwd(), "data", "bookReviews.csv")
df = pd.read_csv(filename, header=0)

In [None]:
df.head()

## Part 2: Create Training and Test Data Sets

### Create Labeled Examples 

<b>Task</b>: Create labeled examples from DataFrame `df`. We will have one text feature and one label.  

In the code cell below carry out the following steps:

* Get the `Positive Review` column from DataFrame `df` and assign it to the variable `y`. This will be our label.
* Get the column `Review` from DataFrame `df` and assign it to the variable `X`. This will be our feature.


In [None]:
y = df['Positive Review']
X = df['Review']

In [None]:
X.head

In [None]:
X.shape

### Split Labeled Examples into Training and Test Sets


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1234)

## Part 3: Implement TF-IDF Vectorizer to Transform Text

<b>Task</b>: Complete the code in the cell below to implement a TF-IDF transformation on the training and test data.
Use the "Transforming Text For a Classifier" demo as a guide. Follow the following steps:

1. Create a `TfidfVectorizer` object and save it to the variable `tfidf_vectorizer`.
2. Call `tfidf_vectorizer.fit()` to fit the vectorizer to the training data `X_train`.
3. Call the `tfidf_vectorizer.transform()` method to use the fitted vectorizer to transform the training data `X_train`. Save the result to `X_train_tfidf`.
4. Call the `tfidf_vectorizer.transform()` method to use the fitted vectorizer to transform the test data `X_test`. Save the result to `X_test_tfidf`.

In [None]:
# 1. Create a TfidfVectorizer object and save it to the variable 'tfidf_vectorizer'

tfidf_vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer 

tfidf_vectorizer.fit(X_train)

# 3. Using the fitted vectorizer, transform the training data 

X_train_tfidf = tfidf_vectorizer.transform(X_train)

# 4. Using the fitted vectorizer, transform the test data

X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [None]:
print(X_test_tfidf)

## Part 4: Fit a Logistic Regression Model to the Transformed Training Data and Evaluate the Model
<b>Task</b>: Complete the code cell below to train a logistic regression model using the TF-IDF features, and compute the AUC on the test set.


In [None]:
# 1. Create the LogisticRegression model object 

model = LogisticRegression(max_iter=200)

# 2. Fit the model to the transformed training data

model.fit(X_train_tfidf, y_train)

# 3. Use the predict_proba() method to make predictions on the test data 

probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

# 4. Compute the area under the ROC curve for the test data. 

auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))


# 5. Compute the size of the resulting feature space 

len_feature_space = tfidf_vectorizer.vocabulary_
print('The size of the feature space: {0}'.format(len_feature_space))




## Part 5: Experiment with Different Document Frequency Values and Analyze the Results

<b>Task</b>: The cell below will loop over a range of 'document frequency' values. For each value, it will fit a vectorizer specifying `ngram_range=(1,2)`. It will then fit a logistic regression model to the transformed data and evaluate the results.   

Complete the loop in the cell below by 

1. adding a list containing four document frequency values that you would like to use (e.g. `[1, 10, 100, 1000]`)
2. adding the code you wrote above inside the loop. 

Note: This may take a short while to run.

In [None]:
for min_df in [1, 10, 100, 1000]: 
    
    print('\nDocument Frequency Value: {0}'.format(min_df))

    # 1. Create a TfidfVectorizer object   
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=min_df)

    # 2. Fit the vectorizer to X_train  
    
    tfidf_vectorizer.fit(X_train)

    # 3. Using the fitted vectorizer, transform the training data.
    
    
    X_train_tfidf = tfidf_vectorizer.transform(X_train)

    # 4. Using the fitted vectorizer, transform the test data.
    
    
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    # 5. Create the LogisticRegression model object 
    
    model = LogisticRegression(max_iter=200)
    
    # 6. Fit the model to the transformed training data
    
    model.fit(X_train_tfidf, y_train)

    # 7. Use the predict_proba() method to make predictions 
    probability_predicions = model.predict_proba(X_test_tfidf)[:,1]

    # 8. Using roc_auc_score() function to compute the AUC. 
    auc = roc_auc_score(y_test, probability_predictions)
    
    print('AUC on the test data: {:.4f}'.format(auc))
    

    # 9. Compute the size of the resulting feature space. 
    len_feature_space = tfidf_vectorizer.vocabulary_

    print('The size of the feature space: {0}'.format(len_feature_space))



<b>Task</b>: Which document frequency value and feature space produced the best performing model? Do you notice any patterns regarding the number of document frequency values, the feature space and the AUC? Record your findings in the cell below.

The document frequency value 100 produces the best-performing model because all the resulting AUC values are identical. We want to choose the fastest-running model if they're all the same. However, if we select 1000, we wouldn't be able to get any keywords even though the resulting AUC is the same. Choosing 100 is the safest option.

## Part 6: Set up a TF-IDF + Logistic Regression Pipeline

We will look at a new way to chain together various methods to automate the machine learning workflow. We will use  the scikit-learn `Pipeline` utility. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). First, let's import `Pipeline`.

In [None]:
from sklearn.pipeline import Pipeline

The code cell below will use a scikit-learn pipeline to perform TF-IDF vectorization and the fitting of a logistic regression model to the transformed data.



<b>Task:</b> In the code cell below, complete the using the pipeline

In [None]:
print('Begin ML pipeline...')

# 1. Define the list of steps:
s = [
        ("vectorizer", TfidfVectorizer(ngram_range=(1,2), min_df=10)),
        ("model", LogisticRegression(max_iter=200))
    ]

# 2. Define the pipeline:
model_pipeline = Pipeline(steps=s)

# We can use the pipeline the way would would use a model object 
# when fitting the model on the training data and testing on the test data:

# 3. Fit the pipeline to the training data

model_pipeline.fit(X_train, y_train)

# 4. Make predictions on the test data
# Save the second column to the variable 'probability_predictions'

probability_predictions = model_pipeline.predict_proba(X_test)[:,1]

print('End pipeline')

Let's compare the performance of our model. 

<b>Task</b>: In the code cell below, call the function `roc_auc_score()` with arguments `y_test` and `probability_predictions`. Save the results to the variable `auc_score`.


In [None]:
# Evaluate the performance by computing the AUC

auc_score = roc_auc_score(y_test, probability_predictions)

print('AUC on the test data: {:.4f}'.format(auc_score))

In some case, scikit-learn gives you the ability to provide a pipeline object as an argument to a function. One such function is `plot_roc_curve()`. You'll see in the online [documentation](https://scikit-learn.org/0.23/modules/generated/sklearn.metrics.plot_roc_curve.html) that this function can take a pipeline (estimator) as an argument. Calling `plot_roc_curve()` with the pipeline and the test data will accomplish the same tasks as steps 3 and 4 in the code cell above.

Let's import the function and try it out.

<b>Task:</b> Call `plot_roc_curve()` with the following three arguments:


In [None]:
from sklearn.metrics import plot_roc_curve

plot_roc_curve(model_pipeline, X_test, y_test)

Note that in newer versions of scikit-learn, this function has been replaced by [RocCurveDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html).

## Part 7: Perform a GridSearchCV on the Pipeline to Find the Best Hyperparameters 


You will perform a grid search on the pipeline object `model_pipeline` to find the hyperparameter configuration for hyperparameter $C$ (for the logistic regression) and for the $ngram\_range$ (for the TF-IDF vectorizer) that result in the best cross-validation score.

<b>Task:</b> Define a parameter grid to pass to `GridSearchCV()`. Recall that the parameter grid is a dictionary. Name the dictionary `param_grid`.



Note that following:

When running a grid search on a pipelines, the hyperparameter names you specify in the parameter grid are the names of the pipeline items (the descriptive names you provided to the items in the pipeline) followed by two underscores, followed by the actual hyperparameter names. 


We named the the classifier `model` and the vectorizer `vectorizer`. 

Since we named our classifier `model`, the hyperparameter name for $C$ that you would specify as they key in `param_grid` is `model__C`. You can find a list containing possible pipeline hyperparameter names you can use by running the code the cell below.

In [None]:
model_pipeline.get_params().keys()

In [None]:
param_grid = {'vectorizer__ngram_range':[(1,1), (1,2)], 'model__C':[0.1,1,10]}

param_grid

<b>Task:</b> Run a grid search on the pipeline.





In [None]:
print('Running Grid Search...')

# 1. Run a Grid Search with 3-fold cross-validation and assign the output to the 


grid = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='roc_auc', verbose=2)

# 2. Fit the model (grid_LR) on the training data and assign the fitted model to the 


grid_search = grid.fit(X_train, y_train)

print('Done')

Run the code below to see the best pipeline configuration that was determined by the grid search.

In [None]:
grid_search.best_estimator_

<b>Task</b>: Print the best hyperparameters by accessing them by using the `best_params_` attribute.

In [None]:
grid_search.best_params_

Recall that in the past, after we obtained the best hyperparameter values from a grid search, we re-trained a model with these values in order to evaluate the performance. This time we will do something different. Just as we can pass a pipeline object directly to `plot_roc_curve()` to evaluate the model, we can pass `grid_search.best_estimator_` to the function `plot_roc_curve()` to evaluate the model. We also pass in the test data (`X_test` and `y_test`). This allows the test data to be passed through the entire pipeline, using the best hyperparameter values.


<b>Task</b>: In the code cell below plot the ROC curve and compute the AUC by calling the function `plot_roc_curve()` with the arguments `grid_search.best_estimator_` and the test data (`X_test` and  `y_test`). Note that you can simply just pass `grid_search` to the function as well.

In [None]:
plot_roc_curve(grid_search.best_estimator_, X_test, y_test)