## Logistic Regression Model and Threshold Calibration

In this notebook, we go over the Logistic Regression method to predict the __isPositive__ field of our final dataset, while also having a look at how probability threshold calibration can help improve classifier's performance.

1. Reading the dataset
2. Exploratory data analysis and missing value imputation
3. Stop word removal and stemming
4. Splitting the training dataset into training and validation
5. Computing Bag of Words features
6. Fitting LogisticRegression and checking model performance
    * Find more details on the __LogisticRegression__ here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
7. Ideas for improvement: Probability threshold calibration (optional) 

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review


### 1. Reading the datasets

We will use the __pandas__ library to read our datasets.

In [None]:
import pandas as pd

df = pd.read_csv('../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION.csv')

Let's look at the first five rows in the datasets.

In [None]:
df.head()

### 2. Exploratory data analysis and missing value imputation

Let's look at the target distribution for our datasets.

In [None]:
df["isPositive"].value_counts()

Checking the number of missing values:    

In [None]:
print(df.isna().sum())

Let's fill-in a placeholder for the __reviewText__ missing values:

In [None]:
df["reviewText"].fillna("Missing", inplace=True)

### 3. Stop word removal and stemming

We will apply the text processing methods discussed in the class. 

In [None]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [None]:
print("Pre-processing training reviewText")
df["reviewText"] = process_text(df["reviewText"].tolist())

### 4. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the __train_test_split()__ function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df["reviewText"].tolist(), # Input
                                                  df["isPositive"].tolist(), # Target field
                                                  test_size=0.10, # 10% val, 90% tranining
                                                  shuffle=True) # Shuffle the whole dataset

### 5. Computing Bag of Words Features

We are using binary features here. TF and TF-IDF are other options.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50 # Limit the vocabulary size
                                  )
# Fit and transform
X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train)
# Only transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val)

### 6. Fitting LogisticRegression and checking model performance

Let's fit __LogisticRegression__ from Sklearn library, and check the performance on the validation dataset.

Find more details on __LogisticRegression__ here:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# To improve the performance of LogisticRegression we can tune its parameters, for example:
# * regularization type: penalty = {l1, l2, elasticnet}
# * regularization strength: C = {smaller values specify stronger regularization} 
#    !!! LogisticRegression regularized cost function: C*Cost(w) + penalty(w), 
# where w is the weights vector !!!
# * addressing class imbalance: 
# class_weight = {balanced or {class label:weight, class label:weight}, ...}
lrClassifier = LogisticRegression(penalty = 'l2',
                                  C = 0.1,
                                  class_weight = 'balanced')
lrClassifier.fit(X_train_text_vectors, y_train)
lrClassifier_val_predictions = lrClassifier.predict(X_val_text_vectors)

print("LogisticRegression on Validation: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_val, lrClassifier_val_predictions), f1_score(y_val, lrClassifier_val_predictions)))

### 7. Ideas for improvement: Probability threshold calibration (optional)

Besides tuning __LogisticRegression__ hyperparameter values, one other path to improve a classifier's performance is to dig deeper into how the classifier actually assigns class membership.

**Binary predictions versus probability predictions.** We often use __classifier.predict()__ to examine classifier binary predictions, while in fact the outputs of most classifiers are real-valued, not binary. For most classifiers in sklearn, the method __classifier.predict_proba()__ returns class probabilities as a two-dimensional numpy array of shape (n_samples, n_classes) where the classes are lexicographically ordered. 

For our example, let's look at the first 5 predictions we made, in binary format and in real-valued probability format:

In [None]:
lrClassifier.predict(X_val_text_vectors)[0:5]

In [None]:
lrClassifier.predict_proba(X_val_text_vectors)[0:5]

**How are the predicted probabilities used to decide class membership?** On each row of predict_proba output, the probabilities values sum to 1. There are two columns, one for each response class: column 0 - predicted probability that each observation is a member of class 0; column 1 - predicted probability that each observation is a member of class 1. From the predicted probabilities, choose the class with the highest probability.

The key here is that a **threshold of 0.5** is used by default (for binary problems) to convert predicted probabilities into class predictions: class 0, if predicted probability is less than 0.5; class 1, if predicted probability is greater than 0.5.

**Can we improve classifier performance by changing the classification threshold?** Let's **adjust** the classification threshold to influence the performance of the classifier. 


#### 7.1 Threshold calibration to improve model accuracy

We calculate the accuracy using different values for the classification threshold, and pick the threshold that resulted in the highest accuracy.

In [None]:
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt

# Calculate the accuracy using different values for the classification threshold, 
# and pick the threshold that resulted in the highest accuracy.
highest_accuracy = 0
threshold_highest_accuracy = 0

thresholds = np.arange(0,1,0.01)
scores = []
for t in thresholds:
    # set threshold to 't' instead of 0.5
    y_val_other = (lrClassifier.predict_proba(X_val_text_vectors)[:,1] >= t).astype(float)
    score = accuracy_score(y_val, y_val_other)
    scores.append(score)
    if(score > highest_accuracy):
        highest_accuracy = score
        threshold_highest_accuracy = t
print("Highest Accuracy on Validation:", highest_accuracy, \
      ", Threshold for the highest Accuracy:", threshold_highest_accuracy)   

# Let's plot the accuracy versus different choices of thresholds
plt.plot([0.5, 0.5], [np.min(scores), np.max(scores)], linestyle='--')
plt.plot(thresholds, scores, marker='.')
plt.title('Accuracy versus different choices of thresholds')
plt.xlabel('Threshold')
plt.ylabel('Accuracy')
plt.show()

#### 7.2 Threshold calibration to improve model F1 score

Similarly, various choices of classification thresholds would affect the Precision and Recall metrics. Precision and Recall are usually trade offs of each other, so when you can improve both at the same time, your model's overall performance is undeniably improved. To choose a threshold that balances Precision and Recall, we can plot the Precision-Recall curve and pick the point with the highest F1 score. 

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# Calculate the precision and recall using different values for the classification threshold
val_predictions_probs = lrClassifier.predict_proba(X_val_text_vectors)
precisions, recalls, thresholds = precision_recall_curve(y_val, val_predictions_probs[:, 1])

Using the Precision and Recall values from the curve above, we calculate the F1 scores using:

$$\text{F1_score} = \frac{2*(\text{Precision} * \text{Recall})}{(\text{Precision} + \text{Recall})}$$

and pick the threshold that gives the highest F1 score.

In [None]:
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt

# Calculate the F1 score using different values for the classification threshold, 
# and pick the threshold that resulted in the highest F1 score.
highest_f1 = 0
threshold_highest_f1 = 0

f1_scores = []
for id, threhold in enumerate(thresholds):
    f1_score = 2*precisions[id]*recalls[id]/(precisions[id]+recalls[id])
    f1_scores.append(f1_score)
    if(f1_score > highest_f1):
        highest_f1 = f1_score
        threshold_highest_f1 = threhold
print("Highest F1 score on Validation:", highest_f1, \
      ", Threshold for the highest F1 score:", threshold_highest_f1)

# Let's plot the F1 score versus different choices of thresholds
plt.plot([0.5, 0.5], [np.min(f1_scores), np.max(f1_scores)], linestyle='--')
plt.plot(thresholds, f1_scores, marker='.')
plt.title('F1 Score versus different choices of thresholds')
plt.xlabel('Threshold')
plt.ylabel('F1 Score')
plt.show()