## Tree-Based Models for a Regression Problem, and Hyperparameter Tuning

We continue to work with our review dataset to see how Tree-based regressors (Decision Tree, Random Forest), along with efficient optimization techniques (GridSearch, RandomizedSearch), perform to predict the __log_votes__ field of our review dataset (that is very similar to the final project dataset).

1. Reading the dataset
2. Exploratory data analysis and missing value imputation
3. Stop word removal and stemming
4. Scaling numerical fields
5. Splitting the training dataset into training and validation
6. Computing Bag of Words features
7. Fitting tree-based regressors and checking the validation performance
    * Find more details on the __DecisionTreeRegressor__ here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
    * Find more details on the __RandomForestRegressor__ here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
8. Hyperparameter Tuning
    * Find more details on the __GridSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
    * Find more details on the __RandomizedSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
9. Ideas for improvement

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __rating:__ Rating of the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)


### 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

In [None]:
import pandas as pd

df = pd.read_csv('../../data/examples/NLP-REVIEW-DATA-REGRESSION.csv')

print('The shape of the dataset is:', df.shape)

Let's look at the first 10 rows of the dataset. As we can see the __log_votes__ field is numeric, so we build a regression model.

In [None]:
df.head(10)

### 2. Exploratory data analysis and missing values imputation

Let's look at the range and distribution of log_votes

In [None]:
df["log_votes"].min()

In [None]:
df["log_votes"].max()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

df["log_votes"].plot.hist()
plt.show()

We can check the number of missing values for each columm below.

In [None]:
print(df.isna().sum())

Let's fill-in the missing values for __reviewText__ below. We will just use the placeholder "Missing" here.

In [None]:
df["reviewText"].fillna("Missing", inplace=True)

### 3. Stop word removal and stemming

In [None]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence).

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [None]:
print("Pre-processing the reviewText field")
df["reviewText"] = process_text(df["reviewText"].tolist()) 

### 4. Scaling numerical fields:

We will apply min-max scaling to our rating field so that they will be between 0-1.

In [None]:
df["rating"] = (df["rating"] - df["rating"].min())/(df["rating"].max()-df["rating"].min())
df["time"] = (df["time"] - df["time"].min())/(df["time"].max()-df["time"].min())

### 5. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the __train_test_split()__ function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [None]:
from sklearn.model_selection import train_test_split

# Input: "reviewText", "rating" and "time"
# Target: "log_votes"
X_train, X_val, y_train, y_val = train_test_split(df[["reviewText", "rating", "time"]],
                                                  df["log_votes"].tolist(),
                                                  test_size=0.10,
                                                  shuffle=True
                                                 )

### 6. Computing Bag of Words Features

We are using binary features here. TF and TF-IDF are other options.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50 # Limit the vocabulary size
                                  )
# Fit and transform
X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train["reviewText"].tolist())
# Only transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val["reviewText"].tolist())

Let's print our vocabulary below. The number next to the word is its index in the vocabulary.

In [None]:
print(tfidf_vectorizer.vocabulary_)

Let's merge our features to train a model.

In [None]:
import numpy as np
X_train_features = np.column_stack((X_train_text_vectors.toarray(), 
                                    X_train["rating"].values, 
                                    X_train["time"].values))
X_val_features = np.column_stack((X_val_text_vectors.toarray(), 
                                  X_val["rating"].values,
                                  X_val["time"].values))

### 7. Fitting tree-based regressors and checking the validation performance

#### 7.1  DecisionTreeRegressor
Let's first fit a __DecisionTreeRegressor__ from Sklearn library, and check the performance on the validation dataset.

Find more details on the __DecisionTreeRegressor__ here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

dtRegressor = DecisionTreeRegressor(max_depth = 10,
                                    min_samples_leaf = 15)
dtRegressor.fit(X_train_features, y_train)
dtRegressor_val_predictions = dtRegressor.predict(X_val_features)
print("DecisionTreeRegressor on Validation: Mean_squared_error: %f, R_square_score: %f" % \
      (mean_squared_error(y_val, dtRegressor_val_predictions), r2_score(y_val, dtRegressor_val_predictions)))

#### 7.2  RandomForestRegressor
Let's now fit a __RandomForestRegressor__ from Sklearn library, and check the performance on the validation dataset.

Find more details on the __RandomForestRegressor__ here:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

__Warning__: When experimenting with different sizes of random forests, keep in mind that random forest training can take a longer time to complete!

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

rfRegressor = RandomForestRegressor(n_estimators = 200,
                                    max_depth = 10,
                                    min_samples_leaf = 15)
rfRegressor.fit(X_train_features, y_train)
rfRegressor_val_predictions =rfRegressor.predict(X_val_features)
print("RandomForestRegressor on Validation: Mean_squared_error: %f, R_square_score: %f" % \
      (mean_squared_error(y_val, rfRegressor_val_predictions), r2_score(y_val, rfRegressor_val_predictions)))

### 8. Hyperparameter Tuning

Let's try different parameter values and see how the __DecisionTreeRegressor__ model performs under some combinations of parameters.

__Warning__: The number of hyperparameters tuned, along with the cross-validations, can greatly increase training time! Especially if trying hyperparameters tuning on the __RandomForestRegressor__ instead of the lower performing __DecisionTreeRegressor__ that we showcase below for speed! Similar tuning on a __RandomForestRegressor__ model can take more minutes to hours!

#### 8.1 GridSearchCV

Find more details on the __GridSearchCV__ here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

dt = DecisionTreeRegressor()
parameters = {'max_depth': [10, 20, 30, 40],
              'min_samples_leaf': [5, 15, 25, 35]}
                     
# NOTE: GridSearchCV uses by default the score function of the estimator to evaluate 
# (r2_score for regression; accuracy_score for classification). If desired, 
# other scoring functions can be specified via the 'scoring' parameter. 
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# NOTE: You can experiment with different cv numbers, default = 5
regressor_grid = GridSearchCV(dt,
                              parameters,
                              cv=5,
                              verbose=1,
                              n_jobs=-1,
                              scoring = 'neg_mean_squared_error')
regressor_grid.fit(X_train_features, y_train)

print("Best parameters: ", regressor_grid.best_params_)
print("Best score: ", regressor_grid.best_score_)

regressor_grid_val_predictions = regressor_grid.best_estimator_.predict(X_val_features)

print("DecisionTreeRegressor with GridSearchCV on Validation: Mean_squared_error: %f, R_square_score: %f" % \
      (mean_squared_error(y_val, regressor_grid_val_predictions), r2_score(y_val, regressor_grid_val_predictions)))

#### 8.2 RandomizedSearchCV

Find more details on the __RandomizedSearchCV__ here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
  
dt = DecisionTreeRegressor()
parameters = {'max_depth': [10, 20, 30, 40],
              'min_samples_leaf': [5, 15, 25, 35]}
                      
# NOTE: RandomizedSearchCV uses by default the score function of the estimator to evaluate
# (r2_score for regression; accuracy_score for classification). 
# If desired, other scoring functions can be specified via the 'scoring' parameter.
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# NOTE: You can experiment with different cv numbers, default = 5
# NOTE: You can also experiment with different n_iter 
# (number of parameter settings that are sampled by the RandomizedSearch), default = 10
regressor_rand = RandomizedSearchCV(dt,
                                    parameters,
                                    cv=5,
                                    verbose=1,
                                    n_jobs=-1,
                                    scoring = 'neg_mean_squared_error')
regressor_rand.fit(X_train_features, y_train)

print("Best parameters: ", regressor_rand.best_params_)
print("Best score: ", regressor_rand.best_score_)

regressor_rand_val_predictions = regressor_rand.best_estimator_.predict(X_val_features)

print("DecisionTreeRegressor with RandomizedSearchCV on Validation: Mean_squared_error: %f, R_square_score: %f" % \
      (mean_squared_error(y_val, regressor_rand_val_predictions), r2_score(y_val, regressor_rand_val_predictions)))

### 9. Ideas for improvement

**Preprocessing**: We can usually improve performance with some additional work. You can try the following:
* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary size.
* Add the other text field __summary__ to the model and get bag of words features of it.
* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.

**Hyperparameter Tuning**: Always a good idea to try other parameter ranges and/or combinations of parameters. If training time is a priority, try __RandomizedSearchCV__ instead of __GridSearchCV__, it's much faster and with almost as good results. 