## K Nearest Neighbors Model for a Regression Problem

In this notebook, we use the K Nearest Neighbors method to build a regressor to predict the __log_votes__ field of our review dataset (that is very similar to the final project dataset).


1. Reading the dataset
2. Exploratory data analysis and missing value imputation
3. Stop word removal and stemming
4. Scaling numerical fields
5. Splitting the training dataset into training and validation
6. Computing Bag of Words features
7. Fitting the regression model
    * Find more details on the KNN Regressor here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
8. Checking model performance on the validation dataset
9. Trying different K values
10. Ideas for improvement

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __rating:__ Rating of the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*


### 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

In [None]:
import pandas as pd

df = pd.read_csv('../../data/examples/NLP-REVIEW-DATA-REGRESSION.csv')

print('The shape of the dataset is:', df.shape)

Let's look at the first 10 rows of the dataset. As we can see the __log_votes__ field is numeric, so we build a regression model.

In [None]:
df.head(10)

### 2. Exploratory data analysis and missing value imputation

Let's look at the range and distribution of __log_votes__.

In [None]:
df["log_votes"].min()

In [None]:
df["log_votes"].max()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

df["log_votes"].plot.hist()
plt.show()

We can check the number of missing values for each columm below.

In [None]:
print(df.isna().sum())

Let's fill-in the missing values for __reviewText__ below. We will just use the placeholder "Missing" here.

In [None]:
df["reviewText"].fillna("Missing", inplace=True)

### 3. Stop word removal and stemming

In [None]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list. It is because those words are actually useful to understand the sentiment in the sentence.

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [None]:
print("Pre-processing the reviewText field")
df["reviewText"] = process_text(df["reviewText"].tolist()) 

### 4. Scaling numerical fields:

We will apply min-max scaling to our rating field so that they will be between 0-1.

In [None]:
df["rating"] = (df["rating"] - df["rating"].min())/(df["rating"].max()-df["rating"].min())
df["time"] = (df["time"] - df["time"].min())/(df["time"].max()-df["time"].min())

### 5. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the __train_test_split()__ function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [None]:
from sklearn.model_selection import train_test_split

# Input: "reviewText", "rating" and "time"
# Target: "log_votes"
X_train, X_val, y_train, y_val = train_test_split(df[["reviewText", "rating", "time"]],
                                                  df["log_votes"].tolist(),
                                                  test_size=0.10,
                                                  shuffle=True
                                                 )

### 6. Computing Bag of Words features

We are using binary features here. TF and TF-IDF are also other options.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50    # Limit the vocabulary size
                                  )
# Fit and transform
X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train["reviewText"].tolist())
# Only transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val["reviewText"].tolist())

Let's print our vocabulary below. The number next to the word is its index in the vocabulary.

In [None]:
print(tfidf_vectorizer.vocabulary_)

### 7. Fitting the regression model

We will use __KNeighborsRegressor__ from Sklearn library with __n_neighbors__ = 5. We will try different values in the last section.

Using the KNeighborsRegressor from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor

from time import time
start = time()

# Let' merge our features
X_train_features = np.column_stack((X_train_text_vectors.toarray(), 
                                    X_train["rating"].values, 
                                    X_train["time"].values)
                                  )

# Using the default KNN with 5 nearest neighbors
knnRegressor = KNeighborsRegressor(n_neighbors=5)
knnRegressor.fit(X_train_features, y_train)

### 8. Checking model performance on the validation dataset

We kept some of our data as validation data. Let's check model performance on this validation dataset. 

One evaluation metrics for regression problems is the Mean Squared Error ($\mathrm{MSE}$), defined as:
$$
\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2,
$$
measuring the mean of all squared differences between the data values $y_i$ and the predicted values $\hat{y_i}$, where $n$ is the number of data records.

Another regression evaluation metric is $\mathrm{R^2}$, measuring the fraction of the variance in the dataset our model can explain:
$$
\mathrm{R^2} = 1- \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
$$
where $\bar y = \frac{1}{n}\sum_{i = 1}^n y_i$ is the mean value of the data values $y_i$. 

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

X_val_features = np.column_stack((X_val_text_vectors.toarray(), 
                                  X_val["rating"].values, 
                                  X_val["time"].values))

val_predictions = knnRegressor.predict(X_val_features)

end = time()
print('KNN Training and validation time for one value of K (in seconds):', end-start)

print("Mean_squared_error: %f,  R_square_score: %f" % (mean_squared_error(y_val, val_predictions), r2_score(y_val, val_predictions)))

### 9. Trying different K values

Let's try different K values and see how the model performs with each one.

*Note: When experimenting with different values of K, keep in mind that KNN training and validation for one value of K can take around 1 minute*

In [None]:
K_values = [5, 10, 20, 25, 30, 40, 50]

for K in K_values:
    knnRegressor = KNeighborsRegressor(n_neighbors=K)
    knnRegressor.fit(X_train_features, y_train)
    val_predictions = knnRegressor.predict(X_val_features)
    print("K=%d, Mean_squared_error: %f,  R_square_score: %f" % (K, mean_squared_error(y_val, val_predictions), r2_score(y_val, val_predictions)))

### 10. Ideas for improvement
We can usually improve performance with some additional work. You can try the following:
* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary sizes.
* Add the other text field __summary__ to the model and get Bag of Words features of it.
* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.