# 0. Set Up


In [1]:
# Upgrade pandas to latest version as pickle objects were saved in latest pandas version
!pip install  pandas==1.3.4

Collecting pandas==1.3.4
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 8.0 MB/s 
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.1.5
    Uninstalling pandas-1.1.5:
      Successfully uninstalled pandas-1.1.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.4 which is incompatible.[0m
Successfully installed pandas-1.3.4


In [1]:
# Import required modules
import pandas as pd
import numpy as np
import pickle

In [2]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Set directory
%cd /content/drive/My Drive/GR5067 NLP Project/data/

/content/drive/My Drive/GR5067 NLP Project/data


In [5]:
# Import data
data_pet = pickle.load(open('Data Cleaning/data_pet.pkl', 'rb'))
data_countvec =  pickle.load(open('Vectorization/my_vec_data_1_1.pkl', 'rb'))
data_tfidf = pickle.load(open('Vectorization/my_tf_idf_data_1_1.pkl', 'rb'))

In [6]:
# Splitting data into train and test set
from sklearn.model_selection import train_test_split
y = data_pet.pro_sales_num
X_c = data_countvec
X_t = data_tfidf
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y, random_state=0)
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X_t, y, random_state=1)
print(X_train_c.shape, X_test_c.shape, X_train_t.shape, X_test_t.shape, '\n',
      y_train_c.shape, y_test_c.shape, y_train_t.shape, y_test_t.shape)

(8756, 9563) (2919, 9563) (8756, 9563) (2919, 9563) 
 (8756,) (2919,) (8756,) (2919,)


# 1. LinearRegression Model

## 1.1 Training using CountVectorizer data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lr = LinearRegression().fit(X_train_c, y_train_c) 

print("LINEAR REGRESSION (CountVectorizer data)")
print("Mean k-fold CV score: {:.3f}".format(np.mean(cross_val_score(lr, X_train_c, y_train_c)))) # default scorer is r2
print("Training set score: {:.3f}".format(lr.score(X_train_c, y_train_c))) 
print("Test set score: {:.3f}".format(lr.score(X_test_c, y_test_c)))

LINEAR REGRESSION (CountVectorizer data)
Training set score: 0.998
Test set score: -3861608762896474963968.000
Mean k-fold CV score: -2346653148224874348544.000


## 1.2 Training using TF-IDF data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lr = LinearRegression().fit(X_train_t, y_train_t) 

print("LINEAR REGRESSION (TF-IDF data)")
print("Mean k-fold CV score: {:.3f}".format(np.mean(cross_val_score(lr, X_train_t, y_train_t))))
print("Training set score: {:.3f}".format(lr.score(X_train_t, y_train_t)))
print("Test set score: {:.3f}".format(lr.score(X_test_t, y_test_t)))

LINEAR REGRESSION (TF-IDF data)
Training set score: 0.996
Test set score: -2674059826798730240065536.000
Mean k-fold CV score: -1544445352530982800982016.000


<u> Interpreting results </u>    
From the results above, given a very high training set score and a very low (negative) test set score for both types of data, it is clear that the Linear Regression model is heavily overfitting on the training data. Hence there is a need to explore the use of **regularization** (e.g. Lasso regression) to restrict the model and prevent overfitting.

Comparing CountVectorizer data to TF-IDF data, the model that is trained on TF-IDF data has a slightly better R-squared score. Hence, **TF-IDF data should be used instead of CountVectorized data** in our model to predict product sales volume using product names.

<br>
<br>

# 2. Ridge Regression Model

## 2.1 Training using CountVectorizer data

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
ridge_param_grid = {'alpha':[0.01, 0.1, 1, 10]} 
ridge_grid = GridSearchCV(Ridge(random_state=2), param_grid=ridge_param_grid) 
ridge_grid.fit(X_train_c, y_train_c)

print("RIDGE REGRESSION (CountVectorizer data)")
print("Best mean CV score: {:.3f}".format(ridge_grid.best_score_))
print("Best parameters: {}".format(ridge_grid.best_params_))
print("Training set score: {:.3f}".format(ridge_grid.score(X_train_c, y_train_c)))
print("Test Set Score: {:.3f}".format(ridge_grid.score(X_test_c, y_test_c)))

RIDGE REGRESSION (CountVectorizer data)
Best mean CV score: 0.413
Best parameters: {'alpha': 1}
Training set score: 0.889
Test Set Score: 0.554


## 2.2 Training using TF-IDF data

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
ridge_param_grid = {'alpha':[0.01, 0.1, 1, 10]} 
ridge_grid = GridSearchCV(Ridge(random_state=3), param_grid=ridge_param_grid) 
ridge_grid.fit(X_train_t, y_train_t)

print("RIDGE REGRESSION (CountVectorizer data)")
print("Best mean CV score: {:.3f}".format(ridge_grid.best_score_))
print("Best parameters: {}".format(ridge_grid.best_params_))
print("Training set score: {:.3f}".format(ridge_grid.score(X_train_t, y_train_t)))
print("Test Set Score: {:.3f}".format(ridge_grid.score(X_test_t, y_test_t)))

RIDGE REGRESSION (CountVectorizer data)
Best mean CV score: 0.428
Best parameters: {'alpha': 0.1}
Training set score: 0.922
Test Set Score: 0.707


<u> Interpreting results </u>    
From the results above, we can see that **Ridge Regression has overall much better performance that Linear Regression** since the test set scores are positive and much higher for Ridge regression. For CountVectorizer data, the test set score is 0.554 and for TF-IDF data, the test set score is even higher at 0.707. Nonetheless, as the test score is still lower than the training set score (0.922), **there could be overfitting** using this model and alternative models should still be explored.

<br>
<br>

# 3. Lasso Regression Model

## 3.1 Training using CountVectorizer data

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso_param_grid = {'alpha':[0.01, 0.1, 1, 10]} 
lasso_grid = GridSearchCV(Lasso(max_iter=100000, random_state=4), param_grid=lasso_param_grid) 
lasso_grid.fit(X_train_c, y_train_c)

print("LASSO REGRESSION (CountVectorizer data)")
print("Best mean CV score: {:.3f}".format(lasso_grid.best_score_))
print("Best parameters: {}".format(lasso_grid.best_params_))
print("Training set score: {:.3f}".format(lasso_grid.score(X_train_c, y_train_c)))
print("Test Set Score: {:.3f}".format(lasso_grid.score(X_test_c, y_test_c)))

LASSO REGRESSION (CountVectorizer data)
Best mean CV score: 0.376
Best parameters: {'alpha': 0.1}
Training set score: 0.799
Test Set Score: 0.510


## 3.2 Training using TF-IDF data

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10]} 
lasso_grid = GridSearchCV(Lasso(max_iter=100000, random_state=5), param_grid=lasso_param_grid) 
lasso_grid.fit(X_train_t, y_train_t)

print("LASSO REGRESSION (TF-IDF data)")
print("Best mean CV score: {:.3f}".format(lasso_grid.best_score_))
print("Best parameters: {}".format(lasso_grid.best_params_))
print("Training set score: {:.3f}".format(lasso_grid.score(X_train_t, y_train_t)))
print("Test set Score: {:.3f}".format(lasso_grid.score(X_test_t, y_test_t)))

<u> Interpreting results </u>    
From the results above, we can see that **Lasso Regression has comparable performance to Ridge Regression** since the test set scores are around the same range. For CountVectorizer data, the test score on Lasso at 0.510 is lower than that of Ridge (0.554). However for TF-IDF data, the test score on Lasso is higher at 0.735 compared to Ridge (0.707). Nonetheless, as this test score is still lower than the training set score of 0.913, **there could be overfitting** using this model and alternative models should still be explored.

<br>
<br>

# 4. Random Forest Model


## 4.1 Training using CountVectorizer data

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rfr_param_grid = {'n_estimators': [100, 200],
                  'max_depth': [10, 20, 30],
                  'max_features': ['log2', 'sqrt'],
                  'n_jobs': [-1]} 
rfr_grid = GridSearchCV(RandomForestRegressor(random_state=6), param_grid=rfr_param_grid)
rfr_grid.fit(X_train_c, y_train_c)

print("RANDOM FOREST REGRESSION (CountVectorizer data)")
print("Best mean CV score: {:.3f}".format(rfr_grid.best_score_))
print("Best parameters: {}".format(rfr_grid.best_params_))
print("Training set score: {:.3f}".format(rfr_grid.score(X_train_c, y_train_c)))
print("Test set Score: {:.3f}".format(rfr_grid.score(X_test_c, y_test_c)))

RANDOM FOREST REGRESSION (CountVectorizer data)
Best mean CV score: 0.443
Best parameters: {'max_depth': 30, 'max_features': 'sqrt', 'n_estimators': 200, 'n_jobs': -1}
Training set score: 0.696
Test set Score: 0.536


## 4.2 Training using TF-IDF data

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rfr_param_grid = {'n_estimators': [100, 200],
                  'max_depth': [10, 20, 30],
                  'max_features': ['log2', 'sqrt'],
                  'n_jobs': [-1]} 
rfr_grid = GridSearchCV(RandomForestRegressor(random_state=7), param_grid=rfr_param_grid)
rfr_grid.fit(X_train_t, y_train_t)

print("RANDOM FOREST REGRESSION (TD-IDF data)")
print("Best mean CV score: {:.3f}".format(rfr_grid.best_score_))
print("Best parameters: {}".format(rfr_grid.best_params_))
print("Training set score: {:.3f}".format(rfr_grid.score(X_train_t, y_train_t)))
print("Test set Score: {:.3f}".format(rfr_grid.score(X_test_t, y_test_t)))

RANDOM FOREST REGRESSION (TD-IDF data)
Best mean CV score: 0.434
Best parameters: {'max_depth': 30, 'max_features': 'sqrt', 'n_estimators': 200, 'n_jobs': -1}
Training set score: 0.731
Test set Score: 0.583


<u> Interpreting results </u>    
From the results above, we can see that **Random Forest Regression has poorer performance relative to our previously best-performing models above**. For CountVectorizer data, the test score on Random Forest at 0.536 is lower than that of Ridge (0.554). For TF-IDF data, the test score on Random Forest at 0.583 is also lower compared to Lasso (0.735). For Random Forest, **there could also be overfitting** as the training set score at 0.731 is higher than the test set score of 0.583. Hence overall, Lasso Regression should still be preferred to Random Forest Regression.

<br>
<br>

# 5. Model Selection and Evaluation

## 5.1 Model Selection

To select the best model to predict sales volume based on product titles, we will choose the model with the highest test set score as this score represents the model's ability to generalize to new unseen data. We will also consider the model's training set score to see if there could be any overfitting or underfitting present.

Among the models above, the best-performing model on the test set was the **Lasso Regression** model using  TF-IDF Vectorizer with a test set score of 0.735. This model also has the second smallest difference in training and test score at 0.913-0.735=0.14=0.178, making it reasonably less susceptible to overfitting than the other models. 

While Random Forest had the smallest difference in training and test score at 0.731-0.583=0.148, its test performance is significantly lower than Lasso Regression. We have only used a limited set of parameters to tune the Random Forest Regression model (e.g. using fewer `n_estimators` and only choosing between 'sqrt' and 'log2' for `max_features`) due to computational limitations, but we acknowledge that with more computational resources and time, we could possibly tune this model sufficiently to produce better test performance. Nonetheless based on our output with the current resources, we will still choose Lasso Regression as our best model as it has the best balance between test set performance and overfitting.

In [None]:
# Export best-performing model
pickle.dump(lasso_grid, open('Modelling/bestmodel_lasso.pkl', 'wb'))

In [None]:
# # Import best-performing model
# lasso_grid = pickle.load(open('Modelling/bestmodel_lasso.pkl', 'rb'))

## 5.2 Interpreting Lasso Model

In [None]:
# Extracting Lasso coefficients into pandas dataframe
d = {'coef': lasso_grid.best_estimator_.coef_}
coefficients = pd.DataFrame(data=d, index=X_train_t.columns) 
coefficients = coefficients.sort_values(['coef'], ascending=False)

In [None]:
# Creating dataframe of top 40 coefficients
coefficients_top40 = coefficients.head(40)
coefficients_top40

Unnamed: 0,coef
粘狗,16042.117798
干无盐,10896.410174
ststtstst,9335.263959
多种,9004.675974
宠盒,8895.959781
品特,8514.601694
内特,8507.977667
包邮,8387.789171
梳柯,7070.127679
指挥棒,7062.761661


In [None]:
# Exporting dataframe of top 40 coefficients
coefficients_top40.to_csv(path + 'coefficients_top40.csv', index=True)

## 5.3 Using Lasso Model to predict new data