## Best model and conclusions

## Environment setup

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
WORKING_PATH = '/content/drive/MyDrive/KeepCoding/NLP/exercise'


In [None]:
%cd {WORKING_PATH}

/content/drive/MyDrive/KeepCoding/NLP/exercise


In [None]:
!pip install -r requirements.tx

In [None]:
import sys
import os
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.svm import SVC
import joblib


In [None]:
# to load custom libraries
sys.path.append(WORKING_PATH)

# load custom libraries

### Train and test best model

Two models were considered for the data analysis: Support Vector Machine (SVM) and Random Forest. The best model with the optimal parameter combination was identified through cross-validation. The best-performing model was a linear-kernel SVM with a value of one for the regularization parameter (C). This model was retrained on the full training dataset using these parameters and subsequently evaluated on the test dataset.

In [None]:
# read data for training the model
cache_dir = "cache"
cache_file_train = "train_model_data.pkl"
try:
  with open(os.path.join(cache_dir, cache_file_train), "rb") as f:
            cache_data = joblib.load(f)
  print("Read data to train model from cache file:", cache_file_train)
except:
  pass

X_train = cache_data['X_train']
X_test = cache_data['X_test']
y_train = cache_data['y_train']
y_test = cache_data['y_test']

Read data to train model from cache file: train_model_data.pkl


In [None]:
# train model
final_model = SVC(C=1, kernel="linear").fit(X_train,y_train)
predict_train = final_model.predict(X_train)

print(confusion_matrix(y_train, predict_train))
print(classification_report(y_train, predict_train))

[[ 558 1573]
 [  89 7734]]
              precision    recall  f1-score   support

           0       0.86      0.26      0.40      2131
           1       0.83      0.99      0.90      7823

    accuracy                           0.83      9954
   macro avg       0.85      0.63      0.65      9954
weighted avg       0.84      0.83      0.80      9954



In [None]:
# test model
predict_test = final_model.predict(X_test)

print(confusion_matrix(y_test, predict_test))
print(classification_report(y_test, predict_test))

[[ 161  559]
 [  31 2567]]
              precision    recall  f1-score   support

           0       0.84      0.22      0.35       720
           1       0.82      0.99      0.90      2598

    accuracy                           0.82      3318
   macro avg       0.83      0.61      0.62      3318
weighted avg       0.82      0.82      0.78      3318



Precision scores are similar for both classes, exceeding 0.80. Notably, recall for negative reviews (the minority class) is significantly lower. This occurs because many negative reviews are misclassified by the model, likely due to insufficient representation of this class in the training data.

To address this, it would be necessary to identify features that are most predictive of minority-class reviews. One potential approach would be to relax the p-value threshold and retain more features during selection. This strategy might provide additional discriminative information to improve recall for the minority class.

All metrics show comparable performance between training and test sets, ruling out overfitting concerns.