# Modeling and Evaluation of Sentiment Prediction

This stage of the project includes preparing text for machine learning algorithms, splitting the dataset into train and test set, and then performing predictions and evaluations using different classifiers. Best models will be then tested again.

Contents of this notebook:

<ul>
    <li>1. Imports</li>
    <li>2. Data</li>
    <li>3. Preparing Text</li>
        <ul>
            <li>3.1 Removing Missing values</li>
            <li>3.2 Creating three categories of labels from ratings</li>
            <li>3.3 Train/Test Split</li>
            <li>3.4 Vectorizing the text</li>
            <li>problem faced and how i solved it<li>
         <ul>
    <li>4. Classification</li>
        <ul>
            <li>4.1 Further splitting data into a train and validation set</li>
            <li>4.2 Logistic Regression</li>
            <li>4.3 Multinomial Naive Bayes</li>
            <li>4.4 Random Forest</li>
            <li>4.5 Decision Tree</li>
            <li>4.6 K Neighbors</li>
            <li>4.7 AdaBoost</li>
            <li>4.8 XGBoost</li>
        </ul>
    <li>5. Evaluation</li>
        <ul>
            <li>5.1 Comparing scores from all models</li>
            <li>5.2 Fitting the best model with test data</li>
            <li>5.3 Additional model metrics and tuning</li>
        </ul>
</ul>

# 1. Imports

In [1]:
#basic libraries for linear algebra and data processing
import numpy as np
import pandas as pd

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

!pip install xgboost
!pip install yellowbrick
#data preparation tools
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords

import pickle

#classification class
from Classification_py import Classification

#time and warnings
import time
import warnings

#settings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_context('poster', font_scale=0.5)



# 2. Data

In [2]:
#loading the review dataset
file_path = r"C:\Users\besid\OneDrive\Desktop\on campus internship\machine learning\yelpproject\data\review_prepared.csv"
review = pd.read_csv(file_path)

In [3]:
#filtering out the stars and text columns
reviews = review[['text', 'stars']].reset_index().drop(columns='index')

In [4]:
print(reviews.shape)
reviews.head()

(229130, 2)


Unnamed: 0,text,stars
0,My wife took me here on my birthday for breakf...,5
1,I have no idea why some people give bad review...,5
2,love the gyro plate. Rice is so good and I als...,4
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5
4,General Manager Scott Petello is a good egg!!!...,5


# 3. Preparing Text

## 3.1 Removing Missing values

Checking if there are missing values I didn't catch the first time

In [5]:
reviews.isnull().sum()

text     6
stars    0
dtype: int64

In [6]:
reviews.dropna(inplace = True)
print(reviews.shape)

(229124, 2)


## 3.2 Creating three categories of labels from ratings

Creating lables of positive, negative, and neutral for their corresponding ratings.

In [7]:
reviews['stars'].value_counts()

stars
4    79702
5    75911
3    35266
2    20897
1    17348
Name: count, dtype: int64

In [8]:
#creating labels from stars
reviews['label'] = reviews['stars'].apply(lambda s: 'positive' if s >= 4 else ('negative' if s <= 2 else 'neutral'))

reviews.head()

Unnamed: 0,text,stars,label
0,My wife took me here on my birthday for breakf...,5,positive
1,I have no idea why some people give bad review...,5,positive
2,love the gyro plate. Rice is so good and I als...,4,positive
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5,positive
4,General Manager Scott Petello is a good egg!!!...,5,positive


In [9]:
reviews['label'].value_counts()

label
positive    155613
negative     38245
neutral      35266
Name: count, dtype: int64

## 3.3 Train/Test Split

In [10]:
X = reviews['text']
y = reviews['label']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [12]:
print('Shape of the X train set: ', X_train.shape)
print('Shape of the X test set: ', X_test.shape)
print('Shape of the y train set: ', y_train.shape)
print('Shape of the y test set: ', y_test.shape)

Shape of the X train set:  (153513,)
Shape of the X test set:  (75611,)
Shape of the y train set:  (153513,)
Shape of the y test set:  (75611,)


In [13]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.fit_transform(y_test)

## 3.4 Vectorizing the text

In [14]:
print(type(stopwords))

<class 'nltk.corpus.reader.wordlist.WordListCorpusReader'>


## problem faced and how i solved it 
stopwords = set(stopwords.words('english'))

This line assumes that stopwords is the NLTK stopwords object, which has a words() method.

However, if stopwords is already a set (as indicated by <class 'set'>), it doesn't have a words() method. Sets in Python don't have this method.
So when you try to call stopwords.words('english'), Python raises an AttributeError because you're trying to call a method that doesn't exist for set objects.

In [15]:
import nltk
from nltk.corpus import stopwords as nltk_stopwords  # Import with a different name

# Download stopwords data (if not already downloaded)
nltk.download('stopwords')

# Create the stopwords set
stopwords_set = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\besid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# Convert your set of stopwords to a list
custom_stopwords = list(stopwords_set)

In [19]:
#building the vectorizer
vectorizer = TfidfVectorizer(lowercase = True, 
                             stop_words = custom_stopwords, 
                             ngram_range = (1,2), 
                             min_df = 0.01)

In [20]:
#vectorizing the training set
X_train_vect = vectorizer.fit_transform(X_train)
print('Shape of X_train vectorized: ',X_train.shape)

Shape of X_train vectorized:  (153513,)


## Learning from TfidfVectorizer Stopwords Issue

### Problem:
I encountered an error when trying to use a set of stopwords with TfidfVectorizer.

### What I Learned:
1. TfidfVectorizer's `stop_words` parameter is picky about input types.
2. It accepts 'english' (str), None, or a list of words, but not a set.
3. Error messages can give clues about what's wrong and how to fix it.

### How I Solved It:
1. I converted my set of stopwords to a list.
2. Updated my code to use this list:
   ```python
   custom_stopwords = list(stopwords_set)
   vectorizer = TfidfVectorizer(stop_words=custom_stopwords)

In [21]:
#vectorizing the test set
X_test_vect = vectorizer.fit_transform(X_test)
print('Shape of X_test vectorized: ',X_test.shape)

Shape of X_test vectorized:  (75611,)


## 3.5 Feature Scaling

In [22]:
#initializing StandardScaler
scaler = StandardScaler(with_mean = False)

# StandardScaler is normalizing your data. 
# It's making sure all your features are on a similar scale.
# Why do this? Because some machine learning algorithms perform better 
# or converge faster when features are on a similar scale.

# It calculates the mean and standard deviation for each feature.
# Then it transforms each data point: (x - mean) / standard deviation

# with_mean=False parameter is used because we
# likely have sparse data (common in text analysis).

In [23]:
#scaling X_train_vect
X_train_scaled = scaler.fit_transform(X_train_vect)

In [24]:
#scaling X_test_vect
X_test_scaled = scaler.fit_transform(X_test_vect)

In [25]:
print(X_train_scaled.shape)
print(y_train.shape)
print(f"Number of samples in y_train: {len(y_train)}")

(153513, 1095)
(153513,)
Number of samples in y_train: 153513


# 4. Classifcation

## 4.1 Further splitting data into a train and validation set

For better model performance, I decided to use GridSearch to find the best parameters for each model used, as well as Stratified k-fold Cross Validation, as my dataset is rather imbalanced.


In [26]:
# GridSearchCV is a method provided by Scikit-learn to perform hyperparameter tuning. It systematically works through multiple combinations of parameter values, cross-validating as it goes to determine which combination gives the best performance.
# Hyperparameters: These are the parameters that are set before the learning process begins (e.g., learning rate, number of trees in a random forest).
# Example: Imagine you're using a Support Vector Machine (SVM) and want to find the best values for the parameters C and kernel. GridSearchCV will try different combinations of these parameters (e.g., C=0.1, kernel='linear', C=1, kernel='rbf', etc.) and find the combination that results in the best model performance.
# Stratified K-Fold Cross Validation is an improved version of K-Fold Cross Validation. It ensures that each fold (subset of data) has the same proportion of class labels as the original dataset. This is particularly important for imbalanced datasets, where one class might be underrepresented.

# In a classification problem, your target variable (the label you want to predict) can have different categories or classes.
# Class 'A' could represent non-spam (ham) emails.
# Class 'B' could represent spam emails.
# Imagine you have a dataset with 1000 samples:
# 900 of these samples are non-spam emails (Class 'A').
# 100 of these samples are spam emails (Class 'B').
# This means your dataset is imbalanced because the majority of the samples belong to Class 'A' (non-spam), and only a small portion belong to Class 'B' (spam).

# Importance of Stratified K-Fold Cross Validation
# When you split your data into training and validation sets (or folds in cross-validation), you want to ensure that each set has a similar distribution of classes as the original dataset. This helps in evaluating the model's performance more accurately.

# Without Stratification
# If you split the data randomly without stratification, you might end up with folds that do not represent the true distribution of classes.
# One fold might have 95 non-spam emails (Class 'A') and only 5 spam emails (Class 'B').
# Another fold might have 85 non-spam emails (Class 'A') and 15 spam emails (Class 'B').
# This inconsistency can lead to misleading performance metrics, as some folds may not contain enough samples of the minority class (Class 'B'), making it hard to evaluate how well the model performs on that class.

# With Stratification
# Stratified K-Fold Cross Validation ensures that each fold has a similar class distribution to the original dataset. 
# For example:
# Each fold will have around 90 non-spam emails (Class 'A') and 10 spam emails (Class 'B').



In [27]:
# Check lengths using shape attribute for sparse matrix
print(f"Number of samples in X_train_scaled: {X_train_scaled.shape[0]}")
print(f"Number of samples in y_train: {len(y_train)}")

# If lengths are inconsistent, find the issue in the preprocessing steps


Number of samples in X_train_scaled: 153513
Number of samples in y_train: 153513


In [28]:
#splitting the train dataset into a train and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X_train_scaled, y_train, 
                                                  test_size = 0.3, random_state = 42)

In [29]:
#initializing the Stratified K-fold CV
skf = StratifiedKFold(n_splits = 5, random_state = 42, shuffle = True)

## 4.2 Logistic Regression

In [30]:
#establishing parameters for GridSearch
parameters = {'penalty':['l1','l2'],
              'C':[0.01,0.05,0.5,5]}

In [31]:
#fitting the model
log_reg = Classification('Logistic Regression', X_train, X_val, y_train, y_val)

In [32]:
%%time

#getting scores
log_reg.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Logistic Regression,0.801329,0.788683,0.012646


The best hyperparameters are:  {'C': 0.01, 'penalty': 'l2'} 

CPU times: total: 7.69 s
Wall time: 1min 10s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.694542,0.51226,0.838515,0.788683,0.681772
recall,0.678042,0.268962,0.934322,0.788683,0.627108
f1-score,0.686192,0.352725,0.88383,0.788683,0.640916


## 4.3 Multinomial Naive Bayes

In [33]:
#establishing parameters for GridSearch
parameters = {'alpha': [0.001, 0.01, 0.5, 1.0]}

In [34]:
#fitting the model
mnb = Classification('Multinomial Naive Bayes', X_train, X_val, y_train, y_val)

In [35]:
%%time

#getting scores
mnb.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Multinomial Naive Bayes,0.696545,0.687671,0.008874


The best hyperparameters are:  {'alpha': 0.001} 

CPU times: total: 109 ms
Wall time: 1.85 s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.526862,0.346822,0.910156,0.687671,0.594613
recall,0.673185,0.561293,0.720061,0.687671,0.651513
f1-score,0.591103,0.428732,0.804025,0.687671,0.607953


## 4.4 Random Forest

In [36]:
#establishing parameters for GridSearch
parameters = {'min_samples_leaf':[1,3,15,50],
          'max_depth':[5,10,15,20]}

In [37]:
#fitting the model
rf = Classification('Random Forest', X_train, X_val, y_train, y_val)

In [39]:
# %%time

# #getting scores
# rf.get_scores(parameters, skf)

Out[1]:
| Model Name    | Train Accuracy | Validation Accuracy | Accuracy Difference |
|---------------|----------------|---------------------|---------------------|
| Random Forest | 0.760727       | 0.717962            | 0.042766            |

The best hyperparameters are:  {'max_depth': 20, 'min_samples_leaf': 1}

CPU times: user 3min 36s, sys: 500 ms, total: 3min 36s
Wall time: 4min 23s

|            | 0       | 1       | 2       | accuracy | macro avg |
|------------|---------|---------|---------|----------|-----------|
| precision  | 0.789629| 0.523810| 0.714804| 0.717962 | 0.676081  |
| recall     | 0.245833| 0.012315| 0.994087| 0.717962 | 0.417412  |
| f1-score   | 0.374937| 0.024063| 0.831624| 0.717962 | 0.410208  |

## 4.5 Decision Tree

In [40]:
#establishing parameters for GridSearch
parameters = {'min_samples_leaf':[3,15,50,100],
              'max_depth':[3,5,7,10]}

In [41]:
#fitting the model
tree = Classification('Decision Tree', X_train, X_val, y_train, y_val)

In [None]:
# %%time
# 
# #getting scores
# tree.get_scores(parameters, skf)

Out[2]:
| Model Name    | Train Accuracy | Validation Accuracy | Accuracy Difference |
|---------------|----------------|---------------------|---------------------|
| Decision Tree | 0.711174       | 0.708625            | 0.002549            |

The best hyperparameters are:  {'max_depth': 10, 'min_samples_leaf': 100}

CPU times: user 3min 9s, sys: 122 ms, total: 3min 9s
Wall time: 5min 27s

|            | 0       | 1       | 2       | accuracy | macro avg |
|------------|---------|---------|---------|----------|-----------|
| precision  | 0.694600| 0.400000| 0.720327| 0.708625 | 0.604976  |
| recall     | 0.195826| 0.084523| 0.976030| 0.708625 | 0.418793  |
| f1-score   | 0.305519| 0.139556| 0.828907| 0.708625 | 0.424661  |

## 4.6 K Neighbors 

In [42]:
#establishing parameters for GridSearch
parameters = {'n_neighbors':[5,10,50,150,300]}

In [43]:
#fitting the model
knn = Classification('KNN', X_train, X_val, y_train, y_val)

In [45]:
# %%time

# #getting scores
# knn.get_scores(parameters, skf)

Out[3]:
| Model Name | Train Accuracy | Validation Accuracy | Accuracy Difference |
|------------|----------------|---------------------|---------------------|
| KNN        | 0.680715       | 0.680636            | 0.00008             |

The best hyperparameters are:  {'n_neighbors': 150}

CPU times: user 44min 6s, sys: 11min 51s, total: 55min 57s
Wall time: 2h 59min 58s

|            | 0       | 1   | 2       | accuracy | macro avg |
|------------|---------|-----|---------|----------|-----------|
| precision  | 0.755319| 0.0 | 0.680483| 0.680636 | 0.478601  |
| recall     | 0.009319| 0.0 | 0.999553| 0.680636 | 0.336290  |
| f1-score   | 0.018410| 0.0 | 0.809719| 0.680636 | 0.276043  |

In [46]:
print('im gonna win')

im gonna win


## 4.7 AdaBoost

In [47]:
#establishing parameters for GridSearch
parameters = {'learning_rate':[0.1,1,10]}

In [48]:
#fitting the model
ada = Classification('AdaBoost', X_train, X_val, y_train, y_val)

In [49]:
# %%time

# #getting scores
# ada.get_scores(parameters, skf)

Out[4]:
| Model Name | Train Accuracy | Validation Accuracy | Accuracy Difference |
|------------|----------------|---------------------|---------------------|
| AdaBoost   | 0.690552       | 0.690515            | 0.000036            |

The best hyperparameters are:  {'learning_rate': 1}

CPU times: user 1min 56s, sys: 2.36 s, total: 1min 59s
Wall time: 5min 5s

|            | 0       | 1       | 2       | accuracy | macro avg |
|------------|---------|---------|---------|----------|-----------|
| precision  | 0.868885| 0.454880| 0.691333| 0.690515 | 0.671699  |
| recall     | 0.058275| 0.034565| 0.994279| 0.690515 | 0.362373  |
| f1-score   | 0.109225| 0.064248| 0.815583| 0.690515 | 0.329685  |

## 4.8 XGBoost

In [50]:
#establishing parameters for GridSearch
parameters = {'eta':[0.001,0.005,0.1,0.5],
              'min_child_weight':[1,5,10]}

In [51]:
#fitting the model
xgb = Classification('XGBoost', X_train, X_val, y_train, y_val)

In [52]:
# %%time

# #getting scores
# xgb.get_scores(parameters, skf)

Out[5]:
| Model Name | Train Accuracy | Validation Accuracy | Accuracy Difference |
|------------|----------------|---------------------|---------------------|
| XGBoost    | 0.825757       | 0.775546            | 0.050211            |

The best hyperparameters are:  {'eta': 0.001, 'min_child_weight': 10}

CPU times: user 33min 30s, sys: 1.38 s, total: 33min 32s
Wall time: 3h 30min 42s

|            | 0       | 1       | 2       | accuracy | macro avg |
|------------|---------|---------|---------|----------|-----------|
| precision  | 0.717170| 0.508607| 0.807231| 0.775546 | 0.677669  |
| recall     | 0.564116| 0.219144| 0.954105| 0.775546 | 0.579122  |
| f1-score   | 0.631502| 0.306308| 0.874544| 0.775546 | 0.604118  |

# 5. Evaluation

## 5.1 Comparing scores from all models

In [53]:
models = pd.concat([log_reg.scores_table,
                    mnb.scores_table,
                    rf.scores_table,
                    tree.scores_table,
                    knn.scores_table,
                    ada.scores_table,
                    xgb.scores_table],
                    axis=0)

In [54]:
models

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Logistic Regression,0.801329,0.788683,0.012646
0,Multinomial Naive Bayes,0.696545,0.687671,0.008874


In [55]:
#saving models results as a csv
models.to_csv('./models_results.csv',index=False)

In [56]:
#saving all models to a pickle
for model in [log_reg, mnb, rf, tree, knn, ada, xgb]:
    pickle.dump(model, open(f'./{model.model_type}.pkl', 'wb'))

## 5.2 Fitting the best model with test data

## 5.3 Further model metrics and tuning

In [64]:
print(type(xgb))

<class 'Classification_py.Classification'>


In [65]:
print(dir(xgb))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slotnames__', '__str__', '__subclasshook__', '__weakref__', 'conf_matrix', 'feature_importances', 'get_feature_importances', 'get_scores', 'get_test_scores', 'model_type', 'name', 'scores', 'scores_table', 'technique', 'test_conf_matrix', 'x_train', 'x_val', 'y_train', 'y_val']


In [58]:
log_reg.classification_report

Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.694542,0.51226,0.838515,0.788683,0.681772
recall,0.678042,0.268962,0.934322,0.788683,0.627108
f1-score,0.686192,0.352725,0.88383,0.788683,0.640916
