# MMA 865, Individual Assignment 1

Last Updated December 11, 2023.

- [Jose, Chua]
- [Student number]
- [Date]

# Part 1: Sentiment Analysis via the ML-based approach

Download the “Product Sentiment” dataset from the course portal: sentiment_train.csv and sentiment_test.csv.

### Part 1.a. Loading and Prep

Load, clean, and preprocess the data as you find necessary.

In [40]:
!pip install textblob
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.3-py3-none-win_amd64.whl (124.9 MB)
Installing collected packages: xgboost
Successfully installed xgboost-2.1.3


In [69]:
#libraries
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import matplotlib.pyplot as plt
from unidecode import unidecode
import re
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [10]:
#importing test and train data

df_train = pd.read_csv("sentiment_train.csv")

print(df_train.info())
print(df_train.head())

df_test = pd.read_csv("sentiment_test.csv")

print(df_train.info())
print(df_train.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  2400 non-null   object
 1   Polarity  2400 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.6+ KB
None
                                            Sentence  Polarity
0                           Wow... Loved this place.         1
1                                 Crust is not good.         0
2          Not tasty and the texture was just nasty.         0
3  Stopped by during the late May bank holiday of...         1
4  The selection on the menu was great and so wer...         1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  2400 non-null   object
 1   Polarity  2400 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.6+ KB

**Cleaning**
1. Remove Duplicates
2. Case normalization
3. Removing punctuation
4. Removing special characters
5. Removing extra whitespace
6. removing numbers 
7. Spell Checking
8. Tokenization
9. Stop words

In [11]:
#checking for duplicates (train)

duplicates = df_train[df_train['Sentence'].duplicated()]

print("Duplicate Rows:")
print(duplicates)

Duplicate Rows:
                                               Sentence  Polarity
71                                               #NAME?         1
219                                              #NAME?         1
814                                  I love this place.         1
816                              The food was terrible.         0
843                                    I won't be back.         0
846                   I would not recommend this place.         0
904                                              #NAME?         0
1285                                      Great phone!.         1
1407                                       Works great.         1
1524                                      Works great!.         1
1543                            Don't buy this product.         0
1744  If you like a loud buzzing to override all you...         0
1748                                      Does not fit.         0
1778                              This is a great deal.     

In [12]:
#deleting duplicates

df_train=df_train[df_train['Sentence'] != '#NAME?']

In [13]:
#checking for duplicates (test)

duplicates = df_test[df_test['Sentence'].duplicated()]

print("Duplicate Rows:")
print(duplicates)

Duplicate Rows:
               Sentence  Polarity
185  Not recommended.           0


In [14]:
def preprocess_text(sentence):
  
    # Case normalization
    sentence = sentence.lower()

    # Remove punctuation
    sentence = re.sub(f"[{string.punctuation}]", "", sentence)

    # Replace special characters
    sentence = unidecode(sentence)

    # Remove extra whitespace
    sentence = re.sub(r"\s+", " ", sentence).strip()

    # Remove numbers
    sentence = re.sub(r"\d+", "", sentence)

    # Spell checking
    sentence = str(TextBlob(sentence).correct())

    # Tokenization
    tokens = word_tokenize(sentence)

    # Stopword removal
    stop_words = set(stopwords.words('english'))
    critical_stopwords = {
        "not", "no", "nor", "never",
        "but", "however", "yet", "though", "although",
        "very", "too", "much", "more", "most", "few", "less", "least",
        "all", "some", "any", "only", "every", "each", "none",
        "if", "then", "when", "while"
    }
    stop_words -= critical_stopwords
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Join tokens back to a string if needed
    return tokens  # Return tokens or ' '.join(tokens) for a string

In [15]:
# Apply to df_train
df_train['Sentence'] = df_train['Sentence'].apply(preprocess_text)

# Apply to df_test
df_test['Sentence'] = df_test['Sentence'].apply(preprocess_text)

# Print processed DataFrames
print(df_train)
print(df_test)

                                               Sentence  Polarity
0                                        [loved, place]         1
1                                    [crust, not, good]         0
2                          [not, taste, texture, nasty]         0
3     [stopped, late, may, bank, holiday, rich, stev...         1
4                      [selection, menu, great, prices]         1
...                                                 ...       ...
2395  [almost, all, songs, cover, girl, oldfashioned...         0
2396  [most, annoying, thing, cover, girl, way, rite...         0
2397  [unfortunately, cover, girl, example, hollywoo...         0
2398  [nonlinear, narration, thus, many, flashbacks,...         1
2399  [good, cinematography, also, makes, monica, be...         1

[2396 rows x 2 columns]
                                              Sentence  Polarity
0    [good, commentary, today, love, undoubtedly, f...         1
1    [people, first, times, film, making, think, ex..

### Part 1.b. Modeling

Use your favorite ML algorithm to train a classification model.  Don’t forget everything that we’ve learned in our ML course: hyperparameter tuning, cross validation, handling imbalanced data, etc. Make reasonable decisions and try to create the best-performing classifier that you can.

In [31]:
# Function to preprocess data and split into train-test sets
def preprocess_and_split(df, text_column, target_column, test_size=0.2):
    # Convert tokenized lists into raw strings
    df[text_column] = df[text_column].apply(lambda x: ' '.join(x))

    # Features and target
    X = df[text_column]
    y = df[target_column]

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    return X_train, X_test, y_train, y_test

***SVM***

In [32]:
# Function to train and evaluate SVM

def train_svm(X_train, X_test, y_train, y_test):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    # Train SVM model
    svm_model = SVC(kernel='linear', C=1, random_state=42)
    svm_model.fit(X_train_tfidf, y_train)

    # Evaluate the model
    y_pred = svm_model.predict(X_test_tfidf)
    print("SVM Accuracy:", accuracy_score(y_test, y_pred))
    print("\nSVM Classification Report:\n", classification_report(y_test, y_pred))

    return svm_model, tfidf


In [59]:
train_svm(X_train, X_test, y_train, y_test)

SVM Accuracy: 0.8020833333333334

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.84      0.81       243
           1       0.82      0.76      0.79       237

    accuracy                           0.80       480
   macro avg       0.80      0.80      0.80       480
weighted avg       0.80      0.80      0.80       480



(SVC(C=1, kernel='linear', random_state=42),
 TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english'))

In [61]:
def tune_svm(X_train, y_train):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)

    # Define the parameter grid
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto']
    }

    # Perform GridSearchCV
    grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', verbose=1)
    grid.fit(X_train_tfidf, y_train)

    print("Best Parameters:", grid.best_params_)
    print("Best Score:", grid.best_score_)

    return grid.best_estimator_, tfidf

In [62]:
best_svm, tfidf_svm = tune_svm(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
Best Score: 0.8047840513489991


***Logistic Regression***

In [37]:
# Function to train and evaluate Logistic Regression
def train_logistic_regression(X_train, X_test, y_train, y_test):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    # Train Logistic Regression model
    logreg_model = LogisticRegression(max_iter=1000, random_state=42)
    logreg_model.fit(X_train_tfidf, y_train)

    # Evaluate the model
    y_pred = logreg_model.predict(X_test_tfidf)
    print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
    print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred))

    return logreg_model, tfidf

In [38]:
train_logistic_regression(X_train, X_test, y_train, y_test)

Logistic Regression Accuracy: 0.7979166666666667

Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.85      0.81       243
           1       0.83      0.74      0.78       237

    accuracy                           0.80       480
   macro avg       0.80      0.80      0.80       480
weighted avg       0.80      0.80      0.80       480



(LogisticRegression(max_iter=1000, random_state=42),
 TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english'))

In [65]:
def tune_logistic_regression(X_train, y_train):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)

    # Define the parameter grid with valid combinations
    param_grid = [
        {'penalty': ['l2'], 'C': [0.01, 0.1, 1, 10, 100], 'solver': ['lbfgs']},
        {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100], 'solver': ['liblinear']},
        {'penalty': ['l1', 'l2', 'elasticnet'], 'C': [0.01, 0.1, 1, 10], 'solver': ['saga'], 'l1_ratio': [0.1, 0.5, 0.9]},
    ]

    # Perform GridSearchCV
    grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1)
    grid.fit(X_train_tfidf, y_train)

    print("Best Parameters:", grid.best_params_)
    print("Best Score:", grid.best_score_)

    return grid.best_estimator_, tfidf

In [66]:
best_logreg, tfidf_logreg = tune_logistic_regression(X_train, y_train)

Fitting 5 folds for each of 51 candidates, totalling 255 fits




Best Parameters: {'C': 10, 'l1_ratio': 0.9, 'penalty': 'l2', 'solver': 'saga'}
Best Score: 0.8026925587467364




***XGBOOST***

In [55]:
def train_xgboost(X_train, X_test, y_train, y_test):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    # Train XGBoost model
    xgb_model = XGBClassifier(eval_metric='logloss', verbosity=0, random_state=42)
    xgb_model.fit(X_train_tfidf, y_train)

    # Evaluate the model
    y_pred = xgb_model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    # Print results
    print("XGBoost Accuracy:", accuracy)
    print("\nXGBoost Classification Report:\n", report)

    return xgb_model, tfidf, accuracy, report

In [56]:
xgb_model, xgb_tfidf, xgb_accuracy, xgb_report = train_xgboost(X_train, X_test, y_train, y_test)

XGBoost Accuracy: 0.7520833333333333

XGBoost Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.86      0.78       243
           1       0.81      0.65      0.72       237

    accuracy                           0.75       480
   macro avg       0.76      0.75      0.75       480
weighted avg       0.76      0.75      0.75       480



In [67]:
def tune_xgboost(X_train, y_train):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)

    # Define the parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2, 0.3],
        'max_depth': [3, 5, 7, 10],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0]
    }

    # Perform RandomizedSearchCV
    random_search = RandomizedSearchCV(XGBClassifier(eval_metric='logloss', random_state=42),
                                       param_distributions=param_grid, n_iter=50, cv=5, scoring='accuracy', verbose=1)
    random_search.fit(X_train_tfidf, y_train)

    print("Best Parameters:", random_search.best_params_)
    print("Best Score:", random_search.best_score_)

    return random_search.best_estimator_, tfidf

In [70]:
best_xgb, tfidf_xgb = tune_xgboost(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.3, 'colsample_bytree': 1.0}
Best Score: 0.7562581592689295


***Random Forest***

In [50]:
def train_random_forest(X_train, X_test, y_train, y_test):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    # Train Random Forest model
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(X_train_tfidf, y_train)

    # Evaluate the model
    y_pred = rf_model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    # Print results
    print("Random Forest Accuracy:", accuracy)
    print("\nRandom Forest Classification Report:\n", report)

    return rf_model, tfidf, accuracy, report

In [57]:
rf_model, rf_tfidf, rf_accuracy, rf_report = train_random_forest(X_train, X_test, y_train, y_test)

Random Forest Accuracy: 0.76875

Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.86      0.79       243
           1       0.83      0.67      0.74       237

    accuracy                           0.77       480
   macro avg       0.78      0.77      0.77       480
weighted avg       0.78      0.77      0.77       480



In [71]:
def tune_random_forest(X_train, y_train):
    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)

    # Define the parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'bootstrap': [True, False]
    }

    # Perform GridSearchCV
    grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', verbose=1)
    grid.fit(X_train_tfidf, y_train)

    print("Best Parameters:", grid.best_params_)
    print("Best Score:", grid.best_score_)

    return grid.best_estimator_, tfidf

In [72]:
best_rf, tfidf_rf = tune_random_forest(X_train, y_train)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Parameters: {'bootstrap': False, 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Score: 0.7912301457789381


### Part 1.c. Assessing

Use the testing data to measure the accuracy and F1-score of your model.  

In [17]:
# TODO

### Part 2. Given the accuracy and F1-score of your model, are you satisfied with the results, from a business point of view? Explain.

TODO: Insert your answer here.

### Part 3. Show five example instances in which your model’s predictions were incorrect. Describe why you think the model was wrong. Don’t just guess: dig deep to figure out the root cause.

TODO: Insert your answer here.

In [18]:
# TODO: Feel free to use code as well to answer this question. Or not. Up to you.