# Sentiment analysis using Bag-of-Words representation
The goal is to develop a model capable of accurately classifying text data into positive and negative sentiment categories. Several steps were made including text preprocessing, feature extraction using CountVectorizer, and model selection. Hyperparameter tuning was performed using grid search with cross-validation to optimise the classifier's performance. The final model was evaluated on both validation and test sets to assess its real-world performance.

## Task 1: Data Loading and Preparation

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.validation import column_or_1d
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Load Datasets

A dataset of several thousand single-sentence reviews collected from three domains: imdb.com,
amazon.com, yelp.com. 

In [2]:
# Load training data
x_train = pd.read_csv('Dataset/x_train.csv', names=['website_name', 'text'], header=None)
y_train = pd.read_csv('Dataset/y_train.csv', names=['is_positive_sentiment'], header=None)
# Load test data
x_test = pd.read_csv('Dataset/x_test.csv', names=['website_name', 'text'], header=None)
y_test = pd.read_csv('Dataset/y_test.csv', names=['is_positive_sentiment'], header=None)

### Pre-processing

In [3]:
# Clean text function
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = text.lower() # Convert upper case to lowercase
    text = re.sub(r'\s+', ' ', text) # Remove multiple consecutive spaces
    text = text.strip() # Remove leading and trailing spaces
    return text

In [4]:
# Clean data
x_train['text'] = x_train['text'].apply(clean_text)
x_test['text'] = x_test['text'].apply(clean_text)

### Splitting the training data into a training set and a validation set.

In [5]:
# Splitting data
X_train, X_val, y_train, y_val = train_test_split(x_train['text'], y_train['is_positive_sentiment'], test_size=0.2, random_state=42)

In [6]:
# Reshape y_train and y_val
y_train = y_train.ravel()
y_val = y_val.ravel()

## Task 2: Feature Representation

In [7]:
# Define the classifiers
classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVC': SVC(),
    'Logistic Regression': LogisticRegression()
}

In [8]:
# Dictionary to store mean cross-validation scores for each classifier
cv_scores = {}

In [9]:
# Loop over classifiers
for clf_name, clf in classifiers.items():
    print(f"Testing {clf_name}:")
    
    # Define the pipeline for each classifier
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('clf', clf)
    ])
    
    # Define the parameters for grid search
    parameters = {
        'vect__lowercase': [True, False],
        'vect__stop_words': [None, 'english'],
        'vect__max_df': [0.5, 0.7, 1.0],
        'vect__min_df': [1, 5, 10],
        'vect__ngram_range': [(1, 1), (1, 2)],
        'vect__binary': [True, False],
    }
    
    # Perform grid search
    grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, error_score='raise')
    grid_search.fit(X_train, y_train)
    
    # Store mean cross-validation score
    cv_scores[clf_name] = np.mean(cross_val_score(grid_search.best_estimator_, X_train, y_train, cv=5))
    
    # Print best parameters
    print("Best parameters:")
    print(grid_search.best_params_)
    print()

Testing Random Forest:
Best parameters:
{'vect__binary': True, 'vect__lowercase': True, 'vect__max_df': 0.5, 'vect__min_df': 1, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}

Testing Decision Tree:
Best parameters:
{'vect__binary': True, 'vect__lowercase': True, 'vect__max_df': 0.7, 'vect__min_df': 1, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}

Testing SVC:
Best parameters:
{'vect__binary': True, 'vect__lowercase': True, 'vect__max_df': 0.5, 'vect__min_df': 1, 'vect__ngram_range': (1, 1), 'vect__stop_words': None}

Testing Logistic Regression:
Best parameters:
{'vect__binary': True, 'vect__lowercase': True, 'vect__max_df': 0.5, 'vect__min_df': 1, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}



In [10]:
# Pick the best classifier based on mean cross-validation score
best_classifier = max(cv_scores, key=cv_scores.get)
print(f"The best classifier is: {best_classifier}")

The best classifier is: Logistic Regression


In [11]:
# Initialize CountVectorizer with best parameters
# Get the CountVectorizer from the pipeline
vectorizer = grid_search.best_estimator_.named_steps['vect']
print(f"The best best parameters for CountVectorizer is: {vectorizer}")

The best best parameters for CountVectorizer is: CountVectorizer(binary=True, max_df=0.5, ngram_range=(1, 2))


In [12]:
# Initialize CountVectorizer with best parameters
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(x_test['text'])

## Task 3: Classification and Evaluation

### Performing the classification

In [13]:
# Classifier Selection and Training
model = LogisticRegression()
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga'],
    'max_iter': [800, 1000, 2000, 4000, 10000]
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='f1', verbose=1)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


### Evaluation

In [14]:
# Best classifier
best_classifier = grid_search.best_estimator_
print(best_classifier)

LogisticRegression(C=10, max_iter=10000, solver='saga')


In [15]:
# Perform predictions
y_pred_train = best_classifier.predict(X_train) # Training set
y_pred_val = best_classifier.predict(X_val) # Validation set
y_pred_test = best_classifier.predict(X_test) # Test set

In [16]:
# Display classification reports 
print("Training set")
print(classification_report(y_train, y_pred_train))
print("Validation set")
print(classification_report(y_val, y_pred_val))
print("Test set")
print(classification_report(y_test, y_pred_test))

Training set
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       963
           1       1.00      1.00      1.00       957

    accuracy                           1.00      1920
   macro avg       1.00      1.00      1.00      1920
weighted avg       1.00      1.00      1.00      1920

Validation set
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       237
           1       0.87      0.82      0.84       243

    accuracy                           0.85       480
   macro avg       0.85      0.85      0.85       480
weighted avg       0.85      0.85      0.85       480

Test set
              precision    recall  f1-score   support

           0       0.79      0.85      0.82       300
           1       0.84      0.77      0.81       300

    accuracy                           0.81       600
   macro avg       0.82      0.81      0.81       600
weighted avg       0.82      0.81    

#### Evaluation Metrics

On the training set, the LR classifier achieved a precision of 100% for both negative sentiment
reviews (label 0) and positive sentiment reviews (label 1), with corresponding recall scores of 100%.
The F1-score, which is the harmonic mean of precision and recall, was 1.00 for both negative and
positive sentiment, resulting in an overall accuracy of 100%.

Similarly, on the validation set, the classifier demonstrated a precision of 83% for negative sentiment
and 87% for positive sentiment, with recall scores of 87% and 82%, respectively. The F1-scores were
0.85 for negative sentiment and 0.84 for positive sentiment, yielding an accuracy of 85%.

In [17]:
# Create new dataframe for test set analysis
test_analysis = x_test.copy()
test_analysis['label'] = y_test
test_analysis['predict'] = y_pred_test

In [18]:
# Function to analyze sentence length
def analyze_sentence_length(df):
    """
    Analyzes the performance of the model based on sentence length.
    
    Args:
    df (DataFrame): DataFrame containing text data.
    
    Returns:
    tuple: Tuple containing confusion matrices for short, medium, and long sentences.
    """
    df['sentence_length'] = df['text'].apply(lambda x: len(x.split()))
    # Compare performance based on sentence length
    performance_short = df[df['sentence_length'] < 10]
    performance_medium = df[(df['sentence_length'] >= 10) & (df['sentence_length'] <= 20)]
    performance_long = df[df['sentence_length'] > 20]
    
    # Calculate confusion matrices
    confusion_matrix_short = confusion_matrix(performance_short['label'], performance_short['predict'])
    confusion_matrix_medium = confusion_matrix(performance_medium['label'], performance_medium['predict'])
    confusion_matrix_long = confusion_matrix(performance_long['label'], performance_long['predict'])
    
    return confusion_matrix_short, confusion_matrix_medium, confusion_matrix_long

# Function to analyze review type
def analyze_review_type(df, website):
    """
    Analyzes the performance of the model based on the website type.
    
    Args:
    df (DataFrame): DataFrame containing text data.
    website (str): Name of the website to analyze.
    
    Returns:
    array-like: Confusion matrix for the specified website.
    """
    performance = df[df['website_name'] == website]
    
    # Calculate confusion matrix
    confusion_matrix_review = confusion_matrix(performance['label'], performance['predict'])
    
    return confusion_matrix_review

# Function to analyze sentences with negation words
def analyze_negation_words(df):
    """
    Analyzes the performance of the model based on the presence of negation words in sentences.
    
    Args:
    df (DataFrame): DataFrame containing text data.
    
    Returns:
    tuple: Tuple containing confusion matrices for sentences with and without negation words.
    """
    negation_words = ['not', "didn't", "shouldn't"]
    df['contains_negation'] = df['text'].apply(lambda x: any(word in x.split() for word in negation_words))
    performance_with_negation = df[df['contains_negation']]
    performance_without_negation = df[~df['contains_negation']]
    
    # Calculate confusion matrices
    confusion_matrix_with_negation = confusion_matrix(performance_with_negation['label'], performance_with_negation['predict'])
    confusion_matrix_without_negation = confusion_matrix(performance_without_negation['label'], performance_without_negation['predict'])
    
    return confusion_matrix_with_negation, confusion_matrix_without_negation

In [19]:
# Analyze sentence length
short_cm, medium_cm, long_cm = analyze_sentence_length(test_analysis)
print("Confusion Matrix for Short Sentences:")
print(short_cm)
print("\nConfusion Matrix for Medium Sentences:")
print(medium_cm)
print("\nConfusion Matrix for Long Sentences:")
print(long_cm)

# Analyze review type
amazon_cm = analyze_review_type(test_analysis, 'amazon')
imdb_cm = analyze_review_type(test_analysis, 'imdb')
print("\nConfusion Matrix for Amazon Reviews:")
print(amazon_cm)
print("\nConfusion Matrix for IMDb Reviews:")
print(imdb_cm)

# Analyze sentences with and without negation words
with_negation_cm, without_negation_cm = analyze_negation_words(test_analysis)
print("\nConfusion Matrix for Sentences with Negation Words:")
print(with_negation_cm)
print("\nConfusion Matrix for Sentences without Negation Words:")
print(without_negation_cm)

Confusion Matrix for Short Sentences:
[[120  19]
 [ 27 120]]

Confusion Matrix for Medium Sentences:
[[108  19]
 [ 20  88]]

Confusion Matrix for Long Sentences:
[[28  6]
 [21 24]]

Confusion Matrix for Amazon Reviews:
[[89 11]
 [21 79]]

Confusion Matrix for IMDb Reviews:
[[83 17]
 [25 75]]

Confusion Matrix for Sentences with Negation Words:
[[47  1]
 [ 8  3]]

Confusion Matrix for Sentences without Negation Words:
[[209  43]
 [ 60 229]]


#### Sentence Length Analysis:

- Short Sentences: The classifier demonstrates relatively balanced performance for short
sentences, defined as those with less than 10 words. The confusion matrix reveals a slightly
higher number of false negatives compared to false positives.
- Medium Sentences: In medium-length sentences (10 to 20 words), the classifier exhibits
similar behaviour, with a higher false-negative rate compared to false positives.
- Long Sentences: For longer sentences (more than 20 words), the classifier performs
relatively well, with a lower false-positive rate and a slightly higher false-negative rate.

The classifier's performance was assessed concerning the length of input sentences. Across short,
medium, and long sentences, the classifier exhibited varying degrees of accuracy. Notably, in short
and medium sentences, the classifier demonstrated a tendency to misclassify negative sentiments as
positive, as evidenced by the higher false-negative rates. Conversely, in longer sentences, the
classifier showed improved accuracy, with a more balanced distribution of false positives and false
negatives.

#### Review Type Analysis:
- Amazon Reviews: The classifier shows better performance on Amazon reviews, with a
higher true positive rate compared to false positives.
- IMDb Reviews: Its performance on IMDb reviews is slightly less accurate, with a higher
false-negative rate and lower true positive rate.

Distinct performance patterns emerged when evaluating the classifier's response to different types of
reviews. While the classifier performed relatively well on Amazon reviews, exhibiting a higher true
positive rate compared to false positives, its accuracy on IMDb reviews was slightly diminished,
characterised by a higher false-negative rate. This discrepancy suggests that the classifier may
struggle with certain types of content or sentiment expressions inherent to specific review platforms.

#### Analysis of Negation Words:
- Sentences with Negation Words: In sentences containing negation words, the classifier
demonstrates a higher false-positive rate compared to true positives, indicating a tendency to
misclassify negative sentiments as positive.
- Sentences without Negation Words: Conversely, in sentences without negation words, the
classifier shows better performance, with a higher true positive rate and lower false-positive
rate, indicating more accurate sentiment classification.

The presence of negation words within sentences posed a unique challenge to the classifier. In
sentences containing negation words, the classifier demonstrated a propensity for misclassifying
negative sentiments as positive, resulting in a higher false-positive rate. However, when negation
words were absent, the classifier's performance notably improved, with a higher true positive rate and
a lower false-positive rate, indicating more accurate sentiment classification.