# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
!pip3 install -r requirements.txt



In [1]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [2]:
data = pd.read_csv('final_df.csv')

In [3]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [4]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [5]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

## Basic LSTM

LSTM Model Explanation:

Model Initialization: A Sequential model is used to stack the layers in order.

Embedding Layer: The first layer is an Embedding layer with input_dim=10000 (vocabulary size) and output_dim=128 (embedding dimension). This layer converts word indices into dense vector representations that the LSTM can process.

LSTM Layers: The model includes two LSTM layers:
The first LSTM layer has 64 units and return_sequences=True, allowing its output to be passed to the next LSTM layer.
The second LSTM layer also has 64 units but return_sequences=False, indicating it outputs only the last hidden state to the next layer.

Dropout Layers: Dropout layers with a rate of 0.5 are added after each LSTM and Dense layer to help prevent overfitting by randomly setting half of the input units to 0 during training.

Dense Layers: A Dense layer with 32 units and tanh activation is added for further processing of the output from the last LSTM layer.
The final Dense layer has 3 units (corresponding to the three sentiment classes: Positive, Neutral, Negative) and a softmax activation function for multi-class classification.

Compilation: The model is compiled using the adam optimizer, sparse_categorical_crossentropy as the loss function (suitable for integer-encoded classes), and accuracy as a performance metric.


In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import numpy as np
import random
import os

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
vocab_size = 10000  # Limit vocabulary size to 10,000 words
embedding_dim = 128  # Dimension of embeddings
max_sequence_length = 300  # Max number of words in each sequence
l2_lambda = 0.01

# Step 1: Tokenize and pad text data using Keras Tokenizer
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

# Calculate class weights
class_weights_values = compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y)
class_weights = {i: class_weights_values[i] for i in range(len(class_weights_values))}

# Define stratified 5-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
f1_scores = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(skf.split(X_padded, y)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X_padded[train_index], X_padded[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Define the model architecture with a trainable Embedding layer
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length, trainable=True))
    model.add(LSTM(64, activation='tanh', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    
    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # Early stopping callback
    early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
    
    # Train the model with early stopping and class weights
    model.fit(
        X_train, y_train, 
        epochs=10, 
        batch_size=128,  
        validation_split=0.2, 
        verbose=1,
        callbacks=[early_stopping],
        class_weight=class_weights
    )
    
    # Predictions and evaluation for the current fold
    y_pred_prob = model.predict(X_test)
    y_pred = np.argmax(y_pred_prob, axis=1)
    
    # Calculate metrics for the current fold
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, output_dict=True)
    f1 = report['weighted avg']['f1-score']
    
    accuracy_scores.append(accuracy)
    f1_scores.append(f1)
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} F1 Score: {f1:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4))

# Print average metrics across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold 1 Accuracy: 0.8403
Fold 1 F1 Score: 0.8432
Fold 1 Classification Report:
               precision    recall  f1-score   support

    Negative     0.8410    0.6721    0.7472       488
     Neutral     0.3763    0.4764    0.4205       233
    Positive     0.9246    0.9457    0.9350      1583

    accuracy                         0.8403      2304
   macro avg     0.7140    0.6981    0.7009      2304
weighted avg     0.8515    0.8403    0.8432      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Fold 2 Accuracy: 0.8212
Fold 2 F1 Score: 0.8393
Fold 2 Classification Report:
               precision    recall  f1-score   support

    Negative     0.8761    0.6373    0.7378       488
     Neutral     0.3503    0.7082    0.4688       233
    Positive     0.9581    0.8945    0.9252      1583

    accuracy                         0.8212      2304
   m

# LSTM + Word2Vec

In [4]:
from gensim.models import Word2Vec
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import numpy as np
import random
import os
import nltk
from nltk.tokenize import word_tokenize

# Ensure NLTK's punkt tokenizer is downloaded
# nltk.download('punkt')

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
embedding_dim = 128       # Dimension of Word2Vec embeddings
max_sequence_length = 300 # Max number of words in each sequence
l2_lambda = 0.01 

# Step 1: Tokenize the text data
tokenized_reviews = [word_tokenize(review.lower()) for review in data['processed_full_review']]

# Step 2: Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_reviews, vector_size=embedding_dim, window=5, min_count=1, sg=1, seed=42)

# Step 3: Prepare embedding matrix
vocab_size = len(word2vec_model.wv.key_to_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Map Word2Vec vectors to the embedding matrix
word_index = {word: idx + 1 for idx, word in enumerate(word2vec_model.wv.key_to_index)}
for word, idx in word_index.items():
    embedding_matrix[idx] = word2vec_model.wv[word]

# Step 4: Convert reviews to sequences of word indices
sequences = [[word_index.get(word, 0) for word in review] for review in tokenized_reviews]
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

# Calculate class weights
class_weights_values = compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y)
class_weights = {i: class_weights_values[i] for i in range(len(class_weights_values))}

# Define stratified 5-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
f1_scores = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(skf.split(X_padded, y)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X_padded[train_index], X_padded[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Define the model architecture with one LSTM layer
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                        weights=[embedding_matrix], input_length=max_sequence_length, trainable=True))
    model.add(LSTM(64, activation='tanh', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    
    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # Early stopping callback
    early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
    
    # Train the model with early stopping and class weights
    model.fit(
        X_train, y_train, 
        epochs=10, 
        batch_size=128,  
        validation_split=0.2, 
        verbose=1,
        callbacks=[early_stopping],
        class_weight=class_weights
    )
    
    # Predictions and evaluation for the current fold
    y_pred_prob = model.predict(X_test)
    y_pred = np.argmax(y_pred_prob, axis=1)
    
    # Calculate metrics for the current fold
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, output_dict=True)
    f1 = report['weighted avg']['f1-score']
    
    accuracy_scores.append(accuracy)
    f1_scores.append(f1)
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} F1 Score: {f1:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4))

# Print average metrics across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Fold 1 Accuracy: 0.8251
Fold 1 F1 Score: 0.8415
Fold 1 Classification Report:
               precision    recall  f1-score   support

    Negative     0.8371    0.7582    0.7957       488
     Neutral     0.3608    0.6395    0.4613       233
    Positive     0.9538    0.8730    0.9116      1583

    accuracy                         0.8251      2304
   macro avg     0.7172    0.7569    0.7229      2304
weighted avg     0.8691    0.8251    0.8415      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold 2 Accuracy: 0.8242
Fold 2 F1 Score: 0.8392
Fold 2 Classification Report:
               precision    recall  f1-score   support

    Negative     0.7840    0.8033    0.7935       488
     Neutral     0.3450    0.5494    0.4238       233
    Positive     0.9623    0.8711    0.9145      1583

    accuracy             

# LSTM + FastText

In [5]:
from gensim.models import FastText
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import numpy as np
import random
import os
import nltk
from nltk.tokenize import word_tokenize

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Parameters
embedding_dim = 128       # Dimension of FastText embeddings
max_sequence_length = 300 # Max number of words in each sequence
l2_lambda = 0.01 

# Step 1: Tokenize the text data
tokenized_reviews = [word_tokenize(review.lower()) for review in data['processed_full_review']]

# Step 2: Train FastText model
fasttext_model = FastText(sentences=tokenized_reviews, vector_size=embedding_dim, window=5, min_count=1, sg=1, seed=42)

# Step 3: Prepare embedding matrix
vocab_size = len(fasttext_model.wv.key_to_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Map FastText vectors to the embedding matrix
word_index = {word: idx + 1 for idx, word in enumerate(fasttext_model.wv.key_to_index)}
for word, idx in word_index.items():
    embedding_matrix[idx] = fasttext_model.wv[word]

# Step 4: Convert reviews to sequences of word indices
sequences = [[word_index.get(word, 0) for word in review] for review in tokenized_reviews]
X_padded = pad_sequences(sequences, maxlen=max_sequence_length)

# Labels
sentiment_dict = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
y = data['sentiment'].map(sentiment_dict).values

# Calculate class weights
class_weights_values = compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y)
class_weights = {i: class_weights_values[i] for i in range(len(class_weights_values))}

# Define stratified 5-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
f1_scores = []

# Cross-validation loop
for fold, (train_index, test_index) in enumerate(skf.split(X_padded, y)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X_padded[train_index], X_padded[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Define the model architecture with one LSTM layer
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                        weights=[embedding_matrix], input_length=max_sequence_length, trainable=True))
    model.add(LSTM(64, activation='tanh', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(l2_lambda)))
    
    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # Early stopping callback
    early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
    
    # Train the model with early stopping and class weights
    model.fit(
        X_train, y_train, 
        epochs=10, 
        batch_size=128,  
        validation_split=0.2, 
        verbose=1,
        callbacks=[early_stopping],
        class_weight=class_weights
    )
    
    # Predictions and evaluation for the current fold
    y_pred_prob = model.predict(X_test)
    y_pred = np.argmax(y_pred_prob, axis=1)
    
    # Calculate metrics for the current fold
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, output_dict=True)
    f1 = report['weighted avg']['f1-score']
    
    accuracy_scores.append(accuracy)
    f1_scores.append(f1)
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} F1 Score: {f1:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive'], zero_division=0, digits=4))

# Print average metrics across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Fold 1 Accuracy: 0.8134
Fold 1 F1 Score: 0.8339
Fold 1 Classification Report:
               precision    recall  f1-score   support

    Negative     0.7992    0.8156    0.8073       488
     Neutral     0.3224    0.5880    0.4164       233
    Positive     0.9696    0.8459    0.9035      1583

    accuracy                         0.8134      2304
   macro avg     0.6970    0.7498    0.7091      2304
weighted avg     0.8680    0.8134    0.8339      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Fold 2 Accuracy: 0.8446
Fold 2 F1 Score: 0.8549
Fold 2 Classification Report:
               precision    recall  f1-score   support

    Negative     0.8209    0.7418    0.7793       488
     Neutral     0.3868    0.5794    0.4639       233
    Positive     0.9571    0.9154    0.9357   

## LSTM with Hashing Vectorization

The text data (texts) is transformed using HashingVectorizer with n_features=5000, meaning each document is represented as a vector of 5000 features.
The toarray() method converts the sparse matrix to a dense format for compatibility with the model.
The transformed data (X) is then reshaped into a 3D array suitable for input into the LSTM (samples, timesteps, features).

Hashing Vectorizer is much faster than the Tokenizer and Embedding approach from above code.

Hashing Vectorizer directly transforms text into fixed-length numerical vectors by hsahing the terms and mapping them to a specified number of features. This eliminate the need to build a vocabulary or convert tokens into embeddings. Whereas in Tokenizer, it creates a vocabulary, then tokenizes the text into sequences of integers, which are then converted into dense vectors using an `Embedding` layer. This two-step process is more computationally intensive than direct hashing.

The model using Hashing Vectorization outperformed the one with basic tokenization and an embedding layer because it provided a more diverse feature space, which allowed the LSTM to better capture complex sequential relationships in the text. The hashing approach created fixed-size, distributed representations without the need for a vocabulary, potentially capturing unique and distinguishable text features more effectively. This improved the model’s ability to generalize on unseen data, resulting in higher test accuracy.

In [7]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Load the dataset
data = pd.read_csv('final_df.csv')

# Preprocess text and labels
texts = data['processed_full_review'].astype(str)
labels = data['sentiment']

# Encode labels (e.g., Positive=2, Negative=0, Neutral=1)
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# Use Hashing Vectorizer
vectorizer = HashingVectorizer(n_features=5000, alternate_sign=False)  # Set n_features as needed
X = vectorizer.transform(texts).toarray()

# Reshape to 3D array as expected by LSTM input (samples, timesteps, features)
X = np.reshape(X, (X.shape[0], 1, X.shape[1]))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels_encoded, test_size=0.2, random_state=42)

# Define the LSTM model
model = Sequential()
model.add(LSTM(64, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(32, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))  # 3 classes for Positive, Negative, Neutral

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Generate predictions for the test set
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes, target_names=label_encoder.classes_, digits=4))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.8351

Classification Report:
              precision    recall  f1-score   support

    Negative     0.7457    0.8234    0.7826       470
     Neutral     0.3673    0.3640    0.3656       228
    Positive     0.9326    0.9054    0.9188      1606

    accuracy                         0.8351      2304
   macro avg     0.6819    0.6976    0.6890      2304
weighted avg     0.8386    0.8351    0.8363      2304



## LSTM + Hashing Vectorizer + GridSearch CV

### 1. **Defining the Model**:
- A function `create_model()` is defined that builds and compiles an LSTM model with configurable hyperparameters (`units`, `dropout_rate`, and `optimizer`).
- The model consists of:
  - Two LSTM layers, each with a specified number of units.
  - Dropout layers to prevent overfitting.
  - A Dense layer with `tanh` activation.
  - A final Dense output layer with `softmax` activation for multi-class classification.

### 2. **Model Wrapping**:
- The LSTM model is wrapped with `KerasClassifier` (from `scikeras`) to make it compatible with `GridSearchCV`. This wrapper allows the custom LSTM model to behave like a scikit-learn classifier, enabling hyperparameter tuning.

### 3. **Hyperparameter Grid**:
- The `param_grid` dictionary specifies the hyperparameters to be tuned and their possible values:
  - `'model__units'`: Number of units in the LSTM layers (e.g., [32, 64]).
  - `'model__dropout_rate'`: Dropout rate to apply after LSTM and Dense layers (e.g., [0.3, 0.5]).
  - `'optimizer'`: Optimization algorithm for training the model (e.g., ['adam', 'rmsprop']).
  - `'epochs'`: Number of training epochs (e.g., [5, 10]).

### 4. **Grid Search Setup**:
- `GridSearchCV` is initialized with:
  - `estimator=model`: The wrapped LSTM model.
  - `param_grid=param_grid`: The defined grid of hyperparameters.
  - `cv=3`: Specifies 3-fold cross-validation, meaning the training data is split into 3 parts, and the model is trained and validated 3 times, each with a different fold as the validation set.
- This means for each combination of hyperparameters, the model is trained and evaluated three times, and the average performance score across the folds is recorded.

### 5. **Performing Grid Search**:
- `grid.fit(X_train, y_train)` performs the grid search. For each hyperparameter combination, the model is:
  - Trained on the training set (with cross-validation applied).
  - Evaluated using cross-validation to find the average accuracy for that combination.
- The process continues until all combinations in `param_grid` are tested.

### 6. **Output**:
- `grid_result.best_params_` displays the combination of hyperparameters that achieved the best average cross-validation score.
- `grid_result.best_score_` shows the highest cross-validation accuracy achieved.
- The best model (`best_model = grid_result.best_estimator_`) is used to make predictions on the test set (`X_test`), and a classification report is printed to show performance metrics such as precision, recall, and F1-score.





Using GridSearchCV with the LSTM model combined with Hashing Vectorization leads to even better performance because it optimizes the hyperparameters of the model more effectively. GridSearchCV performs an exhaustive search over a specified parameter grid, testing different combinations of hyperparameters such as the number of LSTM units, dropout rate, and optimizer type. This systematic approach finds the most optimal configuration that maximizes model performance on the validation set, resulting in improved generalization and accuracy on the test set. By fine-tuning critical parameters, the model adapts more precisely to the data's characteristics, enhancing its predictive power and robustness compared to models trained with default or manually chosen hyperparameters.

In [8]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding
from scikeras.wrappers import KerasClassifier  # Updated import
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np
import random

tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Load the dataset
data = pd.read_csv('final_df.csv')

# Preprocess text and labels
texts = data['processed_full_review'].astype(str)
labels = data['sentiment']

# Encode labels
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# Use Hashing Vectorizer
vectorizer = HashingVectorizer(n_features=5000, alternate_sign=False)  # Set n_features as needed
X = vectorizer.transform(texts).toarray()
X = np.reshape(X, (X.shape[0], 1, X.shape[1]))  # Reshape for LSTM

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels_encoded, test_size=0.2, random_state=42)

# Define a function to create the model (for use in KerasClassifier)
def create_model(units=64, dropout_rate=0.5, optimizer='adam'):
    model = Sequential()
    model.add(LSTM(units, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(units, return_sequences=False))
    model.add(Dense(units // 2, activation='tanh'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(3, activation='softmax'))  # 3 classes for Positive, Negative, Neutral
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Wrap the model using KerasClassifier from scikeras
model = KerasClassifier(model=create_model, verbose=1, batch_size=128)  # scikeras syntax

# Define the grid of hyperparameters
param_grid = {
    'model__units': [32, 64],
    'model__dropout_rate': [0.3, 0.5],
    'optimizer': ['adam', 'rmsprop'],
    'epochs': [5, 10],  # Reduced for demo; increase as needed
}

# Set up GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# Perform grid search
grid_result = grid.fit(X_train, y_train)

# Display the best parameters and accuracy
print("Best Parameters:", grid_result.best_params_)
print("Best Score:", grid_result.best_score_)

# Evaluate the best model on the test set
best_model = grid_result.best_estimator_
y_pred = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_, digits=4))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
