# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Additional web scraping of online reviews

During our EDA, we noticed two main trends in the distribution of our dataset:
1. Less than 10% of our reviews were published from the years 2022 to 2024, making it hard for us to capture recent trends in sentiment.
2. Most of the reviews were highly positive, which could mean that SIA had mostly positive reviews, nevertheless we wanted to get more information on negative reviews to improve the robustness of our model.

### TripAdvisor

We scraped more data for airline reviews from TripAdvisor, specifically for the years 2022 to 2024. 
(https://www.tripadvisor.com.sg/Airline_Review-d8729151-Reviews-Singapore-Airlines)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 5)


### Skytrax

We also scraped from Skytrax, which is another data source for online reviews. 
(https://www.airlinequality.com/airline-reviews/singapore-airlines/?sortby=post_date%3ADesc&pagesize=100)

The dimensions are shown below:
- **`Year`**: Year of review publication.
- **`Month`**: Month of review publication.
- **`Title`**: Title of review publication.
- **`Review Text`**: Main text content of review publication.
- **`Rating`**: Numerical rating provided by reviewer (Scale: 1 to 10)

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [None]:
# !pip3 install -r requirements.txt



In [1]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime 

# Statistical functions
from scipy.stats import zscore

# Text Preprocessing and NLP
import nltk
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer


# For generating n-grams
from nltk.util import ngrams
from collections import Counter

## Data Preparation (Loading CSV)

Load the three CSV files into a pandas DataFrame `data`.

In [2]:
data = pd.read_csv('final_df.csv')

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11518 entries, 0 to 11517
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   11518 non-null  int64 
 1   month                  11518 non-null  int64 
 2   sentiment              11518 non-null  object
 3   processed_full_review  11518 non-null  object
dtypes: int64(2), object(2)
memory usage: 360.1+ KB


In [4]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [5]:
data['year'].value_counts()

year
2019    5129
2018    2596
2022    1184
2023    1111
2020     888
2024     514
2021      96
Name: count, dtype: int64

## Simple Neural Network

A Simple Neural Network, or fully connected neural network (FCNN), is a basic deep learning model ideal for straightforward classification tasks. It consists mainly of fully connected layers that process flattened data inputs, making it versatile for many types of data, including text.

Below is an explanation of how a simple NN works:

1. Embedding Layer (for Text Data):
	- For text inputs, an embedding layer transforms words into numerical vectors that capture meaning and context.
    
2.	Flattening:
	- The embeddings are flattened into a single long vector, allowing the network to process them as one input.

3.	Dense (Fully Connected) Layers:
	- Dense layers are the core of an FCNN. Each neuron connects to all neurons in the previous layer, learning complex relationships.
	- Activation functions, such as ReLU, are applied here to introduce non-linearity, helping the network capture more intricate patterns.

4.	Output Layer:
	- The final layer outputs class probabilities using a softmax activation (for multi-class classification) or sigmoid (for binary classification).
	- This layer helps the model predict the likelihood of each class for an input.
	
5.	Training:
	- During training, the network adjusts its weights to minimize prediction errors, gradually improving its accuracy through backpropagation.


In [None]:
import numpy as np
import tensorflow as tf
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Assuming 'data' is your DataFrame with 'processed_full_review' and 'sentiment' columns

# Step 1: Tokenization and Padding
max_words = 10000  # Maximum vocabulary size
max_sequence_length = 300  # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])

# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_sequence_length)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Define the Simple Neural Network Model with L2 Regularization
def create_simple_nn():
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length))
    
    # Flatten the embeddings to feed into dense layers
    model.add(Flatten())
    
    # Fully connected layers with L2 regularization
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))  # Dropout for regularization
    
    model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Early Stopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Stratified 5-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

y_labels = np.argmax(y, axis=1)  # Convert one-hot to single class labels for stratification

for fold, (train_index, test_index) in enumerate(skf.split(X, y_labels)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize and train the model
    model = create_simple_nn()
    model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test), 
              callbacks=[early_stopping], verbose=1)
    
    # Evaluate the model
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    accuracy_scores.append(accuracy)
    
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    precision_scores.append(report["weighted avg"]["precision"])
    recall_scores.append(report["weighted avg"]["recall"])
    f1_scores.append(report["weighted avg"]["f1-score"])
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_true, y_pred, digits=4, zero_division=0))

# Print average scores across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average Precision: {np.mean(precision_scores):.4f}")
print(f"Average Recall: {np.mean(recall_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")


Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Fold 1 Accuracy: 0.8520
Fold 1 Classification Report:
               precision    recall  f1-score   support

           0     0.7133    0.8770    0.7868       488
           1     0.4000    0.0086    0.0168       233
           2     0.9023    0.9684    0.9342      1583

    accuracy                         0.8520      2304
   macro avg     0.6719    0.6180    0.5793      2304
weighted avg     0.8115    0.8520    0.8102      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Fold 2 Accuracy: 0.8555
Fold 2 Classification Report:
               precision    recall  f1-score   support

           0     0.6778    0.9180    0.7798       488
           1     0.0000    0.0000    0.0000       233
           2     0.9270    0.9621    0.9442      1583

    accuracy                         0.8555      2304
   macro

# Simple Neural Network Accounting for Imbalanced Classes

1. Class Weights Calculation: compute_class_weight calculates class weights based on the training labels, helping to handle the imbalance by giving higher weight to underrepresented classes.

2. Passing class_weight in fit: By adding class_weight=class_weights_dict in model.fit, we inform the model to apply these weights during training, making it more sensitive to the underrepresented classes.

In [12]:
import numpy as np
import tensorflow as tf
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_class_weight

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Assuming 'data' is your DataFrame with 'processed_full_review' and 'sentiment' columns

# Step 1: Tokenization and Padding
max_words = 10000  # Maximum vocabulary size
max_sequence_length = 300  # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['processed_full_review'])
sequences = tokenizer.texts_to_sequences(data['processed_full_review'])

# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_sequence_length)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Convert one-hot encoded labels to single class labels for class weight calculation
y_labels = np.argmax(y, axis=1)

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_labels), y=y_labels)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Define the Simple Neural Network Model with L2 Regularization
def create_simple_nn():
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length))
    
    # Flatten the embeddings to feed into dense layers
    model.add(Flatten())
    
    # Fully connected layers with L2 regularization
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))  # Dropout for regularization
    
    model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Early Stopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Stratified 5-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y_labels)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize and train the model with class weights
    model = create_simple_nn()
    model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test), 
              callbacks=[early_stopping], class_weight=class_weights_dict, verbose=1)
    
    # Evaluate the model
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    accuracy_scores.append(accuracy)
    
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    precision_scores.append(report["weighted avg"]["precision"])
    recall_scores.append(report["weighted avg"]["recall"])
    f1_scores.append(report["weighted avg"]["f1-score"])
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_true, y_pred, digits=4, zero_division=0))

# Print average scores across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average Precision: {np.mean(precision_scores):.4f}")
print(f"Average Recall: {np.mean(recall_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Fold 1 Accuracy: 0.8351
Fold 1 Classification Report:
               precision    recall  f1-score   support

           0     0.8590    0.6865    0.7631       488
           1     0.3663    0.5880    0.4514       233
           2     0.9429    0.9172    0.9299      1583

    accuracy                         0.8351      2304
   macro avg     0.7227    0.7306    0.7148      2304
weighted avg     0.8668    0.8351    0.8462      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold 2 Accuracy: 0.8312
Fold 2 Classification Report:
               precision    recall  f1-score   support

           0     0.8329    0.6639    0.7389       488
           1     0.3410    0.5107    0.4089       233
           2     0.9400    0.9299    0.9349      1583

    accuracy                         0.8312      2304
   macro avg     0.

# NN + CountVec

In [None]:
import numpy as np
import tensorflow as tf
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_class_weight

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Step 1: Use CountVectorizer for Bag-of-Words Representation
max_features = 10000  # Maximum vocabulary size
vectorizer = CountVectorizer(max_features=max_features)
X = vectorizer.fit_transform(data['processed_full_review']).toarray()  # Convert to dense array

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Convert one-hot encoded labels to single class labels for class weight calculation
y_labels = np.argmax(y, axis=1)

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_labels), y=y_labels)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Define Neural Network Model
def create_simple_nn():
    model = Sequential()
    
    # Fully connected layers with L2 regularization
    model.add(Dense(64, activation='relu', input_shape=(max_features,), kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Early Stopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Stratified 5-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y_labels)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize and train the model with class weights
    model = create_simple_nn()
    model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test), 
              callbacks=[early_stopping], class_weight=class_weights_dict, verbose=1)
    
    # Evaluate the model
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    accuracy_scores.append(accuracy)
    
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    precision_scores.append(report["weighted avg"]["precision"])
    recall_scores.append(report["weighted avg"]["recall"])
    f1_scores.append(report["weighted avg"]["f1-score"])
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_true, y_pred, digits=4, zero_division=0))

# Print average scores across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average Precision: {np.mean(precision_scores):.4f}")
print(f"Average Recall: {np.mean(recall_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")


Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Fold 1 Accuracy: 0.8481
Fold 1 Classification Report:
               precision    recall  f1-score   support

           0     0.8069    0.8135    0.8102       488
           1     0.4294    0.6137    0.5053       233
           2     0.9561    0.8932    0.9236      1583

    accuracy                         0.8481      2304
   macro avg     0.7308    0.7735    0.7464      2304
weighted avg     0.8712    0.8481    0.8573      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold 2 Accuracy: 0.8529
Fold 2 Classification Report:
               precision    recall  f1-score   support

           0     0.8422    0.7766    0.8081       488
           1     0.4082    0.6395    0.4983       233
           2     0.9651    0.9078    0.9355      1583

    accuracy                         0.8529      2304
   macr

# Simple Neural Network + Word2Vec

1. Use the custom Word2Vec embeddings you created for initializing an embedding layer.

2. Define a simple neural network that flattens the embeddings and then feeds them into dense layers.

3. Add class_weight to handle the imbalanced classes.



In [13]:
import numpy as np
import tensorflow as tf
import random
from gensim.models import Word2Vec
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_class_weight

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Step 1: Train Word2Vec Model
tokenized_reviews = [review.split() for review in data['processed_full_review']]
word2vec_model = Word2Vec(sentences=tokenized_reviews, vector_size=128, window=5, min_count=1, sg=1, workers=4, seed=42)

# Step 2: Prepare Embedding Matrix
max_words = 10000
embedding_dim = 128

# Create a tokenizer with the vocabulary size of max_words
word_index = {word: i for i, word in enumerate(word2vec_model.wv.index_to_key, start=1) if i < max_words}
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, i in word_index.items():
    if i < max_words and word in word2vec_model.wv:
        embedding_matrix[i] = word2vec_model.wv[word]

# Convert the texts to sequences based on the Word2Vec vocabulary
sequences = [[word_index.get(word, 0) for word in review.split()] for review in data['processed_full_review']]
X = pad_sequences(sequences, maxlen=300)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Convert one-hot encoded labels to single class labels for class weight calculation
y_labels = np.argmax(y, axis=1)

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_labels), y=y_labels)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Define Neural Network Model with Word2Vec Embeddings
def create_simple_nn():
    model = Sequential()
    model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, weights=[embedding_matrix],
                        input_length=300, trainable=True)) 
    model.add(Flatten())
    
    # Fully connected layers with L2 regularization
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Early Stopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Stratified 5-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y_labels)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize and train the model with class weights
    model = create_simple_nn()
    model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test), 
              callbacks=[early_stopping], class_weight=class_weights_dict, verbose=1)
    
    # Evaluate the model
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    accuracy_scores.append(accuracy)
    
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    precision_scores.append(report["weighted avg"]["precision"])
    recall_scores.append(report["weighted avg"]["recall"])
    f1_scores.append(report["weighted avg"]["f1-score"])
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_true, y_pred, digits=4, zero_division=0))

# Print average scores across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average Precision: {np.mean(precision_scores):.4f}")
print(f"Average Recall: {np.mean(recall_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Fold 1 Accuracy: 0.8299
Fold 1 Classification Report:
               precision    recall  f1-score   support

           0     0.7829    0.8053    0.7939       488
           1     0.3657    0.5494    0.4391       233
           2     0.9580    0.8787    0.9166      1583

    accuracy                         0.8299      2304
   macro avg     0.7022    0.7445    0.7166      2304
weighted avg     0.8610    0.8299    0.8424      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold 2 Accuracy: 0.8377
Fold 2 Classification Report:
               precision    recall  f1-score   support

           0     0.8233    0.7254    0.7712       488
           1     0.3777    0.5236    0.4388       233
           2     0.9375    0.9185    0.9279      1583

    accuracy                         0.8377      2304
   macr

# NN + FastText

In [15]:
import numpy as np
import tensorflow as tf
import random
from gensim.models import FastText
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_class_weight

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

# Step 1: Train FastText Model
tokenized_reviews = [review.split() for review in data['processed_full_review']]
fasttext_model = FastText(sentences=tokenized_reviews, vector_size=128, window=5, min_count=1, sg=1, workers=4, seed=42)

# Step 2: Prepare Embedding Matrix
max_words = 10000
embedding_dim = 128

# Create a tokenizer with the vocabulary size of max_words
word_index = {word: i for i, word in enumerate(fasttext_model.wv.index_to_key, start=1) if i < max_words}
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, i in word_index.items():
    if i < max_words and word in fasttext_model.wv:
        embedding_matrix[i] = fasttext_model.wv[word]

# Convert the texts to sequences based on the FastText vocabulary
sequences = [[word_index.get(word, 0) for word in review.split()] for review in data['processed_full_review']]
X = pad_sequences(sequences, maxlen=300)

# One-hot encode the sentiment labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y = onehot_encoder.fit_transform(data[['sentiment']])

# Convert one-hot encoded labels to single class labels for class weight calculation
y_labels = np.argmax(y, axis=1)

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_labels), y=y_labels)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Define Neural Network Model with FastText Embeddings
def create_simple_nn():
    model = Sequential()
    model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, weights=[embedding_matrix],
                        input_length=300, trainable=True)) 
    model.add(Flatten())
    
    # Fully connected layers with L2 regularization
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
    model.add(Dropout(0.5))
    
    # Output layer for three-class classification using softmax
    model.add(Dense(3, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Early Stopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Stratified 5-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y_labels)):
    print(f"\nTraining fold {fold + 1}...\n")
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Initialize and train the model with class weights
    model = create_simple_nn()
    model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test), 
              callbacks=[early_stopping], class_weight=class_weights_dict, verbose=1)
    
    # Evaluate the model
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    accuracy_scores.append(accuracy)
    
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    precision_scores.append(report["weighted avg"]["precision"])
    recall_scores.append(report["weighted avg"]["recall"])
    f1_scores.append(report["weighted avg"]["f1-score"])
    
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")
    print(f"Fold {fold + 1} Classification Report:\n", classification_report(y_true, y_pred, digits=4, zero_division=0))

# Print average scores across all folds
print("\nAverage Metrics across folds:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.4f}")
print(f"Average Precision: {np.mean(precision_scores):.4f}")
print(f"Average Recall: {np.mean(recall_scores):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f}")



Training fold 1...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Fold 1 Accuracy: 0.8173
Fold 1 Classification Report:
               precision    recall  f1-score   support

           0     0.8595    0.6516    0.7413       488
           1     0.3440    0.6009    0.4375       233
           2     0.9332    0.9002    0.9164      1583

    accuracy                         0.8173      2304
   macro avg     0.7122    0.7176    0.6984      2304
weighted avg     0.8580    0.8173    0.8309      2304


Training fold 2...

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Fold 2 Accuracy: 0.8368
Fold 2 Classification Report:
               precision    recall  f1-score   support

           0     0.8288    0.7541    0.7897       488
           1     0.3704    0.6009    0.4583       233
           2     0.9582    0.8970    0.9266      1583

    accuracy                         0.8368      2