# Enhancing Singapore Airlines' Service Through Automated Sentiment Analysis of Customer Reviews



**Motivation**



## Singapore Airlines Customer Reviews Dataset Information

The [Singapore Airlines Customer Reviews Dataset](https://www.kaggle.com/datasets/kanchana1990/singapore-airlines-reviews) aggregates 10,000 anonymized customer reviews, providing a broad perspective on the passenger experience with Singapore Airlines. 

The dimensions are shown below:
- **`published_date`**: Date and time of review publication.
- **`published_platform`**: Platform where the review was posted.
- **`rating`**: Customer satisfaction rating, from 1 (lowest) to 5 (highest).
- **`type`**: Specifies the content as a review.
- **`text`**: Detailed customer feedback.
- **`title`**: Summary of the review.
- **`helpful_votes`**: Number of users finding the review helpful.

## Importing Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [1]:
# !pip3 install -r requirements.txt

In [4]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords

# Tokenizing sentences/words
from nltk.corpus import wordnet

# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

# For generating n-grams
from nltk.util import ngrams
from collections import Counter

# Libraries for Word2Vec and Logistic Regression
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data Preparation (Loading CSV)

Load the `singapore_airline_reviews.csv` file into a pandas DataFrame `data`.

In [3]:
data = pd.read_csv("singapore_airlines_reviews.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   published_date      10000 non-null  object
 1   published_platform  10000 non-null  object
 2   rating              10000 non-null  int64 
 3   type                10000 non-null  object
 4   text                10000 non-null  object
 5   title               10000 non-null  object
 6   helpful_votes       10000 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


In [4]:
data.head()

Unnamed: 0,published_date,published_platform,rating,type,text,title,helpful_votes
0,2024-03-12T14:41:14-04:00,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0
1,2024-03-11T19:39:13-04:00,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0
2,2024-03-11T12:20:23-04:00,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0
3,2024-03-11T07:12:27-04:00,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0
4,2024-03-10T05:34:18-04:00,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0


# Data Cleaning

## Remove Duplicate Rows

- Drop duplicate rows from the dataframe (`data`) and reset the index.

In [5]:
data = data.drop_duplicates().reset_index(drop=True)

# Display the new dataframe shape
print("The new shape is: ", data.shape)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", data.duplicated().sum())

The new shape is:  (10000, 7)
Remaining duplicate rows: 0


## Check for Null Values

- Here we check which features have null values using the `isnull()` function.

In [6]:
# In this case only `title` feature has one null value, will fill it with empty string " "
data.isnull().sum()

published_date        0
published_platform    0
rating                0
type                  0
text                  0
title                 0
helpful_votes         0
dtype: int64

In [7]:
# Fill missing values with empty string
data = data.fillna("")

In [8]:
# Verify that there are no missing values
data.isnull().sum()

published_date        0
published_platform    0
rating                0
type                  0
text                  0
title                 0
helpful_votes         0
dtype: int64

## Convert data types

Since the column `published_date` is in data type (`str`), we will
- Convert `published_date` to a standard timezone (UTC) format as a new column `date`.
- Drop the original `published_date` column after conversion and reset the index.

In [9]:
# Set `utc=True` to convert the date to common timezone (UTC)
data["date"] = pd.to_datetime(data["published_date"], utc=True)
print(data["date"].dtype)

datetime64[ns, UTC]


In [10]:
# Drop `published_date` column and reset the index
data = data.drop(columns=["published_date"]).reset_index(drop=True)
data.head()

Unnamed: 0,published_platform,rating,type,text,title,helpful_votes,date
0,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0,2024-03-12 18:41:14+00:00
1,Desktop,5,review,The service on Singapore Airlines Suites Class...,The service in Suites Class makes one feel lik...,0,2024-03-11 23:39:13+00:00
2,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0,2024-03-11 16:20:23+00:00
3,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0,2024-03-11 11:12:27+00:00
4,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0,2024-03-10 09:34:18+00:00


## Remove Outliers

### `text`

The `text` column of `data`, which is of string (`str`) type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `text_length` in the DataFrame `data` by calculating the length of each review. (Set the value as 0 if the correponding `text` column has NaN values.)

2. Check the statistics of `text_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `text_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `text_length` column and set the corresponding `text` to np.nan.

6. Drop the `text_length` column from the DataFrame.

In [11]:
data['text_length'] = data['text'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["text_length"]
stats_TL = TL.describe()
print(stats_TL)

  published_platform  rating    type  \
0            Desktop       3  review   
1            Desktop       5  review   
2            Desktop       1  review   

                                                text  \
0  We used this airline to go from Singapore to L...   
1  The service on Singapore Airlines Suites Class...   
2  Booked, paid and received email confirmation f...   

                                               title  helpful_votes  \
0                                                 Ok              0   
1  The service in Suites Class makes one feel lik...              0   
2                         Don’t give them your money              0   

                       date  text_length  
0 2024-03-12 18:41:14+00:00         1352  
1 2024-03-11 23:39:13+00:00         4666  
2 2024-03-11 16:20:23+00:00          420  
count    10000.000000
mean       556.526800
std        640.290638
min        100.000000
25%        228.000000
50%        380.000000
75%        665.000000
max

In [12]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'text' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'text'] = np.nan
# print(data.head(3))

data = data.drop("text_length", axis=1)

data.head()
data.shape

(10000, 7)

### `title`

Similarly, the `title` column of `data` (of type `str`) may also contain values with unusually long lengths, indicating the presence of outliers.

1. Create a new column `title_length` in the DataFrame `data` by calculating the length of each price value. (Set the value as 0 if the correponding `title` column has NaN values.)

2. Check the statistics of `title_length` using `describe()` method and display its unique values.

3. Identify the outlier values by inspecting the content in `title` corresponding to the abnormal value in `title_length` and set the corresponding value of `title` to np.nan.

4. Drop the `title_length` column from the DataFrame.

In [13]:
data['title_length'] = data['title'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["title_length"]
stats_TL = TL.describe()
print(stats_TL)

  published_platform  rating    type  \
0            Desktop       3  review   
1            Desktop       5  review   
2            Desktop       1  review   

                                                text  \
0  We used this airline to go from Singapore to L...   
1                                                NaN   
2  Booked, paid and received email confirmation f...   

                                               title  helpful_votes  \
0                                                 Ok              0   
1  The service in Suites Class makes one feel lik...              0   
2                         Don’t give them your money              0   

                       date  title_length  
0 2024-03-12 18:41:14+00:00             2  
1 2024-03-11 23:39:13+00:00            51  
2 2024-03-11 16:20:23+00:00            26  
count    10000.00000
mean        28.40990
std         17.27913
min          2.00000
25%         16.00000
50%         24.00000
75%         36.00000
max   

In [14]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'title' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'title'] = np.nan
# print(data.head(3))

data = data.drop("title_length", axis=1)
data.head()

Unnamed: 0,published_platform,rating,type,text,title,helpful_votes,date
0,Desktop,3,review,We used this airline to go from Singapore to L...,Ok,0,2024-03-12 18:41:14+00:00
1,Desktop,5,review,,The service in Suites Class makes one feel lik...,0,2024-03-11 23:39:13+00:00
2,Desktop,1,review,"Booked, paid and received email confirmation f...",Don’t give them your money,0,2024-03-11 16:20:23+00:00
3,Desktop,5,review,"Best airline in the world, seats, food, servic...",Best Airline in the World,0,2024-03-11 11:12:27+00:00
4,Desktop,2,review,Premium Economy Seating on Singapore Airlines ...,Premium Economy Seating on Singapore Airlines ...,0,2024-03-10 09:34:18+00:00


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   published_platform  10000 non-null  object             
 1   rating              10000 non-null  int64              
 2   type                10000 non-null  object             
 3   text                9846 non-null   object             
 4   title               9834 non-null   object             
 5   helpful_votes       10000 non-null  int64              
 6   date                10000 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), int64(2), object(4)
memory usage: 547.0+ KB


In [16]:
#check data types of each column, make sure they are correct
print(data.dtypes)

# Make sure no more duplicates are present
print("Remaining duplicate rows:", data.duplicated().sum())

# Check for outliers in ratings
print("Unique ratings:", data['rating'].unique())

published_platform                 object
rating                              int64
type                               object
text                               object
title                              object
helpful_votes                       int64
date                  datetime64[ns, UTC]
dtype: object
Remaining duplicate rows: 0
Unique ratings: [3 5 1 2 4]


In [17]:
data.isnull().sum()

published_platform      0
rating                  0
type                    0
text                  154
title                 166
helpful_votes           0
date                    0
dtype: int64

# Feature Engineering

### Create new column `full_review`
Since there are some rows with empty `text` and `title`, we will concatenate both columns (`text` and `title`) to form a new column `full_review`.
1. Replace `NaN` values in `text` and `title` with an empty string.

2. Combine `text` and `title` into `full_review`.

3. Strip any leading/trailing whitespaces in `full_review`.

4. Drop `text` and `title` columns.

In [18]:
# 1) Fill NaN values in 'text' and 'title' with an empty string
data['title'] = data['title'].fillna('')
data['text'] = data['text'].fillna('')

# 2) Combine 'text' and 'title' into 'full_review'
data['full_review'] = data['text'] + " " + data['title']

# 3) Strip any leading/trailing whitespace
data['full_review'] = data['full_review'].str.strip()

# 4) Drop `text` and `title` columns
data = data.drop(columns = ['text', 'title'])

# Check if the 'full_review' column was added and if 'text' and 'title' columns has been dropped
print(data.head())
print("\nThe old shape is:",data.shape)

  published_platform  rating    type  helpful_votes                      date  \
0            Desktop       3  review              0 2024-03-12 18:41:14+00:00   
1            Desktop       5  review              0 2024-03-11 23:39:13+00:00   
2            Desktop       1  review              0 2024-03-11 16:20:23+00:00   
3            Desktop       5  review              0 2024-03-11 11:12:27+00:00   
4            Desktop       2  review              0 2024-03-10 09:34:18+00:00   

                                         full_review  
0  We used this airline to go from Singapore to L...  
1  The service in Suites Class makes one feel lik...  
2  Booked, paid and received email confirmation f...  
3  Best airline in the world, seats, food, servic...  
4  Premium Economy Seating on Singapore Airlines ...  

The old shape is: (10000, 6)


### Remove empty strings
1. Drop rows where `full_review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [19]:
# 1) Drop rows where `full_review` are empty strings and reset the index
data = data[data['full_review'] != ""].reset_index(drop=True)
print("The new shape is:",data.shape)

# 2) Check if there are no more null values in `data`
data.isnull().sum()

The new shape is: (9989, 6)


published_platform    0
rating                0
type                  0
helpful_votes         0
date                  0
full_review           0
dtype: int64

### Create new column `language`
In the case where there are rows where `full_review` are in different languages (e.g., French, Russian, etc.) other than English. We decided to use 2 different language detector libraries (`langdetect`, `langid`) on the `full_review` column and combined the predictions of all 2 libraries and selecting the most frequent predicted language.

**Reason**: `langdetect` might perform well on longer texts while `langid` is more reliable on short texts, using multiple detectors reduces the likelihood of misclassification and mitigates individual detector errors, leading to more accurate overall predictions. Also, even if one detector fails or throws an error, the other can still provide predictions, therefore improving the robustness of the language detection.

1. Set a seed for `langdetect` to ensure reproducibility.

2. Preprocess the text in `full_review`:
    - a\) Function to remove non-alphabetic characters and normalise whitespaces in  `full_review`.
    - b\) Function to determine if the text is non-language (e.g., numbers, symbols only).

3. Two functions for language detection:
    - a\) Using `langdetect`.
    - b\) Using `langid`.

4. Function for calculating majority vote for each language.

5. Function for parallel processing for efficiency.

6. Caching function for repeated inputs

7. Function for choosing language based on combined majority voting.

8. Applying the combined function on `full_review` column.

9. Display the resulting `data` DataFrame.

### <span style="color:red">The code below will take approximately 1 minute to run!</span>

In [20]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
data['language'] = [get_cached_language(text) for text in tqdm(data['full_review'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
data

Language Detection: 100%|██████████| 9989/9989 [08:02<00:00, 20.69it/s]  


Unnamed: 0,published_platform,rating,type,helpful_votes,date,full_review,language
0,Desktop,3,review,0,2024-03-12 18:41:14+00:00,We used this airline to go from Singapore to L...,en
1,Desktop,5,review,0,2024-03-11 23:39:13+00:00,The service in Suites Class makes one feel lik...,en
2,Desktop,1,review,0,2024-03-11 16:20:23+00:00,"Booked, paid and received email confirmation f...",en
3,Desktop,5,review,0,2024-03-11 11:12:27+00:00,"Best airline in the world, seats, food, servic...",en
4,Desktop,2,review,0,2024-03-10 09:34:18+00:00,Premium Economy Seating on Singapore Airlines ...,en
...,...,...,...,...,...,...,...
9984,Desktop,5,review,1,2018-08-06 07:48:21+00:00,First part done with Singapore Airlines - acce...,en
9985,Mobile,5,review,1,2018-08-06 02:50:29+00:00,And again a great Flight with Singapore Air. G...,en
9986,Desktop,5,review,1,2018-08-06 02:47:06+00:00,"We flew business class from Frankfurt, via Sin...",en
9987,Desktop,4,review,2,2018-08-06 00:32:03+00:00,"As always, the A380 aircraft was spotlessly pr...",en


In [21]:
# See distribution of languages
data["language"].value_counts()

en         9952
unknown      32
es            1
de            1
th            1
fr            1
sv            1
Name: language, dtype: int64

In [22]:
# Drop rows where language is NOT in english and reset the index
data = data[data['language'] == 'en'].reset_index(drop=True)
print(data.shape)

(9952, 7)


We will drop the `language` column since all values of `language` are `en` and all `full_review` are in the English language.

In [23]:
data.info()
data.drop(columns=["language"], inplace=True)
print("The new shape is:", data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9952 entries, 0 to 9951
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   published_platform  9952 non-null   object             
 1   rating              9952 non-null   int64              
 2   type                9952 non-null   object             
 3   helpful_votes       9952 non-null   int64              
 4   date                9952 non-null   datetime64[ns, UTC]
 5   full_review         9952 non-null   object             
 6   language            9952 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(2), object(4)
memory usage: 544.4+ KB
The new shape is: (9952, 6)


In [24]:
data.head()

Unnamed: 0,published_platform,rating,type,helpful_votes,date,full_review
0,Desktop,3,review,0,2024-03-12 18:41:14+00:00,We used this airline to go from Singapore to L...
1,Desktop,5,review,0,2024-03-11 23:39:13+00:00,The service in Suites Class makes one feel lik...
2,Desktop,1,review,0,2024-03-11 16:20:23+00:00,"Booked, paid and received email confirmation f..."
3,Desktop,5,review,0,2024-03-11 11:12:27+00:00,"Best airline in the world, seats, food, servic..."
4,Desktop,2,review,0,2024-03-10 09:34:18+00:00,Premium Economy Seating on Singapore Airlines ...


# Text Preprocessing for NLP

# count vectorisation


In [28]:
## for Mac users, might have to install this manually

# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt to /Users/hyinki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/hyinki/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hyinki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/hyinki/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hyinki/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/hyinki/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [29]:
# Define X (features) and y (target) based on multiclass sentiment
X = data['full_review']  # Review text

# Multiclass sentiment: 0 for negative, 1 for neutral, 2 for positive
def sentiment_label(rating):
    if rating < 2:
        return 0  # Negative
    elif rating == 3:
        return 1  # Neutral
    else:
        return 2  # Positive

y = data['rating'].apply(sentiment_label)

# Text preprocessing: vectorizing the text data
vectorizer = CountVectorizer(stop_words='english')
X_vectorized = vectorizer.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.3, random_state=42)

# Logistic regression model (for multiclass classification)
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)

# Predicting and evaluating
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Optional: Output predictions and true values for comparison
print("Predicted Sentiments:", y_pred)
print("True Sentiments:", y_test.values)

Model Accuracy: 82.95%
Predicted Sentiments: [2 2 2 ... 2 2 1]
True Sentiments: [2 2 0 ... 2 2 2]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Word2Vec + log regression


In [1]:
pip install gensim


Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (60 kB)
Downloading gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl (30.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.4/30.4 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully uninstalled scipy-1.14.1
Successfully installed gensim-4.3.3 scipy-1.13.1
Note: you may need to restart the kernel to use updated packages.


In [19]:
data = pd.read_csv("final_df.csv")

In [25]:
from nltk.tokenize import word_tokenize

from gensim.models import Word2Vec

# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text

y= data['sentiment']

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in X]

# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=2)

# Function to convert each sentence into an average word2vec vector
def sentence_to_avg_vector(sentence, model):
    words = [word for word in sentence if word in model.wv]  # Keep only words present in the Word2Vec vocabulary
    if len(words) > 0:
        return np.mean([model.wv[word] for word in words], axis=0)
    else:
        return np.zeros(model.vector_size)  # Return a zero vector if no words from the sentence are in the vocabulary

# Convert all tokenized sentences to their corresponding average word2vec vectors
X_word2vec = np.array([sentence_to_avg_vector(sentence, word2vec_model) for sentence in tokenized_sentences])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_word2vec, y, test_size=0.3, random_state=42)

# Train a logistic regression model
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_vectors, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")


Model Accuracy: 83.28%
Cross-Validation Scores: [0.82421875 0.83072917 0.82552083 0.82587929 0.81676075]
Average Cross-Validation Accuracy: 82.46%


# Word2Vec + SVM


In [27]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, make_scorer
import numpy as np

# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text
y = data['sentiment']

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in X]

# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=2)

# Function to convert each sentence into an average word2vec vector
def sentence_to_avg_vector(sentence, model):
    words = [word for word in sentence if word in model.wv]  # Keep only words present in the Word2Vec vocabulary
    if len(words) > 0:
        return np.mean([model.wv[word] for word in words], axis=0)
    else:
        return np.zeros(model.vector_size)  # Return a zero vector if no words from the sentence are in the vocabulary

# Convert all tokenized sentences to their corresponding average word2vec vectors
X_word2vec = np.array([sentence_to_avg_vector(sentence, word2vec_model) for sentence in tokenized_sentences])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_word2vec, y, test_size=0.3, random_state=42)

# Train an SVM model
model = SVC(kernel='linear', max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_word2vec, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")




Model Accuracy: 82.32%




Cross-Validation Scores: [0.82725694 0.82074653 0.83637153 0.82023448 0.82978723]
Average Cross-Validation Accuracy: 82.69%


# Word2Vec + random forest

In [28]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
import numpy as np

# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text
y = data['sentiment']

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in X]

# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=2)

# Function to convert each sentence into an average word2vec vector
def sentence_to_avg_vector(sentence, model):
    words = [word for word in sentence if word in model.wv]  # Keep only words present in the Word2Vec vocabulary
    if len(words) > 0:
        return np.mean([model.wv[word] for word in words], axis=0)
    else:
        return np.zeros(model.vector_size)  # Return a zero vector if no words from the sentence are in the vocabulary

# Convert all tokenized sentences to their corresponding average word2vec vectors
X_word2vec = np.array([sentence_to_avg_vector(sentence, word2vec_model) for sentence in tokenized_sentences])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_word2vec, y, test_size=0.3, random_state=42)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_word2vec, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")


Model Accuracy: 82.81%
Cross-Validation Scores: [0.81987847 0.82074653 0.82638889 0.82023448 0.81676075]
Average Cross-Validation Accuracy: 82.08%


# FastText Embeddings + log regression
FastText is an extension of Word2Vec developed by Facebook’s AI Research (FAIR). While Word2Vec treats each word as a unique token, FastText breaks words into character n-grams (subword information). This means that it can generate vectors for words that were not seen during training, as long as their subwords were seen.


In [21]:
from gensim.models import FastText
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score, make_scorer

In [26]:
# Ensure you have the required NLTK package
nltk.download('punkt')


# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text

y = data['sentiment']

# Step 1: Tokenize the sentences
tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in X]

# Step 2: Train the FastText model
fasttext_model = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Step 3: Function to average word vectors for each sentence
def get_sentence_vector(sentence, model, vector_size):
    words = nltk.word_tokenize(sentence.lower())
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)  # Return a zero vector if no words are found
    return np.mean(word_vectors, axis=0)

# Step 4: Convert each sentence to its FastText vector representation
X_vectors = np.array([get_sentence_vector(sentence, fasttext_model, 100) for sentence in X])

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.3, random_state=42)

# Step 6: Logistic regression model (for multiclass classification)
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Step 7: Predicting and evaluating
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Step 8 (Optional): Implement Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_vectors, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")

[nltk_data] Downloading package punkt to /Users/hyinki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Model Accuracy: 82.32%
Cross-Validation Scores: [0.82291667 0.82508681 0.82986111 0.82023448 0.8154581 ]
Average Cross-Validation Accuracy: 82.27%


# FastText Embeddings + SVM

In [29]:
# Ensure you have the required NLTK package
nltk.download('punkt')

# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text
y = data['sentiment']

# Step 1: Tokenize the sentences
tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in X]

# Step 2: Train the FastText model
fasttext_model = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Step 3: Function to average word vectors for each sentence
def get_sentence_vector(sentence, model, vector_size):
    words = nltk.word_tokenize(sentence.lower())
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)  # Return a zero vector if no words are found
    return np.mean(word_vectors, axis=0)

# Step 4: Convert each sentence to its FastText vector representation
X_vectors = np.array([get_sentence_vector(sentence, fasttext_model, 100) for sentence in X])

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.3, random_state=42)

# Step 6: SVM model (for multiclass classification)
model = SVC(kernel='linear', max_iter=1000)
model.fit(X_train, y_train)

# Step 7: Predicting and evaluating
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Step 8 (Optional): Implement Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_vectors, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")


[nltk_data] Downloading package punkt to /Users/hyinki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Model Accuracy: 81.25%




Cross-Validation Scores: [0.81380208 0.82682292 0.81467014 0.81849761 0.7915762 ]
Average Cross-Validation Accuracy: 81.31%


# FastText Embeddings + random forest

In [30]:
# Ensure you have the required NLTK package
nltk.download('punkt')

# Define X (features) and y (target) based on multiclass sentiment
X = data['processed_full_review']  # Review text
y = data['sentiment']

# Step 1: Tokenize the sentences
tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in X]

# Step 2: Train the FastText model
fasttext_model = FastText(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Step 3: Function to average word vectors for each sentence
def get_sentence_vector(sentence, model, vector_size):
    words = nltk.word_tokenize(sentence.lower())
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)  # Return a zero vector if no words are found
    return np.mean(word_vectors, axis=0)

# Step 4: Convert each sentence to its FastText vector representation
X_vectors = np.array([get_sentence_vector(sentence, fasttext_model, 100) for sentence in X])

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.3, random_state=42)

# Step 6: Random Forest model (for multiclass classification)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 7: Predicting and evaluating
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the accuracy
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Step 8 (Optional): Implement Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_vectors, y, cv=kf, scoring=make_scorer(accuracy_score))

# Output cross-validation results
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Average Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}%")


[nltk_data] Downloading package punkt to /Users/hyinki/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Model Accuracy: 81.92%
Cross-Validation Scores: [0.82378472 0.82248264 0.81770833 0.81415545 0.80677377]
Average Cross-Validation Accuracy: 81.70%


# NN