# **Preprocessing**

1. Import necessary libraries
2. Load dataset and store in dataframe
3. Label encode rating numbers into positive, negative, and neutral sentiments
4. Clean text using regex by removing unneccesary characters
5. Remove stopwords, apply stemming, tokenize
6. Split data into 80%-20% train-test split

In [7]:
# Import necessary libraries
from IPython import get_ipython
from IPython.display import display

import numpy as np
import pandas as pd
%pip install emoji
%pip install lime
%pip install sentence_transformers
import re, string, emoji, nltk, gc
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline
from sentence_transformers import SentenceTransformer



In [8]:
# Ensure dependencies are available
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
tqdm.pandas()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [9]:
# Load data
df = pd.read_csv("tripadvisor_hotel_reviews.csv.zip")

In [10]:
# Sentiment encoding
def label_encode(rating):
    if rating in [1, 2]:
        return 0
    elif rating == 3:
        return 1
    else:
        return 2

def label_name(rating):
    return ["Negative", "Neutral", "Positive"][rating]

df["Sentiment"] = df["Rating"].apply(label_encode)
df["Sentiment_Name"] = df["Sentiment"].apply(label_name)

In [11]:
# Preprocess text
stops = stopwords.words('english')
ps = PorterStemmer()
exclude = string.punctuation

def preprocess_text(text):
    text = re.sub('<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = emoji.demojize(text)
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', exclude))
    text = word_tokenize(text)
    text = ' '.join(word for word in text if word.lower() not in stops)
    text = ' '.join(ps.stem(word) for word in text.split())
    return re.sub(r'\s+', ' ', text).strip()

# Apply preprocessing
df['Review'] = df['Review'].progress_apply(preprocess_text)

100%|██████████| 20491/20491 [01:15<00:00, 271.31it/s]


In [12]:
# Train/test split
X = df[['Review']]
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1)

# **Vectorization using Sentence Transformers**

In [13]:
# Sentence Transformers
model_bert = SentenceTransformer('distilbert-base-uncased')
X_train_bert = model_bert.encode(X_train['Review'].tolist(), show_progress_bar=True)
X_test_bert = model_bert.encode(X_test['Review'].tolist(), show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Batches:   0%|          | 0/513 [00:00<?, ?it/s]

Batches:   0%|          | 0/129 [00:00<?, ?it/s]

# **Model Training**

We are using the Random Forest and K-Nearest Neighbor classifiers to train our models.

In [14]:
# Classifier setup
rf = RandomForestClassifier()
knn = KNeighborsClassifier(n_neighbors=30, metric='minkowski', p=2)

In [15]:
# Training and evaluation function
def train_and_evaluate(model, X_train_vec, X_test_vec, label=""):
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    acc = accuracy_score(y_test, y_pred)
    print(f"[{label}] Accuracy: {acc:.4f}")
    print(classification_report(y_test, y_pred))
    return model

# **Testing and Evaluation**

Below are the classification reports for each model with Sentence Transformers.

In [16]:
print("=== Random Forest Models ===")
rf_bert = train_and_evaluate(rf, X_train_bert, X_test_bert, "DistilBERT + RF")

print("=== KNN Models ===")
knn_bert = train_and_evaluate(knn, X_train_bert, X_test_bert, "DistilBERT + KNN")


=== Random Forest Models ===
[DistilBERT + RF] Accuracy: 0.7892
              precision    recall  f1-score   support

           0       0.78      0.38      0.51       643
           1       0.50      0.00      0.00       437
           2       0.79      0.99      0.88      3019

    accuracy                           0.79      4099
   macro avg       0.69      0.46      0.46      4099
weighted avg       0.76      0.79      0.73      4099

=== KNN Models ===
[DistilBERT + KNN] Accuracy: 0.7756
              precision    recall  f1-score   support

           0       0.85      0.26      0.40       643
           1       0.25      0.00      0.00       437
           2       0.77      1.00      0.87      3019

    accuracy                           0.78      4099
   macro avg       0.62      0.42      0.42      4099
weighted avg       0.73      0.78      0.70      4099

