This file is for sentiment-analysis.
For the preprocessing part, I tried preprocess the data using Spark, and specifically for the language detection, I used the PretrainedPipeline from sparknlp.pretrained. But I didn not find a lot of improvements from using Spark, the reason might be the use of UDFs and conversions between data structures.

The entire part will not be using Spark for the following reasons:
1. Converting pandas and spark dataframes are computationally expensive, the transformer architectures doe not support spark dataframe as an input
2. Transformer models are computationally intensive and benefit significantly from the parallel processing capabilities of GPUs. PyTorch already has built-in support for distributed training on GPUs, enabling efficient scaling and parallel processing. This can be more effective for deep learning tasks compared to attempting to distribute these workloads across a Spark cluster without native support for the underlying transformer models.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993225 sha256=7c19b934d0a33d1a2ab8d248b72e0e7d98bb14543bcea9dc3016f4e610127c2b
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
import numpy as np
import pandas as pd
import json

from langdetect import detect
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report


import re
import spacy
sp = spacy.load('en_core_web_sm')

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize



import torch
from torch import nn,optim
from torch.utils.data import Dataset,DataLoader,TensorDataset, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
from transformers import pipeline, BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from torch.nn.utils.rnn import pad_sequence
import time
import datetime

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load Data

In [None]:
file_path = '/content/drive/MyDrive/Data_Preprocessing/googleData.json'
with open(file_path, 'r') as file:
    df = json.load(file)

# Normalize the data
df = pd.json_normalize(df, record_path=['train'])

## Data Preprocessing

In [None]:
# Remove columns pics and history_reviews
df = df.drop(['pics', 'history_reviews'], axis=1)

In [None]:
## Detect the language of the review, we'll focus on english review only

# Function to detect language or return 'unknown' on exception
def detect_language(text):
    try:
        return detect(text)
    except Exception:
        return 'unknown'

# Apply language detection and filter only English reviews
df['detect'] = df['review_text'].apply(detect_language)
df = df[df['detect'] == 'en'].reset_index(drop=True)

In [None]:
## clean the text while being cautious about stopwords. Stopwords like 'no' and 'not' must not be removed,
## as they can significantly alter the meaning of the text (e.g., 'not delicious' should remain as 'not delicious').


sp = spacy.load('en_core_web_sm')
stopwords = sp.Defaults.stop_words

exclude_stopwords = ['no','not']
for word in exclude_stopwords:
    stopwords.remove(word)

# Load the spaCy model for English
sp = spacy.load("en_core_web_sm")

def text_preprocessing(raw_review):
    """
    Preprocess the text of a review.
    """
    # Remove non-letters and convert to lower case
    letters_only = re.sub("[^a-zA-Z]", " ", raw_review).lower()

    # Tokenize the words
    tokens = word_tokenize(letters_only)

    # Filter out stopwords
    filtered_tokens = [word for word in tokens if not sp.vocab[word].is_stop]

    # Join the filtered tokens back into a string
    return " ".join(filtered_tokens)

df['cleaned_reviews'] = df['review_text'].apply(text_preprocessing)


In [None]:
## Label reviews as positive and negative (for simplicity, we'll remove all the 3-star review do do the analysis, because those reviews
## most likely have both good and bad aspects feelings)


# first remove all rows with rating not a number
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
df.dropna(subset=['rating'], inplace=True)

df = df[(df["rating"]!=3)].copy()

df.loc[df['rating'] < 3, 'sentiment'] = 0
df.loc[df['rating'] > 3, 'sentiment'] = 1

In [None]:
# split the train, test set for training
df_features = df.drop(['sentiment'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(df_features, df['sentiment'], test_size=0.2, random_state=42)

In [None]:
y_train.value_counts().sort_index()

0.0     2438
1.0    59551
Name: sentiment, dtype: int64

We can see the data is super imbalanced. People tend to give higher ratings (>3) to restaurants. We'll have to handle the imbalance problem for better learning. We'll simply oversample the minority class

In [None]:
# Separate majority and minority classes in both features and labels
majority_X = X_train[y_train == 1]
minority_X = X_train[y_train == 0]

majority_y = y_train[y_train == 1]
minority_y = y_train[y_train == 0]

# Upsample minority class
minority_X_upsampled, minority_y_upsampled = resample(minority_X,
                                                      minority_y,
                                                      replace=True,
                                                      n_samples=len(majority_X),
                                                      random_state=42)

# Combine majority class with upsampled minority class
X_train = pd.concat([majority_X, minority_X_upsampled])
y_train = pd.concat([majority_y, minority_y_upsampled])

In [None]:
y_train.value_counts()

1.0    59551
0.0    59551
Name: sentiment, dtype: int64

## Sentiment Analysis Using Bert

In [None]:
# Check if GPU is available and set PyTorch to use GPU
device = "cuda" if torch.cuda.is_available() else "cpu"


# Load a pre-trained tokenizer for sentiment analysis
pre_trained_model = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pre_trained_model)

# Calculate the length of tokens for each review
token_lens = []
for txt in X_train.cleaned_reviews:
    # Encode the text using the tokenizer
    tokens = tokenizer.encode(txt, max_length=512, truncation=True)
    token_lens.append(len(tokens))


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Calculate the 90th percentile

percentile_99 = np.percentile(token_lens, 99)

print("99th Percentile:", percentile_99)

99th Percentile: 85.0


Define the maximum sequence length for reviews. Most reviews have fewer than 85 tokens, but we need to account for special tokens like [CLS]. In models like BERT, [CLS] is used at the beginning of each sequence for classification tasks. Therefore, we set MAX_SEQ_LENGTH to 121 (120 for the actual content plus 1 for [CLS]), to cover most reviews and provide a buffer for longer ones, balancing efficiency and coverage.

In [None]:
MAX_SEQ_LENGTH = 121

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train['cleaned_reviews'], y_train, test_size=0.15, random_state = 42, stratify= y_train)
X_test = X_test['cleaned_reviews']

In [None]:
# Function to encode reviews
import torch
from torch.nn.utils.rnn import pad_sequence

def encode_reviews(reviews, tokenizer, max_len):
    input_ids = [tokenizer.encode(review, add_special_tokens=True, max_length=max_len, truncation=True) for review in reviews]

    # Convert lists to PyTorch tensors
    input_ids = [torch.tensor(seq, dtype=torch.long) for seq in input_ids]

    # Pad sequences using PyTorch
    input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=0)

    # Create attention masks (1 for tokens, 0 for padding)
    attention_masks = (input_ids_padded != 0).long()

    return input_ids_padded, attention_masks

# Encode reviews for train, validation, and test sets
train_input_ids, train_attention_masks = encode_reviews(X_train.tolist(), tokenizer, MAX_SEQ_LENGTH)
val_input_ids, val_attention_masks = encode_reviews(X_val.tolist(), tokenizer, MAX_SEQ_LENGTH)
test_input_ids, test_attention_masks = encode_reviews(X_test.tolist(), tokenizer, MAX_SEQ_LENGTH)


In [None]:
# Convert labels to PyTorch tensors
train_labels = torch.tensor(y_train.values)
val_labels = torch.tensor(y_val.values)
test_labels = torch.tensor(y_test.values)

In [None]:
# Define batch size
batch_size = 64

# Function to create DataLoader
def create_dataloader(inputs, masks, labels, sampler_class, batch_size):
    data = TensorDataset(inputs, masks, labels)
    sampler = sampler_class(data)
    return DataLoader(data, sampler=sampler, batch_size=batch_size)

# Create DataLoader for train, validation, and test sets
train_dataloader = create_dataloader(train_input_ids, train_attention_masks, train_labels, RandomSampler, batch_size)
val_dataloader = create_dataloader(val_input_ids, val_attention_masks, val_labels, SequentialSampler, batch_size)
test_dataloader = create_dataloader(test_input_ids, test_attention_masks, test_labels, SequentialSampler, batch_size)

In [None]:
# Model Initialization
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    output_attentions=False,
    output_hidden_states=False,
)
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
epochs=2

# Setting up the optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=3e-5)
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Define loss function and move it to GPU
loss_fn=nn.CrossEntropyLoss().to(device)

# Helper functions
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round(elapsed))))

def accuracy(preds, labels):
    return np.mean(np.argmax(preds, axis=1).flatten() == labels.flatten())



In [None]:
# Initialize storage for performance metrics
metrics = {
    "train_loss": [], "train_acc": [],
    "val_loss": [], "val_acc": []
}

In [None]:
# Training and Evaluation Loop
for epoch in range(epochs):
    print(f"\nEpoch {epoch+1}/{epochs}")

    # Training Phase
    model.train()
    total_train_loss, total_train_acc = 0, 0
    start_time = time.time()

    for step, batch in enumerate(train_dataloader):
        if step % 100 == 0 and not step == 0:
            elapsed = format_time(time.time() - start_time)
            print(f'Batch {step:>5,} of {len(train_dataloader):>5,}. Elapsed: {elapsed}.')

        b_input_ids, b_attention_mask, b_labels = (item.to(device) for item in batch)
        model.zero_grad()
        outputs = model(b_input_ids, attention_mask=b_attention_mask, labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits

        total_train_loss += loss.item()
        total_train_acc += accuracy(logits.detach().cpu().numpy(), b_labels.cpu().numpy())

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_train_loss / len(train_dataloader)
    avg_train_acc = total_train_acc / len(train_dataloader)
    metrics["train_loss"].append(avg_train_loss)
    metrics["train_acc"].append(avg_train_acc)

    print(f"\nAverage Training Accuracy: {avg_train_acc:.2f}")
    print(f'Average Training Loss: {avg_train_loss:.2f}')
    print(f'Training Epoch Time: {format_time(time.time() - start_time)}')

    # Validation Phase
    model.eval()
    total_val_loss, total_val_acc = 0, 0
    start_time = time.time()

    for batch in val_dataloader:
        b_input_ids, b_attention_mask, b_labels = (item.to(device) for item in batch)

        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_attention_mask, labels=b_labels)
            loss = outputs.loss
            logits = outputs.logits

            total_val_loss += loss.item()
            total_val_acc += accuracy(logits.detach().cpu().numpy(), b_labels.cpu().numpy())

    avg_val_loss = total_val_loss / len(val_dataloader)
    avg_val_acc = total_val_acc / len(val_dataloader)
    metrics["val_loss"].append(avg_val_loss)
    metrics["val_acc"].append(avg_val_acc)

    print(f"\nValidation Accuracy: {avg_val_acc:.2f}")
    print(f'Validation Loss: {avg_val_loss:.2f}')
    print(f'Validation Time: {format_time(time.time() - start_time)}')


Epoch 1/2
Batch   100 of 1,345. Elapsed: 0:01:59.
Batch   200 of 1,345. Elapsed: 0:03:57.
Batch   300 of 1,345. Elapsed: 0:05:56.
Batch   400 of 1,345. Elapsed: 0:07:54.
Batch   500 of 1,345. Elapsed: 0:09:52.
Batch   600 of 1,345. Elapsed: 0:11:51.
Batch   700 of 1,345. Elapsed: 0:13:49.
Batch   800 of 1,345. Elapsed: 0:15:48.
Batch   900 of 1,345. Elapsed: 0:17:46.
Batch 1,000 of 1,345. Elapsed: 0:19:45.
Batch 1,100 of 1,345. Elapsed: 0:21:43.
Batch 1,200 of 1,345. Elapsed: 0:23:42.
Batch 1,300 of 1,345. Elapsed: 0:25:40.

Average Training Accuracy: 0.99
Average Training Loss: 0.03
Training Epoch Time: 0:26:33

Validation Accuracy: 0.99
Validation Loss: 0.05
Validation Time: 0:01:46

Epoch 2/2
Batch   100 of 1,345. Elapsed: 0:01:59.
Batch   200 of 1,345. Elapsed: 0:03:57.
Batch   300 of 1,345. Elapsed: 0:05:55.
Batch   400 of 1,345. Elapsed: 0:07:53.
Batch   500 of 1,345. Elapsed: 0:09:51.
Batch   600 of 1,345. Elapsed: 0:11:49.
Batch   700 of 1,345. Elapsed: 0:13:47.
Batch   800 of

In [None]:
model.eval()
predictions, true_labels = [], []

for batch in test_dataloader:
  b_input_ids, b_attention_mask, b_labels = (item.to(device) for item in batch)

  with torch.no_grad():
    outputs = model(input_ids=b_input_ids, attention_mask=b_attention_mask)

  logits = outputs[0]
  predictions.extend(logits.tolist())
  true_labels.extend(b_labels.tolist())

print('Done with predictions')

# Convert predictions to softmax probabilities
preds = torch.tensor(predictions)
preds = F.softmax(preds, dim=1)
preds = np.array(preds)

# Convert true_labels to numpy array
true_labels = np.array(true_labels)

# Function to evaluate the model
def evaluate(y_test, predictions):
    # Generate classification report
    class_report = classification_report(y_test, predictions, digits=3)
    print(class_report)

# Evaluating the model
evaluate(true_labels, preds.argmax(1))

Done with predictions
              precision    recall  f1-score   support

           0      0.492     0.511     0.501       583
           1      0.981     0.979     0.980     14915

    accuracy                          0.962     15498
   macro avg      0.736     0.745     0.741     15498
weighted avg      0.962     0.962     0.962     15498



In [None]:
# Define the directory where you want to save the model
save_directory = "/content/drive/MyDrive/cs631models"

# Save the model and tokenizer
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('/content/drive/MyDrive/cs631models/tokenizer_config.json',
 '/content/drive/MyDrive/cs631models/special_tokens_map.json',
 '/content/drive/MyDrive/cs631models/vocab.txt',
 '/content/drive/MyDrive/cs631models/added_tokens.json')

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# apply the model on the entire dataset
input_ids, attention_masks = encode_reviews(df['cleaned_reviews'].tolist(), tokenizer, MAX_SEQ_LENGTH)

data = TensorDataset(input_ids, attention_masks)
data_sampler = SequentialSampler(data)
data_loader = DataLoader(data, sampler=data_sampler, batch_size=128)

# Predict and collect probabilities
predictions = []
for batch in data_loader:
    # Add batch to the device
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_attention_mask = batch

    # Forward pass, calculate logit predictions
    with torch.no_grad():
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_attention_mask)

    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    predictions.extend(probabilities[:, 1].tolist())  # Assuming the second column corresponds to the "positive" class

# Add predictions to the DataFrame
df['sentiment_score'] = predictions
threshold = 0.5  # Example threshold
df['sentiment'] = df['sentiment_score'].apply(lambda x: 'positive' if x > threshold else 'negative')

# Save the updated DataFrame
df.to_csv('/content/drive/MyDrive/Data_Preprocessing/final_training_dataset.csv', index=False)

Reference: https://github.com/josepaulosa/NLP_Sentiment_Analysis/blob/main/BERT.ipynb