## 4A. Aspect Based Sentiment Analysis (ABSA)

Last Updated: 4 Oct 2025 </br>
Description: One-step and Two-step ABSA Pipeline was experimented. Two Step Pipeline for Sentiment Analysis was eventually adopted for greater aspect relevance and accurate sentiment analysis results. Note that this script was ran using Google Collab to leverage on the T4 GPU for efficient processing. In addition, we will be using the cleaned_reviews_polars.csv dataset since we will only be using the raw text from "review_text".

#### Mount Drive for T4 GPU on Google Collab

In [1]:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


#### Import Libraries

In [2]:
# pip install tiktoken

In [3]:
# pip install sentencepiece

In [15]:
import polars as pl
from transformers import pipeline
import torch
import warnings
import os
from tqdm import tqdm
import pandas as pd

# Ignore warnings that are cluttering the output
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning) # Add this line

In [12]:
os.getcwd()

'/content'

#### File Path Config

In [16]:
# Data directory
DATA_DIR = "data/"

FILE_PATH = "/content/drive/MyDrive/CI/data/cleaned_reviews_polars.csv"

#### Aspect Definition

- The aspects for this analysis were defined using a deductive, top-down methodology to ensure relevance and comprehensive coverage of the guest experience.
- Nine distinct categories were predefined based on a three-pronged rationale:
1) Industry-Standard Criteria, mirroring common review categories found on major travel platforms (e.g., 'Location & Accessibility', 'Pricing & Value for Money')
2) Domain-Specific Features, identifying unique attributes of the Marina Bay Sands (e.g., 'Casino Experience', 'Shopping & Retail')
3) General Service Aspects, covering universal elements of hospitality (e.g., 'Room Quality & Comfort', 'Customer Service & Staff').

This structured approach ensures that the subsequent sentiment analysis is grounded in categories that are both widely understood and directly pertinent to MBS.

In [17]:
ASPECTS = [
    "Room Quality & Comfort",
    "Facilities & Amenities",
    "Customer Service & Staff",
    "Dining Experience",
    "Casino Experience",
    "Shopping & Retail",
    "Location & Accessibility",
    "Pricing & Value for Money",
    "Atmosphere & Ambience"
]

#### Load Data
- Note that natively got review_id already in the dataset

In [1]:
# Not required for two step pipeline

print("Loading cleaned dataset...")
# Load the cleaned dataset produced by your Polars pipeline
try:
    df = pl.read_csv("data/imputed_reviews_polars.csv")
    # Add a unique ID to each review for easier joining later
    df = df.with_row_index(name="review_id")
    print("Dataset loaded successfully.")
except Exception as e:
    print(f"Error loading dataset: {e}")
    exit()

Loading cleaned dataset...
Error loading dataset: name 'pl' is not defined


#### Exclude Reviews with low word count
- To improve the quality of the dataset, reviews containing five or fewer words were excluded, as these entries typically lack the sufficient detail and contextual information required for meaningful Aspect-Based Sentiment Analysis.
- Choosing a threshold of 5 is a common and practical heuristic in text analysis, designed to balance data retention with data quality.

#### Rationale
The primary goal is to filter out extremely low-effort, "zero-context" reviews. Phrases like:
- "nice comfortable" (2 words)
- "A satisfying trip" (3 words)
- "It was so good" (4 words)

While these provide an overall sentiment, they are not meaningful for Aspect-Based Sentiment Analysis because they don't mention what was great or horrible.

At the same time, the threshold needs to be low enough to keep short but valuable reviews that do contain a clear aspect and sentiment, such as:
- "Very good service and convenient transportation" (6 words)

In [None]:
# Create a new column 'word_count'
# This splits the review text by spaces, creating a list of words, and then counts the length of that list.
df = df.with_columns(
    pl.col("review_text").str.split(" ").list.len().alias("word_count")
)

# Filter the DataFrame to keep only rows where word_count is greater than 5
filtered_df = df.filter(
    pl.col("word_count") > 5
)

print(f"\nOriginal row count: {len(df)}")
print(f"Row count after filtering out short reviews: {len(filtered_df)}")
print(f"Number of reviews excluded: {len(df) - len(filtered_df)}")


Original row count: 27392
Row count after filtering out short reviews: 17255
Number of reviews excluded: 10137


#### Experiment 1: Single Step Pipeline for Sentiment Analysis
- We are using a deberta-v3-base-absa model, which is specifically trained to take a sentence and an aspect, and then determine the sentiment of that aspect within the sentence. This is much more accurate than a general sentiment model.

In [None]:
# classifier = pipeline("text-classification", model="yangheng/deberta-v3-base-absa-v1.1")
# sentence = "The food was exceptional, although the service was a bit slow."
# result_food = classifier(sentence, text_pair="food")
# print(result_food)

# [{'label': 'Positive', 'score': 0.9969689249992371}]

In [None]:
# --- LOAD MODEL ---

print("\nLoading ABSA model...")
absa_classifier = pipeline(
    "text-classification",
    model="yangheng/deberta-v3-base-absa-v1.1"
)
print("ABSA model loaded.")


Loading ABSA model...


Device set to use mps:0


ABSA model loaded.


In [None]:
# --- RUN THE ABSA PIPELINE ---

# For development, run on a small sample.
# To run on the full filtered dataset, change this to: sample_df = filtered_df
sample_df = filtered_df.sample(n=100, seed=1)
print(f"\nProcessing a sample of {len(sample_df)} reviews...")

results = []
# Iterate through each review in our sample
for row in sample_df.iter_rows(named=True):
    review_id = row['review_id']
    review_text = row['review_text']

    if not review_text:
        continue

    for aspect in ASPECTS:
        input_text = f"[CLS] {review_text} [SEP] {aspect} [SEP]"
        prediction = absa_classifier(input_text)[0]

        # Capture both the sentiment label and the probability score
        sentiment = prediction['label'].capitalize()
        score = prediction['score']

        results.append({
            "review_id": review_id,
            "aspect": aspect,
            "sentiment": sentiment,
            "score": score  # Add score to our results
        })

print("ABSA processing complete.")


Processing a sample of 100 reviews...
ABSA processing complete.


In [None]:
# --- EXPORT LONG-FORMAT DATA FOR VERIFICATION ---

print("\nExporting long-format data for verification...")
# Create the initial long-format DataFrame from the results
long_format_df = pl.DataFrame(results)

# Join with the original DataFrame to include the review text for context
long_verification_df = long_format_df.join(df, on="review_id", how="left")

# Export the long-format verification file
verification_output_path = "output/absa_onestep_results_long_verification.csv"
long_verification_df.write_csv(verification_output_path, include_bom=True)
print(f"Exported verification file to: {verification_output_path}")


Exporting long-format data for verification...
Exported verification file to: output/absa_onestep_results_long_verification.csv


In [None]:
# --- PIVOT FROM LONG TO WIDE FORMAT ---

print("\nPivoting data to wide format...")
# Use the .pivot() method to transform the data
# - index: The column that will become the unique row identifier.
# - columns: The column whose values will become the new column names.
# - values: The columns whose data will fill the new pivoted columns.
wide_format_df = long_format_df.pivot(
    index="review_id",
    columns="aspect",
    values=["sentiment", "score"]
)

# The pivot creates columns like 'sentiment_Room Quality & Comfort'.
# We will rename them to a cleaner format like 'Room Quality & Comfort_sentiment'.
rename_mapping = {}
for aspect in ASPECTS:
    rename_mapping[f'sentiment_{aspect}'] = f'{aspect}_sentiment'
    rename_mapping[f'score_{aspect}'] = f'{aspect}_score'

wide_format_df = wide_format_df.rename(rename_mapping)

print("Pivoting complete.")


Pivoting data to wide format...
Pivoting complete.


In [None]:
# --- JOIN AND EXPORT FINAL WIDE DATAFRAME ---

# Join the new wide-format results back to the ORIGINAL full DataFrame.
# A 'left' join ensures we keep all original reviews, even those not processed by ABSA.
final_analysis_df = df.join(wide_format_df, on="review_id", how="left")

# Export the final combined DataFrame
output_path = "output/absa_analysis_onestep_results_wide.csv"
final_analysis_df.write_csv(output_path, include_bom=True)

print(f"\nSuccessfully exported the final wide-format results to: {output_path}")


Successfully exported the final wide-format results to: output/absa_analysis_onestep_results_wide.csv


#### Experiment 2: Two Step Pipeline for Sentiment Analysis
- Initial methodology for Aspect-Based Sentiment Analysis (ABSA) involved applying a single, specialised model, yangheng/deberta-v3-base-absa-v1.1, to each review for every one of the nine predefined aspects. However, this single-step approach revealed a significant limitation: it was prone to generating spurious sentiment attributions by assigning a score for an aspect even when the review text was not relevant to it.
- For instance, the review "Amazing time , very clean and friendly staff" received a positve score for the "Casino Experience." This not only introduced noise into the results but was also computationally inefficient, as it required running a resource-intensive model unnecessarily.
- To address these limitations, a more sophisticated two-step pipeline was implemented:
   1. A high-speed, zero-shot classification model, facebook/bart-large-mnli, performs a relevance check to identify which, if any, aspects are actually discussed in a review.
   2. Only if an aspect’s relevance score from this initial check surpasses a predefined confidence threshold of 80% is the review then passed to the specialised yangheng/deberta-v3-base-absa-v1.1 model for fine-grained sentiment analysis.
- This conditional, two-step approach ensures that sentiment is only evaluated where contextually relevant, thereby improving the accuracy of the insights and significantly reducing the computational load of the pipeline.

====

The models for the two-step pipeline were selected for their distinct and complementary strengths:

- For the initial relevance check, the facebook/bart-large-mnli model was chosen. This model is a powerful zero-shot classifier trained on Natural Language Inference (NLI), making it exceptionally adept at rapidly determining the topic of a text against a list of custom labels (the aspects) in a single forward pass. Its primary role is to serve as a fast and efficient filter.
- For example, given the review "The pool was amazing, but the check-in queue was a nightmare," this model efficiently determines that the most relevant aspects are 'Facilities & Amenities' and 'Customer Service & Staff', while assigning a near-zero relevance score to 'Casino Experience'.

- For the subsequent sentiment analysis, the specialised yangheng/deberta-v3-base-absa-v1.1 model was used. This model has been specifically fine-tuned on ABSA datasets, making it an expert at pinpointing the sentiment directed towards a specific target within a sentence. Its role is that of a precise but computationally expensive analyst.
- Continuing the example, this model is called twice: first, with the aspect 'Facilities & Amenities', it focuses on the phrase "The pool was amazing" to return a Positive sentiment. Second, with 'Customer Service & Staff', it hones in on "the check-in queue was a nightmare" to return a Negative sentiment, correctly distinguishing the two conflicting opinions.

Source:
1. https://huggingface.co/facebook/bart-large-mnli
2. https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

In [None]:
# # https://huggingface.co/facebook/bart-large-mnli

# from transformers import pipeline
# classifier = pipeline("zero-shot-classification",
#                       model="facebook/bart-large-mnli")

# sequence_to_classify = "one day I will see the world"
# candidate_labels = ['travel', 'cooking', 'dancing']
# classifier(sequence_to_classify, candidate_labels)

# #{'labels': ['travel', 'dancing', 'cooking'],
# # 'scores': [0.9938651323318481, 0.0032737774308770895, 0.002861034357920289],
# # 'sequence': 'one day I will see the world'}

In [18]:
# TWO STEP PROCESSING

# --- LOAD MODELS ---
print("\nLoading models (this may take a few minutes)...")
# Model 1: Fast Zero-Shot classifier for Relevance Identification
relevance_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
# Model 2: Specialized ABSA classifier for Sentiment Analysis
sentiment_classifier = pipeline("text-classification", model="yangheng/deberta-v3-base-absa-v1.1")
print("Models loaded.")


Loading models (this may take a few minutes)...


Device set to use cuda:0
Device set to use cuda:0


Models loaded.


In [21]:
# --- RUN THE ABSA PIPELINE (IN BATCHES TO PREVENT CRASHES) ---

# Define the size of each batch
BATCH_SIZE = 500  # You can adjust this if needed

# Use pandas to read the CSV in chunks for memory-efficient processing
try:
    # We use Polars to quickly scan the file and get an accurate count.
    total_rows = pl.scan_csv(FILE_PATH).collect().height

    # Now, read the file in chunks with pandas for the batch processing loop.
    csv_chunks = pd.read_csv(FILE_PATH, chunksize=BATCH_SIZE)

    print(f"\nProcessing {total_rows} reviews in batches of {BATCH_SIZE}...")
except FileNotFoundError:
    print(f"ERROR: File not found at path: {FILE_PATH}")
    print("Please make sure the path is correct and your Google Drive is mounted.")
    exit()


results = []
row_offset = 0 # Initialize a counter to create a continuous review_id
# Wrap the csv_chunks iterator with tqdm for a progress bar
for batch_pd in tqdm(csv_chunks, total=-(total_rows // -BATCH_SIZE), desc="Processing Batches"):

    # Manually add a continuous review_id to each batch
    batch_pd['review_id'] = range(row_offset, row_offset + len(batch_pd))
    row_offset += len(batch_pd)

    # Convert the pandas batch to a Polars DataFrame for processing
    batch_pl = pl.from_pandas(batch_pd)

    # Calculate word count for the current batch
    batch_pl = batch_pl.with_columns(
        pl.col("review_text").str.split(" ").list.len().alias("word_count")
    )

    # Iterate through each review in the current batch
    for row in batch_pl.iter_rows(named=True):

        # We only process reviews with more than 5 words
        if not row['review_text'] or row['word_count'] <= 5:
            continue

        review_id = row['review_id']
        review_text = row['review_text']

        try:
            # --- STEP 1: RELEVANCE IDENTIFICATION ---
            relevance_prediction = relevance_classifier(review_text, candidate_labels=ASPECTS, multi_label=True)

            for aspect, relevance_score in zip(relevance_prediction['labels'], relevance_prediction['scores']):
                # --- STEP 2: CONDITIONAL SENTIMENT ANALYSIS ---
                if relevance_score > 0.80:
                    input_text = f"[CLS] {review_text} [SEP] {aspect} [SEP]"
                    sentiment_prediction = sentiment_classifier(input_text)[0]

                    results.append({
                        "review_id": review_id, "aspect": aspect, "relevance_score": relevance_score,
                        "sentiment": sentiment_prediction['label'].capitalize(),
                        "sentiment_score": sentiment_prediction['score']
                    })
        except Exception as e:
            # This catch block will handle any errors during the model inference for a single review
            print(f"Skipping review {review_id} due to an error: {e}")

print("ABSA processing complete.")


Processing 27392 reviews in batches of 500...


Processing Batches: 100%|██████████| 55/55 [1:30:37<00:00, 98.87s/it]

ABSA processing complete.





In [24]:
# --- CREATE FINAL DATAFRAME ---
if results:
    # Create the long-format DataFrame from the results list.
    long_format_df = pl.DataFrame(results)

    # Load the original full DataFrame again. We will add the word_count
    # to this DataFrame before joining.
    full_df = pl.read_csv(FILE_PATH)
    if "review_id" not in full_df.columns:
        full_df = full_df.with_row_index(name="review_id")

    # This ensures the 'word_count' column will be in the final exported CSV.
    print("\nCalculating word count for the full dataset...")
    full_df = full_df.with_columns(
        pl.col("review_text").str.split(" ").list.len().alias("word_count")
    )

    # Pivot the long-format results into a wide format for the dashboard.
    print("Pivoting data to wide format for dashboard...")
    wide_format_df = long_format_df.pivot(
        index="review_id",
        columns="aspect",
        values=["sentiment", "sentiment_score", "relevance_score"]
    )
    rename_mapping = {}
    for aspect in ASPECTS:
        if f'sentiment_{aspect}' in wide_format_df.columns: # Check if column exists before renaming
            rename_mapping[f'sentiment_{aspect}'] = f'{aspect}_sentiment'
            rename_mapping[f'sentiment_score_{aspect}'] = f'{aspect}_sentiment_score'
            rename_mapping[f'relevance_score_{aspect}'] = f'{aspect}_relevance_score'
    wide_format_df = wide_format_df.rename(rename_mapping)
    print("Pivoting complete.")

    # Join the wide results back to the full original DataFrame (which now has word_count).
    final_analysis_df = full_df.join(wide_format_df, on="review_id", how="left")

    # Create the output directory if it doesn't exist.
    output_dir = "/content/drive/MyDrive/CI/output"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Export the final wide-format DataFrame.
    output_path = os.path.join(output_dir, "absa_analysis_twostep_results_wide.csv")
    final_analysis_df.write_csv(output_path, include_bom=True)
    print(f"\nSuccessfully exported the final wide-format results to: {output_path}")

else:
    print("\nNo results were generated. This could be due to the sample size or relevance threshold.")



Calculating word count for the full dataset...
Pivoting data to wide format for dashboard...
Pivoting complete.

Successfully exported the final wide-format results to: /content/drive/MyDrive/CI/output/absa_analysis_twostep_results_wide.csv


In [26]:

# Export the same DataFrame to Parquet for efficient storage and faster loading
parquet_output_path = os.path.join(output_dir, "absa_analysis_twostep_results_wide.parquet")
final_analysis_df.write_parquet(parquet_output_path)
print(f"Successfully exported the final results to Parquet: {parquet_output_path}")

Successfully exported the final results to Parquet: /content/drive/MyDrive/CI/output/absa_analysis_twostep_results_wide.parquet
