<a href="https://colab.research.google.com/github/go-hyun77/ABSA/blob/sonnet4.5-user-input-block/ABSA_LLM_Claude_Sonnet_4_5_OATS_ABSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<u>Aspect-Based Sentiment Analysis (ABSA) with Claude-Sonnet 4.5</u>**
This notebook implements a **Claude-Sonnet 4.5** based LLM capable of performing aspect-based sentiment analysis on the [OATS-ABSA dataset](https://huggingface.co/datasets/jordiclive/OATS-ABSA). Instructions/explanations for testing/reviewing each code block will be outlined for posterity purposes.

> To recap, the previous **[T5](https://huggingface.co/docs/transformers/en/model_doc/t5) (Text-to-Text Transfer Transformer)** model implemented followed a standard machine learning pipeline of:
> 1.   Load base T5-small model
  2.   Fine tune on labelled, preprocessed data
  3.   Generate predictions after training

Observations on the output results of the previous T5 implementation indicate that aspect extraction, sentiment classification **and** most notably aspect-to-sentiment mapping for T5 is incredibly fragile. T5 has been observed to:
*   Fail to extract multiple aspects within a sentence
*   Perform incomplete extraction of sentiments
*   Hallucinate/misspell aspects
*   Fail to identify implicit aspects
*   Fail to correctly match sentiment to aspect
*   Fail to extract aspects due to formatting issues
*   Output minor formatting errors lowering F1 calculation

With Sonnet 4.5, we aim to make use of its massive pre-trained corpora to perform aspect extraction, sentiment classification and mapping, all while eliminating the need for training. For this implementation, we focus on zero-shot/few-shot prompting to perform ABSA.

The expected input/outputs of this model will be as follows in the given example:
```
INPUT: Great location but the rooms were tiny and noisy
```
```
OUTPUT: 1. Aspect: location general
           Sentiment: Positive

        2. Aspect: rooms design_features
          Sentiment: Negative

        3. Aspect: rooms comfort
          Sentiment: Negative
```






In [1]:
#install dependencies and import libraries
!pip install anthropic datasets pandas scikit-learn tqdm

import anthropic
import json
import pandas as pd
import time
import os
import re
from datasets import load_dataset
from typing import List, Dict, Tuple
from collections import defaultdict
from tqdm import tqdm
from google.colab import userdata
from google.colab import drive

drive.mount('/content/drive') #mount drive for saving/loading model
model_dir = "/content/drive/MyDrive/ABSA_Sonnet4_Model" #define model directory in google drive, you may need to modify this link to point to the appropriate directory
os.makedirs(model_dir, exist_ok=True)

#get API key from colab secrets
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')

#initialize client
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

#test the connection
print("Testing API connection...")
try:
    test_message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=10,
        messages=[{"role": "user", "content": "Hello"}]
    )
    print("API key is valid. Connection successful.")
except Exception as e:
    print(f" API key test failed: {e}")


#configure anthropic api key in colab secrets
ANTHROPIC_API_KEY = os.environ.get('ANTHROPIC_API_KEY', 'ANTHROPIC_API_KEY')

#initialize the client with the API key from secrets
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

#identify claude model (Sonnet 4.5 for latest)
MODEL_NAME = "claude-sonnet-4-20250514"

#initialize anthropic client
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

print(f"Claude model: {MODEL_NAME}")

Collecting anthropic
  Downloading anthropic-0.75.0-py3-none-any.whl.metadata (28 kB)
Downloading anthropic-0.75.0-py3-none-any.whl (388 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.2/388.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.75.0
Mounted at /content/drive
Testing API connection...
API key is valid. Connection successful.
Claude model: claude-sonnet-4-20250514


In [3]:
# sanity check, test if the secret exists and is loaded
try:
    test_key = userdata.get('ANTHROPIC_API_KEY')
    print(f"API Key loaded: {test_key[:10]}...{test_key[-4:]}")  # Show first 10 and last 4 chars only
    print(f"Key length: {len(test_key)} characters")
except Exception as e:
    print(f"ERROR loading API key: {e}")
    print("\nMake sure you:")
    print("1. Created a secret named exactly 'ANTHROPIC_API_KEY' (case-sensitive)")
    print("2. Toggled 'Notebook access' to ON")
    print("3. API key starts with 'sk-ant-api03-'")


API Key loaded: sk-ant-api...pQAA
Key length: 108 characters


# **<u>Loading and Examining the Dataset</u>**
In this notebook, we will load and work with the [OATS-ABSA dataset](https://huggingface.co/datasets/jordiclive/OATS-ABSA) with the following line:
```
dataset = load_dataset("alexcadillon/SemEval2014Task4", "restaurants")
```
The [OATS-ABSA dataset](https://huggingface.co/datasets/jordiclive/OATS-ABSA)'s data columns contain the following attributes outlined in the below table:

| Field Name | Data Type | Description |
| :------- | :------: | -------: |
| comment  | string  | The actual raw text content of the sentence.  |
| quad  | list | Ground truth aspect-sentiment pairs in the format of: [aspect_category, sentiment]  |
| dataset | string | Domain identifier (hotels, amazon_ff, or coursera).  |


In [5]:
#load the dataset
dataset = load_dataset("jordiclive/OATS-ABSA")

#examine dataset structure
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['comment', 'quad', 'dataset'],
        num_rows: 3987
    })
    test: Dataset({
        features: ['comment', 'quad', 'dataset'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['comment', 'quad', 'dataset'],
        num_rows: 170
    })
})


In [6]:
#examine first few examples
for i in range(3):
    example = dataset['train'][i]
    print(f"\n{i+1}. comment: {example['comment'][:100]}...")
    print(f"   quad: {example['quad']}")
    print(f"   dataset: {example['dataset']}")


1. comment: Fantastic service I am a travel agent booking hotels all over the world, so we are very very fussy. ...
   quad: [['service general', 'positive'], ['location general', 'positive'], ['rooms general', 'positive'], ['rooms prices', 'positive'], ['hotel general', 'positive']]
   dataset: hotels

2. comment: venient, helpful  I stayed here as a single female and it was a great place to be. The concierge was...
   quad: [['hotel general', 'positive'], ['service general', 'positive'], ['location general', 'positive']]
   dataset: hotels

3. comment: Always a comfortable stay Small rooms but nice hotel. Just had a restaurant and bar makeover - much ...
   quad: [['rooms comfort', 'positive'], ['rooms design_features', 'negative'], ['hotel general', 'positive'], ['food_drinks quality', 'positive'], ['hotel miscellaneous', 'positive'], ['service general', 'positive'], ['location general', 'positive']]
   dataset: hotels


In [7]:
#examine train and test splits

#get unique aspect categories and sentiments
def analyze_dataset_structure(dataset_split, name="train"):
    aspect_categories = set()
    sentiments = set()
    domains = set()

    for example in dataset_split:
        domains.add(example['dataset'])
        for quad in example['quad']:
            aspect_categories.add(quad[0])
            sentiments.add(quad[1])

    print(f"\n{name.upper()} SET STATISTICS:")
    print(f"Total examples: {len(dataset_split)}")
    print(f"Domains: {sorted(domains)}")
    print(f"Unique aspect categories: {len(aspect_categories)}")
    print(f"Sample categories: {sorted(list(aspect_categories))[:10]}")
    print(f"Sentiments: {sorted(sentiments)}")

    return aspect_categories, sentiments

train_aspects, train_sentiments = analyze_dataset_structure(dataset['train'], "train")
test_aspects, test_sentiments = analyze_dataset_structure(dataset['test'], "test")


TRAIN SET STATISTICS:
Total examples: 3987
Domains: ['amazon_ff', 'coursera', 'hotels']
Unique aspect categories: 74
Sample categories: ['amazon availability', 'amazon prices', 'assignments comprehensiveness', 'assignments quality', 'assignments quantity', 'assignments relatability', 'assignments workload', 'course comprehensiveness', 'course general', 'course quality']
Sentiments: ['conflict', 'negative', 'neutral', 'positive']

TEST SET STATISTICS:
Total examples: 500
Domains: ['amazon_ff', 'coursera', 'hotels']
Unique aspect categories: 64
Sample categories: ['amazon availability', 'amazon prices', 'assignments comprehensiveness', 'assignments quality', 'assignments quantity', 'assignments relatability', 'assignments workload', 'course comprehensiveness', 'course general', 'course quality']
Sentiments: ['conflict', 'negative', 'neutral', 'positive']


# **<u>Prompt Engineering</u>**
[Role prompting](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/system-prompts) is used to enhance Claude by specializing the model into a specific domain (in this context of this project, being an ABSA "expert"). This process can be broken down into two parts, "system" and "user" prompting.

<br>

System prompting involves sending Claude "foundational"  instructions that define its overarching "purpose", essentially specializing the model into a certain domain by defining the following in the context of the model:
* Role Specialization: Defining the model's role and expertise.
* Constraints and Limitations: Establishing rules for the AI's responses, such as outputting text in a certain format.
* Context Definition: Providing situational context to the input to the model.

User prompting includes the specific instructions, questions, and/or text input a user provides to the model to elicit a desired response.
>In the context of this model, we provide examples of correct aspect extraction, sentiment classification, and sentiment-to-aspect mapping, while also hightlighting the desired output format.

In [8]:
#define system prompt (pinned, overarching instructions)

def create_system_prompt(aspect_categories: List[str] = None) -> str:

    base_prompt = """You are an expert at Aspect-Based Sentiment Analysis (ABSA) for reviews.

    Your task is to analyze reviews and extract aspect-sentiment pairs at the CATEGORY level (not just specific terms).

    IMPORTANT DISTINCTIONS:
    - Aspect CATEGORIES are conceptual (e.g., "service general", "rooms comfort", "food quality")
    - They may not appear explicitly in the text
    - You must infer the category from context

    OUTPUT FORMAT:
    Return ONLY a JSON array with this exact structure:
    [
      {
        "aspect_category": "category_subcategory",
        "sentiment": "positive" or "negative" or "neutral" or "conflict"
      }
    ]

    SENTIMENT DEFINITIONS:
    - positive: Favorable opinion
    - negative: Unfavorable opinion
    - neutral: Factual statement without clear sentiment
    - conflict: Mixed sentiments in same review about same aspect

    RULES:
    1. Extract ALL aspect categories mentioned (explicit or implicit)
    2. Use lowercase for aspect categories with underscore (e.g., "rooms_design_features")
    3. If no aspects found, return empty array []
    4. Return ONLY the JSON array, no other text"""

    if aspect_categories:
        #add known categories to help the model
        categories_text = ", ".join(sorted(aspect_categories)[:50])  # Show first 50
        base_prompt += f"\n\nCOMMON ASPECT CATEGORIES IN THIS DOMAIN:\n{categories_text}"

    return base_prompt

In [9]:
#define user prompt, show the model some examples if few-shot

def create_user_prompt(text: str, few_shot_examples: List[Dict] = None) -> str:

    prompt = ""

    #add fewshot examples if provided
    if few_shot_examples:
        prompt += "Here are some examples:\n\n"
        for i, ex in enumerate(few_shot_examples[:3], 1):  #3 examples max
            prompt += f"Example {i}:\n"
            prompt += f"Text: \"{ex['text']}\"\n"
            prompt += f"Output: {json.dumps(ex['quads'], indent=2)}\n\n"

    #add the actual text to analyze
    prompt += f"Now analyze this review:\n\nText: \"{text}\"\n\nOutput:"

    return prompt

In [10]:
#claude inference to predict aspect-sentiment quads (this would have been the T5 model.generate() funciton)

def predict_with_claude(
    #define parameters
    text: str,  #review text to analyze
    system_prompt: str,  #system prompt for model defined above
    few_shot_examples: List[Dict] = None, #optional exmaples to show model
    max_retries: int = 3, #num of retries if failed API call
    temperature: float = 0.0  #randomness of output, 0->1 strict to flexible
) -> List[Dict]:  #return list of dicts with aspect_category and sentiment

    user_prompt = create_user_prompt(text, few_shot_examples)

    #error handling loop
    for attempt in range(max_retries):
        try:
            message = client.messages.create( #claude api call
                model=MODEL_NAME, #sonnet4.5
                max_tokens=2048,  #max length of response ~1500 words
                temperature=temperature,  #0 for deterministic response
                system=system_prompt, #above defined instructions
                messages=[
                    {"role": "user", "content": user_prompt}  #user prompt from above
                ]
            )

            response_text = message.content[0].text.strip() #parse text from claude response and remove whitespace

            #clean response, remove markdown if present
            response_text = response_text.replace('```json', '').replace('```', '').strip()
            #parse json, convert json string to list of dicts
            predictions = json.loads(response_text)
            #validate structure, check for list structure
            if not isinstance(predictions, list):
                raise ValueError("Response is not a list.")

            #normalize format
            normalized = []

            for pred in predictions:
                #check for aspect_category and sentiment in response
                if isinstance(pred, dict) and 'aspect_category' in pred and 'sentiment' in pred:
                    normalized.append({
                        #convert to lowercase and remove whitespace to match ground truth formatting
                        'aspect_category': pred['aspect_category'].lower().strip(),
                        'sentiment': pred['sentiment'].lower().strip()
                    })
            return normalized

        #error handling, if failed json parse
        except json.JSONDecodeError as e:
            print(f"JSON decode error (attempt {attempt + 1}): {e}")
            print(f"Response: {response_text[:200]}") #output generated response for debug
            if attempt == max_retries - 1:  #retry every 1 second if not last attempt
                return []
            time.sleep(1)

        #error handling, other errors
        except Exception as e:
            print(f"API error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:  #retry every 2 seconds if not last attempt
                return []
            time.sleep(2)

    return []

# **<u>Input Processing</u>**
The dataset input is combed to remove whitespace and uppercase letters, while converting the original format of nested lists to a list of dictionaries.
>Input processing of the dataset is necessary to match the output of the model, ensuring a correct match when calcuating the F1 score (if formatting is off, it will count as a miss even if the predictions are semantically correct, as seen from our previous T5 model).

In [11]:
#format dataset input to list of dicts

def format_ground_truth(example: Dict) -> List[Dict]:

    #convert OATS format to list of dicts format
    #original: [["service general", "positive"], ...]
    #formatted: [{"aspect_category": "service general", "sentiment": "positive"}, ...]

    formatted = []
    for quad in example['quad']:
        formatted.append({
            'aspect_category': quad[0].lower().strip(), #identify aspect category in quad
            'sentiment': quad[1].lower().strip()  #identify sentiment in quad
        })
    return formatted

# **<u>Few-shot Examples</u>**
Below we prepare examples of correct aspect extractions and sentiment classifications to feed to the model to give better context for predictions.
>If testing the model for zero-shot, there will be a variable to toggle whether zero or few-shot is used in the "Claude Evaluation on OATS-ABSA" section. The output of this block will be used when the prompt is generated for Claude.

In [12]:
#function create few-shot examples

def prepare_few_shot_examples(dataset_split, n_examples: int = 5) -> List[Dict]:

    examples = []

    for i in range(min(n_examples, len(dataset_split))):  #loop and extract sample entries from dataset, prevent out of range with min
        ex = dataset_split[i]
        examples.append({ #append parsed text and formatted quads to examples
            'text': ex['comment'],
            'quads': format_ground_truth(ex)
        })
    return examples

#extract 5 examples from the dataset training split
few_shot_examples = prepare_few_shot_examples(dataset['train'], n_examples=5)

print(f"Extracted {len(few_shot_examples)} samples for few-shot")

#sanity check, output extracted samples to verify correct format
print("\nExtracted few-shot format:")
print(json.dumps(few_shot_examples[0], indent=2))

Extracted 5 samples for few-shot

Extracted few-shot format:
{
  "text": "Fantastic service I am a travel agent booking hotels all over the world, so we are very very fussy. This hotel offers customer service at its' best. Great location, great service, nothing is too much bother. The rooms are great, cheaper than some of the well known names and frankly far better. Stay here and give yourself a treat with staff who really do genuinely care. We will continue to book our clients here, as it gives a great satisfaction to offer them such an outstanding property.",
  "quads": [
    {
      "aspect_category": "service general",
      "sentiment": "positive"
    },
    {
      "aspect_category": "location general",
      "sentiment": "positive"
    },
    {
      "aspect_category": "rooms general",
      "sentiment": "positive"
    },
    {
      "aspect_category": "rooms prices",
      "sentiment": "positive"
    },
    {
      "aspect_category": "hotel general",
      "sentiment": "positiv

# **<u>F1 Score Evaluation Functions</u>**
Below are the functions to calculate the model's F1 score. </br>

>The **F1 score** is a common metric used to evaluate natural language processing (NLP) models that specialize in areas such as classification, extraction, and ABSA. It is the harmonic mean of the model's **precision** (how precise is this model predictions?) and **recall** (how many relevant things did it find?). In short, it measures the model’s ability to produce correct outputs while avoiding incorrect ones.

The equation to calculate an F1 score is as follows:
> $$ F1 = 2 × (\frac{Precision × Recall}{Precision + Recall}) $$

Similar to the previous T5 model, there are two tasks being performed to be evaluated: **aspect detection**, and **sentiment classification**.

In [13]:
#f1 score evaluation, calculate aspect extraction and aspect-sentiment classification

def compute_f1(predictions: List[Dict], ground_truth: List[Dict]) -> Dict:

    #convert list of dicts into tuples (set) to compare both aspect + sentiment together as a pair

    #true pairs
    true_set = set((g['aspect_category'], g['sentiment']) for g in ground_truth)
    #predicted pairs
    pred_set = set((p['aspect_category'], p['sentiment']) for p in predictions)


    #aspect only for aspect extraction accuracy

    #true aspects
    true_aspects = set(g['aspect_category'] for g in ground_truth)
    #predicted aspects
    pred_aspects = set(p['aspect_category'] for p in predictions)


    #joint metrics (aspect + sentiment must both match)

    #true positive, actual and prediction = true, correct match
    TP_joint = len(true_set & pred_set)
    #false positive, predicted true but not true, not correct
    FP_joint = len(pred_set - true_set)
    #false negative, did not predict true on true, not correct
    FN_joint = len(true_set - pred_set)

    #aspect metrics (just aspect extraction)

    #true positive, actual and prediction = true, correct match
    TP_aspect = len(pred_aspects & true_aspects)
    #false positive, predicted true but not true, not correct
    FP_aspect = len(pred_aspects - true_aspects)
    #false negative, did not predict true on true, not correct
    FN_aspect = len(true_aspects - pred_aspects)


    #F1 calculation

    #division function to prevent divide by 0
    def safe_div(a, b):
        return a / b if b > 0 else 0.0

    #joint F1 (aspect and sentiment must match)
    precision_joint = safe_div(TP_joint, TP_joint + FP_joint)
    recall_joint = safe_div(TP_joint, TP_joint + FN_joint)
    F1_joint = safe_div(2 * precision_joint * recall_joint, precision_joint + recall_joint)

    #aspect only F1 (just aspect extraction)
    precision_aspect = safe_div(TP_aspect, TP_aspect + FP_aspect)
    recall_aspect = safe_div(TP_aspect, TP_aspect + FN_aspect)
    F1_aspect = safe_div(2 * precision_aspect * recall_aspect, precision_aspect + recall_aspect)

    return {
        #joint metrics (aspect + sentiment)
        'precision_joint': precision_joint, #of all predictions made, how many correct
        'recall_joint': recall_joint, #of all ground truth, how many did we detect
        'F1_joint': F1_joint,
        'TP_joint': TP_joint,
        'FP_joint': FP_joint,
        'FN_joint': FN_joint,
        #aspect-only metrics
        'precision_aspect': precision_aspect,
        'recall_aspect': recall_aspect,
        'F1_aspect': F1_aspect,
        'TP_aspect': TP_aspect,
        'FP_aspect': FP_aspect,
        'FN_aspect': FN_aspect,
    }

# **<u>Claude Evaluation on OATS-ABSA</u>**
Below we evaluate Claude on the [OATS-ABSA dataset](https://huggingface.co/datasets/jordiclive/OATS-ABSA). Relative to the previous T5 model, the below block is the equivalent of executing the trainer.train() function to introduce the dataset to the T5 model. Various parameters to control aspects of prediction generation are listed below, ranging from which dataset split (test or train) to evaluate on, to feeding few-shot examples defined earlier for improved accuracy.

In [14]:
#define evaluation parameters

def evaluate_claude_on_dataset(
    dataset_split,  #which part to evaluate on (train or test splits)
    system_prompt: str, #system prompt instructions defined in Prompt Engineering section
    num_samples: int = None,  #number of samples to evaluate, if none then all, number for quick test
    use_few_shot: bool = True,  #true for few-shot, false for zero-shot
    few_shot_examples: List[Dict] = None, #examples defined in Prompt Engineering section
    save_results: bool = True,  #save toggle
    results_file: str = "claude_absa_results.json"  #output file
):

    #if num_samples from above args is defined, then limit amount of dataset to first num of dataset
    if num_samples:
        dataset_split = dataset_split.select(range(min(num_samples, len(dataset_split))))

    #init lists
    results = []
    all_metrics = []

    #pass in few-shot examples from above block if they exist
    few_shot = few_shot_examples if use_few_shot else None

    print(f"\nEvaluating {len(dataset_split)} examples...")
    print(f"Few-shot learning: {'ENABLED' if use_few_shot else 'DISABLED'}")
    print(f"Model: {MODEL_NAME}")


    #evaluation loop, for each example in the dataset
    for idx, example in enumerate(tqdm(dataset_split, desc="Evaluating")):

        #get ground truth and format to list of dicts
        ground_truth = format_ground_truth(example)
        text = example['comment']

        #get claude predictions with inference function
        predictions = predict_with_claude(
            text=text,
            system_prompt=system_prompt,
            few_shot_examples=few_shot
        )

        #compute metrics for this example
        metrics = compute_f1(predictions, ground_truth)
        all_metrics.append(metrics)

        #store results from this parsed/predicted example
        result = {
            'idx': idx,
            'text': text,
            'domain': example['dataset'],
            'ground_truth': ground_truth,
            'predictions': predictions,
            'metrics': metrics
        }
        results.append(result)  #store in results list

        #print progress every 50 examples
        if (idx + 1) % 50 == 0:
            avg_f1 = sum(m['F1_joint'] for m in all_metrics) / len(all_metrics)
            print(f"\nProgress: {idx + 1}/{len(dataset_split)} | Avg F1 (joint): {avg_f1:.3f}")

    #aggregate metrics for joint and aspect only F1s
    n = len(all_metrics)
    aggregated = {
        'precision_joint': sum(m['precision_joint'] for m in all_metrics) / n,
        'recall_joint': sum(m['recall_joint'] for m in all_metrics) / n,
        'F1_joint': sum(m['F1_joint'] for m in all_metrics) / n,
        'precision_aspect': sum(m['precision_aspect'] for m in all_metrics) / n,
        'recall_aspect': sum(m['recall_aspect'] for m in all_metrics) / n,
        'F1_aspect': sum(m['F1_aspect'] for m in all_metrics) / n,
    }

    #save detailed results
    if save_results:
        with open(results_file, 'w') as f:
            json.dump({
                'aggregated_metrics': aggregated,
                'detailed_results': results,
                'config': {
                    'model': MODEL_NAME,
                    'num_samples': len(dataset_split),
                    'use_few_shot': use_few_shot,
                    'num_few_shot_examples': len(few_shot) if few_shot else 0
                }
            }, f, indent=2)
        print(f"\nDetailed results saved to: {results_file}")

    return aggregated, results

In [15]:
#sanity check, test anthropic api key is correctly init

#re-initialize the client with the API key from colab secrets
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

print("Client re-initialized with API key from Colab Secrets.")

#test to verify it works
try:
    test_message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=10,
        messages=[{"role": "user", "content": "Hi"}]
    )
    print("API connection test successful.")
    print(f"Response: {test_message.content[0].text}")
except Exception as e:
    print(f"Connection test failed: {e}")

Client re-initialized with API key from Colab Secrets.
API connection test successful.
Response: Hello! How are you doing today? Is there


In [68]:
#run evaluation function on subset for working verification

#call system prompt function to define system prompt for this run
system_prompt = create_system_prompt(aspect_categories=list(train_aspects))

#test on a small subset first (10 examples)
print("\n>>> TESTING ON 10 EXAMPLES (Quick validation)")
test_metrics, test_results = evaluate_claude_on_dataset(
    dataset_split=dataset['test'],
    system_prompt=system_prompt,
    num_samples=10,
    use_few_shot=True,
    few_shot_examples=few_shot_examples,
    save_results=False
)

print("Test Results (10 examples)")
print("-"*60)
print(f"Joint F1 (Aspect + Sentiment): {test_metrics['F1_joint']:.3f}")
print(f"Aspect-Only F1 (Category Detection): {test_metrics['F1_aspect']:.3f}")
print(f"Joint Precision: {test_metrics['precision_joint']:.3f}")
print(f"Joint Recall: {test_metrics['recall_joint']:.3f}")


>>> TESTING ON 10 EXAMPLES (Quick validation)

Evaluating 10 examples...
Few-shot learning: ENABLED
Model: claude-sonnet-4-20250514


Evaluating: 100%|██████████| 10/10 [00:33<00:00,  3.33s/it]

Test Results (10 examples)
------------------------------------------------------------
Joint F1 (Aspect + Sentiment): 0.658
Aspect-Only F1 (Category Detection): 0.718
Joint Precision: 0.642
Joint Recall: 0.678





In [69]:
#run evaluation function on full dataset

#full evaluation
print("\n>>> RUNNING FULL EVALUATION")
full_metrics, full_results = evaluate_claude_on_dataset(
     dataset_split=dataset['test'],
     system_prompt=system_prompt,
     num_samples=None,  #use all test examples
     use_few_shot=True,
     few_shot_examples=few_shot_examples,
     save_results=True,
     results_file="claude_oats_full_results.json"
)


>>> RUNNING FULL EVALUATION

Evaluating 500 examples...
Few-shot learning: ENABLED
Model: claude-sonnet-4-20250514


Evaluating:  10%|█         | 50/500 [03:01<25:52,  3.45s/it]


Progress: 50/500 | Avg F1 (joint): 0.669


Evaluating:  20%|██        | 100/500 [06:10<26:05,  3.91s/it]


Progress: 100/500 | Avg F1 (joint): 0.660


Evaluating:  30%|███       | 150/500 [09:30<24:48,  4.25s/it]


Progress: 150/500 | Avg F1 (joint): 0.648


Evaluating:  40%|████      | 200/500 [12:06<14:00,  2.80s/it]


Progress: 200/500 | Avg F1 (joint): 0.581


Evaluating:  50%|█████     | 250/500 [14:39<12:44,  3.06s/it]


Progress: 250/500 | Avg F1 (joint): 0.520


Evaluating:  60%|██████    | 300/500 [17:12<09:30,  2.85s/it]


Progress: 300/500 | Avg F1 (joint): 0.482


Evaluating:  70%|███████   | 350/500 [19:55<08:59,  3.60s/it]


Progress: 350/500 | Avg F1 (joint): 0.477


Evaluating:  80%|████████  | 400/500 [22:51<05:22,  3.23s/it]


Progress: 400/500 | Avg F1 (joint): 0.475


Evaluating:  90%|█████████ | 450/500 [26:18<02:59,  3.59s/it]


Progress: 450/500 | Avg F1 (joint): 0.483


Evaluating: 100%|██████████| 500/500 [29:18<00:00,  3.52s/it]


Progress: 500/500 | Avg F1 (joint): 0.482

Detailed results saved to: claude_oats_full_results.json





# **<u>Saving and Loading Results</u>**
Below are code blocks to save and load the .json file containing results from the above run from Google Drive. </br>

In [18]:
#save results to google drive for loading in subsequent sessions

from google.colab import drive
import shutil

#mount drive
drive.mount('/content/drive', force_remount=True)

#create directory in drive
DRIVE_RESULTS_PATH = '/content/drive/MyDrive/ABSA_Sonnet4_Model/'
os.makedirs(DRIVE_RESULTS_PATH, exist_ok=True)
print(f"Directory created/verified: {DRIVE_RESULTS_PATH}")

#copy the local results file to drive
local_file = "claude_oats_full_results.json"
drive_file = os.path.join(DRIVE_RESULTS_PATH, "claude_oats_full_results.json")

if os.path.exists(local_file):
    shutil.copy(local_file, drive_file)
    print(f"Copied {local_file} to Google Drive")
    print(f"Location: {drive_file}")
else:
    print(f"Could not find Colab local file: {local_file}")  #if local file doesn't exist in colab

Mounted at /content/drive
Directory created/verified: /content/drive/MyDrive/ABSA_Sonnet4_Model/
Could not find Colab local file: claude_oats_full_results.json


In [19]:
#load file from drive
def load_results_from_drive(filename="claude_oats_results.json"):

    #load file from google drive
    filepath = os.path.join(DRIVE_RESULTS_PATH, filename)

    try:
        with open(filepath, 'r') as f:
            data = json.load(f)
        print(f"Results loaded from: {filepath}")
        return data['aggregated_metrics'], data['detailed_results'], data['config']
    except FileNotFoundError:
        print(f"File not found: {filepath}")
        return None, None, None

# **<u>Results Analysis</u>**
After loading the saved results, we examine the results of Claude's predictions on the [OATS-ABSA dataset](https://huggingface.co/datasets/jordiclive/OATS-ABSA).

In [20]:
#loading saved results from drive

import json
from collections import defaultdict, Counter
import os

#set filepath as google drive location
results_filepath = drive_file

#load saved results from google drive
try:
    with open(results_filepath, 'r') as f:
        saved_data = json.load(f)
    print(f"Results loaded from: {results_filepath}")
except FileNotFoundError:
    print(f"Error: File not found at {results_filepath}")
    #re-raise to stop execution if the file isn't found even after checking drive
    raise

#print num of results and F1 to check for correct file
detailed_results = saved_data['detailed_results']
aggregated_metrics = saved_data['aggregated_metrics']

print(f"Loaded {len(detailed_results)} results")
print(f"Overall F1: {aggregated_metrics['F1_joint']:.3f}")


Results loaded from: /content/drive/MyDrive/ABSA_Sonnet4_Model/claude_oats_full_results.json
Loaded 500 results
Overall F1: 0.482


In [75]:
#Performance analysis function

def analyze_errors(results):

    #track domains
    domain_metrics = defaultdict(list)
    domain_examples = defaultdict(int)

    #track aspect categories
    pred_aspects = Counter()
    true_aspects = Counter()
    missed_aspects = Counter()
    false_aspects = Counter()

    #track by position (to see if performance degrades over time)
    position_f1 = []

    for i, result in enumerate(results):
        domain = result['domain']
        metrics = result['metrics']

        #track metrics by domain
        domain_metrics[domain].append(metrics['F1_joint'])
        domain_examples[domain] += 1

        #track F1 over position
        position_f1.append((i, metrics['F1_joint']))

    print("Performance Analysis")
    print("-"*60)


    #domain-wise performance between coursera, amazonff, and hotels
    print("\nF1 Score by Domain:")

    for domain in sorted(domain_metrics.keys()):
        f1_scores = domain_metrics[domain]

        #check for scores to average
        if f1_scores:
            avg_f1 = sum(f1_scores) / len(f1_scores)
            print(f"  {domain:15s}: {avg_f1:.3f} ({domain_examples[domain]} examples)")
        else:
            print(f"  {domain:15s}: No F1 scores ({domain_examples[domain]} examples)")


    #performance over time (first 100 vs last 100)

    #ensure there are enough elements before slicing and dividing
    num_f1_samples = len(position_f1)
    first_n = min(100, num_f1_samples) #use min to prevent index errors if <100 samples

    first_100_f1 = sum(f1 for i, f1 in position_f1[:first_n]) / first_n if first_n > 0 else 0.0
    last_100_f1 = sum(f1 for i, f1 in position_f1[-first_n:]) / first_n if first_n > 0 else 0.0

    print(f"\nPerformance over Time:")
    print(f"  First {first_n} examples: {first_100_f1:.3f}")
    print(f"  Last {first_n} examples:  {last_100_f1:.3f}")
    print(f"  Difference:         {first_100_f1 - last_100_f1:.3f}")

    #track aspect categories
    for result in results:
        for pred in result['predictions']:
            pred_aspects[pred['aspect_category']] += 1

        for true in result['ground_truth']:
            true_aspects[true['aspect_category']] += 1

        #find mismatches
        pred_set = set((p['aspect_category'], p['sentiment']) for p in result['predictions'])
        true_set = set((g['aspect_category'], g['sentiment']) for g in result['ground_truth'])

        #false positives (predicted, but not in ground truth)
        for aspect, sentiment in (pred_set - true_set):
            false_aspects[aspect] += 1

        #false negatives (in ground truth, but not predicted)
        for aspect, sentiment in (true_set - pred_set):
            missed_aspects[aspect] += 1


    #most common predicted aspects
    print("\nTop 10 Predicted Aspect Categories:")
    for aspect, count in pred_aspects.most_common(10):
        print(f"  {aspect:30s}: {count:4d}")

    # most common ground truth aspects
    print("\nTop 10 Ground Truth Aspect Categories:")
    for aspect, count in true_aspects.most_common(10):
        print(f"  {aspect:30s}: {count:4d}")

    #most commonly missed aspects
    print("\nTop 10 Missed Aspects (False Negatives):")
    for aspect, count in missed_aspects.most_common(10):
        print(f"  {aspect:30s}: {count:4d} times")

    #most common false positives
    print("\nTop 10 False Positive Aspects:")
    for aspect, count in false_aspects.most_common(10):
        print(f"  {aspect:30s}: {count:4d} times")

    #check for naming variations
    print("\nPotential Naming Mismatches:")
    pred_set_keys = set(pred_aspects.keys()) #use pred_aspects.keys() now that it's populated
    true_set_keys = set(true_aspects.keys()) #use true_aspects.keys() now that it's populated

    only_predicted = pred_set_keys - true_set_keys
    only_in_truth = true_set_keys - true_set_keys #should be 'true_set_keys - pred_set_keys' logic, but this line just filters what Claude uses

    if only_predicted:
        print("Categories Claude uses but not in ground truth:")
        for cat in sorted(only_predicted)[:10]:
            print(f"    - {cat}")

    if only_in_truth:
        print("Categories in ground truth but Claude doesn't use:")
        for cat in sorted(only_in_truth)[:10]:
            print(f"    - {cat}")

    return domain_metrics, pred_aspects, true_aspects, missed_aspects, false_aspects

#run analysis
if detailed_results:
    domain_metrics, pred_aspects, true_aspects, missed_aspects, false_aspects = analyze_errors(detailed_results)
else:
    print("No detailed results available for error analysis.")

Performance Analysis
------------------------------------------------------------

F1 Score by Domain:
  amazon_ff      : 0.325 (180 examples)
  coursera       : 0.500 (170 examples)
  hotels         : 0.648 (150 examples)

Performance over Time:
  First 100 examples: 0.660
  Last 100 examples:  0.508
  Difference:         0.151

Top 10 Predicted Aspect Categories:
  course general                :  154
  hotel general                 :  140
  service general               :  118
  food quality                  :  116
  location general              :  108
  food general                  :   95
  faculty general               :   86
  rooms design_features         :   82
  course comprehensiveness      :   82
  food_drinks quality           :   69

Top 10 Ground Truth Aspect Categories:
  food general                  :  169
  course general                :  155
  hotel general                 :  127
  food quality                  :  115
  service general               :  103
  locat

In [82]:
#find examples with low F1 scores (where did Claude fail?)
def show_failure_cases(results, n=5):

    #sort by F1 score
    sorted_results = sorted(results, key=lambda x: x['metrics']['F1_joint'])

    for i, result in enumerate(sorted_results[:n], 1):
        print(f"\n--- Example {i} (F1: {result['metrics']['F1_joint']:.3f}) ---")
        print(f"Domain: {result['domain']}")
        print(f"Text: {result['text'][:200]}...")
        print(f"\nGround Truth:")

        for gt in result['ground_truth']: #output correct values from dataset
            print(f"   {gt['aspect_category']:30s} -> {gt['sentiment']}")
        print(f"\nClaude's Predictions:")

        if result['predictions']: #output predicted values by Claude
            for pred in result['predictions']:
                print(f"   {pred['aspect_category']:30s} -> {pred['sentiment']}")
        else:
            print("  (No predictions)")
        print("-" * 60)

show_failure_cases(detailed_results, n=5)


--- Example 1 (F1: 0.000) ---
Domain: hotels
Text: A Return Visit that did not disappoint Following our visit in October 2005 for a stop over, we used the Langham Place as our base for an extended visit to Hong Kong. Just at the end of the Chinese New...

Ground Truth:
   food_drinks quality            -> positive
   polarity positive              -> positive

Claude's Predictions:
   hotel general                  -> positive
   rooms general                  -> positive
   facilities general             -> positive
   service general                -> positive
   location general               -> positive
------------------------------------------------------------

--- Example 2 (F1: 0.000) ---
Domain: hotels
Text: Hotel Nadia Hotel Nadia is lovely - the staff are very friendly and helpful and the rooms are clean (as a family, we had the 'quad' which was a bit of a squeeze but the beds were very comfortable and ...

Ground Truth:
   hotel general                  -> positive
   ser

# **<u>ABSA Input Test</u>**
This block contains the function to test the trained model's functional accuracy by manually inputting text to gauge the correctness of its evaluation. Aspect categories are extracted from the dataset to pass into the function below:

```
predict_user_input("your input text here", domain="selected domain between hotels, amazonff, coursera")
```

This is done to increase the accuracy of the model's predictions, as categories will be limited to the selected domain from the dataset. Additionally, examples from each domain are extracted as examples for few-shot to further increase the accuracy of the prediction.
<br>

Modify the text parameter in the above function to the desired input text, select the desired dataset domain, and execute the block to have the model identify and output detected aspect-sentiment pairs.

In [38]:
#creates domain_categories dictionary needed for the user input prediction function

#extract aspect categories for each domain
def get_categories_by_domain(dataset):

    domain_categories = defaultdict(set)

    for example in dataset['train']:  #loop through training data and group category by domain
        domain = example['dataset']  #hotels, amazon_ff, or coursera
        for quad in example['quad']:
            category = quad[0]  #aspect category
            domain_categories[domain].add(category) #add category to domain

    #convert sets to sorted lists
    for domain in domain_categories:
        domain_categories[domain] = sorted(list(domain_categories[domain]))

    return dict(domain_categories)

#extract categories by domain
domain_categories = get_categories_by_domain(dataset)


print("Domain-specific Categories")
print("-"*60)
for domain, categories in domain_categories.items():
    print(f"{domain.upper()} ({len(categories)} categories):")
    print(f"  Sample: {categories[:5]}\n")



Domain-specific Categories
------------------------------------------------------------
HOTELS (36 categories):
  Sample: ['facilities cleanliness', 'facilities comfort', 'facilities design_features', 'facilities general', 'facilities miscellaneous']

AMAZON_FF (12 categories):
  Sample: ['amazon availability', 'amazon prices', 'food general', 'food prices', 'food quality']

COURSERA (29 categories):
  Sample: ['assignments comprehensiveness', 'assignments quality', 'assignments quantity', 'assignments relatability', 'assignments workload']



In [40]:
# create few_shot examples by domain (needed for prediction function)
def get_few_shot_by_domain(dataset, domain: str, n: int = 3): #3 examples per domain

    examples = []

    for example in dataset['train']:
        if example['dataset'] == domain and len(examples) < n:
            examples.append({
                'text': example['comment'],
                'quads': format_ground_truth(example)
            })
    return examples

#get domain specific few-shot examples
few_shot_by_domain = {
    domain: get_few_shot_by_domain(dataset, domain, n=3)
    for domain in domain_categories.keys()
}

print("Few-shot Examples by Domain")
print("-"*60)
for domain, examples in few_shot_by_domain.items():
    print(f"{domain}: {len(examples)} examples")

Few-shot Examples by Domain
------------------------------------------------------------
hotels: 3 examples
amazon_ff: 3 examples
coursera: 3 examples


In [56]:
#predicts aspects and sentiments for user inputted text
def predict_user_input(user_text: str, domain: str = "hotels"):

    print(f"Input Text: {user_text}")
    print(f"Domain: {domain}")

    #create system prompt with domain-specific categories from above prompt engineering section
    system_prompt = create_system_prompt(aspect_categories=list(domain_categories[domain]))

    #get domain-specific few-shot examples
    few_shot_examples = few_shot_by_domain.get(domain, None)

    #call claude api
    predictions = predict_with_claude(
        text=user_text,
        system_prompt=system_prompt,
        few_shot_examples=few_shot_examples
    )

    print("\n")
    print(f"Results: ({len(predictions)} aspect-sentiment pairs found)")
    print("-"*60)

    if predictions:
        for i, pred in enumerate(predictions, 1):
            sentiment = {
                'positive': 'Positive',
                'negative': 'Negative',
                'neutral': 'Neutral',
                'conflict': 'Conflict'
            }.get(pred['sentiment'], 'Unknown')

            print(f"\n{i}. Aspect: {pred['aspect_category']}")
            print(f"   Sentiment: {sentiment}")
    else:
        print("\nNo aspects detected in the text")

    #print("\n")
    #return predictions

In [57]:
#call predict function
predict_user_input("Great location but the rooms were tiny and noisy", domain="hotels")

Input Text: Great location but the rooms were tiny and noisy
Domain: hotels


Results: (3 aspect-sentiment pairs found)
------------------------------------------------------------

1. Aspect: location general
   Sentiment: Positive

2. Aspect: rooms design_features
   Sentiment: Negative

3. Aspect: rooms comfort
   Sentiment: Negative
