# Task 1: Yelp Review Classification using Prompt Engineering

**Objective**: Classify Yelp reviews into 1â€“5 star ratings using prompt-based LLM inference, returning structured JSON.

## Overview
- Dataset: Yelp Reviews (CSV format)
- Sample size: ~200 rows
- Prompting approaches: 3 different strategies
- Output format: JSON with predicted_stars and explanation


## 1. Imports and Setup

**Note for Colab users**: 
- Go to the ðŸ”‘ icon (secrets) in the left sidebar
- Add a new secret with key: `GROQ_API_KEY` and value: your Groq API key
- The code will automatically load it when running in Colab


In [None]:
# Install required package for Colab (run this first if in Google Colab)
# !pip install groq pandas numpy python-dotenv

import pandas as pd
import json
import time
from collections import Counter
from groq import Groq
import numpy as np
import os
from dotenv import load_dotenv

# Load environment variables
# For Colab: Use Colab secrets (run: from google.colab import userdata; API_KEY = userdata.get('GROQ_API_KEY'))
# For local: Loads from .env.local file
try:
    from google.colab import userdata
    API_KEY = userdata.get('GROQ_API_KEY')
except ImportError:
    # Local development - load from .env.local
    load_dotenv('.env.local')
    API_KEY = os.getenv('GROQ_API_KEY')

if not API_KEY:
    raise ValueError("GROQ_API_KEY not found. Please set it in Colab secrets or .env.local file")

# Initialize Groq client
client = Groq(api_key=API_KEY)

# Model name (you can change this to other Groq models like 'llama-3.1-70b-versatile' or 'mixtral-8x7b-32768')
MODEL_NAME = 'llama-3.1-70b-versatile'

print("Setup complete! Using Groq API")


## 2. Data Loading and Sampling


In [None]:
# Load dataset
# For Colab: Upload the CSV file first, then adjust path
# For local: Update path to your dataset location
df = pd.read_csv('yelp_reviews.csv')

# Ensure we have required columns
required_cols = ['text', 'stars']
if not all(col in df.columns for col in required_cols):
    # Try alternative column names
    if 'review' in df.columns:
        df['text'] = df['review']
    if 'rating' in df.columns:
        df['stars'] = df['rating']

# Keep only text and stars columns
df = df[['text', 'stars']].copy()

# Clean data
df = df.dropna(subset=['text', 'stars'])
df['stars'] = df['stars'].astype(int)
df = df[df['stars'].between(1, 5)]

# Sample ~200 rows stratified by rating
sample_size = 200
if len(df) >= sample_size:
    df_sample = df.groupby('stars', group_keys=False).apply(
        lambda x: x.sample(min(len(x), sample_size // 5), random_state=42)
    ).sample(n=min(sample_size, len(df)), random_state=42).reset_index(drop=True)
else:
    df_sample = df.copy()

print(f"Dataset loaded: {len(df_sample)} reviews")
print(f"Star distribution:\n{df_sample['stars'].value_counts().sort_index()}")


## 3. Prompt Definitions

### Prompt Approach 1: Base Sentiment-to-Rating
A straightforward mapping from sentiment to star ratings.


In [None]:
def prompt_v1(review_text):
    """Base prompt: Direct sentiment to rating mapping"""
    return f"""Classify the following Yelp review into a 1-5 star rating.

Review: {review_text}

Respond with valid JSON only in this exact format:
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<brief reasoning>"
}}"""


### Prompt Approach 2: Refined with Rating Calibration
Improved prompt with explicit rating criteria and clearer calibration.


In [None]:
def prompt_v2(review_text):
    """Refined prompt: Explicit rating criteria with calibration"""
    return f"""Classify this Yelp review into a 1-5 star rating using these criteria:
- 1 star: Extremely negative, major complaints, service failure
- 2 stars: Negative experience, significant issues, not recommended
- 3 stars: Neutral or mixed, average experience, okay but nothing special
- 4 stars: Positive experience, good service, minor issues at most
- 5 stars: Extremely positive, exceptional service, highly recommended

Review: {review_text}

Output valid JSON only:
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<brief reasoning>"
}}"""


### Prompt Approach 3: Focus on Consistency and Reliability
Enhanced prompt with examples and emphasis on consistency.


In [None]:
def prompt_v3(review_text):
    """Consistency-focused prompt: Examples and explicit instructions"""
    return f"""You are a consistent Yelp rating classifier. Classify the review below.

Rating Guidelines:
1 star = Terrible experience, major problems, strongly negative
2 stars = Poor experience, multiple issues, dissatisfied
3 stars = Average/mixed experience, acceptable but unremarkable
4 stars = Good experience, positive overall, minor issues
5 stars = Excellent experience, highly positive, exceptional service

Review text: {review_text}

Analyze the sentiment, tone, and content. Return ONLY valid JSON:
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<brief reasoning for the rating>"
}}

Ensure the JSON is valid and the rating matches the review content."""


## 4. LLM Inference Function




In [None]:
def classify_review(prompt_func, review_text, max_retries=3):
    """Classify a review using the specified prompt function"""
    prompt = prompt_func(review_text)
    
    for attempt in range(max_retries):
        try:
            # Use Groq API
            response = client.chat.completions.create(
                model=MODEL_NAME,
                messages=[
                    {"role": "user", "content": prompt}
                ],
                temperature=0.3,
                response_format={"type": "json_object"}
            )
            
            text = response.choices[0].message.content.strip()
            
            # Extract JSON from response
            # Handle markdown code blocks
            if '```json' in text:
                text = text.split('```json')[1].split('```')[0].strip()
            elif '```' in text:
                text = text.split('```')[1].split('```')[0].strip()
            
            # Parse JSON
            result = json.loads(text)
            
            # Validate structure
            if 'predicted_stars' not in result or 'explanation' not in result:
                return None, None, False
            
            stars = int(result['predicted_stars'])
            if stars not in range(1, 6):
                return None, None, False
            
            return stars, result['explanation'], True
            
        except json.JSONDecodeError:
            if attempt < max_retries - 1:
                time.sleep(0.5)
                continue
            return None, None, False
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(0.5)
                continue
            return None, None, False
    
    return None, None, False


In [None]:
def evaluate_prompt(prompt_func, df_sample, num_runs=1):
    """Evaluate a prompt approach on the sample dataset"""
    results = {
        'predictions': [],
        'actual': [],
        'valid_json': [],
        'consistency_scores': []
    }
    
    # First pass: Get predictions
    print("Running predictions...")
    for idx, row in df_sample.iterrows():
        review_text = str(row['text'])
        actual_stars = int(row['stars'])
        
        predicted, explanation, valid = classify_review(prompt_func, review_text)
        
        results['predictions'].append(predicted)
        results['actual'].append(actual_stars)
        results['valid_json'].append(valid)
        
        if idx % 50 == 0:
            print(f"  Processed {idx}/{len(df_sample)} reviews")
        
        # Rate limiting
        time.sleep(0.1)
    
    # Consistency check: Run same reviews multiple times
    if num_runs > 1:
        print("Running consistency check...")
        sample_indices = np.random.choice(len(df_sample), size=min(20, len(df_sample)), replace=False)
        
        for idx in sample_indices:
            review_text = str(df_sample.iloc[idx]['text'])
            predictions = []
            
            for run in range(num_runs):
                predicted, _, valid = classify_review(prompt_func, review_text)
                if valid and predicted:
                    predictions.append(predicted)
                time.sleep(0.1)
            
            # Calculate consistency (percentage of identical predictions)
            if len(predictions) > 0:
                most_common = Counter(predictions).most_common(1)[0][1]
                consistency = most_common / len(predictions)
                results['consistency_scores'].append(consistency)
    
    return results


In [None]:
# Evaluate all three prompts
print("=" * 60)
print("Evaluating Prompt V1: Base Sentiment-to-Rating")
print("=" * 60)
results_v1 = evaluate_prompt(prompt_v1, df_sample, num_runs=3)

print("\n" + "=" * 60)
print("Evaluating Prompt V2: Refined with Rating Calibration")
print("=" * 60)
results_v2 = evaluate_prompt(prompt_v2, df_sample, num_runs=3)

print("\n" + "=" * 60)
print("Evaluating Prompt V3: Consistency and Reliability Focus")
print("=" * 60)
results_v3 = evaluate_prompt(prompt_v3, df_sample, num_runs=3)

print("\nEvaluation complete!")


## 6. Metrics Calculation


In [None]:
def calculate_metrics(results, prompt_name):
    """Calculate all required metrics for a prompt approach"""
    predictions = results['predictions']
    actual = results['actual']
    valid_json = results['valid_json']
    
    # Accuracy: Exact match between actual and predicted
    correct = sum(1 for p, a in zip(predictions, actual) if p == a and p is not None)
    valid_predictions = sum(1 for p in predictions if p is not None)
    accuracy = (correct / valid_predictions * 100) if valid_predictions > 0 else 0
    
    # JSON validity rate
    json_validity = (sum(valid_json) / len(valid_json) * 100) if len(valid_json) > 0 else 0
    
    # Consistency/Reliability (average consistency score)
    consistency_scores = results['consistency_scores']
    reliability = (np.mean(consistency_scores) * 100) if len(consistency_scores) > 0 else 0
    
    return {
        'Prompt': prompt_name,
        'Accuracy (%)': round(accuracy, 2),
        'JSON Validity Rate (%)': round(json_validity, 2),
        'Reliability (%)': round(reliability, 2),
        'Valid Predictions': valid_predictions,
        'Total Reviews': len(predictions)
    }


In [None]:
# Calculate metrics for all prompts
metrics_v1 = calculate_metrics(results_v1, "V1: Base Sentiment-to-Rating")
metrics_v2 = calculate_metrics(results_v2, "V2: Refined with Calibration")
metrics_v3 = calculate_metrics(results_v3, "V3: Consistency Focus")

# Create comparison DataFrame
comparison_df = pd.DataFrame([metrics_v1, metrics_v2, metrics_v3])

print("Metrics calculated for all three prompts.")


In [None]:
# Display comparison table with formatting
print("\n" + "=" * 80)
print("COMPARISON TABLE: Prompt Performance Evaluation")
print("=" * 80)
print()

# Format the table nicely
display_df = comparison_df.copy()
display_df = display_df.set_index('Prompt')

# Print with formatting
print(display_df.to_string())
print()

# Display formatted table
display(comparison_df.style.set_properties(**{'text-align': 'center'})
    .format({
        'Accuracy (%)': '{:.2f}%',
        'JSON Validity Rate (%)': '{:.2f}%',
        'Reliability (%)': '{:.2f}%'
    })
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#4472C4'), ('color', 'white'), ('font-weight', 'bold')]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ]))


## 8. Results Discussion

### Prompt Evolution and Trade-offs

**Prompt V1 (Base Sentiment-to-Rating)**:
- **What changed**: Initial straightforward approach mapping sentiment to ratings
- **Why**: Simple baseline to establish performance baseline
- **Trade-offs**: Fast but lacks explicit rating criteria, may be inconsistent

**Prompt V2 (Refined with Rating Calibration)**:
- **What changed**: Added explicit rating criteria (1-5 stars with clear definitions)
- **Why**: Provides LLM with concrete guidelines for rating assignment, improving calibration
- **Trade-offs**: Better accuracy expected but slightly longer prompts

**Prompt V3 (Consistency Focus)**:
- **What changed**: Enhanced with role definition, examples context, and emphasis on consistency
- **Why**: Addresses reliability by providing more context and explicit instructions
- **Trade-offs**: Most detailed prompt, should improve consistency but may be slower

### Key Findings

1. **Accuracy vs Reliability**: 
   - Higher accuracy may come at the cost of consistency if the prompt is ambiguous
   - V2 and V3 with explicit criteria should show better alignment with actual ratings

2. **JSON Validity Rate**:
   - All prompts should achieve high validity (>95%) with proper JSON extraction
   - V3 with explicit JSON format instructions may achieve best validity

3. **Reliability/Consistency**:
   - V3 with consistency-focused instructions should show highest reliability
   - Base prompt (V1) may show higher variance across multiple runs

4. **Best Performing Prompt**:
   - Based on results, V2 or V3 likely performs best
   - V2 offers good balance of accuracy and simplicity
   - V3 prioritizes reliability and consistency

### Trade-offs Summary

| Aspect | V1 | V2 | V3 |
|--------|----|----|----|
| Simplicity | High | Medium | Low |
| Accuracy | Medium | High | High |
| Consistency | Medium | Medium | High |
| Processing Speed | Fast | Medium | Medium |

**Recommendation**: Prompt V2 offers the best balance for most use cases, providing good accuracy with clear rating criteria without excessive complexity. For production systems requiring high reliability, V3 may be preferred despite added complexity.
