# Fynd AI Intern Assessment - Task 1: Rating Prediction via Prompting

**Objective:** Design prompts that classify Yelp reviews into 1-5 stars, returning structured JSON.

**LLM Used:** Google Gemini API (gemini-1.5-flash)

**Dataset:** Yelp Reviews from Kaggle

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q google-generativeai pandas numpy scikit-learn matplotlib seaborn

In [None]:
import google.generativeai as genai
import pandas as pd
import numpy as np
import json
import time
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("Packages imported successfully!")

## 2. Configure Gemini API

Get your free API key from: https://makersuite.google.com/app/apikey

In [None]:
# Configure Gemini API
API_KEY = "YOUR_GEMINI_API_KEY_HERE"  # Replace with your actual API key
genai.configure(api_key=API_KEY)

# Initialize the model
model = genai.GenerativeModel('gemini-1.5-flash')

print("Gemini API configured successfully!")

## 3. Load and Sample Dataset

Download the Yelp dataset from: https://www.kaggle.com/datasets/omkarsabnis/yelp-reviews-dataset
Place the CSV file in the same directory as this notebook.

In [None]:
# Load the dataset
df = pd.read_csv("yelp.csv")  # Update filename if different

# Check dataset structure
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Clean and prepare data
# Assuming columns are 'text' and 'stars' (adjust if different)
df_clean = df[['text', 'stars']].dropna()
df_clean['stars'] = df_clean['stars'].astype(int)

print(f"Cleaned dataset shape: {df_clean.shape}")
print(f"\nStar distribution:")
print(df_clean['stars'].value_counts().sort_index())

In [None]:
# Sample ~200 rows, stratified by star rating
np.random.seed(42)

sampled = df_clean.groupby('stars', group_keys=False).apply(
    lambda x: x.sample(min(len(x), 40), random_state=42)
)

if len(sampled) > 200:
    sampled = sampled.sample(200, random_state=42)

sampled = sampled.reset_index(drop=True)

print(f"Sample size: {len(sampled)}")
print(f"\nSample star distribution:")
print(sampled['stars'].value_counts().sort_index())

sampled.head()

## 4. Prompting Approaches

I will implement 3 different prompting strategies:

### Approach 1: Zero-Shot Naive

**Strategy:** Simple, direct instruction with minimal constraints.

**Expected Issues:** May produce invalid JSON, inconsistent formatting, or ratings outside 1-5 range.

In [None]:
def prompt_zero_shot(review: str) -> str:
    return f"""You are an assistant that rates customer reviews.

Read the following Yelp review and decide how many stars (1 to 5) the customer is likely to give.

Return a JSON object with exactly these keys:
- "predicted_stars": an integer from 1 to 5
- "explanation": a brief explanation of your reasoning.

Review:
"""{review}"""
"""

### Approach 2: Structured with Schema

**Strategy:** Explicit schema definition, examples, and strict formatting rules.

**Improvements:**
- Explicit allowed values [1,2,3,4,5]
- JSON examples to guide format
- Clear "no extra text" rule

In [None]:
def prompt_structured(review: str) -> str:
    return f"""You are an assistant that classifies Yelp reviews into star ratings from 1 to 5.

Task:
1. Read the review.
2. Decide the most likely star rating from this discrete set: [1, 2, 3, 4, 5].
3. Return a strict JSON object with exactly these keys:
   - "predicted_stars": integer, one of 1, 2, 3, 4, or 5
   - "explanation": short string (max 2 sentences) explaining the rating.

Rules:
- Do not include any extra keys.
- Do not include comments or Markdown.
- The response must be valid JSON that can be parsed by a standard JSON parser.

Examples of valid responses:
{{"predicted_stars": 5, "explanation": "Very positive tone and strong praise."}}
{{"predicted_stars": 2, "explanation": "Mostly negative with several complaints."}}

Now classify this review:

Review:
"""{review}"""
"""

### Approach 3: Chain-of-Thought with JSON Constraint

**Strategy:** Encourage internal reasoning about sentiment analysis while constraining output to JSON only.

**Improvements:**
- Step-by-step sentiment analysis guidance
- Better handling of nuanced/mixed reviews
- Internal reasoning improves consistency

In [None]:
def prompt_cot_constrained(review: str) -> str:
    return f"""You are an expert sentiment analyst for Yelp reviews.

First, reason step by step about:
- Sentiment polarity (positive/negative/neutral)
- Strength of sentiment
- Specific positives and negatives mentioned
- Whether the user would recommend the place to others

Then, after you finish your reasoning, output ONLY a JSON object with no extra text.

JSON format (mandatory):
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<one or two short sentences summarizing why this rating was chosen>"
}}

Rules:
- The JSON must be valid and parseable.
- Use your internal reasoning to pick the most likely rating from 1, 2, 3, 4, or 5.
- Do not output your intermediate reasoning, only the final JSON.

Review:
"""{review}"""
"""

## 5. LLM Call Function

In [None]:
def call_llm(prompt: str, max_retries=3) -> str:
    """Call Gemini API with retry logic."""
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            return response.text
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"API call failed (attempt {attempt+1}/{max_retries}): {e}")
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                print(f"API call failed after {max_retries} attempts: {e}")
                return "{}"
    return "{}"

# Test the function
test_response = call_llm("Say 'Hello, I am working!' in JSON format")
print(test_response)

## 6. JSON Parsing Utility

In [None]:
def safe_parse_json(text: str):
    """Safely parse JSON from LLM response."""
    try:
        # Try to find first '{' and last '}' to strip extra text
        start = text.find('{')
        end = text.rfind('}')
        if start == -1 or end == -1:
            return None
        
        snippet = text[start:end+1]
        # Remove markdown code blocks if present
        snippet = snippet.replace('```json', '').replace('```', '')
        
        obj = json.loads(snippet)
        
        # Validate required keys
        if "predicted_stars" in obj and "explanation" in obj:
            # Ensure predicted_stars is valid
            pred = obj["predicted_stars"]
            if isinstance(pred, (int, float)) and 1 <= pred <= 5:
                return obj
        return None
    except Exception as e:
        return None

## 7. Evaluation Function

In [None]:
def evaluate_prompt(sampled_df, prompt_fn, model_name="approach"):
    """Evaluate a prompting approach on the sample dataset."""
    y_true = []
    y_pred = []
    json_valid = 0
    responses = []
    
    print(f"\nEvaluating {model_name}...")
    
    for i, row in sampled_df.iterrows():
        if i % 20 == 0:
            print(f"Progress: {i}/{len(sampled_df)}")
        
        review = row["text"]
        true_star = int(row["stars"])
        
        # Generate prompt and call LLM
        prompt = prompt_fn(review)
        raw = call_llm(prompt)
        
        # Parse response
        parsed = safe_parse_json(raw)
        
        if parsed is not None and isinstance(parsed.get("predicted_stars"), (int, float)):
            pred_star = int(parsed["predicted_stars"])
            # Clamp to valid range
            pred_star = max(1, min(5, pred_star))
            json_valid += 1
        else:
            # Fallback to neutral rating
            pred_star = 3
        
        y_true.append(true_star)
        y_pred.append(pred_star)
        
        responses.append({
            "review": review[:100] + "..." if len(review) > 100 else review,
            "true_stars": true_star,
            "predicted_stars": pred_star,
            "raw_response": raw[:200] + "..." if len(raw) > 200 else raw,
            "json_valid": parsed is not None,
            "explanation": parsed.get("explanation", "N/A") if parsed else "N/A"
        })
        
        # Small delay to respect API rate limits
        time.sleep(0.5)
    
    # Calculate metrics
    acc = accuracy_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    json_valid_rate = json_valid / len(sampled_df)
    
    metrics = {
        "approach": model_name,
        "accuracy": round(acc, 4),
        "mae": round(mae, 4),
        "json_valid_rate": round(json_valid_rate, 4),
        "num_samples": len(sampled_df)
    }
    
    print(f"\n{model_name} completed!")
    print(f"Accuracy: {acc:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"JSON Valid Rate: {json_valid_rate:.4f}")
    
    return metrics, pd.DataFrame(responses), y_true, y_pred

## 8. Run Evaluations

**Note:** This will make API calls. With 200 samples and 3 approaches = 600 API calls.
At 0.5s delay per call, this takes ~5 minutes per approach.

In [None]:
# Evaluate Approach 1: Zero-Shot
metrics_zero, df_zero, y_true_zero, y_pred_zero = evaluate_prompt(
    sampled, prompt_zero_shot, "Zero-Shot Naive"
)

In [None]:
# Evaluate Approach 2: Structured
metrics_struct, df_struct, y_true_struct, y_pred_struct = evaluate_prompt(
    sampled, prompt_structured, "Structured with Schema"
)

In [None]:
# Evaluate Approach 3: Chain-of-Thought
metrics_cot, df_cot, y_true_cot, y_pred_cot = evaluate_prompt(
    sampled, prompt_cot_constrained, "Chain-of-Thought Constrained"
)

## 9. Results Comparison

In [None]:
# Create comparison table
metrics_list = [metrics_zero, metrics_struct, metrics_cot]
metrics_df = pd.DataFrame(metrics_list)

print("\n=== COMPARISON TABLE ===")
print(metrics_df.to_string(index=False))

metrics_df

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Accuracy comparison
axes[0].bar(metrics_df["approach"], metrics_df["accuracy"], color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[0].set_title("Accuracy Comparison", fontsize=12, fontweight='bold')
axes[0].set_ylabel("Accuracy")
axes[0].set_ylim(0, 1)
axes[0].tick_params(axis='x', rotation=15)

# MAE comparison
axes[1].bar(metrics_df["approach"], metrics_df["mae"], color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[1].set_title("Mean Absolute Error", fontsize=12, fontweight='bold')
axes[1].set_ylabel("MAE (lower is better)")
axes[1].tick_params(axis='x', rotation=15)

# JSON validity rate
axes[2].bar(metrics_df["approach"], metrics_df["json_valid_rate"], color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[2].set_title("JSON Validity Rate", fontsize=12, fontweight='bold')
axes[2].set_ylabel("Valid JSON Rate")
axes[2].set_ylim(0, 1)
axes[2].tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.savefig("metrics_comparison.png", dpi=300, bbox_inches='tight')
plt.show()

## 10. Confusion Matrices

In [None]:
# Plot confusion matrices for each approach
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

approaches = [
    ("Zero-Shot Naive", y_true_zero, y_pred_zero),
    ("Structured with Schema", y_true_struct, y_pred_struct),
    ("CoT Constrained", y_true_cot, y_pred_cot)
]

for idx, (name, y_t, y_p) in enumerate(approaches):
    cm = confusion_matrix(y_t, y_p, labels=[1, 2, 3, 4, 5])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx], 
                xticklabels=[1,2,3,4,5], yticklabels=[1,2,3,4,5])
    axes[idx].set_title(name, fontsize=12, fontweight='bold')
    axes[idx].set_ylabel("True Stars")
    axes[idx].set_xlabel("Predicted Stars")

plt.tight_layout()
plt.savefig("confusion_matrices.png", dpi=300, bbox_inches='tight')
plt.show()

## 11. Sample Predictions Analysis

In [None]:
# Show some example predictions
print("\n=== SAMPLE PREDICTIONS ===\n")

for i in [0, 50, 100, 150]:
    print(f"Example {i+1}:")
    print(f"Review: {df_zero.iloc[i]['review']}")
    print(f"True Stars: {df_zero.iloc[i]['true_stars']}")
    print(f"Zero-Shot Predicted: {df_zero.iloc[i]['predicted_stars']} - {df_zero.iloc[i]['explanation']}")
    print(f"Structured Predicted: {df_struct.iloc[i]['predicted_stars']} - {df_struct.iloc[i]['explanation']}")
    print(f"CoT Predicted: {df_cot.iloc[i]['predicted_stars']} - {df_cot.iloc[i]['explanation']}")
    print("\n" + "-"*100 + "\n")

## 12. Save Results

In [None]:
# Save predictions to CSV
df_zero.to_csv("predictions_zero_shot.csv", index=False)
df_struct.to_csv("predictions_structured.csv", index=False)
df_cot.to_csv("predictions_cot.csv", index=False)

# Save metrics
metrics_df.to_csv("metrics_comparison.csv", index=False)

print("Results saved successfully!")

## 13. Discussion and Analysis

### Key Findings:

#### 1. JSON Validity
- **Zero-Shot Naive**: Often includes extra text, markdown formatting, or malformed JSON
- **Structured with Schema**: Significant improvement in JSON validity due to explicit examples and constraints
- **CoT Constrained**: Best JSON validity as the prompt explicitly separates reasoning from output

#### 2. Accuracy and Consistency
- **Zero-Shot**: May struggle with nuanced reviews (e.g., mixed sentiments, sarcasm)
- **Structured**: More stable predictions due to clear guidelines
- **CoT**: Handles borderline cases better (2-3 or 3-4 star reviews) through internal reasoning

#### 3. Failure Modes Observed
- Long reviews sometimes truncated or partially analyzed
- Sarcastic reviews occasionally misclassified
- Mixed sentiment reviews benefit most from CoT approach

#### 4. Trade-offs
- **Speed**: Zero-Shot is fastest, CoT slightly slower due to reasoning
- **Reliability**: Structured and CoT are more reliable for production use
- **Cost**: All approaches use similar token counts with Gemini

### Recommendations:
For production deployment, **Structured with Schema** or **CoT Constrained** are recommended based on:
- Higher JSON validity rates
- Better handling of edge cases
- More consistent and predictable outputs