# LLM Comparison: Llama 4 Scout, Llama 4 Maverick, Kimi K2

This notebook mirrors the Claude classification experiments using the **Groq API**.

It runs the same 4 prompt strategies:
- Zero-Shot
- Zero-Shot Chain-of-Thought (CoT)
- Few-Shot
- Few-Shot CoT

Across 3 models:
- `meta-llama/llama-4-scout-17b-16e-instruct`
- `meta-llama/llama-4-maverick-17b-128e-instruct`
- `moonshotai/kimi-k2-instruct-0905`

Results are saved to individual CSVs and one combined CSV.

---
**Before running:** Make sure `fewshot_examples.csv`, `test_data_unlabeled.csv`, and `test_data_with_labels.csv` are uploaded to this Colab session (same files you generated in your Claude notebook).

In [1]:
!pip install groq -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/138.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.3/138.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Running on GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Running on CPU")

Running on CPU


In [3]:
import pandas as pd
import time
from tqdm import tqdm
from groq import Groq

# ── API KEY ──────────────────────────────────────────────────────────────────
# Option A: paste key directly (fine for a private Colab session)
# GROQ_API_KEY = "YOUR_GROQ_API_KEY_HERE"  # <-- replace with your key

# Option B: load from Colab secrets (recommended)
from google.colab import userdata
GROQ_API_KEY = userdata.get('GROQ_API_KEY')

client = Groq(api_key=GROQ_API_KEY)
print(f"Groq client ready. Key starts with: {GROQ_API_KEY[:10]}...")

Groq client ready. Key starts with: gsk_e7S9Fj...


In [4]:
# ── MODEL NAMES ───────────────────────────────────────────────────────────────
MODEL_SCOUT    = "meta-llama/llama-4-scout-17b-16e-instruct"
MODEL_MAVERICK = "meta-llama/llama-4-maverick-17b-128e-instruct"
MODEL_KIMI     = "moonshotai/kimi-k2-instruct-0905"

# ── SHARED PARAMS ────────────────────────────────────────────────────────────
# Temperature=0 for deterministic classification (same as Claude setup)
TEMPERATURE    = 0
MAX_TOKENS     = 200   # same as Claude notebook

# Column names - match your CSV
TEXT_COLUMN    = "text"
LABEL_COLUMN   = "label"

## Prompt Templates
Identical to the Claude notebook — same wording, same structure.

In [5]:
# ── ZERO-SHOT ─────────────────────────────────────────────────────────────────
ZERO_SHOT_PROMPT = """Title: "Classification of Work-Related Burnout and Stress in Cybersecurity Professionals"

Definition: In this task, we ask you to classify the input text into two options:

1: Work-related burnout/stress: The poster discussed burnout or work-related stress related to their own mental health in the past or present.
The context of burnout can be related to work, career, job responsibilities, workplace environment, or professional life in cybersecurity.

0: No work-related burnout/stress: Burnout or stress used in a context unrelated to work or mental health. Or work-related burnout/stress
in hypothetical situations when the poster is not discussing their own experience in the past and present.

Emphasis & Caution: Discussions of hypothetical situations such as fear of burnout or future/imaginary circumstances should NOT be labeled as 1.

Things to avoid: All input must be classified into one of the options. If you cannot pick then choose the option with higher probability. The output
must be either 1 or 0 but not both.

Input: {text}

Output:"""

# ── ZERO-SHOT CoT ─────────────────────────────────────────────────────────────
ZERO_SHOT_COT_PROMPT = """Title: "Classification of Work-Related Burnout and Stress in Cybersecurity Professionals"

Definition: In this task, we ask you to classify the input text into two options:

1 : Work-related burnout/stress: The poster discussed burnout or work-related stress related to their own mental health in the past or present. The context of burnout can be related to work, career, job responsibilities, workplace environment, or professional life in cybersecurity.
0 : No work-related burnout/stress: Burnout or stress used in a context unrelated to work or mental health. Or work-related burnout/stress in hypothetical situations when the poster is not discussing their own experience in the past and present.

Emphasis & Caution: Discussions of hypothetical situations such as fear of burnout or future/imaginary circumstances should NOT be labeled as 1.

Things to avoid: All input must be classified into one of the options. If you cannot pick then choose the option with higher probability. The output must be either 1 or 0 but not both.

Input: {text}

Let's think about it step by step.

Output:"""

print("Prompt templates defined.")

Prompt templates defined.


In [6]:
# ── FEW-SHOT HELPERS ──────────────────────────────────────────────────────────
def create_fewshot_examples(fewshot_df, text_column):
    """Create formatted few-shot examples from the dataframe"""
    examples = []
    for idx, row in fewshot_df.iterrows():
        label = 1 if row['label'] == 1 else 0
        example = f"Input: {row[text_column]}\nOutput: {label}"
        examples.append(example)
    return "\n\n".join(examples)


def create_fewshot_prompt(fewshot_examples_text, is_cot=False):
    """Create few-shot prompt template"""
    cot_instruction = "\n\nLet's think about it step by step." if is_cot else ""

    prompt = f"""Title: "Classification of Work-Related Burnout and Stress in Cybersecurity Professionals"

Definition: In this task, we ask you to classify the input text into two options:

1: Work-related burnout/stress: The poster discussed burnout or work-related stress related to their own mental health in the past or present. The context of burnout can be related to work, career, job responsibilities, workplace environment, or professional life in cybersecurity.

0: No work-related burnout/stress: Burnout or stress used in a context unrelated to work or mental health. Or work-related burnout/stress in hypothetical situations when the poster is not discussing their own experience in the past and present.

Emphasis & Caution: Discussions of hypothetical situations such as fear of burnout or future/imaginary circumstances should NOT be labeled as 1.

Things to avoid: All input must be classified into one of the options. If you cannot pick then choose the option with higher probability. The output must be either 0 or 1 but not both.

Here are some examples:

{fewshot_examples_text}

Now classify the following:

Input: {{text}}{cot_instruction}

Output:"""

    return prompt

print("Few-shot helpers defined.")

Few-shot helpers defined.


In [7]:
# ── UTILITY FUNCTIONS ─────────────────────────────────────────────────────────
def truncate_text(text, max_chars=10000):
    """Truncate text to fit context window"""
    if len(text) > max_chars:
        return text[:max_chars] + "..."
    return text


def parse_output(output):
    """Parse model output to extract 0 or 1"""
    output = output.strip()
    # Look for 0 or 1 in the output
    if "1" in output:
        return 1
    elif "0" in output:
        return 0
    else:
        return 0  # default on parse error


def call_groq_api(prompt, model):
    """
    Call Groq API with the given prompt and model.
    Uses non-streaming for clean output collection.
    Includes retry logic for rate limits.
    """
    max_retries = 3
    retry_delay = 2

    # Kimi K2 works better with a slightly higher temperature;
    # we still use 0 for reproducibility but allow override here.
    temp = TEMPERATURE

    for attempt in range(max_retries):
        try:
            completion = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "user", "content": prompt}
                ],
                temperature=temp,
                max_completion_tokens=MAX_TOKENS,
                top_p=1,
                stream=False,   # non-streaming for clean output collection
                stop=None
            )
            return completion.choices[0].message.content.strip()

        except Exception as e:
            error_str = str(e)
            print(f"Attempt {attempt + 1} failed for {model}: {error_str[:100]}")
            if attempt < max_retries - 1:
                # Extra wait if it's a rate limit error
                wait = retry_delay * 3 if "rate" in error_str.lower() else retry_delay
                print(f"Waiting {wait}s before retry...")
                time.sleep(wait)
                retry_delay *= 2
            else:
                return f"ERROR: {error_str}"

    return "ERROR"


print("Utility functions defined.")

Utility functions defined.


In [8]:
# ── MAIN CLASSIFICATION LOOP ──────────────────────────────────────────────────
def classify_dataset(test_df, text_column, prompt_template, model, approach_name):
    """Classify entire dataset and return results — mirrors Claude notebook exactly"""
    results = []

    print(f"\n{'='*60}")
    print(f"Running: {approach_name}")
    print(f"Model:   {model}")
    print(f"{'='*60}")

    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Classifying"):
        text = truncate_text(str(row[text_column]))
        prompt = prompt_template.format(text=text)

        raw_output = call_groq_api(prompt, model)
        predicted_label = parse_output(raw_output)

        results.append({
            'id': row['id'], # Include the original 'id' for correct merging later
            'text': row[text_column],
            'raw_output': raw_output,
            'predicted_label': predicted_label,
            'approach': approach_name,
            'model': model
        })

        # Small delay to avoid hitting rate limits
        time.sleep(0.5)

    return pd.DataFrame(results)

print("classify_dataset() defined.")

classify_dataset() defined.


## Run Experiments

The cell below loads your data and runs all combinations.

**To run only specific models or prompts**, comment out the lines in the `experiments` list.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("final_merged_dataset.csv")

print(f"Original dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())

Original dataset shape: (309, 4)
Columns: ['id', 'text', 'similarity_score', 'label']

Label distribution:
label
0    173
1    136
Name: count, dtype: int64


In [10]:
# Step 2: Create a copy and save the full labeled dataset as backup
print("\nSaving full labeled dataset as backup...")
df_backup = df.copy()
df_backup.to_csv('labeled_data_backup.csv', index=False)


Saving full labeled dataset as backup...


In [11]:
# Step 3: Extract 20 examples where label = 1 for few-shot learning
print("\nExtracting 40 examples where label = 1 for few-shot prompts")
burnout_examples = df[df['label'] == 1].sample(n=40, random_state=42)
fewshot_examples = burnout_examples.copy()


Extracting 40 examples where label = 1 for few-shot prompts


In [12]:
# Save few-shot examples (with labels)
fewshot_examples.to_csv('fewshot_examples.csv', index=False)
print(f"Few-shot examples saved: {fewshot_examples.shape}")

Few-shot examples saved: (40, 4)


In [13]:
# Step 4: Create test set (remaining data after removing few-shot examples)
# Get indices of few-shot examples
fewshot_indices = burnout_examples.index

# Remove few-shot examples from the dataset to create test set
test_data = df.drop(fewshot_indices)

print(f"Test set shape: {test_data.shape}")
print(f"Test set label distribution:")
print(test_data['label'].value_counts())

Test set shape: (269, 4)
Test set label distribution:
label
0    173
1     96
Name: count, dtype: int64


In [14]:
# Step 5: Save test set WITH labels (for validation later)
test_data.to_csv('test_data_with_labels.csv', index=False)
print("\nTest data with labels saved: test_data_with_labels.csv")



Test data with labels saved: test_data_with_labels.csv


In [15]:
# Step 6: Create and save test set WITHOUT labels (for actual inference)
test_data_unlabeled = test_data.drop(columns=['label'])
test_data_unlabeled.to_csv('test_data_unlabeled.csv', index=False)
print("Test data without labels saved: test_data_unlabeled.csv")

Test data without labels saved: test_data_unlabeled.csv


In [16]:
# ── LOAD DATA ─────────────────────────────────────────────────────────────────
print("Loading data...")
fewshot_df   = pd.read_csv('fewshot_examples.csv')
test_df      = pd.read_csv('test_data_unlabeled.csv')
test_labels  = pd.read_csv('test_data_with_labels.csv')

print(f"Few-shot examples : {len(fewshot_df)}")
print(f"Test samples      : {len(test_df)}")
print(f"Test label dist   :\n{test_labels[LABEL_COLUMN].value_counts()}")

# Build few-shot prompt templates (filled with examples, {{text}} still a placeholder)
fewshot_examples_text = create_fewshot_examples(fewshot_df, TEXT_COLUMN)
fewshot_prompt        = create_fewshot_prompt(fewshot_examples_text, is_cot=False)
fewshot_cot_prompt    = create_fewshot_prompt(fewshot_examples_text, is_cot=True)

print("\nData loaded and prompt templates ready.")

Loading data...
Few-shot examples : 40
Test samples      : 269
Test label dist   :
label
0    173
1     96
Name: count, dtype: int64

Data loaded and prompt templates ready.


In [17]:
# ── EXPERIMENT MATRIX ─────────────────────────────────────────────────────────
# Each tuple: (prompt_template, model_id, name_for_output_files)
# Comment out any rows you don't want to run.

experiments = [

    # ── Llama 4 Scout ─────────────────────────────────────────────────────────
    (ZERO_SHOT_PROMPT,     MODEL_SCOUT,    "Zero-Shot_Scout"),
    (ZERO_SHOT_COT_PROMPT, MODEL_SCOUT,    "Zero-Shot-CoT_Scout"),
    (fewshot_prompt,       MODEL_SCOUT,    "Few-Shot_Scout"),
    (fewshot_cot_prompt,   MODEL_SCOUT,    "Few-Shot-CoT_Scout"),

    # ── Llama 4 Maverick ──────────────────────────────────────────────────────
    (ZERO_SHOT_PROMPT,     MODEL_MAVERICK, "Zero-Shot_Maverick"),
    (ZERO_SHOT_COT_PROMPT, MODEL_MAVERICK, "Zero-Shot-CoT_Maverick"),
    (fewshot_prompt,       MODEL_MAVERICK, "Few-Shot_Maverick"),
    (fewshot_cot_prompt,   MODEL_MAVERICK, "Few-Shot-CoT_Maverick"),

    # ── Kimi K2 ───────────────────────────────────────────────────────────────
    (ZERO_SHOT_PROMPT,     MODEL_KIMI,     "Zero-Shot_KimiK2"),
    (ZERO_SHOT_COT_PROMPT, MODEL_KIMI,     "Zero-Shot-CoT_KimiK2"),
    (fewshot_prompt,       MODEL_KIMI,     "Few-Shot_KimiK2"),
    (fewshot_cot_prompt,   MODEL_KIMI,     "Few-Shot-CoT_KimiK2"),

]

# ── RUN ALL EXPERIMENTS ───────────────────────────────────────────────────────
all_results = []

for prompt_template, model, approach_name in experiments:
    results_df = classify_dataset(test_df, TEXT_COLUMN, prompt_template, model, approach_name)
    all_results.append(results_df)

    # Save individual result CSV
    fname = f'results_{approach_name}.csv'
    results_df.to_csv(fname, index=False)
    print(f"Saved: {fname}")

# ── SAVE COMBINED ─────────────────────────────────────────────────────────────
if all_results:
    combined = pd.concat(all_results, ignore_index=True)
    combined.to_csv('all_results_groq_combined.csv', index=False)
    print("\nAll results saved to: all_results_groq_combined.csv")


Running: Zero-Shot_Scout
Model:   meta-llama/llama-4-scout-17b-16e-instruct


Classifying:  67%|██████▋   | 180/269 [04:26<03:29,  2.36s/it]

Attempt 1 failed for meta-llama/llama-4-scout-17b-16e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-scout-17b-16e-instruct is currently over
Waiting 2s before retry...


Classifying:  83%|████████▎ | 224/269 [06:05<01:38,  2.20s/it]

Attempt 1 failed for meta-llama/llama-4-scout-17b-16e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-scout-17b-16e-instruct is currently over
Waiting 2s before retry...


Classifying: 100%|██████████| 269/269 [07:49<00:00,  1.74s/it]


Saved: results_Zero-Shot_Scout.csv

Running: Zero-Shot-CoT_Scout
Model:   meta-llama/llama-4-scout-17b-16e-instruct


Classifying:  72%|███████▏  | 195/269 [08:05<03:12,  2.60s/it]

Attempt 1 failed for meta-llama/llama-4-scout-17b-16e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-scout-17b-16e-instruct is currently over
Waiting 2s before retry...


Classifying:  87%|████████▋ | 233/269 [09:36<01:26,  2.40s/it]

Attempt 1 failed for meta-llama/llama-4-scout-17b-16e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-scout-17b-16e-instruct is currently over
Waiting 2s before retry...


Classifying: 100%|██████████| 269/269 [10:33<00:00,  2.36s/it]


Saved: results_Zero-Shot-CoT_Scout.csv

Running: Few-Shot_Scout
Model:   meta-llama/llama-4-scout-17b-16e-instruct


Classifying: 100%|██████████| 269/269 [08:41<00:00,  1.94s/it]


Saved: results_Few-Shot_Scout.csv

Running: Few-Shot-CoT_Scout
Model:   meta-llama/llama-4-scout-17b-16e-instruct


Classifying: 100%|██████████| 269/269 [09:42<00:00,  2.17s/it]


Saved: results_Few-Shot-CoT_Scout.csv

Running: Zero-Shot_Maverick
Model:   meta-llama/llama-4-maverick-17b-128e-instruct


Classifying:  62%|██████▏   | 166/269 [04:08<04:49,  2.81s/it]

Attempt 1 failed for meta-llama/llama-4-maverick-17b-128e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-maverick-17b-128e-instruct is currently 
Waiting 2s before retry...


Classifying:  62%|██████▏   | 167/269 [04:16<07:37,  4.49s/it]

Attempt 1 failed for meta-llama/llama-4-maverick-17b-128e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-maverick-17b-128e-instruct is currently 
Waiting 2s before retry...


Classifying: 100%|██████████| 269/269 [07:12<00:00,  1.61s/it]


Saved: results_Zero-Shot_Maverick.csv

Running: Zero-Shot-CoT_Maverick
Model:   meta-llama/llama-4-maverick-17b-128e-instruct


Classifying: 100%|██████████| 269/269 [07:46<00:00,  1.73s/it]


Saved: results_Zero-Shot-CoT_Maverick.csv

Running: Few-Shot_Maverick
Model:   meta-llama/llama-4-maverick-17b-128e-instruct


Classifying:  14%|█▍        | 38/269 [02:16<20:37,  5.36s/it]

Attempt 1 failed for meta-llama/llama-4-maverick-17b-128e-instruct: Error code: 503 - {'error': {'message': 'meta-llama/llama-4-maverick-17b-128e-instruct is currently 
Waiting 2s before retry...


Classifying: 100%|██████████| 269/269 [11:28<00:00,  2.56s/it]


Saved: results_Few-Shot_Maverick.csv

Running: Few-Shot-CoT_Maverick
Model:   meta-llama/llama-4-maverick-17b-128e-instruct


Classifying: 100%|██████████| 269/269 [11:36<00:00,  2.59s/it]


Saved: results_Few-Shot-CoT_Maverick.csv

Running: Zero-Shot_KimiK2
Model:   moonshotai/kimi-k2-instruct-0905


Classifying: 100%|██████████| 269/269 [04:02<00:00,  1.11it/s]


Saved: results_Zero-Shot_KimiK2.csv

Running: Zero-Shot-CoT_KimiK2
Model:   moonshotai/kimi-k2-instruct-0905


Classifying: 100%|██████████| 269/269 [04:10<00:00,  1.07it/s]


Saved: results_Zero-Shot-CoT_KimiK2.csv

Running: Few-Shot_KimiK2
Model:   moonshotai/kimi-k2-instruct-0905


Classifying: 100%|██████████| 269/269 [04:27<00:00,  1.00it/s]


Saved: results_Few-Shot_KimiK2.csv

Running: Few-Shot-CoT_KimiK2
Model:   moonshotai/kimi-k2-instruct-0905


Classifying: 100%|██████████| 269/269 [04:22<00:00,  1.02it/s]

Saved: results_Few-Shot-CoT_KimiK2.csv

All results saved to: all_results_groq_combined.csv





## Evaluation
Compare predicted labels against ground truth and compute accuracy, precision, recall, and F1 for every model+prompt combination.

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Load combined results
combined = pd.read_csv('all_results_groq_combined.csv')

# ── DEBUG: see what columns you actually have ──────────────────────────────
print("combined columns  :", combined.columns.tolist())
print("test_labels columns:", test_labels.columns.tolist())

# ── MERGE ─────────────────────────────────────────────────────────────────
# classify_dataset() stores the DataFrame index in a column called 'id'.
# test_labels uses 'id' as the row identifier.
# We rename so both sides share the same key before merging.

ground_truth = test_labels[['id', LABEL_COLUMN]].rename(columns={LABEL_COLUMN: 'true_label'})

combined = combined.merge(
    ground_truth,
    left_on='id',    # what classify_dataset() saved (the 'id' column)
    right_on='id',      # what test_labels has
    how='left'
)

# Sanity check - should show no NaNs in true_label if merge worked
print(f"\nRows after merge : {len(combined)}")
print(f"NaN in true_label: {combined['true_label'].isna().sum()}")

# ── METRICS ───────────────────────────────────────────────────────────────
summary_rows = []
for approach, group in combined.groupby('approach'):
    y_true = group['true_label']
    y_pred = group['predicted_label']
    summary_rows.append({
        'Approach':  approach,
        'Model':     group['model'].iloc[0],
        'Accuracy':  round(accuracy_score(y_true, y_pred), 4),
        'Precision': round(precision_score(y_true, y_pred, zero_division=0), 4),
        'Recall':    round(recall_score(y_true, y_pred, zero_division=0), 4),
        'F1':        round(f1_score(y_true, y_pred, zero_division=0), 4),
        'N':         len(group)
    })

summary_df = pd.DataFrame(summary_rows).sort_values('F1', ascending=False).reset_index(drop=True)
print(summary_df.to_string(index=False))

summary_df.to_csv('groq_evaluation_summary.csv', index=False)
print("\nEvaluation summary saved to: groq_evaluation_summary.csv")

combined columns  : ['id', 'text', 'raw_output', 'predicted_label', 'approach', 'model']
test_labels columns: ['id', 'text', 'similarity_score', 'label']

Rows after merge : 3276
NaN in true_label: 0
              Approach                                         Model  Accuracy  Precision  Recall     F1   N
        Few-Shot_Scout     meta-llama/llama-4-scout-17b-16e-instruct    0.7399     0.6148  0.7576 0.6787 273
       Few-Shot_KimiK2              moonshotai/kimi-k2-instruct-0905    0.7729     0.6989  0.6566 0.6771 273
     Few-Shot_Maverick meta-llama/llama-4-maverick-17b-128e-instruct    0.6996     0.5563  0.8485 0.6720 273
   Few-Shot-CoT_KimiK2              moonshotai/kimi-k2-instruct-0905    0.7582     0.6667  0.6667 0.6667 273
    Few-Shot-CoT_Scout     meta-llama/llama-4-scout-17b-16e-instruct    0.6703     0.5287  0.8384 0.6484 273
      Zero-Shot_KimiK2              moonshotai/kimi-k2-instruct-0905    0.7692     0.8103  0.4747 0.5987 273
 Few-Shot-CoT_Maverick meta-llama/lla

In [21]:
# ── DETAILED CLASSIFICATION REPORT PER APPROACH ───────────────────────────────
for approach, group in combined.groupby('approach'):
    print(f"\n{'='*60}")
    print(f"Approach: {approach}")
    print(classification_report(group['true_label'], group['predicted_label'],
                                target_names=['No Burnout (0)', 'Burnout (1)'],
                                zero_division=0))


Approach: Few-Shot-CoT_KimiK2
                precision    recall  f1-score   support

No Burnout (0)       0.81      0.81      0.81       174
   Burnout (1)       0.67      0.67      0.67        99

      accuracy                           0.76       273
     macro avg       0.74      0.74      0.74       273
  weighted avg       0.76      0.76      0.76       273


Approach: Few-Shot-CoT_Maverick
                precision    recall  f1-score   support

No Burnout (0)       0.85      0.32      0.46       174
   Burnout (1)       0.43      0.90      0.58        99

      accuracy                           0.53       273
     macro avg       0.64      0.61      0.52       273
  weighted avg       0.69      0.53      0.50       273


Approach: Few-Shot-CoT_Scout
                precision    recall  f1-score   support

No Burnout (0)       0.86      0.57      0.69       174
   Burnout (1)       0.53      0.84      0.65        99

      accuracy                           0.67       273
  

In [22]:
"""
Download all output files from Google Colab to your local machine.
Paste this entire cell into a NEW code cell at the END of your Colab notebook and run it.
"""

import os
from google.colab import files

# ── FILES TO DOWNLOAD ─────────────────────────────────────────────────────────
# Add or remove filenames as needed.
files_to_download = [

    # ── From the DATA PREP notebook ───────────────────────────────────────────
    "labeled_data_backup.csv",
    "fewshot_examples.csv",
    "test_data_with_labels.csv",
    "test_data_unlabeled.csv",

    # ── From the CLAUDE notebook ──────────────────────────────────────────────
    # "results_Zero-Shot_Claude-Sonnet.csv",
    # "results_Zero-Shot-CoT_Claude-Sonnet.csv",
    # "results_Few-Shot_Claude-Sonnet.csv",
    # "results_Few-Shot-CoT_Claude-Sonnet.csv",
    # "all_results_combined.csv",

    # ── From the GROQ notebook ────────────────────────────────────────────────
    "results_Zero-Shot_Scout.csv",
    "results_Zero-Shot-CoT_Scout.csv",
    "results_Few-Shot_Scout.csv",
    "results_Few-Shot-CoT_Scout.csv",

    "results_Zero-Shot_Maverick.csv",
    "results_Zero-Shot-CoT_Maverick.csv",
    "results_Few-Shot_Maverick.csv",
    "results_Few-Shot-CoT_Maverick.csv",

    "results_Zero-Shot_KimiK2.csv",
    "results_Zero-Shot-CoT_KimiK2.csv",
    "results_Few-Shot_KimiK2.csv",
    "results_Few-Shot-CoT_KimiK2.csv",

    "all_results_groq_combined.csv",
    "groq_evaluation_summary.csv",

]

# ── DOWNLOAD LOOP ─────────────────────────────────────────────────────────────
print("Starting download...\n")

downloaded = []
skipped    = []

for filename in files_to_download:
    if os.path.exists(filename):
        print(f"⬇  Downloading: {filename}")
        files.download(filename)
        downloaded.append(filename)
    else:
        print(f"  Skipped (not found): {filename}")
        skipped.append(filename)

# ── SUMMARY ───────────────────────────────────────────────────────────────────
print(f"\n{'='*50}")
print(f"Downloaded : {len(downloaded)} file(s)")
print(f" Skipped   : {len(skipped)} file(s)")

if skipped:
    print("\nSkipped files (run the relevant notebook cells first):")
    for f in skipped:
        print(f"  - {f}")


Starting download...

⬇  Downloading: labeled_data_backup.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: fewshot_examples.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: test_data_with_labels.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: test_data_unlabeled.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot_Scout.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot-CoT_Scout.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot_Scout.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot-CoT_Scout.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot_Maverick.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot-CoT_Maverick.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot_Maverick.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot-CoT_Maverick.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot_KimiK2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Zero-Shot-CoT_KimiK2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot_KimiK2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: results_Few-Shot-CoT_KimiK2.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: all_results_groq_combined.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

⬇  Downloading: groq_evaluation_summary.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Downloaded : 18 file(s)
 Skipped   : 0 file(s)
