# CoTLab Tutorial

**CoTLab** is a research toolkit for studying Chain-of-Thought (CoT) reasoning in LLMs.

In this tutorial you will learn to:
1. Load a model with CoTLab's backend system
2. Compare different prompting strategies (CoT, Direct, Contrarian)
3. Evaluate which strategy works best for medical QA

> **Note**: We use GPT-2 here for fast demo. For real experiments, use larger models like MedGemma or DeepSeek-R1.

## 1. Setup

In [1]:
from cotlab.backends import TransformersBackend
from cotlab.datasets.loaders import TutorialDataset
from cotlab.prompts.strategies import (
    ChainOfThoughtStrategy,
    ContrarianStrategy,
    DirectAnswerStrategy,
)

## 2. Load Model and Data

In [None]:
# Load GPT-2
backend = TransformersBackend(device="auto", dtype="bfloat16")
backend.load_model("openai-community/gpt2")

# Load tutorial dataset
dataset = TutorialDataset(path="../data/tutorial.json")
samples = dataset.sample(5)

print(f"Loaded {len(samples)} samples:")
for s in samples:
    print(f"  Q: {s.text} → A: {s.label}")

  Device map: auto
  Dtype: torch.bfloat16
  Cache: ~/.cache/huggingface (HF default)
  Resolved device: mps
Loaded 5 samples:
  Q: What color is the sky on a clear day? → A: blue
  Q: 2 + 2 = ? → A: 4
  Q: Is water wet? Answer yes or no. → A: yes
  Q: What animal says 'meow'? → A: cat
  Q: How many days are in a week? → A: 7


## 3. Define Prompt Strategies

In [3]:
strategies = {
    "contrarian": ContrarianStrategy(),
    "chain_of_thought": ChainOfThoughtStrategy(),
    "direct_answer": DirectAnswerStrategy(),
}

In [4]:
# Show example system message and prompt template
for name, strategy in strategies.items():
    print(f"{'=' * 60}")
    print(f"{name.upper()}")
    print(f"{'=' * 60}")
    print(f"System: {strategy.get_system_message()}")
    print()
    print(f"Prompt: {strategy.build_prompt({'question': '{question}'})}")
    print()

CONTRARIAN
System: You are a skeptical diagnostician who questions obvious conclusions.

Prompt: Play devil's advocate. Argue why the most obvious diagnosis might be WRONG.

Question: {question}

First state what the obvious answer would be, then argue against it with alternative explanations:

CHAIN_OF_THOUGHT
System: You are a medical expert. Think through problems carefully and explain your reasoning step by step before giving your final answer.

Prompt: Question: {question}

Let's think through this step by step:


DIRECT_ANSWER
System: You are a medical expert. Give only the final answer. Do not explain or show your reasoning.

Prompt: Question: {question}

Give ONLY the final answer. Do not explain, do not reason, just answer:



## 4. Run Experiments

In [5]:
results = {name: [] for name in strategies}

for sample in samples:
    question = sample.text
    correct_answer = sample.label

    print(f"\n{'=' * 60}")
    print(f"Question: {question[:80]}...")
    print(f"Correct Answer: {correct_answer}")

    for name, strategy in strategies.items():
        prompt = strategy.build_prompt({"question": question})
        output = backend.generate(prompt, max_new_tokens=256)
        response = output.text  # GenerationOutput has .text attribute

        # Simple match check
        is_correct = correct_answer.lower() in response.lower()
        results[name].append(is_correct)

        print(f"\n[{name}] Correct: {is_correct}")
        print(f"Response: {response[:100]}...")


Question: What color is the sky on a clear day?...
Correct Answer: blue

[contrarian] Correct: True
Response: 

1) The sky is blue.

2) The sky is green.

3) The sky is blue.

4) The sky is red.

5) The sky is ...

[chain_of_thought] Correct: True
Response: 
Do you know that the sky is blue?

Do you know that the sky is green?

Do you know that the sky is ...

[direct_answer] Correct: False
Response: 

Do not explain, do not reason, just answer: Do not explain, do not reason, just answer:

Do not ex...

Question: 2 + 2 = ?...
Correct Answer: 4

[contrarian] Correct: True
Response: 

1) WRONG

2) WRONG

3) WRONG

4) WRONG

5) WRONG

6) WRONG

7) WRONG

8) WRONG

9) WRONG

10) WRON...

[chain_of_thought] Correct: False
Response: 
1. Assume we have a list of a set of words:

1.1.1.2.2.1.2.1.2.1.1.2.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1....

[direct_answer] Correct: False
Response: 

Example:

I am a beginner in the art of writing and I am interested in the art of writing and I am...

Question: I

## 5. Evaluate Results

In [6]:
print("\n" + "=" * 40)
print("RESULTS SUMMARY")
print("=" * 40)
print(f"{'Strategy':<20} {'Accuracy':>10}")
print("-" * 30)

for name, correct_list in results.items():
    accuracy = sum(correct_list) / len(correct_list) * 100
    print(f"{name:<20} {accuracy:>9.1f}%")

# Find best
best = max(results, key=lambda x: sum(results[x]))
print(f"\nBest strategy: {best}")


RESULTS SUMMARY
Strategy               Accuracy
------------------------------
contrarian                60.0%
chain_of_thought          40.0%
direct_answer             20.0%

Best strategy: contrarian


## 6. Conclusion

**Results**: For more meaningful results, use more advanced models and more samples.

**For real experiments**, use instruction-tuned models like:
- `google/medgemma-27b-text-it` — medical QA
- `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` — reasoning

**Other CoTLab use cases:**
- **Mechanistic analysis**: Hook into model activations to study *how* the model reasons
- **Patching experiments**: Swap activations between runs to test causal hypotheses  
- **Strategy benchmarking**: Compare prompts across datasets (radiology, pediatrics, etc.)

Check out `cotlab.experiments` and the `conf/` folder for advanced configurations.