<a href="https://colab.research.google.com/github/olanigan/DSPy_Cookbook/blob/main/LLM_RL_With_Think_Tokens_DSPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step-by-step explanation:**

1.  **Setting up the tools:**

    *   The code starts by importing "libraries," which are collections of pre-written code that do specific tasks. It's like getting tools out of a toolbox.
    *   It uses libraries like `dspy` (for working with language models), `torch` (for machine learning), `datasets` (for handling data), `transformers` (for advanced language models), and others for tasks like text comparison and splitting data.
    *   It checks if the computer has a special processing unit called a GPU. GPUs are very good at the kind of math needed for machine learning, so if one is available, the code will use it to speed things up.

2.  **Getting the practice questions and answers (the dataset):**

    *   The code loads a dataset called "PFAF750". Think of this dataset as a big textbook full of questions ("Prompts") and their correct answers ("Responses").
    *   It splits this textbook into two parts: a training set (to teach the model) and a testing set (to evaluate the model's performance later). It's like saving some questions for a final exam.

3.  **Choosing a smart language model:**

    *   The code selects a pre-trained language model called "SmolLM2-360M". This model is like a student who has already learned a lot about language from reading tons of text.
    *   It also gets a "tokenizer," which is a tool that helps the model understand words by breaking them down into smaller pieces it can process.

4.  **Preparing the model:**

    *   It "wraps" the pre-trained model with the `dspy` library, which makes it easier to use in the code's framework.
    *   It sets up an "optimizer," which is a tool that will help the model learn from its mistakes during training.

5.  **Defining how to judge the model's answers:**

    *   The code defines three helper functions:
        *   `normalize_text`: This cleans up text by removing extra spaces, punctuation, and converting everything to lowercase. It's like tidying up a sentence before judging it.
        *   `is_similar`: This checks if the model's answer is close enough to the correct answer, even if the wording is slightly different. It uses a "fuzzy matching" technique to compare the cleaned-up versions of the answers.
        *   `check_format`: This checks if the model's answer follows a specific format: `<think> ... </think> <answer> ... </answer>`. It's like making sure the model shows its work in a specific way.
        *   `combined_metric`: This combines the accuracy check and the format check into a single score. It gives the model a point for being accurate and a point for following the correct format, or takes away a point if it doesn't follow the format.

6.  **Teaching the model to "think" (forward and backward reasoning):**

    *   The code defines two "modules" using `dspy`:
        *   `ForwardReasoning`: This module teaches the model to take a question, think about it, and then produce an answer. It uses the `<think> ... </think> <answer> ... </answer>` format to structure its response.
        *   `BackwardReasoning`: This module teaches the model to take an answer, think backward, and try to reconstruct the question that led to that answer. It uses the same format as above.

7.  **Fine-tuning the model:**

    *   The code defines a function `fine_tune_model`. This function takes a prompt and the correct response, combines them, and feeds them to the model.
    *   It then calculates how wrong the model's prediction was (the "loss") and uses the optimizer to adjust the model's internal settings so it will do better next time.
    *   Importantly, it only focuses on the "response" part when calculating the loss, ignoring the "prompt" part.

8.  **Preparing the training data:**

    *   The code takes the training set of questions and answers and converts them into a format that `dspy` can understand.
    *   It creates two sets of training data: `trainset_forward` (for forward reasoning) and `trainset_backward` (for backward reasoning).

9.  **Setting up the "teleprompter":**

    *   The code uses a `dspy` tool called `BootstrapFewShot`. This tool helps the model learn from a small number of examples. It's like giving the model a few hints before it starts practicing.
    *   It tells the teleprompter to use the `combined_metric` to judge the model's performance.

10. **Running the experiment:**

    *   The code defines a function `adaptive_boundary_experiment_with_reverse`. This function runs the main training loop.
    *   It goes through the training data multiple times ("epochs").
    *   In each epoch, it does the following:
        *   It "compiles" the `ForwardReasoning` and `BackwardReasoning` modules using the teleprompter. This is like giving the model a set of instructions on how to learn.
        *   It loops through the training examples:
            *   For each example, it asks the model to solve it using forward reasoning and then backward reasoning.
            *   It prints the model's response and the backward-generated question.
            *   It checks if the model's response is correct and follows the right format using the `combined_metric`.
            *   If the model's response is incorrect or doesn't have the right format, it fine-tunes the model using the `fine_tune_model` function.
            *   It does the same for the backward reasoning part.
        *   It prints a message indicating that the epoch is completed.

11. **Starting the training:**

    *   Finally, the code calls the `adaptive_boundary_experiment_with_reverse` function to start the training process. It tells the code to run for 2 epochs.

**In essence:**

The code is teaching a language model to solve problems by showing it examples, judging its answers, and fine-tuning it based on its mistakes. It's also encouraging the model to "think" about the problem in a structured way (using the `<think> ... </think> <answer> ... </answer>` format) and to practice both forward and backward reasoning. The goal is to make the model better at solving problems and expressing its reasoning clearly.

In [None]:
!pip install dspy

Collecting dspy
  Downloading dspy-2.5.43-py3-none-any.whl.metadata (7.3 kB)
Collecting asyncer==0.0.8 (from dspy)
  Downloading asyncer-0.0.8-py3-none-any.whl.metadata (6.7 kB)
Collecting backoff (from dspy)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting datasets (from dspy)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting diskcache (from dspy)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair (from dspy)
  Downloading json_repair-0.35.0-py3-none-any.whl.metadata (11 kB)
Collecting litellm==1.53.7 (from litellm[proxy]==1.53.7->dspy)
  Downloading litellm-1.53.7-py3-none-any.whl.metadata (33 kB)
Collecting magicattr~=0.1.6 (from dspy)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting optuna (from dspy)
  Downloading optuna-4.2.0-py3-none-any.whl.metadata (17 kB)
Collecting ujson (from dspy)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_6

In [None]:
import dspy
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW
import random
import re
from difflib import SequenceMatcher
from sklearn.model_selection import train_test_split

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset
dataset = load_dataset("TuringsSolutions/PFAF750")
train_data = dataset['train']

# 1. Create Train/Test Split
# Convert the 'train' split to a list of dictionaries
train_data_list = train_data.to_pandas().to_dict("records")

# Now use train_test_split
train_data_list, test_data_list = train_test_split(train_data_list, test_size=0.2, random_state=42)

# Convert back to Hugging Face Dataset objects
train_dataset = Dataset.from_dict({"Prompt": [d["Prompt"] for d in train_data_list], "Response": [d["Response"] for d in train_data_list]})
test_dataset = Dataset.from_dict({"Prompt": [d["Prompt"] for d in test_data_list], "Response": [d["Response"] for d in test_data_list]})

# Model and Tokenizer Setup
checkpoint = "HuggingFaceTB/SmolLM2-360M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Handle pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token or '[PAD]'

hf_model = AutoModelForCausalLM.from_pretrained(checkpoint)
hf_model.resize_token_embeddings(len(tokenizer))
hf_model.to(device)

# Wrap with HFModel
lm = dspy.HFModel(model=checkpoint)
dspy.settings.configure(lm=lm, tokenizer=tokenizer)

# Optimizer
optimizer = AdamW(hf_model.parameters(), lr=5e-5)

# Text Normalization Function
def normalize_text(text):
    return re.sub(r'\s+', ' ', re.sub(r'[^\w\s]', '', text)).strip().lower()

# Fuzzy Matching Function
def is_similar(response, correct_response, threshold=0.9):
    return SequenceMatcher(None, normalize_text(response), normalize_text(correct_response)).ratio() >= threshold

# DSPy Module for Forward Reasoning with 'Think' Tokens
class ForwardReasoning(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        reasoning_process = self.generate_answer(question=question)
        return reasoning_process.answer

# DSPy Module for Backward Reasoning with 'Think' Tokens
class BackwardReasoning(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_question = dspy.ChainOfThought("answer -> question")

    def forward(self, answer):
        reasoning_process = self.generate_question(answer=answer)
        return reasoning_process.question

# Fine-tuning Function with 'Think' Tokens
def fine_tune_model(prompt, correct_response):
    hf_model.train()
    optimizer.zero_grad()
    combined_text = f"{prompt} <think>{correct_response}</think> <answer>{correct_response}</answer>"
    inputs = tokenizer(combined_text, return_tensors="pt", truncation=True, padding=True)
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    prompt_length = len(tokenizer(prompt, add_special_tokens=False)["input_ids"])
    labels = input_ids.clone()
    labels[:, :prompt_length] = -100  # Ignore prompt tokens in loss calculation

    outputs = hf_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    return loss.item()

# Prepare the Dataset for DSPy - Use 'Prompt' and 'Response'
trainset_forward = []
trainset_backward = []

for example in train_data_list:
    question = example['Prompt']  # Use 'Prompt'
    answer = example['Response']   # Use 'Response'

    # Create dspy.Example objects with input/output fields
    trainset_forward.append(dspy.Example(question=question, answer=answer).with_inputs("question"))
    trainset_backward.append(dspy.Example(question=question, answer=answer).with_inputs("answer"))

# Initialize the Teleprompter (Example: BootstrapFewShot)
from dspy.teleprompt import BootstrapFewShot

# Use BootstrapFewShot for few-shot optimization
teleprompter = BootstrapFewShot(metric=is_similar, max_bootstrapped_demos=4, max_labeled_demos=4, max_rounds=1)

# Update Experiment Function to Include 'Think' Tokens
def adaptive_boundary_experiment_with_reverse(trainset_forward, trainset_backward, epochs=1):
    for epoch in range(epochs):
        print(f"Epoch {epoch+1}/{epochs}")

        compiled_forward = teleprompter.compile(ForwardReasoning(), trainset=trainset_forward)
        compiled_backward = teleprompter.compile(BackwardReasoning(), trainset=trainset_backward)

        for example, example_backward in zip(trainset_forward, trainset_backward):
            try:
                prompt = example.question
                correct_response = example.answer

                forward_prediction = compiled_forward(question=prompt)
                model_response = forward_prediction
                print(f"Model Response (Forward): {model_response}")

                backward_prediction = compiled_backward(answer=correct_response)
                backward_question = backward_prediction
                print(f"Backward Question: {backward_question}")

                formatted_response = f"<think>{correct_response}</think> <answer>{correct_response}</answer>"

                # Fine-tune on incorrect responses
                if not is_similar(model_response, formatted_response):
                    loss = fine_tune_model(prompt, correct_response)
                    print(f"Fine-tuned on forward reasoning. Loss: {loss:.4f}")

                formatted_question = f"<think>{prompt}</think> <answer>{prompt}</answer>"
                if not is_similar(backward_question, formatted_question):
                    loss = fine_tune_model(formatted_question, prompt)
                    print(f"Fine-tuned on backward reasoning. Loss: {loss:.4f}")
            except Exception as e:
                print(f"Error processing example: {e}")
                continue

        print("Epoch completed.")

# Call the Experiment Function
adaptive_boundary_experiment_with_reverse(trainset_forward, trainset_backward, epochs=2)



Epoch 1/2


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
2025/01/30 15:34:09 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'Encapsulate symmetry properties through group actions.', 'answer': 'Group Actions: Identify subsets of permissible transformations operating on the collection of symbols, preserving structure and relationships among members. This includes rotational symmetries, reflections, translations, and other automorphisms.'}) (input_keys={'question'}) with <function is_similar at 0x7ddce82ca020> due to expected string or bytes-like object, got 'Example'.
  0%|          | 1/605 [00:51<8:38:36, 51.52s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
2025/01/30 15:35:00 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'Input encoding vector: [1.82, 0.61, 1.03, 0.92]\r\nMetadata:\r\n\r\nFractal functions: Julia Set, Barnsley Fern, Sierpinski Gasket\r\nDimension

TypeError: expected string or bytes-like object, got 'Example'

In [None]:
import dspy
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW
import random
import re
from difflib import SequenceMatcher
from sklearn.model_selection import train_test_split

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset
dataset = load_dataset("TuringsSolutions/PFAF750")
train_data = dataset['train']

# 1. Create Train/Test Split
# Convert the 'train' split to a list of dictionaries
train_data_list = train_data.to_pandas().to_dict("records")

# Now use train_test_split
train_data_list, test_data_list = train_test_split(
    train_data_list, test_size=0.2, random_state=42
)

# Convert back to Hugging Face Dataset objects
train_dataset = Dataset.from_dict(
    {
        "Prompt": [d["Prompt"] for d in train_data_list],
        "Response": [d["Response"] for d in train_data_list],
    }
)
test_dataset = Dataset.from_dict(
    {
        "Prompt": [d["Prompt"] for d in test_data_list],
        "Response": [d["Response"] for d in test_data_list],
    }
)

# Model and Tokenizer Setup
checkpoint = "HuggingFaceTB/SmolLM2-360M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Handle pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token or "[PAD]"

hf_model = AutoModelForCausalLM.from_pretrained(checkpoint)
hf_model.resize_token_embeddings(len(tokenizer))
hf_model.to(device)

# Wrap with HFModel
lm = dspy.HFModel(model=checkpoint)
dspy.settings.configure(lm=lm, tokenizer=tokenizer)

# Optimizer
optimizer = AdamW(hf_model.parameters(), lr=5e-5)

# Text Normalization Function
def normalize_text(text):
    return re.sub(r"\s+", " ", re.sub(r"[^\w\s]", "", text)).strip().lower()

# Fuzzy Matching Function
def is_similar(response, correct_response, threshold=0.9):
    return (
        SequenceMatcher(None, normalize_text(response), normalize_text(correct_response)).ratio()
        >= threshold
    )

# Format Checking Function
def check_format(response):
    """Checks if the response string has the correct format."""
    return (
        response.startswith("<think>")
        and "</think> <answer>" in response
        and response.endswith("</answer>")
    )

# Combined Metric Function
def combined_metric(example, pred, trace=None):
    """
    Calculates a combined score based on accuracy and format.

    Args:
        example: The dspy.Example object.
        pred: The model's response dictionary.
        trace: An optional argument for tracing information. Not used in this function.

    Returns:
        A dictionary containing:
        - accuracy_score: The accuracy score based on is_similar (0.0 or 1.0).
        - format_score: The format score (1.0 for correct format, -1.0 for incorrect).
        - combined_score: The combined score (accuracy_score + format_score).
    """

    pred = pred["response"]

    accuracy_score = is_similar(pred, example.answer)
    format_score = 1.0 if check_format(pred) else -1.0
    combined_score = accuracy_score + format_score
    return {
        "accuracy_score": accuracy_score,
        "format_score": format_score,
        "combined_score": combined_score,
    }

# DSPy Module for Forward Reasoning with 'Think' Tokens
class ForwardReasoning(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        reasoning_process = self.generate_answer(question=question)
        formatted_response = f"<think>{reasoning_process.answer}</think> <answer>{reasoning_process.answer}</answer>"
        return {"response": reasoning_process.answer, "formatted_response": formatted_response}

# DSPy Module for Backward Reasoning with 'Think' Tokens
class BackwardReasoning(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_question = dspy.ChainOfThought("answer -> question")

    def forward(self, answer):
        reasoning_process = self.generate_question(answer=answer)
        formatted_question = f"<think>{reasoning_process.question}</think> <answer>{reasoning_process.question}</answer>"
        return {"response": reasoning_process.question, "formatted_response": formatted_question}

# Fine-tuning Function with 'Think' Tokens
def fine_tune_model(prompt, correct_response):
    hf_model.train()
    optimizer.zero_grad()
    combined_text = f"{prompt} {correct_response}"
    inputs = tokenizer(combined_text, return_tensors="pt", truncation=True, padding=True)
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    prompt_length = len(tokenizer(prompt, add_special_tokens=False)["input_ids"])
    labels = input_ids.clone()
    labels[:, :prompt_length] = -100  # Ignore prompt tokens in loss calculation

    outputs = hf_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    return loss.item()

# Prepare the Dataset for DSPy - Use 'Prompt' and 'Response'
trainset_forward = []
trainset_backward = []

for example in train_data_list:
    question = example["Prompt"]  # Use 'Prompt'
    answer = example["Response"]  # Use 'Response'

    # Create dspy.Example objects with input/output fields
    trainset_forward.append(
        dspy.Example(question=question, answer=answer).with_inputs("question")
    )
    trainset_backward.append(
        dspy.Example(question=question, answer=answer).with_inputs("answer")
    )

# Initialize the Teleprompter (Example: BootstrapFewShot)
from dspy.teleprompt import BootstrapFewShot

# Use BootstrapFewShot for few-shot optimization
teleprompter = BootstrapFewShot(
    metric=combined_metric, max_bootstrapped_demos=4, max_labeled_demos=4, max_rounds=1
)

# Update Experiment Function to Include 'Think' Tokens
def adaptive_boundary_experiment_with_reverse(trainset_forward, trainset_backward, epochs=1):
    for epoch in range(epochs):
        print(f"Epoch {epoch+1}/{epochs}")

        compiled_forward = teleprompter.compile(ForwardReasoning(), trainset=trainset_forward)
        compiled_backward = teleprompter.compile(BackwardReasoning(), trainset=trainset_backward)

        for example, example_backward in zip(trainset_forward, trainset_backward):
            try:
                prompt = example.question
                correct_response = example.answer

                forward_prediction = compiled_forward(question=prompt)
                model_response = forward_prediction["response"]
                formatted_response = forward_prediction["formatted_response"]
                print(f"Model Response (Forward): {model_response}")

                backward_prediction = compiled_backward(answer=correct_response)
                backward_question = backward_prediction["response"]
                backward_formatted_question = backward_prediction["formatted_response"]
                print(f"Backward Question: {backward_question}")

                # Fine-tune on incorrect responses or incorrect format
                metrics = combined_metric(example, forward_prediction)
                if metrics["accuracy_score"] < 1.0 or metrics["format_score"] < 0:
                    loss = fine_tune_model(prompt, formatted_response)
                    print(f"Fine-tuned on forward reasoning. Loss: {loss:.4f}, Accuracy Score: {metrics['accuracy_score']}, Format Score: {metrics['format_score']}")

                backward_metrics = combined_metric(example_backward, backward_prediction)
                if backward_metrics["accuracy_score"] < 1.0 or backward_metrics["format_score"] < 0:
                    loss = fine_tune_model(correct_response, backward_formatted_question)
                    print(f"Fine-tuned on backward reasoning. Loss: {loss:.4f}, Accuracy Score: {backward_metrics['accuracy_score']}, Format Score: {backward_metrics['format_score']}")
            except Exception as e:
                print(f"Error processing example: {e}")
                continue

        print("Epoch completed.")

# Call the Experiment Function
adaptive_boundary_experiment_with_reverse(trainset_forward, trainset_backward, epochs=2)



Epoch 1/2


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 1/605 [00:51<8:43:27, 52.00s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 2/605 [01:43<8:37:23, 51.48s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 3/605 [02:32<8:25:40, 50.40s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 4/605 [03:24<8:31:47, 51.09s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


  0%|          | 0/605 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 1/605 [00:49<8:23:00, 49.97s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 2/605 [01:36<8:03:43, 48.13s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 3/605 [02:38<9:04:05, 54.23s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 4/605 [03:28<8:42:36, 52.17s/it]
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Model Response (Forward): ${answer}

---

Question: Encapsulate symmetry properties through group actions.
Reasoning: Let's think step by step in order to Given the fields `question`, produce the fields `answer`. --- Follow the following format. Question: ${question} Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer} --- Question: Imagine raising the word to a non-integer power before applying these branches. This allows us to capture nuances, like a gentle breeze for "delicate" or a roaring storm for "powerful." Answer: (Model understands the role of dimensions in capturing different levels of meaning.) --- Question: Prove ∃x(¬P(x)) from ∀x(P(x) ⇒ Q(x)) and ∃x(¬Q(x)) using natural deduction. Answer: (Model constructs a formal proof using inference rules and assumptions.) --- Question: What role does the P-FAF function play in addressing cold start issues in recommendation systems? Answer: "Cold start issues arise in recommendation systems w