## Evaluating Inference-Time Adaptive Temperature for Improving Mathematical Reasoning in Large Language Models

## Step 1: Install Hugging Face CLI
- The Hugging Face Command Line Interface (CLI) allows you to interact with Hugging Face models and repositories directly.

- Make sure to request access to model while creating a token. Given that Gemma-2-2B-Instruct is an publicly gated model, the link below might alleviate some issues regarding access to such models.   

https://huggingface.co/docs/hub/en/models-gated


In [1]:
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `HF Token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `HF Token`


## Step 2: Install Required Libraries

- Before running the code cells, ensure that all necessary Python packages are installed.


In [2]:
! pip install datasets transformers
! pip install torch
! pip install tqdm


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

## Step 3: Import Libraries and Define Configuration

This section imports all the necessary libraries required for data handling, model processing, and performance monitoring. Additionally, it defines a `Config` class that encapsulates various **hyperparameters** essential for controlling the behavior and performance of our model.



#### **Note on some Hugging Face Transformers Libraries imports**


  - **`LogitsProcessorList`, `LogitsProcessor`**:
    - **Purpose**: Facilitate the modification of logits (raw model outputs) before sampling, enabling custom behaviors like temperature scaling or top-k filtering.
    - **Usage**: Used to implement adaptive entropy-based temperature scaling during text generation.

  - **`TextIteratorStreamer`**:
    - **Purpose**: Streams generated text tokens in real-time, allowing for interactive or progressive processing of generated text.
    - **Usage**: Enables the asynchronous generation and collection of text outputs without blocking the main thread.

### **Configuration**

- The `Config` class is a `NamedTuple` that encapsulates various **hyperparameters**.
- I would encourage extensive experimentation with the following hyperparameters
  - **`entropy_threshold`** : Threshold to trigger adaptive temperature
  - **`max_new_tokens`**
  - **`num_samples`**
  - **`max_beta` and `min_beta`**: Bounds on the beta (inverse of temeprature) parameter


In [3]:
import logging
import math
from typing import NamedTuple, Tuple, Optional, Dict, List, Any
from tqdm.notebook import tqdm
import numpy as np
from threading import Thread
from datasets import load_dataset
import pandas as pd
from statistics import mean, stdev

import torch
import torch.nn.functional as F
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    LogitsProcessorList,
    LogitsProcessor,
    TextIteratorStreamer
)

class Config(NamedTuple):
    model_name: str = "google/gemma-2-2b-it"
    entropy_threshold: float = 0.6
    poly_coeffs: Tuple[float, ...] = (-1.791, 4.917, -2.3, 0.481, -0.037)
    max_new_tokens: int = 500
    max_tokens: int = 2048
    top_p: float = 0.9
    top_k: int = 40
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    num_samples: int = 200
    min_beta: float = 0.5

## `TokenMetrics` Class

The `TokenMetrics` class is designed to **store and manage per-token generation metrics** during the text generation process.

### **Attributes**

- **`tokens`** (`List[str]`):
  - **Description**: A list that holds each token generated by the model.
  - **Purpose**: Keeps track of the sequence of tokens produced, allowing for reconstruction and analysis of the generated text.

- **`entropies`** (`List[float]`):
  - **Description**: A list that records the entropy values associated with each generated token.
  - **Purpose**: Measures the uncertainty or randomness in the model's predictions for each token, providing insights into the model's confidence during generation.

- **`betas`** (`List[float]`):
  - **Description**: A list that stores the temperature scaling factors (`beta`) applied during the generation of each token.
  - **Purpose**: Allows monitoring of how the temperature parameter influences the randomness and creativity of the generated tokens.

- **`timestamps`** (`List[float]`):
  - **Description**: A list that logs the timestamp for when each token was generated.
  - **Purpose**: Enables tracking of the generation time for each token, which can be useful for performance analysis and optimization.


In [4]:

class TokenMetrics:
    """Stores per-token generation metrics"""
    def __init__(self):
        self.tokens: List[str] = []
        self.entropies: List[float] = []
        self.betas: List[float] = []
        self.timestamps: List[float] = []

    def add_metric(self, token: str, entropy: float, beta: float, timestamp: float):
        self.tokens.append(token)
        self.entropies.append(entropy)
        self.betas.append(beta)
        self.timestamps.append(timestamp)

    def get_dataframe(self) -> pd.DataFrame:
        return pd.DataFrame({
            'token': self.tokens,
            'entropy': self.entropies,
            'beta': self.betas,
            'timestamp': self.timestamps
        })

## `AdaptiveEntropyTemperature` Class

The `AdaptiveEntropyTemperature` class is a custom **LogitsProcessor** that dynamically adjusts the temperature scaling factor (`beta`) during the text generation process based on the entropy of the model's predictions.
### **Key Features**

- **Dynamic Temperature Adjustment**:
  - **Entropy Calculation**: Computes the entropy of the model's output probabilities to assess prediction uncertainty.
  - **Temperature Scaling**: Adjusts the inverse of temperature (`beta`) using a polynomial function when entropy exceeds a predefined threshold.

- **Metric Tracking**:
  - **`TokenMetrics` Integration**: Records detailed metrics for each generated token, including the token itself, its entropy, the applied `beta`, and a timestamp.






In [5]:

class AdaptiveEntropyTemperature(LogitsProcessor):
    def __init__(self, config: Config, tokenizer: PreTrainedTokenizer):
        self.config = config
        self.tokenizer = tokenizer
        self.current_entropy = None
        self.current_beta = None
        self.step_count = 0
        self.min_beta = config.min_beta
        self.token_metrics = TokenMetrics()


    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        # Get probabilities and compute entropy
        probs = F.softmax(scores, dim=-1)
        entropy = -torch.sum(probs * torch.log(probs + 1e-9), dim=-1, keepdims=True)
        self.current_entropy = entropy.item()

        if self.current_entropy > self.config.entropy_threshold:
            self.current_beta = max(np.polyval(self.config.poly_coeffs, self.current_entropy), self.min_beta)
        else:
            self.current_beta = 1.0

        # Get the latest generated token
        latest_token_id = input_ids[0, -1].item()
        latest_token = self.tokenizer.decode([latest_token_id])

        # Record metrics
        self.token_metrics.add_metric(
            token=latest_token,
            entropy=self.current_entropy,
            beta=self.current_beta,
            timestamp=pd.Timestamp.now().timestamp()
        )

        # Print token-level information
        print(f"Token [{self.step_count}]: '{latest_token}' | "
              f"Entropy: {self.current_entropy:.3f} | "
              f"Beta: {self.current_beta:.3f}")

        self.step_count += 1
        return scores * self.current_beta

    def get_generation_metrics(self) -> Dict[str, Any]:
        return {
            "token_metrics": self.token_metrics.get_dataframe(),
            "steps_taken": self.step_count,
            "final_beta": self.current_beta,
            "final_entropy": self.current_entropy,
        }


## `ModelManager` Class

The `ModelManager` class is a **context manager** designed to handle the loading and unloading of the pre-trained language model and its corresponding tokenizer.

### **Key Features**


  -  Automatically loads the specified pre-trained model and tokenizer when entering the context.
  -  Ensures that the model is moved to the CPU and properly deleted upon exiting the context, freeing up GPU memory.

  - Utilizes a `Config` object to determine parameters such as the model name, device (CPU/GPU), and other hyperparameters..

  - Implements the `__enter__` and `__exit__` methods, enabling the use for streamlined model management.



In [6]:

class ModelManager:
    def __init__(self, config: Config):
        self.config = config
        self.model: Optional[PreTrainedModel] = None
        self.tokenizer: Optional[PreTrainedTokenizer] = None

    def __enter__(self):
        print(f"Loading model {self.config.model_name} on {self.config.device}")
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.model_name,
            device_map="auto",
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
        )
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name)

        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        return self.model, self.tokenizer

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.model is not None:
            self.model = self.model.cpu()
            del self.model
            torch.cuda.empty_cache()
        if self.tokenizer is not None:
            del self.tokenizer
        self.model = None
        self.tokenizer = None

## `evaluate_answer` Function

The `evaluate_answer` function assesses the quality of generated math solutions by comparing them to ground truth answers. It performs the following operations:

1. **Extract Final Numerical Answer**:
   - **`extract_final_number`**: Parses the text to retrieve the last numerical value, ignoring symbols like `$` and `%`.

2. **Check for Solution Steps**:
   - **`has_solution_steps`**: Determines if the generated text includes step-by-step reasoning by searching for indicators such as "step", "first", "then", etc.

3. **Evaluate Numerical Accuracy**:
   - Compares the extracted numerical answer from the generated text with the ground truth, allowing a small tolerance for discrepancies.

4. **Return Evaluation Metrics**:
   - **`length`**: Number of words in the generated answer.
   - **`has_numerical_answer`**: Indicates if a numerical answer was extracted.
   - **`has_solution_steps`**: Indicates presence of step-by-step reasoning.
   - **`numerical_match`**: Boolean indicating if the numerical answers match within the tolerance.
   - **`extracted_answer`**: The numerical answer extracted from the generated text.
   - **`ground_truth_number`**: The correct numerical answer.
   - **`is_complete`**: True if both a numerical answer and solution steps are present.
   - **`overall_correct`**: True if both the numerical match and solution steps are correct.


   ### **NOTE**: This is a really janky implementation and is one of the sections of the code that need further refinement given that this could help improve the benchmarking results. I was running against deadlines and had to resort to a hackneyed implementation


In [7]:

def evaluate_answer(generated: str, ground_truth: str) -> Dict[str, Any]:
    """
    Evaluates generated math solutions against ground truth answers.
    """
    def extract_final_number(text: str) -> Optional[float]:
        """Extract the final numerical answer from text."""
        text = text.replace('$', '').replace('%', '')
        numbers = []
        for word in text.replace(',', '').split():
            try:
                numbers.append(float(word))
            except ValueError:
                continue
        return numbers[-1] if numbers else None

    def has_solution_steps(text: str) -> bool:
        """Check if the text contains step-by-step reasoning."""
        step_indicators = ['step', 'first', 'then', 'next', 'finally', '1)', '2)', '3)']
        return any(indicator in text.lower() for indicator in step_indicators)

    generated_num = extract_final_number(generated)
    ground_truth_num = extract_final_number(ground_truth)

    numerical_match = False
    if generated_num is not None and ground_truth_num is not None:
        tolerance = max(0.01 * abs(ground_truth_num), 0.01)
        numerical_match = abs(generated_num - ground_truth_num) <= tolerance

    return {
        "length": len(generated.split()),
        "has_numerical_answer": generated_num is not None,
        "has_solution_steps": has_solution_steps(generated),
        "numerical_match": numerical_match,
        "extracted_answer": generated_num,
        "ground_truth_number": ground_truth_num,
        "is_complete": generated_num is not None and has_solution_steps(generated),
        "overall_correct": numerical_match and has_solution_steps(generated)
    }


## Generation Functions

This section includes two functions, `generate_baseline` and `generate_with_adaptive_temp`, which handle text generation using the pre-trained model. The former uses standard generation settings, while the latter incorporates adaptive temperature scaling based on entropy.

### `generate_baseline`

Generates text using the baseline settings without any adaptive mechanisms.

- **Parameters**:
  - `prompt` (`str`): The input text prompt for generation.
  - `model` (`PreTrainedModel`): The pre-trained language model.
  - `tokenizer` (`PreTrainedTokenizer`): The tokenizer corresponding to the model.
  - `config` (`Config`): Configuration object containing hyperparameters.

- **Process**:
  1. Tokenizes the input prompt and moves it to the specified device.
  2. Sets up generation parameters including `max_new_tokens`, `temperature`, `top_p`, and `top_k`.
  3. Initiates text generation in a separate thread to stream output.
  4. Collects and returns the generated text.

- **Returns**:
  - `generated_text` (`str`): The text generated by the model.

### `generate_with_adaptive_temp`

Generates text with adaptive temperature scaling based on the entropy of each token prediction.

- **Parameters**:
  - `prompt` (`str`): The input text prompt for generation.
  - `model` (`PreTrainedModel`): The pre-trained language model.
  - `tokenizer` (`PreTrainedTokenizer`): The tokenizer corresponding to the model.
  - `config` (`Config`): Configuration object containing hyperparameters.

- **Process**:
  1. Tokenizes the input prompt and moves it to the specified device.
  2. Initializes the `AdaptiveEntropyTemperature` processor to adjust temperature dynamically.
  3. Sets up generation parameters including `max_new_tokens`, `top_p`, `top_k`, and the custom logits processor.
  4. Initiates text generation in a separate thread to stream output.
  5. Collects the generated text and retrieves generation metrics.
  6. Prints summary statistics such as total tokens generated, average entropy, average beta, and generation time.



In [8]:

def generate_baseline(
    prompt: str,
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    config: Config
) -> str:
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=config.max_tokens
    ).to(config.device)

    streamer = TextIteratorStreamer(tokenizer)
    generation_kwargs = dict(
        **inputs,
        max_new_tokens=config.max_new_tokens,
        do_sample=True,
        temperature=1.0,
        top_p=config.top_p,
        top_k=config.top_k,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        streamer=streamer
    )

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    generated_text = ""
    for text in streamer:
        generated_text += text

    return generated_text

def generate_with_adaptive_temp(
    prompt: str,
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    config: Config
) -> Tuple[str, Dict]:
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=config.max_tokens
    ).to(config.device)

    adaptive_processor = AdaptiveEntropyTemperature(config, tokenizer)
    processors = LogitsProcessorList([adaptive_processor])
    streamer = TextIteratorStreamer(tokenizer)

    generation_kwargs = dict(
        **inputs,
        max_new_tokens=config.max_new_tokens,
        do_sample=True,
        top_p=config.top_p,
        top_k=config.top_k,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        logits_processor=processors,
        output_scores=True,
        return_dict_in_generate=True,
        streamer=streamer
    )

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    generated_text = ""
    for text in streamer:
        generated_text += text

    # Get detailed generation metrics
    generation_metrics = adaptive_processor.get_generation_metrics()

    # Print summary statistics
    metrics_df = generation_metrics['token_metrics']
    print("\nGeneration Summary:")
    print(f"Total tokens generated: {len(metrics_df)}")
    print(f"Average entropy: {metrics_df['entropy'].mean():.3f}")
    print(f"Average beta: {metrics_df['beta'].mean():.3f}")
    print(f"Generation time: {metrics_df['timestamp'].max() - metrics_df['timestamp'].min():.2f} seconds")

    return generated_text, generation_metrics

## Generation and Evaluation Functions

This section comprises two primary functions, `run_experiment` and `analyze_results`, which orchestrate the process of generating model outputs, evaluating their accuracy, and analyzing the overall performance.

### `run_experiment`

Generates answers for a set of math problems using both baseline and adaptive temperature methods, evaluates the generated answers against ground truth, and collects the results for further analysis.

- **Parameters**:
  - `problems` (`List[Dict]`): A list of dictionaries where each dictionary contains a `question` and its corresponding `answer`.
  - `display_outputs` (`bool`, optional): If set to `True`, the function will print the generated outputs for each problem. Defaults to `False`.

- **Returns**:
  - `List[Dict]`: A list of result dictionaries, each containing the original question, ground truth answer, baseline and adaptive generated texts, their evaluations, and additional statistics for the adaptive method.

- **Functionality**:
  1. **Configuration Initialization**:
     - Instantiates a `Config` object to access hyperparameters.
  
  2. **Model and Tokenizer Setup**:
     - Utilizes the `ModelManager` context manager to load the pre-trained model and tokenizer based on the configuration.
  
  3. **Problem Processing Loop**:
     - Iterates over each math problem using a progress bar provided by `tqdm`.
     - For each problem:
       - Constructs a prompt instructing the model to solve the problem step-by-step.
       - Generates responses using both the baseline and adaptive temperature methods.
       - Evaluates each generated response against the ground truth answer using the `evaluate_answer` function.
       - Compiles the results into a structured dictionary and appends it to the `results` list.
       - If `display_outputs` is `True`, prints the question, ground truth, and both generated outputs for immediate inspection.
  
  4. **Error Handling**:
     - Catches and logs any exceptions that occur during the processing of individual problems, allowing the experiment to continue uninterrupted.

### `analyze_results`

Analyzes the collected experiment results by aggregating various performance metrics for both baseline and adaptive generation methods, and prints a comprehensive summary of the findings.

- **Parameters**:
  - `results` (`List[Dict]`): The list of result dictionaries returned by the `run_experiment` function.
  - `config` (`Config`): Configuration object containing hyperparameters (currently unused but can be leveraged for extended analysis).

- **Returns**:
  - `None`: The function outputs the analysis directly to the console.

- **Functionality**:
  1. **Metric Initialization**:
     - Sets up dictionaries to accumulate metrics for both baseline and adaptive methods, including counts of numerical matches, presence of solution steps, completeness of solutions, overall correctness, response lengths, and adaptive-specific statistics like final `beta` values and entropies.
  
  2. **Metrics Aggregation**:
     - Iterates through each result:
       - Extracts and aggregates evaluation metrics for the baseline method.
       - Extracts and aggregates evaluation metrics and adaptive statistics for the adaptive method.
  
  3. **Performance Metrics Calculation**:
     - Calculates percentages for key metrics such as Numerical Accuracy, Presence of Solution Steps, Complete Solutions, and Overall Correctness for both baseline and adaptive methods.
  
  4. **Response Statistics**:
     - Computes average and standard deviation of response lengths for both methods.
  
  5. **Adaptive Control Statistics**:
     - Calculates average and standard deviation of the final `beta` values and entropies used in the adaptive method.
     - Determines the percentage of tokens where adaptive temperature scaling was applied (`beta` not equal to 1.0).
  
  6. **Token-Level Analysis**:
     - Aggregates token-level metrics from all adaptive generations.
     - Computes overall statistics such as total tokens generated, average entropy, average `beta`, and their respective standard deviations.
  
  7. **Output**:
     - Prints a detailed summary of all the computed metrics and statistics to provide insights into the performance differences between the baseline and adaptive generation methods.

### **Workflow Overview**

1. **Data Preparation**:
   - Use `prepare_math_problems` to load and select a subset (`num_samples`) of math problems from the GSM8K dataset.

2. **Experiment Execution**:
   - Invoke `run_experiment` to generate answers using both baseline and adaptive temperature methods.
   - Collect evaluations comparing the generated answers against ground truth.

3. **Result Analysis**:
   - Utilize `analyze_results` to compute and display performance metrics, helping to assess the effectiveness of the adaptive temperature approach.




In [9]:

def prepare_math_problems(config: Config) -> List[Dict[str, str]]:
    dataset = load_dataset("gsm8k", "main")
    problems = []

    for item in dataset['train'].select(range(config.num_samples)):
        problems.append({
            'question': item['question'],
            'answer': item['answer']
        })

    return problems


def run_experiment(problems: List[Dict], display_outputs: bool = False) -> List[Dict]:
    config = Config()
    results = []

    with ModelManager(config) as (model, tokenizer):
        for problem in tqdm(problems, desc="Processing problems"):
            try:
                prompt = (f"Solve this step by step:\n{problem['question']}\n"
                         f"Let's solve this step by step:")

                baseline = generate_baseline(prompt, model, tokenizer, config)
                adaptive_output, stats = generate_with_adaptive_temp(prompt, model, tokenizer, config)

                result = {
                    'question': problem['question'],
                    'ground_truth': {
                        'answer': problem['answer']
                    },
                    'baseline': {
                        'text': baseline,
                        'evaluation': evaluate_answer(baseline, problem['answer'])
                    },
                    'adaptive': {
                        'text': adaptive_output,
                        'evaluation': evaluate_answer(adaptive_output, problem['answer']),
                        'stats': stats
                    }
                }
                results.append(result)

                if display_outputs:
                    print(f"\nQuestion: {problem['question']}")
                    print(f"\nGround Truth: {problem['answer']}")
                    print("\nBaseline output:")
                    print(baseline)
                    print("\nAdaptive output:")
                    print(adaptive_output)
                    print("-" * 80)

            except Exception as e:
                logging.error(f"Error processing problem: {str(e)}")
                continue

    return results

def analyze_results(results: List[Dict], config: Config) -> None:
    print("\nDetailed Analysis Summary:")

    metrics = {
        'baseline': {
            'numerical_matches': 0,
            'has_steps': 0,
            'complete_solutions': 0,
            'overall_correct': 0,
            'lengths': []
        },
        'adaptive': {
            'numerical_matches': 0,
            'has_steps': 0,
            'complete_solutions': 0,
            'overall_correct': 0,
            'lengths': [],
            'final_betas': [],
            'final_entropies': []
        }
    }

    # Collect metrics
    for result in results:
        baseline_eval = result['baseline']['evaluation']
        metrics['baseline']['numerical_matches'] += baseline_eval['numerical_match']
        metrics['baseline']['has_steps'] += baseline_eval['has_solution_steps']
        metrics['baseline']['complete_solutions'] += baseline_eval['is_complete']
        metrics['baseline']['overall_correct'] += baseline_eval['overall_correct']
        metrics['baseline']['lengths'].append(baseline_eval['length'])

        adaptive_eval = result['adaptive']['evaluation']
        metrics['adaptive']['numerical_matches'] += adaptive_eval['numerical_match']
        metrics['adaptive']['has_steps'] += adaptive_eval['has_solution_steps']
        metrics['adaptive']['complete_solutions'] += adaptive_eval['is_complete']
        metrics['adaptive']['overall_correct'] += adaptive_eval['overall_correct']
        metrics['adaptive']['lengths'].append(adaptive_eval['length'])
        metrics['adaptive']['final_betas'].append(result['adaptive']['stats']['final_beta'])
        metrics['adaptive']['final_entropies'].append(result['adaptive']['stats']['final_entropy'])

    total = len(results)

    # Print performance metrics
    print("\nPerformance Metrics:")
    print(f"{'Metric':<25} {'Baseline':<15} {'Adaptive':<15}")
    print("-" * 55)

    metrics_to_report = [
        ('Numerical Accuracy', 'numerical_matches'),
        ('Has Solution Steps', 'has_steps'),
        ('Complete Solutions', 'complete_solutions'),
        ('Overall Correct', 'overall_correct')
    ]

    for metric_name, metric_key in metrics_to_report:
        baseline_pct = (metrics['baseline'][metric_key] / total) * 100
        adaptive_pct = (metrics['adaptive'][metric_key] / total) * 100
        print(f"{metric_name:<25} {baseline_pct:>6.1f}%{' '*8} {adaptive_pct:>6.1f}%")

    # Print response statistics
    print("\nResponse Statistics:")
    print(f"{'Statistic':<25} {'Baseline':<15} {'Adaptive':<15}")
    print("-" * 55)

    bl_lengths = metrics['baseline']['lengths']
    ad_lengths = metrics['adaptive']['lengths']
    print(f"Avg Response Length{' '*8} {mean(bl_lengths):>6.1f}{' '*8} {mean(ad_lengths):>6.1f}")
    print(f"StdDev Length{' '*13} {stdev(bl_lengths):>6.1f}{' '*8} {stdev(ad_lengths):>6.1f}")

    # Print adaptive-specific metrics
    print("\nAdaptive Control Statistics:")
    print(f"Average Final Beta: {mean(metrics['adaptive']['final_betas']):.3f}")
    print(f"Average Final Entropy: {mean(metrics['adaptive']['final_entropies']):.3f}")
    print(f"Beta StdDev: {stdev(metrics['adaptive']['final_betas']):.3f}")
    print(f"Entropy StdDev: {stdev(metrics['adaptive']['final_entropies']):.3f}")

    # Add token-level analysis
    print("\nToken-Level Analysis:")
    all_token_metrics = []

    for result in results:
        if 'adaptive' in result and 'stats' in result['adaptive']:
            if 'token_metrics' in result['adaptive']['stats']:
                metrics = result['adaptive']['stats']['token_metrics']
                all_token_metrics.append(metrics)

    if all_token_metrics:
        combined_metrics = pd.concat(all_token_metrics)
        print(f"Total tokens across all generations: {len(combined_metrics)}")
        print(f"Average token entropy: {combined_metrics['entropy'].mean():.3f}")
        print(f"Average token beta: {combined_metrics['beta'].mean():.3f}")
        print(f"Entropy std dev: {combined_metrics['entropy'].std():.3f}")
        print(f"Beta std dev: {combined_metrics['beta'].std():.3f}")


        num_tokens = len(combined_metrics)
        num_adaptive_tokens = (combined_metrics['beta']!= 1.0).sum()
        precentage_adaptive = (num_adaptive_tokens / num_tokens) * 100

        print(f"Number of adaptive tokens: {num_adaptive_tokens} "
              f"out of {num_tokens} ({precentage_adaptive:.2f}%)")

## `main` Function


### **Workflow Steps**

1. **Configure Logging**:
   - Initializes the logging system to capture informational messages and above, facilitating monitoring and debugging.

2. **Initialize Configuration**:
   - Creates a `Config` object that holds all necessary hyperparameters and settings for the experiment.

3. **Prepare Math Problems**:
   - Prints a message indicating the start of math problem preparation.
   - Calls `prepare_math_problems(config)` to load and select a specified number of math problems from the GSM8K dataset based on the configuration.

4. **Run Experiment**:
   - Prints a message indicating the commencement of the experiment.
   - Executes `run_experiment(problems, display_outputs=True)` to generate answers using both baseline and adaptive temperature methods, and evaluates their accuracy against ground truth answers.

5. **Analyze Results**:
   - Prints a message indicating the start of result analysis.
   - Invokes `analyze_results(results, config)` to aggregate and display performance metrics, comparing the effectiveness of baseline and adaptive generation approaches.

6. **Save Results to File**:
   - Generates a timestamp to uniquely identify the results file.
   - Converts the results list into a Pandas DataFrame.
   - Saves the DataFrame as a JSON file named `math_evaluation_results_<timestamp>.json`.
   - Prints a confirmation message indicating where the results have been saved.




#### **NOTES**:
- The per-token metrics are printed at the end of the logs. Apologies for this ordeal.  
- The logs/outputs in the next cell are really long and the per-token temperature adaption slightly improves on the baseline model on a sample size of 200 from the GSM8K benchmark in this inference run.
- Going through these logs might be mildy interesting . I encourage the reader to parse through some of the samples to see the difference in generation patterns between the tokens generated by the baseline and the adaptive temperature.   

#### **Speculative Reasons for underperformance**
- **Benchmark Leakage into the training data**
- This is a easy answer but the model does need more careful tuning of `beta` bounds. Stability analysis and gradient analysis could be one of the approaches.
- The current entropy control method from the "Softmax is not enough" paper might be sub-optimal.
- The coefficients of the polynomial control function drawn direclty from the paper might need revision.  
- **The paper mentions that this adaption comes in handy when faced with longer context lengths and length of the GSMK samples falls short in comparison.**



### **Future Directions**
- Looking at higher moments of entropy like varentropy and skewness
- Constructing control functions that are more interpretable and sopisticated than polynomials
- More thorough per-token analysis of entropy and beta functions  


In [10]:

def main():
    logging.basicConfig(level=logging.INFO)
    config = Config()  # Can experiment here with parameter values and sample sizes

    print("Preparing math problems...")
    problems = prepare_math_problems(config)

    print("Running experiment...")
    results = run_experiment(problems, display_outputs=True)

    print("Analyzing results...")
    analyze_results(results, config)

    # Save results to file
    timestamp = pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")
    results_df = pd.DataFrame(results)
    results_df.to_json(f"math_evaluation_results_{timestamp}.json")
    print(f"\nResults saved to math_evaluation_results_{timestamp}.json")

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        logging.error(f"Error in main execution: {str(e)}")
        raise




Preparing math problems...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Running experiment...
Loading model google/gemma-2-2b-it on cuda


config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Processing problems:   0%|          | 0/200 [00:00<?, ?it/s]

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Token [281]: '.' | Entropy: 0.122 | Beta: 1.000
Token [282]: '

' | Entropy: 0.034 | Beta: 1.000
Token [283]: '**' | Entropy: 0.002 | Beta: 1.000
Token [284]: '7' | Entropy: 0.002 | Beta: 1.000
Token [285]: '.' | Entropy: 0.032 | Beta: 1.000
Token [286]: ' Calculate' | Entropy: 0.080 | Beta: 1.000
Token [287]: ' the' | Entropy: 0.001 | Beta: 1.000
Token [288]: ' total' | Entropy: 0.001 | Beta: 1.000
Token [289]: ' spending' | Entropy: 0.002 | Beta: 1.000
Token [290]: ' for' | Entropy: 0.005 | Beta: 1.000
Token [291]: ' Carly' | Entropy: 0.007 | Beta: 1.000
Token [292]: ':**' | Entropy: 0.048 | Beta: 1.000
Token [293]: '

' | Entropy: 0.003 | Beta: 1.000
Token [294]: '*' | Entropy: 0.055 | Beta: 1.000
Token [295]: ' Carly' | Entropy: 0.007 | Beta: 1.000
Token [296]: ''' | Entropy: 0.000 | Beta: 1.000
Token [297]: 's' | Entropy: 0.006 | Beta: 1.000
Token [298]: ' total' | Entropy: 0.001 | Beta: 1.000
Token [299]: ' spending