# Validation - repeated inference

To address your questions about modifying vLLM to include special tokens in the output, here's a detailed guide:

### 1. How do I modify the vLLM code?

**Modify the `_postprocess` method in the `LLM` class to include special tokens during decoding.**

In vLLM, the decoding of generated token IDs into text happens in the `_postprocess` method of the `LLM` class. By default, this method uses `skip_special_tokens=True`, which omits special tokens during decoding. To include special tokens, you need to set `skip_special_tokens=False`.

**Steps:**

1. **Locate the `_postprocess` method:**

   - Open the `llm.py` file in the `vllm` package. The path is typically `vllm/llm.py`.

2. **Modify the decoding parameters:**

   - Find the line in the `_postprocess` method where the decoding happens. It usually looks like this:

     ```python
     output_text = self.tokenizer.decode(
         output_ids,
         skip_special_tokens=True,
         clean_up_tokenization_spaces=True)
     ```

   - Change `skip_special_tokens` to `False`:

     ```python
     output_text = self.tokenizer.decode(
         output_ids,
         skip_special_tokens=False,
         clean_up_tokenization_spaces=True)
     ```

3. **Save the changes and restart your application:**

   - After modifying the code, ensure you restart any running instances that use vLLM so the changes take effect.

**Alternative Approach:**

If you prefer not to modify the vLLM source code directly, you can subclass the `LLM` class and override the `_postprocess` method:

```python
from vllm import LLM

class CustomLLM(LLM):
    def _postprocess(self, outputs):
        # Your custom postprocessing code here
        for output in outputs:
            output_text = self.tokenizer.decode(
                output.output_ids,
                skip_special_tokens=False,
                clean_up_tokenization_spaces=True)
            output.output_text = output_text
        return outputs

# Use your custom LLM class
llm = CustomLLM(model="facebook/opt-125m")
```

### 2. Should I use a vLLM engine?

**Yes, using the vLLM engine directly can give you more control over the generation process, including tokenization and decoding.**

By interacting with the `Engine` class, you can customize various aspects of the generation pipeline. This approach is especially useful if you need fine-grained control over token handling.

**Example:**

```python
from vllm import Engine, SamplingParams
from transformers import AutoTokenizer

# Initialize tokenizer with special tokens
tokenizer = AutoTokenizer.from_pretrained("your-model", use_fast=False)
# Add your special tokens here
# tokenizer.add_special_tokens({'additional_special_tokens': ['<SPECIAL_TOKEN>']})

engine = Engine(model="your-model")

# Prepare prompts and sampling parameters
prompts = ["Your prompt here"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Generate tokens
request_outputs = engine.generate(prompts, sampling_params)

# Decode tokens including special tokens
for output in request_outputs:
    output_text = tokenizer.decode(
        output.output_ids,
        skip_special_tokens=False,
        clean_up_tokenization_spaces=True)
    print(output_text)
```

### 3. How can I avoid skipping special tokens?

**Ensure that `skip_special_tokens` is set to `False` during the decoding step.**

Since vLLM doesn't expose the `skip_special_tokens` parameter in its public API, modifying the `_postprocess` method or using the engine directly (as shown above) are effective ways to include special tokens in the output.

**Additional Tips:**

- **Access Token IDs Directly:**

  If you have access to the token IDs, you can decode them manually:

  ```python
  for output in outputs:
      token_ids = output.output_ids  # Ensure this attribute exists
      decoded_text = tokenizer.decode(
          token_ids,
          skip_special_tokens=False,
          clean_up_tokenization_spaces=True)
      print(decoded_text)
  ```

- **Update Tokenizer Configuration:**

  Make sure your tokenizer is correctly configured with the special tokens added:

  ```python
  tokenizer.add_special_tokens({'additional_special_tokens': ['<SPECIAL_TOKEN>']})
  llm.model.resize_token_embeddings(len(tokenizer))
  ```

**Summary:**

- Modify the `_postprocess` method in vLLM to set `skip_special_tokens=False`.
- Using the vLLM engine directly can provide more control over decoding.
- Always set `skip_special_tokens=False` when decoding to include special tokens in the output.

---

By following these steps, you should be able to modify vLLM to include special tokens in your generated outputs successfully.

```python
from vllm import AsyncEngine, SamplingParams, GenerationRequest
from transformers import AutoTokenizer
import asyncio

# Initialize tokenizer with special tokens
tokenizer = AutoTokenizer.from_pretrained("your-model", use_fast=False)
special_tokens_dict = {'additional_special_tokens': ['<SPECIAL_TOKEN_1>', '<SPECIAL_TOKEN_2>']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

# Initialize the engine
engine = AsyncEngine(model="your-model")

# Prepare prompts and tokenize them
prompts = ["Your prompt here"]
tokenized_inputs = tokenizer(prompts, return_tensors="pt", padding=True)
input_ids = tokenized_inputs['input_ids']

# Create sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create generation requests
requests = []
for ids in input_ids:
    ids_list = ids.tolist()
    request = GenerationRequest(
        prompt='',
        prompt_token_ids=ids_list,
        sampling_params=sampling_params
    )
    requests.append(request)

# Define async function to run the engine
async def generate_outputs(engine, requests):
    return await engine.generate(requests)

# Run the asynchronous generation
outputs = asyncio.run(generate_outputs(engine, requests))

# Decode and print outputs
for output in outputs:
    all_token_ids = output.prompt_token_ids + output.output_token_ids
    decoded_text = tokenizer.decode(
        all_token_ids,
        skip_special_tokens=False,
        clean_up_tokenization_spaces=True
    )
    print(decoded_text)
```

1. Test Llama 3.2 3B finetuned on re-ARC 400x200 for few epochs lr=1e-4

***

## Import

In [None]:
import os
import logging
import time

from llm_prompts.logs import get_named_logger
from llm_prompts.reader import ReaderMany
from llm_prompts.causal_lm.models import CausalLMWrapper
from llm_prompts.wrapper import EvaluationConfig


# Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

# vLLM
from vllm import SamplingParams, LLM

# New imports for experiments with Lars code
from collections import defaultdict
from types import SimpleNamespace
import json
import numpy as np

import torch
from torch import Tensor
from torch.utils.data import DataLoader

from transformers import GenerationConfig
from transformers.generation import GenerateDecoderOnlyOutput

from llm_prompts.type_aliases import Attempts, Grid
from llm_prompts.data import Dataset
from llm_prompts.utils import RepeatSampler
from llm_prompts.prompts.grid_formatter import GridFormatter
from llm_prompts.transforms import Transforms, _BackTransformTestOutput, backtransform_test_output
from llm_prompts.type_aliases import Grid, OAIMessage, JSONTask

## Config

In [2]:
MODEL_NAME = "llama_3B"

MAX_NUM_TASKS = 1

data_config = {
    "dataset_dir": "../../kaggle/input",
    "dataset_type": "evaluation",
}

model_config = {
    "wrapper": CausalLMWrapper,
    "wrapper_kwargs": {"model_id": "models/llama/ID002_best_text_24_10_25_merged_pretrained_llama_1B_short_re_arc_400x200"},
    "evaluation_config": {
        "batch_size": 2,
    },
}

assert os.path.exists(data_config["dataset_dir"])

## Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_config["wrapper_kwargs"]["model_id"])
print(f"{len(tokenizer)=}")

## Data

In [None]:
tasks = ReaderMany(
    dataset_dir=data_config["dataset_dir"],
    dataset_type=data_config["dataset_type"],
    read_test_output=True,
).read_tasks()

tasks = {key: value for i, (key, value) in enumerate(tasks.items()) if key in ("070dd51e",)}

print(f">>> {len(tasks)}")
print(f">>> {tasks.keys()}")

## Model wrapper

In [5]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [6]:
wrapper = model_config["wrapper"](**model_config["wrapper_kwargs"])

In [7]:
COLLATE_TOKENIZER = AutoTokenizer.from_pretrained(model_config["wrapper_kwargs"]["model_id"])
# collate of CausalLMWrapper
def collate_fn_eval(
    examples: list[tuple[OAIMessage, int, _BackTransformTestOutput]],
) -> dict[str, dict[str, Tensor] | list[int] | list[_BackTransformTestOutput]]:
    """The collate function."""

    conversation = [example[0][:2] for example in examples]
    encoded_conversation = COLLATE_TOKENIZER.apply_chat_template(
        conversation=conversation,
        tokenize=False,
        add_generation_prompt=True,
    )
    batch_inputs = COLLATE_TOKENIZER(
        encoded_conversation,
        padding=True,
        truncation=False,
        split_special_tokens=False,
        return_tensors="pt",
    )
    batch_indices = [example[1] for example in examples]
    backtransforms = [example[2] for example in examples]
    return {
        "batch_inputs": batch_inputs,
        "batch_indices": batch_indices,
        "backtransforms": backtransforms,
    }

# collate_fn_eval = partial(_collate_fn_eval)

In [None]:
data_config = SimpleNamespace(
    batch_size = 1,
    num_dataloader_workers = 2,
    n_transforms = 2,
)

config = SimpleNamespace(
    n_attempts = 2,
    transforms = Transforms(order="reorder", color="foreground", limit_colors=False, rigid=True),

    image_resize_factor = None,
)

print(f"{data_config.n_transforms=}")

In [9]:
grid_formatter = GridFormatter()

In [10]:
def _generate(
    model,
    tokenizer,
    batch_inputs: dict[str, Tensor],
    generation_config: GenerationConfig,
):
    output = model.generate(
        **batch_inputs,
        generation_config=generation_config,
        tokenizer=tokenizer,
    )
    return output

def _create_results(
    task_attempts: dict[str, list[Grid]],
    task_log_likelihoods: dict[str, list[float]],
    n_attempts: int | None,
) -> dict[str, Attempts]:
    """Sort attempts by log-likelihood, merge and combine test examples from same tasks"""
    results: dict[str, Attempts] = defaultdict(lambda: defaultdict(list))
    for split_task_id in sorted(task_attempts.keys()):
        if "-|-" in split_task_id:
            tokens = split_task_id.split("-|-")
            task_id = tokens[0]
            test_idx = int(tokens[1])
        else:
            task_id = split_task_id
            test_idx = 0

        attempts = task_attempts[split_task_id]
        # There can be duplicate attempts, so mean the log likelihood of duplicates
        attempt_log_likelihoods: dict[str, list[float]] = defaultdict(list)
        for i, attempt in enumerate(attempts):
            attempt_log_likelihoods[str(attempt)].append(task_log_likelihoods[split_task_id][i])

        grids = [json.loads(attempt) for attempt in attempt_log_likelihoods.keys()]
        log_likelihoods = [np.mean(ll) for ll in attempt_log_likelihoods.values()]

        idx = np.argsort(log_likelihoods)[::-1]
        if n_attempts is not None:
            idx = idx[:n_attempts]
        results[task_id][test_idx] = [grids[i] for i in idx]
    return results

def _decode(tokenizer, output_ids: Tensor, input_size: int) -> str:
    response: str = tokenizer.batch_decode(
        output_ids[:, input_size:],
        skip_special_tokens=False,
    )
    return response

def _get_log_likelihoods(tokenizer, output: GenerateDecoderOnlyOutput, input_size: int) -> Tensor:
    # Remove input tokens, as well as start/end tokens.
    generated_tokens = output.sequences[:, input_size:]
    # Stack logits to get shape [batch_size, sequence_length, vocab_size]
    logits = torch.stack(output.scores, dim=1)
    # Compute log probabilities
    log_probs = torch.log_softmax(logits, dim=-1)
    # Get attention mask (1s for real tokens, 0s for padding)
    attention_mask = (generated_tokens != tokenizer.pad_token_id).long()
    # Select log probabilities for the generated tokens
    log_likelihoods = log_probs.gather(2, generated_tokens.unsqueeze(-1)).squeeze(-1)
    # Apply attention mask to ignore padding in log likelihood
    masked_log_likelihoods = log_likelihoods * attention_mask
    # Compute total log likelihood (sum across all tokens in the sequence)
    total_log_likelihood: Tensor = masked_log_likelihoods.sum(dim=-1)

    print(f"{generated_tokens.shape=}")
    print(f"{logits.shape=}")
    print(f"{log_probs.shape=}")
    print(f"{attention_mask.shape=}")
    print(f"{log_likelihoods.shape=}")
    print(f"{total_log_likelihood.shape=}")

    return total_log_likelihood

In [None]:
wrapper.model.eval()


task_grids: dict[str, list[Grid]] = defaultdict(list)
task_log_likelihoods: dict[str, list[float]] = defaultdict(list)
dataset = Dataset(
    tasks=tasks,
    prompt_type="prompt_solve_short",
    model_type="text-to-text",
    transforms=config.transforms,
    image_resize_factor=config.image_resize_factor,
)

dataloader = DataLoader(
    dataset,
    batch_size=data_config.batch_size,
    shuffle=False,
    sampler=RepeatSampler(data_config.n_transforms, len(dataset)),
    num_workers=data_config.num_dataloader_workers,
    collate_fn=collate_fn_eval,
    pin_memory=True,
    drop_last=False,
)

print(f"{len(dataset)=}")
print(f"{len(dataloader)=}")

total_start_time = time.time()

generation_config = GenerationConfig(
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=512,
    num_return_sequences=2,
    num_beams=2,
)

print(f">>> Start loop. {len(dataloader)=}")
for i, batch in enumerate(dataloader):
    print(f"Generating batch {i+1} of {len(dataloader)}")
    batch_indices = batch["batch_indices"]
    batch_inputs = batch["batch_inputs"]
    input_ids = batch_inputs["input_ids"]
    backtransforms = batch["backtransforms"]
    if generation_config.num_return_sequences > 1:
        batch_indices = [
            i for i in batch_indices for _ in range(generation_config.num_return_sequences)
        ]
        backtransforms = [
            t for t in backtransforms for _ in range(generation_config.num_return_sequences)
        ]
    output = _generate(
        model=wrapper.model,
        tokenizer=tokenizer,
        batch_inputs=batch_inputs,
        generation_config=generation_config,
    )
    responses = _decode(
        tokenizer=tokenizer,
        output_ids=output.sequences,
        input_size=input_ids.shape[1]
    )
    log_likelihoods = _get_log_likelihoods(
        tokenizer=tokenizer,
        output=output,
        input_size=input_ids.shape[1]
    )
    attempts = [
        grid_formatter.decode_grid(
            str_containing_grid=response,
            input_or_output="output",
            logger=None,
        )
        for response in responses
    ]
    for attempt, log_likelihood, idx, backtransform in zip(
        attempts, log_likelihoods, batch_indices, backtransforms, strict=True
    ):
        if attempt is None:
            continue
        task_id = dataset.keys[idx]
        task_grids[task_id].append(
            backtransform_test_output(grid=attempt, backtransform=backtransform)
        )
        task_log_likelihoods[task_id].append(log_likelihood.item())

results: dict[str, Attempts] = _create_results(
    task_attempts=task_grids,
    task_log_likelihoods=task_log_likelihoods,
    n_attempts=100,
)

total_end_time = time.time()

print(f"Total time: {total_end_time - total_start_time:.2f} seconds")

In [None]:
TASK_ID_TO_TEST = "070dd51e"

exp_grid = tasks[TASK_ID_TO_TEST]["test"][0]["output"]
num_attempts = len(results[TASK_ID_TO_TEST][0])
first_grid = results[TASK_ID_TO_TEST][0][0]
second_grid = results[TASK_ID_TO_TEST][0][0]
if num_attempts > 1:
    second_grid = results[TASK_ID_TO_TEST][0][1]

print(f"Returned {num_attempts} possible grids to check")
print()

print(f"The first is solving? {first_grid == exp_grid}")
print(f"The second is solving? {second_grid == exp_grid}")
print(f"Any is solving? {any(grid == exp_grid for grid in results[TASK_ID_TO_TEST])}")

In [None]:
np.sum(np.array(first_grid) == np.array(exp_grid))
first_grid[1]

In [None]:
print(exp_grid)
print(results[TASK_ID_TO_TEST][0][0])
print()
print(results[TASK_ID_TO_TEST][0][1])
print("---------")
print()


In [7]:
logger = get_named_logger(
    name=f"validation_test",
    log_level=logging.INFO,
    enable_log_to_file=True,
    project_root="../../",
    output_dir="logs",
)

In [None]:
s = time.time()
results = wrapper.evaluate(
        tasks=tasks,
        logger=logger,
        config=EvaluationConfig(**model_config["evaluation_config"]),
    )
e = time.time()

print(f">>> {results=}")
print(f"Time: {e - s:.2f} seconds")