---

#### $Load$ $Libraries$

---

In [None]:
import json
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, AutoModel
from huggingface_hub import notebook_login
import textwrap
from generate import generate
import re
from shots_type import run_llada_experiment
from manage_folders import save_results_to_json, dataset_folders
from retriever import DynamicRetriever
from accelerate import infer_auto_device_map, dispatch_model
# !pip install -U transformers accelerate bitsandbytes datasets

---

#### $Load$ $Model$

---

##### $Model$ $Access$

In order to access to the model we need to:
1. Visit the github of the model: https://github.com/ML-GSAI/LLaDA?tab=readme-ov-file
2. Clone it
3. In order to run this model you have to add into the cloned folder this notebook and then run the model

In [2]:
# Initialize the model name 
model_name = 'GSAI-ML/LLaDA-8B-Base'

##### $Bnb$ $Configuration$


The `bnb_config` creates a configuration to shrink the language model. 

* *Compresses the Model:* It tells the system to load the model in a "compressed" 4-bit format instead of its full 16-bit size.
* *Saves Memory:* This makes the model about 4 times smaller, allowing it to run on computers with less memory (RAM and VRAM).
* *Maintains Performance:* It uses clever tricks (like doing the actual math in 16-bit) to ensure the shrunken model is still fast and accurate.

Essentially, it's a set of instructions to make a huge model fit on the computer with minimal loss in quality.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

##### $Tokenizer$ $Set-up$


This code sets up a `tokenizer` for the language model.
It loads a pre-trained tokenizer.

In [4]:
# Load the tokenizer. The library will handle the chat template automatically.
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

print(f"Tokenizer for {model_name} loaded successfully.")


Tokenizer for GSAI-ML/LLaDA-8B-Base loaded successfully.


In [5]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,  # This is the key part
    device_map="auto"
)

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [6]:
# Tie weights manually (critical for LLaDA)
model.tie_weights()

In [None]:
print(chat_template := tokenizer.chat_template)

{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}


##### $Parameters$ $for$ $the$ $model$ $configuration:$ 

*  **`model=model`**: The actual LLaDA model we are generating text from.


* **`prompt=input_ids`**: This is the tokenized prompt. Same idea as input_ids in Hugging Face — it's the numerical version of the input string, shaped like (1, length).


* **`steps=128`**: This is the number of refinement steps the LLaDA model takes. LLaDA performs **masked iterative decoding**, not left-to-right greedy generation. In each step, it masks and refines parts of the output until it's confident.
   * Higher steps → better but slower.


* **`gen_length=256`**: The number of tokens to generate, similar to `max_new_tokens` in Hugging Face. This controls the length of the generated response, after the prompt.


* **`block_length=16`**: LLaDA generates in blocks, not token-by-token. This means it will attempt to produce `16` new tokens in a "block" before deciding what to mask and refine. Smaller blocks = more fine-grained generation, larger blocks = faster but potentially rougher outputs.


* **`temperature=0.0`** : Controls randomness.
   * 0.0 = **fully deterministic**, always pick the most likely next token. Since the task is **analytical**, no creativity needed, just accuracy.


*  **`cfg_scale=10.0`**: This strongly biases the generation toward the prompt, making the output closely aligned with your instructions. (Higher values close to 10 makes the output closely aligned with the instructions.)


*  **`remasking='low_confidence'`**:  Tells the model where to mask again during the iterative decoding.
   * "low_confidence" means: in each step, the model looks at what it’s least confident about, and re-generates those tokens. This is key to LLaDA’s power — instead of continuing generation linearly, it fixes its own weak spots.

In [8]:
# Define the model specific configuration

model_config = {
    "model_id": model_name,
    "system_prompt": "You are an expert assistant that decomposes complex user questions into a numbered list of simple, sequential sub-questions. Each sub-question should be a direct, answerable query that contributes to a logical plan. Your output should ONLY be the sub-questions. Do not provide any explanations or other answers.",
    "generation_params": {
        "steps": 64,
        "gen_length": 256,
        "block_length": 32,
        "temperature": 0.0,
        "cfg_scale": 10.0,
        "remasking": 'low_confidence'
    }
}


In [9]:
# Define the name of the model for our file names
model_file_name = "LLaDA-8B-Base_results.json"

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/static'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


---

#### $QDMR$ $Dataset$ $Predictions$

---

##### $Zero$ $Shot$ $Experiment$

In [None]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_llada_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)
save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting Zero-Shot Experiment
Processing (0-Shot) ID: CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1011_edc922a0faa1e47614eb7e6effe2d1a1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1036_0b5333d98ef87008aa02d1fbc1554b05


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1036_4e73509d14bda62590480b655eee8751


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Processing (0-Shot) ID: CWQ_dev_WebQTest-1081_1ecabf57357cb4abd089a4af52154854
Results saved successfully to '../QDMR/llm_predictions/static/zero_shot/Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Static$ $Shots$

In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_llada_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="static",  # "static" | "random" | "dynamic"
    retriever=None,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting 3-Shot Experiment


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/static/few_shot/3shot_Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Random$ $Shots$

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/random'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting Random-3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_llada_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="random",  # "static" | "random" | "dynamic"
    retriever=None,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Starting Random-3-Shot Experiment


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/random/few_shot/3shot_Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$ $-$ $Dynamic$ $Shots$

In [None]:
qdmr_base_folder = '../QDMR/llm_predictions/dynamic'
qdmr_dataset_file = "../QDMR/QDMR_examples/qdmr_evaluation.json" # Define the path to the qdmr dataset
qdmr_fewshot_file = "../QDMR/QDMR_examples/qdmr_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(qdmr_base_folder, qdmr_dataset_file, qdmr_fewshot_file)

# Unpack the dictionary into these variables for easy access
qdmr_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting Dynamic-3-Shot Experiment")

retriever_instance = DynamicRetriever(shot_examples)

# Run a 3-shot experiment
three_shot_results = run_llada_experiment(
    model=model,
    tokenizer=tokenizer,
    data=qdmr_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3,
    few_shot_type="dynamic",  # "static" | "random" | "dynamic"
    retriever=retriever_instance,          # Required if few_shot_type="dynamic"
    seed=42
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting Dynamic-3-Shot Experiment
Initializing DynamicRetriever...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Retriever initialized and example embeddings are pre-computed.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Results saved successfully to '../QDMR/llm_predictions/dynamic/few_shot/3shot_Llama-3.2-3B_results.json'


---

#### $HotpotQA$ $Dataset$ $Predictions$

---

We use a helper function `dataset_folders(base_results_folder, dataset_path, few_shot_examples_path)` that streamlines the setup process for running experiments on a new dataset. It handles three key tasks:

1.  **Creates Output Directories:** It takes a base folder path and automatically creates the `zero_shot` and `few_shot` subdirectories where the model's predictions will be saved.
2.  **Loads Evaluation Data:** It reads the main dataset file (e.g., `hotpot_dataset.json`) containing the questions to be evaluated.
3.  **Loads Few-Shot Examples:** It reads the corresponding file containing the high-quality examples for few-shot prompting.

The function returns a single, convenient dictionary containing all these assets (the loaded data and the output folder paths), which can then be easily used by the main experiment functions.

In [None]:
# Define the paths for the dataset we want to use
hotpot_base_folder = '../HotpotQA/llm_predictions/static/'
hotpot_dataset_file = '../HotpotQA/HotpotQA_examples/hotpot_evaluation.json'
hotpot_fewshot_file = '../HotpotQA/HotpotQA_examples/hotpot_few_shot.json'

# Call the function to get everything we need in one go
data_assets = dataset_folders(hotpot_base_folder, hotpot_dataset_file, hotpot_fewshot_file)

# Unpack the dictionary into these variables for easy access
hotpot_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


##### $Zero$ $Shot$ $Experiment$

In [None]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_llada_experiment(
    generate_func=generate, # Pass the custom function
    model=model,
    tokenizer=tokenizer,
    data=hotpot_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)

save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Results saved successfully to '../HotpotQA/llm_predictions/static/zero_shot/LLaDA-8B-Base_results.json'


##### $Few$ $Shot$ $Experiment$

In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_llada_experiment(
    generate_func=generate, # Pass the custom function
    model=model,
    tokenizer=tokenizer,
    data=hotpot_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=3
)


# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment


KeyboardInterrupt: 

---

#### $StrategyQA$ $Dataset$ $Predictions$

---

In [None]:
# Create the folder paths for results and the folders if they do not exist
strategyqa_base_folder = '../StrategyQA/llm_predictions/static'
strategyqa_dataset_file = "../StrategyQA/StrategyQA_examples/strategyqa_evaluation.json" # Define the path to the strategyqa dataset
strategyqa_fewshot_file = "../StrategyQA/StrategyQA_examples/strategyqa_few_shot.json"

# Call the function to get everything we need in one go
data_assets = dataset_folders(strategyqa_base_folder, strategyqa_dataset_file, strategyqa_fewshot_file)

# Unpack the dictionary into these variables for easy access
strategyqa_data = data_assets["data"]
shot_examples = data_assets["shot_examples"]
zero_shot_folder = data_assets["zero_shot_folder"]
few_shot_folder = data_assets["few_shot_folder"]
    

Loaded 5 evaluation questions.
Loaded 5 shot examples.


##### $Zero$ $Shot$ $Experiment$

In [None]:
# Run the zero-shot experiment
print("\nStarting Zero-Shot Experiment")

zero_shot_results = run_llada_experiment(
    generate_func=generate,
    model=model,
    tokenizer=tokenizer,
    data=strategyqa_data,
    shot_examples=shot_examples,
    model_config=model_config,  
    num_shots=0
)

save_results_to_json(zero_shot_results, zero_shot_folder, model_file_name)


Starting Zero-Shot Experiment
Processing (0-Shot) ID: 7a0e419ffb6009156828
Processing (0-Shot) ID: 427fe3968e32005479b9
Processing (0-Shot) ID: 06b9ed3f803e3d5796ed
Processing (0-Shot) ID: 5090f573b09ac3050824
Processing (0-Shot) ID: d912709b7341dd86ba39
Results saved successfully to '../StrategyQA/llm_predictions/zero_shot/Llama-3.2-3B_results.json'


##### $Few$ $Shot$ $Experiment$

In [None]:
# Run the few-shot experiment with 3 shots
print("\nStarting 3-Shot Experiment")
# Run a 3-shot experiment
three_shot_results = run_llada_experiment(
    generate_func=generate,
    model=model,
    tokenizer=tokenizer,
    data=strategyqa_data,
    shot_examples=shot_examples,
    model_config=model_config, # Use the same config
    num_shots=3
)

# Save the results with a different filename
save_results_to_json(three_shot_results, few_shot_folder, f"3shot_{model_file_name}")


Starting 3-Shot Experiment
Processing (3-Shot) ID: 7a0e419ffb6009156828
Processing (3-Shot) ID: 427fe3968e32005479b9
Processing (3-Shot) ID: 06b9ed3f803e3d5796ed
Processing (3-Shot) ID: 5090f573b09ac3050824
Processing (3-Shot) ID: d912709b7341dd86ba39
Results saved successfully to '../StrategyQA/llm_predictions/few_shot/3shot_Llama-3.2-3B_results.json'
