## Demo running the Llama-3.2-3B-ARChitects-ReArc-bnb-4bit model 

### Prerequisites 

1. The `Llama-3.2-3B-ARChitects-ReArc-bnb-4bit` folder should be at the top level of this project and is imported as a git submodule when your ran `git submodule update --init`
2. `Llama-3.2-3B-ARChitects-ReArc-bnb-4bit/model.safetensors` is a large file and is pulled not from github.com but huggingface.com. This means that you need to either setup an [ssh key](https://huggingface.co/docs/hub/security-git-ssh) or personal access token for your huggingface account. Once this is done you can pull `the model.safetensors` file by `cd Llama-3.2-3B-ARChitects-ReArc-bnb-4bit` followed by `git lfs pull`. Note git LFS should be  installed and initialised as per the *Getting Started* section in the main read me.
3. All the required packages (except for `bitsandbytes non CUDA backend` ) are managed by uv. Run `uv sync` to make sure they are installed.
4. Manually install `bitsandbytes non CUDA backend` with this [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation?backend=Intel+CPU+%2B+GPU#multi-backend) by huggingface. Availability is hardware dependant, I suspect the mac users among us do not have their hardware supported - in this case we can move this repository onto a hosted platform with cloud compute.

*Before starting this notebook always make sure you have done:* `uv pip install -e "../bitsandbytes/"`. *Otherwise do this now and restart the kernel.*

##### Notes/recommendations:

If you are compiling the `bitsandbytes`package with a non-CUDA backend from source. Clone the repo adjacent to this one and follow the build instructions. You can install the package to the venv associated with this repo via running `uv pip install -e "../bitsandbytes/"`in your terminal.

In general you can prefix any `pip install` command with uv for the uv package manager to add packages installed this way to its dependency graph.   

In [None]:
from diskcache import Cache
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm


import numpy as np
import sys
import torch

# Local methods
from mol_arc_agi.io_helpers import load_all_json_files_concurrently


In [None]:
# Input paths
# Path to model and tokenizer repository (submodule)   
model_directory = Path('../../Llama-3.2-3B-ARChitects-ReArc-bnb-4bit')
# Path to training data (original ARC data)
training_data_directory = Path('../../ARC-AGI/data/training/')

# Output paths
base_output_directory = Path('output/')
inference_cache = base_output_directory.joinpath('inference_cache/')
submission = base_output_directory.joinpath('submission/')

# Check paths are as expected
print(model_directory.resolve()) 
print(training_data_directory.resolve())  
print(base_output_directory.resolve())
print(inference_cache.resolve())
print(submission.resolve())

### Load the Data 

In [None]:
data = load_all_json_files_concurrently(training_data_directory)

### Load the model

In [None]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_directory)
model = AutoModelForCausalLM.from_pretrained(model_directory)

# Check if the tokenizer and model are loaded correctly
assert tokenizer is not None, "Tokenizer not loaded correctly."
assert model is not None, "Model not loaded correctly."

In [None]:
# set prompt formatting options
prompt_fmt_opts = dict(
    preprompt='ABCDEFGHJKLMNPQRSTUVWXYZabcdefghjklmnpqrstuvwxyz',
    query_begin='I',
    reply_begin='\n+/-=O',
    reply_end='\n' + tokenizer.eos_token,
    lines_sep='\n',
    max_tokens=128000,
)

# set inference time data augmentation options
inference_data_augmentation_opts = dict(tp='all', 
                                        rt='all',   #
                                        perm=True,  #Permute the order of the tasks
                                        shfl_ex=True, 
                                        seed=10000)


In [None]:
# Setup the inference cache
model_cache = Cache(inference_cache).memoize(typed=True, ignore=set(['model_tok', 'guess']))


In [None]:
#start refactor of the inference tools

def dfs(model,              # formally the explore function
        current_generated_tokens, # originally named path
        gen_start_position, # start
        max_gen_tokens,     # goal
        min_probability,    # early stopping condition
        cache, 
        current_score=0.0):
    pass #https://huggingface.co/docs/transformers/v4.48.0/en/main_classes/text_generation#transformers.GenerationConfig

def init_dfs(model, 
        input_text_token_ids,
        eos_token_id,
        max_gen_tokens, #TODO: consider moving this
        min_probability,
        attention_mask=None):
    
    assert not torch.is_grad_enabled(), "Gradient computation should be disabled."
    assert attention_mask is None or attention_mask.all(), "Attention mask not fully implemented."
    
    # Set recursion limit to avoid stack overflows
    sys.setrecursionlimit(1000 + max_gen_tokens)

    # prepare inputs
    input = torch.as_tensor(input_text_token_ids, device=model.device, dtype=int)
    if input.ndim == 2:
        input = input.squeeze(0)
    assert input.ndim == 1, 'Batching not supported.'

    # Remove end of string token if it is in the input
    if input[-1] == eos_token_id:
        input = input[:-1]

    gen_start_position = len(input)

    # run DFS
    result = explore(model, 
                     gen_start_position,
                     max_gen_tokens,
                     min_probability,
                     cache)

     # return results sorted by scores
    return sorted([(np.array(suffix[::-1]), score_val) for suffix, score_val in result], key=lambda x: x[1])
