# Introduction

*** WARNING: THIS EXAMPLE IS INCOMPLETE*** See models.py for how to add new models and pipeline.py for how to get rationales / skills / skill-slices. 

In this notebook, we will show how to evaluate a new model over our suite of benchmarks, towards performing skill-slice analyses. 

We will first show how to download the datasets in our initial 12-dataset corpus. 

Then, we will add a model and evaluate it, via the following steps:
1. Implementing the model under our general model class.
2. Feeding this model to our standard inference code.
3. Retrieving our pre-computed skill slices, and using existing code to compute accuracy per slice.
4. Some example analyses that can be done with skill slices, e.g. comparing to frontier models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet.

## Downloading datasets

Our original 12 dataset corpus consists of benchmarks accessible from huggingface, enabling easy downloading. Further, we implement our datasets in a way such that they auto-download lazily (i.e. when first accessed). That is, the first time each instance is accessed, it will be downloaded and saved to the path defined by _DATA_ROOT in constants.py.



In [None]:
from dsets import _DSET_DICT
from tqdm import tqdm 

for dsetname in _DSET_DICT:
    dset = _DSET_DICT[dsetname]()
    for x in tqdm(dset):
        continue
    print(x)
    print(f'Downloaded and saved all instances from {dsetname}.')

## Implementing a new model under appropriate model class

Each model must implement a `answer_question(question, system_message, image)` method:
* Inputs:
    * `question`: string, the question / prompt to the model for the specific evaluation instance
    * `system_message`: string, the system message specifying the output format. Typically set to `_SYS_MSGS['standard_prompt']` (see constants.py).
    * `image`: PIL Image, the image relevant to a given eval instance, or None in the case of a language only query.
* Output: string response to the given question

It must also have a field called modelname. This name is given to the directory where outputs from the model are cached.

We'll now show how to go from a pretrained model hosted on huggingface to something we can feed to existing code in our repo. We begin with the [example inference code provided on huggingface](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct). Then, we move the model and tokenizer to the constructor for our model object, and place inference under `answer_question`. 

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

class QwenVL:
    def __init__(self):
        # default: Load the model on the available device(s)
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
        )

        self.processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
        self.modelname = 'Qwen2-VL-7B-Instruct'
    
    def answer_question(self, question, system_message, image):
        ########### THIS IS THE ONLY CODE CHANGE WE NEEDED TO MODIFY THE EXAMPLE INFERENCE CODE ############
        if image:
            messages = [{"role":"system", "content": [{"type":"text", "text": system_message}]},
                    {"role": "user", "content": [{"type":"image", "image": image}, {"type":"text", "text":question}]}]
            image_inputs, video_inputs = process_vision_info(messages)
        else:
            messages = [{"role":"system", "content": [{"type":"text", "text": system_message}]},
                    {"role": "user", "content": [{"type":"text", "text":question}]}] 
            image_inputs, video_inputs = None, None

        ### Standard code from huggingface example inference docs: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to("cuda")

        # Inference: Generation of the output
        generated_ids = self.model.generate(**inputs, max_new_tokens=128)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        return output_text[0]

Let's go ahead and test this code by loading the model and evaluate it on a test sample.

In [None]:
model = QwenVL()
dset = _DSET_DICT['mmc']()
q = dset[0]
print("Test Question: ", q[prompt])
q['image'].show()

ans = model.answer_question(q['prompt'], system_message, q['image'])