# Introduction
In this tutorial we will walk through how to send prompts and recieve generated text back from our base gemma-2b and lora fine-tuned model.  We will also demonstrate how to run evaluation benchmarks for comparing the performance of our model to others.

# 1. Prompting the model
The "generate" recipe in torchtune will let you send text prompts to a model.  In the sections below we will send the same prompt to the base model and the fine-tuned one to get an idea of what our fine-tuning accomplished.

<h2><center>1.1 Prompting the base model</center></h2>


In [None]:
!tune run generate --config ./config_files/custom_generate_gemma2-2b.yaml

Running InferenceRecipe with resolved config:

chat_format: null
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/models/gemma-2b/
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: GEMMA
  output_dir: ./models/
  recipe_checkpoint: null
device: cuda
dtype: bf16
enable_kv_cache: true
max_new_tokens: 300
model:
  _component_: torchtune.models.gemma.gemma_2b
prompt:
  system: ''
  user: 'What are the three primary colors?


    '
quantizer: null
seed: 2
temperature: 0.95
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/models/gemma-2b/tokenizer.model
  prompt_template:
    system:
    - Below is an instruction that describes a task. Write a response that appropriately
      completes the request.
    - '


      '
    user:
    - '### Instruction:

      '
    - '### Response:

      '
top_k: 300

Setting manual seed to local seed 2. Local seed is seed + ran

Well... it's a little unhinged and the response to our instruction was not direct or succint. If you want to change the prompt being sent to the model, you can add the command line override 

`prompt.user="Your new prompt" `

to the command above. You can get different responses to the same prompt by chaning the seed value, and higher temperature values will make the model output more random.  Now let's see how the fine-tuned model performs with the same prompt.

<h2><center>1.2 Prompting the fine-tuned model</center></h2>
The command below will send the same prompt to the fine-tuned model. We can reuse the configuration file that we used for the base model, and just override the checkpoint directory and file names.

In [49]:
!tune run generate --config ./config_files/custom_generate_gemma2-2b.yaml \
checkpointer.checkpoint_dir=/tmp/models/gemma-2b-lora-finetuned-alpaca/ \
checkpointer.checkpoint_files=[hf_model_0001_0.pt,hf_model_0002_0.pt]

2024-10-22:09:35:19,138 INFO     [_logging.py:101] Running InferenceRecipe with resolved config:

chat_format: null
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b-lora-finetuned-alpaca/
  checkpoint_files:
  - hf_model_0001_0.pt
  - hf_model_0002_0.pt
  model_type: GEMMA
  output_dir: ./models/
  recipe_checkpoint: null
device: cuda
dtype: bf16
enable_kv_cache: true
max_new_tokens: 300
model:
  _component_: torchtune.models.gemma.gemma_2b
prompt:
  system: ''
  user: 'What are the three primary colors?


    '
quantizer: null
seed: 2
temperature: 0.95
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b/tokenizer.model
  prompt_template:
    system:
    - Below is an instruction that describes a task. Write a response that appropriately
      completes the request.
    - '


      '
    user:
    - '### Instru

The fine-tuned model is *almost* correct, it at least gave us 2/3 correct colors and appears to be more succint in its responses. The brevity of the fine-tuned model response is perhaps not surprising given that the *alapaca* training dataset consists of relatively short question and answer queries so our lora likely has conditioned the model towards shorter responses.


# 2. Benchmarking the model output
Manually coming up with prompts for the model is fun, but benchmarks will provide us with a more quantitative way to gauge any improvement in our model and compare its capabilities to other models.  Torchtune integrates with <a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI’s evaluation harness</a> which provides access to many different generation benchmarks.  As an example, we can use the <a href="https://github.com/sylinrl/TruthfulQA">TruthfulQA</a> benchmark which evaluates the model's ability to generate factually correct answers to a question. There are several different testing methods offered in this benchmark, in our case we will be running MC2 in which the model attempts to chose which of the multiple choice responses are true given a particular question. Note, the benchmark takes ~20 minutes to complete.

In [50]:
!tune run eleuther_eval --config ./config_files/custom_eval_gemma2-2b.yaml

2024-10-22:09:35:50,279 INFO     [_logging.py:101] Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b/
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: GEMMA
  output_dir: ./models/
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.gemma.gemma_2b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  max_seq_len: null
  path: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b/tokenizer.model

2024-10-22:09:35:50,397 DEBUG    [seed.py:60] Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
2024-10-22:09:35:53,658 INFO     [eleuther_eval.py:240] Model is initialized with precision torch.bfloat16.
2024-10-22:09:

In my test, the base model got a score of 0.3993 ± 0.0152. We can run the same evaluation on the fine-tuned model using the command below. Again we'll reuse the configuration file and just point to our fine-tuned model checkpoint directory and files.

In [1]:
!tune run eleuther_eval --config ./config_files/custom_eval_gemma2-2b.yaml \
checkpointer.checkpoint_dir=/tmp/models/gemma-2b-lora-finetuned-alpaca/ \
checkpointer.checkpoint_files=[hf_model_0001_0.pt,hf_model_0002_0.pt]

2024-10-22:10:21:10,199 INFO     [_logging.py:101] Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b-lora-finetuned-alpaca/
  checkpoint_files:
  - hf_model_0001_0.pt
  - hf_model_0002_0.pt
  model_type: GEMMA
  output_dir: ./models/
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.gemma.gemma_2b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  max_seq_len: null
  path: /work2/10156/gj3385/frontera/ml_institute/models/gemma-2b/tokenizer.model

2024-10-22:10:21:10,352 DEBUG    [seed.py:60] Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
2024-10-22:10:21:22,643 INFO     [eleuther_eval.py:240] Model is initialized with precision torch.bfloat16.
2024-10-22:10:21:22,

The fine-tuned model gets a score of 0.3953 ± 0.0152 which is not statistically different from the base model. There are several explanations for this.  First, the fine-tuned model was only trained on the dataset for 1 epoch, so training for longer *might* further influence its score.  The second and more likely explanation is explained in the paper published along with the benchmark (https://arxiv.org/abs/2109.07958).


"Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web."

The alpaca dataset we trained our model on was generated by an earlier version of Chat GPT (see https://huggingface.co/datasets/tatsu-lab/alpaca) which means the dataset itself is likely to contain many false and incorrect answers.