# Uncertainty LoRA with Granite Uncertainty 3.0 8b
*Using IBM Granite Models*

## In this notebook

This notebook contains instructions on how to leverage a LoRA adapter for IBM's Granite model. This specific adapter is designed to provide a calibrated certainty score when answering questions that are prompted, while still retaining the full abilities of the original granite-3.0-8b-instruct model.

There are a few use cases where having the certainty score would be helpful:
- Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
- Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
- RAG: Granite Uncertainty 3.0 8b is calibrated on document-based question answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge (e.g. if the model is correct but the answer is not in the documents, the certainty can still be high).

In this notebook, we will walk through how to set up and test the LoRA adapters capabilities so that you will be able to apply it to your own use case

## Setting up your environment

First ensure you are running python 3.10 or 3.11 in a freshly-created virtual environment.

In [None]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."

## Install dependencies

First we'll install some dependencies. Granite utils comes with some helpful functions

In [None]:
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    peft \

## Setup your model and adapter

Next we will create a model object for your Granite Uncertainty model. This can take quite a bit of memory (>16 GB). To do that let's setup the backend and tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft.peft_model import PeftModel
import torch, os

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained('ibm-granite/granite-3.0-8b-instruct', padding_side='left', trust_remote_code=True)

From here we will set up each of the models, and create the merged model object

In [None]:
model_base = AutoModelForCausalLM.from_pretrained('ibm-granite/granite-3.0-8b-instruct')
model_lora = PeftModel.from_pretrained(model_base, 'ibm-granite/granite-uncertainty-3.0-8b-lora')
model = model_lora.to(device)

## Create your prompt

Next, let's set up your system prompt and user prompt. The granite model was calibrated using a specific system prompt, which we have stored as "system_prompt" below, which we then merge with the user "question" to form the "question_chat" prompt.

In [None]:
system_prompt = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior." 
question = input("Please enter your question: ")
print("Question:" + question)
question_chat = [
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": question
    },
]


## Generating the answer

In [None]:
input_text = tokenizer.apply_chat_template(question_chat,tokenize=False,add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
output = model_lora.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=80)
output_text = tokenizer.decode(output[0])
answer = output_text.split("assistant<|end_of_role|>")[1]
print("Answer: " + answer)

## Generating certainty score
Once we have generated our answer, we will run it back through the LoRA tuned adapter to get an uncertainty score

In [None]:
uq_generation_prompt = "<|start_of_role|>certainty<|end_of_role|>"
uq_chat = [
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": question
    },
    {
        "role": "assistant",
        "content": answer
    },
]

uq_text = tokenizer.apply_chat_template(uq_chat,tokenize=False) + uq_generation_prompt
inputs = tokenizer(uq_text, return_tensors="pt")
output = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1)
output_text = tokenizer.decode(output[0])
uq_score = int(output_text[-1])
print("Certainty: " + str(5 + uq_score * 10) + "%")