<a href="https://colab.research.google.com/github/adnaen/ai-dev-toolbox/blob/main/python-tools/transformers/day_4_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Inference

it means interact with the model

## Inference Techniques

### **temperature**

- It the measure of how much the model need to be creative.
- With high temperature the model generate more words its own.
- But sometime that can cause hallusination
- Usually between 0.1 to 2.0
- 1.0 = default, no change
- less than 1.0 = fixed or confident answeres.
- greater than 1.0 = more creative, but sometime generate answer that not meaningfull.
- `best range : 0.6 - 1.0`


### **do_sample**

- It for let model know pick probability randomly. rather than pick from highest to lowest probability order.
- It is boolean

### **top_k**

- It only show the top_k probability distribution.
- If we set to 20, it sort the entire probability into hightest to lowest,  it pick 20 probability values for prediction
- It is whole numbers

### **top_p**

- When it set it, it randomly find cumulative sum of the probability and stop when the sum result match the top_p value, and it use that much value for prediction.
- It is value between 0 to 1

# Inference with LLM

- we can inference a model using `transformers` with 2 way

    1. `pipeline()` :
        - there is a high-level api function that automatically load model, tokenizer and decode the output token back.
        - `cons` : this not ideal for production, it give less control over model.

    2. `Manual Loading`
        - Recomented approach for production.
        - in this, we manualy need to load model, and its tokenizer. and manualy encode/decode tokens as needed.
        - But this can tune the model performance (hyperaparams) as out preference.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
import torch

torch.cuda.is_available()

In [None]:
model_name: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

## Inference with Manual

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

`as we know LLM model return per token in each generatation. some model expect proper structured input from user that contain in model tokenizer_config.json file in HF hug, so when inference with that model we need to set the chat template`

In [None]:
message = [
    {
        "role"    : "system",
        "content" : "you are a helpfull assistant and kindness."
    },
    {
        "role"    : "user",
        "content" : "what is the capital of India."
    }
]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path = model_name,
        torch_dtype                   = "auto",
        device_map                    = "auto"
    )

In [16]:
ip_ids = tokenizer.apply_chat_template(message, return_tensors="pt", tokenize=True)
ip_ids

tensor([[  529, 29989,  5205, 29989, 29958,    13,  6293,   526,   263,  1371,
          8159, 20255,   322,  2924,  2264, 29889,     2, 29871,    13, 29966,
         29989,  1792, 29989, 29958,    13,  5816,   338,   278,  7483,   310,
          7513, 29889,     2, 29871,    13]])

In [17]:
output_ids = model.generate(ip_ids, max_new_tokens=20, temperature=0.8)
output_ids

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


tensor([[  529, 29989,  5205, 29989, 29958,    13,  6293,   526,   263,  1371,
          8159, 20255,   322,  2924,  2264, 29889,     2, 29871,    13, 29966,
         29989,  1792, 29989, 29958,    13,  5816,   338,   278,  7483,   310,
          7513, 29889,     2, 29871,    13, 29966, 29989,   465, 22137, 29989,
         29958,    13,  1576,  7483,   310,  7513,   338,  1570,  5556,  2918,
         29889,     2]])

In [18]:
tokenizer.decode(output_ids[0], skip_special_tokens=True)

'<|system|>\nyou are a helpfull assistant and kindness. \n<|user|>\nwhat is the capital of India. \n<|assistant|>\nThe capital of India is New Delhi.'

## Streaming Output

- like, we see in ChatGPT application the LLM response is generate like a word by word flow that is called 'Streaming Output'
- let's look at how we can do the same.

In [19]:
from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer(text, return_tensors="pt")

model_kwargs = dict(
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    top_k=40,
    max_new_tokens=100,
    input_ids=input_ids["input_ids"],
    streamer=streamer
)

infr_thread = Thread(
    target=model.generate,
    kwargs=model_kwargs,
)

infr_thread.start()

for token in streamer:
    if token != "":
        print(token, flush=True, end="")

Yes, the capital of India is New Delhi.

### In My understand

- most of the model can load with transformers library need GPU to better inference. with a no GPU system can only find a few model to inference better with transformers.

- when inference this lib, using manual model load and inference is better than pipeline() approach,

- when we dont have GPU or better computer resource, that were the `llama.cpp` comes into play. we can load LLMs into low end system (cpu only) with llama.cpp. that develop mainly with LlaMA architecure so it mainly compact with llama model.

- we can improve the model output by prompt engineering in 50%.
