<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_doc/blob/master/notebooks/03_local_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting a local LLM

How to choose the LLM model for text generation ?

There are plently of models regularly been updated and open-sourced. You can check out the [hugging face leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

However the choice of LLM also depends on the hardware that is available in the local machine

Also these models occupies large disk space. It is recommened to also look for [quantized version of these models](https://huggingface.co/TheBloke)

## Checking our local GPU memory availability


In [2]:
## Install dependencies
!pip install -q bitsandbytes
!pip install -q accelerate

In [3]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 15 GB


## Loading an LLM Locally

We can load open-source LLM models from [HuggingFace](https://huggingface.co/)

The model that we are going to use is `google gemma-2b-it`.

Gemma can run on a CPU, GPU and TPU. For GPU, we recommend a 8GB+ RAM on GPU for the 2B checkpoint and 24GB+ RAM on GPU for the 7B checkpoint.

To get a model running locally, we need few things:
1. A quantization cofig (optional) - a config on what precision to load the model in (eg. 8bit, 4bit, etc)
2. A model ID: this will tell transformer which model/tokenizer to load
3. A tokenizer: this turns the text into numbers ready for the LLM (note: tokenizer is different from an embedding model)
4. A LLM model: this will be what use to generate the text based on the input prompt

**Note:** There are many tips and tricks on loading/making LLMs work faster. One of the best ones is flash_attn (Flash Attention 2). See the [github repo](https://github.com/Dao-AILab/flash-attention)



## HuggingFace API login

In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
from transformers import pipeline

gemma_pipeline = pipeline(
                    model="google/gemma-2b-it",
                    torch_dtype=torch.bfloat16,
                    trust_remote_code=True,
                    device_map='auto'
                )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
## 1. Create a quantized version of the model
## create a quantization config
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

## 2. pick the model you want from hugging face
model_id = "google/gemma-2b-it"

## 3. instantiate the tokenizer (tokenizer turns text into tokens)
tokenizer = AutoTokenizer.from_pretrained(model_id)

## 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path = model_id,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
    low_cpu_mem_usage=False,
    attn_implementation = 'sdpa' ## You can use flash_accerlerate here to make it faster
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
torch.backends.cuda.enable_flash_sdp(True)

check the cuda gpu capability

Resource: https://developer.nvidia.com/cuda-gpus

In [11]:
torch.cuda.get_device_capability(0)

(7, 5)

## Model Description

In [18]:
num_params = sum([param.numel() for param in llm_model.parameters()])/10**9
mem_params = sum([param.nelement() * param.element_size() for param in llm_model.parameters()])/1024**2
mem_buffers = sum([buf.nelement() * buf.element_size() for buf in llm_model.parameters()])/1024**2
model_mem_mb = mem_params+mem_buffers

print(f"Number of parameters in the model {num_params:.2f}")
print(f"model memory: {mem_params:.2f} MB")

Number of parameters in the model 1.52
model memory: 1945.14 MB


This means to load gemma-2b model with float16 we need minimum of 2GB of VRAM. But we need to keep in mind that we need some more memory for doing the forward pass with the model to generate text


## Generating text with our LLM




In [21]:
input_text = "List down some healthy food for breakfast"

print(f"Input text:\n{input_text}")

## Create the prompt template for instruction-tuned model

dialogue_template = [
    {"role": "user",
     "content": input_text}
]

prompt = tokenizer.apply_chat_template(
    conversation=dialogue_template,
    tokenize=False,
    add_generation_prompt=True
)

prompt

Input text:
List down some healthy food for breakfast


'<bos><start_of_turn>user\nList down some healthy food for breakfast<end_of_turn>\n<start_of_turn>model\n'

Tokenize the text and send it to device

In [22]:
tokenizer

GemmaTokenizerFast(name_or_path='google/gemma-2b-it', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	106: AddedToken("<start_of_turn>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	107: AddedToken("<end_of_turn>", rstrip=False, lstr

In [23]:
input_ids = tokenizer(prompt, return_tensors='pt').to(device)
input_ids

{'input_ids': tensor([[    2,     2,   106,  1645,   108,  1268,  1706,  1009,  9606,  2960,
           604, 14457,   107,   108,   106,  2516,   108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Generate the outputs from local LLM


In [24]:
outputs = llm_model.generate(**input_ids, max_new_tokens=256)
print(f"Model output (tokens):\n{outputs[0]}\n")

Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   1268,   1706,   1009,   9606,
          2960,    604,  14457,    107,    108,    106,   2516,    108,  21404,
        235269,   1517,    708,   1009,   9606,  14457,   2960,   5793, 235292,
           109, 235274, 235265, 186986,    675,  46051,    578,  22606,    108,
        235284, 235265,  41326, 235290,  78346,  33611,    675,  54154,  10605,
           578,  31985,    108, 235304, 235265,  15556,  50162,    675,   9471,
           578, 145197,    108, 235310, 235265, 169685,    675,  16803, 235269,
         19574, 235269,    578,  50162,    108, 235308, 235265,  41326, 235290,
         78346,  71531,    675,  61449,  10605,    578,   9471,    108, 235318,
        235265, 186986,    675,  54269,  15741, 235269,  22606, 235269,    578,
          9471,    108, 235324, 235265, 217675, 122149,    675,   9471,    578,
         22606,    108, 235321, 235265,  41326, 235290,  78346,  57289,    675,
          9512,  

Decode the output tokens to text

In [25]:
text_gen = tokenizer.decode(outputs[0])
print(f"Text generated:\n{text_gen}")

Text generated:
<bos><bos><start_of_turn>user
List down some healthy food for breakfast<end_of_turn>
<start_of_turn>model
Sure, here are some healthy breakfast food ideas:

1. Oatmeal with berries and nuts
2. Whole-wheat toast with peanut butter and banana
3. Greek yogurt with fruit and granola
4. Smoothie with fruits, vegetables, and yogurt
5. Whole-wheat pancakes with almond butter and fruit
6. Oatmeal with chia seeds, nuts, and fruit
7. Quinoa porridge with fruit and nuts
8. Whole-wheat cereal with milk and fruit
9. Greek yogurt with cottage cheese and fruit
10. Whole-wheat muffins with fruit and nuts<eos>


## Augmenting our prompt with context items