<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_doc/blob/master/notebooks/03_local_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting a local LLM

How to choose the LLM model for text generation ?

There are plently of models regularly been updated and open-sourced. You can check out the [hugging face leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

However the choice of LLM also depends on the hardware that is available in the local machine

Also these models occupies large disk space. It is recommened to also look for [quantized version of these models](https://huggingface.co/TheBloke)

## Checking our local GPU memory availability


In [1]:
## Install dependencies
!pip install -q bitsandbytes
!pip install -q accelerate
!pip install -U -q sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig

from sentence_transformers import SentenceTransformer, util

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [4]:
if torch.cuda.is_available():
    gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
    gpu_memory_gb = round(gpu_memory_bytes / (2**30))
    print(f"Available GPU memory: {gpu_memory_gb} GB")
else:
    print("No GPU available at the moment, Running on CPU")

Available GPU memory: 15 GB


## Loading an LLM Locally

We can load open-source LLM models from [HuggingFace](https://huggingface.co/)

The model that we are going to use is `google gemma-2b-it`.

Gemma can run on a CPU, GPU and TPU. For GPU, we recommend a 8GB+ RAM on GPU for the 2B checkpoint and 24GB+ RAM on GPU for the 7B checkpoint.

To get a model running locally, we need few things:
1. A quantization cofig (optional) - a config on what precision to load the model in (eg. 8bit, 4bit, etc)
2. A model ID: this will tell transformer which model/tokenizer to load
3. A tokenizer: this turns the text into numbers ready for the LLM (note: tokenizer is different from an embedding model)
4. A LLM model: this will be what use to generate the text based on the input prompt

**Note:** There are many tips and tricks on loading/making LLMs work faster. One of the best ones is flash_attn (Flash Attention 2). See the [github repo](https://github.com/Dao-AILab/flash-attention)



## HuggingFace API login

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import pipeline

gemma_pipeline = pipeline(
                    model="google/gemma-2b-it",
                    torch_dtype=torch.bfloat16,
                    trust_remote_code=True,
                    device_map='auto'
                )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
## 1. Create a quantized version of the model
## create a quantization config
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

## 2. pick the model you want from hugging face
model_id = "google/gemma-2b-it"

## 3. instantiate the tokenizer (tokenizer turns text into tokens)
tokenizer = AutoTokenizer.from_pretrained(model_id)

## 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path = model_id,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
    low_cpu_mem_usage=False,
    attn_implementation = 'sdpa' ## You can use flash_accerlerate here to make it faster
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
torch.backends.cuda.enable_flash_sdp(True)

check the cuda gpu capability

Resource: https://developer.nvidia.com/cuda-gpus

In [None]:
torch.cuda.get_device_capability(0)

(7, 5)

## Model Description

In [None]:
num_params = sum([param.numel() for param in llm_model.parameters()])/10**9
mem_params = sum([param.nelement() * param.element_size() for param in llm_model.parameters()])/1024**2
mem_buffers = sum([buf.nelement() * buf.element_size() for buf in llm_model.parameters()])/1024**2
model_mem_mb = mem_params+mem_buffers

print(f"Number of parameters in the model {num_params:.2f}")
print(f"model memory: {mem_params:.2f} MB")

Number of parameters in the model 1.52
model memory: 1945.14 MB


This means to load gemma-2b model with float16 we need minimum of 2GB of VRAM. But we need to keep in mind that we need some more memory for doing the forward pass with the model to generate text


## Generating text with our LLM




In [5]:
input_text = "List down some healthy food for breakfast"

print(f"Input text:\n{input_text}")

## Create the prompt template for instruction-tuned model

dialogue_template = [
    {"role": "user",
     "content": input_text}
]

prompt = tokenizer.apply_chat_template(
    conversation=dialogue_template,
    tokenize=False,
    add_generation_prompt=True
)

prompt

Input text:
List down some healthy food for breakfast


NameError: name 'tokenizer' is not defined

Tokenize the text and send it to device

In [None]:
tokenizer

GemmaTokenizerFast(name_or_path='google/gemma-2b-it', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	106: AddedToken("<start_of_turn>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	107: AddedToken("<end_of_turn>", rstrip=False, lstr

In [None]:
input_ids = tokenizer(prompt, return_tensors='pt').to(device)
input_ids

{'input_ids': tensor([[    2,     2,   106,  1645,   108,  1268,  1706,  1009,  9606,  2960,
           604, 14457,   107,   108,   106,  2516,   108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Generate the outputs from local LLM


In [None]:
outputs = llm_model.generate(**input_ids, max_new_tokens=256)
print(f"Model output (tokens):\n{outputs[0]}\n")

Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   1268,   1706,   1009,   9606,
          2960,    604,  14457,    107,    108,    106,   2516,    108,  21404,
        235269,   1517,    708,   1009,   9606,  14457,   2960,   5793, 235292,
           109, 235274, 235265, 186986,    675,  46051,    578,  22606,    108,
        235284, 235265,  41326, 235290,  78346,  33611,    675,  54154,  10605,
           578,  31985,    108, 235304, 235265,  15556,  50162,    675,   9471,
           578, 145197,    108, 235310, 235265, 169685,    675,  16803, 235269,
         19574, 235269,    578,  50162,    108, 235308, 235265,  41326, 235290,
         78346,  71531,    675,  61449,  10605,    578,   9471,    108, 235318,
        235265, 186986,    675,  54269,  15741, 235269,  22606, 235269,    578,
          9471,    108, 235324, 235265, 217675, 122149,    675,   9471,    578,
         22606,    108, 235321, 235265,  41326, 235290,  78346,  57289,    675,
          9512,  

Decode the output tokens to text

In [None]:
text_gen = tokenizer.decode(outputs[0])
print(f"Text generated:\n{text_gen}")

Text generated:
<bos><bos><start_of_turn>user
List down some healthy food for breakfast<end_of_turn>
<start_of_turn>model
Sure, here are some healthy breakfast food ideas:

1. Oatmeal with berries and nuts
2. Whole-wheat toast with peanut butter and banana
3. Greek yogurt with fruit and granola
4. Smoothie with fruits, vegetables, and yogurt
5. Whole-wheat pancakes with almond butter and fruit
6. Oatmeal with chia seeds, nuts, and fruit
7. Quinoa porridge with fruit and nuts
8. Whole-wheat cereal with milk and fruit
9. Greek yogurt with cottage cheese and fruit
10. Whole-wheat muffins with fruit and nuts<eos>


As a next step after setting up the local llm, we will now augment the prompt with more relevant passages this is also called prompt engineering, But before that let's set up the retrieval pipeline before we build the augmentation pipeline.

## Retrieval Pipeline

Let's build the functionized retrieval pipeline from our previous notebook

In [6]:
# Download the embeddings of the pdf
!wget https://github.com/kavyajeetbora/nlp_doc/raw/master/data/doc_embeddings.parquet -O "doc_embeddings.parquet"

## Load the pandas dataframe from the parquet file
df = pd.read_parquet('doc_embeddings.parquet')

# Extract the text embeddings from the dataframe
a = np.stack(df['embedding'].to_list(), axis=0)
embedding  = torch.tensor(a).to(device)

--2024-04-16 14:56:50--  https://github.com/kavyajeetbora/nlp_doc/raw/master/data/doc_embeddings.parquet
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kavyajeetbora/nlp_doc/master/data/doc_embeddings.parquet [following]
--2024-04-16 14:56:51--  https://raw.githubusercontent.com/kavyajeetbora/nlp_doc/master/data/doc_embeddings.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3922167 (3.7M) [application/octet-stream]
Saving to: ‘doc_embeddings.parquet’


2024-04-16 14:56:51 (72.2 MB/s) - ‘doc_embeddings.parquet’ saved [3922167/3922167]



Retrieval: Define the semantic search pipeline



In [43]:
class RetrievalPipeline():

    def __init__(self, data_frame:pd.DataFrame, device):
        self.df = data_frame
        a = np.stack(self.df['embedding'].to_list(), axis=0)
        self.embeddings = torch.tensor(a).to(device)
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        self.device = device

    def retrieve_relevant_text(self, query,k=5):

        ## Encode the query with the same embedding model
        query_encode = self.embedding_model.encode(query, convert_to_tensor=True).to(self.device)

        ## Get similarity scores
        sim_score = util.dot_score(a=query_encode, b=self.embeddings)
        ## get the top 5 values
        vals, indices = torch.topk(sim_score, k=k)

        ## flatten, detach from GPU and return an numpy array
        vals, indices = vals.cpu().numpy().flatten(), indices.cpu().numpy().flatten()

        ## After retrieving the indices, get the text from the dataframe
        context_items = [self.df.loc[i, 'sentence_chunk'] for i in indices]

        return context_items

    def prompt_formatter(self, query:str, k=5) -> str:

        context_items = rp.retrieve_relevant_text(query, k=k)
        context = "These are some context:\n - "
        context += "\n - ".join(context_items)
        prompt = f'''
{context}
Referring to the context above, please answer the following query:
{query}

Answer:
'''
        return prompt

In [44]:
rp = RetrievalPipeline(df, device)

In [45]:
query = "What are the benefits of having oats ?"
context_items = rp.retrieve_relevant_text(query, k=2)
context_items

['Journal concluded that all diets, (independent of carbohydrate, fat, and protein content) that incorporated an exercise regimen significantly decreased weight and waist circumference in obese women.6 Some studies do provide evidence that in comparison to other diets, low-carbohydrate diets improve insulin levels and other risk factors for Type 2 diabetes and cardiovascular disease. The overall scientific consensus is that consuming fewer calories in a balanced diet will promote health and stimulate weight loss, with significantly better results achieved when combined with regular exercise. Health Benefits of Whole Grains in the Diet While excessive consumption of simple carbohydrates is potentially bad for your health, consuming more complex carbohydrates is extremely beneficial to health. There is a wealth of scientific evidence supporting that replacing refined grains with whole grains decreases the risk for obesity, Type 2 diabetes, and cardiovascular disease. Whole grains are gre

Now augment the prompt with context:

In [46]:
query = 'What are some good sources of dietary fibre ?'
prompt = rp.prompt_formatter(query,k=2)
print(prompt)


These are some context:
 - Dietary fiber is categorized as either water-soluble or insoluble. Some examples of soluble fibers are inulin, pectin, and guar gum and they are found in peas, beans, oats, barley, and rye. Cellulose and lignin are insoluble fibers and a few dietary sources of them are whole-grain foods, flax, cauliflower, and avocados. Cellulose is the most abundant fiber in plants, making up the cell walls and providing structure. Soluble fibers are more easily accessible to bacterial enzymes in the large intestine so they can be broken down to a greater extent than insoluble fibers, but even some breakdown of cellulose and other insoluble fibers occurs. The last class of fiber is functional fiber. Functional fibers have been added to foods and have been shown to provide health benefits to humans. Functional fibers may be extracted from plants and purified or synthetically made. An example of a functional fiber is psyllium-seed husk. Scientific studies show that consuming 

## Augmenting our prompt with context items

Text generation was done in the previous section,  now it is the time to augment.

The concept of augmenting a prompt with context text items is also referred as prompt engineering

Prompt engineering is an active field of research and many new styles and techniques are being found out. There are few techniques that work quite well

1. https://www.promptingguide.ai/
2. [Brex's Prompt Engineering Guide](https://github.com/brexhq/prompt-engineering)
3. [Prompt engineering for business performance](https://www.anthropic.com/news/prompt-engineering-for-business-performance)


In this section, we are going to use a couple of prompting techniques:
1. Give clear instructions
2. Give few examples of input/output (for example this is my input and I want output like this)
3. Give room to think

Let's create a function to format a prompt with context items

Example:

> Based on the following context:
> - asncacakca
> - ascnankca
> - jqfopqopa
> - aknsqopwq

> Please answer the following query: What are the macronutrients and what do they do?

> Answer:




In [36]:
def prompt_formatter(query:str, rp:RetrievalPipeline) -> str:

    context_items = rp.retrieve_relevant_text(query, k=5)
    context = "These are some context:\n - "
    context += "\n - ".join(context_items)
    prompt = f'''
{context}
Please answer the following query:
{query}

Answer:
'''
    return prompt


query = "List down some foods which are high in protein"
print(prompt_formatter(query, rp))


These are some context:
 - Dietary Sources of Protein The protein food group consists of foods made from meat, seafood, poultry, eggs, soy, dry beans, peas, and seeds. According to the Harvard School of Public Health, “animal protein and vegetable protein probably have the same effects on health. It’s the protein package that’s likely to make a difference.”1 1. Protein: The Bottom Line. Harvard School of Public Proteins, Diet, and Personal Choices | 411
 - 18.9 5.4 200 454 Tuna 3 oz. (canned) 21.7 0.2 26 99 Soybeans 1 c. (boiled) 29.0 2.2 0 298 Lentils 1 c. (boiled) 17.9 0.1 0 226 Kidney beans 1 c. (canned) 13.5 0.2 0 215 Sunflower seeds 1 c. 9.6 2.0 0 269 The USDA provides some tips for choosing your dietary protein sources. Their motto is, “Go Lean with Protein”. The overall suggestion is to eat a variety of protein-rich foods to benefit health. The USDA recommends lean meats, such as round steaks, top sirloin, extra lean ground beef, pork loin, and skinless chicken. Proteins, Diet,