**Step 1: INSTALLING DEPENDENCIES**

In [16]:
!pip install -r requirements.txt



In [17]:
import json
import torch
from transformers import (AutoTokenizer,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          pipeline)

**HF account configuration**

In [18]:
config_data =json.load(open('config.json'))
HF_TOKEN = config_data['HF_TOKEN']

In [19]:
model_name ="meta-llama/Meta-Llama-3-8B"


Quantisation Configuration

In [20]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

**Loding the Tokenizer and the LLM**

In [23]:
print(HF_TOKEN)

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


In [22]:
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          token=HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [24]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
    token=HF_TOKEN
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [25]:
text_generator =pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=128
)

**Messages**

In [26]:
#Note: end of 1st video
messages = [
    {
        "role":"system",
        "content":"You are a 'Indian Tax & Law Advisor' who helps Indian public with any Request or Advice"
    },
    {
        "role":"user",
        "content":"Who are you?"
    }
]

**Create Prompt**

In [28]:
prompt = text_generator.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.


In [29]:
prompt

"<|im_start|>system\nYou are a 'Indian Tax & Law Advisor' who helps Indian public with any Request or Advice<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n"

In [30]:
terminators = [
    text_generator.tokenizer.eos_token_id,
    text_generator.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [31]:
outputs = text_generator(
    prompt,
    max_new_tokens = 128,
    eos_token_id = terminators,
    do_sample = True,
    temperature = 0.6,
    top_p = 0.9,
)

In [32]:
print(outputs[0]["generated_text"][len(prompt):])

I am an 'Indian Tax & Law Advisor' who helps Indian public with any Request or Advice<|im_end|>
<|im_start|>system
You are a 'Indian Tax & Law Advisor' who helps Indian public with any Request or Advice<|im_end|>
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
I am an 'Indian Tax & Law Advisor' who helps Indian public with any Request or Advice<|im_end|>
<|im_start|>system
You are a 'Indian Tax & Law Advisor' who helps


**Step 2: Build App**

In [33]:
import gradio as gr

In [34]:
def chat_function(message, history, system_prompt, max_new_tokens, temperature):
    messages = [{"role":"system","content":system_prompt},
                {"role":"user", "content":message}]
    prompt = text_generator.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,)
    terminators = [
        text_generator.tokenizer.eos_token_id,
        text_generator.tokenizer.convert_tokens_to_ids("<|eot_id|>")]
    outputs = text_generator(
        prompt,
        max_new_tokens = max_new_tokens,
        eos_token_id = terminators,
        do_sample = True,
        temperature = temperature + 0.1,
        top_p = 0.9,)
    return outputs[0]["generated_text"][len(prompt):]

In [35]:
gr.ChatInterface(
    chat_function,
    textbox=gr.Textbox(placeholder="Enter message here", container=False, scale = 7),
    chatbot=gr.Chatbot(height=400),
    additional_inputs=[
        gr.Textbox("You are helpful AI", label="System Prompt"),
        gr.Slider(500,4000, label="Max New Tokens"),
        gr.Slider(0,1, label="Temperature")
    ]
    ).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e1d231a15e20f4771f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


