<a href="https://colab.research.google.com/github/jchen8000/DemystifyingLLMs/blob/main/6_Deployment/Chatbot_FLAN_T5_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6. Deployment of LLMs

## 6.11 Chatbot, Example of LLM-Powered Application

### **Build a Chatbot built with Local LLM plus a LoRA Model**

This notebook demonstrates how to load and run a local LLM entirely on your machine (or a Google Colab GPU runtime), and enhance its capabilities using a LoRA fine‑tuned adapter. The LoRA adapter used here was created earlier in **Section 5.3**, where we fine‑tuned ```google/flan-t5-base``` on the TweetSum summarization dataset.

To ensure reproducibility, the notebook downloads the fine‑tuned LoRA checkpoints directly from our GitHub repository. 

---

### **Key Objectives**

- **Local Model Execution**  
  Load the ```google/flan-t5-base``` model directly from Hugging Face and run it locally on GPU using PyTorch.

- **Apply a Fine‑Tuned LoRA Adapter**  
  Attach the fine‑tuned LoRA weights (created in Section 5.3) to the base model using the ```PEFT``` library.

- **Text Generation Pipeline**  
  Generate responses—such as Tweet‑style summaries or general conversational output—fully offline without relying on external inference APIs.

---

### ⚠️ **Hardware & Compatibility Requirements**

This notebook requires a GPU runtime (such as the Google Colab T4 GPU) because the model is executed locally using PyTorch with CUDA acceleration. Running on CPU is possible but not recommended due to performance limitations.

This notebook is confirmed working as of **January 2026**. 

While the workflow reflects current best practices, deep‑learning libraries evolve rapidly, and future updates to Transformers, PEFT, or CUDA tooling may require adjustments to this notebook.

---


In [None]:
%pip install -q \
    torch==2.9.0+cu126 \
    torchvision==0.24.0+cu126 \
    torchaudio==2.9.0+cu126 \
    transformers==4.57.6 \
    peft==0.18.1 \
    sentencepiece==0.2.1

In [None]:
import os
from google.colab import userdata
import torch
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration
from transformers import GenerationConfig
from peft import PeftModel, PeftConfig

#### Download the LoRA model from github repo.  

In Section 5.3, we fine‑tuned ```google/flan-t5-base``` model on the TweetSum summarization dataset, and we created the LoRA model called ```flan-t5-base_lora```. Now we download the LoRA model from github repo.

In [None]:
!rm -rf lora_model

!mkdir -p lora_model
!git clone --depth 1 --filter=blob:none --sparse \
    https://github.com/jchen8000/DemystifyingLLMs.git temp_repo

%cd temp_repo
!git sparse-checkout set "5_Fine-Tuning/outputs/flan-t5-base_lora"
%cd ..

!mv temp_repo/5_Fine-Tuning/outputs/flan-t5-base_lora lora_model/
!rm -rf temp_repo

#### Load the Tokenizer, Base Model, and LoRA Adapter

In [None]:
base_model_name = 'google/flan-t5-base'
lora_model_name = './lora_model/flan-t5-base_lora'

def load_model(base_model_name, lora_model_name):

    # Loading the base model and tokenizer into memory
    tokenizer = T5Tokenizer.from_pretrained(base_model_name, legacy=True)
    base_model = T5ForConditionalGeneration.from_pretrained(
        base_model_name,
        dtype=torch.bfloat16
    )

    # Applying LoRA adapter on top of the local base model
    model = PeftModel.from_pretrained(
        base_model,
        lora_model_name,
        dtype=torch.bfloat16,
        is_trainnable=False
    )
    return tokenizer, model

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF_TOKEN')
tokenizer, model = load_model(base_model_name, lora_model_name)

total_parameter = sum(p.numel() for p in model.parameters())
print(f"Parameters of the model: {total_parameter:,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Parameters of the model: 249,347,328


#### Define a Helper for Text Generation

In [None]:
def generate_output(tokenizer, model, input_text, max_length=200):
    # Tokenize the input text and generate the model's output
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=GenerationConfig(max_new_tokens=200, num_beams=1) )

    # Decode the generated tokens to a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

#### Define and Run the Chatbot

In [None]:
def chatbot():

    print("Chatbot initialized. You can start chatting now (type 'quit' to stop)!\n")

    while True:
        # Get user input
        user_input = input("You: ")

        # Check if the user wants to quit
        if user_input.lower() == "quit":
            break

        response = generate_output(tokenizer, model, user_input)

        # Print the generated text
        print(f"Chatbot: {response}\n")

# Run the chatbot
chatbot()

Chatbot initialized. You can start chatting now (type 'quit' to stop)!

You: How are you doing today?
Chatbot: You are doing well today.

You: Summarize:  saludos,es que tengo un problema con mi iPhone,y lo que ocurre es que no puedo llamar ni recibir llamadas,y no s qu ms hacer,ya que acud a Apple Support y aunque intenten llamarme,no entran llamadas,quiero saber cual es su recomendacin.Gracias. <BR> We offer support via Twitter in English. Get help in Spanish here <LINK> or join <LINK> <BR> AppleSupport I can speak English so so,can you help me?i have an iPhone 7,and 1 week ago,I cant make calls or receive them,what I do? <BR> Being able to make and receive calls is important and wed like to help. Which exact iOS version is installed on it and have you tried any steps yet? <BR> AppleSupport I have iOS 11.1.2 and I have tried everything what the page said.im Colombian. <BR> Is this the page youve tried all the steps from <LINK> yes, what did you find out when you contacted your carrie

Token indices sequence length is longer than the specified maximum sequence length for this model (641 > 512). Running this sequence through the model will result in indexing errors


Chatbot: Advice: I have a problem with my iPhone and I can't make calls or receive them,what I do?

You: quit
