<a href="https://colab.research.google.com/github/jchen8000/DemystifyingLLMs/blob/main/6_Deployment/Chatbot_FLAN_T5_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6. Deployment of LLMs

## 6.11 Chatbot, Example of LLM-Powered Application

### **Build a Chatbot with Local LLM plus a LoRA Adapter**

This notebook demonstrates how to load and run a local LLM entirely on your machine (or a Google Colab GPU runtime), and enhance its capabilities using a LoRA fine‑tuned adapter. The LoRA adapter used here was created earlier in **Section 5.3**, where we fine‑tuned ```google/flan-t5-base``` on the TweetSum summarization dataset.

To ensure reproducibility, the notebook downloads the fine‑tuned LoRA checkpoints directly from our GitHub repository.

---

### **Key Objectives**

- **Local Model Execution**  
  Load the ```google/flan-t5-base``` model directly from Hugging Face and run it locally on GPU using PyTorch.

- **Apply a Fine‑Tuned LoRA Adapter**  
  Attach the fine‑tuned LoRA weights (created in Section 5.3) to the base model using the ```PEFT``` library.

- **Text Generation Pipeline**  
  Generate responses—such as Tweet‑style summaries or general conversational output—fully offline without relying on external inference APIs.

---

### ⚠️ **Hardware & Compatibility Requirements**

This notebook requires a GPU runtime (such as the Google Colab T4 GPU) because the model is executed locally using PyTorch with CUDA acceleration. Running on CPU is possible but not recommended due to performance limitations.

This notebook is confirmed working as of **January 2026**.

While the workflow reflects current best practices, deep‑learning libraries evolve rapidly, and future updates to Transformers, PEFT, or CUDA tooling may require adjustments to this notebook.

---


In [9]:
%pip install -q \
    torch==2.9.0+cu126 \
    torchvision==0.24.0+cu126 \
    torchaudio==2.9.0+cu126 \
    transformers==4.57.6 \
    peft==0.18.1 \
    sentencepiece==0.2.1

In [10]:
import os
from google.colab import userdata
import torch
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration
from transformers import GenerationConfig
from peft import PeftModel, PeftConfig

#### Download the LoRA model from github repo.  

In Section 5.3, we fine‑tuned ```google/flan-t5-base``` model on the TweetSum summarization dataset, and we created the LoRA model called ```flan-t5-base_lora```. Now we download the LoRA model from github repo.

In [11]:
!rm -rf lora_model

!mkdir -p lora_model
!git clone --depth 1 --filter=blob:none --sparse \
    https://github.com/jchen8000/DemystifyingLLMs.git temp_repo

%cd temp_repo
!git sparse-checkout set "5_Fine-Tuning/outputs/flan-t5-base_lora"
%cd ..

!mv temp_repo/5_Fine-Tuning/outputs/flan-t5-base_lora lora_model/
!rm -rf temp_repo

Cloning into 'temp_repo'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 14 (delta 0), reused 9 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (14/14), done.
remote: Enumerating objects: 2, done.[K
remote: Counting objects: 100% (2/2), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 2 (delta 0), reused 1 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (2/2), 2.22 KiB | 2.22 MiB/s, done.
/content/temp_repo
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (4/4), 2.66 MiB | 11.46 MiB/s, done.
/content


#### Load the Tokenizer, Base Model, and LoRA Adapter

In [12]:
base_model_name = 'google/flan-t5-base'
lora_model_name = './lora_model/flan-t5-base_lora'

def load_model(base_model_name, lora_model_name):

    # Loading the base model and tokenizer into memory
    tokenizer = T5Tokenizer.from_pretrained(base_model_name, legacy=True)
    base_model = T5ForConditionalGeneration.from_pretrained(
        base_model_name,
        dtype=torch.bfloat16
    )

    # Applying LoRA adapter on top of the local base model
    model = PeftModel.from_pretrained(
        base_model,
        lora_model_name,
        dtype=torch.bfloat16,
        is_trainnable=False
    )
    return tokenizer, model

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF_TOKEN')
tokenizer, model = load_model(base_model_name, lora_model_name)

total_parameter = sum(p.numel() for p in model.parameters())
print(f"Parameters of the model: {total_parameter:,}")

Parameters of the model: 249,347,328


#### Define a Helper for Text Generation

In [13]:
def generate_output(tokenizer, model, input_text, max_length=200):
    # Tokenize the input text and generate the model's output
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=GenerationConfig(max_new_tokens=200, num_beams=1) )

    # Decode the generated tokens to a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

#### Define and Run the Chatbot

In [15]:
def chatbot():

    print("Chatbot: All set!")
    print("I'm a text‑summarizing chatbot trained on the TweetSum dataset.")
    print("I'm ready to summarize any text you provide.")
    print("(Type 'quit' to stop)\n")

    while True:
        # Get user input
        user_input = input("You: ")

        # Check if the user wants to quit
        if user_input.lower() == "quit":
            break

        response = generate_output(tokenizer, model, user_input)

        # Print the generated text
        print(f"Chatbot: {response}\n\n")

# Run the chatbot
chatbot()

Chatbot: All set!
I'm a text‑summarizing chatbot trained on the TweetSum dataset.
I'm ready to summarize any text you provide.
(Type 'quit' to stop)

You: Summarize the text: AmazonHelp ok, I need to change the shipping address on some shipments, I just realized theyre going to the wrong place. <BR> If the order hasnt entered the shipping process then you can edit the <LINK> KM <BR> AmazonHelp I successfully did it with 2 items, but someone named murfbooks is being difficult. I told them I wanted a refund if they wont cooperate. <BR> Hi, glad to hear you were able to do this for both items. Have they submitted the refund ? CR <BR> AmazonHelp No. I only put in the initial order an hour ago. They were quick to respond the first time, but now have not replied for about 15 minutes. <BR> Ah okay Has the order been marked as shipped? NV <BR> AmazonHelp I am not sure. It says edit order, but it wont let me cuz its 3rd party. <LINK> <BR> Please allow the seller a bit longer to respond. AT <BR>