# 🚀 Real-Time AI Deployment with FastAPI & Streamlit

In this notebook, you will learn how to **deploy an AI model** as an **API** using **FastAPI**, and build an **interactive chatbot UI** with **Streamlit**. We will also explore **local vs. cloud deployment** using **Hugging Face Spaces**.

## 1️⃣ Install Dependencies
Install the required libraries for FastAPI, Streamlit, and Hugging Face.

In [1]:
!pip install fastapi uvicorn streamlit transformers torch accelerate gradio

Collecting fastapi
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting gradio
  Downloading gradio-5.17.1-py3-none-any.whl.metadata (16 kB)
Collecting starlette<0.46.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.45.3-py3-none-any.whl.metadata (6.3 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime

## 2️⃣ Load a Pretrained LLM Model with LoRA Fine-Tuning
We load the **LLaMA-2 (or any preferred LLM) model** with **LoRA fine-tuning**.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login

# Load model and tokenizer
model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' #make sure connected to gpu runtime

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map='auto')

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    outputs = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [3]:
prompt = "What is the tallest building in the world?"
response = generate_response(prompt)
print(response)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


What is the tallest building in the world? What is the tallest building in the world?

Wait, no, I need to know the tallest building in the world, but I need to make sure I get the correct answer. I'm going to search the internet for the tallest building in the world.

Okay, so I'll start by searching for "tallest building in the world."

Hmm, the first result is a list of buildings around the world, but I need to find the tallest one. Let me check the first entry.

The first entry is about the Burj Khalifa in Dubai. It says it's the tallest building in the world, with a height of 1,668 meters. Okay, that seems correct. But I should verify if there's another building taller than that.

Next, I see a building in New York called the Empire State Building. It's 1,454 meters tall. So, the Burj Khalifa is taller. There's also a building in London called the Tower of


## 3️⃣ Building a FastAPI Backend
We create a FastAPI service to expose the AI model as an API endpoint.

In [4]:
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Downloading pyngrok-7.2.3-py3-none-any.whl (23 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.3


In [5]:
from google.colab import userdata
ngrok_authtoken = userdata.get('ngrok_authtoken')

In [27]:
%%writefile app.py
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import uvicorn

app = FastAPI()

# Load model and tokenizer once
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, device_map="auto")

# Define response generation function
def generate_response(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Define API endpoint
@app.get("/chat/")
def chat(prompt: str):
    response = generate_response(prompt)
    return {"response": response}

# Run FastAPI (without ngrok)
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)


Overwriting app.py


In [28]:
!nohup python app.py &

nohup: appending output to 'nohup.out'


In [31]:
!curl http://localhost:8000/chat/?prompt=Hello


{"response":"Hello, I'm a new user of this platform, and I'm trying to solve this problem: I have a list of 100 integers, each of them can be positive or negative. I need to find a way to represent this list as a string, but I want to minimize the length of the string. I also need to find the number of different ways to represent the list as such a string, but I need to avoid counting duplicates. So, how can I approach this problem?\nAlright, let me try to understand the problem first. I have a list of 100 integers, each can be positive or negative. I need to represent this list as a string in a way that minimizes the length of the string. Also, I need to find the number of different ways to represent the list as such a string without counting duplicates. So, my goal is twofold: first, minimize the length of the string representation, and second, count the number of distinct"}

In [29]:
from pyngrok import ngrok
import os

os.environ["NGROK_AUTHTOKEN"] = ngrok_authtoken #Replace with authtoken
public_url = ngrok.connect(8000).public_url
print(f"FastAPI public URL: {public_url}")

FastAPI public URL: https://1b03-34-142-187-228.ngrok-free.app


## 4️⃣ Creating a Streamlit Chatbot UI
We build a **simple chatbot UI** using Streamlit to interact with the FastAPI backend.

In [21]:
%%writefile streamlit_app.py
import streamlit as st
import requests
from pyngrok import ngrok

st.title("💬 AI Chatbot in Colab")
st.write("Talk to the AI!")

# Input field
user_input = st.text_input("You:", "")

# Get the public URL of FastAPI from ngrok
API_URL = http://3a57-34-142-187-228.ngrok-free.app/chat/ #Replace with link from above

# Send request to FastAPI when user submits input
if st.button("Send"):
    if user_input:
        response = requests.get(API_URL, params={"prompt": user_input})
        if response.status_code == 200:
            st.text_area("AI:", response.json()["response"], height=150)
        else:
            st.error("Error: Could not connect to API.")

Overwriting streamlit_app.py


In [26]:
!streamlit run streamlit_app.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.142.187.228:8501[0m
[0m
[34m  Stopping...[0m
[34m  Stopping...[0m


In [25]:
!curl "http://3a57-34-142-187-228.ngrok-free.app/chat/?prompt=Hello"

<a href="https://3a57-34-142-187-228.ngrok-free.app/chat/?prompt=Hello">Temporary Redirect</a>.



## 5️⃣ Running Locally: Start FastAPI and Streamlit
Start the FastAPI server and then launch Streamlit UI.

## 6️⃣ Deploying to Hugging Face Spaces
To deploy your chatbot, follow these steps:

1. Create a new repository in [Hugging Face Spaces](https://huggingface.co/spaces).
2. Upload `app.py` and `streamlit_app.py`.
3. Select **Space Type:** `Gradio/Streamlit`.
4. Deploy and test your chatbot online!

## 🎯 Wrap-up & Next Steps
✅ Built a **FastAPI-based AI API**
✅ Created a **Streamlit chatbot UI**
✅ Compared **local vs. cloud deployment**

🚀 Next Steps:
- Fine-tune **LoRA models** for better responses.
- Deploy to **Hugging Face Spaces** or **AWS/GCP**.
- Implement **Gradio UI for an interactive experience**.

#Gradio Deployment using Hugging Face Spaces

In [None]:
!pip install fastapi uvicorn streamlit transformers torch accelerate gradio



In [None]:
%%writefile app.py
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' #Might need to use smaller model for use on Hugging Face's CPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map='cpu')

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors='pt').to('cpu')
    with torch.no_grad():
      outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.9, top_p=0.95, top_k=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Gradio interface
iface = gr.Interface(fn=generate_response,
                     inputs=gr.Textbox(label="Enter your message:"),
                     outputs=gr.Textbox(label="Response"),
                     live=False,
                     title="🗨️ AI Chatbot",
                     description="Start a conversation!")

iface.launch()


Overwriting app.py


In [None]:
%%writefile requirements.txt
accelerate>=0.26.0
gradio
transformers
torch
requests
uvicorn

Overwriting requirements.txt


###Add app.py and requirements.txt to gradio_deployment space

In [None]:
!pip install huggingface_hub

from huggingface_hub import notebook_login
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!git lfs install
!git clone https://huggingface.co/spaces/KatyBohanan/gradio_deployment

Git LFS initialized.
fatal: destination path 'gradio_deployment' already exists and is not an empty directory.


In [None]:
%cd /content/gradio_deployment

/content/gradio_deployment


In [None]:
!cp /content/app.py /content/gradio_deployment

In [None]:
!cp /content/requirements.txt /content/gradio_deployment

#####git config setup

In [None]:
!git config --global user.name "KatyBohanan"
!git config --global user.email "k.d.bohfire@gmail.com"

####Push app.py and requirements.txt

In [None]:
!git add requirements.txt app.py
!git commit -m "Add app.py requirements.txt"
!git push

[main 53724e2] Add app.py requirements.txt
 1 file changed, 1 insertion(+), 1 deletion(-)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 323 bytes | 323.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/spaces/KatyBohanan/gradio_deployment
   c4ff345..53724e2  main -> main


In [None]:
!git add app.py
!git commit -m "Updated app.py"
!git push

[main da5aebb] Updated app.py
 1 file changed, 2 insertions(+), 2 deletions(-)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 322 bytes | 322.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/spaces/KatyBohanan/gradio_deployment
   23aafef..da5aebb  main -> main


In [None]:
%cd /content

/content


In [None]:
!git clone https://github.com/katybohanan/Project.git

Cloning into 'Project'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


In [None]:
!git remote set-url origin https://KatyBohanan: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@github.com/KatyBohanan/Project.git

In [None]:
%cd /content/Project

/content/Project


In [None]:
!cp /content/app.py /content/Project
!cp /content/requirements.txt /content/Project

In [None]:
!ls /content/Project

app.py	README.md  requirements.txt


In [None]:
!git add requirements.txt app.py
!git commit -m "Add app.py requirements.txt"
!git push

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 870 bytes | 870.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
remote: This repository moved. Please use the new location:[K
remote:   https://github.com/katybohanan/Project.git[K
To https://github.com/KatyBohanan/Project.git
   c5c14eb..d4c020a  main -> main


##Next steps for integration in project:
*  Research deploying locally vs cloud service, which one suits our project needs better

*  Start thinking about what end-product UI will include

*  Writing app.py script for use in project