<a href="https://colab.research.google.com/github/hamza13-10/Colab-Notebooks/blob/main/FastAPI_Mistral_Quantized.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install fastapi uvicorn
!pip install bitsandbytes
!pip install pyngrok



In [2]:
import os
os.environ["HF_TOKEN"] = "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"

In [3]:
from huggingface_hub import login

# Login with the token
login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [5]:
from pyngrok import ngrok
import threading
import uvicorn
import nest_asyncio
from typing import Dict, List
import time

In [6]:
nest_asyncio.apply() #allows nested event loops

In [7]:
app = FastAPI()

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
device = next(model.parameters()).device

In [9]:
print(device)

cuda:0


In [10]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.293 GB of memory reserved.


**Testing the Quantized Model**

In [11]:
def test_model(prompt):
    # Provide an alternating sequence of user/assistant messages
    sample_messages = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": ""}
    ]

    try:
        # Tokenize and prepare the input
        input_ids = tokenizer.apply_chat_template(sample_messages, return_tensors="pt").to(device)

        # Generate output from the model
        outputs = model.generate(input_ids, max_new_tokens=200)

        # Decode and print the output
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print("Model Output:", response)
        print(type(response))
    except Exception as e:
        print("An error occurred during model testing:", str(e))

In [12]:
print("Testing the model with a custom prompt...")
test_model("Can you generate a tweet for me about cats?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Testing the model with a custom prompt...




Model Output: Can you generate a tweet for me about cats?  "A room without a cat is like a garden without a sunflower. 🐾🌻 Cats, the sunflowers of our homes. 🐈🏠 #CatLovers #PurrfectCompanions"
<class 'str'>


**Inference Endpoint**

In [13]:
# In-memory store for conversation history
session_history: Dict[str, List[Dict[str, str]]] = {}

class ChatInput(BaseModel):
    session_id: str  # Unique identifier for the session
    prompt: str
    clear_history: bool = False  # Flag to clear the chat history

class ChatOutput(BaseModel):
    response: str

@app.post("/bot", response_model=ChatOutput)
def chat_bot(input_data: ChatInput):
    try:
        session_id = input_data.session_id
        prompt = input_data.prompt
        clear_history = input_data.clear_history

        if clear_history:
            session_history.pop(session_id, None)

        if session_id not in session_history:
            session_history[session_id] = []

        chat_history = session_history[session_id]

        if prompt:
            chat_history.append({"role": "user", "content": prompt})
            chat_history.append({"role": "assistant", "content": ""})

            input_ids = tokenizer.apply_chat_template(chat_history, return_tensors="pt").to(device)

            # Generate output from the model
            outputs = model.generate(
                input_ids,
                max_new_tokens=200,
            )

            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

            # Update the last assistant message in the chat history with the generated response
            chat_history[-1]["content"] = generated_text

            # Return the response to the client
            return {"response": generated_text}
        else:
            return {"response": "Chat history cleared." if clear_history else "No prompt provided."}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

In [14]:
def run_uvicorn():
    uvicorn.run(app, host="0.0.0.0", port=8000)

In [15]:
uvicorn_thread = threading.Thread(target=run_uvicorn)
uvicorn_thread.start()

In [16]:
ngrok.set_auth_token("2nsMPO39cVSe3K4MuK3oEzfWQnC_5nj6Lh246McuLtR6KFwZe")

INFO:     Started server process [3855]
INFO:     Waiting for application startup.
INFO:     Application startup complete.


In [17]:
ngrok.kill()

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


In [18]:
public_url = ngrok.connect(8000)
print(f"Public URL: {public_url}")

Public URL: NgrokTunnel: "https://6a68-34-16-210-35.ngrok-free.app" -> "http://localhost:8000"


In [None]:
# Keep the main thread alive while the Uvicorn server is running
uvicorn_thread.join()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


INFO:     203.130.22.163:0 - "POST /bot HTTP/1.1" 200 OK


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


INFO:     203.130.22.163:0 - "POST /bot HTTP/1.1" 200 OK
INFO:     203.130.22.163:0 - "POST /bot HTTP/1.1" 200 OK
INFO:     203.130.22.163:0 - "POST /bot HTTP/1.1" 200 OK


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


INFO:     203.130.22.163:0 - "POST /bot HTTP/1.1" 200 OK
