# LLM API Notebook using FastAPI + ngrok

This notebook demonstrates how to build and publish a simple chat API using a Japanese LLaMA model provided by ELYZA.  
It includes the following features:

- Load a Japanese LLM using Hugging Face Transformers  
- Inference with PyTorch (GPU supported)  
- Provide a chat API endpoint `/predict` using FastAPI  
- Temporarily expose the local API using ngrok  

> Note: When running on Google Colab (free tier), inference may be too heavy. In that case, you can switch to the dummy function `generate_response_dummy()` to test the API without actual model loading.

## How to Use This Notebook

1. Set the environment variable `NGROK_TOKEN` as a notebook secret (via Colab's `userdata`)  
2. (Optional) Select an L4 GPU runtime for better performance  
3. Run all the cells in the notebook sequentially  
4. Copy the `LLM_URL` printed by the final `main()` cell  
5. Follow the "How to use" section in the [GitHub repository](https://github.com/kenta-hirahara/Speech-Recognition) README.  
   Add the URL to your `.env` file as instructed, and the web application will launch accordingly.

In [None]:
!pip install transformers accelerate bitsandbytes fastapi uvicorn pyngrok sentencepiece

In [None]:
from pyngrok import ngrok
import torch
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import nest_asyncio
from pydantic import BaseModel
import uvicorn
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.colab import userdata

## Model initialization
- Load pre-trained parameters
- This time Japanse LLM model called "elyza/Llama-3-ELYZA-JP-8B" was chosen

In [None]:

model_name = "elyza/Llama-3-ELYZA-JP-8B"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀なアシスタントです。知性的かつ簡潔に回答してください。"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
model.eval()

## Inference

In [None]:
def build_prompt(user_message: str) -> str:
    """
    Generate prompt from system message and user input
    """
    messages = [
        {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
        {"role": "user", "content": user_message}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return prompt

def generate_response(user_message: str) -> str:
    """
    Respond using LLM
    """
    prompt = build_prompt(user_message)
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            max_new_tokens=1024,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    output = tokenizer.decode(output_ids[0][token_ids.size(1):], skip_special_tokens=True)
    return output.strip()

def generate_response_dummy(user_message: str) -> str:
    """
    For API test
    """
    return "dummy: Rutileaのみなさんお久しぶりです！お元気ですか？"

## FastAPI Application

In [None]:
app = FastAPI()

# Configure CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=['*'],
    allow_credentials=True,
    allow_methods=['*'],
    allow_headers=['*'],
)

# INput schema
class RequestBody(BaseModel):
    message: str

# Endpoint
@app.post("/predict")
async def predict(req: RequestBody):
    response_text = generate_response(req.message)
    # response_text = generate_response_dummy(req.message)
    return {"response": response_text}

## Exposure of the Endpoint by Ngrok

In [None]:
def start_ngrok():
    """
    ngrokトンネルを起動してURLを表示
    """
    token = userdata.get("NGROK_TOKEN")
    ngrok.set_auth_token(token)
    tunnel = ngrok.connect(8000)
    print("LLM_URL:", tunnel.public_url)

## Run Application


In [None]:
def main():
    nest_asyncio.apply()
    start_ngrok()
    uvicorn.run(app,port=8000)

In [None]:
main()