# 🇯🇵 Japanese LLM (ELYZA) API Notebook using FastAPI + ngrok

This notebook demonstrates how to build and publish a simple chat API using a Japanese LLaMA model provided by ELYZA.  
It includes the following features:

- Load a Japanese LLM using Hugging Face Transformers  
- Inference with PyTorch (GPU supported)  
- Provide a chat API endpoint `/predict` using FastAPI  
- Temporarily expose the local API using ngrok  

> ⚠️ Note: When running on Google Colab (free tier), inference may be too heavy. In that case, you can switch to the dummy function `generate_response_dummy()` to test the API without actual model loading.

## How to Use This Notebook

1. Set the environment variable `NGROK_TOKEN` as a notebook secret (via Colab's `userdata`)  
2. (Optional) Select an L4 GPU runtime for better performance  
3. Run all the cells in the notebook sequentially  
4. Copy the `LLM_URL` printed by the final `main()` cell  
5. Follow the "How to use" section in the [GitHub repository](https://github.com/kenta-hirahara/Speech-Recognition) README.  
   Add the URL to your `.env` file as instructed, and the web application will launch accordingly.

In [1]:
!pip install transformers accelerate bitsandbytes fastapi uvicorn pyngrok sentencepiece

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting fastapi
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.1-py3-none-any.whl.metadata (6.5 kB)
Collecting pyngrok
  Downloading pyngrok-7.2.4-py3-none-any.whl.metadata (8.7 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.46.2-py3-none-any.whl.metadata (6.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 

In [2]:
from pyngrok import ngrok
import torch
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import nest_asyncio
from pydantic import BaseModel
import uvicorn
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.colab import userdata

## Model initialization
- Load pre-trained parameters
- This time Japanse LLM model called "elyza/Llama-3-ELYZA-JP-8B" was chosen

In [3]:

model_name = "elyza/Llama-3-ELYZA-JP-8B"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀なアシスタントです。知性的かつ簡潔に回答してください。"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

# Inference

In [4]:
def build_prompt(user_message: str) -> str:
    """
    Generate prompt from system message and user input
    """
    messages = [
        {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
        {"role": "user", "content": user_message}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return prompt

def generate_response(user_message: str) -> str:
    """
    Respond using LLM
    """
    prompt = build_prompt(user_message)
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            max_new_tokens=1024,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    output = tokenizer.decode(output_ids[0][token_ids.size(1):], skip_special_tokens=True)
    return output.strip()

def generate_response_dummy(user_message: str) -> str:
    """
    For API test
    """
    return "dummy: Rutileaのみなさんお久しぶりです！お元気ですか？"

# FastAPI Application

In [5]:
app = FastAPI()

# Configure CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=['*'],
    allow_credentials=True,
    allow_methods=['*'],
    allow_headers=['*'],
)

# INput schema
class RequestBody(BaseModel):
    message: str

# Endpoint
@app.post("/predict")
async def predict(req: RequestBody):
    response_text = generate_response(req.message)
    # response_text = generate_response_dummy(req.message)
    return {"response": response_text}

# Exposure of the Endpoint by Ngrok

In [15]:
def start_ngrok():
    """
    ngrokトンネルを起動してURLを表示
    """
    token = userdata.get("NGROK_TOKEN")
    ngrok.set_auth_token(token)
    tunnel = ngrok.connect(8000)
    print("LLM_URL:", tunnel.public_url)

# Run Application


In [16]:
def main():
    nest_asyncio.apply()
    start_ngrok()
    uvicorn.run(app,port=8000)

In [17]:
main()

LLM_URL: https://bad1-35-240-195-228.ngrok-free.app


INFO:     Started server process [1028]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-7' coro=<Server.serve() done, defined at /usr/local/lib/python3.11/dist-packages/uvicorn/server.py:68> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/main.py", line 580, in run
    server.run()
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/server.py", line 66, in run
    return asyncio.run(self.serve(sockets=sockets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nest_asyncio.py", line 92, in run_until_complete
  

INFO:     2a02:3032:a:a818:1de5:72be:c899:c251:0 - "POST /predict HTTP/1.1" 200 OK
INFO:     2a02:3032:a:a818:1de5:72be:c899:c251:0 - "POST /predict HTTP/1.1" 200 OK
INFO:     2a02:3032:a:a818:1de5:72be:c899:c251:0 - "POST /predict HTTP/1.1" 200 OK


INFO:     Shutting down
INFO:     Finished server process [1028]
