# LLM Server with pyNgrok

### Server: run this .ipynb (on Colab T4 GPU)

### Client : run post_text.py (on your PC)

After running this .ipynb on Colab, Go to your PC, then doing the followings:

`cd ~/GenAI/Text-to-Text`

`nano post_text.py` (modify url to ngrok ip address)

`python post_text.py`

In [1]:
!pip install fastapi uvicorn
!pip install pyngrok
!pip install accelerate
!pip install nest-asyncio

Collecting fastapi
  Downloading fastapi-0.110.1-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn
  Downloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting h11>=0.8 (from uvicorn)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, uvicorn, starlette, fastapi
Successfully installed fastapi-0.110.1 h11-0.14.0 starlette-0.37.2 uvicorn-0.29.0
Collecting pyngrok
  Downloa

## LLM model

In [2]:
import torch
import transformers
from transformers import AutoModelForCausalLM , AutoTokenizer

# get huggingface access token from https://huggingface.co/settings/tokens, and set a Secret on Colab
from google.colab import userdata
access_token = userdata.get('hugging')

### https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

#model_name = "Q-bert/Mamba-130M"
#LLM = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype="auto", device_map="cuda") # for Mamba

#model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#model_name = "Q-bert/Mamba-3B"
#model_name = "microsoft/phi-2" # Phi-2.7B
#model_name = "openlm-research/open_llama_3b_v2"
#model_name = "google/gemma-1.1-2b-it"
model_name = "meta-llama/Llama-3.2-3B-Instruct"

print(model_name)

LLM = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda", token=access_token) # for the rest models

tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)



google/gemma-1.1-2b-it


config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

## HTTP Server with Ngrok

In [3]:
import getpass
import os
import threading

from fastapi import FastAPI, Request
from fastapi.responses import Response
import requests
import uvicorn

from pydantic import BaseModel

from pyngrok import ngrok, conf

## set ngrok authtoken
print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken
··········


In [4]:
# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(5000).public_url
print(" * ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}/\"".format(public_url, 5000))

# ... Update inbound traffic via APIs to use the public-facing ngrok URL

 * ngrok tunnel "https://8f62-34-105-42-193.ngrok-free.app" -> "http://127.0.0.1:5000/"


In [5]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
app = FastAPI()

class UserData(BaseModel):
    text: str

@app.get("/")
def root():
    return Response("Hello World!")

@app.post("/text")
def text(user_data: UserData):
    prompt = user_data.text
    print(prompt)

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    output = LLM.generate(input_ids, max_length=128, num_beams=5, no_repeat_ngram_size=2)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print("LLM: "+generated_text)
    return Response(generated_text)

# start new thread
threading.Thread(uvicorn.run(app, host="127.0.0.1", port=5000, log_level="info")).start()

INFO:     Started server process [167]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)


Hello, How are you?
LLM: Hello, How are you?

I am doing well, thank you for asking. I am happy to be of assistance in any way I can. How can I help you today?
INFO:     2407:4d00:8d00::f8:0 - "POST /text HTTP/1.1" 200 OK
Could you make me a coffee?
LLM: Could you make me a coffee?

I am unable to make physical objects, including coffee. I am a language model and I can provide information and assist with tasks, but I do not have personal experiences or the ability to create physical items.
INFO:     2407:4d00:8d00::f8:0 - "POST /text HTTP/1.1" 200 OK
Why is the sky blue?
LLM: Why is the sky blue?

The sky appears blue due to a phenomenon known as Rayleigh scattering. When sunlight enters the atmosphere, it is scattered by molecules in the air. The blue wavelengths of sunlight are scattered more strongly than the other colors of the spectrum. This is because the blue light waves have a shorter wavelength and are more likely to be deflected by individual molecules.
INFO:     2407:4d00:8d