### Fast API

Run fast_api.py file using the below command 

uvicorn fast_api:app --host 0.0.0.0 --port 8000 --workers 4

this will give you the Fast API app for testing

Test this on http://localhost:8000/docs 

Using a request in the format as given below

{
  "symptoms": "fever and sore throat",
  "question": "what should I do?"
}


### Testing Single User Load

In [1]:
import requests, time, numpy as np

FASTAPI_URL = "http://localhost:8000/ask"
payload = {
    "symptoms": "fever and cough",
    "question": "what should I do?"
}

times = []
for _ in range(100):
    start = time.time()
    res = requests.post(FASTAPI_URL, json=payload)
    end = time.time()

    if res.status_code == 200:
        times.append(end - start)
    else:
        print(f"Error: {res.status_code}, Response: {res.text}")

times = np.array(times)
print(f"Median: {np.median(times)*1000:.2f} ms")
print(f"95th Percentile: {np.percentile(times,95)*1000:.2f} ms")
print(f"99th Percentile: {np.percentile(times,99)*1000:.2f} ms")
print(f"Throughput: {len(times)/np.sum(times):.2f} req/s")


📊 Median: 682.74 ms
📊 95th Percentile: 732.81 ms
📊 99th Percentile: 862.46 ms
🚀 Throughput: 1.45 req/s


### Multi User Load

In [3]:
import concurrent.futures, requests, time, numpy as np

FASTAPI_URL = "http://localhost:8000/ask"
payload = {
    "symptoms": "headache and dizziness",
    "question": "should I take medicine?"
}

def send_request():
    start = time.time()
    res = requests.post(FASTAPI_URL, json=payload)
    return time.time() - start if res.status_code == 200 else None

num_threads = 16
num_requests = 200

with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
    results = list(executor.map(lambda _: send_request(), range(num_requests)))

# Clean nulls
results = np.array([r for r in results if r])
print(f"Threads: {num_threads}, Requests: {len(results)}")
print(f"Median: {np.median(results)*1000:.2f} ms")
print(f"95th: {np.percentile(results,95)*1000:.2f} ms")
print(f"99th: {np.percentile(results,99)*1000:.2f} ms")
print(f"Throughput: {len(results)/np.sum(results):.2f} req/s")


Threads: 16, Requests: 200
Median: 7180.39 ms
95th: 9958.40 ms
99th: 11571.84 ms
Throughput: 0.14 req/s


Run this in Optimisations directory

`mkdir -p triton_model_repo/gpt2_quantized/1`

`cp models/gpt2_quantized_static.onnx triton_model_repo/gpt2_quantized/1/model.onnx`

Check if config.pbtxt exists

Create a docker-compose-triton.yaml

```
version: "3.8"
services:
  triton:
    image: nvcr.io/nvidia/tritonserver:23.10-py3
    restart: always
    shm_size: 1g
    ports:
      - "8000:8000"   # HTTP
      - "8001:8001"   # gRPC
      - "8002:8002"   # Metrics
    volumes:
      - ./triton_model_repo:/models
    command: tritonserver --model-repository=/models
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]  # Remove if using CPU
```

Perf analyser Command

```
perf_analyzer -m gpt2_quantized \
  -u localhost:8000 \
  -b 1 \
  --shape input_ids:1,20
```

With Concurrency

```
perf_analyzer -m gpt2_quantized \
  -u localhost:8000 \
  -b 8 \
  --shape input_ids:8,20 \
  --concurrency-range 1:16 \
  --percentile 95
```

### Dynamic Batching

Add this to config.pbtxt

```
dynamic_batching {
  preferred_batch_size: [4, 8]
  max_queue_delay_microseconds: 100
}
```

Run Perf Analyser

```
perf_analyzer -m gpt2_quantized \
  -u localhost:8000 \
  -b 4 \
  --shape input_ids:4,20 \
  --concurrency-range 1:16 \
  --percentile 95 \
  --export-csv-path perf_batch.csv
```



### Scaling

For Multiple instances

```
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]
```

For Multiple GPUS

```
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  },
  {
    count: 1
    kind: KIND_GPU
    gpus: [1]
  }
]

```

Test

```
perf_analyzer -m gpt2_quantized \
  -u localhost:8000 \
  -b 8 \
  --shape input_ids:8,20 \
  --concurrency-range 8:32 \
  --percentile 95
```