# Running an API endpoint locally

This notebook is a simple walkthrough of how to serve a local API endpoint with FastAPI, using a publicly available Graphcore model on the IPU, from a Docker image. 

Here, we'll cover:

* Pulling the example image for [Stable Diffusion 2 Text-to-image]() for inference 
* Running up the FastAPI service on your machine from the image to create a locally hosted endpoint
* How to access and send requests to the endpoint and receive model output

The public model inference images available on Graphcore's Docker Hub have all of the necessary dependencies 'baked in', including executables and model binaries, to make the process of serving up an endpoint as smooth as possible. The internals of the image are based on the [api-deployment]() repository model-serving architecture. This is designed to be a straightforward example of serving a model with FastAPI and running up a local endpoint. Once you've tested your local endpoint functionality, you can use the same container to launch up a deployment in [Paperspace]()!


Before starting the notebook, install the necessary dependencies to run the endpoint demo. We will use [Gradio]() to create a basic in-notebook GUI using the Stable Diffusion model endpoint served by the image.

In [None]:
! pip install gradio
! pip install matplotlib

Specify the address of the docker image to pull using an environment variable, this is simply the Docker Hub username ('Graphcore') followed by the name of the container. We can easily run bash commands using the `!` denotation in notebooks to get the image into the notebook environment:

In [None]:
%env DOCKERHUB_IMAGE=gcapidev/stable-diffusion-2-512-deployment
%env LOCAL_IP_ADDRESS=localhost

## Pull the image from Docker Hub

Ensure the environment variable is set correctly, we can print it with `echo`.

In [None]:
! echo $DOCKERHUB_IMAGE

Next, use the `docker pull` command to get the image from the Graphcore image repository.

In [None]:
! docker pull $DOCKERHUB_IMAGE

To check the image has been pulled into the environment, we can check the list of locally available docker images and use `grep` to search for the name of our image:

In [None]:
! docker image list | grep $DOCKERHUB_IMAGE

## Run the Docker image

The image we pulled has the model, executables and FastAPI endpoint built in. When we run the image, it will start preparing the model, loading the executables and run up the endpoint. To send requests to the endpoint, we first need to wait until the model is ready to receive requests, once all the binaries have been loaded and model graphs compiled, the console output will show the IP address that the endpoint is being served at.

In this case, there is only one model present in the image, but some images may contain multiple models, serving up multiple endpoints for use. You can modify which models the image should prepare by specifying the `SERVER_MODELS` environment variable within the `docker run` command.

Ensure the environment variables `IPUOF_VIPU_API_HOST` and `IPUOF_VIPU_API_PARTITION_ID` are set on the host machine.

In [None]:
! echo $IPUOF_VIPU_API_HOST
! echo $IPUOF_VIPU_PARTITION_ID

`-e` indicates environment variables that need to be set for the image, including the `POPTORCH_CACHE_DIR`, `HUGGINGFACE_HUB_CACHE` and `HF_HOME` - these are the model cache directory environment variables recognised by the Poptorch framework and the Hugging Face model used in the container. 

The last line of the command specifies the name of the image we want to run up, in this case, our downloaded image from the Docker Hub repository. In this notebook, we will run the image in a **detached** state using `-d`, this is because the model endpoint images do not 'finish' running, and as such the output to `stdout` doesn't end, meaning the following cell would never stop running if the command is run attached to the terminal. The terminal output is, however, useful to view, and will let you know when the image is ready to receive requests, when `uvicorn` is ready. 

You can run this step on the notebook terminal by clicking the small notebook symbol on the left navigation bar in your Paperspace console, running the command in the attached state to view the live output:

```
docker run \
    -e POPTORCH_CACHE_DIR=/src/model_cache \
    -e HUGGINGFACE_HUB_CACHE=/src/model_cache/ \
    -e HF_HOME=/src/model_cache/ \
    --env-file <(env | grep IPU) \
    --network host \
    --device=/dev/infiniband/ \
    --cap-add=IPC_LOCK \
    $DOCKERHUB_IMAGE
```

If you run the image in the terminal, skip the next cell to ensure you're not running up two images at once.

In [None]:
# Run the image in the notebook in a detached state, you will not be able to view terminal output:

! docker run -d \
    -e POPTORCH_CACHE_DIR=/src/model_cache \
    -e HUGGINGFACE_HUB_CACHE=/src/model_cache/ \
    -e HF_HOME=/src/model_cache/ \
    --env-file <(env | grep IPU) \
    --network host \
    --device=/dev/infiniband/ \
    --cap-add=IPC_LOCK \
    $DOCKERHUB_IMAGE

You can now view the running containers and container IDs, including the above one, by running:

In [None]:
! docker ps

Remember that when you are finished with the container, remember to stop and then delete the container, ensuring it detaches from any devices it is attached to.

From the listed containers output by the `docker ps` command, find your container by the `IMAGE` column, uncomment and replace `<container ID>` in the following lines with the corresponding ID from `CONTAINER ID`.

In [None]:
# docker stop <container ID>
# docker rm <container ID>

## Using the endpoint

First, import the necessary packages for this stage. We will use the easy `requests` package to send requests to the endpoint.

In [None]:
import requests
import json
import random

Once the model has been initialised and compiled, the endpoint is successfully running. We can ensure the model is running by performing a `GET` request to the `/readiness` service, this is a health check which will return the endpoints state when called.

The server is not immediately ready to use, so after running `docker run` in a detached state, we need to wait for the endpoint to be ready before proceeding to actually perform inference with the endpoint. For this, we use a simple function to loop until the `GET` request to the health check returns the passing message.

In [None]:
import time

def wait_for_readiness(url):
    while True:
        try:
            response = requests.get(f"{url}/readiness")
            response = response.json()
            if response['message'] == 'Readiness check succeeded.': 
                print(f"Server ready - {response['message']}")
                break
            else:
                print(f"Server waiting - {response['message']}")
                raise Exception
        except Exception as e:
            time.sleep(2)
        
    return True

print("Waiting for readiness...")

warmup_start = time.perf_counter()
ready = wait_for_readiness("http://localhost:8100")

print(f"Warm up time: {time.perf_counter() - warmup_start}s")

The message should say 'Readiness check succeeded', which means we are ready to start generating images with the model using the live endpoint.

Lets create a dictionary for the parameters to send to the model. This is specific to and defined by the model endpoint that has been created. For Stable Diffusion, we must pass:
* `prompt`: Main body of text describing the image we want to create.
* `random_seed`: Can be used to emulate a deterministic image output from the same prompt each time (we set this to random to observe variation in the image).
* `guidance scale`: Specific to Stable Diffusion, it controls how strongly the generated image will follow the text output. 
* `return_json`: Defines whether to return a JSON object in the response or not, to receive an encoded image, we want to set this to `True`. 
* `negative_prompt`: Defines any aspects we don't want to see in the image.
* `num_inference_steps`: The number of sampling steps undertaken by the model, increasing this up to a point should improve the image quality of the generated image, 25-50 steps is a reasonable range for this.

In [None]:
model_params = {
      "prompt": "big red dog",
      "random_seed": random.randint(0,99999999),
      "guidance_scale": 9,
      "return_json": True,
      "negative_prompt": "string",
      "num_inference_steps": 25
}

Next, we can use `requests` to send a POST call to the REST endpoint at the IP address that the endpoint is running on. This will return an image in the response JSON body.

In [None]:
response = requests.post("http://10.129.96.114:8100/stable_diffusion_2_txt2img_512", json=model_params)

if response.status_code != 200:
    print(response.status_code)
    
response = response.json()

Now, the image has been returned in Base64 encoded form within the JSON, we can decode this using the `base64` and `io` libraries to visualise the image. First, we decode the images returned by the model and convert them to PIL RGB images - in this case there is only one image.

In [None]:
from PIL import Image
import base64
import io

images_b64 = [i for i in response['images']]

pil_images = []
for b64_img in images_b64:
    base64bytes = base64.b64decode(b64_img)
    bytesObj = io.BytesIO(base64bytes)
    img = Image.open(bytesObj)
    
    pil_images.append(img)
    
print("Number of images returned: ", len(pil_images))

Once the images have been converted, they can be viewed:

In [None]:
import matplotlib.pyplot as plt

plt.axis('off')
plt.imshow(pil_images[0])
plt.show()

Once we have tested the endpoint as above, we have all the tools needed to create a basic interface capability to infer from the model through the endpoint, with a GUI. Lets create a simple Gradio app for this, wrapping the POST request to the model and the image decoder into a short and simple function which will serve as the core of the app:

In [None]:
import gradio as gr
import numpy as np

def stable_diffusion_2_inference(prompt, guidance_scale, num_inference_steps):
    model_params = {
      "prompt": prompt,
      "random_seed": random.randint(0,99999999),
      "guidance_scale": guidance_scale,
      "return_json": True,
      "negative_prompt": "string",
      "num_inference_steps": num_inference_steps
    }
    
    response = requests.post("http://localhost:8100/stable_diffusion_2_txt2img_512", json=model_params)
    response = response.json()
    
    images_b64 = [i for i in response['images']]
    pil_images = []
    for b64_img in images_b64:
        base64bytes = base64.b64decode(b64_img)
        bytesObj = io.BytesIO(base64bytes)
        img = Image.open(bytesObj)

        pil_images.append(img)
    
    return np.array(pil_images[0])

In [None]:
gr.close_all()
demo = gr.Interface(
    fn=stable_diffusion_2_inference, 
    inputs=[gr.Textbox(value="Ice skating on the moon"),
            gr.Slider(1,50,value=9, step=1, label='Guidance scale'),
            gr.Slider(1,100,value=25, step=1, label='Number of steps')
           ], 
    outputs=gr.Image(shape=(512,512))
    )

demo.launch(share=True)

And that's it!

To recap, this notebook has covered how to pull a public Docker image with a Graphcore model endpoint, serve the endpoint on your local machine, observe readiness of the endpoint, and finally send prompts to the API and predict using the model from the endpoint, receiving the output into your workspace or as a simple frontend for the model using GUIs like Gradio.