## Serving ML Models Using Web Servers

### Model Serving

- Sharing results with others (team members, other business orgs, customers, web services, applications)
- Batch approach: dump predictions to a database (quite popular)
- Real-time approach: send a test feature vector and get back the prediction (the inference step happens just-in-time)

### How to consume from prediction services?

- Using web requests (e.g., using a JSON payload)

### How to output predictions?

- We will plan to set up a server to serve predictions
  - It will respond to web requests (GET, POST)
    - We pass some inputs (image, text, vector of numbers), and get some outputs (just like a function).
    - The environment from which we pass inputs may be very different from the environment where the prediction happens (e.g., different hardware).



### Our Objective

We will learn how to serve model predictions via the following steps:

 - 1. We will understand the key idea (mapping URL routes to functions) behind the `flask` web framework through an example flask app.
 - 2. We will use the `requests` module from a jupyter notebook (this is an example of a programmatic way to get any information from other machines on the internet). Alternatively, one can use commandline tools such as `curl` or commercial/GUI tools such as `postman` (these serve different needs of end users).
 - 3. Integrating the model with the app is relatively easy if the model can be read from disk. We will use the pytorch model with flask (see how to use `gunicorn` and Heroku PaaS in the exercises section) to set up a prediction server.

### Making API calls

 - Most of the internet works via HTTP requests.
 - The key idea is that a requester (a client) will send some information to the server located by  unique address (the IP address).
 - The server in turn processes the request and sends back a response (whereever the client is).
 - There are various types of requests:
   - GET: mostly used to access read-only data from the server
   - POST: mostly used to modify some information on the server (e.g., new user registration)
   - PUT

Below is our first example of making a GET request:

In [13]:
import requests
for x in range(1,5):
    print(x)
    res = requests.get(f'http://127.0.0.1:5000/?x={str(x)}&y={str(x+3)}')

    print('Response code:',res)
    print('Returned text: ',res.text[:300])

1
Response code: <Response [200]>
Returned text:  {"input1":"1","input2":"4","prediction1":4.0,"prediction2":10.0}

2
Response code: <Response [200]>
Returned text:  {"input1":"2","input2":"5","prediction1":6.0,"prediction2":12.0}

3
Response code: <Response [200]>
Returned text:  {"input1":"3","input2":"6","prediction1":8.0,"prediction2":14.0}

4
Response code: <Response [200]>
Returned text:  {"input1":"4","input2":"7","prediction1":10.0,"prediction2":16.0}



In [8]:
res.text

'{"input1":"5","input2":"3","prediction1":12.0,"prediction2":8.0}\n'

A status code of 200 means the server was able to respond as intended. A 4xx code means there was an issue with the client and a 5xx code means there was an issue with the server. We may face a lot of the latter codes when we try to deploy our models and we should learn to debug them properly (more on this later).

The same request above can be made using a commandline utility found in Ubuntu/Debian and other linux distros:
```bash
curl -o output.json https://httpbin.org/get
curl -o temp.html https://theja.org
```

Finally requests can also be made using more sophisticated programs such as [Postman](https://www.postman.com/).

In [16]:
!curl http://127.0.0.1:5000/?x=5&y=3

zsh:1: no matches found: http://127.0.0.1:5000/?x=5
