In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb

### Data preparation

In [2]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'
df = pd.read_csv(data)

In [3]:
df.columns = df.columns.str.lower()

status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

df.status = df.status.map(status_values)

home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}

df.home = df.home.map(home_values)

marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}

df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}

df.records = df.records.map(records_values)

job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}

df.job = df.job.map(job_values)

for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

df = df[df.status != 'unk'].reset_index(drop=True)

In [4]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=11)

df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = (df_train.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

del df_train['status']
del df_test['status']

In [5]:
dv = DictVectorizer(sparse=False)

train_dicts = df_train.fillna(0).to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

test_dicts = df_test.fillna(0).to_dict(orient='records')
X_test = dv.transform(test_dicts)

### Random forest

In [6]:
rf = RandomForestClassifier(n_estimators=200,
                            max_depth=10,
                            min_samples_leaf=3,
                            random_state=1)
rf.fit(X_train, y_train)

### XGBoost

Note:

We removed feature names

It was 

```python
features = dv.get_feature_names_out()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
```

Now it's

```python
dtrain = xgb.DMatrix(X_train, label=y_train)
```

In [7]:
dtrain = xgb.DMatrix(X_train, label=y_train)

In [8]:
xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=175)

### BentoML

In [9]:
import bentoml

In [10]:
bentoml.xgboost.save_model(
    'credit_risk_model',
    model,
    custom_objects={
        'dictVectorizer': dv
    })

Model(tag="credit_risk_model:4kir6mcr6265t4os", path="/home/martin/bentoml/models/credit_risk_model/4kir6mcr6265t4os/")

Test

In [11]:
import json

In [12]:
request = df_test.iloc[0].to_dict()
print(json.dumps(request, indent=2))

{
  "seniority": 3,
  "home": "owner",
  "time": 36,
  "age": 26,
  "marital": "single",
  "records": "no",
  "job": "freelance",
  "expenses": 35,
  "income": 0.0,
  "assets": 60000.0,
  "debt": 3000.0,
  "amount": 800,
  "price": 1000
}


## Question 1

* Install BentoML
* What's the version of BentoML you installed?
* Use `--version` to find out

In [13]:
!bentoml --version

bentoml, version 1.0.7


## Question 2

Run the notebook which contains random forest model from module 6 i.e previous module and save the model with BentoML. To make it easier for you we have prepared this [notebook](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/07-bentoml-production/code/train.ipynb). 


How big approximately is the saved BentoML model? Size can slightly vary depending on your local development environment.
Choose the size closest to your model.

* 924kb
* 724kb
* 114kb
* 8kb

In [14]:
!bentoml models list

[1m [0m[1mTag                         [0m[1m [0m[1m [0m[1mModule         [0m[1m [0m[1m [0m[1mSize      [0m[1m [0m[1m [0m[1mCreation Time      [0m[1m [0m
 credit_risk_model:4kir6mcr6…  bentoml.xgboost  197.77 KiB  2022-10-22 06:47:12 
 credit_risk_model:uofn5nsr6…  bentoml.xgboost  197.77 KiB  2022-10-22 06:45:26 
 credit_risk_model:rbjdalcr6…  bentoml.xgboost  197.77 KiB  2022-10-22 06:44:40 
 credit_risk_model:qhepj6sr6…  bentoml.xgboost  197.77 KiB  2022-10-22 06:23:01 
 credit_risk_model:qujzyzsr6…  bentoml.xgboost  197.77 KiB  2022-10-22 06:08:47 
 credit_risk_model:hpsemlcr4…  bentoml.xgboost  197.77 KiB  2022-10-22 04:12:13 
 credit_risk_model:gzgt2pcr4…  bentoml.xgboost  197.77 KiB  2022-10-22 04:04:54 
 credit_risk_model:zrmzncsqw…  bentoml.xgboost  197.77 KiB  2022-10-20 16:28:46 
 iris_clf:6iv5s4cnikh76b2m     bentoml.sklearn  5.98 KiB    2022-10-16 07:09:04 
 iris_clf:r7p573snh66r5ye5     bentoml.sklearn  5.98 KiB    2022-10-16 06:44:51 
 m

## Question 3

Say you have the following data that you're sending to your service:

```json
{
  "name": "Tim",
  "age": 37,
  "country": "US",
  "rating": 3.14
}
```

What would the pydantic class look like? You can name the class `UserProfile`.

In [15]:
from pydantic import BaseModel

class UserProfile(BaseModel):
    name: str = "Tim"
    age: int = 37
    country: str = "US"
    rating: float = 3.14

## Question 4

We've prepared a model for you that you can import using:

```bash
curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel.bentomodel
bentoml models import coolmodel.bentomodel
```

What version of scikit-learn was this model trained with?

* 1.1.1
* 1.1.2
* 1.1.3
* 1.1.4
* 1.1.5

In [16]:
!curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel.bentomodel

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1724  100  1724    0     0   3528      0 --:--:-- --:--:-- --:--:--  3525


In [17]:
!bentoml models import coolmodel.bentomodel

Error: [31m[models] `import` failed: Item 'mlzoomcamp_homework:qtzdz3slg6mwwdu5' already exists in the store <osfs '/home/martin/bentoml/models'>[0m


In [18]:
!grep scikit-learn ~/bentoml/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/model.yaml

    scikit-learn: 1.1.1


## Question 5 

Create a bento out of this scikit-learn model. The output type for this endpoint should be `NumpyNdarray()`

Send this array to the Bento:

```
[[6.4,3.5,4.5,1.2]]
```

You can use curl or the Swagger UI. What value does it return? 

* 0
* 1
* 2
* 3

In [19]:
from bentoml.io import NumpyNdarray

runner = bentoml.sklearn.get("mlzoomcamp_homework:qtzdz3slg6mwwdu5").to_runner()

svc = bentoml.Service("mlzoomcamp_homework", runners=[runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series):
    result = runner.predict.run(input_series)
    return result

In [23]:
runner.destroy()
runner.init_local()

runner.predict.run([[6.4,3.5,4.5,1.2]])

'Runner.init_local' is for debugging and testing only.


array([1])

In [29]:
runner.destroy()

In [59]:
'''
bentoml serve service:svc --reload
'''

'\nbentoml serve service:svc --reload\n'

In [62]:
!curl -X 'POST' 'http://0.0.0.0:3000/classify' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[[6.4,3.5,4.5,1.2]]'

[1]

## Question 6

Ensure to serve your bento with `--production` for this question

Install locust using:

```bash
pip install locust
```

Use the following locust file: [locustfile.py](locustfile.py)

Ensure that it is pointed at your bento's endpoint (In case you didn't name your endpoint "classify")

<img src="resources/classify-endpoint.png">

Configure 100 users with ramp time of 10 users per second. Click "Start Swarming" and ensure that it is working.

Now download a second model with this command:

```bash
curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel
```

Or you can download with this link as well:
[https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel](https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel)

In [63]:
!curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1728  100  1728    0     0   3146      0 --:--:-- --:--:-- --:--:--  3147


Now import the model:

```bash
bentoml models import coolmodel2.bentomodel
```

In [64]:
!bentoml models import coolmodel2.bentomodel

Model(tag="mlzoomcamp_homework:jsi67fslz6txydu5") imported


Update your bento's runner tag and test with both models. Which model allows more traffic (more throughput) as you ramp up the traffic?

**Hint 1**: Remember to turn off and turn on your bento service between changing the model tag. Use Ctl-C to close the service in between trials.

**Hint 2**: Increase the number of concurrent users to see which one has higher throughput

Which model has better performance at higher volumes?

* The first model
* The second model

In [65]:
!firefox HW7_report_model1.html HW7_report_model2.html

The second model