In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import catboost
import datetime
import warnings
warnings.filterwarnings('ignore')

# Applied Machine Learning

## Models and production

Let's train a model on Heart Disease data (https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

In [2]:
df = pd.read_csv('heart.csv')
df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
model = catboost.CatBoostClassifier()
model.fit(X=df.drop(columns='target'), y=df['target'], verbose=False)

<catboost.core.CatBoostClassifier at 0x7fa0cdcde128>

In [4]:
# in production you do the following:

model.predict_proba([63, 1, 3, 145, 233, 1, 0, 160, 0, 2.5, 0, 0, 1])

array([0.1016251, 0.8983749])

### How do we put models into production?

- Batch jobs
- Microservices
- Integration into application code

### Batch jobs

- Something that is execute every now and when (e.g. daily)
- Might be a simple Python script, MapReduce job, Spark job, ...
- The easiest way as you don't need to change much

### Batch job example

In [5]:
# execute at 0:00

batch = pd.read_csv('heart.csv')
batch['prediction'] = model.predict(batch.drop(columns='target'))
filename = 'result%s.csv' % (datetime.datetime.now())
batch.to_csv(filename)

! head -5 "$filename"

,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,prediction
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,1.0
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,1.0
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,1.0
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,1.0


### Microservices

### Microservice example

In [6]:
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/')
def apply_model():
    return jsonify(list(model.predict_proba(request.json)))

app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


```curl --header "Content-Type: application/json" -XGET --data "[0,63,1,3,145,233,1,0,150,0,2.3,0,0,1]" localhost:5000
[0.11089046083115017,0.8891095391688498]```

### Integration into application code

- Most of ML happens in Python or R
- Most of the applications are Java-based
- You're lucky if your app runs python
- Most of the time your code diverges: how can you know that sklearn.RandomForest in Python is the same as whateverlibrary.RandomForest in Java??

### CoreML, Tensorflow Serving + ONNX

- CoreML was one of the first endeavours to provide unified format
- Tensorflow Serving was introduced later
- The workflow is to train (Python) -> export (standard format) -> apply (common languages)
- Expect similar libraries and tools to appear during the next years

In [7]:
model.save_model(format='coreml', fname='model.coreml')

open('model.coreml', mode='rb').read(300)

b'\x08\x01\x12\x81\x1d\n\x0f\n\tfeature_0\x1a\x02\x12\x00\n\x0f\n\tfeature_1\x1a\x02\x12\x00\n\x0f\n\tfeature_2\x1a\x02\x12\x00\n\x0f\n\tfeature_3\x1a\x02\x12\x00\n\x0f\n\tfeature_4\x1a\x02\x12\x00\n\x0f\n\tfeature_5\x1a\x02\x12\x00\n\x0f\n\tfeature_6\x1a\x02\x12\x00\n\x0f\n\tfeature_7\x1a\x02\x12\x00\n\x0f\n\tfeature_8\x1a\x02\x12\x00\n\x0f\n\tfeature_9\x1a\x02\x12\x00\n\x10\n\nfeature_10\x1a\x02\x12\x00\n\x10\n\nfeature_11\x1a\x02\x12\x00\n\x10\n\nfeature_12\x1a\x02\x12\x00R\x17\n\nprediction\x1a\t*\x07\n\x01\x01\x10\xc0\x80\x04Z\npredictionb\nprediction\xa2\x06\xec\x1a\n\x0eCatboost model\x12\x05'

### The Rules of Machine Learning

As based on a paper by Martin Zinkevich (http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf)

### Your first product could use no machine learning at all

- Most of the time you don't even have the training data
- If you're recommending stuff just recommend something popular
- Very simple things might bootstrap the product

### First thing to do is to measure the quality of your model

- You've got to define a metric and track it
- The earlier you start the longer your metric history becomes
- Nobody likes to implement that but you *have to*

### Your heuristics are features

- You will get a lot of requests from product managers and other people
- Heuristic: "I believe you should use the time of the day and exclude any scary videos at night"
- Let the ML decide but make the heuristic a feature to be used

### Watch for silent failures

- The machine learning algorithms are failing silently
- Check metrics regularly and check for changes in your data and/or environment
- The worst way to track bugs is to get bugreports from users

### Design and plan to iterate

- You will change the algorithm and features sooner or later
- Make it easy to add new features and try new algorithms
- Do not fall in love with specific algorithms and features

### Reuse the code as much as possible

- If you have training in Python and inference (apply, ..) in Java you're in danger
- Write tests and check metrics if your predictions are still consistent

### Avoid feedback loop

- It is easy to be trapped in a loop so think of architecture
- The worst thing about feedback loop is that you can't do anything
- Filter bubbles and all the other stuffs are worrysome, too

### Keep things simple

- It is not only CPU-consuming to have too many models
- One day a model might fail to train or start to produce gibberish
- The more parts you have the less reliable it becomes

## Reporting to executives (and your thesis)

- Propose using different algorithms
- Provide a baseline: what is the simplest possible solution to your problem?
- Provide error bars: it is not 97% but 97±0.2%
- Explain your key metrics first, e.g. accuracy 55% sounds horrible — but what about 10000 classes?
- Communicate the most important features (google: feature importance)

## Questions I can ask

- How many parts should you split you data and why?
- When would you use a linear model?
- What is common between gradient boosting and random forest?
- What is the typical data you apply gradient boosting to?
- What would you use neural networks for?
- What is the difference between metric and loss?
- What are the metrics for regression problems?
- What are the metrics for classification problems?
- What is the bias/variance tradeoff?
- What is the idea of bootstrapping?
- What is the role of regularization? What ways of regularizing do you know?
- How do we find the coefficients of a linear model?
- How do we create the splits in decision trees?
- What is overfitting and underfitting?
- What is the idea of kNN classifier?
- What is the clustering problem?