# Evaluation Metrics

## 4.1 Overview

The fourth week of Machine Learning Zoomcamp is about different metrics to evaluate a binary classifier. These measures include accuracy, confusion table, precision, recall, ROC curves(TPR, FRP, random model, and ideal model), AUROC, and cross-validation.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [32]:
df = pd.read_csv('customer_data.csv')
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for cell in categorical_columns:
    df[cell] = df[cell].str.lower().str.replace(' ', '_')
    
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [33]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_val.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [34]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents', 
        'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 
        'techsupport','streamingtv', 'streamingmovies', 
        'contract', 'paperlessbilling',
       'paymentmethod']

In [35]:
from sklearn.metrics import roc_auc_score

In [36]:
# to implement, we have to create a train function

def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')
    
    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)
    
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    
    return dv, model

In [37]:
def predict(df, dv, model):
    dicts = df[categorical + numerical].to_dict(orient='records')
    
    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]
    
    return y_pred

In [38]:
# k-fold cross-validation

from sklearn.model_selection import KFold

In [39]:
n_splits = 5
C = 1.0
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train, C=C)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

C=1.0 0.841 +- 0.008


In [40]:
scores

[0.8435682269548084,
 0.8458337471858032,
 0.8311780052177403,
 0.8301724275756219,
 0.8517774779580183]

In [41]:
# we train our model once more on C=1.0


dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc

# the result ought to be 0.8572386167896259

0.5035238731962783

# Model Deployment

## 5.1 Intro/Session Overview

In This session we talked about the earlier model we made in chapter 3 for churn prediction. This chapter contains the deployment of the model. If we want to use the model to predict new values without running the code, There's a way to do this. The way to use the model in different machines without running the code, is to deploy the model in a server (run the code and make the model). After deploying the code in a machine used as server we can make some endpoints (using api's) to connect from another machine to the server and predict values.

To deploy the model in a server there are some steps:

- After training the model save it, to use it for making predictions in future (session 02-pickle).
- Make the API endpoints in order to request predictions. (session 03-flask-intro and 04-flask-deployment)
- Some other server deployment options (sessions 5 to 9)

We train a model in jupyter notebook, then we use the model by saving it to a file(model.bin). We want to load the file from a service e.g. churn service and the model will be inside the service. 

Say we have another service such as a marketing service that contain all the users. The marketing service can send a request to the churn service with info about the user, then the churn service sends back prediciton to the marketing service and based on the prediction received, the marketing service can decide whether they want to send promotional email with discount prices.

We put the churn prediction model into a web service using flask (a framework for creating web services in python). We want to wrap the service in such a way that it does not interfere with other services that we have in our machine. What we want to do is to create a special environment for our python dependencies - for this, we will use `pipenv`. We have another layer for system dependencies, then deploy the container to the cloud

![deployment_image](deployment.png)

## 5.2 Saving and Loading the Model

- Saving the model to pickle
- Loading the model to pickle
- Turning our notebook into a python script


### Saving the Model
To save the model we made before there is an option using the pickle library:
- First install the library with the command pip install pickle-mixin if you don't have it.
- After training the model and being the model ready for prediction process use this code to save the model for later.

In [42]:
# pickle for saving python objects

import pickle

In [43]:
# output_file = 'model_C=%s.bin' % C
output_file = f'model_C={C}.bin'
output_file

'model_C=1.0.bin'

In [44]:
# # create a file and write to it
# f_out = open(output_file, 'wb')

# # save our model
# pickle.dump((dv, model), f_out)

# # close the file
# f_out.close()

In [45]:
# using a methodology that ensures that the file is always closed
# instead of the cell above, we can rewrite:

with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)
    # do stuff
# here, once outside the 'with' statement, the file closes
# and we can do other things.

- In the code above we'll making a binary file named model.bin and writing the dict_vectorizer for one hot encoding and model as array in it. (We will save it as binary in case it wouldn't be readable by humans)
- To be able to use the model in future without running the code, We need to open the binary file we saved before.

### Loading the Model

We restarted the kernel so that we can mimick loading the model

In [1]:
import pickle

In [2]:
model_file = 'model_C=10.bin'

In [3]:
# we read the file here

with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)

With unpacking the model and the dict_vectorizer, We're able to again predict for new input values without training a new model by re-running the code.

In [4]:
dv, model

(DictVectorizer(sparse=False), LogisticRegression(max_iter=1000))

In [5]:
## a new customer informations
customer = {
  'customerid': '8879-zkjof',
  'gender': 'female',
  'seniorcitizen': 0,
  'partner': 'no',
  'dependents': 'no',
  'tenure': 41,
  'phoneservice': 'yes',
  'multiplelines': 'no',
  'internetservice': 'dsl',
  'onlinesecurity': 'yes',
  'onlinebackup': 'no',
  'deviceprotection': 'yes',
  'techsupport': 'yes',
  'streamingtv': 'yes',
  'streamingmovies': 'yes',
  'contract': 'one_year',
  'paperlessbilling': 'yes',
  'paymentmethod': 'bank_transfer_(automatic)',
  'monthlycharges': 79.85,
  'totalcharges': 3320.75
}


In [6]:
# we want to turn the customer above into a feature matrix
X = dv.transform([customer])
model.predict_proba(X)[0, 1]
# the result shows that the customer is going to churn for the value of
# 0.6363584152715288 cos the result from the cell is wrong.

0.06224295541445037

It is not convenient to train a model from jupyter notebook. The best way is to create a python file/script that does the training. We do that by downloading the notebook as a python file.

## 5.3 Web Services: Introduction to Flask

Web services are services you communicate with over a network using some protocol. We can use flask for implementing it.

In this session, we create a simple service where user sends a `ping` address and it responds with a `pong`.

## 5.3 Web services: introduction to Flask
- A web service is a method used to communicate between electronic devices.
- There are some methods in web services we can use it to satisfy our problems. Here below we would list some.
    - GET: GET is a method used to retrieve files, For example when we are searching for a cat image in google we are actually requesting cat images with GET method.
    - POST: POST is the second common method used in web services. For example in a sign up process, when we are submiting our name, username, passwords, etc we are posting our data to a server that is using the web service. (Note that there is no specification where the data goes)
    - PUT: PUT is same as POST but we are specifying where the data is going to.
    - DELETE: DELETE is a method that is used to request to delete some data from the server.
- To create a simple web service, there are plenty libraries available in every language. Here we would like to introduce Flask library in python.
    - If you haven't installed the library just try installing it with the code pip install Flask
    - To create a simple web service just create the code in a python file, `ping.py` inside the parent folder of this ml zoomcamp notes.

## 5.4 Serving the Churn Model with Flask

In this session we talked about implementing the functionality of prediction to our churn web service and how to make it usable in development environment.

- To make the web service predict the churn value for each customer we must modify the code in session 3 with the code we had in previous chapters. Below we can see how the code works in order to predict the churn value.
- In order to predict we need to first load the previous saved model and use a prediction function in a special route.
To load the previous saved model we use the code below:

In [3]:
# import pickle

# with open('/churn-model.bin', 'rb') as f_in:
#   dv, model = pickle.load(f_in)

As we had earlier to predict a value for a customer we need a function like below:

In [4]:
# def predict_single(customer, dv, model):
#   X = dv.transform([customer])  ## apply the one-hot encoding feature to the customer data 
#   y_pred = model.predict_proba(X)[:, 1]
#   return y_pred[0]

Then at last we make the final function used for creating the web service.

In [5]:
# @app.route('/predict', methods=['POST'])  ## in order to send the customer information we need to post its data.
# def predict():
# customer = request.get_json()  ## web services work best with json frame, So after the user post its data in json format we need to access the body of json.

# prediction = predict_single(customer, dv, model)
# churn = prediction >= 0.5

# result = {
#     'churn_probability': float(prediction), ## we need to cast numpy float type to python native float type
#     'churn': bool(churn),  ## same as the line above, casting the value using bool method
# }

# return jsonify(result)  ## send back the data in json format to the user

At last run your code. To see the result we can't use a simple request in web browser, because we are expecting a POST request in our app. We can run the code below to post customer data as json and see the response

In [6]:
# ## a new customer informations
# customer = {
#   'customerid': '8879-zkjof',
#   'gender': 'female',
#   'seniorcitizen': 0,
#   'partner': 'no',
#   'dependents': 'no',
#   'tenure': 41,
#   'phoneservice': 'yes',
#   'multiplelines': 'no',
#   'internetservice': 'dsl',
#   'onlinesecurity': 'yes',
#   'onlinebackup': 'no',
#   'deviceprotection': 'yes',
#   'techsupport': 'yes',
#   'streamingtv': 'yes',
#   'streamingmovies': 'yes',
#   'contract': 'one_year',
#   'paperlessbilling': 'yes',
#   'paymentmethod': 'bank_transfer_(automatic)',
#   'monthlycharges': 79.85,
#   'totalcharges': 3320.75
# }
# import requests ## to use the POST method we use a library named requests
# url = 'http://localhost:9696/predict' ## this is the route we made for prediction
# response = requests.post(url, json=customer) ## post the customer information in json format
# result = response.json() ## get the server response
# print(result)

- Until here we saw how we made a simple web server that predicts the churn value for every user. When you run your app you will see a warning that it is not a WGSI server and not suitable for production environmnets. To fix this issue and run this as a production server there are plenty of ways available.

    - One way to create a WSGI server is to use gunicorn. To install it use the command `pip install gunicorn`, And to run the WGSI server you can simply run it with the command `gunicorn --bind 0.0.0.0:9696 churn:app`. Note that in churn:app the name churn is the name we set for our the file containing the code `app = Flask('churn')`(for example: churn.py), You may need to change it to whatever you named your Flask app file.
    - Windows users may not be able to use gunicorn library because windows system do not support some dependecies of the library. So to be able to run this on a windows machine, there is an alternative library waitress and to install it just use the command `pip install waitress`.
    - to run the waitress wgsi server use the command `waitress-serve --listen=0.0.0.0:9696 churn:app`.
    - To test it just you can run the code above and the results is the same.
- So until here you were able to make a production server that predict the churn value for new customers.

## 5.5 Python virtual environment: Pipenv

In this session we're going to make virtual environment for our project. So Let's start this session to get to know what is a virtual environment and how to make it.

- Every time we're running a file from a directory we're using the executive files from a global directory. For example when we install python on our machine the executable files that are able to run our codes will go to somewhere like */home/username/python/bin/* for example the pip command may go to */home/username/python/bin/pip*.
- Sometimes the versions of libraries conflict (the project may not run or get into massive errors). For example we have an old project that uses sklearn library with the version of 0.24.1 and now we want to run it using sklearn version 1.0.0. We may get into errors because of the version conflict.
    - To solve the conflict we can make virtual environments. Virtual environment is something that can seperate the libraries installed in our system and the libraries with specified version we want our project to run with. There are a lot of ways to create a virtual environments. One way we are going to use is using a library named pipenv.
    - pipenv is a library that can create a virutal environment. To install this library just use the classic method `pip install pipenv`.
    - After installing pipenv we must to install the libraries we want for our project in the new virtual environment. It's really easy, Just use the command pipenv instead of pip. `pipenv install numpy sklearn==0.24.1 flask`. With this command we installed the libraries we want for our project.
    - Note that using the pipenv command we made two files named *Pipfile* and *Pipfile.lock*. If we look at this files closely we can see that in Pipfile the libraries we installed are named. If we specified the library name, it's also specified in Pipfile.
    - In *Pipfile.lock* we can see that each library with it's installed version is named and a hash file is there to reproduce if we move the environment to another machine.
    - If we want to run the project in another machine, we can easily installed the libraries we want with the command `pipenv install`. This command will look into *Pipfile* and *Pipfile*.lock to install the libraries with specified version.
    - After installing the required libraries we can run the project in the virtual environment with `pipenv shell` command. This will go to the virtual environment's shell and then any command we execute will use the virtual environment's libraries.
Installing and using the libraries such as gunicorn is the same as the last session.
Until here we made a virtual environment for our libraries with a required specified version. To seperate this environment more, such as making gunicorn be able to run in windows machines we need another way. The other way is using Docker. Docker allows us to seperate everything more than before and make any project able to run on any machine that support Docker smoothly.

## 5.6 Environment management: Docker

### Installing Docker

To isolate more our project file from our system machine, there is an option named Docker. With Docker you are able to pack all your project is a system that you want and run it in any system machine. For example if you want Ubuntu 20.4 you can have it in a mac or windows machine or other operating systems.
To get started with Docker for the churn prediction project you can follow the instructions below.

### Ubuntu

`sudo apt-get install docker.io`

To run docker without sudo, follow [this instruction](https://docs.docker.com/engine/install/linux-postinstall/).

### Windows
To install the Docker you can just follow the instruction by Andrew Lock in this link: (https://andrewlock.net/installing-docker-desktop-for-windows/)

### MacOS
Follow the steps in the [Docker docs](https://docs.docker.com/desktop/install/mac-install/).

### Notes
- Once our project was packed in a Docker container, we're able to run our project on any machine.
- First we have to make a Docker image. In Docker image file there are settings and dependecies we have in our project. To find Docker images that you need you can simply search the [Docker](https://hub.docker.com/search?type=image&q=) website.

Here a Dockerfile (There should be no comments in Dockerfile, so remove the comments when you copy)

        # First install the python 3.8, the slim version uses less space
        FROM python:3.8.12-slim - `docker run -it --rm python:3.8.18-slim`. If you want to be able to access the terminal, `docker run -it --rm --entrypoint=bash python:3.8.18-slim`

        # Install pipenv library in Docker 
        RUN pip install pipenv

        # create a directory in Docker named app and we're using it as work directory 
        WORKDIR /app                                                                
        # Copy the Pip files into our working derectory 
        COPY ["Pipfile", "Pipfile.lock", "./"]

        # Build the image - e.g. `docker build -t zoomcamp-test .`, then run it - `docker run -it --rm --entrypoint=bash zoomcamp-test`

        # install the pipenv dependencies for the project and deploy them.
        RUN pipenv install --deploy --system

        # Copy any python files and the model we had to the working directory of Docker 
        COPY ["*.py", "churn-model.bin", "./"]

        # We need to expose the 9696 port because we're not able to communicate with Docker outside it
        EXPOSE 9696

        # If we run the Docker image, we want our churn app to be running
        ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "churn_serving:app"]

#### Summary
* Running a python image with docker
* Dockerfile
* Building a docker image
* Runnning a docker image





