# Week 4 Capstone Project Peer Review

The following are key points for peer review submissions

Unit tests of the model

Unit testing for logging

Performance monitoring

Data capture automation

Model comparison

Visualization

In [7]:
import os
import sys
import csv
import requests
from collections import Counter
from datetime import date
from ast import literal_eval
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Getting started

The ``workflow-template.zip`` is a workflow template.  Unpack the directory in a location where you would like the source code to exist.  Leaving out the ``static`` directory that contains CSS and JavaScript to render a landing page, the important pieces are shown in the following tree.

```
├── app.py
├── Dockerfile
├── model.py
├── README.rst
├── requirements.txt
├── run-tests.py
├── templates
│   ├── base.html
│   ├── dashboard.html
│   ├── index.html
│   └── running.html
└── unittests
    ├── ApiTests.py
    ├── __init__.py
    ├── ModelTests.py
```

If you plan on modifying the HTML website you will need to modify the files in ``templates``.  The rest of the files you should be familiar with at this point.

## Project

In [41]:
!pip install statsmodels


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting statsmodels
  Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting patsy>=0.5.2 (from statsmodels)
  Downloading patsy-0.5.3-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.8/233.8 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.3 statsmodels-0.13.5


In [44]:
import pandas as pd
import numpy as np
import os
import datetime

import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

from sklearn.metrics import mean_squared_error

plt.style.use('seaborn')

from application.utils.ingestion import fetch_data, fetch_ts
from application.utils.processing import convert_to_ts, engineer_features

from application.utils.plot import ts_plot, ts_plot_pred

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [43]:
import pandas as pd
import numpy as np
import os
import datetime

import matplotlib.pyplot as plt
from fbprophet import Prophet

from sklearn.metrics import mean_squared_error

plt.style.use('seaborn')

from application.utils.ingestion import fetch_data,fetch_ts
from application.utils.processing import convert_to_ts, engineer_features

from application.utils.plot import ts_plot, ts_plot_pred

%load_ext autoreload
%autoreload 2

ModuleNotFoundError: No module named 'fbprophet'

In [None]:
data_dir = os.path.join("data","cs-train")
ts_data = fetch_ts(data_dir)

... processing data for loading


In [21]:
ts_plot(ts_data['all'].date.values, ts_data['all'].revenue.values,figx=14,figy=6, title="revenue over time")

NameError: name 'ts_plot' is not defined

## TASK 1: Write units test for a logger and a logging API endpoint

1. Using `model.py` and `./unittests/ModelTests.py` as an example create `logger.py` and 
`./unittests/LoggerTests.py`.
2. Modify the files so that there are at a minimum the following tests:

    * ensure predict log is automatically created
    * ensure train log is automaticall created
    * ensure that the train function in model archives last used training data
    * ensure that 'n' predictions result in 'n' predict log entries
    * ensure that predict gracefully handles NaNs
    
> IMPORTANT: when writing to a log file from a unit test you will want to ensure that you do not modify or delete existing 'production' logs.  You can test your function with the following code (although it is likely easier to work directly in a terminal).

In [8]:
!python ./unittests/LoggerTests.py

Traceback (most recent call last):
  File "./unittests/LoggerTests.py", line 13, in <module>
    from logger import update_train_log, update_predict_log
ModuleNotFoundError: No module named 'logger'


## TASK 2: Add an API endpoint for logging

In addition to the `predict` and `train` endpoints, create a third endpoint that returns 
logs.  Remember that there are `train` and `predict` log files and that they are set up 
to create new files each month.  You will need to ensure that your endpoint can accommodate this and the best way to ensure this is to **first write the unit tests** then write the code.

Flask has several functions to help with the sending of files. One example is [send_from_directory](https://flask.palletsprojects.com/en/1.1.x/api/#flask.send_from_directory).

In [9]:
!python ./unittests/ApiTests.py

ssss
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK (skipped=4)


## TASK 3: Make sure all tests pass

You have been working on specific suites of unit tests.  It is a best practice to double-check that all tests pass after making major changes like the ones you have just completed.

> make sure you modify the `./unittests/__init__.py` so that the LoggerTest suite is also included when running all tests.

In [10]:
!python run-tests.py

ssss....... grid searching
EEE
ERROR: test_01_train (ModelTests.ModelTest)
test the train functionality
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/unittests/ModelTests.py", line 23, in test_01_train
    model_train(test=True)
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/model.py", line 103, in model_train
    grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, iid=False, n_jobs=-1)
TypeError: __init__() got an unexpected keyword argument 'iid'

ERROR: test_02_load (ModelTests.ModelTest)
test the train functionality
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/unittests/ModelTests.py", line 32, in test_02_load
    model = model_load()
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/model.py", line 193, in model_load
    r

## TASK 4: Create model performance investigative tooling

There are a lot of convenience functions you could create here.  Create them directly in this notebook or create them as scripts that you may call from this notebook.  

Essentially you will need to create a tools that compare the most recently used training data to the most recent predictions.  In reality your tooling should also allow you to compare predictions from say one model to another.  The predictions come from log files.

For this task let's focus on the comparison of predictions to established data.

1. Use bootstrap samples from the original data to ping the `predict` endpoint.  Also, add a couple of outliers like you did in a few course back with outlier detection.

2. Pull the predictions from the log file and summarize the investigation visually 

In [11]:
## YOUR CODE HERE




from monitoring import get_latest_train_data, get_monitoring_tools

## load latest data
data = get_latest_train_data()
y = data['y']
X = data['X']

Exception: cannot find models/latest-train.pickle-- did you train the model?

In [12]:
## generate some data
bs_samples = 60
subset_indices = np.random.choice(np.arange(X.shape[0]),
                                  bs_samples,replace=True).astype(int)
mask = np.in1d(np.arange(X.shape[0]),subset_indices)
X_bs=X[mask]
X_outliers = X[:5].copy()
X_outliers['age'] = [88,90,76,80,68]
X_outliers['num_streams'] = [111,100,80,90,150]
X_query = pd.concat([X_bs,X_outliers])

print(X_query.shape)

NameError: name 'X' is not defined

In [13]:
## delete the log file so we are starting fresh
today = date.today() 
logfile = os.path.join("logs","predict-{}-{}.log".format(today.year, today.month)) 
print(logfile)
if os.path.exists(logfile):
    os.remove(logfile)

## ping the API
request_json = {'query':X_query.to_dict(),'type':'dict'}
port = 8080
r = requests.post('http://0.0.0.0:{}/predict'.format(port),json=request_json)
response = literal_eval(r.text)
print(list(sorted(Counter(response['y_pred']).items())))

logs/predict-2023-7.log


NameError: name 'X_query' is not defined

In [14]:
pm_tools = get_monitoring_tools(X,y)

NameError: name 'X' is not defined

In [15]:
## read in the logged data
df = pd.read_csv(logfile)
df.drop(columns=["unique_id","y_proba"], inplace=True)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'logs/predict-2023-7.log'

In [16]:
## reconstruct a data frame from the logged queries
queries = [literal_eval(q) for q in df['query'].values]
queries = pd.DataFrame(queries)
queries.columns = ['country', 'age', 'subscriber_type', 'num_streams']
print(queries.shape)
queries.head()

NameError: name 'df' is not defined

In [17]:
from scipy.stats import wasserstein_distance
X_target = pm_tools['preprocessor'].transform(queries)

outlier_test = pm_tools['clf_X'].predict(X_target)
outliers_X = 100 * (1.0 - (outlier_test[outlier_test==1].size / outlier_test.size))
wasserstein_X = wasserstein_distance(pm_tools['X_source'].flatten(),X_target.flatten()) 
wasserstein_y = wasserstein_distance(pm_tools['y_source'],df['y_pred'].values)

if outliers_X >= pm_tools['outlier_X']:
    print("OUTLIER TEST FAILED: {} >= {}".format(round(outliers_X,2),
                                                 pm_tools['outlier_X']))
else:
    print("OUTLIER TEST PASSED: {} < {}".format(round(outliers_X,2),
                                                pm_tools['outlier_X']))
    
if wasserstein_X >= pm_tools['wasserstein_X']:
    print("DISTRIBUTION X TEST FAILED: {} >= {}".format(round(wasserstein_X,2),
                                                        pm_tools['wasserstein_X']))
else:
    print("DISTRIBUTION X TEST PASSED: {} < {}".format(round(wasserstein_X),
                                                       pm_tools['wasserstein_X']))
    
if wasserstein_y >= pm_tools['wasserstein_y']:
    print("DISTRIBUTION y TEST FAILED: {} >= {}".format(round(wasserstein_y,2),
                                                        pm_tools['wasserstein_y']))
else:
    print("DISTRIBUTION y TEST PASSED: {} < {}".format(round(wasserstein_y),
                                                       pm_tools['wasserstein_y']))

fig = plt.figure(figsize=(10,6),dpi=400)
ax = fig.add_subplot(111)

x_range = np.arange(outlier_test.size)
labels = ['outlier','normal']
markerline, stemlines, baseline = ax.stem(x_range, outlier_test, '-.',
                                          use_line_collection=True)
plt.setp(baseline, 'color', 'r', 'linewidth', 2)
ax.set_title("outlier visualization")
ax.set_ylabel("outlier=-1, normal=1")
ax.set_xlabel("queries (in time order)");

NameError: name 'pm_tools' is not defined

### SOLUTION NOTE

The tests we choose to run are reasonable given the size of the data.  We are saving each query and with the reconstructed queries we can test for both outliers and distributional changes in the data.  All of this code would be better organized under `monitoring.py` in a production environment, but we walked through the process here with the hope that it provides some insight.  Be cautioned that the bootstrap and disk read/write portions of this code will take much longer with large data sets and some optimization will be required.  For example, you could pre-train and serialize the outlier model(s).

## TASK 5: Swap out the iris data for the AAVAIL churn data

We suggest that you copy the iris example folder to a another directory, then re-create the template to work with the AAVAIL data.  The exercise of changing the dataset is very much aligned with real-world practices since you will often be modifying workflow-templates to meet the needs of a particular business opportunity.

In [18]:
!python run-tests.py

ssss....... grid searching
EEE
ERROR: test_01_train (ModelTests.ModelTest)
test the train functionality
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/unittests/ModelTests.py", line 23, in test_01_train
    model_train(test=True)
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/model.py", line 103, in model_train
    grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, iid=False, n_jobs=-1)
TypeError: __init__() got an unexpected keyword argument 'iid'

ERROR: test_02_load (ModelTests.ModelTest)
test the train functionality
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/unittests/ModelTests.py", line 32, in test_02_load
    model = model_load()
  File "/home/ec2-user/SageMaker/ai-workflow-capstone-soln/model.py", line 193, in model_load
    r

In [19]:
## this solution uses the AAVAIL data