# ml_deploy

The ml_deploy package provides a production framework for machine learning models. The platform's emphasis is on defining an objective for a model, tracking the data used for training and predicting, and tracking performance against that goal as your model evolves. This cheat sheet shows how to set up a job using ml_deploy. 

ml_deploy has a dedicated data model (see [link](https://docs.google.com/spreadsheets/d/1O52-SKVCe_zf_kHttzB7OMM9FeTCGbp1YGck-EzK6SE/edit?usp=sharing)). 

## Process
![ml_deploy Process](ml_deploy_process.PNG)

## Data Model Objects
### Model
At the very highest level is the __model__ object. These are stored in the __ml_deploy_models__ table. A model is the abstract thing which you are modeling. The underlying model parameters might change, the coefficients might change, but this fundamental concept should be immutable.

### Model Version
As our approach to modeling something evolves, we might want to change the literal model being used for predictions or scoring. For example, we might change the optimization algorithms for an `sklearn.linearregression.LogisticRegression` model, or switch to an entirely new model, while still modeling the same outcome. These changes are referred to as __model_versions__, and are stored in the __ml_deploy_model_versions__ table. One __model__ may have one or more __model_versions__.

#### Serialized Model Versions
When a new __model_version__ is defined, it is saved as a serialized object for easy reference in the future. This is currently handled by the S3_StoredModelUtility class.

### Fitted Model
An essential part of putting a model into production is training and retraining your __model_version__. Each time a __model_version__ is trained, a new __fitted_model__ is created. One __model_version__ may have one or more __fitted_models__. Each __fitted_model__ is stored with a mutable set of performance scores.

#### Fitted Model Versions
When a new __fitted_model__ is defined, it is saved as a serialized object for easy reference in the future. This is currently handled by the S3_StoredModelUtility class.

### Training Data
The data used to train every __fitted_model__ is stored in the __ml_deploy_training_data__ table. There should be one set of training data per __fitted_model__. The raw data, preprocessed data, testing_date, and test indicator are captured.

### Predictions
When a __fitted_model__ is used to generate predictions and probabilities, the results are stored in the __ml_deploy_predictions__ table. The raw data, preprocessed data, prediction date, prediction, probability, and target id of the scored record are captured. 

### Results
For every prediction that is made, an actual result is captured in the __ml_deploy_results__ table. The provided result data set is associated with the last prediction set via a target id field.

### Result Snapshots
Result sets can be snapshotted over time. This is useful for showing how your model improved over time with additional training and tweaks. 

## To Do
* Add features to automatically generate reporting and data visualization datasets.
* Add in additional table integrity validation

## Setup

The ml_deploy class requires the following inputs:
* sql_alchemy_engine - A SQLAlchemy.Engine object. This is used to run queries against the database.
* schema - Probably going to become an optional keyword argument. Specifies the schema in which the ml_deploy data model has been initialized.
* store_model_utils_obj - A ml_deploy.StoredModelUtils subclass (currently only S3_StoredModelUtils exists)

In [1]:
from ml_deploy.ml_deploy import MLDeploy, TestUtils, Evaluator
from ml_deploy.stored_model_utils import S3_StoredModelUtils, StoredModelUtils
from ml_deploy.models import create_data_model
from sqlalchemy import create_engine
from sklearn.linear_model import LogisticRegression
from pandas import read_sql
import os
from tempfile import mkdtemp

'''
# -= Postgres example =-
# create sql_alchemy_engine object
username = 'analytics'
password = os.getenv('red-pw')
host = 'dw-production.ucoachapp.com'
port = 5439
database = 'itkdw'

template = 'postgresql+psycopg2://{username}:{password}@{host}:{port}/{database}'
conn_string = template.format(username=username,
                              password=password,
                              host=host,
                              port=port,
                              database=database)

sa_engine = create_engine(conn_string)

s3_access = os.getenv('s3-access')
s3_secret = os.getenv('s3-secret')
bucket = 'ml-deploy'
s3_smu = S3_StoredModelUtils(s3_access, s3_secret, bucket)
'''

sa_engine = create_engine('sqlite:///:memory:')
create_data_model(sa_engine)
smu = StoredModelUtils(mkdtemp())

# create instnace of ML_Deploy
ml_deploy = MLDeploy(sqlalchemy_engine=sa_engine,
                     stored_model_utils_obj=smu)


## Creating a new model
After defining the name, type, model object, and model parameters, use the `ml_deploy.store_new_model()` method to create a new model.

In [2]:
model_name = 'Test Engagement Model'
model_type = 'engagement'
model_obj = LogisticRegression()
model_params = model_obj.get_params()

# store new model
model = ml_deploy.store_new_model(model_name=model_name,
                                  model_type=model_type,
                                  model_obj=model_obj,
                                  model_params=model_params)

### ml_deploy.Model attributes
The `ml_deploy.store_new_model()` function returns a saved version of the object. It containes the following useful attributes:

* `model.model` - A reference to the actual model object.
* `model.model_id` - The __id__ value for the model stored in the __ml_deploy_models__ table.
* `model.model_name` - The given name of the model. Must be unique!
* `model.model_type` - The generic type of the model. Used for comparing models.
* `model.model_version_id` - The __id__ of the ml_deploy.Model object in the __ml_deploy_model_versions__ table.
* `model.model_version_uuid` - A __uuid__ for the ml_deploy.Model object in the __ml_deploy_model_version__ table.
* `model.model_version` - A useful indicator of the __model_version__ iteration of the current __model__.

If your model is trained, the following attributes will be set:
* `model.fitted_model_id` - The __id__ of the ml_deploy.Model object in the __ml_deploy_fitted_models__ table.
* `model.fitted_model_uuid` - A __uuid__ for the ml_deploy.Model object in the __ml_deploy_fitted_models__ table.
* `model.fitted_model_version` - A useful indicator of the __fitted_model__ iteration of the current __model_version__.


In [3]:
print('model_id >', model.model_id)
print('model_version >', model.model_version)
print('fitted_model_uuid >', model.fitted_model_uuid) # 'None', since this model hasn't been fitted

model_id > 1
model_version > 1
fitted_model_uuid > None


## Updating a Model Version
If you ever want to change some fundamental parameter of your model, you can make changes directly to the __ml_deploy.model__
attribute, and save them with the `ml_deploy.store_model_version()` method.

In [4]:
updated_model_obj = LogisticRegression(solver='lbfgs')
model.model = updated_model_obj
updated_model_params = model.model.get_params()
model = ml_deploy.store_model_version(model, model_params=updated_model_params)

## Utility functions
### Querying the data model
It is possible to query the data models via the __ml_deploy.query_utils__ class. The methods in this class are straightforward, and it's probably best to stick to the ones that start with _get_. Some useful starters are:
* `ml_deploy.query_utils.get_models()` - returns all stored __models__ as a pandas DataFrame.
* `ml_deploy.query_utils.get_model_versions(model_id=None)` - returns all stored __model_versions__ as a pandas DataFrame. Limits to ones associated with the optional __model_id__ value.
* `ml_deploy.query_utils.get_fitted_models(model_version_id=None)` - returns all stored __fitted_models__ as a pandas DataFrame. Limits the ones associated with the optional __model_version_id__ value.

In [5]:
model_df = ml_deploy.query_utils.get_models()

model_version_df = ml_deploy.query_utils.get_model_versions(model.model_id)
model_version_df.head()

Unnamed: 0,id,uuid,model_id,version,parameters,production_version,created_at
0,1,5d9041d8-aef6-425b-9def-8d03ad279110,1,1,"{""C"": 1.0, ""class_weight"": null, ""dual"": false...",False,2018-08-29 00:49:55.252226
1,2,8595f86e-afd8-4ad5-89b4-ff9ba1364f6a,1,2,"{""C"": 1.0, ""class_weight"": null, ""dual"": false...",False,2018-08-29 00:50:07.138556


### Retrieving existing models
Stored models can be retrieved via the following methods:
* `ml_deploy.retrieve_model_version(model_version_id)` - Given a model_version_id, this method will return the associated unfit MLDeployModel object.
* `ml_deploy.retrieve_fitted_model(fitted_model_id)` - Given a fitted_model_id, this method will return the associated fitted MLDeployModel object.
* `ml_deploy.retrieve_production_model(model_id)` - Given a model_id, this method will return the associated fitted production MLDeployModel object. A given __model__ can only have one production __fitted_model__ at a time. By default, this will be the most recently created fitted model. 

In [6]:
old_model = ml_deploy.retrieve_model_version(1)
print('old model version id>', old_model.model_version_id)
print('old model fitted id >', old_model.fitted_model_id) # None, since the model has not yet been fitted

new_model = ml_deploy.retrieve_model_version(2)
print('current unfitted model version id >', new_model.model_version_id)

old model version id> 1
old model fitted id > None
current unfitted model version id > 2


# Training

The getting and preprocessing of data is not handled by ml_deploy. The only requirements are that you have a DataFrame for your training data, and your preprocessing function returns exactly one DataFrame.

In [7]:
from oa_jobs.lib.oa.query import AWS_Query
from os import getenv

aws = AWS_Query(None, 'itkdw', getenv('red-pw'))

def get_training_data():
    sql = '''
    select student_uuid
        ,case
          when first_contact_date is null then total_comm_outgoing
          else total_comm_outgoing_pre_contact
        end as contact_outreaches
        ,case
          when first_contact_date is null then datediff(day, first_outgoing_comm_date, getdate())
          else days_outreach_to_contact
        end as days_to_contact
        ,case
            when first_contact_date is not null then 1
            else 0
        end as contacted
              
    from vresearch_engagement_summary
    
    where first_comm_sender_role = 'coach'
        and sdp = 'PSC'
        
    limit 1000
    '''
    return aws.read(sql)

def preprocess(input_df):
    input_df['c'] = 1
    input_df['days_to_contact'] = input_df['days_to_contact'].fillna(0)
    return input_df

# get training data
training_df = get_training_data()

# preprocess data
preproc_df = preprocess(training_df.copy())

## ml_deploy.TestUtils
ml_deploy provides a special class specifically for getting selecting a test subset for your data (`TestUtils.get_test_ind()`), as well as a custom version of `TestUtils.train_test_split`. The additional utility provided by these methods is utilized by the ml_deploy data model.

In [8]:
# get __test__ field
test_utils = TestUtils()
preproc_df  = test_utils.get_test_ind(preproc_df)
preproc_df.head()

# get train / test split
feature_cols = ['c', 'contact_outreaches', 'days_to_contact']
label_col = 'contacted'
X_train, X_test, y_train, y_test = test_utils.train_test_split(preproc_df ,
                                                               feature_cols=feature_cols,
                                                               label_col=label_col)

## Training your model
Because your model object is exposed directly as the .model attribute, you can use whatever method is required to train it once you have your train & test data sets.

In [9]:
model.model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

## Evaluating Model Performance
The ml_deploy.Evaluator class provides a simple class for evaluating model performance. Simply pass it your predicted and test label values, and it will return a dictionary of performance results. Be default, it will evaluate the Percision, Recall, and Accuracy of your model. This class can be inherited and modified to include different measures. 

Because the emphasis is on providing reporting and data visaulization on the performance of your models over time, one assumption is that the evaluation metrics have scores bounded between 0 and 1. 

In [10]:
y_pred = model.model.predict(X_test)
evaluator = Evaluator(y_pred, y_test)
performance = evaluator.get_performance()

## Storing a fitted model
Once your model has been fitted, you can save it to the __ml_deploy_fitted_models__ table and save a serialized version via the `ml_deploy.store_fitted_model()` method.

In [11]:
model = ml_deploy.store_fitted_model(model, performance)

## Storing training data
Every __fitted_model__ object should have corresponding data in the __ml_deploy_training_data__ table. To store the data used to train your model, use the `ml_deploy.store_training_data()` method.

In [12]:
ml_deploy.store_training_data(model_obj=model,
                              training_df=training_df,
                              preproc_df=preproc_df,
                              target_col='student_uuid')

# Making predictions
Once you have a trained model that has been stored in the __ml_deploy_fitted_models__ table and as a serialized file, you can retrieve it and make predictions.

## Retrieving your model
The first step will typically be to retrieve your model via the `ml_deploy.retrieve_model()` method. This method requires a _model_id_ value, but can also be supplied with _model_version_id_ and _fitted_model_id_ values. If those values are not provided, it will default to the most recent __model_version__ and / or __fitted_model_id__.

In [13]:
model_id = 1
model = ml_deploy.retrieve_production_model(model_id)

## Get your prediction data and apply your preprocessing function
The query to get the set of data that you want to score will likely be different than the one used to train your data. It may or may not utilize the same preprocessing method.

### Important note
Be sure to include some unique identifier for the record for which you are deriving a prediction or probability. This will populate in the 'target_id' column in __ml_deploy_predictions__ and is used for updating the __ml_deploy_results__ table.

In [14]:
def get_prediction_data():
    sql = '''
    select stu_uuid
      ,sum(case when comm.sender_role = 'coach' then 1 else 0 end) as contact_outreaches
      ,datediff(day, coalesce(ustu.first_outgoing_comm_date, ustu.sf_createddate), getdate()) as days_to_contact
      ,contacted
        
    from ucoach_students_all ustu
        
      left outer join ucoach_communications comm
      on ustu.stu_uuid = comm.student_uuid
        
    where ustu.sdp='PSC'
      and
        (
          (
            ustu.first_outgoing_comm_date is null
            and datediff(day, sf_createddate, getdate()) < 90
          )
          or datediff(day, ustu.last_communication_date, getdate()) < 45
        )
        
    group by 1, 3, 4
    order by stu_uuid
    limit 100;
    '''
    return aws.read(sql)



In [15]:
prediction_data_df = get_prediction_data()
proc_prediction_data_df = preprocess(prediction_data_df.copy())
feature_cols = ['c', 'contact_outreaches', 'days_to_contact']

## Make predictions and get probability scores
__ml_deploy__ is set up to capture both binary predictions and their corresponding probability scores. Since your model object is exposed directly as the `ml_deploy.model` attribute, you can use the appropriate method to return the probability and prediction values.

In [16]:
pred = model.model.predict(proc_prediction_data_df[feature_cols])
prob = model.model.predict_proba(proc_prediction_data_df[feature_cols])[:,1]

## Storing predictions
Once you've created your predictions, you'll want to store them in the __ml_deploy_predictions__ table. This can be done via the `ml_deploy.store_prediction_data()` method. 

In [17]:
ml_deploy.store_prediction_data(model_obj=model,
                                input_df=prediction_data_df,
                                preproc_df=proc_prediction_data_df,
                                predictions=pred,
                                probabilities=prob,
                                target_col='stu_uuid')

# Storing your results
After you've made your predictions, you'll want to capture some actual results to evaluate your model's performance.

## Fetch result data
The query used to fetch your actual result data may or may not be the same as the one used to get your prediction dataset. 

### Important note
The result DataFrame should have two columns:
* `target_id`: the unique id of the record that was scored. This should align with the field passed to the 'target_col' key word argument used in the `ml_deploy.store_prediction_data` method.
* `result`: this should be the actual 1 or 0 outcome that you are attempting to predict with your model. 

If your result dataframe is organized or typed differently, it cannot be processed.

## Storing results
Pass the `ml_deploy` object used to make the last round of predictions, along with the `new_results_df` to the `ml_deploy.update_results()` method. This will update the predictions and results in the __ml_deploy_results__ table, which will serve as the primary source for evaluating your model's performance.

In [18]:
ml_deploy.update_results(model, prediction_data_df, 'stu_uuid', 'contacted')

results_df = ml_deploy.query_utils.get_results(model_id=1)
results_df.drop('target_id', 1, inplace=True)
results_df.head()

INFO:ml_deploy.ml_deploy:starting result processing
INFO:ml_deploy.ml_deploy:processing input results dataframe
INFO:ml_deploy.ml_deploy:getting latest predictions
INFO:ml_deploy.ml_deploy:initializing new result dataframe
INFO:ml_deploy.ml_deploy:generating insert orms
INFO:ml_deploy.ml_deploy:deleting existing results for updated records
INFO:ml_deploy.ml_deploy:inserting new and updated records
INFO:ml_deploy.ml_deploy:complete


Unnamed: 0,id,model_id,prediction_id,result_prediction,result_probability,result,created_at
0,1,1,1,0,2.540014e-14,1,2018-08-29 00:54:16.221279
1,2,1,2,1,0.6331906,0,2018-08-29 00:54:16.221279
2,3,1,3,1,0.9926459,0,2018-08-29 00:54:16.221279
3,4,1,4,0,0.2674052,0,2018-08-29 00:54:16.221279
4,5,1,5,1,0.7766961,1,2018-08-29 00:54:16.221279


## Taking a Result Snapshot
You can take a snapshot of your data by running the `ml_deploy.create_results_snapshot()` method. This will save the current result set along with a timestamp and snapshot version in the `ml_deploy_result_snapshots` table.

In [19]:
ml_deploy.create_results_snapshot(model)