# Keeping code maintainable and extensible for small teams

## Introduction

For ten years, I've been working on small teams where people have different areas of expertise. The world expert in domain X may not be comfortable doing anything more than building models in their favorite language, while the bright, polyglot programmer may have no interest in analytics. When you need to build a new model, do you need your data scientists to be comfortable with your programming frameworks, distributed systems, or microservices? When you need to modify your infrastructure, do you need your engineers to understand how data is cleaned and formatted, or how models are trained and crossvalidated? How do you build a system that is maintainable and extensible?

As first engineer at my last startup, I was responsible for creating both the engineering and data science components of our data analytics pipeline. I chose to keep these components separated so that team members with strengths in either area would feel comfortable maintaining and extending the codebase, and I'm going to demonstrate how with some sample code.

## Project structure

First, this was a Django project, so I created an ```analytics``` module in the ```main``` folder beside ```commands```, ```views```, and other standard Django modules. The project looks somewhat like this, with more descriptive names for the generic "model a", "model b", and "model c" labels:

```
├──  analytics
│   ├──  model_base.py
│   ├──  model_a
│   │   ├──  base.py
│   │   ├──  dataset.py
│   │   ├──  model.py
│   ├──  model_b
│   │   ├──  base.py
│   │   ├──  dataset.py
│   │   ├──  model.py
│   ├──  model_c
│   │   ├──  base.py
│   │   ├──  dataset.py
│   │   ├──  model.py
```

(Thanks ```brew install tree```!)

The general idea is that the engineering code is restricted to ```model_base.py``` and the ```base.py``` scripts, while the data science code is found only in the ```dataset.py``` and ```model.py``` scripts. Do you need to do something with AWS EFS? Do you need to work with scikit-learn? You already know where you should be looking.

## Engineering components

In ```model_base.py```, I've created a few resources that are used by each of the models

In [None]:
class Sample(object):
    """
    I'm a class for uniform dataset generation. Engineering scripts know how to assign
    cleaned and formatted data to me as attributes, while data science scripts know how
    to access attributes to generate Datasets.
    """
    def __init__(self, sample_id, sample_data, sample_label):
        self.id = sample_id
        self.data = sample_data
        self.label = sample_label
    
    
class Dataset(object):
    """
    I'm a class for uniform model fitting and predicting, or training and testing. Data
    science scripts know how to assign features and labels to me as attributes, while models
    know how to fit and predict using those attributes.
    """
    def __init__(self, sample_ids, features, labels):
        self.sample_ids = sample_ids
        self.features = features
        self.labels = labels
        

class Prediction(object):
    """
    I'm a class for uniform model output. Data science scripts know how to assign predictions
    and probabilities, while engineering scripts know how to access those values for business
    logic use-cases.
    """
    def __init__(self, sample_id, model_label, model_probability):
        self.id = sample_id
        self.label = model_label
        self.probability = model_probability
        

class Model(object):
    """
    I'm a class to make it simple to fit and predict.
    """
    def fit(self, dataset_fit):
        self.dataset_fit = dataset_fit
        self.model = self._get_model()
        self._fit(dataset_fit, self.model)
        
    def _fit(self, dataset_fit, model):
        """
        I need to be implemented for each derived model class, and I need to accept Datasets
        generated from all Samples.
        """
        raise NotImplementedError
        
    def predict(self, dataset_predict):
        return self._predict(dataset_predict, self.model)
    
    def _predict(self, dataset_predict, model):
        """
        I need to be implemented for each derived model class, and I need to return
        Predictions for each sample.
        """
        raise NotImplementedError
    
    
def save_model(model):
    """
    I save models in a standardized way.
    """
    # Not shown:  code to save models
    

def load_model(model_identifier):
    """
    I load models in a standardized way.
    """
    # Not shown:  code to load models

A nice side-effect of this partitioning? Every internal model will now have the same pipeline and API:

1. Generate data
2. Create Samples from raw data
3. Create a Dataset from Samples
4. Create a Model from a Dataset
5. Create Predictions from a Model and new Samples/Datasets

Sure, there are lower-level details that I'm glossing over, like the fact that Dataset.features can be things like ```pandas.DataFrames``` or ```scipy.SparseDataFrames```, but we're going for a high-level overview.

Next, the ```base.py``` scripts create models, predict with existing models, and crossvalidate models. The pattern generally follows something like the following, and could definitely be refactored and generalized in the future.

In [None]:
from main.analytics import model_base
import main.analytics.model_a.dataset as dataset_a
import main.analytics.model_a.model as model_a


def create_model(samples):
    dataset_fit = dataset_a.get_dataset_fit(samples)
    model = model_a.DerivedModel()
    model.fit(dataset_fit)
    model_base.save_model(model)
    

def predict_with_model(samples):
    dataset_predict = dataset_a.get_dataset_predict(samples)
    model = model_base.load_model(model_id)  # We set model ID elsewhere, not important
    return model.predict(dataset_predict)


def crossvalidate_model(samples):
    """
    This function has code for training and testing, generates performance reports
    """
    # Not shown

## Data science components

Does the code involve math, statistics, or machine learning? Does it use numpy, pandas, scipy, or scikit-learn? Does it generate or format features, labels, or models? Then it'll be found in either a ```dataset.py``` or ```model.py``` script. The former generally looks something like this:

In [None]:
import numpy
import pandas
import scipy

from main.analytics.model_base import Dataset


def get_dataset_fit(samples):
    # Simplified example:
    sample_ids = [sample.id for sample in samples]
    features = parse_features_from_samples(samples)
    labels = parse_labels_from_samples(samples)
    return Dataset(sample_ids=sample_ids, features=features, labels=labels)


def get_dataset_predict(samples):
    # Simplified example
    features = parse_features_from_samples(samples)
    return Dataset(features=features)


def parse_features_from_samples(samples):
    stuff = numpy.foo(samples)
    more_stuff = pandas.bar(stuff)
    return scipy.baz(more_stuff)

while the latter, ```model.py```, generally follows this format:

In [None]:
import sklearn.model_module

from main.analytics.model_base import Model, Prediction


class DerivedModel(Model):
    
    def _get_model(self):
        return sklearn.model_module.model_class()
    
    def _fit(self, dataset_fit, model):
        model.fit(dataset_fit.features, dataset_fit.labels)
    
    def _predict(self, dataset_predict, model):
        predictions = model.predict(dataset_predict.features)
        probabilities = model.predict_proba(dataset_predict.labels)
        return [Prediction(id_, label, prob) for id_, label, prob
                in zip(dataset_predict.sample_ids, predictions, probabilities)]

## Back to the bigger picture

You need to modify how or where models are being stored, or how often they're being updated? Piece of cake! Have an engineer modify the base ```save_model()``` and ```load_model``` functions. They won't need to know how raw data is being used or where model predictions are coming from; whether the results are coming from linear models, neural networks, or random number generators.

You need to create a brand-new model for a new business goal, or improve the peformance of an existing model? No problem! Have a data scientist create a new ```Dataset``` and ```DerivedModel``` object, or modify the methods on existing objects. They won't need to care whether their model state is stored locally, on S3 or EFS, or the CEO's personal computer. 

This isn't the perfect solution -- there are definitely improvements to be made -- but it's a good starting point for a small team.