<h1><center>Developing Python Backends for Machine Learning Applications</center></h1>
<h2><center>by Jaidev Deshpande</center></h2>
<h3><center>Data Scientist @ Cube26 Software Pvt Ltd</center></h3>

<div style="text-align: center">
<div id="social">
<div id="social_twitter">
    <a href="http://twitter.com/jaidevd"><img src="images/twitter-128.png" width="32" height="32">
    </a>
</div>
<div id="social_medium">
    <a href="http://medium.com/@jaidevd"><img src="images/medium.ico" width="32" height="32"></a>
</div>
<div id="social_github">
    <a href="http://github.com/jaidevd"><img src="images/mark-github-128.png" width="32" height="32"></a>
</div>
</div>
</div>

<div style="text-align: center"><font size="4"><strong>@jaidevd</strong></font></div>

This is a follow-up to my talk at PyCon a few months ago. That talk was about continuous integration for data scientists. But this talk is going to be a lot more specific about its scientific computing aspects. I am going to be go into a reasonable amount of detail about the data processing and machine learning aspects of it.

# From Peter Norvig's Q&A session on Quora:
> I think that it will be important for machine learning experts and software engineers to come together to develop best practices for software development of machine learning systems. Currently we have a software testing regime where you define unit tests... We will need new testing processes that involve running experiments, analyzing the results... This is a great area for software engineers and machine learning people to work together to build something new and better.

Recently Peter Norvig was asked in a Q&A what he thinks about the place of software developers in machine learning research. And this is his answer. "It's important for machine learning experts and software engineers to come together to develop best practices for the development of ML systems. We need new testing processes that involve running experiments and analyzing results" and that "this is a great area for software developers and ML expers to work together". It is broadly in this spirit that I will be talking. I choose to make this talk more about engineering than about machine learning, since even minimal machine learning can go a long way with good engineering.

<div style="text-align: center">
<img src="images/reos.png">
</div>

Last August, my company released a suite of apps, called the ReOs suite of apps, all of which have a very heavy data driven component. These are the three flagship apps of the campaign - reos message, reos music and reos camera. You can find all of them on the Google playstore. There are features in all these apps that use machine learning to drive a particular feature of the app. For example, the music app has features like automatic lyrics synchronization, identification of tunes from recordings, etc. The camera app has beautification / styling filters which are applied by a neural network. For the purpose of this talk, I will be focusing on the SMS app - simply because the machine learning built into it is easy to understand. Although, this will be representative of what most Android apps do when it comes to using machine learning.

# Example: How we started building ReosMessage

<ul>
    <li><h3>Classification of SMS into personal, transactional and spam</h3></li>
    <li><h3>Get a dump of a table from postgres</h3></li>
    <li><h3>Use simple Python scripts and some pandas magic to clean the data</h3></li>
    <li><h3>Use regular expressions to label the data</h3></li>
    <li><h3>Train sklearn estimators on the labeled data</h3></li>
    <li><h3>Crowdsource the evaluation of the predictions</h3></li>
    <li><h3>Dump the model coefficients to JSON</h3></li>
    <li><h3>Hand over the JSON to Android developers</h3></li>
</ul>

I'll start with an example, and at the end of the talk I'll come back to how it evolved. So the SMS app is called ReosMessage - it classifies your messages into personal, transactional or notification messages and spam. So to build and deploy the classifier, all we had initially was a massive Python script - with a single input which was a CSV file. This file was just a dump of a table from a postgres database. We then would run some simple pandas functions to clean the data of unicode, emojis - anything else that was irrelevant. We didn't have any labels, so we used regular expressions to label a subset of the dataset into those three classes. On this data we trained some scikit-learn models. We used to send the model results to everyone in the office. They would all quickly glance over it let us know if they were any glaring classification errors. When we were somewhat satisfied with the results, the model coefficients were dumped into the JSON and were sent to the Android developers who could use them in their port of the classifier. It's easy to see how this could not have scaled. It's extremely monolithic, it's quite redundant in places, it wasn't modular at all, and also it was becoming increasingly difficult to debug. So I'll come back to this example at the end of the talk and I'll tell you what we did to fix these things.

# Typical Data Processing Pipeline
![](images/flowchart.png)

So this here is a very broadly representative, a very general data processing pipeline. That box you see to the left is known as ETL part of the pipeline. ETL stands for extract, transform and load. You start with a source datastore, and end up in a sink datastore. The ETL part of the process is the most data-intensive one - by that I don't mean that others are not, it's just that this is the part where you treat your data for what it is - raw data. After it's left this box, it's no longer raw data, it's features, it's model coefficients, it's statistics, and so on. And it is in this box that your developer side has to be the most creative, as against your data scientist side, which has to show off outside the box. So after you've extracted, transformed and loaded your data, you train some model on it, and you do some validation which might allow you to do better model selection. Now, these steps - training, validation and model selection - are not necessarily well separated in time. You could be cross-validating one model while you are already training another model, and a third one could actually be in production that is writing the output to whatever your sink is. But the thing is that we tend to think about a pipeline as something sequential. You can't think if this as a pipeline. How do you bend the spoon? Think that there is no spoon. Think that there is no such pipeline, and each of these blocks are independent systems - these are microservices - which only happen to be loosely coupled.

# Using Data Abstractions
## And Data Ingest as an abstraction

<ul>
    <li><h3>Raw data is an integral part of ETL and therefore of your software</h3></li>
    <li><h3>Working off local flatflies is <em>bad</em>!</h3></li>
    <li><h3>Learn to work from remote storage to remote storage. Use the "cloud".</h3></li>
    <li><h3>What about experimental / exploratory work? Use things like sqlite!</h3></li>
    <li><h3>Only use local files when:</h3></li>
    <ul>
        <li><h3>doing EDA or any interactive studies.</h3></li>
        <li><h3>debugging a larger application.</h3></li>
        <li><h3>prototyping (resist the temptation to deploy).</h3></li>
    </ul>
</ul>

# Using Pandas for Data Ingest

<ul>
<li><h3>A few of million PSQL rows randomly sampled from over 15M rows</h3></li>
<li><h3>Preprocessing with:</h3></li>
<ul>
<li><h3>Removing unicode, emoji, stopwords</h3></li>
<li><h3>Converting to lowercase</h3></li>
<li><h3>Dropping stopwords</h3></li>
<li><h3>Cleaning any other malformed input</h3></li>
</ul>
<li><h3>Using a few hundred regular expressions (on each column) to produce a labeled dataset</h3></li>
</ul>


# Using PySemantic to Wrap Pandas
```yaml
smsdata:
  source: psql
  table_name: messages_db
  config:
    hostname: 127.0.0.1
    db_name: foo
    username: bar
  chunksize: 100000
  sampling:
    factor: 0.1
    kind: random
  dtypes:
    Message: &string !!python/name:__builtin__.str
    Number: *string
    person: *string
  postprocessors:
    - !!python/name:jeeves.preprocessing.text.remove_unicode
    - !!python/name:jeeves.preprocessing.text.remove_tabs
    - !!python/name:jeeves.preprocessing.text.remove_digits
    - !!python/name:jeeves.preprocessing.text.remove_stopwords
    - !!python/name:jeeves.preprocessing.text.to_lowercase
    - !!python/name:jeeves.feature_extraction.text.get_regex_features
    - !!python/name:jeeves.feature_extraction.text.get_tfidf_features
```
```python
>>> from pysemantic import Project
>>> smartsms = Project("smartsms")
>>> X = smartsms.load_dataset("smsdata")
```

# A Note about the AutoML project
![](images/automl.png)
<h4><i>"If machine learning is so ubiquitous, one should just be able to use it without understanding libraries."</i></h4>
<h4>- Andreas Mueller @ SciPy US 2016</h4>

<li><h3>sklearn philosophy: explicit is better than implicit. DevOps philosophy - Just build and run stuff!</h3>

# Automating Model Selection
## Automating Cross Validation and Hyperparameter Tuning

```python
class CrossValidationTask(luigi.Task):
    
    estimator = luigi.Parameter() # or luigi.Target
    
    def run(self):
        # Run CV loop
        # Export metrics for each iteration


class GridSearchTask(luigi.Task):
    
    grid = luigi.Parameter() # or Target
    estimator = luigi.Parameter() # or Target
    ...
    
    def run(self):
        X, y = self.input()
        clf = GridSearchCV(self.estimator, param_grid=self.grid, ...)
        clf.fit(X, y)
        ...
        joblib.dump(clf.best_estimator_, self.output())
```

# Data and Model _Quality_

## (Tools from psychometrics for) data quality evaluation
### - Katie Malone @ SciPy US 2016

<ul>
    <li><h3>Predictive modeling != building a model</h3></li>
    <li><h3>Iterative model selection involves going all the way back to data quality (not simply changing the pipeline)</h3></li>
    <li><h3>Develop the same intuition for your data as that for your model</h3></li>
    <ul>
        <li><h3>complex model + mediocre dataset = fair predictive accuracy</h3></li>
        <li><h3>simple model + great dataset = high predictive accuracy</h3></li>
    </ul>
</ul>

# Visualizing (Data & ML performance)

<ul>
        <li><h3>Bokeh server for dashboards</h3></li>
        <li><h3>Chaco / Traits-based visualizations - for interative exploration</h3></li>
        <li><h3>Use libs like Seaborn for stats - resist the temptation to write them yourself</h3></li>
    
</ul>

# Exposing Trained Models

<ul>
        <li><h3>Simple serialization methods</h3></li>
        <li><h3>sklearn-compiledtrees</h3></li>
        <li><h3>Don't use nonlinear models where linear models will do</h3></li>
    
</ul>

# Example: How we built & scaled ReosMessage

<ul>
    <li><h3><strike>Get a dump of a table from postgres</strike></h3></li>
    <li><h3><strike>Use simple Python scripts and some pandas magic to clean the data</strike></h3></li>
    <li><h3>Spark streaming API connected to Kafka consumers</h3></li>
    <li><h3>Use <strike>regular expressions</strike> user feedback to label the data</h3></li>
    <li><h3>Use Luigi to:</h3></li>
        <ul>
        <li><h3>Continuously run grid search and cross validation benchmarks</h3></li>
        <li><h3>Train sklearn estimators on the labeled data</h3></li>
        <li><h3>Dump the model coefficients to JSON</h3></li>
        <li><h3>Hand over the JSON to Android developers</h3></li>
        </ul>
    <li><h3>Use Jenkins to drive everything</h3></li>
</ul>

![](images/no_data.jpg)