<h1><center>Developing Python Backends for Machine Learning Applications</center></h1>
<h2><center>by Jaidev Deshpande</center></h2>
<h3><center>Data Scientist @ Cube26 Software Pvt Ltd</center></h3>

<div style="text-align: center">
<div id="social">
<div id="social_twitter">
    <a href="http://twitter.com/jaidevd"><img src="images/twitter-128.png" width="32" height="32">
    </a>
</div>
<div id="social_medium">
    <a href="http://medium.com/@jaidevd"><img src="images/medium.ico" width="32" height="32"></a>
</div>
<div id="social_github">
    <a href="http://github.com/jaidevd"><img src="images/mark-github-128.png" width="32" height="32"></a>
</div>
</div>
</div>

<div style="text-align: center"><font size="4"><strong>@jaidevd</strong></font></div>

Good afternoon. To begin with, I'd like to thank the organizers for having me. I'm speaking at SciPy after a gap of three years and it's really good to be back. So this is a follow-up to my talk at PyCon a few months ago. That talk was about continuous integration for data scientists. This time it is going to be a lot more specific about the scientific computing aspects. That talk was a very coarse overview of how data scientists or machine learning engineers can use continuous integration to make their lives easier. This time I am going to be go into a some detail about the data processing and machine learning aspects of the same problem.

# From Peter Norvig's Q&A session on Quora:
> I think that it will be important for machine learning experts and software engineers to come together to develop best practices for software development of machine learning systems. Currently we have a software testing regime where you define unit tests... We will need new testing processes that involve running experiments, analyzing the results... This is a great area for software engineers and machine learning people to work together to build something new and better.

Recently Peter Norvig was asked in a Q&A what he thinks about the place of software developers in machine learning research. And this is his answer. "It's important for machine learning experts and software engineers to come together to develop best practices for the development of ML systems. We need new testing processes that involve running experiments and analyzing results" and that "this is a great area for software developers and ML expers to work together". So it is in this spirit that I'll be speaking. I've chosen to keep this talk more about engineering than machine learning, because I think that even the most elementary machine learning can go a long way if it is supported by good backend systems, and even the deepest of neural networks will fail if your systems are not well maintained.

<div style="text-align: center">
<img src="images/reos.png">
</div>

Last August, my company released a bunch of Android apps, called the ReOs suite of apps, all of which are heavily data driven. These are the three flagship apps of the campaign - reos message, reos music and reos camera. You can find all of them on the Google playstore, by the way. All of these apps use machine learning to drive their most attractive features. For example, the music app has features like automatic lyrics synchronization, identification of tunes from recordings, like Shazam or Soundcloud etc. The camera app has beautification / styling filters which are applied by a neural network. For the purpose of this talk, however, I will be focusing on the SMS app - simply because the data science built into it is very easy to understand. But this is still representative of host most Android apps do machine learning.

# Example: How we started building ReosMessage

<ul>
    <li><h3>Classification of SMS into personal, notifications and spam</h3></li>
    <li><h3>Get a dump of a table from postgres</h3></li>
    <li><h3>Use simple Python scripts and some pandas magic to clean the data</h3></li>
    <li><h3>Use regular expressions to label the data</h3></li>
    <li><h3>Train sklearn estimators on the labeled data</h3></li>
    <li><h3>Crowdsource the evaluation of the predictions</h3></li>
    <li><h3>Dump the model coefficients to JSON</h3></li>
    <li><h3>Hand over the JSON to Android developers</h3></li>
</ul>

Now I'll start with an example, and at the end of the talk I'll come back to how it evolved. So the SMS app is supposed to classify your messages into three categories - personal messages, notifications and spam. So to build and deploy the classifier, all we had initially was a massive Python script - with a single entry point, which used to take a single input, which was a CSV file. This file was just a dump of a table from a postgres database. We then would run some simple pandas functions to clean the data of anything that was irrelevant. We didn't have labeled data, so we used regular expressions to label a subset of the dataset into those three classes. On this data we trained some scikit-learn models. And then we used to crowdsource the evaluation of these models. We would send the predictions to everyone in the office. They would all quickly glance over it let us know if they were any glaring classification errors. When we were somewhat satisfied with the results, the model coefficients were dumped into JSON and were sent to the Android developers who could use them in the Android port of the classifier.

Now it's easy to see how this could not have scaled. It's totally monolithic, it's quite redundant in places, it wasn't modular at all, and also it was becoming increasingly difficult to debug. So I'll come back to this example at the end of the talk and I'll tell you what we did to fix these things.

# Typical Data Processing Pipeline
![](images/flowchart.png)

So this here is a very broadly representative, a very general data processing pipeline. That box you see to the left is known as ETL part of the pipeline. ETL stands for extract, transform and load. You start with a source datastore, and end up in a sink datastore. The ETL part of the process is the most data-intensive one - by that I don't mean that others are not, it's just that this is the part where your data is literally raw data. After it's left this box, it's no longer raw data, it's features, it's model coefficients, it's statistics, and so on. And it is in this box that your data engineering has to be the most creative. So after you've extracted, transformed and loaded your data, you train some model on it, and you do some validation which might allow you to do better model selection. Now, these steps - training, validation and model selection - are not necessarily sequential in time. You could be cross-validating one model while you are already training another model, and a third one could actually be in production that is writing the output to whatever your sink is. But the thing is that we tend to think about a pipeline as something sequential. Its much better to think of each of these blocks are independent systems - these are microservices - which only happen to be loosely coupled. So here are a few lessons about how to best manage the ETL box.

# Managing Raw Data
## And Data Ingest as a Service

<ul>
    <li><h3>Raw data is an integral part of ETL and therefore of your software</h3></li>
    <li><h3>Working off local flatflies is <em>bad</em>!</h3></li>
    <li><h3>Learn to work from remote storage to remote storage. Use the "cloud".</h3></li>
    <li><h3>What about experimental / exploratory work? Use things like sqlite!</h3></li>
    <li><h3>Only use local files when:</h3></li>
    <ul>
        <li><h3>doing EDA or any interactive studies.</h3></li>
        <li><h3>debugging a larger application.</h3></li>
        <li><h3>prototyping (resist the temptation to deploy).</h3></li>
    </ul>
</ul>

Abstract away almost everything that has to do with ETL. We, as a community, the data science community, haven't paid much attention to it. Machine learning systems have grown super sophisticated while data management systems (that these same machine learning systems use), have trudged behind. Let's stop thinking of raw data as an inanimate entity that you only have to dig through before you can get the gold. The digging itself is part of your software, and is therefore subject to pretty much the same dangers as any other kind of software development. So build abstraction layers, services and all kinds of tooling that you would need around ETL. Now this might sound very obvious, but you'd be surprised at how infrequently we practice this. For example, almost all development data scientists do is based off local flatfiles - most of it actually happens within the ipython notebook. The excuse is that they're just building prototypes. But we know how fast the boundaries between data scientists and other kinds of developers are thinning. In that light, we have to learn to work with larger more intergated data sources. Even if you're just building a prototype, this is still not every healthy because you have no idea how long the prototyping is going to take, or how many intermediate files you might end up producing. So, its best to learn to use larger integrated systems. For example, even if you're using the Iris dataset, try to use sqlite instead of CSV. The cloud is your friend. The sooner you become comfortable with remote or cloud based distributed storages, the faster you can deploy your apps.

# Using Pandas for Data Ingest

<ul>
<li><h3>A few of million PSQL rows randomly sampled from over 15M rows</h3></li>
<li><h3>Preprocessing with:</h3></li>
<ul>
<li><h3>Removing unicode, emoji, stopwords</h3></li>
<li><h3>Converting to lowercase</h3></li>
<li><h3>Dropping stopwords</h3></li>
<li><h3>Cleaning any other malformed input</h3></li>
</ul>
<li><h3>Using a few hundred regular expressions to produce a labeled dataset</h3></li>
</ul>


So coming back to the example, this is what ETL translates into for the SMS classification application. We needed to randomly sample a few million records from over 15 million rows. These records contained the SMS text field with some other metadata, and those text fields needed to be preprocessed. Particularly if you're dealing with SMS on Android, things are not standardized. There's a message database on Android devices to which your SMS app needs full access. And different SMS apps will write to the databse in different ways, so the literal text of the same message can look very different on different devices. It also varies between brands. Most brands will make their own SMS apps, to give the phone a different look and feel, instead of using the stock Android messenger, which means they'll add their own bells and whistles to how they handle the message databse. Anyway, what this means is that you can't really fully automate the preprocessing. You have to keep coming back to your preprocessing rules once in a while. So, it makes sense to have the preprocessing as a separate self-contained service. Now, remember that the goal of the ETL stage is to make your data ready for machine learning - so if your ETL is taken care of, you can just plug and play sklearn estimators on the data - that way you spend the bulk of your time doing model selection and evaluation, instead of feature engineering or data cleaning.

# Using PySemantic to Wrap Pandas
```yaml
smsdata:
  source: psql
  table_name: messages_db
  config:
    hostname: 127.0.0.1
    db_name: foo
    username: bar
  chunksize: 100000
  sampling:
    factor: 0.1
    kind: random
  dtypes:
    Message: &string !!python/name:__builtin__.str
    Number: *string
    person: *string
  postprocessors:
    - !!python/name:jeeves.preprocessing.text.remove_unicode
    - !!python/name:jeeves.preprocessing.text.remove_tabs
    - !!python/name:jeeves.preprocessing.text.remove_digits
    - !!python/name:jeeves.preprocessing.text.remove_stopwords
    - !!python/name:jeeves.preprocessing.text.to_lowercase
    - !!python/name:jeeves.feature_extraction.text.get_regex_features
    - !!python/name:jeeves.feature_extraction.text.get_tfidf_features
```
```python
>>> from pysemantic import Project
>>> smartsms = Project("smartsms")
>>> X = smartsms.load_dataset("smsdata")
```

As a solution to that, we use this library called pysemantic. Practically it is just a wrapper around pandas. The motivation is that you want to be able to move from interactive data analysis to production as seamlessly as possible. Even if you're restricted only to the prototyping stage, your colleagues may not always be able to reproduce your results, since they may not have copied your ETL script word for word. The obvious solution is just to share your notebook or your script, but then that inevitably leads to massive clutter or arbitrary scripts and datasets in your repository. And then when you want to move these prototypes to production, refactoring them to get a modular Python package out of them becomes very inconvenient. So pysemantic has a this concept of a data dictionary. It is just a yaml file with the description of how you want to load and preprocess your dataset. It is based on traits. How many of you were at the ETS tutorial yesterday? So all these keys you see here are traits of the data dictionary. This allows pysemantic to be somewhat smart, in that it recursively eliminates ETL errors in pretty much a same way a human would. Ultimately it guarantees that whoever has a copy of a dictionary like this, will end up with the exact same dataset. So you have team wide reproducibility of the features of your dataset. And it's very easy to use. Just three lines of Python will load your dataset. The details go into the dictionary.

# A Note about the AutoML project
![](images/automl.png)
<h4><i>"If machine learning is so ubiquitous, one should just be able to use it without understanding libraries."</i></h4>
<h4>- Andreas Mueller @ SciPy US 2016</h4>


So that was all about data ingest. Now let's talk about automating machine learning. This is a very widely studied problem, and one of its most popular forms is the AutoML problem - also known as the CASH problem. CASH stands for Combined Algorithm Selection and Hyperparameter Optimization. The motivation for formally studying automation of machine learning is simple - no algorithm works best on all datasets, not even neural networks - and most algorithms are highly sensitive to hyperparameters. This here is a concise flowchart of the AutoML framework. The _xTrain, xTest, yTrain_ variables are as usual, L is the loss. The interesting thing here is b which denotes your budget - which could be time or processing power. The ML framework box here is as usual. There's a data preprocessor, which feeds into a feature extractor and then finally a classifier. The _Bayesian optimizer_ block is a process that fits a model between hyperparameters and performance. Actually take a moment to think about that. You want to automate the training of a model, and for that you're fitting a model which given another model and its hyperparameters, predicts the performance. The meta-learning block here also learns the performance, but the input to that is the dataset. So given a dataset, it predicts performance, you know, like a domain expert. So when we say that decision trees are likely to perform well for categorical variables - that is the sort of thing that the meta-learning block learns. Finally, we build ensembles from the different models that have been performing well. Thankfully, there's a convenient sklearn implementation of this - called auto-sklearn. This is what Andreas Mueller spoke about in his talk at SciPy this year. The idea is that if machine learning is so ubuquitous then one should be able to just use it without understanding the details. Which is another reason why projects like AutoML are useful. They allow you to run _a lot_ of models easily and in a hands-off manner. But unfortunately auto-sklearn isn't a very stable project right now, it has too many low level dependencies, which makes it very painful to install. So again, not a one-size-fits-all solution.

# Automating Model Selection

![](https://raw.githubusercontent.com/spotify/luigi/master/doc/luigi.png)

```python
class CrossValidationTask(luigi.Task):
    
    estimator = luigi.Parameter() # or luigi.Target
    
    def run(self):
        # Run CV loop


class GridSearchTask(luigi.Task):
    
    grid = luigi.Parameter() # or Target
    estimator = luigi.Parameter() # or Target
    
    def run(self):
        X, y = self.input()
        clf = GridSearchCV(self.estimator, param_grid=self.grid, ...)
        clf.fit(X, y)
        joblib.dump(clf.best_estimator_, self.output())
```

However, some part of what auto-sklearn does can be easily implemented with something like luigi. Luigi is a library that helps you build complex pipelines of batch processing operations. Each operation is called a task, and each task has an input and an output. Each task can also be parametrized, with arbitrary Python objects. So these parameters can be the parameter of your model, in the simplest case, or in more extreme cases they can be an entire grid of hyperparameters which you want to exhaustively search for the best predictive performance. These tasks themselves can be interdependent, and luigi will solve the dependency graph. You can think of luigi as a build tool like make, only it is Pythonic and it can be distributed. It directly supports a great number of data sources and data sinks, like we saw in the flowchart. It has its limitations like it can only do batch processing, and it's design patterns don't remain easy for more and more complex dependency graphs. So if your pipeline is reasonably straightforward, luigi will do very well. Then there are always more tools like kafka and spark that are outside the Python ecosystem, if you're really dealing with Big Data in the true sense.

# Data Processing Pipeline as a Luigi Graph
![](images/reosmessage_flowchart.png)

Oh, and Luigi also provides a visualization of the graphs and the tasks. This is what the flowchart looks like in the luigi dashboard. The screenshot was taken when the ETL had finished and the ML was about to start. Each block is a task. The ones to the left are the ETL tasks, so you can see a lot of things happening in parallel. The tasks to the right are the grid search, cross validation and bechmarking tasks.

# Visualizing (Data & ML performance)

<ul>
        <li><h3>Bokeh server for dashboards</h3></li>
        <li><h3>Chaco / Traits-based visualizations - for interative exploration</h3></li>
        <li><h3>Use libs like Seaborn for stats - resist the temptation to write them yourself</h3></li>
    
</ul>

That was just the visualization of the pipeline, but ultimately the front-facing side of the pipeline would be a very different dashboard. And these are a few things that could help you with the visualization beyond matplotlib. Bokeh is wonderful for dynamic plotting. You don't have to know HTML or js, and it runs as a server and you can keep sending it data to have it plotted dynamically. Instant dashboards. Chaco is another library that provides a lot more interactivity than matplotlib, but is somewhat more difficult to hack - you need to know the Enthought tool suite - and I'm not sure if you can get it working in the browser. There are many other libraries like seaborn, geopandas, networkx that have domain specific data visualization. Use these domain specific visualization tools, and try not to write your own.

# Exposing Trained Models

<ul>
        <li><h3>Simple serialization methods</h3></li>
        <li><h3>sklearn-compiledtrees</h3></li>
        <li><h3>Don't use nonlinear models where linear models will do</h3></li>
        <li><h3>The serveless paradigm - AWS Lambda / Amazon API Gateway, etc.</h3></li>
    
</ul>

Finally we come to deployment. The easiest way to deploy models is to use simple serialization methods like pickle or json - sklearn's joblib will do great - and then writing wrappers around the serialized models on the client side - especially if its not a Python client. sklearn-compiledtrees is a project that compiles tree based models from sklearn into object code, which can then be read by the client. Now you _could_ port almost any machine learning algorithm to any platform, but if it's a linear model, then the advantage you have is that the client needs to compute only a dot product. So evaluate linear models first, and if you can't work with them, make absolutely sure that you can't.

An interesting deployment solution that has emerged recently is the serverless paradigm. Companies like Cloudera and Cloudacademy are pioneering the serverless paradigm specially in the Python/Scikit-learn ecosystem. The motivation here is that the data science community is stuck at the level of designing ad-hoc models - which are mostly just prototype with very thin layers of customization - even Azure ML, which is the one of the biggest farms of plug-and-play machine learning algorithms, suffers from this problem. These tools cannot embedded or extended by other developers very easily. So the serverless paradigm is one of the ways in which you can make your prototypes production-ready without too much effort.

Now even if we want to keep things simple with simple serialization - its a lot of effort to scale it, and to make it distributed. And you will also have to put a lot of thought into how you expose the API and how can authenticate it. One of the easiest things you could do it write simple flask or django wrappers around your model and make the estimator methods available through HTTP requests - but even that doesn't provide enough elasticity, and scaling is still non trivial. AWS Lambda is a service that allows you to just deploy scripts onto AWS and it takes care of the scaling automatically. A combination of this and the Amazon API Gateway is all you need for deploying and scaling your scikit-learn code. But unsurprisingly, that's more on the expensive slide.

# Example: How we built & scaled ReosMessage

<ul>
    <li><h3><strike>Get a dump of a table from postgres</strike></h3></li>
    <li><h3><strike>Use simple Python scripts and some pandas magic to clean the data</strike></h3></li>
    <li><h3>Spark streaming API connected to Kafka consumers</h3></li>
    <li><h3>Use <strike>regular expressions</strike> user feedback to label the data</h3></li>
    <li><h3>Use Luigi to:</h3></li>
        <ul>
        <li><h3>Continuously run grid search and cross validation benchmarks</h3></li>
        <li><h3>Train sklearn estimators on the labeled data</h3></li>
        <li><h3>Dump the model coefficients to JSON</h3></li>
        <li><h3>Hand over the JSON to Android developers</h3></li>
        </ul>
    <li><h3>Use Jenkins to drive everything</h3></li>
</ul>

So coming back to the example from the beginning. Our postgres database was being populated by kafka consumers which in turn were reading data from the devices on which our apps are running. So instead of taking dumps from postgres, as we were earlier doing, we set up a streaming application in spark that served as a data ingest system. There is an experimental branch in pysemantic which supports Spark as a backend instead of the pandas default. Spark itself has great data cleaning functionality - so we could manage to get the streaming application to absorb even the preprocessing layer we earlier had. The data then goes to a set of luigi tasks which are running parallely to produce newer benchmarks on the classification problems and ultimately the export the model configuration which ultimately gets deployed on the apps. And all of this is triggered by Jenkins - primarily, builds are triggered by cron jobs and whenever commits are made to the staging branches in git.

In summary I'd like to direct your attention towards how Python spans majority of the pipeline. This shows how its a wonderful glue language. And what I said about linear models in machine learning is also true for backend engineering. Doing simple things well is going to be significantly better than complicating things. And I think that's one of the things that the next talk is going to be about.

![](images/no_data.jpg)

So thanks for your time. Thank you very much.