<h1><center>Developing Python Backends for Machine Learning Applications</center></h1>
<h2><center>by Jaidev Deshpande</center></h2>
<h3><center>Data Scientist @ Cube26 Software Pvt Ltd</center></h3>

<div style="text-align: center">
<div id="social">
<div id="social_twitter">
    <a href="http://twitter.com/jaidevd"><img src="images/twitter-128.png" width="32" height="32">
    </a>
</div>
<div id="social_medium">
    <a href="http://medium.com/@jaidevd"><img src="images/medium.ico" width="32" height="32"></a>
</div>
<div id="social_github">
    <a href="http://github.com/jaidevd"><img src="images/mark-github-128.png" width="32" height="32"></a>
</div>
</div>
</div>

<div style="text-align: center"><font size="4"><strong>@jaidevd</strong></font></div>

# From Peter Norvig's Q&A session on Quora:
> I think that it will be important for machine learning experts and software engineers to come together to develop best practices for software development of machine learning systems. Currently we have a software testing regime where you define unit tests... We will need new testing processes that involve running experiments, analyzing the results... This is a great area for software engineers and machine learning people to work together to build something new and better.

<div style="text-align: center">
<img src="images/reos.png">
</div>

# Example: How we started building ReosMessage

<ul>
    <li><h3>Classification of SMS into personal, notifications and spam</h3></li>
    <li><h3>Get a dump of a table from postgres</h3></li>
    <li><h3>Use simple Python scripts and some pandas magic to clean the data</h3></li>
    <li><h3>Use regular expressions to label the data</h3></li>
    <li><h3>Train sklearn estimators on the labeled data</h3></li>
    <li><h3>Crowdsource the evaluation of the predictions</h3></li>
    <li><h3>Dump the model coefficients to JSON</h3></li>
    <li><h3>Hand over the JSON to Android developers</h3></li>
</ul>

# Typical Data Processing Pipeline
![](images/flowchart.png)

# Managing Raw Data
## And Data Ingest as a Service

<ul>
    <li><h3>Raw data is an integral part of ETL and therefore of your software</h3></li>
    <li><h3>Working off local flatflies is <em>bad</em>!</h3></li>
    <li><h3>Learn to work from remote storage to remote storage. Use the "cloud".</h3></li>
    <li><h3>What about experimental / exploratory work? Use things like sqlite!</h3></li>
    <li><h3>Only use local files when:</h3></li>
    <ul>
        <li><h3>doing EDA or any interactive studies.</h3></li>
        <li><h3>debugging a larger application.</h3></li>
        <li><h3>prototyping (resist the temptation to deploy).</h3></li>
    </ul>
</ul>

# Using Pandas for Data Ingest

<ul>
<li><h3>A few of million PSQL rows randomly sampled from over 15M rows</h3></li>
<li><h3>Preprocessing with:</h3></li>
<ul>
<li><h3>Removing unicode, emoji, stopwords</h3></li>
<li><h3>Converting to lowercase</h3></li>
<li><h3>Dropping stopwords</h3></li>
<li><h3>Cleaning any other malformed input</h3></li>
</ul>
<li><h3>Using a few hundred regular expressions to produce a labeled dataset</h3></li>
</ul>


# Using PySemantic to Wrap Pandas
```yaml
smsdata:
  source: psql
  table_name: messages_db
  config:
    hostname: 127.0.0.1
    db_name: foo
    username: bar
  chunksize: 100000
  sampling:
    factor: 0.1
    kind: random
  dtypes:
    Message: &string !!python/name:__builtin__.str
    Number: *string
    person: *string
  postprocessors:
    - !!python/name:jeeves.preprocessing.text.remove_unicode
    - !!python/name:jeeves.preprocessing.text.remove_tabs
    - !!python/name:jeeves.preprocessing.text.remove_digits
    - !!python/name:jeeves.preprocessing.text.remove_stopwords
    - !!python/name:jeeves.preprocessing.text.to_lowercase
    - !!python/name:jeeves.feature_extraction.text.get_regex_features
    - !!python/name:jeeves.feature_extraction.text.get_tfidf_features
```
```python
>>> from pysemantic import Project
>>> smartsms = Project("smartsms")
>>> X = smartsms.load_dataset("smsdata")
```

# A Note about the AutoML project
![](images/automl.png)
<h4><i>"If machine learning is so ubiquitous, one should just be able to use it without understanding libraries."</i></h4>
<h4>- Andreas Mueller @ SciPy US 2016</h4>


# Automating Model Selection

![](https://raw.githubusercontent.com/spotify/luigi/master/doc/luigi.png)

```python
class CrossValidationTask(luigi.Task):
    
    estimator = luigi.Parameter() # or luigi.Target
    
    def run(self):
        # Run CV loop


class GridSearchTask(luigi.Task):
    
    grid = luigi.Parameter() # or Target
    estimator = luigi.Parameter() # or Target
    
    def run(self):
        X, y = self.input()
        clf = GridSearchCV(self.estimator, param_grid=self.grid, ...)
        clf.fit(X, y)
        joblib.dump(clf.best_estimator_, self.output())
```

# Data Processing Pipeline as a Luigi Graph
![](images/reosmessage_flowchart.png)

# Visualizing (Data & ML performance)

<ul>
        <li><h3>Bokeh server for dashboards</h3></li>
        <li><h3>Chaco / Traits-based visualizations - for interative exploration</h3></li>
        <li><h3>Use libs like Seaborn for stats - resist the temptation to write them yourself</h3></li>
    
</ul>

# Exposing Trained Models

<ul>
        <li><h3>Simple serialization methods</h3></li>
        <li><h3>sklearn-compiledtrees</h3></li>
        <li><h3>Don't use nonlinear models where linear models will do</h3></li>
        <li><h3>The serveless paradigm - AWS Lambda / Amazon API Gateway, etc.</h3></li>
    
</ul>

# Example: How we built & scaled ReosMessage

<ul>
    <li><h3><strike>Get a dump of a table from postgres</strike></h3></li>
    <li><h3><strike>Use simple Python scripts and some pandas magic to clean the data</strike></h3></li>
    <li><h3>Spark streaming API connected to Kafka consumers</h3></li>
    <li><h3>Use <strike>regular expressions</strike> user feedback to label the data</h3></li>
    <li><h3>Use Luigi to:</h3></li>
        <ul>
        <li><h3>Continuously run grid search and cross validation benchmarks</h3></li>
        <li><h3>Train sklearn estimators on the labeled data</h3></li>
        <li><h3>Dump the model coefficients to JSON</h3></li>
        <li><h3>Hand over the JSON to Android developers</h3></li>
        </ul>
    <li><h3>Use Jenkins to drive everything</h3></li>
</ul>

![](images/no_data.jpg)