In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns
import sklearn 
import pandas as pd

import shelter
from shelter.config import data_dir

%matplotlib inline

# Machine Learning in Production

## 0. Introduction (10 minutes)

Just being able to create a predictive model is not enough for a Data Scientist. 
You also need to be able to implement it in such a way that other people can understand and use it.
Putting your models into production means knowing and using software best practices.

Read the blog [Software development skills for data scientists](http://treycausey.com/software_dev_skills.html) by Trey Causey.

We'll touch on some of his topics:

* Writing modular, reusable code
* Testing
* Logging
* Version control

## 1. Code quality (20 minutes)

Writing modular, reusable code has to do with code quality.
Code is a means to communicate not only with machines but also with other developers.
High quality code is good communication.

Code of high quality is correct, human readable, consistent, modular and reusable.
On one hand this involves fundamentals like code styling, on the other hand it also concerns naming, code structure and principles like [DRY](Don't repeat yourself), the [rule of three](https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming) and [single responsibility principle](https://en.wikipedia.org/wiki/Single_responsibility_principle).

We'll first focus on style.
Style guides dictate how you should write your code so that everyone uses a single, consistent style.
There's [PEP8](https://www.python.org/dev/peps/pep-0008/) for Python; [Google's Style Guide](https://google.github.io/styleguide/Rguide.xml) or [Advanced R](http://adv-r.had.co.nz/Style.html) for R; and the official [Guide](https://docs.scala-lang.org/style/) for Scala.

> #### Exercise 1

> There's a handy function called `add_features()` in  `data.py` of our Python package `shelter`.
Unfortunately, it doesn't follow the PEP8 standards.
Most violations are whitespace problems and variable names, so this should be pretty easy to fix.
>
Open the project folder in PyCharm and navigate to the file `shelter/data.py`.
Make all the curly yellow lines go away.
>
If you don't have PyCharm, change the code in your favourite editor until the following command doesn't return errors:
>
```bash
(ml-production) $ flake8 shelter/data.py --show-source
```

The code in `add_features()` now has the right format, but it's not good code yet.
The function is doing multiple things and that is [not OK](https://blog.codinghorror.com/curlys-law-do-one-thing/).
There's more to improve than just code style!

> #### Exercise 2
>
Move the sub-logic from `add_features()`  to the appropriate functions in:
>
- `check_has_name()`
- `get_sex()`
- `get_neutered()`
- `get_hair_type()`
- `compute_days_upon_outcome()`    
>
The function `check_is_dog()` is already filled in for you.
>
After this exercise `add_features()` should look something like:
>


```python
def add_features(df):
    df['is_dog'] = check_is_dog(df['animal_type'])
    df['has_name'] = check_has_name(df['name'])
    # ...
    return df
```


It already looks better and more structured, but there are still things that should be improved. 
What would you do to improve these functions?

For instance, the function `add_features()` has an unexpected [side effect](https://softwareengineering.stackexchange.com/questions/15269/why-are-side-effects-considered-evil-in-functional-programming): input `df` gets changed when the function is called.
Generally, you want to avoid this kind of unexpected behaviour.
How could you avoid this?

## 2. Testing (10 minutes)

Tests help you determine if your code does what you expected it to do.

There are different types of test.
The [most important tests](http://slides.com/treycausey/pydata2015#/) for Data Scientists are:
- unit tests that focus on small units of code like functions; 
- integration tests for whole systems;
- regression tests to test if software performs the same after changes;

In addition, you probably want to have systems checking data quality and monitoring if your model is still performing as expected.
Those test won't be discussed here: we'll only show unit tests.

[Unit testing](https://jeffknupp.com/blog/2013/12/09/improve-your-python-understanding-unit-testing/) is easy as calling your function and `assert`-ing that the function behaves as expected:

In [2]:
from shelter.data import convert_camel_case

result = convert_camel_case('CamelCase')
expected = 'camel_case'  # TODO: Adjust this to see what happens.

assert result == expected

AssertionError: 

Python unit tests generally go in a folder called `tests/` and contain modules starting with `test_`.
These modules again contain functions and classes starting with respectively `test_` and `Test`.
It's tests all the way down.

Our project also has a folder called `tests/`.
`tests/test_data.py` contains unit tests to check the functions that you've made. 
Check them out!

Most functions in `test_data.py` don't use `assert`, but use the `pandas` utility function `assert_series_equal()` to check if `Series` are the same.
Many libraries have utility functions to make writing tests easier.

Run the unit tests using [`pytest`](https://docs.pytest.org/en/latest/):

```bash
(ml-production) $ python -m pytest tests/
```

You'll get some error messages because `test_is_dog()` has not been implemented yet!

> #### Exercise 3
> 
> Create a test case to check if `is_dog()` is implemented correctly. 
Make sure that `pytest` doesn't return any errors.

## 3. Logging (10 minutes)

Logging helps you understand what your code did when it was run.

Many people start with `print()` statements to check what's going on, but it's better to use the official `logging` module.
`logging` is made for logging.

`logging` isn't the most clearly documented feature in Python, but you should be OK if you follow these guidelines.


#### `logging` in modules 

In modules use the `logger` like this:

```python
# data.py
import logging

# This logger variable is used by all functions in the module.
logger = logging.getLogger(__name__)

def load_data(path):
    logger.info('Reading data from %s', path)
    # ...


def check_is_dog(animal_type):
    is_cat_dog = animal_type.str.lower().isin('dog', 'cat')
    if not is_cat_dog.all():
        logging.error('Found something else but dogs and cats: %s',
                      animal_type[~is_cat_dog])
    # ...
```

You can create logs with different importance levels with:
- `logger.critical()`: most important
- `logger.error()`
- `logger.warning()`
- `logger.info()`
- `logger.debug()`: least important


#### `logging` in Notebooks

You're Notebook is not a module, it's your main application.
Because of that `logging` has to be configured:

```python
# Your Notebook
import logging

# Configure logging.
logging.basicConfig(level=logging.INFO)

# Get the logger as you'd do normally.
logger = logging.getLogger(__name__)
```

There are many options you can configure, but the most important setting is the logging level.
The logging level is the minimal importance that's being shown.
Options are `logging.CRITICAL`, `logging.ERROR`, `logging.WARNING`, etc.

Note that once you've set the logging level of your Notebook, you can't change it until you restart your kernel.

In [13]:
# Your Notebook
import logging

# Configure logging.
logging.basicConfig(level=logging.INFO)

# Get the logger as you'd do normally.
logger = logging.getLogger(__name__)

# shelter module??
import shelter

import os
import pandas as pd

> #### Exercise 4
>
Play around with the `level` argument below. 
Which function outputs what kind of logging messages?
>
(Don't forget to restart the kernel and run all cells when changing the log levels.)



In [21]:
logging.basicConfig(level=logging.INFO)

In [22]:
data_dir = "/Users/janellezoutkamp/Documents/practice/accelerator/ml-production/data"
train = shelter.data.load_data(os.path.join(data_dir, 'train.csv'))

INFO:shelter.data:Reading data from /Users/janellezoutkamp/Documents/practice/accelerator/ml-production/data/train.csv
INFO:shelter.data:Read 26729 rows


In [23]:
animal_type = pd.Series(['mouse', 'cat'])
train = shelter.data.check_is_dog(animal_type)

ERROR:root:Found something else but dogs and cats:
0    mouse
dtype: object


## 4. Version control (15 minutes)

Git is one of the most fundamental but also one of the harder tools to learn for Data Scientists.
One aspect of Git is versioning your code by committing, pulling and pushing.
Other important aspects are collaborating, code review, and automated testing and deploying of your code.

All these aspects are part of a mature Data Science workflow.
They're also vital if you'd like to improve open-source tools like [`pandas`](https://github.com/pandas-dev/pandas) or [`sklearn`](https://github.com/scikit-learn/scikit-learn).

Code review is done with Pull Requests (or Merge Requests).
With these requests you ask the owner of a repository to pull (or merge) your changes in their code base.
The owner can then discuss your code and suggest improvements.
Check for instance the [Pull Requests for `pandas`](https://github.com/pandas-dev/pandas/pulls).

Many repositories have systems that test your code for style and correctness.
For instance, new `pandas` code is automatically tested and executed on different systems for various Python versions.

Once your code is tested and approved, automated (CI/CD) pipelines pick up your code and put it into production.

> #### Exercise 5
>
> The [original repository](https://github.com/hgrif/ml-production) (the original is often called _upstream_) doesn't have your improvements yet.
>
Push the changes to your repository and open a Pull Request at the [upstream](https://github.com/hgrif/ml-production).
Once you've opened a Pull Request, [my Travis account](https://travis-ci.org/hgrif/ml-production/pull_requests) will automatically check if you've done exercises 1, 2 and 3 correctly.
Keep correcting, committing and pushing until the build passes and I give you a [LGTM](http://livedoor.blogimg.jp/bluesignal/imgs/9/5/95d3a71e.jpg) on your Pull Request.