<table style="width:100%; border: 0px solid black;">
    <tr style="width: 100%; border: 0px solid black;">
        <td style="width:75%; border: 0px solid black;">
            <a href="http://www.drivendata.org">
                <img src="https://s3.amazonaws.com/drivendata-public-assets/logo-white-blue.png" />
            </a>
        </td>
    </tr>
</table>

# Data Science is Software
## Developer #lifehacks for the Jupyter Data Scientist

### Section 4:  Don't let other people break your toys

### Motivation

> "Many machine learning algorithms have a curious property: they are robust against bugs. Since they’re designed to deal with noisy data, they can often deal pretty well with noise caused by math mistakes as well. If you make a math mistake in your implementation, the algorithm might still make sensible-looking predictions. This is bad news, not good news. It means bugs are subtle and hard to detect. Your algorithm might work well in some situations, such as small toy datasets you use for validation, and completely fail in other situations — high dimensions, large numbers of training examples, noisy observations, etc." — Roger Gross, "[Testing MCMC code, part 1: unit tests](https://hips.seas.harvard.edu/blog/2013/05/20/testing-mcmc-code-part-1-unit-tests/)", Harvard Intelligent Probabilistic Systems group

In [6]:
from __future__ import print_function

import os

import numpy as np
import pandas as pd

PROJ_ROOT = os.path.abspath(os.path.join(os.pardir, os.pardir))

In [7]:
print(pd.__version__)

2.2.3


### `numpy.testing`
Provides useful assertion methods for values that are numerically close and for numpy arrays.

In [2]:
data = np.random.normal(0.0, 1.0, 1000000)
assert np.mean(data) == 0.0

AssertionError: 

In [4]:
np.testing.assert_almost_equal(np.mean(data), 0.0, decimal=2)

In [3]:
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)

np.testing.assert_array_equal(a, b)

AssertionError: 
Arrays are not equal

Mismatched elements: 10000 / 10000 (100%)
Max absolute difference among violations: 0.00058303
Max relative difference among violations: 6016.67618149
 ACTUAL: array([-1.832408e-04,  1.096928e-04, -9.149495e-05, ..., -1.943389e-04,
       -8.819834e-05, -1.013134e-04])
 DESIRED: array([ 4.531290e-05,  1.890777e-04, -1.005158e-04, ..., -9.453659e-05,
        2.248399e-04, -1.184803e-04])

In [5]:
np.testing.assert_array_almost_equal(a, b, decimal=3)

### [engarde]() decorators

A new library that lets you practice defensive program--specifically with pandas `DataFrame` objects. It provides a set of decorators that check the return value of any function that returns a `DataFrame` and confirms that it conforms to the rules.

In [4]:
import engarde.decorators as ed

ModuleNotFoundError: No module named 'pandas.util.testing'

In [None]:
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
                          'b': np.random.normal(0, 1, 100)})

@ed.none_missing()
def process(dataframe):
    dataframe.loc[10, 'a'] = 1.0
    return dataframe

process(test_data).head()

`engarde` has an awesome set of decorators:

 - `none_missing` - no NaNs (great for machine learning--sklearn does not care for NaNs)
 - `has_dtypes` - make sure the dtypes are what you expect
 - `verify` - runs an arbitrary function on the dataframe
 - `verify_all` - makes sure every element returns true for a given function

More can be found [in the docs](http://engarde.readthedocs.org/en/latest/api.html).

**`#lifehack`: test your _data science_ code. **

### Code coverage

What are those tests getting up to? Sometimes you think you wrote test cases that cover anything that might be interesting. But, sometimes you're wrong.

`coverage.py` is an _amazing_ tool for seeing what code gets executed when you run your test suite. You can run these commands to generate a code coverage report:

    coverage run --source src -m pytest
    coverage html
    coverage report
