To play slideshow, run this command in the terminal:
```
$ jupyter nbconvert intro-to-testing.ipynb --to slides --post serve --SlidesExporter.reveal_theme=night --SlidesExporter.reveal_transition=none
```

# The Role of Testing in Data Science

Jes Ford, PhD

Data Scientist

<img src="img/pytest.png">

# About Me

- Originally from Alaska, have followed the snow all around the western US/Canada
- PhD in Astrophysics from UBC, Vancouver
- Postdoc in Data Science at UW, Seattle
- Moved to UT to snowboard and be a Data Scientist at Backountry.com $\rightarrow$ now at Recursion Pharma
- I love teaching and learning about things including Python
- I organize this PyLadies chapter

<table><tr>
<td> <img src="https://jesford.github.io/photos/cfht_and_me.jpg" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://jesford.github.io/photos/Jess_Ford_SSS_cbox_BSlide.jpg" alt="Drawing" style="width: 142px;"/> </td>
<td> <img src="https://jesford.github.io/photos/toki_lions.jpg" alt="Drawing" style="width: 315px;"/> </td>
</tr></table>

## Why talk about testing?

I care about code quality...

meaning clean code\* that gives *correct results*

\* topic for another day: PEP8, code is read much more often than it is written!

## Why Test?

- Tests can give you evidence that your code is working as expected
- Tests give you confidence to make changes without fear of breaking something
- Tests make other people trust your code more
- Bad tests can give you false confidence

## In this talk...


- I will *not* insist that you always write tests

- I will describe different scenarios I find myself in as a data scientist and how I try to be confident that my results are correct

- I will show you how to get started testing 

## Disclaimer

- I am not a testing expert or a software engineer

- These *opinions* are based on my own experience as a data scientist

- "data science" covers a huge range of job duties and formal testing is less important in some of them (one-off analyses vs committing to production code base)

## How do you know if your code is correct??
- manual sanity checks
- assertions
- formal tests

In [30]:
# assertion example
def hello_to_all(list_of_names):
    assert len(list_of_names) > 0, 'There is no one here'
    print('Hello {}!'.format(', '.join(list_of_names)))

In [25]:
hello_to_all(['Ayla', 'Missy', 'Evan'])

Hello Ayla, Missy, Evan!


In [28]:
hello_to_all([])

AssertionError: There is no one here

## Types of tests:
- unittests
- integration tests
- system tests

## When to write tests

- when you write new code
- when you make a change to your code
- when you find a bug

## Test Driven Development

TDD: write the test *before* you write the code it is testing.

- reasons to use TDD:
  - makes you define your code requirements up front
  - helps frame how you'll write the code
  - is more fun
- reasons to not use TDD:
  - not alwasy possible to define requirements up front (data exploratoration)
  - time constraints (need quick bugfix)
  - legacy code

## Data Science Scenarios
1. boss asks for a quick analysis of X
2. you start an exploratory investigation of an idea 
3. a new data source is available that needs to be analyzed and tracked in the database
4. you inherit a large amount of legacy code written by a predecessor that will need to be maintained and potentially updated over time.

## pytest is great

- less boilerplate, easier/faster test writing
- allows unittests, integration tests, system tests
- automatically handles finding\* & collecting your tests, running, and reporting on evaluation status
- when tests fail you can get a lot of useful info about what happened and why
- fixtures give you a lot of power and flexibility (more on this later)
- checkout xdist plugin for running your tests in parallel

vs unittest requires tests to be wrapped inside classes which subclass from unittest.TestCase; pytest you just write functions with simple regular assert statements, which easier to read/write


## Where do tests go?

pytest searches all directories below the current directory for files that start or end with "test" (`test_*.py`, `*_test.py`) and runs any functions and classes like `def test_the_things()` and `class TestStuff()`.

```
myproject/
    myproject/
        myproject.py
        utils.py
        __init__.py
    tests/
        test_myproject.py
    setup.py
    README.md
    LICENSE.txt
```

Above is a typical directory layout, but its not required. You can tell pytest where to look for tests, so really you *can* put them pretty much anywhere.

## Fixtures

Examples here...

- can have scope (e.g. session), so can create a database with schemas once per run of test suite, and then tear down afterward
- keeps execution of a test separate from setup/teardown of anything thats needed for the test, so the test function itself is as simple and clear as possible.

### Cool related projects to be aware of
- [Hypothesis](https://hypothesis.readthedocs.io/en/latest/) for property-based testing and testing your code with many inputs and edge cases
- [engarde](https://engarde.readthedocs.io/en/latest/index.html) for defensive data analysis with pandas
- [pytest-xdist](https://docs.pytest.org/en/3.0.1/xdist.html) plugin for pytest so you can run tests faster in parallel

### Resources for learning
- Ned Batchelder's [Getting Started Testing](https://www.youtube.com/watch?v=FxSsnHeWQBY) from PyCon 2014
- Eric Ma's [Best Testing Practice's for Data Science](https://www.youtube.com/watch?v=yACtdj1_IxE) from PyCon 2017, with GitHub tutorial notebooks [here](https://github.com/ericmjl/data-testing-tutorial)

In [11]:
import pytest

# sample code
@pytest.fixture
def make_data():
    pass

print('Hello World')

Hello World
