# Breaking Stuff

In [None]:
import pandas as pd
import numpy as np

In [None]:
url = 'http://koaning.io/old/theme/data/chickweight.csv'
chickweight_df = (pd.read_csv(url).rename(str.lower, axis='columns'))

Let's assume our chick weight dataset is a huge production database that changes continuously. We're writing some analysis pipelines based on a small extract from this database:

In [None]:
chickweight_test_df = chickweight_df.loc[1:10]
chickweight_test_df

## Weak Links in a Pipeline

The functions below are used to select "overweight" chickens, given the time:

In [None]:
def weight_time_z_value(chick_weight_df):
    return (
        chick_weight_df
        .assign(weight_time=lambda x: x['weight'] / x['time'])
        .assign(weight_time_z=lambda x: (x['weight_time'] - x['weight_time'].mean()) / x['weight_time'].std())
        .drop('weight_time', axis=1)
    )

In [None]:
def select_overweight(chick_weight_df, z_threshold=0):
    return (
        chick_weight_df
        .pipe(weight_time_z_value)
        .loc[lambda x: x['weight_time_z'] > z_threshold]
    )

In [None]:
select_overweight(chickweight_test_df)

**Q:** Besides the meaninglessnes of this analysis, what can go wrong when performing this selection? I.e. how can we (obviously) manipulate the input dataframe to let it crash?

In [None]:
%load answers/breaking_with_exception_missing_column.py

**Q:** There's a more subtle bug in this code. While it will not lead to exceptions, it will surely lead to unexpected results. What input data can be added to our test dataframe to make this analysis even more useless than it already is?

In [None]:
%load answers/chickweight_test_unexpected_results_call.py

These are examples of ad-hoc, informal testing of our functions. We intuitively feel that the last result is unexpected. Regardless how we intend to fix our problem, we encountered an *undocumented assumption* about our code. Instead of documenting the functionality in prose, let's do it in code!

In [None]:
import unittest

In [None]:
class TestSelectOverweight(unittest.TestCase):
    def test_select_overweight_default(self):
        test_df = pd.DataFrame({
            'chick': [1, 2],
            'weight': [100, 200],
            'time': [1, 1]
        })
        result_df = select_overweight(test_df)
        self.assertEqual(1, len(result_df))
        self.assertEqual(2, result_df.chick.values[0])
    
    def test_select_overweight_zero_time(self):
        test_df = pd.DataFrame({
            'chick': [1, 2, 3],
            'weight': [100, 200, 50],
            'time': [1, 1, 0]
        })
        result_df = select_overweight(test_df)
        self.assertEqual(1, len(result_df))
        self.assertEqual(2, result_df.chick.values[0])

unittest.main(argv=['ignored', '-v', 'TestSelectOverweight'], exit=False)

The documentation for Python's standard unit testing framework can be found [here](https://docs.python.org/3/library/unittest.html). There are many more assertions and checks than `assertEqual()` in the above example. Besides, pandas and numpy have their own specialized assertion functions in the [`pandas.testing`](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#testing-functions) and [`numpy.testing`](https://docs.scipy.org/doc/numpy/reference/routines.testing.html) packages.

**Q:** We discover a failure in our (production) code, and find its root cause. What's (often) the best first action to take?

## What to Test?

Let's assume that the chicks's health was monitored during the experiment, and recorded in a new `sick` feature. According to the experiment manual, a value of `0` means that the chick was healthy. However, values for a non-healthy state were not clearly defined and may range from number of treatments given to a textual description of any such treatment.

In [None]:
chickweight_test_df = chickweight_test_df.assign(sick=[0, 0, 0, 0, 0, 0, 0, 1, 'antibiotics', 0])

In [None]:
chickweight_test_df

If we want to know the average weight gain for a chick only when it is healthy, the following might be a solution:

In [None]:
def weight_gain_when_healthy(chickweight_df):
    return(
        chickweight_df
        .assign(weight_diff=lambda x: x['weight'].diff())
        .loc[lambda x: ~(x['sick'].astype(bool)), 'weight_diff']
        .mean()
    )

In [None]:
weight_gain_when_healthy(chickweight_test_df)

In [None]:
class TestWeightGainWhenHealthy(unittest.TestCase):
    def test_weight_gain_when_healthy_default(self):
        test_df = pd.DataFrame({
            'weight': [100, 105, 120, 120, 120],
            'sick': [0, 0, 0, 1, 1],
            'time': [0, 1, 2, 3, 4]
        })
        result = weight_gain_when_healthy(test_df)
        self.assertEqual(10, result)

unittest.main(argv=['ignored', '-v', 'TestWeightGainWhenHealthy'], exit=False)

So far, so good. However, it turns out that sometimes the experimenters didn't bother to fill out the health checks in the case that chicks were healthy. Will this impact our existing function (and tests)?

**Assignment:** Extend the above test case by adding tests that break our `weight_gain_for_sick()` function (i.e. let it return unexpected results), and improve that function such that it can handle these edge cases.

In [None]:
%load answers/weight_gain.py

Most often, failures - ranging from obvious exceptions to very subtle (but equally damaging) unexpected outcomes - are caused by not properly dealing with:

- Missing values
- Numerical values of 0
- Empty strings
- Complex branching

## Checking Common Assumptions

The [first problem we ran into](http://localhost:8888/notebooks/labs/chickweight/chickweight-testing.ipynb#Weak-Links-in-a-Pipeline) was caused by unexpectedly missing a crucial column in a dataframe.

**Q:** How can this issue be handled? Is the function `weight_time_z_value()` responsible for gracefully returning anything (e.g. an empty dataframe or a copy of the input)? What makes sense?

Similarly, the function `weight_gain_when_healthy()` assumes that the input data is sorted on the `time` column. Such assumptions occur often.

**Q:** Should we clutter our pipeline functions (+ their corresponding unit tests!) with code for checking existence of certain columns, or sorted-ness of other columns? What are the alternatives?

**Assignment:** Create a function that checks if a dataframe passed to any of our pipeline functions contains a given column and raises an exception if it doesn't. Hints:

1. How did we previously create a single function that could log the size of dataframes passed to pipeline functions?
2. Similar to unit testing assertions, Python has the [`assert`](https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement) statement.

**Assignment:** Extend the above function with an argument that indicates whether a more graceful way should be used to deal with missing columns, e.g. returning the input dataframe as-is.

**Assignment:** Extend the above function with an argument that indicates whether a given column is expected to be sorted.

In [None]:
%load answers/common_assumptions.py