# Integrating Hypothesis into a Data Analysis Script

In this notebook, the hypothesis library will be used to test a data analysis script. The script is an astronaut analysis script provided by the German Aerospace Center (DLR). More, specific information is available in the reference section.

## Prerequisites

This script builds upon the knowledge from the `tutorial.ipynb` script. That is, some base knowledge regarding Property-Based Testing and the hypothesis python library is assumed. It is also assumed that the latest version of hypothesis (at the time of writing) is correctly installed. And then there are packages related to the data analysis script. In brief, assuming the latest versions of Python and Pip are installed, hypothesis and the required packages can be installed using: 

```
pip install hypothesis pandas
```

## Code Examples of using Hypothesis to test the Data Analysis Script 

### Example 1 - Prepare Data Set

Arguably one of the most important functions, `prepare_data_set()` will be tested. 

Here is the code from the script. Please note, that the line `df["time_in_space_D"] = df["time_in_space"].astype("timedelta64[D]")` has been altered to `df["time_in_space"] / pd.Timedelta(days=1`.

In [60]:
import pandas as pd
from datetime import date, timedelta
from hypothesis import given, strategies as st
from pandas import DataFrame

def prepare_data_set(df):
    df = rename_columns(df)
    df = df.set_index("astronaut_id")

    # Set pandas dtypes for columns with date or time
    df = df.dropna(subset=["time_in_space"])
    df["time_in_space"] = df["time_in_space"].astype(int)
    df["time_in_space"] = pd.to_timedelta(df["time_in_space"], unit="m")
    df["birthdate"] = pd.to_datetime(df["birthdate"])
    df["date_of_death"] = pd.to_datetime(df["date_of_death"])
    df.sort_values("birthdate", inplace=True)

    # Calculate extra columns from the original data
    df["time_in_space_D"] = df["time_in_space"] / pd.Timedelta(days=1) # df["time_in_space_D"] = df["time_in_space"].astype("timedelta64[D]")
    df["alive"] = df["date_of_death"].apply(is_alive)
    df["age"] = df["birthdate"].apply(calculate_age)
    df["died_with_age"] = df.apply(died_with_age, axis=1)
    return df

def is_alive(date_of_death):
    if pd.isnull(date_of_death):
        return True
    return False

def rename_columns(df):
    """
    The original column naming in the data set is not useful
    for programming with pandas. So we rename it.
    """

    name_mapping = {
        "astronaut": "astronaut_id",
        "astronautLabel": "name",
        "birthplaceLabel": "birthplace",
        "sex_or_genderLabel": "sex_or_gender",
    }
    df = df.rename(index=str, columns=name_mapping)
    return df


def calculate_age(born):
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))


def died_with_age(row):
    if pd.isnull(row["date_of_death"]):
        return None
    born = row["birthdate"]
    today = row["date_of_death"]
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [61]:
@st.composite
def astronaut_data(draw):
    astronaut = draw(st.from_regex(r"http://www\.wikidata\.org/entity/Q\d+", fullmatch=True))
    astronautLabel = draw(st.from_regex(r"[A-Z][a-z]+ [A-Z][a-z]+", fullmatch=True))
    birthdate = draw(st.dates(min_value=date(1920, 1, 1), 
                                max_value=date(2030, 12, 31))
                        )
    birthplaceLabel = draw(st.from_regex(r"[A-Z][a-z]+", fullmatch=True))
    sex_or_genderLabel = draw(st.sampled_from(["male", "female"]))
    time_in_space = draw(st.integers(min_value=1, max_value=900)) 
    date_of_death = draw(st.one_of(
        st.none(), 
        st.dates(birthdate + timedelta(days=1), 
                 max_value=date(2030, 12, 31))
        )
    )

    birthdate_str = birthdate.strftime("%Y-%m-%dT00:00:00Z")
    date_of_death_str = date_of_death.strftime("%Y-%m-%dT00:00:00Z") if date_of_death else None

    return {
        "astronaut": astronaut,
        "astronautLabel": astronautLabel,
        "birthdate": birthdate_str,
        "birthplaceLabel": birthplaceLabel,
        "sex_or_genderLabel": sex_or_genderLabel,
        "time_in_space": time_in_space,
        "date_of_death": date_of_death_str
    }

@given(st.lists(astronaut_data(), min_size=1))
def test_prepare_data_set(data):
    df = DataFrame(data)
    prepared_df = prepare_data_set(df)
    assert 'age' in prepared_df.columns
    assert 'alive' in prepared_df.columns

test_prepare_data_set()