# 1.0 An end-to-end classification problem (Data Check)



## 1.1 Dataset description

We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. ETL (done!!!)
4. Data Checks

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install, load libraries and setup wandb

In [None]:
!pip install wandb

In [None]:
!pip install pytest pytest-sugar

In [None]:
import wandb

In [None]:
# Login to Weights & Biases
!wandb login --relogin

## 1.2 Pytest


### 1.2.1 How pytest discovers tests



pytests uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:
  1. files with tests should be called `test_*.py` or `*_test.py `
  2. test function name should start with `test_`




### 1.2.2 Fixture


An important aspect when using ``pytest`` is understanding the fixture's scope works. 

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

### 1.2.3 Create and run a test file


In [None]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="decision_tree", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("decision_tree/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 15

def test_column_presence_and_type(data):

    required_columns = {
        "age": pd.api.types.is_int64_dtype,
        "workclass": pd.api.types.is_object_dtype,
        "fnlwgt": pd.api.types.is_int64_dtype,
        "education": pd.api.types.is_object_dtype,
        "education_num": pd.api.types.is_int64_dtype,
        "marital_status": pd.api.types.is_object_dtype,
        "occupation": pd.api.types.is_object_dtype,
        "relationship": pd.api.types.is_object_dtype,
        "race": pd.api.types.is_object_dtype,
        "sex": pd.api.types.is_object_dtype,
        "capital_gain": pd.api.types.is_int64_dtype,
        "capital_loss": pd.api.types.is_int64_dtype,  
        "hours_per_week": pd.api.types.is_int64_dtype,
        "native_country": pd.api.types.is_object_dtype,
        "high_income": pd.api.types.is_object_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        " <=50K",
        " >50K"
    ]

    assert data["high_income"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {
        "age": (17, 90),
        "fnlwgt": (1.228500e+04, 1.484705e+06),
        "education_num": (1, 16),
        "capital_gain": (0, 99999),
        "capital_loss": (0, 4356),
        "hours_per_week": (1, 99)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Now lets run pytest

In [None]:
!pytest . -vv

In [None]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()