# 1.0 Classification Problem using decision tree classifier

##1.1 Dataset description

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

We build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not.

**Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

**BloodPressure**: Diastolic blood pressure (mm Hg)

**SkinThickness**: Triceps skin fold thickness (mm)

**Insulin**: 2-Hour serum insulin (mu U/ml)

**BMI**: Body mass index (weight in kg/(height in m)^2)

**DiabetesPedigreeFunction**: Diabetes pedigree function

**Age**: Age (years)

**Outcome**: Class variable (0 or 1) 268 of 768 are 1, the others are 0

<center>
  <figure>
    <img width="500" src="https://cdn.britannica.com/42/93542-050-E2B32DAB/women-Pima-shinny-game-field-hockey.jpg">
  </figure>
  <figcaption>Fig.1 - Prima indian</figcaption>
</center>


###1.1.1 Glucose Tolerance Test

It is a blood test that involves taking multiple blood samples over time, usually 2 hours.It used to diagnose diabetes. The results can be classified as normal, impaired, or abnormal.

**Normal Results for Diabetes**: Two-hour glucose level less than 140 mg/dL

**Impaired Results for Diabetes** Two-hour glucose level 140 to 200 mg/dL

**Abnormal (Diagnostic) Results for Diabetes** Two-hour glucose level greater than 200 mg/dL


###1.1.2 Blood Pressure

The diastolic reading, or the bottom number, is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen. A normal diastolic blood pressure is lower than 80. A reading of 90 or higher means you have high blood pressure.

**Normal**: Systolic below 120 and diastolic below 80

**Elevated**: Systolic 120–129 and diastolic under 80

**Hypertension stage 1**: Systolic 130–139 and diastolic 80–89

**Hypertension stage 2**: Systolic 140-plus and diastolic 90 or more

**Hypertensive crisis**: Systolic higher than 180 and diastolic above 120.


###1.1.3 BMI (Body Mass Index)

<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=default'></script>

The BMI value is found by: 
$$ {BMI = weight/height²} $$

The standard weight status categories associated with BMI ranges for adults are shown below.

**Below 18.5**: Underweight

**18.5 – 24.9**: Normal or Healthy Weight

**25.0 – 29.9**: Overweight

**30.0 and Above**: Obese


###1.1.4 Triceps Skinfolds

For an adult woman, the standard normal values for triceps skinfolds is 18.0mm

##1.2 Install and load libraries

In [1]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.12.16-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 4.9 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 46.5 MB/s 
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.12-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 45.9 MB/s 
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.3 MB/s 
[?25hCollecti

In [2]:
!pip install pytest pytest-sugar

Collecting pytest-sugar
  Downloading pytest-sugar-0.9.4.tar.gz (12 kB)
Building wheels for collected packages: pytest-sugar
  Building wheel for pytest-sugar (setup.py) ... [?25l[?25hdone
  Created wheel for pytest-sugar: filename=pytest_sugar-0.9.4-py2.py3-none-any.whl size=8985 sha256=d9805d05e05e6c2d944cd4e6bf6f2927906cc41e68a92b64b2daccda6b69ade2
  Stored in directory: /root/.cache/pip/wheels/9c/e8/b6/5009ec756a2f40eed690a8b0a95549cd788c1ec968832876df
Successfully built pytest-sugar
Installing collected packages: pytest-sugar
Successfully installed pytest-sugar-0.9.4


In [3]:
import wandb

In [4]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


##1.3 Pytest


###1.3.1 How pytest discovers tests

pytests uses the following conventions to automatically discovering tests:

1- files with tests should be called test_*.py or *_test.py

2- test function name should start with test_

###1.3.2 Fixture

An important aspect when using pytest is understanding the fixture's scope works.

The scope of the fixture can have a few legal values, described here. We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

###1.3.3 Create and run a test file

In [5]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="diabetes_decision_tree", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("diabetes_decision_tree/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 500


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 12

def test_column_presence_and_type(data):

    required_columns = {
        "Pregnancies": pd.api.types.is_int64_dtype,
        "Glucose": pd.api.types.is_int64_dtype,
        "BloodPressure": pd.api.types.is_int64_dtype,
        "SkinThickness": pd.api.types.is_int64_dtype,
        "Insulin": pd.api.types.is_int64_dtype,
        "BMI": pd.api.types.is_float_dtype,
        "DiabetesPedigreeFunction": pd.api.types.is_float_dtype,
        "Age": pd.api.types.is_int64_dtype,
        "Outcome": pd.api.types.is_int64_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [0,1]

    assert data["Outcome"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {
        "Age": (18, 90)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Writing test_data.py


Run test File

In [6]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m

―――――――――――――――――――――――――――― test_number_of_columns ――――――――――――――――――――――――――――

data =      Pregnancies  ...  Outcome
0              6  ...        1
1              1  ...        0
2              8  ...    ... 0
765            5  ...        0
766            1  ...        1
767            1  ...        0

[768 rows x 9 columns]

[1m    def test_number_of_columns(data):[0m
[1m        """[0m
[1m        We test that we have enough data to continue[0m
[1m        """[0m
[1m>       assert data.shape[1] == 12[0m
[1m[31mE       assert 9 == 12[0m

[1m[31mtest_data.py[0m:27: AssertionError

 [36mtest_data.py[0m::test_number_of_columns[0m [31m⨯[0m        