# 1.0 Classification Problem using decision tree classifier

##1.1 Dataset description

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

We build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not.

**Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

**BloodPressure**: Diastolic blood pressure (mm Hg)

**SkinThickness**: Triceps skin fold thickness (mm)

**Insulin**: 2-Hour serum insulin (mu U/ml)

**BMI**: Body mass index (weight in kg/(height in m)^2)

**DiabetesPedigreeFunction**: Diabetes pedigree function

**Age**: Age (years)

**Outcome**: Class variable (0 or 1) 268 of 768 are 1, the others are 0

<center>
  <figure>
    <img width="500" src="https://cdn.britannica.com/42/93542-050-E2B32DAB/women-Pima-shinny-game-field-hockey.jpg">
  </figure>
  <figcaption>Fig.1 - Prima indian</figcaption>
</center>


###1.1.1 Glucose Tolerance Test

It is a blood test that involves taking multiple blood samples over time, usually 2 hours.It used to diagnose diabetes. The results can be classified as normal, impaired, or abnormal.

**Normal Results for Diabetes**: Two-hour glucose level less than 140 mg/dL

**Impaired Results for Diabetes** Two-hour glucose level 140 to 200 mg/dL

**Abnormal (Diagnostic) Results for Diabetes** Two-hour glucose level greater than 200 mg/dL


###1.1.2 Blood Pressure

The diastolic reading, or the bottom number, is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen. A normal diastolic blood pressure is lower than 80. A reading of 90 or higher means you have high blood pressure.

**Normal**: Systolic below 120 and diastolic below 80

**Elevated**: Systolic 120–129 and diastolic under 80

**Hypertension stage 1**: Systolic 130–139 and diastolic 80–89

**Hypertension stage 2**: Systolic 140-plus and diastolic 90 or more

**Hypertensive crisis**: Systolic higher than 180 and diastolic above 120.


###1.1.3 BMI (Body Mass Index)

<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=default'></script>

The BMI value is found by: 
$$ {BMI = weight/height²} $$

The standard weight status categories associated with BMI ranges for adults are shown below.

**Below 18.5**: Underweight

**18.5 – 24.9**: Normal or Healthy Weight

**25.0 – 29.9**: Overweight

**30.0 and Above**: Obese


###1.1.4 Triceps Skinfolds

For an adult woman, the standard normal values for triceps skinfolds is 18.0mm

##1.2 Install and load libraries

In [1]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.12.16-py2.py3-none-any.whl (1.8 MB)
[?25l[K     |▏                               | 10 kB 18.5 MB/s eta 0:00:01[K     |▍                               | 20 kB 25.0 MB/s eta 0:00:01[K     |▌                               | 30 kB 26.8 MB/s eta 0:00:01[K     |▊                               | 40 kB 14.4 MB/s eta 0:00:01[K     |█                               | 51 kB 8.4 MB/s eta 0:00:01[K     |█                               | 61 kB 9.7 MB/s eta 0:00:01[K     |█▎                              | 71 kB 9.1 MB/s eta 0:00:01[K     |█▌                              | 81 kB 7.1 MB/s eta 0:00:01[K     |█▋                              | 92 kB 7.8 MB/s eta 0:00:01[K     |█▉                              | 102 kB 8.6 MB/s eta 0:00:01[K     |██                              | 112 kB 8.6 MB/s eta 0:00:01[K     |██▏                             | 122 kB 8.6 MB/s eta 0:00:01[K     |██▍                             | 133 kB 8.6 MB/s eta 0:00:01

In [2]:
import pandas as pd
import seaborn as sns
import missingno as msno
from google.colab import drive

##1.3 Fetch data

###1.3.1 Create raw_data artifact

In [3]:
#mount drive to import dataset from google drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
#load dataset
path = "/content/gdrive/MyDrive/ML-project1/Dataset/"
df = pd.read_csv(path+"diabetes.csv")

In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
#export dataframe as raw_data
df.to_csv("raw_data.csv",index=False)

In [7]:
#login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [8]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name diabetes_decision_tree/raw_data.csv \
      --type raw_data \
      --description "The raw data from prima indians" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "mgoldbarg/diabetes_decision_tree/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mmgoldbarg[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.16
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220523_105605-22slpuxt[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mfast-energy-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/mgoldbarg/diabetes_decision_tree[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/mgoldbarg/diabetes_decision_tree/runs/22slpuxt[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("mgoldbarg/diabetes_decision_tree/raw_data.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m:                   