# Undertanding the diabetes set

The notebook aims to undertand the content of the tips data set.


## Acknowledgments

- https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

- The **sklearn.datasets** python package

- Image below from https://www.gutmicrobiotaforhealth.com/es/como-contribuye-la-microbiota-intestinal-a-la-diabetes-de-tipo-2-lo-que-ya-sabemos/


# Diabetes data set

![Diabetes.jpg](datasets/diabetes/diabetes.jpg)

1. The dataset description
    - Many observations/measurements/recordings of the characteristics/attributes/variables of persons
    - Variables: age, sex, bmi, bp, tc, ... (10 variables)
    - Total numer of observations: 442


2. Description of the predictors/variables/features/attributes (independant variable)
    - AGE: in years
    - SEX: gender
    - BMI: body mass index
    - BP: average blood pressure
    - S1: tc, total serum cholesterol
    - S2: ldl, low-density lipoproteins
    - S3: hdl, high-density lipoproteins
    - S4: tch, total cholesterol / HDL
    - S5: ltg, possibly log of serum triglycerides level
    - S6:  glu, blood sugar level


3. Description of the response (dependant variable)
    - Y: quantitative measure of disease progression one year after baseline


# Option 1: Importing (as numpy array) and inspecting the data from a file in HHDD (raw data)

In [1]:
# Import the packages that we will be using
import numpy as np                  # For arrays, matrices, and functions to operate on them
import matplotlib.pyplot as plt     # For showing plots

# Dataset url
url = "datasets/diabetes/diabetes.txt"

# Load the dataset
data = np.loadtxt(url,skiprows=1)

X    = data[:,:-1]
y    = data[:,-1]

In [2]:
X.shape

(442, 10)

In [12]:
y.shape

(442,)

# Option 2: Importing (as pandas dataframe) and inspecting the data from a file in HHDD (raw data)

In [15]:
# Import the packages that we will be using
import pandas as pd                 # For data handling

# Construct dataframe
ColumnNames = ['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6','Y']
df      = pd.DataFrame(data, columns = ColumnNames)


In [16]:
df

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59.0,2.0,32.1,101.00,157.0,93.2,38.0,4.00,4.8598,87.0,151.0
1,48.0,1.0,21.6,87.00,183.0,103.2,70.0,3.00,3.8918,69.0,75.0
2,72.0,2.0,30.5,93.00,156.0,93.6,41.0,4.00,4.6728,85.0,141.0
3,24.0,1.0,25.3,84.00,198.0,131.4,40.0,5.00,4.8903,89.0,206.0
4,50.0,1.0,23.0,101.00,192.0,125.4,52.0,4.00,4.2905,80.0,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,60.0,2.0,28.2,112.00,185.0,113.8,42.0,4.00,4.9836,93.0,178.0
438,47.0,2.0,24.9,75.00,225.0,166.0,42.0,5.00,4.4427,102.0,104.0
439,60.0,2.0,24.9,99.67,162.0,106.6,43.0,3.77,4.1271,95.0,132.0
440,36.0,1.0,30.0,95.00,201.0,125.2,42.0,4.79,5.1299,85.0,220.0


# Option 3: Importing and inspecting the data from sklearn (normalized data)

In [14]:
# Import the packages that we will be using
from sklearn import datasets

# Load the dataset
X, y = datasets.load_diabetes(return_X_y=True)


In [39]:
X.shape

(442, 10)

Note that each of the 10 variables have been mean centered and scaled by the standard deviation times n_samples (i.e. the sum of squares of each column totals 1)

In [40]:
y.shape

(442,)

# Activity: work with the iris dataset

1. Load the iris.csv file in your computer and compare with the data printed above.


2. How many observations (rows) are in total?


3. How many variables (columns) are in total? What do they represent?


4. How many observatoins are for each type of flower? 


5. What is the type of data for each variable?


6. What are the units of each variable?