###  PIMA INDIANS DIABETES DATASET

The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 258 tested positive and 500 tested negative. Therefore, there is one target (dependent) variable and the following attributes <sup>1</sup>: 


	Pregnancies (number of times pregnant), 

	Oral glucose tolerance test (plasma glucose concentration at 2 h), 

	Blood Pressure (Diastolic Blood Pressure in mmHg), 

	Skin Thickness (Triceps skin fold thickness in mm), 

	Insulin (2 h serum insulin in mu U/ml), 

	BMI (Body Mass Index in kg/m2), 

	Age (years).

### PIMA INDIANS AND DIABETES

Pima are descendants of people that inhabited the Sonoran desert and Sierra Madre areas for centuries. Around 300 B.C. they moved to Gila River Valley at the time  in Mexico, but region that was acquired by the United States in 1853. A Pima reservation was created in Arizona in 1959 and they adapted to their desert homeland by directing water to support a subsistence agriculture. Around 1900 the number of population of white settlers increased and a diversion of the water happened. That had an impact of Pima's food intake and way of life. Pima Indians used to farm sustained through physical labour to a little labour and scarce of food. As a consequence they food intake became high in fat and their lifestyle was mainly sedentary. That resulted in development of diabetes among the Arizona Pimas. By the 1950's the prevalence of diabetes among Pima Indians <sup>2</sup>.
The Pima population has been under study by the National Institute of Diabetes and Digestive and Kidney Diseases since 1965.

### TRICEPS SKIN FOLD THICKNESS

Triceps skinfold thickness in millimeters for females aged 20 and over and number of examined persons, mean, standard error of the mean, and selected percentiles, by race and ethnicity and age: United States, 2007–2010 <sup>3</sup>.

<a href="https://ibb.co/4NDNNx1"><img src="https://i.ibb.co/KyQyyHK/table-triceps-skin-fold.png" alt="table-triceps-skin-fold" border="0"></a>

### BODY  WEIGHT AND DIABETES

Obesity is associated with diabetes. Therefore, they are intimately linked <sup>4,5</sup>. In fact, most of the individuals with type 2 diabetes mellitus (T2DM) are overweight or obese 5.  Despite the link between obesity and T2DM not all obese develops diabetes and not all diabetics are obese people. Diabetic lean people probably have a stronger genetic component for T2DM than overweight and obese individuals <sup>5</sup>.

## OBJECTIVE

The objective of this project is to help health professionals to make diagnosis easier by applying machine learning techniques resulting in bridging the gap between datasets and human knowledge. In this project I will apply machine learning techniques in Pima Indian Diabetes Dataset. 

In [932]:
import pandas as pd
import io
import requests
import numpy as np

In [933]:
url="https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv"

In [934]:
s=requests.get(url).content

In [935]:
pima = pd.read_csv(io.StringIO(s.decode('utf-8')))

In [936]:
pima.head (10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [937]:
Nutritional_status = pd.Series([]) 

In [938]:
# Nutritional status based on BMI

for i in range(len(pima)): 
    if pima["BMI"][i] == 0.0: 
        Nutritional_status[i]="NA"
    
    elif pima["BMI"][i] < 25: 
        Nutritional_status[i]="Normal"
  
    elif pima["BMI"][i] >= 25 and pima["BMI"][i] < 30: 
        Nutritional_status[i]="Overweight"
  
    elif pima["BMI"][i] >= 30: 
        Nutritional_status[i]="Obese"
        
    else: 
        Nutritional_status[i]= pima["BMI"][i] 

In [939]:
# Insert new column - Nutritional Status
pima.insert(6, "Nutritional Status", Nutritional_status)

In [940]:
# Check df containing new column
pima.head (10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,Obese,0.627,50,1
1,1,85,66,29,0,26.6,Overweight,0.351,31,0
2,8,183,64,0,0,23.3,Normal,0.672,32,1
3,1,89,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,40,35,168,43.1,Obese,2.288,33,1
5,5,116,74,0,0,25.6,Overweight,0.201,30,0
6,3,78,50,32,88,31.0,Obese,0.248,26,1
7,10,115,0,0,0,35.3,Obese,0.134,29,0
8,2,197,70,45,543,30.5,Obese,0.158,53,1
9,8,125,96,0,0,0.0,,0.232,54,1


In [941]:
High_glucose = pd.Series([]) 

In [942]:
#  According to World Health Organization Criteria if the 2 hour post-load glucose was at least 200 mg/dl then the diabetes was diagnosed

for i in range(len(pima)): 
    if pima["Glucose"][i] == 0.0: 
        High_glucose[i]="NA"
    
    elif pima["Glucose"][i] >= 200: 
        High_glucose[i]="High glucose"
  
    else: 
        High_glucose[i]="Normal"

In [943]:
# Insert new column - High glucose (>=200)
pima.insert(2, "Glucose level", High_glucose)

In [944]:
pima.head(20)

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
0,6,148,Normal,72,35,0,33.6,Obese,0.627,50,1
1,1,85,Normal,66,29,0,26.6,Overweight,0.351,31,0
2,8,183,Normal,64,0,0,23.3,Normal,0.672,32,1
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
5,5,116,Normal,74,0,0,25.6,Overweight,0.201,30,0
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
7,10,115,Normal,0,0,0,35.3,Obese,0.134,29,0
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
9,8,125,Normal,96,0,0,0.0,,0.232,54,1


In [945]:
# Minimum

In [946]:
pima.min()

Pregnancies                     0
Glucose                         0
Glucose level                  NA
BloodPressure                   0
SkinThickness                   0
Insulin                         0
BMI                             0
Nutritional Status             NA
DiabetesPedigreeFunction    0.078
Age                            21
Outcome                         0
dtype: object

In [947]:
# It can be noticed that some of women don't have information about some of the attributes, such as glucose, blood pressure, etc as the minimum value is zero. Besides, only adult women were included as the minimum age is 21 years old.

In [948]:
# Maximum

In [949]:
pima.max()

Pregnancies                         17
Glucose                            199
Glucose level                   Normal
BloodPressure                      122
SkinThickness                       99
Insulin                            846
BMI                               67.1
Nutritional Status          Overweight
DiabetesPedigreeFunction          2.42
Age                                 81
Outcome                              1
dtype: object

In [950]:
# Checking the maximum value of the attributes it shows that the glucose level is normal (< 200), 

In [951]:
pima.mean()

Pregnancies                   3.845052
Glucose                     120.894531
BloodPressure                69.105469
SkinThickness                20.536458
Insulin                      79.799479
BMI                          31.992578
DiabetesPedigreeFunction      0.471876
Age                          33.240885
Outcome                       0.348958
dtype: float64

In [952]:
# As some of the women doesn't have information of some of the attributes, such as glucose, blood pressure and BMI the average of those items might not be correct. To get the correct value only the women with a value > zero should be included to calculate the average.

In [953]:
# Shows women that contains information about Glucose

pima_glucose = pima.loc[pima['Glucose'] != 0]

In [954]:
pima_glucose

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
0,6,148,Normal,72,35,0,33.6,Obese,0.627,50,1
1,1,85,Normal,66,29,0,26.6,Overweight,0.351,31,0
2,8,183,Normal,64,0,0,23.3,Normal,0.672,32,1
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
5,5,116,Normal,74,0,0,25.6,Overweight,0.201,30,0
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
7,10,115,Normal,0,0,0,35.3,Obese,0.134,29,0
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
9,8,125,Normal,96,0,0,0.0,,0.232,54,1


In [955]:
# Check average of glucose from women that don't have zero value of glucose

In [956]:
pima_glucose['Glucose'].mean()

121.6867627785059

In [957]:
# Check minimum and maximum values of glucose from women that don't have zero value of glucose

In [958]:
pima_glucose['Glucose'].min()

44

In [959]:
pima_glucose['Glucose'].max()

199

In [960]:
# As it shows above, some women does have information about glucose but not have information about insulin. Moreover, it shows that 5 women don't have information about glucose as the result shows 763 rows.
# Furthermore, the maximum value of glucose is 199. Therefore, even diagnosed with diabetes the women don't have high glucose value. The dataset should have information about drug therapy.

In [961]:
# Shows women that have information about Blood pressure

pima_BloodPressure = pima.loc[pima['BloodPressure'] != 0]

In [962]:
pima_BloodPressure

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
0,6,148,Normal,72,35,0,33.6,Obese,0.627,50,1
1,1,85,Normal,66,29,0,26.6,Overweight,0.351,31,0
2,8,183,Normal,64,0,0,23.3,Normal,0.672,32,1
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
5,5,116,Normal,74,0,0,25.6,Overweight,0.201,30,0
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
9,8,125,Normal,96,0,0,0.0,,0.232,54,1
10,4,110,Normal,92,0,0,37.6,Obese,0.191,30,0


In [963]:
# Check the average of blood pressure (only from women that don't have zero value of Blood Pressure)

pima_BloodPressure['BloodPressure'].mean()

72.40518417462484

In [964]:
# Minimum and maximum of Blood Pressure from women that don't have zero value of Blood Pressure

In [965]:
pima_BloodPressure['BloodPressure'].min()

24

In [966]:
pima_BloodPressure['BloodPressure'].max()

122

In [967]:
# 35 women dont have information about Blood Pressure as the result shows 733 rows. 

In [968]:
pima_insulin = pima.loc[pima['Insulin'] != 0]

In [969]:
pima_insulin

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
13,1,189,Normal,60,23,846,30.1,Obese,0.398,59,1
14,5,166,Normal,72,19,175,25.8,Overweight,0.587,51,1
16,0,118,Normal,84,47,230,45.8,Obese,0.551,31,1
18,1,103,Normal,30,38,83,43.3,Obese,0.183,33,0
19,1,115,Normal,70,30,96,34.6,Obese,0.529,32,1
20,3,126,Normal,88,41,235,39.3,Obese,0.704,27,0


In [970]:
# Check average value of insulin from women that don't have zero value of insulin

pima_insulin['Insulin'].mean()

155.5482233502538

In [971]:
# Check minimum and maximum of Insulin value from women that don't have zero value of insulin

In [972]:
pima_insulin['Insulin'].min()

14

In [973]:
pima_insulin['Insulin'].max()

846

In [974]:
# Check women that don't have zero value of BMI

pima_BMI = pima.loc[pima['BMI'] != 0]

In [975]:
pima_BMI

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
0,6,148,Normal,72,35,0,33.6,Obese,0.627,50,1
1,1,85,Normal,66,29,0,26.6,Overweight,0.351,31,0
2,8,183,Normal,64,0,0,23.3,Normal,0.672,32,1
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
5,5,116,Normal,74,0,0,25.6,Overweight,0.201,30,0
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
7,10,115,Normal,0,0,0,35.3,Obese,0.134,29,0
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
10,4,110,Normal,92,0,0,37.6,Obese,0.191,30,0


In [976]:
# Check average of BMI from women that don't have zero value of BMI

pima_BMI['BMI'].mean()

32.45746367239099

In [None]:
# The average value of BMI indicates obesity (BMI >= 30 kg/m2)

In [977]:
# Check minimun and maximum value of BMI from women that don't have zero value of BMI

In [978]:
pima_BMI['BMI'].min()

18.199999999999999

In [979]:
pima_BMI['BMI'].max()

67.099999999999994

In [None]:
# The minimum value of BMI shows that there is no caso of underweight, but the maximum value shows case of women with morbid obesity.

In [980]:
# Check only the women that have all the values of BMI, Glucose, Insulin and Blood Pressure

pima_all = pima.loc[(pima['BMI'] != 0) & (pima['Insulin'] != 0) & (pima['BloodPressure'] != 0) & (pima['Glucose'] != 0)]

In [981]:
pima_all

Unnamed: 0,Pregnancies,Glucose,Glucose level,BloodPressure,SkinThickness,Insulin,BMI,Nutritional Status,DiabetesPedigreeFunction,Age,Outcome
3,1,89,Normal,66,23,94,28.1,Overweight,0.167,21,0
4,0,137,Normal,40,35,168,43.1,Obese,2.288,33,1
6,3,78,Normal,50,32,88,31.0,Obese,0.248,26,1
8,2,197,Normal,70,45,543,30.5,Obese,0.158,53,1
13,1,189,Normal,60,23,846,30.1,Obese,0.398,59,1
14,5,166,Normal,72,19,175,25.8,Overweight,0.587,51,1
16,0,118,Normal,84,47,230,45.8,Obese,0.551,31,1
18,1,103,Normal,30,38,83,43.3,Obese,0.183,33,0
19,1,115,Normal,70,30,96,34.6,Obese,0.529,32,1
20,3,126,Normal,88,41,235,39.3,Obese,0.704,27,0


In [982]:
# Only 392 women have information about all the attributes. That number represents less than half of the sample (around 49% of the women of the sample have all information of all attributes).

In [983]:
pima_all.mean()

Pregnancies                   3.301020
Glucose                     122.627551
BloodPressure                70.663265
SkinThickness                29.145408
Insulin                     156.056122
BMI                          33.086224
DiabetesPedigreeFunction      0.523046
Age                          30.864796
Outcome                       0.331633
dtype: float64

## REFERENCE

1. TYNECKI P. Predict diabetes diagnosis for Pima Female Indians with Logistic Regression. Available on: https://www.kaggle.com/ptynecki/pima-indians-diabetes-prediction-with-lr-84.
2. SCHULZ LO, CHAUDHARI LS. High-Risk Populations: The Pimas of Arizona and Mexico
Curr Obes Rep. 2015 Mar 1; 4(1): 92–98. Available on: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4418458/
3. FRYAR CD, GU Q, OGDEN CL. Anthropometric reference data for children and adults: United States, 2007–2010. National Center for Health Statistics. Vital Health Stat 11(252). 2012.
4. VAN GAAL L., SCHEEN A. Weight Management in Type 2 Diabetes: Current and Emerging Approaches to Treatment, Diabetes Care 2015; 38(6): 1161 - 1172. Available on http://care.diabetesjournals.org/content/38/6/1161.
5. WILDING JPH. The importance of weight management in type 2 diabetes mellitus. Int J Clin Pract. 2014 Jun; 68(6): 682–691. Available on: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4238418/


# END