## Dataset Source
[Diabetes Dataset](https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset/data)   

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.2 From the data set in the (.csv) File We can find several variables, some of them are independent (several medical predictor variables) and only one target dependent variable (Outcome).

In [2]:
import pandas as pd

### Prelimnary Data Analysis

In [4]:
df = pd.read_csv(f"diabetes.csv")

In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
df.shape

(768, 9)

In [7]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [8]:
df.columns #returns an Index object with the names of all columns in the df

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [9]:
df_columns = df.columns.tolist() # converts df.columns to a usable list
features = df_columns
features.remove("Outcome")
print(features)

['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


In [10]:
df.isnull().sum() # Missing Values per Column

df.isnull().sum().sum() # Total Missing Values

0

In [11]:
df.duplicated().sum() # number of duplicate rows

0

In [12]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [13]:
for column in features:
    column_range = df[column].max() - df[column].min()
    print(f"{column}: range - {column_range}, min - {df[column].min()}, max - {df[column].max()}")

Pregnancies: range - 17, min - 0, max - 17
Glucose: range - 199, min - 0, max - 199
BloodPressure: range - 122, min - 0, max - 122
SkinThickness: range - 99, min - 0, max - 99
Insulin: range - 846, min - 0, max - 846
BMI: range - 67.1, min - 0.0, max - 67.1
DiabetesPedigreeFunction: range - 2.342, min - 0.078, max - 2.42
Age: range - 60, min - 21, max - 81


In [30]:
values = df['Insulin']

percentile_25 = values.quantile(0.25) # calculates the 25th percentile
percentile_75 = values.quantile(0.75) # calculates the 75th percentile

iqr = percentile_75 - percentile_25

# calculate the upper and lower thresholds
upper = percentile_75 + 1.5 * iqr
lower = percentile_25 - 1.5 * iqr # 1.5 is a value chosen out of convention

In [46]:
outlier_count = 0


for col in features:
    col_values = df[col]
    percentile_25 = col_values.quantile(0.25) # calcs 25th Percentile
    percentile_75 = col_values.quantile(0.75) # calcs 75th Percentile

    iqr = percentile_75 - percentile_25

    # calculate the upper and lower thresholds
    upper = percentile_75 + 1.5 * iqr
    lower = percentile_25 - 1.5 * iqr # 1.5 is chosen out of convention

    # identify outliers
    lower_outliers = df[col_values < lower]
    upper_outliers = df[col_values > upper]

    print(f"{col}: Upper Threshold ({round(upper,5)}), Lower Threshold ({round(lower,5)})")
    
    print(f"Is Lower Outliers Empty: {lower_outliers.empty}")
    print(f"Is Upper Outliers Empty: {upper_outliers.empty}")

    # display the outliers if present
    for row in lower_outliers.itertuples():
        print(row)
        outlier_count += 1

    for row in upper_outliers.itertuples():
        print(row)
        outlier_count += 1

print(outlier_count)

Pregnancies: Upper Threshold (13.5), Lower Threshold (-6.5)
Is Lower Outliers Empty: True
Is Upper Outliers Empty: False
Pandas(Index=88, Pregnancies=15, Glucose=136, BloodPressure=70, SkinThickness=32, Insulin=110, BMI=37.1, DiabetesPedigreeFunction=0.153, Age=43, Outcome=1)
Pandas(Index=159, Pregnancies=17, Glucose=163, BloodPressure=72, SkinThickness=41, Insulin=114, BMI=40.9, DiabetesPedigreeFunction=0.817, Age=47, Outcome=1)
Pandas(Index=298, Pregnancies=14, Glucose=100, BloodPressure=78, SkinThickness=25, Insulin=184, BMI=36.6, DiabetesPedigreeFunction=0.412, Age=46, Outcome=1)
Pandas(Index=455, Pregnancies=14, Glucose=175, BloodPressure=62, SkinThickness=30, Insulin=0, BMI=33.6, DiabetesPedigreeFunction=0.212, Age=38, Outcome=1)
Glucose: Upper Threshold (202.125), Lower Threshold (37.125)
Is Lower Outliers Empty: False
Is Upper Outliers Empty: True
Pandas(Index=75, Pregnancies=1, Glucose=0, BloodPressure=48, SkinThickness=20, Insulin=0, BMI=24.7, DiabetesPedigreeFunction=0.14, A

There are 146 Outliers in the dataset. I need to check if any of the outliers are in the sa