In [1]:
"""
Objective: The objective of this assignment is to build a predictive model to predict the likelihood of a patient having diabetes based on certain features.

Dataset: You will use the "diabetes" dataset provided. The dataset contains information about the medical history of patients, including features like Glucose level, Blood Pressure, BMI, etc., and a target variable indicating whether the patient has diabetes (1) or not (0).

Tasks:

Explore the dataset to understand its structure and contents.
Perform any necessary data preprocessing steps such as handling missing values, encoding categorical variables, and scaling numerical features.
Split the dataset into training and testing sets (e.g., 80% training and 20% testing).
Build a Logistic Regression model to predict the likelihood of diabetes based on the features provided.
Evaluate the model using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
Interpret the model coefficients to understand the impact of different features on the likelihood of diabetes.
Deliverables:

Jupyter Notebook (or Python script) containing the code for data preprocessing, model building, and evaluation.
Report summarizing the key findings, model performance metrics, and insights from the model coefficients.
Upload the notebook on moodle and github and share the link
Submission:

Submit the Jupyter Notebook (or Python script) and the report summarizing your analysis.
Include any additional insights, visualizations, or improvements you made to the model.
Resources:

You can refer to Python libraries such as Pandas, NumPy, Scikit-learn for data manipulation, model building, and evaluation.
Feel free to reach out for any clarifications or assistance during the assignment.
Deadline: Complete the assignment and submit it within [specified deadline].
"""

'\nObjective: The objective of this assignment is to build a predictive model to predict the likelihood of a patient having diabetes based on certain features.\n\nDataset: You will use the "diabetes" dataset provided. The dataset contains information about the medical history of patients, including features like Glucose level, Blood Pressure, BMI, etc., and a target variable indicating whether the patient has diabetes (1) or not (0).\n\nTasks:\n\nExplore the dataset to understand its structure and contents.\nPerform any necessary data preprocessing steps such as handling missing values, encoding categorical variables, and scaling numerical features.\nSplit the dataset into training and testing sets (e.g., 80% training and 20% testing).\nBuild a Logistic Regression model to predict the likelihood of diabetes based on the features provided.\nEvaluate the model using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.\nInterpret the model coefficients to unders

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

Explore the dataset to understand its structure and contents.

In [3]:
import pandas as pd

diabetes_data = pd.read_csv("datasets_228_482_diabetes.csv")
print(diabetes_data.head())


   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [4]:
print(diabetes_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [5]:
print(diabetes_data.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

Perform any necessary data preprocessing steps such as handling missing values, encoding categorical variables, and scaling numerical features.

In [6]:
diabetes_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:
# Replacing zero values with NaN in specified features because it doesn't make sense in thos specific features
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes_data[zero_features] = diabetes_data[zero_features].replace(0, np.nan)
diabetes_data.fillna(diabetes_data.mean(), inplace=True)


# Separating features and dependent variable
X = diabetes_data.drop(columns='Outcome')
y = diabetes_data['Outcome']
"""# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(diabetes_data)"""

'# Scaling\nscaler = StandardScaler()\nX_scaled = scaler.fit_transform(diabetes_data)'

Split the dataset into training and testing sets (e.g., 80% training and 20% testing).

In [8]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=38)

In [9]:
scaler = StandardScaler()
XTr_scaled = scaler.fit_transform(X_train)
XTe_scaled = scaler.transform(X_test)

Build a Logistic Regression model to predict the likelihood of diabetes based on the features provided.

In [10]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
log_reg_model = LogisticRegression()

# Train the model
log_reg_model.fit(XTr_scaled, y_train)


Evaluate the model using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Evaluate the model
y_pred = log_reg_model.predict(XTe_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)

Accuracy: 0.7857142857142857
Precision: 0.7631578947368421
Recall: 0.5471698113207547
F1 Score: 0.6373626373626373
ROC AUC Score: 0.7290304502148329


Interpret the model coefficients to understand the impact of different features on the likelihood of diabetes.

In [13]:
# Get feature coefficients
feature_names = X.columns
coefficients = log_reg_model.coef_[0]

# Create a dataframe to display coefficients
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
print(coefficients_df)


                    Feature  Coefficient
0               Pregnancies     0.431803
1                   Glucose     1.200129
2             BloodPressure    -0.209639
3             SkinThickness     0.008732
4                   Insulin    -0.092984
5                       BMI     0.585371
6  DiabetesPedigreeFunction     0.213775
7                       Age     0.170007


Glucose: The highest positive coefficient (1.200129) suggests that an elevation in glucose levels strongly correlates with an increased likelihood of diabetes.

BMI (Body Mass Index): With a positive coefficient (0.585371), BMI also significantly impacts the prediction, indicating that higher BMI values are linked to a heightened risk of diabetes.

Pregnancies: The coefficient (0.431803) indicates a positive association between the number of pregnancies and the likelihood of diabetes.

Age: Age positively affects the likelihood of diabetes, as indicated by its coefficient (0.170007). Older individuals tend to face a greater risk.

Diabetes Pedigree Function: The coefficient (0.213775) suggests a positive correlation between the diabetes pedigree function and the likelihood of diabetes.

Skin Thickness: The coefficient (0.008732) suggests a marginal positive association between skin thickness and the likelihood of diabetes.

Blood Pressure: With a coefficient of (-0.209639), blood pressure shows a negative correlation with the likelihood of diabetes, indicating that higher blood pressure levels may be associated with a lower likelihood of diabetes in this model.

Insulin: Similarly, insulin also has a negative coefficient (-0.092984), implying that higher levels of insulin may be associated with a lower likelihood of diabetes, as per the model's interpretation.