Assignment: Build a Logistic Regression Model for Diabetes Prediction
Instructions:
Objective: The objective of this assignment is to build a predictive model to predict the likelihood of a patient having diabetes based on certain features.

Dataset: You will use the "diabetes" dataset provided. The dataset contains information about the medical history of patients, including features like Glucose level, Blood Pressure, BMI, etc., and a target variable indicating whether the patient has diabetes (1) or not (0).

Tasks:

Explore the dataset to understand its structure and contents.
Perform any necessary data preprocessing steps such as handling missing values, encoding categorical variables, and scaling numerical features.
Split the dataset into training and testing sets (e.g., 80% training and 20% testing).
Build a Logistic Regression model to predict the likelihood of diabetes based on the features provided.
Evaluate the model using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
Interpret the model coefficients to understand the impact of different features on the likelihood of diabetes.
Deliverables:

Jupyter Notebook (or Python script) containing the code for data preprocessing, model building, and evaluation.
Report summarizing the key findings, model performance metrics, and insights from the model coefficients.
Upload the notebook on moodle and github and share the link
Submission:

Submit the Jupyter Notebook (or Python script) and the report summarizing your analysis.
Include any additional insights, visualizations, or improvements you made to the model.
Resources:

You can refer to Python libraries such as Pandas, NumPy, Scikit-learn for data manipulation, model building, and evaluation.
Feel free to reach out for any clarifications or assistance during the assignment.

In [12]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

In [13]:
# Load the dataset from a CSV file
df = pd.read_csv("datasets_228_482_diabetes.csv")

# Display the initial rows of the dataset to get a glimpse
print("First few rows of the dataset:")
print(df.head())

# Check the structure and summary statistics of the dataset
print("\nDataset Information:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())

First few rows of the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 n

Upon examining the statistics, it becomes evident that certain features, including Glucose, BloodPressure, SkinThickness, Insulin, and BMI, have minimum values of 0. Such occurrences are likely indicative of missing or invalid data, given the implausibility of physiological measurements being zero in most instances.

To address this issue, a strategy is devised to replace zero values in the mentioned features with NaN, thereby distinctly marking and isolating them as missing data. Subsequently, the plan involves filling these NaN values with the median, chosen for its robustness to outliers in comparison to the mean. This approach is particularly suitable for physiological measurements, which may occasionally exhibit outliers. Moreover, utilizing the median preserves the overall distribution of the data, ensuring that the substituted values accurately reflect typical data points within the dataset.

In [14]:
# Substitute zero values with NaN in specific columns
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.nan)

# Fill NaN values with the median of their respective columns
df.fillna(df.median(), inplace=True)

# Verify if any missing values persist after the imputation
print("Remaining Missing Values:")
print(df.isnull().sum())

Remaining Missing Values:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [15]:
# Divide the dataset into features and the target variable
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Segment the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# Standardize features by subtracting the mean and scaling to unit variance
scaler = StandardScaler()

# Transform the training set using the scaler
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing set using the same scaler
X_test_scaled = scaler.transform(X_test)

In [17]:
# Construct a Logistic Regression model with a specified maximum number of iterations
model = LogisticRegression(max_iter=1000)

# Train the model using the standardized training set
model.fit(X_train_scaled, y_train)

In [18]:
# Make predictions on the standardized testing set
y_pred = model.predict(X_test_scaled)

In [19]:
# Assess the model's performance using various metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Display the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)

Accuracy: 0.7532467532467533
Precision: 0.6666666666666666
Recall: 0.6181818181818182
F1 Score: 0.6415094339622642
ROC AUC Score: 0.7232323232323232


The logistic regression model has demonstrated an accuracy of approximately 75.32%, signifying its capability to correctly predict whether a patient has diabetes or not in approximately 75% of instances. Furthermore, the precision of the model stands at approximately 66.67%, indicating that when predicting diabetes, the model is accurate around 66.67% of the time. On the other hand, the model's recall is approximately 61.82%, denoting its ability to correctly identify around 61.82% of patients with actual diabetes.

The F1 score, a harmonic mean of precision and recall, is approximately 64.15%, reflecting a balanced performance. Additionally, the ROC AUC score, assessing the model's discrimination between positive and negative instances, is around 72.32%. A higher ROC AUC score suggests improved discriminatory ability.

In conclusion, the logistic regression model exhibits reasonable performance in various metrics. However, there is room for enhancement, especially in accurately identifying patients with diabetes. Considering the real-world implications of healthcare predictions, minimizing inaccurate predictions is crucial to avoid serious consequences for patients.

To enhance the model, potential strategies include feature engineering, exploring existing features, and creating new ones for increased predictive power. Feature selection techniques such as L1 regularization (Lasso) or recursive feature elimination (RFE) could also be employed. Additionally, hyperparameter tuning through methods like grid search or randomized search offers avenues for improvement.

To guide the next steps, examining logistic regression coefficients is crucial. This analysis will quantify the impact of independent variables on the target variable, providing insights for further refinement and optimization of the model.

In [20]:
# Retrieve the model coefficients
coefficients = model.coef_[0]

# Construct a DataFrame to showcase feature names and their respective coefficients
coef_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': coefficients})
coef_df.sort_values(by='Coefficient', ascending=False, inplace=True)

# Print the DataFrame
print(coef_df)

                    Feature  Coefficient
1                   Glucose     1.102551
5                       BMI     0.688767
7                       Age     0.392364
0               Pregnancies     0.222844
6  DiabetesPedigreeFunction     0.203586
3             SkinThickness     0.068664
4                   Insulin    -0.138304
2             BloodPressure    -0.151521


The logistic regression analysis indicates that Glucose, BMI, Age, and the count of Pregnancies stand out as the most impactful factors in predicting diabetes. Elevated levels of Glucose and BMI, coupled with advancing age and an increased number of pregnancies, contribute to an augmented likelihood of diabetes. The family history of diabetes, represented by DiabetesPedigreeFunction, also holds substantial significance. In contrast, SkinThickness and Insulin exhibit a comparatively minor influence on the model's predictive capabilities.

In [21]:
# Identify significant features based on their coefficients
important_features = coef_df[coef_df['Coefficient'].abs() > 0.2]['Feature'].tolist()
print("Significant Features:", important_features)

# Generate new features considering important features and their interactions
for feature in important_features:
    # Example: Generating polynomial features
    #df[feature + '_squared'] = df[feature] ** 2
    # Example: Generating interaction terms with other significant features
    for other_feature in important_features:
        if other_feature != feature:
            df[feature + '_' + other_feature] = df[feature] * df[other_feature]

# Verify if new features have been successfully incorporated
print(df.head())

Significant Features: ['Glucose', 'BMI', 'Age', 'Pregnancies', 'DiabetesPedigreeFunction']
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0           72.0           35.0    125.0  33.6   
1            1     85.0           66.0           29.0    125.0  26.6   
2            8    183.0           64.0           29.0    125.0  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  Glucose_BMI  ...  Age_Pregnancies  \
0                     0.627   50        1       4972.8  ...              300   
1                     0.351   31        0       2261.0  ...               31   
2                     0.672   32        1       4263.9  ...              256   
3                     0.167   21        0       2500.9  ...               21   
4                     2.288   33        1       5904.7  ...                0

In [22]:
# Divide the dataset into features and the target variable
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Segment the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
# Standardize features by centering on the mean and scaling to unit variance
scaler = StandardScaler()

# Transform the training set using the scaler
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing set using the same scaler
X_test_scaled = scaler.transform(X_test)

In [24]:
# Construct a Logistic Regression model with a specified maximum number of iterations
model = LogisticRegression(max_iter=1000)

# Train the model using the standardized training set
model.fit(X_train_scaled, y_train)

In [14]:
# Make predictions on the standardized testing set
y_pred = model.predict(X_test_scaled)

In [15]:
# Assess the model's performance using various metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Display the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)

Accuracy: 0.7532467532467533
Precision: 0.660377358490566
Recall: 0.6363636363636364
F1 Score: 0.6481481481481481
ROC AUC Score: 0.7272727272727272


Following the implementation of feature engineering, there was only a marginal improvement in performance metrics compared to the initial model. Although precision and ROC AUC score experienced slight increases, the overall accuracy, recall, and F1 score remained relatively consistent. This implies that the employed feature engineering techniques did not result in substantial enhancements in the model's diabetes prediction capability. Further exploration of alternative approaches may be essential to achieve more significant improvements in performance.

In [25]:
# Import necessary libraries for model selection and logistic regression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Instantiate the logistic regression model with a specified maximum number of iterations
logreg = LogisticRegression(max_iter=1000)

# Define the hyperparameter grid for grid search
param_grid = {
    'penalty': ['l2'],  # Regularization penalty
    'C': [0.001, 0.01, 0.1, 1, 10, 100]  # Inverse regularization strength
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the training data
grid_search.fit(X_train_scaled, y_train)

# Retrieve the best hyperparameters determined by the grid search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Retrieve the best cross-validation score achieved during grid search
best_score = grid_search.best_score_
print("Best Cross-validation Score:", best_score)

# Retrieve the best model based on cross-validation performance
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print evaluation metrics for the best model
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC Score:", roc_auc)

Best Hyperparameters: {'C': 0.1, 'penalty': 'l2'}
Best Cross-validation Score: 0.7671331467413035
Accuracy: 0.7597402597402597
Precision: 0.6730769230769231
Recall: 0.6363636363636364
F1 Score: 0.6542056074766355
ROC AUC Score: 0.7323232323232323


After employing grid search to fine-tune hyperparameters, a subtle enhancement was observed in the performance metrics compared to the preceding model. The accuracy increased marginally from 75.32% to 75.97%, and precision saw improvement from 66.04% to 67.31%. However, recall remained constant at 63.64%, and the F1 score showed a slight increase from 64.81% to 65.42%. The ROC AUC score also experienced a minor improvement, progressing from 72.73% to 73.23%.

In summary, while grid search effectively optimized the logistic regression model's hyperparameters, the resulting improvements in performance metrics were modest.

In [26]:
# Import the pickle module for serialization
import pickle

# Store the trained model in a pickle file
with open('diabetes_lr_model.pkl', 'wb') as model_file:
    pickle.dump(best_model, model_file)

# Save the scaler to a pickle file
with open('diabetes_scaler.pkl', 'wb') as scaler_file:
    pickle.dump(scaler, scaler_file)