This script imports essential libraries for data handling, preprocessing, model training, evaluation, and saving in a machine learning workflow.
It supports building a Logistic Regression classification model, scaling features, evaluating performance, and storing the trained model for later use.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib

This code loads the diabetes dataset from a CSV file into a Pandas DataFrame and confirms successful loading.
It then displays the first 5 rows of the dataset to quickly inspect its structure and contents.

I selected the Pima Indians Diabetes dataset because it contains important health-related features such as glucose levels, blood pressure, BMI, and age, which are key factors in predicting diabetes. This dataset is widely used in medical machine learning research and is publicly available, making it ideal for training and testing classification models. It has a balanced number of positive and negative cases.

In [2]:
data = pd.read_csv('diabetes.csv')
print("Dataset loaded successfully")
print(data.head())  

Dataset loaded successfully
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


This code treats invalid zero values in selected medical features as missing data by replacing them with NaN.
It prepares the dataset for proper cleaning and imputation before model training.

In [3]:
cols_with_zero_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data[cols_with_zero_missing] = data[cols_with_zero_missing].replace(0, np.nan)

This code fills missing values with the column mean and removes duplicate records to clean the dataset.
It then verifies that no missing values remain, ensuring the data is ready for modeling.

In [4]:
data.fillna(data.mean(), inplace=True)
data.drop_duplicates(inplace=True)
print("Missing values after cleaning:")
print(data.isnull().sum())  # Should be 0 now

Missing values after cleaning:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


This code calculates the first quartile (Q1) and third quartile (Q3) for each numerical feature in the dataset.
The Interquartile Range (IQR) is then computed to measure data spread and help identify potential outliers.

In [5]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

This code removes rows that contain outliers based on the IQR method, filtering values that fall outside the acceptable range.
After cleaning, it prints the new dataset shape to show how many records remain.

In [6]:
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f"Dataset shape after outlier removal: {data.shape}")

Dataset shape after outlier removal: (515, 9)


This code separates the dataset into features (X) by removing the target column and labels (y) containing the prediction outcome.

In [7]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

This code standardizes the feature data so that each variable has a mean of 0 and a standard deviation of 1, improving model training performance.

In [8]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

This line splits the dataset into training and testing sets, using 80% of the data for training and 20% for testing, with a fixed random state to ensure reproducible results.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

This code creates a Logistic Regression model and trains it using the training data (X_train and y_train) to learn the relationship between features and the target variable.

In [10]:
model = LogisticRegression()
model.fit(X_train, y_train)


This code uses the trained model to predict the target values for the test data (X_test). It then evaluates the model’s performance by calculating and printing key metrics: Accuracy, Precision, Recall, and F1-score, which show how well the model predicts diabetes cases.

In [11]:
y_pred = model.predict(X_test)

print("Model Performance on Test Set:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))

Model Performance on Test Set:
Accuracy: 0.7475728155339806
Precision: 0.8181818181818182
Recall: 0.45
F1-score: 0.5806451612903226


Here's the interpretation of the model performance metrics:

Accuracy (about 74.76%)
This means that overall,the model correctly predicts whether a patient has diabetes or not about 75% of the time. It's a general measure of how often the model is right.

Precision (about 81.82%)
Of all the cases the model predicted as "diabetes positive," around 82% were actually positive. This means your model is good at not giving too many false alarms (false positives).

Recall (about 45%)
Out of all the actual diabetes cases, the model correctly identified only 45%. This is relatively low and means the model misses quite a few true diabetes cases (false negatives).

F1-score (about 58.06%)
The F1-score balances precision and recall. Since recall is low, the F1-score is also moderate, showing that while the model is precise, it could improve in catching more true diabetes cases.

This code saves the trained machine learning model and the scaler object to files named with your registration number (25RP20515_model.joblib and 25RP20515_scaler.joblib). This allows you to reuse the model and scaler later without retraining. It then prints a confirmation message.

In [12]:
joblib.dump(model, "25RP20515_model.joblib")
joblib.dump(scaler, "25RP20515_scaler.joblib")

print("Model and scaler saved successfully")

Model and scaler saved successfully
