This project focuses on classification task for predicting diabetes using the Pima Indians Diabetes Dataset. The goal is to accurately identify individuals who are likely to have diabetes, aiding early diagnosis and preventive healthcare.
Hugging Face Link: Click
- Source: Pima Indians Diabetes Dataset
- Features: Numeric and categorical features including:
- Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age
- Target:
Outcome(0 = non-diabetic, 1 = diabetic) - Total samples: ~768
- Imbalance: Fewer positive cases (~268) than negative (~500)
- Missing and zero values handled with median imputation.
- Outliers in numeric features were clipped (1%–99% quantiles).
- Numeric features scaled using StandardScaler.
- Categorical features encoded using OneHotEncoder.
- Pipelines used to integrate preprocessing and ensure reproducibility.
- Primary model: Logistic Regression
- Alternative models evaluated: SVC, KNN
- Evaluation metric: Recall for positive class (Outcome=1) prioritized due to medical significance.
- Pipeline created using ColumnTransformer and preprocessing.
- Models trained on stratified train-test split to handle class imbalance.
- Cross-validation (5-fold) applied to assess model robustness.
- Metrics recorded: Accuracy, Precision, Recall, F1-score (positive class).
- Logistic Regression with class_weight="balanced" achieved highest recall for diabetic cases.
- KNN showed higher F1 but lower recall.
- Weighted metrics can be misleading due to dataset imbalance; positive class recall is primary metric.
- Final pipeline saved as
.pklfile. - Front-end can pass input as dictionary or DataFrame.
- Column order flexibility maintained; column names must match pipeline requirements.
- Python 3.x
- Pandas, NumPy
- scikit-learn
- Matplotlib / Seaborn (EDA & visualization)
- Joblib / Pickle (pipeline serialization)
This guide explains how to set up the Python environment and run the project.
Create a new virtual environment for this project:
python -m venv aiml_envActivate the newly created environment:
aiml_env\Scripts\activateInstall all required packages from requirements.txt:
pip install -r requirements.txtStart projet by typing:
python app.py- Pima Indians Diabetes Dataset – Kaggle
- scikit-learn documentation: Pipeline & ColumnTransformer
If you faced any kind of issue feel free to Report:
Name: MD Rohan Mulla
🎓University: Rabindra Maitree University
📨E-mail: mdrohanislam444@gmail.com
facebook: https://www.facebook.com/MullaRohan