Machine learning project that predicts hospital readmissions using XGBoost. The project involves data preprocessing, model training, model evaluation and deployment
- Overview
- Getting Started
- Data
- Training
- K-Fold Cross-Validation
- Testing
- Saving the Model
- Usage
- FastAPI Application
- Files for Python Development and Containerization
- Conclusion
- Demonstration
In this project, we use the XGBoost Classifier algorithm to build a predictive model for hospital readmissions. We start by loading and preprocessing the dataset, splitting it into training, validation, and test sets, and encoding categorical and ordinal features. We then train an XGBoost model, evaluate its performance, and save the trained model for future use.
Before you can run this project, make sure you have the required libraries installed. You can install them using the following commands:
pip install pandas
pip install xgboost
pip install scikit-learn
pip install fastapi
pip install "uvicorn[standard]"The dataset used in this project is loaded from a CSV file named hospital_readmissions.csv. The target variable is readmitted, which indicates whether a patient was readmitted to the hospital. The data preprocessing steps include converting the target variable to a binary format and encoding categorical and ordinal features.
Data Source: The dataset can be obtained from the UCI Machine Learning Repository.
Diabetes 130-US Hospitals for Years 1999-2008
- Categorical columns:
glucose_test,A1Ctest - Ordinal columns:
age,medical_specialty,diag_1,diag_2,diag_3,change,diabetes_med
A column transformer is used to apply appropriate encodings to the features. One-hot encoding is used for categorical features, and Ordinal encoding is used for ordinal features.
We train an XGBoost model with the following hyperparameters:
- Learning rate (eta):
0.1 - Maximum depth of trees:
4 - Minimum child weight:
5 - Objective:
binary:logistic - Random seed:
1 - Gamma:
1
Evaluation metric: AUC (Area Under the Receiver Operating Characteristic curve) The number of boosting rounds is set to 105. We train the model using the training data and evaluate it on the validation data.
To ensure the model's robustness, we perform K-Fold cross-validation with K=10. This helps us assess the model's performance on different subsets of the data. The mean AUC and standard deviation of AUC across the folds are reported to provide a better understanding of model performance.
After cross-validation, we train the final model using the entire training dataset and evaluate it on the test set. The AUC score is reported as the final performance metric.
The trained XGBoost model and the preprocessing transformers are saved to a binary file named xgb_eta01.bin using the pickle module. This allows for reusing the model without the need for retraining.
testing.mov
You can use the saved model for making predictions on new data. Here's an example of how to load the model and make predictions:
import pickle
import xgboost as xgb
# Load the saved model
with open('xgb_eta01.bin', 'rb') as f_in:
preprocessor, model = pickle.load(f_in)
# Your new data (X_new) should be in the same format as the training data
X_new = preprocessor.transform(X_new)
X_new_dmat = xgb.DMatrix(X_new, feature_names=feature_names)
y_pred = model.predict(X_new_dmat)The FastAPI application includes two endpoints:
This endpoint allows you to make predictions for individual patient data. The input data is provided as a JSON request body in the following format:
{
"age": "string",
"time_in_hospital": int,
"n_lab_procedures": int,
"n_procedures": int,
"n_medications": int,
"n_outpatient": int,
"n_inpatient": int,
"n_emergency": int,
"medical_specialty": "string",
"diag_1": "string",
"diag_2": "string",
"diag_3": "string",
"glucose_test": "string",
"A1Ctest": "string",
"change": "string",
"diabetes_med": "string"
}The application processes the input data, pre-processes it, and makes predictions. It returns the readmission probability and a binary readmitted status for the provided patient data.
This endpoint allows you to upload a CSV file containing multiple patient records for prediction. The file should have the same structure as the training data. The application processes the uploaded file, pre-processes the data, makes predictions, and returns a mixed result of readmission probabilities and binary readmitted status for each patient record. Additionally, it calculates the proportion of positive readmitted cases in the group data.
- Dockerfile: Contains instructions to set up the container environment, install software, and configure the application. Ensures consistent application deployment in isolated containers.
FROM python:3.9-slim RUN pip install pipenv WORKDIR /app COPY ["Pipfile", "Pipfile.lock", "./"] RUN pipenv install --system --deploy COPY ["main.py", "xgb_eta01.bin", "./"] EXPOSE 8000 ENTRYPOINT ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
- Pipfile: Lists project dependencies and their versions. Used with tools like Pipenv for managing Python project environments.
- Pipfile.lock : Lock file generated by Pipenv, ensuring that the same package versions are installed when recreating a virtual environment. Guarantees reproducible Python environments.
This project demonstrates how to create a FastAPI application for making hospital readmission predictions using a pre-trained XGBoost model and data preprocessing transformers. The application provides endpoints for both individual and group predictions, making it useful for various scenarios in healthcare analytics.
