# Table of Contents
1. [Introduction](#introduction)
2. [Exploratory Data Analysis](#eda)
   1. [Data Dictionary](#data_dictionary)
   1. [Import Libraries & Load Data](#load_data)   
   2. [Features](#features)
   3. [Descriptive Analysis](#desc_analysis)
3. [Data Cleaning](#data_cleaning)
   1. [Remove the 'Missing' labels in the Diagnosis features](#remove_missing_diag)
   2. [Remove the 'Missing' labels in the Medical Specialty features](#remove_missing_ms)
4. [Feature Engineering](#feat_eng)
   1. [has_procedure_performed Feature](#has_procedure_performed)
   2. [has_outpatient_visit Feature](#has_outpatient_visit)
   3. [has_inpatient_visit Feature](#has_inpatient_visit)
   4. [has_emergency_visit Feature](#has_emergency_visit)
   5. [lab_procedures_freq Feature](#lab_procedures_freq)
   6. [time_in_hospital_freq Feature](#time_in_hospital_freq)
   7. [medications_freq Feature](#medications_freq)
5. [Model Building](#model_building)
   1. [Transforming Nominal Data](#nominal_data)
   2. [Transforming Target Feature](#target_feature)
   3. [Training & Test Data for Model 1 (12 Features)](#train_test_1)
   4. [Training & Test Data for Model 2 (11 features without the Medical Specialty feature)](#train_test_2)
   5. [Training & Test Data for Model 3 (12 features with records with the 'Missing' labels removed from the Medical Specialty feature)](#train_test_3)
   6. [Training & Test Data for Model 4 (all 16 features)](#train_test_4)
   7. [Train the Predictive Model](#train_pred_model)
6. [Model Evaluation](#model_eval)
   1. [Evaluate Model 1](#eval_model_1)
   2. [Evaluate Model 2](#eval_model_2)
   3. [Evaluate Model 3](#eval_model_3)
   4. [Evaluate Model 4](#eval_model_4)
   5. [Models Evaluation](#eval_model_summary)
   6. [Models' Coefficient](#model_coeff)
7. [Recommendations](#recommendations)
8. [Next Steps](#next_steps)
9. [Appendix](#appendix)

<a id="introduction"></a>
# Introduction
I'm an analyst working for a consulting company helping a hospital group better understand patient readmissions. The hospital gave us access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want to assess if initial diagnoses, number of procedures, or other variables could help them understand the probability of readmission. They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

## Goal
1. Analyze the primary diagnosis with the highest readmission rate and its pattern in the dataset.
2. Build a predictive model to predict if a patient will be readmitted within 30-days after discharge.
3. Identify the group of patients the hospital needs to focus its follow-up effort to better monitor patients with high probability of readmission.


<a id="eda"></a>
# Exploratory Data Analysis (EDA) of Patient Readmission Dataset

<a id="data_dictionary"></a>
## Data Dictionary

|Name|Description|Type|
|----|-----------|----|
|age|age bracket of the patient|string|
|time_in_hospital|days (from 1 to 14)|numeric|
|n_procedures|number of procedures performed during the hospital stay|numeric|
|n_lab_procedures| number of laboratory procedures performed during the hospital stay|numeric|
|n_medications| number of medications administered during the hospital stay|numeric|
|n_outpatient| number of outpatient visits in the year before a hospital stay|numeric|
|n_inpatient| number of inpatient visits in the year before the hospital stay|numeric|
|n_emergency| number of visits to the emergency room in the year before the hospital stay|numeric|
|medical_specialty| the specialty of the admitting physician|string|
|diag_1| primary diagnosis (Circulatory, Respiratory, Digestive, etc.)|string|
|diag_2| secondary diagnosis|string|
|diag_3| additional secondary diagnosis|string|
|glucose_test| whether the glucose serum came out as high (> 200), normal, or not performed|string|
|A1Ctest| whether the A1C level of the patient came out as high (> 7%), normal, or not performed|string|
|change| whether there was a change in the diabetes medication ('yes' or 'no')|string|
|diabetes_med| whether a diabetes medication was prescribed ('yes' or 'no')|string|
|readmitted| if the patient was readmitted at the hospital ('yes' or 'no') |string|

Assumptions:
1. Each record is an unique patient record.

In [None]:
# mount the Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<a id="load_data"></a>
## Import Libraries & Load Data

In [None]:
# Import the neccessary libraries for viz (if needed)
import pandas as pd
import numpy as np
import matplotlib.pyplot as mpl
%matplotlib inline
import plotly.express as px
import seaborn as sb

# import the train_test_split function
from sklearn.model_selection import train_test_split

# Import the model from the scikit learn library to build / train the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Import the library to save the logistic regression model to a file
import joblib

# Import the libraries for the model's assessment.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [None]:
# Load the CSV into a data frame
patient_readmission = pd.read_csv('/content/drive/MyDrive/01 My Career/GA Data Analytics Bootcamp/DAB-Capstone/02-Data-Prep/data-raw/patient_readmission_data.csv')

<a id="features"></a>
## Features
Based on the code cell below:
1. There are **17 features** in the dataset.
2. **7 features** are **numeric type**.
3. **10 features** are **categorical type**.


In [None]:
patient_readmission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  object
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted         25000 non-null  objec

<a id="desc_analysis"></a>
## Descriptive Analysis
**time_in_hospital**:
1. a patient's average time spent in the hospital is 4.45 days.
2. standard deviation is 3.00 days, indicates that, on average, individual patient hospital stays deviate from the average hospital stay by about 3.00 days. This means there is substantial variability in the time spent in the hospital for each patient.
3. median is 4.0 days.

---
**n_lab_procedures**:
1. the average lab procedures done for a patient during his/her stay is 43.24 times.
2. standard deviation is 19.82 times, shows that, on average, the lab procedure for a patient deviate form the average lab procedure by about 19.82 times. This means there is significant variability in the lab procedures performed for the patient.
3. The highest lab procedures performed for a patient is 113 times.

---
**n_procedures**:
1. the average procedures performed on a patient during the stay is 1.35 times.
2. the standard deviation is 1.72 shows that on average the procedure for a patient deviate from the average procedure by about 1.72 times. This is a significant variability in the procedures performed on the patient.

---
**n_medications**:
1. the average number of times medication administered to the patient during the hospital stay is 16.25 times.
2. the standard deviation is 8.06 times shows that the variability is high in the medications administered to the patient.

---
**n_outpatient**:
1. The average number of outpatient visit by a patient in the year before the hospital admission is 0.37 times.
2. The standard deviation is 1.19 is higher than the average value.
3. The 75th percentile is 0.00, this shows that 75% of the patients in this dataset had 0.00 times of visits in the year before their hospital stay.
4. The highest number of outpatient visits in the year before the hospital stay is 33.00 times.

---
**n_inpatient**:
1. The average number of inpatient visit by a patient in the year before the hospital admisssion is 0.62 times.
2. The standard deviation is 1.18 means the distribution for this feature is skewed.
3. The highest number of inpatient visits in the year before the hospital stay is 15.00 times.

---
**n_emergency**:
1. the average number of emergency visit by a patient in the year before the hospital admission is 0.19 times.
2. The standard deviation is 0.89 means the distribution for this feature is skewed.
3. The highest number of emergency visit by a patient is 64 times.

---



In [None]:
patient_readmission.describe()

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,4.45332,43.24076,1.35236,16.2524,0.3664,0.61596,0.1866
std,3.00147,19.81862,1.715179,8.060532,1.195478,1.177951,0.885873
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0
25%,2.0,31.0,0.0,11.0,0.0,0.0,0.0
50%,4.0,44.0,1.0,15.0,0.0,0.0,0.0
75%,6.0,57.0,2.0,20.0,0.0,1.0,0.0
max,14.0,113.0,6.0,79.0,33.0,15.0,64.0


<a id="data_cleaning"></a>
# Data Cleaning
This section performs the data cleaning process on the dataset.

<a id="remove_missing_diag"></a>
## Remove the 'Missing' labels in the Diagnosis features
Based on the EDA on the dataset, we'll be removing the `Missing` labels in the diag_1, diag_2, and diag_3 feature since their proportion in the feature is insignificant (0.02%, 0.17%, 0.78% respectively) and have no benefit to the prediction model.

In [None]:
# This code cell is to drop the records where the value is 'Missing' in the diag_1, diag_2, diag_3 feature.

raw_count = patient_readmission.shape[0]
print(f'There are {raw_count} rows in the dataset.')

# Store the total `Missing` values in the diag_1 feature.
diag1_count = patient_readmission[patient_readmission['diag_1'] == 'Missing']['readmitted'].count()
print(f"There are {diag1_count} rows with 'Missing' value in the diag_1 feature.")

# Drop the rows with 'Missing' value in the diag_1 column. The 'inplace=True' parameter is to replace the original dataframe.
patient_readmission.drop(patient_readmission[patient_readmission['diag_1'] == "Missing"].index, inplace=True)

# Store the total `Missing` values in the diag_1 feature.
diag2_count = patient_readmission[patient_readmission['diag_2'] == 'Missing']['readmitted'].count()
print(f"There are {diag2_count} rows with 'Missing' value in the diag_2 feature.")

# Drop the rows with 'Missing' value in the diag_2 column.
patient_readmission.drop(patient_readmission[patient_readmission['diag_2'] == "Missing"].index, inplace=True)

# Store the total `Missing` values in the diag_1 feature.
diag3_count = patient_readmission[patient_readmission['diag_3'] == 'Missing']['readmitted'].count()
print(f"There are {diag3_count} rows with 'Missing' value in the diag_3 feature.")

# Drop the rows with 'Missing' value in the diag_3 column.
patient_readmission.drop(patient_readmission[patient_readmission['diag_3'] == "Missing"].index, inplace=True)

print(f"After removing the records, there are {patient_readmission[patient_readmission['diag_1'] == 'Missing']['readmitted'].count()} rows with 'Missing' value in the diag_1 feature.")
print(f"After removing the records, there are {patient_readmission[patient_readmission['diag_2'] == 'Missing']['readmitted'].count()} rows with 'Missing' value in the diag_2 feature.")
print(f"After removing the records, there are {patient_readmission[patient_readmission['diag_3'] == 'Missing']['readmitted'].count()} rows with 'Missing' value in the diag_3 feature.")

print(f"There are {patient_readmission.shape[0]} rows after dropping the rows where the value is 'Missing' in the diagnosis feature.")

# patient_readmission.to_csv('pr_cleaned.csv', index=False)
# print('Generated the pr_cleaned.csv.')

There are 25000 rows in the dataset.
There are 4 rows with 'Missing' value in the diag_1 feature.
There are 42 rows with 'Missing' value in the diag_2 feature.
There are 175 rows with 'Missing' value in the diag_3 feature.
After removing the records, there are 0 rows with 'Missing' value in the diag_1 feature.
After removing the records, there are 0 rows with 'Missing' value in the diag_2 feature.
After removing the records, there are 0 rows with 'Missing' value in the diag_3 feature.
There are 24779 rows after dropping the rows where the value is 'Missing' in the diagnosis feature.


<a id="remove_missing_ms"></a>
## Remove the 'Missing' label in the Medical Specialty feature
We have found a large portion of the data (12,382 records or 49.53%) in the `medical_specialty` feature is labeled as `Missing`. Since the `Missing` label is expected in the real-world dataset, we've decided to keep the data to train the prediction model as it have its own pattern.

However, we will also remove these records to build the 3rd predictive model to evaluate its performance against the 1st predictive model.

In [None]:
# This code cell is to drop the records where the value is 'Missing' in the medical_specialty feature.

raw_count = patient_readmission.shape[0]
print(f'There are {raw_count} rows in the dataset.')

# Store the total `Missing` values in the diag_1 feature.
missing_count = patient_readmission[patient_readmission['medical_specialty'] == 'Missing']['readmitted'].count()
print(f"There are {missing_count} rows with 'Missing' value in the medical_specialty feature.")

# Drop the rows with 'Missing' value in the diag_1 column. The 'inplace=True' parameter is to replace the original dataframe.
patient_readmission.drop(patient_readmission[patient_readmission['medical_specialty'] == "Missing"].index, inplace=True)

print(f"After removing the records, there are {patient_readmission[patient_readmission['medical_specialty'] == 'Missing']['readmitted'].count()} rows with 'Missing' value in the medical_specialty feature.")

print(f"There are {patient_readmission.shape[0]} rows after dropping the rows where the value is 'Missing' in the medical_specialty feature.")

# patient_readmission.to_csv('pr_cleaned.csv', index=False)
# print('Generated the pr_cleaned.csv.')

There are 24779 rows in the dataset.
There are 12314 rows with 'Missing' value in the medical_specialty feature.
After removing the records, there are 0 rows with 'Missing' value in the medical_specialty feature.
There are 12465 rows after dropping the rows where the value is 'Missing' in the medical_specialty feature.


<a id="feat_eng"></a>
# Feature Engineering
Based on the EDA, we've identified features like `emergency`, `inpatient`, `outpatient`, and `procedures` are highly imbalanced, with a large majority of patients having zero visits or procedures.

We'll create binary bins for each of them:
1. `n_procedures`:
   - Create a binary bin call `has_procedure_performed`.
   - If a patient has no procedure (0) performed, then its value is `0` else it is `1` (this means the patient have 1 or more procedure done).
2. `n_outpatient`:
   - Create a binary bin call `has_outpatient_visit`.
   - If a patient has no outpatient visit (0), then its value is `0` else it is `1` (this means the patient have 1 or more outpatient visits).
3. `n_inpatient` :
   - Create a binary bin call `has_inpatient_visit`.
   - If a patient has no inpatient visit (0), then its value is `0` else it is `1` (this means the patient have 1 or more inpatient visits).
4. `n_emergency` :
   - Create a binary bin call `has_emergency_visit`.
   - If a patient has no emergency visit (0), then its value is `0` else it is `1` (this means the patient have 1 or more inpatient visits).

<a id="has_procedure_performed"></a>
### has_procedure_performed feature

In [None]:
# This code cell is to create the has_procedure_performed feature engineering features in the dataframe.

print(
    f"There are {patient_readmission[patient_readmission['n_procedures']>0]['readmitted'].count()} patients with more than 1 procedure(s) performed.\n"
    f"There are {patient_readmission[patient_readmission['n_procedures']==0]['readmitted'].count()} patients with no procedures performed."
)

# Create the new `has_procedure_performed` column.
conditions = [
    (patient_readmission['n_procedures']==0),
    (patient_readmission['n_procedures']>0)
]

values = [0, 1]

# The '.astype(int)' ensures the data is created as an integer type.
patient_readmission['has_procedure_performed'] = np.select(conditions, values, default=None).astype(int)

display(patient_readmission['has_procedure_performed'].value_counts())


There are 13485 patients with more than 1 procedure(s) performed.
There are 11294 patients with no procedures performed.


Unnamed: 0_level_0,count
has_procedure_performed,Unnamed: 1_level_1
1,13485
0,11294


<a id="has_outpatient_visit"></a>
### has_outpatient_visit feature

In [None]:
# This code cell is to create the has_outpatient_visit feature engineering features in the dataframe.

print(
    f"There are {patient_readmission[patient_readmission['n_outpatient']>0]['readmitted'].count()} patients with more than one outpatient visit.\n"
    f"There are {patient_readmission[patient_readmission['n_outpatient']==0]['readmitted'].count()} patients with no outpatient visit."
)

# Create the new `has_outpatient_visit` column.
conditions = [
    (patient_readmission['n_outpatient']==0),
    (patient_readmission['n_outpatient']>0)
]

values = [0, 1]

patient_readmission['has_outpatient_visit'] = np.select(conditions, values, default=None).astype(int)

display(patient_readmission['has_outpatient_visit'].value_counts())

There are 4122 patients with more than one outpatient visit.
There are 20657 patients with no outpatient visit.


Unnamed: 0_level_0,count
has_outpatient_visit,Unnamed: 1_level_1
0,20657
1,4122


<a id="has_inpatient_visit"></a>
### has_inpatient_visit feature

In [None]:
# This code cell is to create the has_inpatient_visit feature engineering features in the dataframe.

print(
    f"There are {patient_readmission[patient_readmission['n_inpatient']==0]['readmitted'].count()} patients with no inpatient visit.\n"
    f"There are {patient_readmission[patient_readmission['n_inpatient']>0]['readmitted'].count()} patients with more than one inpatient visit."
)

# Create the new `has_inpatient_visit` column.
conditions = [
    (patient_readmission['n_inpatient']==0),
    (patient_readmission['n_inpatient']>0)
]

values = [0, 1]

patient_readmission['has_inpatient_visit'] = np.select(conditions, values, default=None).astype(int)

display(patient_readmission['has_inpatient_visit'].value_counts())

There are 16354 patients with no inpatient visit.
There are 8425 patients with more than one inpatient visit.


Unnamed: 0_level_0,count
has_inpatient_visit,Unnamed: 1_level_1
0,16354
1,8425


<a id="has_emergency_visit"></a>
### has_emergency_visit feature

In [None]:
# This code cell is to create the has_emergency_visit feature engineering features in the dataframe.

print(
    f"There are {patient_readmission[patient_readmission['n_emergency'] == 0]['readmitted'].count()} patients with no emergency visit.\n"
    f"There are {patient_readmission[patient_readmission['n_emergency'] > 0]['readmitted'].count()} patients with more than one emergency visit.\n"
)

# Create the new `has_emergency_visit` column.
conditions = [
    (patient_readmission['n_emergency']==0),
    (patient_readmission['n_emergency']>0)
]

values = [0, 1]

patient_readmission['has_emergency_visit'] = np.select(conditions, values, default=None).astype(int)

display(patient_readmission['has_emergency_visit'].value_counts())

There are 22063 patients with no emergency visit.
There are 2716 patients with more than one emergency visit.



Unnamed: 0_level_0,count
has_emergency_visit,Unnamed: 1_level_1
0,22063
1,2716


For the `lab_procedures`, `time_in_hospital`, and `n_medications` feature, we will create category bins and transform it from a numeric variable into a categorical one. This will allow the model to learn from distinct patient groups rather than a continuous, weakly correlated number..

1. `lab_procedures`:
   - Create 3 bins for this feature.
   - If a patient has less than 30 lab procedures performed, then it will be categorised as `low_lab_procedures`.
   - If a patient has more than 30 lab procedures and less than 60 lab procedures performed, then it will be categorised as `med_lab_procedures`.
   - If a patient has more than 60 lab procedures performed, then it will be categorised as `high_lab_procedures`.
2. `time_in_hospital`:
   - Create a feature called `time_in_hospital_freq` with 3 bins for this feature.
   - `short` - patients who stay less or equal to 3 days
   - `med` - patients who stay more than 3 days and less or equal to 6 days.
   - `long` - patients who stay more than 6 days.
3. `n_medications`:
   - Create a feature called `medications_freq` with 3 bins for this feature.
   - `low` - patients who are prescribed with less or equal to 10 medications during the hospital stay.
   - `med` - patients who are prescribed with more than 10 medications and less or equal to 20 medications during the hospital stay.
   - `high` - patients who are prescribed with more than 20 medications during the hospital stay.

<a id="lab_procedures_freq"></a>
### lab_procedures_freq feature

In [None]:
# This code cell is to create the lab_procedures_freq feature in the dataframe.

print(
    f"There are {patient_readmission[(patient_readmission['n_lab_procedures'] <= 30)]['readmitted'].count()} patients with low lab procedures.\n"
    f"There are {patient_readmission[(patient_readmission['n_lab_procedures'] > 30) & (patient_readmission['n_lab_procedures'] <= 60)]['readmitted'].count()} patients with medium lab procedures.\n"
    f"There are {patient_readmission[(patient_readmission['n_lab_procedures'] > 60)]['readmitted'].count()} patients with high lab procedures.\n"
)

# Create the new `has_emergency_visit` column.
conditions = [
    (patient_readmission['n_lab_procedures'] <= 30),
    (patient_readmission['n_lab_procedures'] > 30) & (patient_readmission['n_lab_procedures'] <= 60),
    (patient_readmission['n_lab_procedures'] > 60)
]

values = [
    'low', 'med', 'high'
]

patient_readmission['lab_procedures_freq'] = np.select(conditions, values, default=None)

display(patient_readmission['lab_procedures_freq'].value_counts())

There are 5905 patients with low lab procedures.
There are 14009 patients with medium lab procedures.
There are 4865 patients with high lab procedures.



Unnamed: 0_level_0,count
lab_procedures_freq,Unnamed: 1_level_1
med,14009
low,5905
high,4865


<a id="time_in_hospital_freq"></a>
### time_in_hospital_freq feature

In [None]:
# This code cell is to create the time_in_hospital_freq feature in the dataframe.

print(
    f"There are {patient_readmission[(patient_readmission['time_in_hospital'] <= 3)]['readmitted'].count()} patients with short hospital stay.\n"
    f"There are {patient_readmission[(patient_readmission['time_in_hospital'] > 3) & (patient_readmission['time_in_hospital'] <= 6)]['readmitted'].count()} patients with medium hospital stay.\n"
    f"There are {patient_readmission[(patient_readmission['time_in_hospital'] > 6)]['readmitted'].count()} patients with long hospital stay.\n"
)

# Create the new `has_emergency_visit` column.
conditions = [
    (patient_readmission['time_in_hospital'] <= 3),
    (patient_readmission['time_in_hospital'] > 3) & (patient_readmission['time_in_hospital'] <= 6),
    (patient_readmission['time_in_hospital'] > 6)
]

values = [
    'short', 'med', 'long'
]

patient_readmission['time_in_hospital_freq'] = np.select(conditions, values, default=None)

display(patient_readmission['time_in_hospital_freq'].value_counts())

There are 11629 patients with short hospital stay.
There are 7848 patients with medium hospital stay.
There are 5302 patients with long hospital stay.



Unnamed: 0_level_0,count
time_in_hospital_freq,Unnamed: 1_level_1
short,11629
med,7848
long,5302


<a id="medications_freq"></a>
### medications_freq feature

In [None]:
# This code cell is to create the medications_freq feature in the dataframe.

print(
    f"There are {patient_readmission[(patient_readmission['n_medications'] <= 10)]['readmitted'].count()} patients with low medications administered.\n"
    f"There are {patient_readmission[(patient_readmission['n_medications'] > 10) & (patient_readmission['n_medications'] <= 20)]['readmitted'].count()} patients with medium medications administered.\n"
    f"There are {patient_readmission[(patient_readmission['n_medications'] > 20)]['readmitted'].count()} patients with high medications administered.\n"
)

# Create the new `has_emergency_visit` column.
conditions = [
    (patient_readmission['n_medications'] <= 10),
    (patient_readmission['n_medications'] > 10) & (patient_readmission['n_medications'] <= 20),
    (patient_readmission['n_medications'] > 20)
]

values = [
    'low', 'med', 'high'
]

patient_readmission['medications_freq'] = np.select(conditions, values, default=None)

display(patient_readmission['medications_freq'].value_counts())

There are 5853 patients with low medications administered.
There are 12896 patients with medium medications administered.
There are 6030 patients with high medications administered.



Unnamed: 0_level_0,count
medications_freq,Unnamed: 1_level_1
med,12896
high,6030
low,5853


<a id="model_building"></a>
# Model Building
## Data Transformation
Before we begin building the **Logistic Regression** predictive model, we need to transform the categorical data to binary columns (also known as one-hot encoding) for the model to understand it.

"[Essentially, all models are wrong, but some are useful.](https://en.wikipedia.org/wiki/All_models_are_wrong)" ~George Box

<a id="nominal_data"></a>
### Transforming Nominal Data

The below are the nominal data (no order). We use the `get_dummies()` function in `Pandas` to encode it.
1. `medical_specialty`
2. `diag_1`
3. `diag_2`
4. `diag_3`
5. `change`
6. `diabetes_med`
7. `age`
8. `glucose_test`
9. `A1Ctest`
10. `lab_procedures_freq`
11. `time_in_hospital_freq`
12. `medications_freq`


In [None]:
# This code creates binary values for the nominal data columns.

# List of categorical columns to encode
nom_cat_cols = [
    # Add the names of the categorical columns here
    'medical_specialty'
    , 'diag_1', 'diag_2', 'diag_3'
    , 'change'
    , 'diabetes_med'
    , 'age', 'glucose_test', 'A1Ctest', 'lab_procedures_freq', 'time_in_hospital_freq', 'medications_freq'
]

# Apply one-hot encoding to the categorical columns
# Use pd.get_dummies() on your DataFrame, specifying the columns to encode
# and setting drop_first=True to avoid multicollinearity
patient_readmission_encoded = pd.get_dummies(patient_readmission, columns=nom_cat_cols, drop_first=True)

# Optional: Display the first few rows of the new DataFrame to see the encoded columns
display(patient_readmission_encoded.head())

# Optional: Display the shape of the new DataFrame to see how many columns were added
print(f"\nShape of the DataFrame after one-hot encoding: {patient_readmission_encoded.shape}")

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,readmitted,has_procedure_performed,has_outpatient_visit,...,glucose_test_no,glucose_test_normal,A1Ctest_no,A1Ctest_normal,lab_procedures_freq_low,lab_procedures_freq_med,time_in_hospital_freq_med,time_in_hospital_freq_short,medications_freq_low,medications_freq_med
0,8,72,1,18,2,0,0,no,1,1,...,True,False,True,False,False,False,False,False,False,True
1,3,34,2,13,0,0,0,no,1,0,...,True,False,True,False,False,True,False,True,False,True
2,5,45,0,18,0,0,0,yes,0,0,...,True,False,True,False,False,True,True,False,False,True
3,2,36,0,12,1,0,0,yes,0,1,...,True,False,True,False,False,True,False,True,False,True
4,1,42,0,7,0,0,0,no,0,0,...,True,False,True,False,False,True,False,True,True,False



Shape of the DataFrame after one-hot encoding: (24779, 53)


<a id="target_feature"></a>
### Transforming Target Feature
We need to encode the target feature in numerical format for the model to understand. We will encode the `no` and `yes` values as `0` and `1`.

In [None]:
# This cell creates the ordinal encoding for the 'readmitted' column called 'readmitted_ordinal'

# Create a dictionary to map the values.
readmitted_dict = {
    'no':0
    ,'yes':1
}

# Assign the result back to a new column
patient_readmission_encoded['readmitted_ordinal'] = patient_readmission['readmitted'].map(readmitted_dict)

# Optional: Display the first few rows of the new DataFrame to see the encoded columns
display(patient_readmission_encoded.head())

# Optional: Display the shape of the new DataFrame to see how many columns were added
print(f"\nShape of the DataFrame after one-hot encoding: {patient_readmission_encoded.shape}")

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,readmitted,has_procedure_performed,has_outpatient_visit,...,glucose_test_normal,A1Ctest_no,A1Ctest_normal,lab_procedures_freq_low,lab_procedures_freq_med,time_in_hospital_freq_med,time_in_hospital_freq_short,medications_freq_low,medications_freq_med,readmitted_ordinal
0,8,72,1,18,2,0,0,no,1,1,...,False,True,False,False,False,False,False,False,True,0
1,3,34,2,13,0,0,0,no,1,0,...,False,True,False,False,True,False,True,False,True,0
2,5,45,0,18,0,0,0,yes,0,0,...,False,True,False,False,True,True,False,False,True,1
3,2,36,0,12,1,0,0,yes,0,1,...,False,True,False,False,True,False,True,False,True,1
4,1,42,0,7,0,0,0,no,0,0,...,False,True,False,False,True,False,True,True,False,0



Shape of the DataFrame after one-hot encoding: (24779, 54)


<a id="train_test_1"></a>
### Training and Test Data for Model 1 (12 features)
Create the training and testing data for the prediction model using the `train_test_split` library.

This training and test data uses 12 features.
1. `n_inpatient`
2. `n_outpatient`
3. `n_emergency`
4. `age`
5. `change`
6. `diag_1`
7. `diag_2`
8. `diag_3`
9. `diabetes_med`
10. `medical_specialty`
11. `A1Ctest`
12. `glucose_test`

In [None]:
# Define the features (X) and target variable (y)
x_cols = [
    # Strong indicators
    'age_[50-60)', 'age_[60-70)', 'age_[70-80)', 'age_[80-90)', 'age_[90-100)'
    , 'diabetes_med_yes'
    , 'has_outpatient_visit', 'has_inpatient_visit', 'has_emergency_visit'
    , 'diag_1_Diabetes', 'diag_1_Digestive', 'diag_1_Injury', 'diag_1_Musculoskeletal', 'diag_1_Other', 'diag_1_Respiratory'
    , 'diag_2_Diabetes', 'diag_2_Digestive', 'diag_2_Injury', 'diag_2_Musculoskeletal', 'diag_2_Other', 'diag_2_Respiratory'
    , 'diag_3_Diabetes', 'diag_3_Digestive', 'diag_3_Injury', 'diag_3_Musculoskeletal', 'diag_3_Other', 'diag_3_Respiratory'
    , 'medical_specialty_Emergency/Trauma', 'medical_specialty_Family/GeneralPractice'
    , 'medical_specialty_InternalMedicine', 'medical_specialty_Missing'
    , 'medical_specialty_Other', 'medical_specialty_Surgery'
    , 'change_yes'
    , 'A1Ctest_no', 'A1Ctest_normal'
    , 'glucose_test_no', 'glucose_test_normal'
]

X = patient_readmission_encoded[x_cols]
y = patient_readmission_encoded['readmitted_ordinal']

print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# Call the train_test_split function - define the test size, random_state
# Assign the outputs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Shape of X (features): (24779, 38)
Shape of y (target): (24779,)


<a id="train_test_2"></a>
### Training and Test Data for Model 2 (without the Medical Specialty feature)
Create the training and testing data for the prediction model using the `train_test_split` library.

This training and test data uses 11 features.
1. `n_inpatient`
2. `n_outpatient`
3. `n_emergency`
4. `age`
5. `change`
6. `diag_1`
7. `diag_2`
8. `diag_3`
9. `diabetes_med`
10. `A1Ctest`
11. `glucose_test`

In [None]:

# Define the features (X) and target variable (y)
x_cols = [
    # Strong indicators
    'age_[50-60)', 'age_[60-70)', 'age_[70-80)', 'age_[80-90)', 'age_[90-100)'
    , 'diabetes_med_yes'
    , 'has_outpatient_visit', 'has_inpatient_visit', 'has_emergency_visit'
    , 'diag_1_Diabetes', 'diag_1_Digestive', 'diag_1_Injury', 'diag_1_Musculoskeletal', 'diag_1_Other', 'diag_1_Respiratory'
    , 'diag_2_Diabetes', 'diag_2_Digestive', 'diag_2_Injury', 'diag_2_Musculoskeletal', 'diag_2_Other', 'diag_2_Respiratory'
    , 'diag_3_Diabetes', 'diag_3_Digestive', 'diag_3_Injury', 'diag_3_Musculoskeletal', 'diag_3_Other', 'diag_3_Respiratory'
    , 'change_yes'
    , 'A1Ctest_no', 'A1Ctest_normal'
    , 'glucose_test_no', 'glucose_test_normal'
]

X = patient_readmission_encoded[x_cols]
y = patient_readmission_encoded['readmitted_ordinal']

print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# Call the train_test_split function - define the test size, random_state
# Assign the outputs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Shape of X (features): (24779, 32)
Shape of y (target): (24779,)


<a id="train_test_3"></a>
### Training and Test Data for Model 3 (12 features with records with the 'Missing' labels removed from the Medical Specialty feature)
Create the training and testing data for the prediction model using the `train_test_split` library.

This training and test data uses 11 features.
1. `n_inpatient`
2. `n_outpatient`
3. `n_emergency`
4. `age`
5. `change`
6. `diag_1`
7. `diag_2`
8. `diag_3`
9. `diabetes_med`
10. `medical_specialty` - the `Missing` labels are removed from this feature.
11. `A1Ctest`
12. `glucose_test`

In [None]:

# Define the features (X) and target variable (y)
x_cols = [
    # Strong indicators
    'age_[50-60)', 'age_[60-70)', 'age_[70-80)', 'age_[80-90)', 'age_[90-100)'
    , 'diabetes_med_yes'
    , 'has_outpatient_visit', 'has_inpatient_visit', 'has_emergency_visit'
    , 'diag_1_Diabetes', 'diag_1_Digestive', 'diag_1_Injury', 'diag_1_Musculoskeletal', 'diag_1_Other', 'diag_1_Respiratory'
    , 'diag_2_Diabetes', 'diag_2_Digestive', 'diag_2_Injury', 'diag_2_Musculoskeletal', 'diag_2_Other', 'diag_2_Respiratory'
    , 'diag_3_Diabetes', 'diag_3_Digestive', 'diag_3_Injury', 'diag_3_Musculoskeletal', 'diag_3_Other', 'diag_3_Respiratory'
    , 'medical_specialty_Emergency/Trauma', 'medical_specialty_Family/GeneralPractice'
    , 'medical_specialty_InternalMedicine', 'medical_specialty_Other', 'medical_specialty_Surgery'
    , 'change_yes'
    , 'A1Ctest_no', 'A1Ctest_normal'
    , 'glucose_test_no', 'glucose_test_normal'
]

X = patient_readmission_encoded[x_cols]
y = patient_readmission_encoded['readmitted_ordinal']

print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# Call the train_test_split function - define the test size, random_state
# Assign the outputs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Shape of X (features): (12465, 37)
Shape of y (target): (12465,)


<a id="train_test_4"></a>
### Training and Test Data for Model 4 (all 16 features)
Create the training and testing data for the prediction model using the `train_test_split` library.

This training and test data uses 16 features.
1. `age`
2. `n_inpatient`
3. `n_outpatient`
4. `n_emergency`
5. `n_procedures`
6. `diag_1`
7. `diag_2`
8. `diag_3`
9. `diabetes_med`
10. `medical_specialty`
11. `glucose_test`
12. `A1Ctest`
13. `change`
14. `n_lab_procedures`
15. `time_in_hospital`
16. `n_medications`

In [None]:
# Define the features (X) and target variable (y)
x_cols = [
    # Strong indicators
    'age_[50-60)', 'age_[60-70)', 'age_[70-80)', 'age_[80-90)', 'age_[90-100)'
    , 'diabetes_med_yes', 'has_procedure_performed'
    , 'has_outpatient_visit', 'has_inpatient_visit', 'has_emergency_visit'
    , 'diag_1_Diabetes', 'diag_1_Digestive', 'diag_1_Injury', 'diag_1_Musculoskeletal', 'diag_1_Other', 'diag_1_Respiratory'
    , 'diag_2_Diabetes', 'diag_2_Digestive', 'diag_2_Injury', 'diag_2_Musculoskeletal', 'diag_2_Other', 'diag_2_Respiratory'
    , 'diag_3_Diabetes', 'diag_3_Digestive', 'diag_3_Injury', 'diag_3_Musculoskeletal', 'diag_3_Other', 'diag_3_Respiratory'
    , 'medical_specialty_Emergency/Trauma', 'medical_specialty_Family/GeneralPractice'
    , 'medical_specialty_InternalMedicine', 'medical_specialty_Missing'
    , 'medical_specialty_Other', 'medical_specialty_Surgery'
    , 'glucose_test_no', 'glucose_test_normal'
    , 'A1Ctest_no', 'A1Ctest_normal'
    , 'change_yes'

    # Potential useful indicators
    , 'lab_procedures_freq_low', 'lab_procedures_freq_med'
    , 'time_in_hospital_freq_med', 'time_in_hospital_freq_short'
    , 'medications_freq_low', 'medications_freq_med'
]

X = patient_readmission_encoded[x_cols]
y = patient_readmission_encoded['readmitted_ordinal']

print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# Call the train_test_split function - define the test size, random_state
# Assign the outputs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Shape of X (features): (24779, 45)
Shape of y (target): (24779,)


<a id="train_pred_model"></a>
### Train the Predictive Model
Train the prediction model using the code cell below.

To train the Model 1, before running the code cell below, run the code cell which:
1. loads the data.
2. removes the 'Missing' labels in the Diagnosis features.
3. creates all the 7 new features under the **Feature Engineering** section.
4. transforms the Nominal Data.
5. transforms the Target Feature.
6. creates the training and test data for Model 1.

---
To train the Model 2, before running the code cell below, run the code cell which:
1. creates the training and test data for Model 2.

---
To train the Model 3, before running the code cell below, run the code cell which:
1. loads the data.
2. removes the 'Missing' labels in the Diagnosis features.
3. removes the 'Missing' labels in the Medical Specialty features.
4. creates all the 7 new features under the **Feature Engineering** section.
5. transforms the Nominal Data.
6. transforms the Target Feature.
7. creates the training and test data for Model 3.

---
To train the Model 4, before running the code cell below, run the code cell which:
1. loads the data.
2. removes the 'Missing' labels in the Diagnosis features.
3. creates all the 7 new features under the **Feature Engineering** section.
4. transforms the Nominal Data.
5. transforms the Target Feature.
6. creates the training and test data for Model 4.

In [None]:
# Build a Logistic Regression model.

# Instantiate the model - you can set the hyperparameters (e.g. random_state) during instantiating a model
lg_reg = LogisticRegression(random_state=40)

# Train the model using the training data using the .fit() method
lg_reg.fit(X=X_train, y=y_train)

# "For the things we have to learn before we can do them, we learn by doing them." ~Aristotle

In [None]:
# This code cell is to save the trained model to a file.

# Save the trained model to a file
joblib.dump(lg_reg, 'patient_readmission_prediction_model.pkl')

print("Trained model saved to 'patient_readmission_prediction_model.pkl'")

Trained model saved to 'patient_readmission_prediction_model.pkl'


<a id="model_eval"></a>
# Model Evaluation

<a id="eval_model_1"></a>
### Evaluate the Performance of Model 1 (12 features)
We'll make predictions using the test data and use standard metrics to assess how well the model performs.


In [None]:
print("\nEvaluating model performance on the test set...")
y_pred = lg_reg.predict(X_test)

# Calculate key performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Display the confusion matrix to see a detailed breakdown of predictions
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# False Positive Rate = False Positives / (False Positives + True Negatives)
fpr = conf_matrix[0][1] / (conf_matrix[0][1] + conf_matrix[0][0])
print(f"\nFPR is {fpr:.4f}")

# False Negative Rate = False Negatives / (False Negatives + True Positives)
fnr = conf_matrix[1][0] / (conf_matrix[1][0] + conf_matrix[1][1])
print(f"FNR is {fnr:.4f}")

# --- 5. Interpret Model Coefficients (Feature Importance) ---
# For Logistic Regression, the coefficients tell us the direction and
# magnitude of a feature's impact on the readmission probability.
# A positive coefficient increases the risk of readmission, while a
# negative one decreases it. The larger the absolute value, the more
# influential the feature.

print("\nModel Coefficients (Feature Importance):")
feature_importance = pd.DataFrame({
    'Feature': x_cols,
    'Coefficient': lg_reg.coef_[0]
})
feature_importance['Coefficient'] = np.exp(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

print(feature_importance.to_string(index=False))

# Interpretation:
# For every one-unit increase in a feature, the odds of readmission
# increase or decrease by the magnitude of its coefficient.
# For example, if 'has_inpatient_visit' has a coefficient of 2.5,
# the odds of a patient being readmitted are 2.5 times higher than
# a patient with no prior inpatient visits, holding all other features constant.


Evaluating model performance on the test set...
Accuracy: 0.6096
Precision: 0.6097
Recall: 0.4730
F1-Score: 0.5327

Confusion Matrix:
[[1918  706]
 [1229 1103]]

FPR is 0.2691
FNR is 0.5270

Model Coefficients (Feature Importance):
                                 Feature  Coefficient
                     has_inpatient_visit     2.119694
                     has_emergency_visit     1.521697
                    has_outpatient_visit     1.509608
                        diabetes_med_yes     1.278455
                         diag_1_Diabetes     1.195931
      medical_specialty_Emergency/Trauma     1.189764
                             age_[80-90)     1.167283
                             age_[70-80)     1.160873
                  diag_2_Musculoskeletal     1.108927
               medical_specialty_Missing     1.081536
                             age_[60-70)     1.074715
medical_specialty_Family/GeneralPractice     1.029505
                              change_yes     1.025452
           

<a id="eval_model_2"></a>
### Evaluate the Performance of Model 2 (11 features without the Medical Specialty feature)
We'll make predictions using the test data and use standard metrics to assess how well the model performs.


In [None]:
print("\nEvaluating model performance on the test set...")
y_pred = lg_reg.predict(X_test)

# Calculate key performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Display the confusion matrix to see a detailed breakdown of predictions
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# False Positive Rate = False Positives / (False Positives + True Negatives)
fpr = conf_matrix[0][1] / (conf_matrix[0][1] + conf_matrix[0][0])
print(f"\nFPR is {fpr:.4f}")

# False Negative Rate = False Negatives / (False Negatives + True Positives)
fnr = conf_matrix[1][0] / (conf_matrix[1][0] + conf_matrix[1][1])
print(f"FNR is {fnr:.4f}")

# --- 5. Interpret Model Coefficients (Feature Importance) ---
# For Logistic Regression, the coefficients tell us the direction and
# magnitude of a feature's impact on the readmission probability.
# A positive coefficient increases the risk of readmission, while a
# negative one decreases it. The larger the absolute value, the more
# influential the feature.

print("\nModel Coefficients (Feature Importance):")
feature_importance = pd.DataFrame({
    'Feature': x_cols,
    'Coefficient': lg_reg.coef_[0]
})
feature_importance['Coefficient'] = np.exp(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

print(feature_importance.to_string(index=False))

# Interpretation:
# For every one-unit increase in a feature, the odds of readmission
# increase or decrease by the magnitude of its coefficient.
# For example, if 'has_inpatient_visit' has a coefficient of 2.5,
# the odds of a patient being readmitted are 2.5 times higher than
# a patient with no prior inpatient visits, holding all other features constant.


Evaluating model performance on the test set...
Accuracy: 0.6118
Precision: 0.6114
Recall: 0.4803
F1-Score: 0.5379

Confusion Matrix:
[[1912  712]
 [1212 1120]]

FPR is 0.2713
FNR is 0.5197

Model Coefficients (Feature Importance):
               Feature  Coefficient
   has_inpatient_visit     2.103327
   has_emergency_visit     1.541418
  has_outpatient_visit     1.530117
      diabetes_med_yes     1.272149
           age_[80-90)     1.182849
       diag_1_Diabetes     1.182800
           age_[70-80)     1.160949
diag_2_Musculoskeletal     1.092285
           age_[60-70)     1.073600
            change_yes     1.038787
    diag_1_Respiratory     1.026980
      diag_3_Digestive     1.017180
       glucose_test_no     0.997148
    diag_3_Respiratory     0.991947
   glucose_test_normal     0.978228
           age_[50-60)     0.974654
            A1Ctest_no     0.971210
    diag_2_Respiratory     0.955510
          diag_2_Other     0.954163
      diag_1_Digestive     0.945298
diag_3_Musc

<a id="eval_model_3"></a>
### Evaluate the Performance of Model 3 (12 features without the 'Missing' label in the Medical Specialty Feature)
We'll make predictions using the test data and use standard metrics to assess how well the model performs.


In [None]:
print("\nEvaluating model performance on the test set...")
y_pred = lg_reg.predict(X_test)

# Calculate key performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Display the confusion matrix to see a detailed breakdown of predictions
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

#[[1042(TN)  370(FP)]
# [ 600(FN)  512(TP)]]

# False Positive Rate = False Positives / (False Positives + True Negatives)
fpr = conf_matrix[0][1] / (conf_matrix[0][1] + conf_matrix[0][0])
print(f"\nFPR is {fpr:.4f}")

# False Negative Rate = False Negatives / (False Negatives + True Positives)
fnr = conf_matrix[1][0] / (conf_matrix[1][0] + conf_matrix[1][1])
print(f"FNR is {fnr:.4f}")

# --- 5. Interpret Model Coefficients (Feature Importance) ---
# For Logistic Regression, the coefficients tell us the direction and
# magnitude of a feature's impact on the readmission probability.
# A positive coefficient increases the risk of readmission, while a
# negative one decreases it. The larger the absolute value, the more
# influential the feature.

print("\nModel Coefficients (Feature Importance):")
feature_importance = pd.DataFrame({
    'Feature': x_cols,
    'Coefficient': lg_reg.coef_[0]
})
feature_importance['Coefficient'] = np.exp(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

print(feature_importance.to_string(index=False))

# Interpretation:
# For every one-unit increase in a feature, the odds of readmission
# increase or decrease by the magnitude of its coefficient.
# For example, if 'has_inpatient_visit' has a coefficient of 2.5,
# the odds of a patient being readmitted are 2.5 times higher than
# a patient with no prior inpatient visits, holding all other features constant.


Evaluating model performance on the test set...
Accuracy: 0.6017
Precision: 0.5894
Recall: 0.4299
F1-Score: 0.4972

Confusion Matrix:
[[1009  342]
 [ 651  491]]

FPR is 0.2531
FNR is 0.5701

Model Coefficients (Feature Importance):
                                 Feature  Coefficient
                     has_inpatient_visit     2.076534
                     has_emergency_visit     1.601723
                    has_outpatient_visit     1.445834
                        diabetes_med_yes     1.247610
                         diag_1_Diabetes     1.216373
                             age_[80-90)     1.136835
      medical_specialty_Emergency/Trauma     1.136129
                             age_[70-80)     1.108507
medical_specialty_Family/GeneralPractice     1.105990
                     glucose_test_normal     1.103231
                              A1Ctest_no     1.086935
                  diag_3_Musculoskeletal     1.080187
                      diag_3_Respiratory     1.078127
           

<a id="eval_model_3"></a>
### Evaluate the Performance of Model 4 (all 16 features)
We'll make predictions using the test data and use standard metrics to assess how well the model performs.


In [None]:
print("\nEvaluating model performance on the test set...")
y_pred = lg_reg.predict(X_test)

# Calculate key performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Display the confusion matrix to see a detailed breakdown of predictions
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# False Positive Rate = False Positives / (False Positives + True Negatives)
fpr = conf_matrix[0][1] / (conf_matrix[0][1] + conf_matrix[0][0])
print(f"\nFPR is {fpr:.4f}")

# False Negative Rate = False Negatives / (False Negatives + True Positives)
fnr = conf_matrix[1][0] / (conf_matrix[1][0] + conf_matrix[1][1])
print(f"FNR is {fnr:.4f}")

# --- 5. Interpret Model Coefficients (Feature Importance) ---
# For Logistic Regression, the coefficients tell us the direction and
# magnitude of a feature's impact on the readmission probability.
# A positive coefficient increases the risk of readmission, while a
# negative one decreases it. The larger the absolute value, the more
# influential the feature.

print("\nModel Coefficients (Feature Importance):")
feature_importance = pd.DataFrame({
    'Feature': x_cols,
    'Coefficient': lg_reg.coef_[0]
})
feature_importance['Coefficient'] = np.exp(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

print(feature_importance.to_string(index=False))

# Interpretation:
# For every one-unit increase in a feature, the odds of readmission
# increase or decrease by the magnitude of its coefficient.
# For example, if 'has_inpatient_visit' has a coefficient of 2.5,
# the odds of a patient being readmitted are 2.5 times higher than
# a patient with no prior inpatient visits, holding all other features constant.


Evaluating model performance on the test set...
Accuracy: 0.6100
Precision: 0.6097
Recall: 0.4756
F1-Score: 0.5343

Confusion Matrix:
[[1914  710]
 [1223 1109]]

FPR is 0.2706
FNR is 0.5244

Model Coefficients (Feature Importance):
                                 Feature  Coefficient
                     has_inpatient_visit     2.051865
                    has_outpatient_visit     1.518723
                     has_emergency_visit     1.518507
                        diabetes_med_yes     1.243581
                         diag_1_Diabetes     1.199817
      medical_specialty_Emergency/Trauma     1.193207
                             age_[70-80)     1.132428
                             age_[80-90)     1.125072
                  diag_2_Musculoskeletal     1.100373
                             age_[60-70)     1.056427
               medical_specialty_Missing     1.048834
                 lab_procedures_freq_med     1.026412
                    medications_freq_med     1.015645
           

<a id="eval_model_summary"></a>
### Models Performances

|Measures|Model 1|Model 2|Model 3|Model 4|
|--------|----|---|---|---|
|Accuracy|0.6096|0.6118|0.6017|0.6100|
|Precision|0.6097|0.6114|0.5894|0.6097|
|Recall|0.4730|0.4803|0.4299|0.4756|
|F1-Score|0.5327|0.5379|0.4972|0.5343|
|FPR|0.2691|0.2713|0.2531|0.2706|
|FNR|0.5270|0.5197|0.5701|0.5244|

||
|--|
|Model 1 uses 12 features like `age`, `diabetes_med`, `n_inpatient`, `n_outpatient`, `n_emergency`, `glucose_test`, `change`, `medical_specialty`, `diag_1`, `diag_2`, `diag_3` features.|
|Model 2 uses 11 features like `age`, `diabetes_med`, `n_inpatient`, `n_outpatient`, `n_emergency`, `glucose_test`, `change`, `diag_1`, `diag_2`, `diag_3`.|
|Model 3 uses 12 features and has removed the `Missing` label in the `medical_specialty` feature, which removes around 12k of records.|
|Model 4 uses all 16 features.|

The purpose of creating the predictive model is to identify patients who are at high-risk of readmission to allow doctors to focus on follow-up efforts.

We observed similar performance in terms of accuracy and precision across the models with Model 2 performing slightly better at 61.18% and 61.14% respectively.

Accuracy indicates how well the model predicts the outcome (readmitted or not) in actual dataset. This means the model is able to predict whether a patient is readmitted or not correctly around 6 out of 10 times.

Precision indicates how well the model predicts if a patient will be readmitted. This means among all the patients the model predicted would be readmitted, it was correct around 6 out of 10 times.

Given the priority of the hospital is to identify patients who have actual readmission, we look at the Recall metric for the models.

Recall is how well the model can identify a patient who is actually readmitted. Based on the Model 2’s recall of 48.03%, the model 2 is able to identify close to 5 out 10 patients who were actually readmitted.

The Model 2 shows the highest False Positive Rate (27.13%) among the models but the difference among other models is relatively small. This value may come at the expense of small improvement in the Recall value for the model.

The Model 2 shows a lower False Negative Rate (51.97%) among the models, which means the model was better at identifying patients who were actually readmitted. A False Negative Rate is the proportion of actual readmissions that the model missed and a lower rate means it missed fewer actual readmissions.


<a id="model_coeff"></a>
### Models' Coefficients

|#|Model 1||Model 2||Model 3||Model 4||
|-|-|-|-|-|-|-|-|-|
||Feature|Coefficient|Feature|Coefficient|Feature|Coefficient|Feature | Coefficient|
|1|has_inpatient_visit     |2.119694|has_inpatient_visit     |2.103327|has_inpatient_visit|2.076534|has_inpatient_visit|2.051865|
|2|has_emergency_visit     |1.521697|has_emergency_visit     |1.541418|has_emergency_visit|1.601723|has_outpatient_visit|1.518723|
|3|has_outpatient_visit     |1.509608|has_outpatient_visit     |1.530117|has_outpatient_visit     |1.445834|has_emergency_visit|     1.518507|
|4|diabetes_med_yes     |1.278455|diabetes_med_yes     |1.272149|diabetes_med_yes     |1.247610|diabetes_med_yes|     1.243581|
|5|diag_1_Diabetes     |1.195931|age_[80-90)     |1.182849|diag_1_Diabetes     |1.216373|diag_1_Diabetes|     1.199817|
|6|medical_specialty_Emergency/Trauma     |1.189764|diag_1_Diabetes     |1.182800|age_[80-90)|1.136835|medical_specialty_Emergency/Trauma|     1.193207|
|7|age_[80-90)     |1.167283 | age_[70-80) |     1.160949|medical_specialty_Emergency/Trauma | 1.136129| age_[70-80)|     1.132428|
|8|age_[70-80)     |1.160873|diag_2_Musculoskeletal |    1.092285|age_[70-80) | 1.108507 | age_[80-90)|     1.125072|
|9|diag_2_Musculoskeletal     |1.108927|age_[60-70) |     1.073600|medical_specialty_Family/GeneralPractice | 1.105990|diag_2_Musculoskeletal|     1.100373|
|10|medical_specialty_Missing     |1.081536|change_yes   |  1.038787|glucose_test_normal     |1.103231|age_[60-70)|     1.056427|
|11|age_[60-70)     |1.074715|diag_1_Respiratory   |  1.026980| A1Ctest_no |    1.086935|medical_specialty_Missing|     1.048834|
|12|medical_specialty_Family/GeneralPractice     |1.029505|diag_3_Digestive  |   1.017180| diag_3_Musculoskeletal  | 1.080187 |lab_procedures_freq_med|     1.026412|
|13|change_yes     |1.025452|glucose_test_no   |  0.997148| diag_3_Respiratory  | 1.078127 |medications_freq_med|     1.015645|
|14|diag_1_Respiratory     |1.015956|diag_3_Respiratory  |    0.991947| age_[60-70)  | 1.068953 |A1Ctest_no|     1.013542|
|15|diag_3_Digestive     |1.012041|glucose_test_normal |    0.978228| change_yes   | 1.046505 |diag_3_Digestive|     1.010112|
|16|diag_3_Respiratory     |0.991012|age_[50-60)  |     0.974654| diag_3_Injury  | 1.018170 |glucose_test_normal|     1.006984|
|17|A1Ctest_no     |0.980323|A1Ctest_no   |   0.971210| diag_3_Digestive  | 1.017537 |medical_specialty_Family/GeneralPractice|     1.004283|
|18|age_[50-60)     |0.977327|diag_2_Respiratory | 0.955510| diag_2_Other  | 0.977836 |change_yes|     0.989937|
|19|glucose_test_normal     |0.973613|diag_2_Other     | 0.954163| diag_1_Digestive  | 0.975885 |time_in_hospital_freq_med|     0.987429|
|20|glucose_test_no     |0.973480|diag_1_Digestive    | 0.945298| diag_2_Injury  | 0.974065 |diag_1_Respiratory|     0.977772|
|21|diag_1_Digestive     |0.948900|diag_3_Musculoskeletal     | 0.939989| age_[50-60)  | 0.964410 |diag_3_Respiratory|     0.973695|
|22|diag_2_Other     |0.947899|diag_2_Diabetes     | 0.931662| diag_2_Respiratory |     0.960585 |age_[50-60)|     0.969238|
|23|diag_2_Respiratory     |0.945935|diag_3_Diabetes     | 0.929861| medical_specialty_InternalMedicine  |    0.954544 |diag_2_Diabetes|     0.968111|
|24|diag_2_Diabetes     |0.943032|diag_3_Injury     | 0.927349| diag_1_Respiratory  |    0.950203 |diag_1_Digestive|     0.965646|
|25|diag_3_Diabetes     |0.937483|diag_3_Other     | 0.925728| age_[90-100)  |    0.940296 |glucose_test_no|     0.964578|
|26|diag_3_Musculoskeletal     |0.934648|diag_2_Digestive     | 0.894780| A1Ctest_normal  |    0.935728 |diag_3_Diabetes|     0.957470|
|27|diag_3_Injury     |0.924315|diag_1_Other     | 0.859580| diag_3_Diabetes  |    0.915632 |diag_2_Other|     0.942926|
|28|diag_3_Other     |0.919039|age_[90-100)     | 0.849391| diag_2_Diabetes  |    0.913405 |diag_3_Musculoskeletal|     0.938605|
|29|medical_specialty_InternalMedicine     |0.889534|A1Ctest_normal     | 0.849274| diag_1_Injury  |    0.898535 |lab_procedures_freq_low|     0.937300|
|30|diag_2_Digestive     |0.886854|diag_2_Injury     | 0.824818| glucose_test_no  |    0.897410 |diag_2_Respiratory|     0.921156|
|31|medical_specialty_Other     |0.868929|diag_1_Injury     | 0.812243| diag_1_Other  |    0.890426 |diag_3_Other|     0.916419|
|32|diag_1_Other     |0.868837|diag_1_Musculoskeletal     | 0.713807| diag_2_Digestive  |    0.887815 |diag_3_Injury|     0.907784|
|33|A1Ctest_normal     |0.851260|-|-| medical_specialty_Other  |    0.880010 |diag_2_Digestive|     0.887722|
|34|diag_1_Injury     |0.832436|-|-| diag_3_Other  |    0.873788 |time_in_hospital_freq_short|     0.870792|
|35|age_[90-100)     |0.831668|-|-| diag_2_Musculoskeletal  |    0.860604 |has_procedure_performed|     0.864049|
|36|diag_2_Injury     |0.824904|-|-| medical_specialty_Surgery  |    0.857457 |medical_specialty_Other|     0.862092|
|37|medical_specialty_Surgery     |0.796772|-|-| diag_1_Musculoskeletal  |    0.800616 |diag_1_Other|     0.860000|
|38|diag_1_Musculoskeletal     |0.763534|-|-|-|-|medical_specialty_InternalMedicine|     0.850367|
|39|-|-|-|-|-|-|A1Ctest_normal|     0.844889|
|40|-|-|-|-|-|-|diag_1_Injury|     0.834479|
|41|-|-|-|-|-|-|diag_2_Injury|     0.826582|
|42|-|-|-|-|-|-|medications_freq_low|     0.815652|
|43|-|-|-|-|-|-|age_[90-100)|     0.799895|
|44|-|-|-|-|-|-|medical_specialty_Surgery|     0.792607|
|45|-|-|-|-|-|-|diag_1_Musculoskeletal|     0.782542|


Model 2’s feature importance provides us a guideline on which are the features that have an impact on the readmission probability. This helps the doctors to focus on which group of patients which are having higher risk of readmission. A positive coefficient increases the risk of readmission. The larger the absolute value, the more influential the feature.

For example, a patient who had a prior inpatient visit is more than twice as likely to be readmitted compared to those who have not had a prior inpatient visit, all else being equal.

<a id="recommendations"></a>
# Recommendations
Based on the EDA and model evaluation, we suggest the hospital:
1. piloting the prediction model to help with patient readmission reduction effort. While the model currently identifies close to 5 out of 10 of patients who were actually readmitted, these represent the highest-risk individuals according to our analysis. Implementing targeted follow-up for these identified high-risk individuals could still contribute to reducing overall readmission rates.
2. to focus on the patient group who are strongly associated with the following characteristics as it has significant influence to a patient’s readmission probability:
   1. has previous hospital visit before the hospital stay (especially with inpatient, emergency , and outpatient visit),
has been prescribed with diabetes medication during the hospital stay,
   2. has Diabetes diagnosed as the Primary diagnosis,
   3. has Musculoskeletal diagnosed as the Secondary diagnosis,
   4. are within the 70-90 age group.
      
   Doctors can use the Prediction Dashboard to find out the readmission probability of the patient to plan for follow-up effort.


<a id="next_steps"></a>
# Next Steps
For future improvement of the predictive model, we can:
1. explore strategies like adjusting the classification threshold to improve the Recall although it will increase the False Positive Rate,
2. explore feature engineering like categorising the intensity of procedures, lab procedures or medications.
3. train the predictive model using a larger dataset to allow the model to learn from a diverse dataset


# Appendix
* **Acknowledgments**: [Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.](https://onlinelibrary.wiley.com/doi/10.1155/2014/781670)