# `PATIENT CHURN PREDICTION`

The purpose of this project is to build predictive models that predict if patients leave or stay with a certain healthcare provider.

This predictions will help the provider better understand their patients in terms of their likelihood to leave of stay. In turn, they will employ targeted retention strategies for all clients flagged by the models as likely to churn

# `ABOUT THE DATASET USED`

This Patient Churn dataset represents synthetic healthcare data of 500 patients designed for churn prediction and machine learning analysis. Each record contains demographic details like Age and Gender, along with service-related features such as Tenure in months, number of hospital Visits in the last year, presence of Chronic Disease, and type of Insurance. It also includes behavioral and financial indicators like Satisfaction Score, Total Bill Amount, and number of Missed Appointments. The target column, Churn, indicates whether a patient has stopped using the hospitalâ€™s services (1 = churned, 0 = active). Overall, this dataset is suitable for studying patient retention patterns and building predictive models to identify patients at risk of leaving a healthcare provider.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


In [4]:
# Load datset
df = pd.read_csv(r"Data\patient_churn_dataset.csv")

df.head()

Unnamed: 0,Patient_ID,Age,Gender,Tenure_Months,Visits_Last_Year,Chronic_Disease,Insurance_Type,Satisfaction_Score,Total_Bill_Amount,Missed_Appointments,Churn
0,1,56,Female,2,3,No,Government,2.5,12252.96,9,1
1,2,69,Male,10,3,Yes,Government,2.6,25862.01,4,0
2,3,46,Female,56,10,No,,2.8,5659.13,4,0
3,4,32,Male,30,4,Yes,Government,4.1,19533.31,5,0
4,5,60,Male,50,19,No,Private,4.6,24639.52,5,0


In [5]:
# Check dataset info
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Patient_ID           500 non-null    int64  
 1   Age                  500 non-null    int64  
 2   Gender               500 non-null    str    
 3   Tenure_Months        500 non-null    int64  
 4   Visits_Last_Year     500 non-null    int64  
 5   Chronic_Disease      500 non-null    str    
 6   Insurance_Type       350 non-null    str    
 7   Satisfaction_Score   500 non-null    float64
 8   Total_Bill_Amount    500 non-null    float64
 9   Missed_Appointments  500 non-null    int64  
 10  Churn                500 non-null    int64  
dtypes: float64(2), int64(6), str(3)
memory usage: 43.1 KB


In [10]:
# Check for duplicated entries
print(f"There are {df.duplicated().sum()} values in the dataset")

There are 0 values in the dataset


In [11]:
# Check for null values
df.isnull().sum()

Patient_ID               0
Age                      0
Gender                   0
Tenure_Months            0
Visits_Last_Year         0
Chronic_Disease          0
Insurance_Type         150
Satisfaction_Score       0
Total_Bill_Amount        0
Missed_Appointments      0
Churn                    0
dtype: int64

## Notes

- The INSURANCE_TYPE Column has 150 missing values

In [19]:
# Since Insurance_Type column has some missing values we will explore it further
df["Insurance_Type"].unique()

<StringArray>
['Government', nan, 'Private']
Length: 3, dtype: str

## Notes

- The column has 2 variables - This will inform our null handling later during modelling

In [15]:
# Get characteristics of numerical columns
df.describe()

Unnamed: 0,Patient_ID,Age,Tenure_Months,Visits_Last_Year,Satisfaction_Score,Total_Bill_Amount,Missed_Appointments,Churn
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,49.91,29.498,9.412,3.049,24440.01518,4.422,0.248
std,144.481833,18.221909,17.402865,5.687086,1.148136,14264.55432,2.908752,0.432284
min,1.0,18.0,1.0,0.0,1.0,506.67,0.0,0.0
25%,125.75,35.0,14.0,4.0,2.0,13576.0875,2.0,0.0
50%,250.5,50.0,30.0,9.0,3.1,23996.56,4.0,0.0
75%,375.25,66.0,45.0,14.0,4.1,36289.66,7.0,0.0
max,500.0,79.0,59.0,19.0,5.0,49892.13,9.0,1.0


In [16]:
# Get characteristics of categorical columns
df.describe(include="O")

Unnamed: 0,Gender,Chronic_Disease,Insurance_Type
count,500,500,350
unique,2,2,2
top,Female,No,Government
freq,250,260,180
