<a href="https://colab.research.google.com/github/jaynetra/AIForHealthCare_Mimic3/blob/main/ML_DL_mimicData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Insurance Type of the patient based on hospitalAdmission (Project Notebook)
Author: Jayanthi Suryanarayana
   

Type of insurance is a key data point to estimate revenue of a provider. For a payer it can give insights to product design. In this notebook,the MIMIC 3 data will be used to predict type of insurance of the patient based on admission data.
Note: This assignment will focus more on model aspect, as we did prior assignments on data. The approach is to bootstrap from the tutorial referred in the class for data, but focus on model aspects for a solution.


## Table of Contents

1. [**Project Overview**](#Overview)
1.1 [Problem Statement](#Problem_statement)
1.2 [Metrics](#Metrics)
2. [**Data Exploration and Feature Engineering**](#Data-Exploration)   
2.1 [ADMISSIONS.csv](#ADMISSIONS.csv)   
3 [Model Implementation](#Implementation)  
4. [**Results**](#Results)
5. [**Conclusion**](#Conclusion)

<a class="anchor" id="Overview"></a>
## 1. Project Overview

Predictive analytics is an increasingly important tool in the healthcare field since modern machine learning (ML) methods can use large amounts of available data to predict individual outcomes for patients. For example, ML predictions can help healthcare providers determine likelihoods of disease, aid in diagnosis, recommend treatment, and predict future wellness. For this project, I chose to focus on 'Insurance' Type of the patients from the admission table

<a class="anchor" id="Problem_statement"></a>
### 1.1 Problem Statement

The goal of this project is to create a model that predicts the insurance type length-of-stay for each patient at time of admission. Looking at the the data, a multi label binary classifier will be a suitable model.
<a class="anchor" id="Metrics"></a>
### 1.2 Metrics








In [1]:
# Imports
import pandas as pd


The following csv files were downloaded from the MIMIC-III database [source](https://mimic.physionet.org/).

In [2]:
# Primary Admissions information
df = pd.read_csv('/content/admisions.csv')



<a class="anchor" id="Data-Exploration"></a>
# 2. Data Exploration and Feature Engineering

In this section, I'll examine the various imported MIMIC DataFrames to understand how the data is distributed. Additionaly, I need to figure out a strategy to extract the target Length-of-Stay (LOS) values and understand what features (independent variables) may be useful in predicting LOS.

**NOTE**: I have to perform a good amount of feature engineering in this section to enable the visualization of certain data categories. Any additional data 'tidyness' problems will be addressed in the 'Data Preprocessing' section including a comprehensive preprocessing function.

<a class="anchor" id="ADMISSIONS.csv"></a>
## 2.1 ADMISSIONS.csv Exploration

From [MIMIC](https://mimic.physionet.org/mimictables/admissions/): The ADMISSIONS table gives information regarding a patient’s admission to the hospital. Since each unique hospital visit for a patient is assigned a unique HADM_ID, the ADMISSIONS table can be considered as a definition table for HADM_ID. Information available includes timing information for admission and discharge, demographic information, the source of the admission, and so on.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58878 entries, 0 to 58877
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SUBJECT_ID          58878 non-null  int64  
 1   HADM_ID             58878 non-null  int64  
 2   ADMITTIME           58878 non-null  object 
 3   DEATHTIME           5774 non-null   object 
 4   ADMISSION_TYPE      58878 non-null  object 
 5   ADMISSION_LOCATION  58878 non-null  object 
 6   DISCHARGE_LOCATION  58878 non-null  object 
 7   INSURANCE           58878 non-null  object 
 8   LANGUAGE            33606 non-null  object 
 9   RELIGION            58878 non-null  object 
 10  MARITAL_STATUS      58878 non-null  object 
 11  ETHNICITY           58878 non-null  object 
 12  DIAGNOSIS           58853 non-null  object 
 13  LOS                 58878 non-null  float64
 14  DECEASED            58878 non-null  int64  
dtypes: float64(1), int64(3), object(11)
memory usage: 6.7

In [4]:
# do prediction of insurance just from one table
df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,LOS,DECEASED
0,3115,134067,2139-02-13 03:11:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,RELIGIOUS,UNKNOWN (DEFAULT),WHITE,STAB WOUND,7.181944,0
1,7124,109129,2188-07-11 00:58:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,RELIGIOUS,UNKNOWN (DEFAULT),WHITE,PENILE LACERATION-CELLULITIS,21.4625,0
2,10348,121510,2133-04-16 21:12:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,RELIGIOUS,UNKNOWN (DEFAULT),OTHER/UNKNOWN,STATUS EPILEPTICUS,6.777778,0
3,9396,106469,2109-02-16 23:14:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,RELIGIOUS,UNKNOWN (DEFAULT),WHITE,SUBDURAL HEMATOMA,6.532639,0
4,9333,133732,2167-10-06 18:35:00,,URGENT,TRANSFER FROM HOSP/EXTRAM,SNF,Private,,RELIGIOUS,UNKNOWN (DEFAULT),OTHER/UNKNOWN,CORONARY ARTERY DISEASE,9.776389,0


In [5]:
!pip install --upgrade numpy
!pip install --upgrade catboost

Collecting numpy
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
catboost 1.2.7 requires numpy<2.0,>=1.16.0, but you have numpy 2.2.4 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.4 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-2.2.4
Collecting numpy<2.0,>=1.16.0 (from catboost)
  Using cached numpy-1.26.4-cp311-cp311-manylin

In [7]:
# prompt: use cat boost to predict insurance variable

# Assuming necessary libraries are already imported (as in the provided code)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

# Load the data (replace with your actual file path if different)
df = pd.read_csv('/content/admisions.csv')

# Define features (X) and target variable (y)
X = df.drop('INSURANCE', axis=1)
y = df['INSURANCE']

# Handle non-numeric columns (categorical features)
# Convert object type columns and columns with NaN to string type
for col in X.select_dtypes(include=['object']).columns:
    X[col] = X[col].astype(str)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Get indices of categorical features after conversion
categorical_features_indices = np.where(X.dtypes == object)[0]

# Initialize and train the CatBoost classifier
model = CatBoostClassifier(iterations=100,  # Adjust as needed
                           learning_rate=0.1, # Adjust as needed
                           depth=6,  # Adjust as needed
                           loss_function='MultiClass', # Or appropriate loss function
                           random_seed=42, # for reproducibility
                           cat_features=categorical_features_indices) # Specify categorical features
model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=10) #verbose for training updates


# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

0:	learn: 1.4647101	test: 1.4610295	best: 1.4610295 (0)	total: 657ms	remaining: 1m 5s
10:	learn: 0.9844028	test: 0.9714362	best: 0.9714362 (10)	total: 5.26s	remaining: 42.6s
20:	learn: 0.8752587	test: 0.8612793	best: 0.8612793 (20)	total: 11.1s	remaining: 41.7s
30:	learn: 0.8407943	test: 0.8288116	best: 0.8288116 (30)	total: 15.6s	remaining: 34.8s
40:	learn: 0.8265112	test: 0.8163044	best: 0.8163044 (40)	total: 20.7s	remaining: 29.7s
50:	learn: 0.8187409	test: 0.8107909	best: 0.8107909 (50)	total: 25.8s	remaining: 24.7s
60:	learn: 0.8135638	test: 0.8071326	best: 0.8071326 (60)	total: 30.5s	remaining: 19.5s
70:	learn: 0.8092356	test: 0.8046416	best: 0.8046416 (70)	total: 36.3s	remaining: 14.8s
80:	learn: 0.8066868	test: 0.8033818	best: 0.8033818 (80)	total: 41s	remaining: 9.61s
90:	learn: 0.8040812	test: 0.8020054	best: 0.8020054 (90)	total: 45.5s	remaining: 4.5s
99:	learn: 0.8024161	test: 0.8013110	best: 0.8013110 (99)	total: 50.8s	remaining: 0us

bestTest = 0.801311017
bestIteration =

In [8]:
y_pred[:20]

array([['Medicare'],
       ['Private'],
       ['Medicare'],
       ['Private'],
       ['Private'],
       ['Medicare'],
       ['Private'],
       ['Medicare'],
       ['Medicare'],
       ['Medicaid'],
       ['Medicare'],
       ['Medicaid'],
       ['Private'],
       ['Private'],
       ['Private'],
       ['Medicare'],
       ['Private'],
       ['Private'],
       ['Medicare'],
       ['Private']], dtype=object)

## Acknowledgments

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635

I found these resources particularly helpful for this project:
- https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e
- https://matplotlib.org/examples/api/barchart_demo.html
- https://stackoverflow.com/questions/46168450/replace-specific-range-of-values-in-data-frame-pandas
- https://en.wikipedia.org/wiki/Root-mean-square_deviation
- https://en.wikipedia.org/wiki/Coefficient_of_determination
- https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/
- https://www.healthcatalyst.com/success_stories/reducing-length-of-stay-in-hospital
- http://bok.ahima.org/Pages/Long%20Term%20Care%20Guidelines%20TOC/Practice%20Guidelines/Reporting