# Janatahack: Healthcare Analytics II

## [Janatahack: Healthcare Analytics II](https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii)

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, staff management & more.

This weekend we invite you to participate in another Janatahack with the theme of healthcare analytics. Stay tuned for the problem statement and datasets this Friday and get a chance to work on a real healthcare case study along with 250 AV points at stake.

## Problem Statement

Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 

This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.
The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

## Data

Column - Description

case_id - Case_ID registered in Hospital

Hospital_code - Unique code for the Hospital

Hospital_type_code - Unique code for the type of Hospital

City_Code_Hospital - City Code of the Hospital

Hospital_region_code - Region Code of the Hospital

Available Extra Rooms in Hospital - Number of Extra rooms available in the Hospital

Department - Department overlooking the case

Ward_Type -	Code for the Ward type

Ward_Facility_Code - Code for the Ward Facility

Bed Grade -	Condition of Bed in the Ward

patientid -	Unique Patient Id

City_Code_Patient -	City Code for the patient

Type of Admission -	Admission Type registered by the Hospital

Severity of Illness - Severity of the illness recorded at the time of admission

Visitors with Patient -	Number of Visitors with the patient

Age - Age of the patient

Admission_Deposit -	Deposit at the Admission Time

Stay - Stay Days by the patient

Evaluation Metric

The evaluation metric for this hackathon is 100*Accuracy Score.

# Load the Packages

In [88]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

#Basic Packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data Visualization
import seaborn as sns # Advance Data Visualization
%matplotlib inline

#OS packages
import os

#Encoding Packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#Scaling Packages
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()

#Multicolinearity VIF
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Data Modelling Packages
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import RandomOverSampler
sm = RandomOverSampler(random_state=294,sampling_strategy='not majority')

#Model Packages
import lightgbm as lgb


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load the Datasets

## Loading from Kaggle Input Data

In [89]:
df_Train = pd.read_csv('../data/train.csv')
df_Test = pd.read_csv('../data/test.csv')

# Exploratory Data Analysis

In [90]:
#To find the head of the Data
df_Train.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50


In [91]:
#Information of the Dataset Datatype
df_Train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 18 columns):
case_id                              318438 non-null int64
Hospital_code                        318438 non-null int64
Hospital_type_code                   318438 non-null object
City_Code_Hospital                   318438 non-null int64
Hospital_region_code                 318438 non-null object
Available Extra Rooms in Hospital    318438 non-null int64
Department                           318438 non-null object
Ward_Type                            318438 non-null object
Ward_Facility_Code                   318438 non-null object
Bed Grade                            318325 non-null float64
patientid                            318438 non-null int64
City_Code_Patient                    313906 non-null float64
Type of Admission                    318438 non-null object
Severity of Illness                  318438 non-null object
Visitors with Patient                318438 non-null

In [92]:
#Information of the Dataset Continuous Values
df_Train.describe()

Unnamed: 0,case_id,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,patientid,City_Code_Patient,Visitors with Patient,Admission_Deposit
count,318438.0,318438.0,318438.0,318438.0,318325.0,318438.0,313906.0,318438.0,318438.0
mean,159219.5,18.318841,4.771717,3.197627,2.625807,65747.579472,7.251859,3.284099,4880.749392
std,91925.276847,8.633755,3.102535,1.168171,0.873146,37979.93644,4.745266,1.764061,1086.776254
min,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1800.0
25%,79610.25,11.0,2.0,2.0,2.0,32847.0,4.0,2.0,4186.0
50%,159219.5,19.0,5.0,3.0,3.0,65724.5,8.0,3.0,4741.0
75%,238828.75,26.0,7.0,4.0,3.0,98470.0,8.0,4.0,5409.0
max,318438.0,32.0,13.0,24.0,4.0,131624.0,38.0,32.0,11008.0


In [93]:
#Columns List
df_Train.columns

Index(['case_id', 'Hospital_code', 'Hospital_type_code', 'City_Code_Hospital',
       'Hospital_region_code', 'Available Extra Rooms in Hospital',
       'Department', 'Ward_Type', 'Ward_Facility_Code', 'Bed Grade',
       'patientid', 'City_Code_Patient', 'Type of Admission',
       'Severity of Illness', 'Visitors with Patient', 'Age',
       'Admission_Deposit', 'Stay'],
      dtype='object')

In [94]:
#Shape of the Train and Test Data
print('Shape of Train Data: ', df_Train.shape)
print('Shape of Test Data: ', df_Test.shape)

Shape of Train Data:  (318438, 18)
Shape of Test Data:  (137057, 17)


In [95]:
#Null values in the Train Dataset
print('Null values in Train Data: \n', df_Train.isnull().sum())

Null values in Train Data: 
 case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                             113
patientid                               0
City_Code_Patient                    4532
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
Stay                                    0
dtype: int64


In [96]:
#Null Values in the Test Dataset
print('Null Values in Test Data: \n', df_Test.isnull().sum())

Null Values in Test Data: 
 case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                              35
patientid                               0
City_Code_Patient                    2157
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
dtype: int64


Missing Values in "Bed Grade" and "City_Code_Patient" columns.

In [97]:
print('Total Count of the Prediction Output Column Stay Variable: \n', df_Train['Stay'].value_counts())

Total Count of the Prediction Output Column Stay Variable: 
 21-30                 87491
11-20                 78139
31-40                 55159
51-60                 35018
0-10                  23604
41-50                 11743
71-80                 10254
More than 100 Days     6683
81-90                  4838
91-100                 2765
61-70                  2744
Name: Stay, dtype: int64


## Assumptions of the Predictor Variables

Target Variable

Stay - Highly Imbalanced. Need to use SMOTE to balance it


Predictor Variable

Hospital Code - Highly Imbalanced and Might affect the model

Hospital Type Code - Imbalanced

City Code Hospital - Imbalanced

Available Extra Rooms - Need to Balance the Available Extra Rooms as its Skewed Positive

Department - Highly Imbalanced

Ward Type Count - highly imbalanced

Patient ID - lot of Unique Values - Might need to drop it

City Code Patient - highly imbalance

Severity of Illness Variable - imbalanced

Visitors with Patient - imbalanced

Age - Imbalanced can be binned even more

Admission Deposit - Continous Need to remove the outliers or Scale the Values

# Basic Feature Engineering

## Remove Duplicate Rows

In [98]:
df_Train.drop_duplicates(keep='first', inplace=True)

NO Duplicate ROWS

## Joining the Train and Test Data for Encoding and Filling the Missing Values

In [99]:
# We will concat both train and test data set
df_Train['is_train'] = 1
df_Test['is_train'] = 0

#df_Frames = [df_Train,df_Test]
df_Total = pd.concat([df_Train, df_Test])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


## Fill missing Values

In [100]:
#Null values in the Total Dataset
print('Null values in Total Data: \n', df_Total.isnull().sum())

Null values in Total Data: 
 Admission_Deposit                         0
Age                                       0
Available Extra Rooms in Hospital         0
Bed Grade                               148
City_Code_Hospital                        0
City_Code_Patient                      6689
Department                                0
Hospital_code                             0
Hospital_region_code                      0
Hospital_type_code                        0
Severity of Illness                       0
Stay                                 137057
Type of Admission                         0
Visitors with Patient                     0
Ward_Facility_Code                        0
Ward_Type                                 0
case_id                                   0
is_train                                  0
patientid                                 0
dtype: int64


In [101]:
#using Forward Fill to fill missing Values
df_Total['Bed Grade']=df_Total['Bed Grade'].fillna(method="ffill",axis=0)
df_Total['City_Code_Patient']=df_Total['City_Code_Patient'].fillna(method="ffill",axis=0)

## Feature Engineering

In [102]:
df_Total['Bill_per_patient'] = df_Total.groupby('patientid')['Admission_Deposit'].transform('sum')

## Encoding of the Columns

In [103]:
df_Total.head()

Unnamed: 0,Admission_Deposit,Age,Available Extra Rooms in Hospital,Bed Grade,City_Code_Hospital,City_Code_Patient,Department,Hospital_code,Hospital_region_code,Hospital_type_code,Severity of Illness,Stay,Type of Admission,Visitors with Patient,Ward_Facility_Code,Ward_Type,case_id,is_train,patientid,Bill_per_patient
0,4911.0,51-60,3,2.0,3,7.0,radiotherapy,8,Z,c,Extreme,0-10,Emergency,2,F,R,1,1,31397,83314.0
1,5954.0,51-60,2,2.0,5,7.0,radiotherapy,2,Z,c,Extreme,41-50,Trauma,2,F,S,2,1,31397,83314.0
2,4745.0,51-60,2,2.0,1,7.0,anesthesia,10,X,e,Extreme,31-40,Trauma,2,E,S,3,1,31397,83314.0
3,7272.0,51-60,2,2.0,2,7.0,radiotherapy,26,Y,b,Extreme,41-50,Trauma,2,D,R,4,1,31397,83314.0
4,5558.0,51-60,2,2.0,2,7.0,radiotherapy,26,Y,b,Extreme,41-50,Trauma,2,D,S,5,1,31397,83314.0


### For Tree Based Algorithm use Label Encoding

In [104]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_Total['Hospital_code'] = le.fit_transform(df_Total['Hospital_code'])
df_Total['Hospital_type_code'] = le.fit_transform(df_Total['Hospital_type_code'])
df_Total['City_Code_Hospital'] = le.fit_transform(df_Total['City_Code_Hospital'])
df_Total['Hospital_region_code'] = le.fit_transform(df_Total['Hospital_region_code'])
df_Total['Available Extra Rooms in Hospital'] = le.fit_transform(df_Total['Available Extra Rooms in Hospital'])
df_Total['Department'] = le.fit_transform(df_Total['Department'])
df_Total['Ward_Type'] = le.fit_transform(df_Total['Ward_Type'])
df_Total['Ward_Facility_Code'] = le.fit_transform(df_Total['Ward_Facility_Code'])
df_Total['Bed Grade'] = le.fit_transform(df_Total['Bed Grade'])
#df_Total['patientid'] = le.fit_transform(df_Total['patientid'])
df_Total['City_Code_Patient'] = le.fit_transform(df_Total['City_Code_Patient'])
df_Total['Type of Admission'] = le.fit_transform(df_Total['Type of Admission'])
df_Total['Severity of Illness'] = le.fit_transform(df_Total['Severity of Illness'])
df_Total['Visitors with Patient'] = le.fit_transform(df_Total['Visitors with Patient'])
df_Total['Age'] = le.fit_transform(df_Total['Age'])

## For Scaling the Columns

In [105]:
df_Total['Admission_Deposit']

0         4911.0
1         5954.0
2         4745.0
3         7272.0
4         5558.0
5         4449.0
6         6167.0
7         5571.0
8         7223.0
9         6056.0
10        5797.0
11        5993.0
12        5141.0
13        8477.0
14        2685.0
15        9398.0
16        2933.0
17        5342.0
18        7442.0
19        5155.0
20        8181.0
21        6672.0
22        6364.0
23        4664.0
24        4091.0
25        2405.0
26        5164.0
27        5055.0
28        3339.0
29        5757.0
           ...  
137027    4397.0
137028    3906.0
137029    8316.0
137030    3815.0
137031    4050.0
137032    3017.0
137033    4365.0
137034    4587.0
137035    8933.0
137036    5760.0
137037    3020.0
137038    3858.0
137039    3862.0
137040    4749.0
137041    2804.0
137042    5577.0
137043    5675.0
137044    5416.0
137045    3641.0
137046    4800.0
137047    4418.0
137048    3816.0
137049    4406.0
137050    4573.0
137051    5241.0
137052    6313.0
137053    3510.0
137054    7190

In [106]:
df_Total['Admission_Deposit'].describe()

count    455495.000000
mean       4877.434022
std        1084.982089
min        1800.000000
25%        4184.000000
50%        4738.000000
75%        5405.000000
max       11920.000000
Name: Admission_Deposit, dtype: float64

## Un Merge the Train and Test Data after Feature Engineering

In [107]:
#Un-Merge code
df_Train_final = df_Total[df_Total['is_train'] == 1]
df_Test_final = df_Total[df_Total['is_train'] == 0]

In [15]:
df_Train_final

Unnamed: 0,Admission_Deposit,Age,Available Extra Rooms in Hospital,Bed Grade,City_Code_Hospital,City_Code_Patient,Department,Hospital_code,Hospital_region_code,Hospital_type_code,Severity of Illness,Stay,Type of Admission,Visitors with Patient,Ward_Facility_Code,Ward_Type,case_id,is_train,patientid,Bill_per_patient
0,4911.0,5,3,1,2,6,3,7,2,2,0,0-10,0,2,5,2,1,1,31397,83314.0
1,5954.0,5,2,1,4,6,3,1,2,2,0,41-50,1,2,5,3,2,1,31397,83314.0
2,4745.0,5,2,1,0,6,1,9,0,4,0,31-40,1,2,4,3,3,1,31397,83314.0
3,7272.0,5,2,1,1,6,3,25,1,1,0,41-50,1,2,3,2,4,1,31397,83314.0
4,5558.0,5,2,1,1,6,3,25,1,1,0,41-50,1,2,3,3,5,1,31397,83314.0
5,4449.0,5,2,1,5,6,1,22,0,0,0,11-20,1,2,5,3,6,1,31397,83314.0
6,6167.0,5,1,2,7,6,3,31,1,5,0,0-10,0,2,1,3,7,1,31397,83314.0
7,5571.0,5,4,2,5,6,3,22,0,0,0,41-50,1,2,5,1,8,1,31397,83314.0
8,7223.0,5,2,3,8,6,2,0,1,3,0,51-60,1,2,1,2,9,1,31397,83314.0
9,6056.0,5,2,2,0,6,2,9,0,4,0,31-40,1,2,4,3,10,1,31397,83314.0


In [108]:
df_Test_final

Unnamed: 0,Admission_Deposit,Age,Available Extra Rooms in Hospital,Bed Grade,City_Code_Hospital,City_Code_Patient,Department,Hospital_code,Hospital_region_code,Hospital_type_code,Severity of Illness,Stay,Type of Admission,Visitors with Patient,Ward_Facility_Code,Ward_Type,case_id,is_train,patientid,Bill_per_patient
0,3095.0,7,3,1,2,1,2,20,2,2,2,,0,2,0,3,318439,0,17006,28765.0
1,4018.0,7,2,1,3,1,2,28,0,0,2,,1,4,5,3,318440,0,17006,28765.0
2,4492.0,7,3,3,1,1,2,25,1,1,2,,0,3,3,1,318441,0,17006,28765.0
3,4173.0,7,3,1,5,1,2,5,0,0,2,,1,3,5,1,318442,0,17006,28765.0
4,4161.0,7,2,1,9,1,2,27,0,1,2,,1,4,5,2,318443,0,17006,28765.0
5,4659.0,7,3,1,5,1,2,22,0,0,2,,1,2,5,1,318444,0,17006,28765.0
6,4167.0,7,2,1,1,1,2,25,1,1,2,,1,2,3,1,318445,0,17006,28765.0
7,4396.0,3,4,2,0,1,2,24,0,4,2,,0,2,4,3,318446,0,95946,28755.0
8,4088.0,3,4,2,5,1,2,22,0,0,2,,1,2,5,1,318447,0,95946,28755.0
9,3925.0,3,3,3,5,1,2,22,0,0,2,,2,2,5,1,318448,0,95946,28755.0


# Data Modelling

## Split the Data to x and y variable

In [109]:
df_Train_final.columns

Index(['Admission_Deposit', 'Age', 'Available Extra Rooms in Hospital',
       'Bed Grade', 'City_Code_Hospital', 'City_Code_Patient', 'Department',
       'Hospital_code', 'Hospital_region_code', 'Hospital_type_code',
       'Severity of Illness', 'Stay', 'Type of Admission',
       'Visitors with Patient', 'Ward_Facility_Code', 'Ward_Type', 'case_id',
       'is_train', 'patientid', 'Bill_per_patient'],
      dtype='object')

In [126]:
x = df_Train_final
x = x.drop(['case_id'], axis=1)
#x = x.drop(['patientid'], axis=1)
x = x.drop(['is_train'], axis=1)
x = x.drop(['Stay'], axis=1)
y = df_Train['Stay']
x_pred = df_Test_final
x_pred = x_pred.drop(['case_id'], axis=1)
#x_pred = x_pred.drop(['patientid'], axis=1)
x_pred = x_pred.drop(['is_train'], axis=1)
x_pred = x_pred.drop(['Stay'], axis=1)

In [128]:
x.head()

Unnamed: 0,Admission_Deposit,Age,Available Extra Rooms in Hospital,Bed Grade,City_Code_Hospital,City_Code_Patient,Department,Hospital_code,Hospital_region_code,Hospital_type_code,Severity of Illness,Type of Admission,Visitors with Patient,Ward_Facility_Code,Ward_Type,patientid,Bill_per_patient
0,4911.0,5,3,1,2,6,3,7,2,2,0,0,2,5,2,31397,83314.0
1,5954.0,5,2,1,4,6,3,1,2,2,0,1,2,5,3,31397,83314.0
2,4745.0,5,2,1,0,6,1,9,0,4,0,1,2,4,3,31397,83314.0
3,7272.0,5,2,1,1,6,3,25,1,1,0,1,2,3,2,31397,83314.0
4,5558.0,5,2,1,1,6,3,25,1,1,0,1,2,3,3,31397,83314.0


## Boosting Algorithm

### LightGBM Model

In [129]:
params = {}
params['learning_rate'] = 0.09
params['max_depth'] = 50
params['n_estimators'] = 500
params['objective'] = 'multiclass'
params['boosting_type'] = 'gbdt'
params['subsample'] = 0.7
params['random_state'] = 50
params['colsample_bytree']=0.7
params['min_data_in_leaf'] = 55
params['reg_alpha'] = 1.6
params['reg_lambda'] = 1.1

In [133]:
clf = lgb.LGBMClassifier(**params)
clf.fit(x, np.ravel(y), eval_metric='multi_error', verbose=False)
predparams= clf.predict(x)
predparams

array(['11-20', '51-60', '21-30', ..., '21-30', '31-40', '0-10'],
      dtype=object)

In [131]:
y_pred= clf.predict(x_pred)
y_pred

array(['0-10', '51-60', '21-30', ..., '21-30', '11-20', '51-60'],
      dtype=object)

In [132]:
submission_df = pd.DataFrame({'case_id':df_Test['case_id'], 'Stay':y_pred})
submission_df.to_csv('Sample Submission LGB Final1.csv', index=False)

Public Score of 42.88