## Mortality Rate
COVID-19 pandemic has brought a lot of changes in human life and taught us various life lessons. Symptoms are enough to seek help and get tested but when you also have a dataset related to symptoms, it can help in predicting the mortality rate across regions. That would really help the government and health workers in making key decisions.

## Goal: 
Build a Machine learning model that can predict the mortality rate based on different symptoms and the patient’s condition.

## Data Description: 
Data is collected by health workers from a wide range of patients. This data set consists of a variety of attributes that explains different symptoms and different immunity levels of patients.

# Evaluation Metric:
Root Mean Square Error (RMSE)

In [118]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# display option
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

In [119]:
# read data
train = pd.read_excel('/content/drive/MyDrive/Data-colab/skillenza - symptoms and mortality rate/training_data.xlsx')
test = pd.read_excel('/content/drive/MyDrive/Data-colab/skillenza - symptoms and mortality rate/test_data.xlsx')
sample = pd.read_excel('/content/drive/MyDrive/Data-colab/skillenza - symptoms and mortality rate/sample_submission_mortality_rate.xlsx')

In [120]:
sample.head()


Unnamed: 0,ID,Mortality
0,155,1000
1,156,1000
2,156,1000
3,156,1000
4,157,1000


In [121]:
# 5 rows
train.head(5)

Unnamed: 0,ID,Severity,Ventilation,Mean Age,% Male,Any Comorbidity,Hypertension,Diabetes,Cardiovascular Disease (incl. CAD),Chronic obstructive lung (COPD),Cancer (Any),Liver Disease (any),Cerebrovascular Disease,Chronic kidney/renal disease,Other,Fever (temperature ≥37·3°C),Average temperature (celsius),Max temperature (celsius),Respiratory rate > 24 breaths per min,Cough,Shortness of Breath (dyspnoea),Headache,Sputum (/Expectoration),Myalgia (Muscle Pain),Fatigue,Diarrhoea,Nausea or Vomiting,Loss of Appetite/Anorexia,Disease Severity Asymptomatic,Disease Severity General,Disease Severity Severe,Disease Severity Critical,White Blood Cell Count (10^9/L) - Median,White Blood Cell Count (10^9/L) - LQ,White Blood Cell Count (10^9/L) - UQ,Lymphocyte Count (10^9/L) - Median,Lymphocyte Count (10^9/L) - LQ,Lymphocyte Count (10^9/L) - UQ,Platelet Count (10^9/L) - Median,Platelet Count (10^9/L) - LQ,Platelet Count (10^9/L) - UQ,Hemoglobin (g/L) - Median,Hemoglobin (g/L) - LQ,Hemoglobin (g/L) - UQ,Albumin (g/L),Alanine Aminotransferase (U/L),Aspartate Aminotransferase (U/L),Antibiotic,Antiviral (Any),Heart failure,Acute kidney injury (AKI),Secondary infection/ Bacterial infection,"ICU length of stay, days","Hospital length of stay, days",Mortality,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84,Unnamed: 85,Unnamed: 86,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90
0,1,All,Both,,0.623,0.476,0.3,0.19,0.08,0.03,0.0168,,,0.0168,0.12,0.94,,,0.29,0.79,,,0.23,0.15,0.2303,0.0471,0.04,,,0.377,0.3455,0.2775,6.2,4.5,9.5,1.0,0.6,1.3,206.0,155.0,262.0,128.0,119.0,140.0,32.3,30.0,,0.9476,0.2147,0.2304,0.1466,0.1466,8,11.0,0.28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,Severe/Critical Only,Both,,0.7037,0.6667,0.4815,0.3148,0.2407,0.0741,0.0,,,0.037,0.2037,0.9444,,,0.6296,0.7222,,,0.2593,0.1481,0.2778,0.037,0.0556,,,0.0,0.2222,0.7778,9.8,6.9,13.9,0.6,0.5,0.8,165.5,107.0,229.0,126.0,115.0,138.0,29.1,40.0,,0.9815,0.2222,0.5185,0.5,0.5,8,7.5,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,1,All,Both,,0.5912,0.4015,0.2336,0.1387,0.0146,0.0146,0.07,,,0.0,0.0803,0.9416,,,0.1606,0.8175,,,0.219,0.1533,0.2117,0.0511,0.0292,,,0.5255,0.3942,0.0803,5.2,4.3,7.7,1.1,0.8,1.5,220.0,168.0,271.0,128.0,120.0,140.0,33.6,27.0,,0.9343,0.2117,0.1168,0.0073,0.0073,7,12.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2,All,Both,,0.603,,,0.116,,,0.03,,0.065,,,0.915,36.5,,0.188,,,,,,,0.0206,0.0425,0.01,,,,,7.0,5.1,9.4,0.9,0.6,1.2,207.0,158.0,284.0,,,,,33.0,34.0,0.95,0.47,0.0051,0.0468,0.0365,10,15.0,0.16,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2,All,Both,,0.616,,,0.101,,,0.051,,0.051,,,0.899,36.5,,0.216,,,,,,,0.042,0.095,0.021,,,,,7.3,5.3,9.6,0.8,0.6,1.4,201.0,155.0,287.0,,,,,33.0,33.0,0.949,0.949,0.0,0.032,0.011,6,14.0,0.15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [122]:
# Train & test data - (rows and coumns)
print('Train Shape:',train.shape)
print('Test Shape:',test.shape)

Train Shape: (475, 91)
Test Shape: (53, 54)


In [123]:
train.columns

Index(['ID', 'Severity', 'Ventilation', 'Mean Age', '% Male',
       'Any Comorbidity', 'Hypertension', 'Diabetes',
       'Cardiovascular Disease (incl. CAD)', 'Chronic obstructive lung (COPD)',
       'Cancer (Any)', 'Liver Disease (any)', 'Cerebrovascular Disease',
       'Chronic kidney/renal disease', 'Other', 'Fever (temperature ≥37·3°C)',
       'Average temperature (celsius)', 'Max temperature (celsius)',
       'Respiratory rate > 24 breaths per min', 'Cough',
       'Shortness of Breath (dyspnoea)', 'Headache', 'Sputum (/Expectoration)',
       'Myalgia (Muscle Pain)', 'Fatigue', 'Diarrhoea', 'Nausea or Vomiting',
       'Loss of Appetite/Anorexia', 'Disease Severity Asymptomatic',
       'Disease Severity General', 'Disease Severity Severe',
       'Disease Severity Critical', 'White Blood Cell Count (10^9/L) - Median',
       'White Blood Cell Count (10^9/L) - LQ',
       'White Blood Cell Count (10^9/L) - UQ',
       'Lymphocyte Count (10^9/L) - Median', 'Lymphocyte Count 

In [124]:
test.columns

Index(['ID', 'Severity', 'Ventilation', 'Mean Age', '% Male',
       'Any Comorbidity', 'Hypertension', 'Diabetes',
       'Cardiovascular Disease (incl. CAD)', 'Chronic obstructive lung (COPD)',
       'Cancer (Any)', 'Liver Disease (any)', 'Cerebrovascular Disease',
       'Chronic kidney/renal disease', 'Other', 'Fever (temperature ≥37·3°C)',
       'Average temperature (celsius)', 'Max temperature (celsius)',
       'Respiratory rate > 24 breaths per min', 'Cough',
       'Shortness of Breath (dyspnoea)', 'Headache', 'Sputum (/Expectoration)',
       'Myalgia (Muscle Pain)', 'Fatigue', 'Diarrhoea', 'Nausea or Vomiting',
       'Loss of Appetite/Anorexia', 'Disease Severity Asymptomatic',
       'Disease Severity General', 'Disease Severity Severe',
       'Disease Severity Critical', 'White Blood Cell Count (10^9/L) - Median',
       'White Blood Cell Count (10^9/L) - LQ',
       'White Blood Cell Count (10^9/L) - UQ',
       'Lymphocyte Count (10^9/L) - Median', 'Lymphocyte Count 

Train Dataset has some extra columns that are missing in test data. So we can remove those extra columns from the train data.

In [125]:
train = train[['ID', 'Severity', 'Ventilation', 'Mean Age', '% Male',
       'Any Comorbidity', 'Hypertension', 'Diabetes',
       'Cardiovascular Disease (incl. CAD)', 'Chronic obstructive lung (COPD)',
       'Cancer (Any)', 'Liver Disease (any)', 'Cerebrovascular Disease',
       'Chronic kidney/renal disease', 'Other', 'Fever (temperature ≥37·3°C)',
       'Average temperature (celsius)', 'Max temperature (celsius)',
       'Respiratory rate > 24 breaths per min', 'Cough',
       'Shortness of Breath (dyspnoea)', 'Headache', 'Sputum (/Expectoration)',
       'Myalgia (Muscle Pain)', 'Fatigue', 'Diarrhoea', 'Nausea or Vomiting',
       'Loss of Appetite/Anorexia', 'Disease Severity Asymptomatic',
       'Disease Severity General', 'Disease Severity Severe',
       'Disease Severity Critical', 'White Blood Cell Count (10^9/L) - Median',
       'White Blood Cell Count (10^9/L) - LQ',
       'White Blood Cell Count (10^9/L) - UQ',
       'Lymphocyte Count (10^9/L) - Median', 'Lymphocyte Count (10^9/L) - LQ',
       'Lymphocyte Count (10^9/L) - UQ', 'Platelet Count (10^9/L) - Median',
       'Platelet Count (10^9/L) - LQ', 'Platelet Count (10^9/L) - UQ',
       'Hemoglobin (g/L) - Median', 'Hemoglobin (g/L) - LQ',
       'Hemoglobin (g/L) - UQ', 'Albumin (g/L)',
       'Alanine Aminotransferase (U/L)', 'Aspartate Aminotransferase (U/L)',
       'Antibiotic', 'Antiviral (Any)', 'Heart failure',
       'Acute kidney injury (AKI)', 'Secondary infection/ Bacterial infection',
       'ICU length of stay, days', 'Hospital length of stay, days','Mortality']]

In [126]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 475 entries, 0 to 474
Data columns (total 55 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   ID                                        475 non-null    int64  
 1   Severity                                  458 non-null    object 
 2   Ventilation                               301 non-null    object 
 3   Mean Age                                  252 non-null    float64
 4   % Male                                    430 non-null    float64
 5   Any Comorbidity                           222 non-null    float64
 6   Hypertension                              309 non-null    float64
 7   Diabetes                                  325 non-null    float64
 8   Cardiovascular Disease (incl. CAD)        295 non-null    float64
 9   Chronic obstructive lung (COPD)           262 non-null    float64
 10  Cancer (Any)                          

In [127]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 54 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   ID                                        53 non-null     int64  
 1   Severity                                  53 non-null     object 
 2   Ventilation                               46 non-null     object 
 3   Mean Age                                  36 non-null     float64
 4   % Male                                    45 non-null     float64
 5   Any Comorbidity                           14 non-null     float64
 6   Hypertension                              26 non-null     float64
 7   Diabetes                                  30 non-null     float64
 8   Cardiovascular Disease (incl. CAD)        35 non-null     float64
 9   Chronic obstructive lung (COPD)           24 non-null     float64
 10  Cancer (Any)                            


* "Severity", "Ventilation" column in train data & test data is of "Object" Datatype. These column need to be label encoded.

* "ICU length of stay, days", "Hospital length of stay, days","Mortality" - These column in train dataset is of "object" datatype. but when looking at first 5 rows, it has numerical values. So we need to investigate further.

In [128]:
null_df = pd.DataFrame(train.isnull().sum(),columns=['null_values'])
null_df['percent_null_values'] = null_df['null_values']/475
#null_df

In [129]:
train['Severity'] = train['Severity'].map({'All':1, 'Severe/Critical Only':2, 'Mild only':3, 'Both':4, 'Mild':3,'Severe':2,'0':0,'Asymptomatic only':7, 'Severe/critical only':2})
test['Severity'] = test['Severity'].map({'All':1, 'Severe/Critical Only':2, 'Mild only':3, 'Both':4, 'Mild':3,'Severe':2,'0':0,'Asymptomatic only':7, 'Severe/critical only':2,np.nan:0})

In [130]:
train['Ventilation'] = train['Ventilation'].map({'Both':1,'Ventilation only':2, 'Non-ventilation only':3, 'ΝΑ':0, 'Yes':1,'No':2,np.nan:0})
test['Ventilation'] = test['Ventilation'].map({'Both':1,'Ventilation only':2, 'Non-ventilation only':3, 'ΝΑ':0, 'Yes':1,'No':2,np.nan:0})

In [131]:
train = train[train['ICU length of stay, days']!='na']
train = train[train['Hospital length of stay, days']!='na']
train = train[(train['Mortality']!='na')]
train = train[(train['Mortality']!='0%%')]

In [132]:
train.fillna(0,inplace=True)
test.fillna(0,inplace=True)

In [133]:
train.dropna(inplace=True)

In [134]:
train.shape

(469, 55)

In [135]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [136]:
#train[['Mortality']] = train[['Mortality']]*100

In [137]:
"""X = train.drop(['ID','Mortality','White Blood Cell Count (10^9/L) - LQ','White Blood Cell Count (10^9/L) - UQ','Lymphocyte Count (10^9/L) - LQ','Lymphocyte Count (10^9/L) - UQ','Platelet Count (10^9/L) - LQ','Platelet Count (10^9/L) - UQ',
                'Hemoglobin (g/L) - LQ','Hemoglobin (g/L) - UQ','Aspartate Aminotransferase (U/L)',
                'Albumin (g/L)','Alanine Aminotransferase (U/L)'],axis=1)"""

X = train.drop(['ID','Mortality'],axis=1)


#X = train[['Any Comorbidity','Heart failure','Cough']]
y = train[['Mortality']]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,)

param_grid={'n_estimators':[100,300,500],'max_depth':[3,5,7]}

reg = xgb.XGBRegressor() # 450
#reg = LinearRegression() #663
#reg = Lasso() # 620
#reg = Ridge() # 771
#reg =DecisionTreeRegressor()#861

model = GridSearchCV(reg,param_grid=param_grid,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=8,return_train_score=True)

model.fit(X_train,y_train)
y_pred = model.predict(X_test)
test['Mortality']=model.predict(test.drop(['ID'],axis=1))
"""test['Mortality']=reg.predict(test.drop(['ID','White Blood Cell Count (10^9/L) - LQ','White Blood Cell Count (10^9/L) - UQ','Lymphocyte Count (10^9/L) - LQ','Lymphocyte Count (10^9/L) - UQ','Platelet Count (10^9/L) - LQ','Platelet Count (10^9/L) - UQ',
                'Hemoglobin (g/L) - LQ','Hemoglobin (g/L) - UQ','Aspartate Aminotransferase (U/L)',
                'Albumin (g/L)','Alanine Aminotransferase (U/L)'],axis=1))"""
print(mean_squared_error(y_test,y_pred))

0.0433277737371014


In [138]:
"""import sklearn.metrics

sklearn.metrics.SCORERS.keys()"""

'import sklearn.metrics\n\nsklearn.metrics.SCORERS.keys()'

In [139]:
test[['ID','Mortality']].to_csv('mortality.csv',index=False)

In [140]:
"""plt.figure(figsize=(20,20))
sns.heatmap(X.corr())
plt.show()"""

'plt.figure(figsize=(20,20))\nsns.heatmap(X.corr())\nplt.show()'

In [141]:
"""reg.feature_importances_"""

'reg.feature_importances_'

In [142]:
"""plt.figure(figsize=(15,15))
plt.barh(X.columns, reg.feature_importances_)"""

'plt.figure(figsize=(15,15))\nplt.barh(X.columns, reg.feature_importances_)'