# Dataset Information

# Background

The World Health Organization (WHO) characterized the COVID-19, caused by the SARS-CoV-2, as a pandemic on March 11, while the exponential increase in the number of cases was risking to overwhelm health systems around the world with a demand for ICU beds far above the existing capacity, with regions of Italy being prominent examples.

Brazil recorded the first case of SARS-CoV-2 on February 26, and the virus transmission evolved from imported cases only, to local and finally community transmission very rapidly, with the federal government declaring nationwide community transmission on March 20.

Until March 27, the state of São Paulo had recorded 1,223 confirmed cases of COVID-19, with 68 related deaths, while the county of São Paulo, with a population of approximately 12 million people and where Hospital Israelita Albert Einstein is located, had 477 confirmed cases and 30 associated death, as of March 23. Both the state and the county of São Paulo decided to establish quarantine and social distancing measures, that will be enforced at least until early April, in an effort to slow the virus spread.

One of the motivations for this challenge is the fact that in the context of an overwhelmed health system with the possible limitation to perform tests for the detection of SARS-CoV-2, testing every case would be impractical and tests results could be delayed even if only a target subpopulation would be tested.

# Dataset

This dataset contains anonymized data from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests during a visit to the hospital.

All data were anonymized following the best international practices and recommendations. All clinical data were standardized to have a mean of zero and a unit standard deviation.

# Task

Predict admission to general ward, semi-intensive unit or intensive care unit among confirmed COVID-19 cases.
Based on the results of laboratory tests commonly collected among confirmed COVID-19 cases during a visit to the emergency room, would it be possible to predict which patients will need to be admitted to a general ward, semi-intensive unit or intensive care unit?

Dataset Extraction link : https://www.kaggle.com/einsteindata4u/covid19

In [26]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import missingno as msno
import ipywidgets as widgets

%config InlineBackend.figure_format = 'svg'
%matplotlib notebook
%matplotlib inline

warnings.filterwarnings('ignore')



In [2]:
%%javascript
IPython.OutputArea.prototype._should_scroll=function(lines){
    return false;
}

<IPython.core.display.Javascript object>

# 1. Read dataset

In [3]:
covid_df= pd.read_excel("dataset.xlsx")
covid_df.head()

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",Hematocrit,Hemoglobin,Platelets,Mean platelet volume,...,Hb saturation (arterial blood gases),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
0,44477f75e8169d2,13,negative,0,0,0,,,,,...,,,,,,,,,,
1,126e9dd13932f68,17,negative,0,0,0,0.236515,-0.02234,-0.517413,0.010677,...,,,,,,,,,,
2,a46b4402a0e5696,8,negative,0,0,0,,,,,...,,,,,,,,,,
3,f7d619a94f97c45,5,negative,0,0,0,,,,,...,,,,,,,,,,
4,d9e41465789c2b5,15,negative,0,0,0,,,,,...,,,,,,,,,,


In [4]:
print ("Total Number of rows are",covid_df.shape[0],"and total number of columns are",str(covid_df.shape[1])+".")

Total Number of rows are 5644 and total number of columns are 111.


# 2. Initial Exploratory data analysis

**2.1 Missing values**

In [5]:
pd.set_option('display.max_rows', covid_df.shape[0]+1) 
covid_df.isna().sum()

Patient ID                                                  0
Patient age quantile                                        0
SARS-Cov-2 exam result                                      0
Patient addmited to regular ward (1=yes, 0=no)              0
Patient addmited to semi-intensive unit (1=yes, 0=no)       0
Patient addmited to intensive care unit (1=yes, 0=no)       0
Hematocrit                                               5041
Hemoglobin                                               5041
Platelets                                                5042
Mean platelet volume                                     5045
Red blood Cells                                          5042
Lymphocytes                                              5042
Mean corpuscular hemoglobin concentration (MCHC)         5042
Leukocytes                                               5042
Basophils                                                5042
Mean corpuscular hemoglobin (MCH)                        5042
Eosinoph

**Observation 2.1:** Based on the above there are too many columns with missing datas and it is impossible to remove all the rows, therefore we will proceed to convert the above data based on their weight and see if we are able to remove rows that has weight more than 0.9.

**2.11 Finding the weight of the missing data**

In [6]:
missing_datacol=(covid_df.isna().sum()/covid_df.shape[0]).sort_values(ascending=False).reset_index()

replace_col_list=['Column_name','Weightage']
missing_datacol.columns =replace_col_list
missing_datacol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Column_name  111 non-null    object 
 1   Weightage    111 non-null    float64
dtypes: float64(1), object(1)
memory usage: 1.9+ KB


In [30]:
#Below is the interative Visualization for viewing missing data
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
from ipywidgets import IntSlider
from ipywidgets.embed import embed_minimal_html
def get_weightage_info(db,Weightage):
    if Weightage!='':
        Weightage=float(Weightage)
        missing_weight=db.loc[db['Weightage']>=Weightage]
    else:
        missing_weight=db
    return missing_weight  

In [32]:
slider = interact(lambda Weightage:get_weightage_info(missing_datacol,Weightage),
        Weightage='')

embed_minimal_html('export.html', views=[slider], title='Widgets export')

interactive(children=(Text(value='', description='Weightage'), Output()), _dom_classes=('widget-interact',))

AttributeError: 'function' object has no attribute 'get_view_spec'

AttributeError: 'function' object has no attribute 'get_view_spec'

**2.11: Finding number of columns that have less than 90% of the data**

In [10]:
missing_datacol2=covid_df.isna().sum()/covid_df.shape[0]
missing_90valcount=missing_datacol2.loc[missing_datacol2>0.9]

remove_count=missing_90valcount.count()
remove_count
print ("Number of column to be removed :",str(remove_count))

Number of column to be removed : 72


**Observation 2.11:** Based on the weightage indicated above, there are 72 columns with weightage more than 0.9. Hence, they will be removed, as i do not think that it will have any impact on the prediction.

**Dropping of columns**

In [11]:
pd.DataFrame(data=missing_90valcount)

Unnamed: 0,0
Serum Glucose,0.963147
Mycoplasma pneumoniae,1.0
Neutrophils,0.909107
Urea,0.92966
Proteina C reativa mg/dL,0.910347
Creatinine,0.924876
Potassium,0.934266
Sodium,0.934444
Alanine transaminase,0.960135
Aspartate transaminase,0.959957


msno.heatmap(covid_df)

msno.matrix(covid_df)