# Introduction

In Part I, we (sort of) cleaned the data that we downloaded. We call it semi-clean since we focused on removing NAs from certain columns only. With the cleaned data, we are now going to do some exploratory data analysis! 

With so many columns, it's good to understand what kind of data we're working with. Visualization helps with building intuition around our data and quickly assess if there are qualitative points of significance in some columns.

In this notebook, you will do the following:
1. Import pandas and data vizualization libraries
2. Visualize the column data
    - univariate analysis
    - bivariate analysis

Useful readings on visualization: 
<a href = "https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed">Introduction to Data Visualization in Python</a> (run it in Incognito Mode if you face the paywall)

It's quite comprehensive and a useful guide for this Part if you're new to visualization.

This Part is semi-exhaustive, i.e. feel free to go beyond the parts that we've suggested for visualization.

### Step 1: Import the following libraries
- pandas
- matplotlib.pyplot as plt
- seaborn as sns

In [None]:
# Step 1: Import the libraries

### Step 2: Read the semi-cleaned CSV from Part I
We will now read the CSV that we got from Part I as a DataFrame.

In [None]:
# Step 2: Read your semi-cleaned CSV as DataFrame

## Univariate analysis (UA)
In this section, we will perform univariate analysis. We'll examine each column with either a histogram or a barplot. 

For categorical values, we will first get a frequent count followed by plotting of the barplot.

<strong>Hint: Google "pandas column barplot frequency"</strong>

### Step 3: Perform UA on "hospital_death" with a barplot
Let's plot a barplot on the frequencies of "hospital_death". If you've noticed, this is our dependent variable that we'll be predicting for later on. 

Plotting barplots is awesome because you can plot your data either using matplotlib.pyplot, DataFrame plotting method, or seaborn.

![HospitalDeathBarplot.png](attachment:HospitalDeathBarplot.png)

Here we what is expect when we plot the frequency count of 'hospital_death' column, using three different ways. It doesn't matter which plotting method you use. As long as we visualize it successfully, it's all good. 

What we want to see here is what the proportion of deaths there are in all cases in the hospitals.

In [None]:
# Step 3: Plot a barplot for "hospital_death"

After you plot the barplots, notice that there's a huge imbalance between survival (0), and death (1)? It's almost a 10/90 split between 1 and 0 respectively.

We should anticipate a challenging modelling session, because of how unbalanced the dataset is. 

While we hope for an even split between the two binary outcomes, sometimes we encounter these imbalances.

Don't sweat the details for now, and let's get to UA! 

### Step 4: Perform UA on 'age' with a histogram
To look at the distribution of continuous variables, we will use a histogram. 

We can assess the shape of the distribution of the values with a histogram, and see if there are any interesting patterns.

In [None]:
# Step 4: Plot a histogram with "age"

### Step 5: Perform UA on 'bmi' with a histogram
BMI, or Body Mass Index, is an index that measure's one's health. It is calculated as your weight (in kilograms) divided by the square of your height (in metres) or BMI = kg/m2.

In [None]:
# Step 5: Plot a histogram with "bmi"

### Step 6: Perform UA on 'elective_surgery' with a barplot
We then assess the proportion of people who had elective surgery.

In [None]:
# Step 6: Plot a barplot with 'elective_surgery'

### Step 7: Perform UA on 'ethnicity' with a barplot
What are the proportions of the ethnicities that come in? We use a barplot to find out.

Don't forget to rotate your x-axis labels! Or use a horizontal barplot.

In [None]:
# Step 7: Plot a barplot with 'ethnicity'

### Step 8: Perform UA on 'gender' with a barplot
Let's see the gender makeup in the dataset.

In [None]:
# Step 8: Plot a barplot with "gender"

### Step 9: Perform UA on 'height' with a histogram
Next, we look at the height of those who are admitted into the hospital

In [None]:
# Step 9: Plot a histogram with 'height'

### Step 10: Perform UA on 'pre_icu_los_days' with a histogram
According to the data dictionary, this column contains the length of stay of the patient between hospital stay and unit admission.

In [None]:
# Step 10: Plot a histogram with 'pre_icu_los_days'

Phew, that was a lot of plotting. Actually, there's a lot more to test since we have over 180 columns but we suggest these columns to plot based on the data dictionary. 

Please feel free to plot other columns that you find interesting! We are stopping at ten so you don't get too fatigued by plotting - let's give bivariate analysis a try instead.

## Bivariate Analysis (BA)
Next up, bivariate analysis. In univariate analysis, we looked at understanding each column by themselves. In this section, we will take a look at the relationship columns have with each other. Usually these relationships are between an indepedent variable, e.g., age, and the dependent variable, i.e. hospital_death. 

We'll be looking at the relationship hospital_death has with other patient's information, e.g., bmi, weight, etc. 

Boxplots will be used a lot here, so head on to the readings provided up above if you need a refresher. 

In addition, we highly recommend using seaborn to plot boxplots. It's simply easier to plot boxplots with seaborn than with matplotlib.pyplot, but we won't stop you from practising.

### Step 11: Perform BA on 'age' vs 'hospital_death' with boxplot
Let's see if there's a different in the age between those who survive and those who die.

In [None]:
# Step 11: Plot age vs hospital_death with a boxplot

### Step 12: Perform BA on 'bmi' vs 'hospital_death' with boxplot
BMI is closely related to one's health. Let's see if it has anything to do with ICU deaths.

In [None]:
# Step 12: Plot bmi vs hospital_death with boxplot

### Step 13: Perform BA on 'pre_icu_los_days' vs 'hospital_death' with boxplot
Earlier, we looked at 'pre_icu_los_days' alone for its distribution - now we see if there's an relationship between this feature and hospital death. 

Given how widely ranging the values were from our UA of 'pre_icu_los_days', you should consider removing outliers when you plot the boxplot. Otherwise the resultant plot will be weird (try it).

<strong>Hint: Google "remove outliers boxplot seaborn" or read the documentation</strong>

In [None]:
# Step 13: Plot a barplot

### Step 14: Plot 'gender' vs 'hospital_death' using a grouped countplot
This is a bit interesting, since we're pitting two categorical variables. What we are assessing here is whether there are any differences in the proportion of people who die

<strong>Hint: Google "seaborn grouped countplot"</strong>

In [None]:
# Step 14: Plot a grouped countplot with seaborn for gender vs hospital death

### Step 15: Plot 'weight' vs 'hospital_death' using a boxplot
Let's try another feature against hospital death

In [None]:
# Step 15: Plot a boxplot with weight vs hospital death

### End of Part II
That was a lot of plotting, but generally data visualization is a good skill to practice and we hope that you got a lot of here in this Part. 

If you're curious, you can try a few more UA and BA for the other 180+ features in the dataset. 

We won't be doing any data exporting so let's head on to Part III where we will do some basic feature engineering. 