# Exploratory Data Analysis with Python

This is the notebook for the O'Reilly Live Training - Exploratory Data Analysis with Python by Pratheerth Padman

## Introduction to EDA

The dataset we're going to be using throughout the session, can be found at - https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

#### Importing the required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import warnings

#warnings.filterwarnings('ignore')
sns.set(style="whitegrid")
%matplotlib inline
#plt.rcParams['figure.figsize'] = [12, 8]

#### First look at the dataset!

Here, we'll be using pandas to read the downloaded csv file. We'll then print the number of rows and columns in the dataset using the shape function.

Then we'll get our first look at the dataset using the head function which by default prints out the first 5 rows of the dataset. If we want to print out the last 5 rows, we can use the tail function. We can also specify the number of rows we want to be printed out in the head or tail functions.

In [None]:
# importing the dataset

data_df = pd.read_csv("../data/healthcare-dataset-stroke-data.csv")

In [None]:
# shape of the dataset

print("The dataset has {} rows and {} columns".format(data_df.shape[0], data_df.shape[1]))

In [None]:
# first few rows of the dataset

data_df.head()

In [None]:
# last few rows

data_df.tail(3)

In [None]:
#viewing the entire dataframe



In [None]:
data_df

In [None]:
# transpose

data_df.head().T

#### Attribute Information

> 1) **id:** unique identifier

> 2) **gender:** "Male", "Female" or "Other"

> 3) **age:** age of the patient

> 4) **hypertension:** 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

> 5) **heart_disease:** 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

> 6) **ever_married:** "No" or "Yes"

> 7) **work_type:** "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

> 8) **Residence_type:** "Rural" or "Urban"

> 9) **avg_glucose_level:** average glucose level in blood

> 10) **bmi:** body mass index

> 11) **smoking_status:** "formerly smoked", "never smoked", "smokes" or "Unknown"*

> 12) **stroke:** 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

#### The Info Function

The info function helps us to identify the number of columns, if there are any missing values and also the type of features/variables that are in the dataset.
Here "object" means its a categorical feature and both "int64" and "float64" means it is numerical.

In [None]:
data_df.info()

#### Filter the dataset using datatype

In [None]:
data_df.select_dtypes(include=["object"])

In [None]:
data_df.select_dtypes(include=["number"])

#### Which person has the max bmi from the dataset?

In [None]:
data_df["bmi"]

In [None]:
data_df[data_df.bmi == data_df.bmi.max()]

In [None]:
max_bmi_person = data_df[data_df.bmi == data_df.bmi.max()]

In [None]:
max_bmi_person.id

#### Questions on initial glance

> 1. Is gender correlated with stroke? Are males or females more likely to get it?

## Excercise

1. What is the value of the 10th observation of the age feature from the top of the dataset?

2. What is the value of the 7th observation of the bmi feature from the bottom of the dataset? 

3. What is the id number and work_type of the person with the lowest average_glucose_level in the dataset

4. Print out a filtered dataframe, based on three conditions:

   a) Age is less than 30
   b) Residence_type is Rural
   c) Gender is female
   
   How many rows are there in the filtered dataset?

## Univariate Data Analysis

#### Data Description

The data describe function helps to print out some basic summary statistics like count, mean, standard deviation, max value, min value and the 25th, 50th and 75th percentile of each of the variables. It works for both numerical and categorical features, but in different ways.

In [None]:
# data description for numerical columns

data_df.describe()

#### Target Variable - Stroke

In [None]:
# occurences of each value

data_df.stroke.value_counts()

In [None]:
# stroke - pie chart

stroke_labels = ["No Stroke", "Stroke"]

sizes = data_df.stroke.value_counts()

plt.pie(x=sizes, labels=stroke_labels, autopct="%1.1f%%")

plt.show()

#### Continuous Numerical Features - age, avg_glucose_level, bmi

**Histogram**

A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.

In [None]:
# single plot - age

sns.histplot(data_df.age, bins=30).set(title="Histogram of Age", xlabel="Age in Years")

plt.show()

In [None]:
# age histogram in people with stroke

sns.histplot(data_df.age[data_df["stroke"]==1], bins=30).set(title="Histogram of Age with Stroke", xlabel="Age in Years")

plt.show()

In [None]:
# subplots - bmi, avg_glucose_level

fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, figsize=(15,10))

sns.histplot(data_df.bmi, bins=30, ax=ax1).set(title="Histogram of BMI", xlabel="BMI")
sns.histplot(data_df.avg_glucose_level, bins=30, ax=ax2, color="green").set(title="Histogram of Glucose Level", xlabel="Glucose Level (mg/dl)")

    
plt.tight_layout()
plt.show()

#### Boxplots and Outliers

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

![title](../assets/boxplot1.png)

Image Credit: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(15,8))

sns.boxplot(data_df.age, ax=ax1, palette="Greens").set(title="Boxplot - Age")
sns.boxplot(data_df.bmi, ax=ax2, palette="Blues").set(title="Boxplot - BMI")
sns.boxplot(data_df.avg_glucose_level, ax=ax3, palette="Set1").set(title="Boxplot - Glucose Level")

plt.tight_layout()
plt.show()

#### Should We Remove or Keep Outliers?

#### Categorical Features - gender, ever_married, work_type, residence_type, smoking_status

In [None]:
# data description for non-numerical columns

data_df.describe(exclude = ["number"])

#### Countplot

In [None]:
# countplots for gender, ever_married and work_type

fig, ax = plt.subplots(1, 3, figsize=(15,8))

sns.countplot(data_df.gender, ax=ax[0]).set(title="Gender")
sns.countplot(data_df.ever_married, ax=ax[1]).set(title="Have they ever married?")
sns.countplot(data_df.work_type, ax=ax[2]).set(title="What type of work do they do?")

plt.tight_layout()

plt.show()

In [None]:
# same thing as above for people who've had a stroke

fig, ax = plt.subplots(1, 3, figsize=(15,8))

sns.countplot(data_df.gender[data_df["stroke"]==1], ax=ax[0]).set(title="Gender")
sns.countplot(data_df.ever_married[data_df["stroke"]==1], ax=ax[1]).set(title="Have they ever married?")
sns.countplot(data_df.work_type[data_df["stroke"]==1], ax=ax[2]).set(title="What type of work do they do?")

plt.tight_layout()

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=45)

plt.show()

In [None]:
# countplots for residence_type and smoking_status

fig, ax = plt.subplots(1, 2, figsize=(15,8))

sns.countplot(data_df.Residence_type, ax=ax[0]).set(title="Where do they live?")
sns.countplot(data_df.smoking_status, ax=ax[1]).set(title="Have they smoked?")


plt.tight_layout()

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=45)

plt.show()

In [None]:
# same as above for people who've had a stroke

fig, ax = plt.subplots(1, 2, figsize=(15,8))

sns.countplot(data_df.Residence_type[data_df["stroke"]==1], ax=ax[0]).set(title="Where do they live?")
sns.countplot(data_df.smoking_status[data_df["stroke"]==1], ax=ax[1]).set(title="Have they smoked?")

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    
plt.tight_layout()
plt.show()

## Bivariate Data Analysis

#### Do older people tend to get more strokes?

In [None]:
sns.boxplot(data_df.stroke, data_df.age)
plt.show()

#### Is there a connection between the type of work you do and your bmi?

In [None]:
sns.violinplot(data_df.work_type, data_df.bmi)
plt.show()

#### Do older people have a higher average glucose level?

In [None]:
sns.scatterplot(data_df.age, data_df.avg_glucose_level)
plt.show()

In [None]:
# linear regression model - shows trend
sns.regplot(data_df.age, data_df.avg_glucose_level)
plt.show()

#### Smoking vs Stroke and Work Type vs Hypertension

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15,8))

sns.countplot(x="stroke", hue="smoking_status", data=data_df, ax=ax[0]).set(title="Smoking vs Stroke")
sns.countplot(x="hypertension", hue="work_type", data=data_df, ax=ax[1], palette="Set2").set(title="Work Type vs Hypertension")

plt.tight_layout()
plt.show()

## Exercise

   Create a 3 plot figure to answer the following questions:
   
   a) What is the percentage of people with hypertension in the dataset?
   
   b) In the different values within the feature "smoking_status", do men outnumber the women in any of them?
   
   c) What can you tell me about the relationship between the type of work and the age?
   
   All figures should contain titles.

## Missing Data

In [None]:
data_df.isnull()

In [None]:
data_df.bmi.mean()

In [None]:
data_df.bmi.median()

In [None]:
data_df[data_df["bmi"].isnull()]

In [None]:
# create a missing indicator feature for bmi

data_df["bmi_nan"] = np.where(data_df["bmi"].isnull(), 1, 0)

In [None]:
data_df.head()

In [None]:
# fill missing bmi values with mean

data_df["bmi"].fillna(data_df["bmi"].mean(), inplace=True)

data_df.head()

## Correlation Analysis

In [None]:
data_df.corr()

In [None]:
data_df.corr(method="spearman")

In [None]:
sns.heatmap(data_df.corr(), annot=True)
#plt.savefig("heatmap.jpeg")
plt.show()