# VADSTI 2024

# Module 1: Exploratory Data Analysis (EDA)



This notebook provides recipes for exploratory data analysis which is a critical step in any data science project. The goal of this workshop is to learn how to perform initial investigations of the data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

## **I. Dataset**



We will use the diabetes dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. 

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. 

**Study population:** Female patients over 21 years old of Pima Indian heritage.

**Data dictionary:** The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

> - ***Pregnancies***: Number of times pregnant
- ***Glucose:*** Plasma glucose concentration over 2 hours in an oral 0 glucose tolerance test
- ***BloodPressure:*** Diastolic blood pressure (mm Hg)
- ***SkinThickness:*** Triceps skin fold thickness (mm)
- ***Insulin:*** 2-Hour serum insulin (mu U/ml)
- ***BMI:*** Body mass index (weight in kg/(height in m)2)
- ***DiabetesPedigreeFunction:*** Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
- ***Age:*** Age (years)
- ***Outcome:*** Class variable (0 if non-diabetic, 1 if diabetic)




[Link to dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database/download)

[Link to UCI Repository](https://archive.ics.uci.edu/ml/datasets/diabetes)

## **II. Reading and manipulating the data**

In this section we will read the diabetes dataset into a Pandas dataframe. The primary two components of pandas are the Series and DataFrame. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

### **1. Reading the data**

In [None]:
#install pandas
!sudo -H pip3 install pandas

In [None]:
# Importing pandas
import pandas as pd

In [None]:
# Reading data into a pandas dataframe: diabetes_df. The dataset diabetes.csv is located under the ./dataset folder
diabetes_df = pd.read_csv('./datasets/diabetes.csv')

In [None]:
# View top 10 rows of diabetes_df
diabetes_df.head(10)

### **2. Initial invesitgation**

Let's perform initial investigations of the dataframe.

In [None]:
#What's the shape of our dataframe? How many rows? How many columns?
diabetes_df.shape

In [None]:
#Dataframe Columns
diabetes_df.columns

In [None]:
#Column types
diabetes_df.dtypes

In [None]:
#Use .describe() on the dataframe to get a description of the dataframe
diabetes_df.describe()

In [None]:
#Use .info() on the dataframe to get a summary of the columns
diabetes_df.info()

In [None]:
#What's the count breakdown of the Outcome variable
diabetes_df.Outcome.value_counts()

### **3. Table 1: Baseline charcteristics of the study population**

When working with any patient dataset, it is important to understand the baseline characteristics of your study population. In this section, we will generate ``Table 1``, i.e., patient baseline characteristics table commonly found in biomedical research papers.

We will use the tableone python package inspired by the R package with a same name. The tableone package can summarize both continuous and categorical variables mixed within one table. Categorical variables can be summarized as counts and/or percentages. Continuous variables can be summarized in the “normal” way (means and standard deviations) or “nonnormal” way (medians and interquartile ranges).

The table one package can be found here: https://pypi.org/project/tableone/

In [None]:
#pip install the package
!sudo -H pip3 install tableone

In [None]:
#import tableone
from tableone import TableOne
cols = ['Age', 'BMI','Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'DiabetesPedigreeFunction']

#Use TableOne(data,columns,groupby) function to generate Table 1
mytable = TableOne(diabetes_df,columns=cols,groupby=['Outcome'],nonnormal=nonnormal, pval=True,htest_name=True)
mytable

### **4. Missing data**

When working with any dataset, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. Most commonly you'll see Python's ``None`` or NumPy's ``np.nan``, each of which are handled differently in some situations.

In [None]:
#Let's print how many missing values per column
diabetes_df.isna().sum()

Let's visualize the missingness in our dataframe. We will use the ``missingno`` package  provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. 

Pip install missingno to get started. [Github link](https://github.com/ResidentMario/missingno)

In [None]:
#install missingno package
!sudo -H pip3 install missingno

In [None]:
#Import missingno package and use msno as the shorthand
import missingno as msno

**Bar Chart** 
`msno.bar` is a simple visualization of nullity by column

In [None]:
#use the msno.bar to plot a bar chart of nullity by column
msno.bar(diabetes_df)

**Missingness Matrix:** 
The `msno.matrix` nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

In [None]:
#Visualize missingness per column using the msno.matrix function
msno.matrix(diabetes_df,figsize=(18,10))

### 5. Handling missing data

Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. There are 3 types of missing data:

- ***Missing completely at random (MCAR)*** The propensity for a data point to be missing is completely random. The missing data are just a random subset of the data.

- ***Missing at random (MAR)*** Missing at Random means  the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. 

- ***Missing not at random (MNAR)*** (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing). To extend the previous example, this would occur if men failed to fill in a depression survey because of their level of depression.

Generally speaking, there are three main approaches to handle missing data: 

> - ***Omission***: where samples with invalid data are discarded from further analysis,
- ***Imputation:*** where values are filled in the place of missing data, 
- ***Analysis:*** by directly applying methods unaffected by the missing values. 

The choice of which methods to use is based on the use case and how the data was collected.

In [None]:
#Listwise deletion (row deletion)
diabetes_cc = diabetes_df.dropna()
diabetes_cc.shape

In [None]:
#Drop columns with missing data
diabetes_omission = diabetes_df.dropna(axis=1)
diabetes_omission.shape

In [None]:
#Impute using a constant
diabetes_constant = diabetes_df.fillna(0)
diabetes_constant

In [None]:
#Imputing using mean
diabetes_mean = diabetes_df.fillna(diabetes_df.mean())
diabetes_mean

In [None]:
#Check the dataframe for missing data
diabetes_final.isna().sum()

### **6. Filtering and transofrming the data**

Based on the data use case, you might need to perform data transformations including subseting or filtering the data, creating categorical variables, or applying certain data manipulations. In this section, we review some examples.

**Creating an age group variable**

In [None]:
#Use the describe() function to review a summary of the Age column
diabetes_final['Age'].describe()

In [None]:
#Histogram of age variable
diabetes_df.Age.hist()

In [None]:
#Binning the Age column using pd.cut function
import numpy as np

diabetes_final['age_group'] = pd.cut(diabetes_final['Age'], bins=[21,45,65,np.inf], labels=['21-44', '45-64', '65>'],right=False)
diabetes_final.age_group.value_counts()

**Filtering data**

Filtering data using conditions

In [None]:
#diabetes_final[diabetes_final['age_group'] == "65>"]
#diabetes_final[(diabetes_final['Age'] > 65)]
#diabetes_final[(diabetes_final['Age'] > 65) | (diabetes_final['Age']<45) ]
#diabetes_final[(diabetes_final['Age'] > 65) & (diabetes_final['bmi']>30) ]
#diabetes_final[(diabetes_final['Age'] > diabetes_final['Age'].quantile(0.25))]

**Using the apply function**

We will write a function to group BMI and use the 'map' to generate a new bmi_group column

In [None]:
# Function takes as an input the bmi and output a bmi group 
# x<18.5 -> Unhealthy low
# x>=18.5 and x<25 -> Healthy
# x>=25 and x<30 -> Overweight
# x>=30 -> Obese

def bmi_groups(x):
    if x <18.5:
        return "Unhealthy Low"
    if x>=18.5 and x<25:
        return "Healthy"
    if x>=25 and x<30:
        return "Overweight"
    if x>=30:
        return "Obese"    

Now let's use the use the function to add a bmi_category column to our dataframe. We will use `.map()` function.

In [None]:
diabetes_final["bmi_category"] = diabetes_final["BMI"].map(bmi_groups)
diabetes_final.bmi_category.value_counts()

## **III. Data Visualization**


Let’s also look at how many people in the dataset are diabetic and how many are not. Below is the barplot of the same:

In [None]:
#Use the seaborn package to plot bar plots of the bmi_category variable
import seaborn as sns
ax = sns.countplot(x="bmi_category", data=diabetes_final)

In [None]:
#Use matplotlib package to plot bar plot of the bmi_category variable
import matplotlib.pyplot as plt
from collections import Counter

counter = Counter(diabetes_final.bmi_category)
plt.bar(counter.keys(),counter.values())


In [None]:
#Use pandas vizualisation to plot bar plot of the bmi_category variable
diabetes_final.bmi_category.value_counts().plot.bar()

Let's plot the reltionship between age and bmi or age and other variables using a scatter plot.

In [None]:
#Use pandas vizualisation to plot scatter plot of Age and BMI
diabetes_final.plot(kind='scatter', x='Age', y='BMI', title='Age vs BMI')

## **IV. Feature Selection**

Feature selection is crucial in enhancing model performance by focusing on relevant attributes. It involves identifying and utilizing the most informative features, streamlining the analysis process. 

Feature extraction is essential in machine learning as it reduces data dimensionality, improving model efficiency and performance. 

Techniques for feature selection vary, including statistical tests, model-based approaches, and iterative methods, each offering unique advantages in simplifying datasets and improving predictive accuracy. Here are some ways in which we can perform feafure selection

### **a) Correlations**


We measure correlation of two numerical variables to find an insight about their relationships. On a dataset with many attributes, the set of correlation values between pairs of its attributes form a matrix which is called a correlation matrix.

Correlation can be used in feature extraction by identifying and retaining features that are highly correlated with the target variable, while removing features that are highly correlated with each other. This reduces redundancy and focuses the model on features most relevant to predicting the outcome, improving model performance and interpretability.

We can use pandas `.corr` function to generate correlations between columns or a correlation matrix. 

In [None]:
#Calculate the correlation between Age and BMI
column_1 = diabetes_df["BMI"]
column_2 = diabetes_df["Age"]
correlation = column_1.corr(column_2)
print(correlation)

In [None]:
#Using Pearson correlation to generate a 
corr = diabetes_df.corr(method='pearson')
corr.style.background_gradient(cmap='coolwarm').set_precision(2)#.to_excel('correlation_matrix_pearson.xlsx')

In [None]:
#Plot the lower triangle of the correlation heatmap using Seaborn
#look at feature correlations
corr = diabetes_df.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

#make the heatmap plot
plt.figure(figsize=(16,9))
sns.heatmap(corr, mask=mask, annot = True,cmap='coolwarm', square=True)#, vmax=1, vmin=-1,center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.tight_layout()
plt.savefig('heatmap.png', dpi=250)
plt.show()

### **b) Variance**

Variance can be used for feature selection by identifying and retaining features that have high variance, as they are likely to contain more information than features with low variance. Features with low variance may not significantly impact the model's predictions and can be considered for removal. This approach helps in reducing the dimensionality of the dataset, leading to improved model performance and computational efficiency.

In [None]:
#Here we check to see what the variance for each columns look like.
diabetes_df.var()

## **V. Data Transformation**

### a) Data Normalization

Data normalization in Pandas involves scaling numerical columns to a common scale without distorting differences in the ranges of values. This is crucial for algorithms that assume data is centered around 0 and scaled uniformly, like K-Means clustering or principal component analysis (PCA). To normalize data, you can use the `MinMaxScaler` from `sklearn.preprocessing` for min-max normalization or `StandardScaler` for z-score normalization. Apply these scalers to your DataFrame columns to transform the data.

In [None]:
# Min-Max Normalization
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(diabetes_df), columns=diabetes_df.columns)

# Z-Score Normalization
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(diabetes_df), columns=diabetes_df.columns)

## **VI. Data Splitting**


Data splitting involves dividing a dataset into separate sets to train and evaluate a machine learning model, typically into training, and test sets. This approach helps to assess the model's performance and generalization to new, unseen data, ensuring it learns patterns rather than memorizing the dataset.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(diabetes_df, labels, test_size=0.2, random_state=42)

## **VII. Generate a data exploration report**


Let's use pandas_profiling library to generate a report for the variables in the dataset

In [None]:
#install pandas-profiling
!sudo -H pip3 install -U pandas-profiling

In [None]:
import pandas as pd
import pandas_profiling

# Reading data into a pandas dataframe: diabetes_df
diabetes_df = pd.read_csv('./datasets/diabetes.csv')

profile = pandas_profiling.ProfileReport(diabetes_df)
profile.to_file("./exports/profiling_report.html")