[Home](../../README.md)

### Data Preprocessing

This Jupyter Notebook demonstrates different processes you can apply to your data to better understand it before data wrangling. For this demonstration we will use relatively a complex real dataset that compares health measures with the speed of progress of type 2 adult onset diabetes.

#### Load the required dependencies

Load the two required dependencies:

- [Pandas](https://pandas.pydata.org/) is library for data analysis and manipulation.
- [Matplotlib](https://matplotlib.org) a comprehensive library for creating static, animated, and interactive visualizations in Python. A customised stylesheet for the visualisations is also applied.

In [86]:
# Import frameworks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('../../style_Matplotlib_charts.mplstyle')

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [87]:
data_frame = pd.read_csv("2.1.2.diabeties_sample_data.csv")

#### Data Snapshot

It is important to get a high-level look at your dataset to understand what you are working with. Printing the complete data might be impossible for large-scale datasets where the rows can be in thousands or even millions.

You can use the `head()` and `tail()` method call to inspect the first and last 5 rows of your dataset.

In [None]:
# Target = A measure of disease progression in one year
data_frame.head()

####  Data Summary
 
The `info()` method call prints a summary of each column, giving you more information about the specific data types, total number of rows, null values and memory usage.

In [None]:
data_frame.info()

#### Statistics For Numerical Columns
 
The `describe()` method call provides basic statistical knowledge like the mean and spread of the data.

In [None]:
data_frame.describe()

#### Graphically present the data

At this early stage you are just wanting to evaluate the data, the below simple plots lets you look the data in different ways to inform your model design and data wrangling approaches.

In [None]:
# plot a line graph 
data_frame.plot()

In [None]:
# Plot a histogram of a column
plt.hist(data_frame['BMI'])
plt.title(f"Histogram of {data_frame['BMI'].name}")
plt.ylabel('Count')
plt.xlabel(f'{data_frame["BMI"].name}')
plt.show()

In [None]:
# Scatter plot 2 columns to see the relationship
plt.scatter(data_frame['BMI'], data_frame['Target'])
plt.title(f"Scatter of {data_frame['BMI'].name} against {data_frame['Target'].name}")
plt.ylabel(f'{data_frame['Target'].name} Data')
plt.xlabel(f'{data_frame['BMI'].name} Data')
plt.show()

In [None]:
# Scatter plot multiples columns to see the relationship
x_plot = ['BMI', 'BP']
for col in x_plot:
    plt.scatter(data_frame[col], data_frame['Target'], marker='x')
plt.title(f"Scatter of {*x_plot,} against {data_frame['Target'].name}")
plt.ylabel(f'{data_frame['Target'].name} Data')
plt.xlabel(f'{*x_plot,} Data')
plt.show()

In [None]:
# Scatter plot 2 columns in separate charts with a shared y-axis
fig, (ax1, ax2) = plt.subplots(1,2, sharey=True)
plt.suptitle(f"Scatter of {data_frame['BMI'].name} and {data_frame['BP'].name} against {data_frame['Target'].name}")
ax1.set_ylabel(f'{data_frame['Target'].name} Data')

ax1.scatter(data_frame['BMI'], data_frame['Target'])
ax1.set_xlabel(f'{data_frame['BMI'].name} Data')

ax2.scatter(data_frame['BP'], data_frame['Target'])
ax2.set_xlabel(f'{data_frame['BP'].name} Data')

plt.show()

In [None]:
# 3D Scatter plot 3 columns to see the relationship

x_plot = ['BMI', 'BP']

fig = plt.figure()
plt.suptitle(f"3D Scatter of {*x_plot,} against {data_frame['Target'].name}")
ax = fig.add_subplot(111, projection='3d')

ax.scatter(data_frame[x_plot[0]], data_frame[x_plot[1]], data_frame['Target'], color='blue')

x1_range = np.linspace(data_frame[x_plot[0]].min(), data_frame[x_plot[0]].max())
x2_range = np.linspace(data_frame[x_plot[1]].min(), data_frame[x_plot[1]].max())
X1_grid, X2_grid = np.meshgrid(x1_range, x2_range)


ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Target')

plt.show()