# Introduction to Expoloratory Data Analysis

## Introduciton

In this session, we will dive into the first of those core steps: exploratory analysis.

This step should not be confused with data visualization or summary statistics. Those are merely tools... means to an end.

Proper exploratory analysis is about answering questions. It's about extracting enough insights from your dataset to course correct before you get lost in the weeds.

In this guide, we explain which insights to look for. Let's get started.

## Why explore your dataset upfront?

The purpose of exploratory analysis is to **"get to know"** the dataset. Doing so upfront will make the rest of the project much smoother, in 3 main ways:

1. You’ll gain valuable hints for Data Cleaning (which can make or break your models).
2. You’ll think of ideas for Feature Engineering (which can take your models from good to great).
3. You’ll get a "feel" for the dataset, which will help you communicate results and deliver greater impact.

However, exploratory analysis for machine learning should be **quick, efficient, and decisive**... not long and drawn out!

Don’t skip this step, but don’t get stuck on it either.

As you can see [here](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/), there are infinite possible plots, charts, and tables, but you only need a **handful** to "get to know" the data well enough to work with it.

In this session, we'll show you the visualizations and data analysis that provide the biggest bang for your buck.

## Start with Basics

First, you'll want to answer a set of basic questions about the dataset:

- How many observations do I have?
- How many features?
- What are the data types of my features? Are they numeric? Categorical?
- Do I have a target variable?

### Example observations

Then, you'll want to display example observations from the dataset. This will give you a "feel" for the values of each feature, and it's a good way to check if everything makes sense.

Here's an example from the real-estate dataset:

In [1]:
import pandas as pd

# load dataset
data = pd.read_csv('../data/houseprice-train.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


The purpose of displaying examples from the dataset is not to perform rigorous analysis. Instead, it's to get a **qualitative "feel"** for the dataset.

- Do the columns make sense?
- Do the values in those columns make sense?
- Are the values on the right scale?
- Is missing data going to be a big problem based on a quick eyeball test?

## Plot Numerical Distributions

Next, it can be very enlightening to plot the distributions of your numeric features.

Often, a quick and dirty grid of **histograms** is enough to understand the distributions.

Here are a few things to look out for:

- Distributions that are unexpected
- Potential outliers that don't make sense
- Features that should be binary (i.e. "wannabe indicator variables")
- Boundaries that don't make sense
- Potential measurement errors

At this point, you should start making notes about potential fixes you'd like to make. If something looks out of place, such as a potential outlier in one of your features, now's a good time to ask the client/key stakeholder, or to dig a bit deeper.

![](../imgs/histogram.png)

However, we'll wait until data cleaning to make fixes so that we can keep our steps organized.

## Plot Categorical Distributions

Categorical features cannot be visualized through histograms. Instead, you can use **bar plots**.

In particular, you'll want to look out for **rare classes** (or sparse classes), which are classes that have a very small number of observations.

![](../imgs/barchart.png)

By the way, a **"class"** is simply a unique value for a categorical feature. For example, the following bar plot shows the distribution for a feature called `exterior_walls`. So Wood Siding, Brick, and Stucco are each classes for that feature.

Some of the classes for 'exterior_walls' have very short bars. Those are rare classes. They tend to be problematic when building models.

- In the best case, they don't influence the model much.
- In the worse case, they can cause the model to be overfit.

Therefore, we recommend making a note to combine or reassign some of these classes later. We prefer saving this until tomorrow's Feature Engineering session.

## Plot Segmentations

Segmentations are powerful ways to observe the _relationship between categorical features and numeric features_.

**Box plots** allow you to do so.

![](../imgs/boxplot.png)

Here are a few insights you could draw from the following chart.

- The median transaction price (middle vertical bar in the box) for Single-Family homes was much higher than that for Apartments / Condos / Townhomes.
- The min and max transaction prices are comparable between the two classes.
- In fact, the round-number min (`$200k`) and max (`$800k`) suggest possible data truncation...
- ...which is very important to remember when assessing the generalizability of your models later!

Finally, correlations allow you to look at the relationships _between numeric features and other numeric features._

Correlation is a value between -1 and 1 that represents how closely two features move in unison. You don't need to remember the math to calculate them. Just know the following intuition:

- **Positive** correlation means that as one feature increases, the other increases. E.g. a child’s age and her height.
- **Negative** correlation means that as one feature increases, the other decreases. E.g. hours spent studying and number of parties attended.
- Correlations near -1 or 1 indicate a **strong relationship**.
- Those closer to 0 indicate a **weak relationship**.
- 0 indicates **no relationship**.

Correlation **heatmaps** help you visualize this information. Here's an example (note: all correlations were multiplied by 100):

![](../imgs/heatmap.png)

In general, you should look out for:

- Which features are strongly correlated with the **target variable**?
- Are there interesting or **unexpected** strong correlations between other features?

Again, your aim is to gain intuition about the data, which will help you throughout the rest of the workflow.

By the end of your Exploratory Analysis step, you'll have a pretty good understanding of the dataset, some notes for data cleaning, and possibly some ideas for feature engineering.