## Exploratory Data Analysis (EDA)

EDA is the process of performing an initial assessment of your dataset to identify relationships, outliers and check statistical assumptions before applying a model

The output of this stage is a set of summary statistics and visualizations

## Methods for EDA

During this phase of analysis, we will learn to:
- Encode categorical data so we can use it in statistical models 
- Bin data into more meaningful chunks 
- Apply normalization/ transforms to skewed datasets
- Check for Outliers 

## Encoding Categorical Variables 

Encoding is in simple terms converting categorical variables into numeric input 

Categorical data is non-numeric data that is grouped in some way and usually containing a finite list of values 
e.g Hair Color which can be ['black', 'brown', 'red', 'green']

In [5]:
# setup environment 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import sklearn
import scipy
import seaborn as sns

print("Done")

Done


In [7]:
# dataset 
file_path = 'datasets/iris.csv'
iris = pd.read_csv(file_path)
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [9]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [11]:
iris.Species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [12]:
# we could decide to encode the Species using a simple mapping

map_dict = {
    'Iris-setosa': 1,
    'Iris-versicolor': 2,
    'Iris-virginica': 4,
}

This simple mapping above can cause the model to assume that there was some order or relationship to the variables. e.g a statistical model would assume that iris-setosa and iris-versicolor are more closely related than iris-virginica

How ever looking at the [images](https://miro.medium.com/max/1000/1*Hh53mOF4Xy4eORjLilKOwA.png) This isn't the case. 


What we can do instead is to convert each feature to a binary representation where

- 0 = feature is not in that category
- 1 = feature is part of that category

In [15]:
# One Hot Encoding 

In [16]:
feature_df = pd.DataFrame(iris.Species)

In [18]:
feature_df.head(2)

Unnamed: 0,Species
0,Iris-setosa
1,Iris-setosa


In [19]:
encoded = pd.get_dummies(feature_df['Species'])

In [20]:
encoded

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1


## summary
We can see that pandas get_dummies() is labelling each row with a 1 for the category that it belongs to and a 0 for the categories it doesn't.

Our dataframe now has three new numerical columns that we can use instead of the original class labels 