# Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) forms an important part of all data science work (indeed, some argue that exploratory data analysis ***is*** data science under an older name!). Here, we will walk through a very basic example of EDA - there's really a whole lot more that coud be covered, but ... baby steps, eh? :)

First, let's load in our data:

In [None]:
import pandas as pd

data = pd.read_csv('eda-demo.csv')
data

Before doing anything, it's a good idea to get some basic descriptive statistics of our data:

In [None]:
print(data.info())
print(data.describe())

It's a guess (but a reasonable one) that the purpose of this data is to inform the process of building a model to predict the <code>target</code> variable using the other known variables. Looks like we're dealing solely with numeric data (phew! no need to transform or dummy encode any data here) - normally, this would be the place that you'd perform any obvious transformations and data cleaning prior to using the data. For example, in Lab 4, you could have removed the Make and Model, and cleaned up the Transmission variable at this point.

Seeing as all our data is numeric, let's have a quick look at the correlations between our variables:

In [None]:
corr = data.corr()
corr

There's a couple of interesting things there (e.g., the high correlation between y and z) - perhaps they'll be easier to see/understand when we visualise them.

## Visualisation

Visualisation plays an important part in EDA - some trends are easier to pick up visually rather than by analysing tables of data. Let's start by visualising our correlations from before:

In [None]:
from statsmodels.graphics.correlation import plot_corr

cp = plot_corr(corr, xnames=corr.columns.values)
cp.set_figwidth(8)
cp.set_figheight(8)

Okay, looks like there's some strong correlations to play with here, and a couple that we may want to look into further (e.g., the strong correlation between x and y, and the weak correlation between d and everything else). Let's examine the scatter plots to get some more ideas:

In [None]:
import seaborn as sns

pairs = sns.pairplot(data)

Okay, looks like we can definitely get rid of d - it doesn't look like it is contributing anything of use in terms of predicting the target.

x, y, and z all look useful, but the interaction between y and z seems pretty strong (almost like one is a noisy copy of the other!). It doesn't look like we get much extra information by keeping both of them. Given that y is slightly stronger in terms of correlation to the target than z, it looks like we should keep y and discard z.

In terms of their relationship to the target, both x and x have fairly strong non-linearities (albeit in opposite directions). We may be able to "linearise" them with a set of simple transformations (see the lecture for some candidates).

So, our quick venture into EDA has suggested a couple of things:
1. Get rid of the z and d features
2. Perform a transformation on x and y

## Prepping the transformations

Let's apply the transformations suggested by EDA to a copy of our data:

In [None]:
import numpy as np
import seaborn as sns
trans = data.copy(deep=True)
trans['x'] = np.log(trans['x'])
trans['ysqr'] = trans['y']**2  ## note that, instead of replacing y with y^2, we add a new column, this is a useful thing to explore sometimes
trans.drop(columns=[ 'z', 'd' ], inplace=True)

sns.pairplot(trans[[ 'x', 'y', 'ysqr', 'target' ]]);

There's certainly _some_ improvement in our features (the $y^2$ transformation in particular seems to be useful). The relationship between target and x is a little better, too (we could try a different type of transformation, but this will do for now!).

We're ready for modelling. We'll use a cross validation scheme (see Lecture 10) to assess the generalisation performance of our model. First, let's see how well a linear model would have performed on our "raw" data:

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

X = data.drop(columns=['target']).to_numpy()
y = data['target'].to_numpy()

kf = KFold(n_splits=10, shuffle=True, random_state=1234)
rsqr = cross_val_score(LinearRegression(), X, y, cv=kf)
print("Linear Regression on raw features: R^2={}, std={}".format(np.mean(rsqr), np.std(rsqr)))

_ = plt.figure()
scatter = sns.scatterplot(x=y, y=LinearRegression().fit(X, y).predict(X))
plt.xlabel('y (target)')
plt.ylabel('$\\hat{y}$')
plt.title('Linear Regression Model Fit (Pre EDA)');

There's a clear non-linear relationship between the predictions and the known target values. The $R^{2}$ value of 0.87 is not bad, but it's clear that the linear model is not picking up all the nuance of the relationships in the data. Let's see what happens when we take the EDA into consideration:

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

X = trans.drop(columns=['target']).to_numpy()
y = trans['target'].to_numpy()

kf = KFold(n_splits=10, shuffle=True, random_state=1234)
rsqr = cross_val_score(LinearRegression(), X, y, cv=kf)
print("Linear Regression on post-EDA transformed and removed features: R^2={}, std={}".format(np.mean(rsqr), np.std(rsqr)))

_ = plt.figure()
scatter = sns.scatterplot(x=y, y=LinearRegression().fit(X, y).predict(X))
plt.xlabel('y (target)')
plt.ylabel('$\\hat{y}$')
plt.title('Linear Regression Model Fit (Post EDA)');

The visualisation and $R^{2}$ score of 0.97 clearly show that the modelling process now captures a lot more of the relationship between the target and the input features. Note that exactly the same process of fitting the model was used, so the improvement that we see is through to our analysis of the features and a hand-crafted approach to tuning the process via simple feature transformations.

## The End
Of course, not all EDA processes will end up as successful as this one was. There's going to be times where we need more sophisticated methods to automatically construct new feature tansformations for us (see Neural Networks!). However, the EDA process still remains an important part of data science work, and should be the FIRST thing that you conduct prior to going heads-first into the "state-of-the-art" modelling techniques!