#### <span style="color:black"><u>**Bivariate Analysis - Numeric vs Numeric**<a name="ba_n"></a><u></span>
    
- Correlation matrix
- Scatter plots with correlation calculations
    
<ins>Correlation</ins>

- [Correlation](https://towardsdatascience.com/what-is-correlation-975ea899aaed) is a statistical measure that expresses the extent to which two variables are linearly related. 
- For example as one variable increases the other might increase too (positive correlation), or on the other hand, as one variable increases, the other decreases (negative correlation). 
- It's a common tool for describing simple linear relationships that exist within the data, without making a statement about cause and effect. 
- Just because variables are correlated, an increase in one does not neccessarily cause the the movement of the other. Correlation does not imply causation.
- The method we will use to see if variables are correlated is Pearson's sample correlation coefficient, given by: 

    
<big><big>$$r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}.$$</big></big>

<ins>Some Properties of Correlation</ins> 
* $Corr(X, Y) = Corr(Y, X)$
* $P(Y = aX + b) = 1$ will imply either perfect positive or perfect negative correlation, depending on the slope. In this case, the two variables will move together in a perfectly straight line

In [None]:
# Bi-variate analysis
ea.corr_heatmap(train)
for xvar in ea.numeric_variables(train):
    ea.plot_corr(data=train, xvar=xvar)

<u>Some thoughts</u>

* Surprisingly there is minimal correlation between health expenditure and life expectancy
* GDP per capita against life expectancy certainly doesn't follow a linear relationship. Might need to transform this feature. Perhaps a log transform might work
* Strong positive correlations included (HDI <-> Life Expectancy),  (Electricity Access <-> Life Expectancy), (Schooling <-> Life Expectancy)
* Strong negative correlation between Infant Mortality and Life Expectancy

<ins>Bivariate Outlier Detection - Mahalanobis Distance</ins>

* Observing bivariate scatterplots (above) is useful, though the [Mahalanobis Distance](https://www.statisticshowto.com/mahalanobis-distance/) (MD) is also a popular option. 
* The Mahalanobis Distance helps us find the distance between between a point and a multivariate distribution and is used for bivariate outlier detection as well as forming a big part of [Hotelling's T Squared Test](https://www.statistics.com/glossary/hotellings-t-square/). The general concepts of the Mahalanobis Distance can also be applied to [Principal Component Analysis](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c) (PCA)
* The reason why MD is useful is because euclidean distance breaks down when the coviariance between two variables is non zero
* Observing the diagram below, the Euclidean distance from center of the cluster of points to the bottom red point is the same as the distance to the top right pink point, despite the bottom red point looking like a bivariate outlier. In situations like this the Mahalanobis Distance is useful

![md](MD.png)

To calculate the Mahalanobis Distance we first need the covariance matrix, which calculates the [covariance](https://www.investopedia.com/terms/c/covariance.asp) between each possible variable pairing. Covariance, like correlation can also be used to inform us as to whether two variables move in the same direction or in opposing directions. It does through telling us the joint variability of the two variables, and is given by the formula 

$$\text{Cov}({x,y})=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}$$

Because Euclidean distance tends to fail in detecting bivariate outliers when two variables in question have reasonable positive/negative covariance, we would ideally like to rescale the coordinates to remove the covariance and thus, the correlation. 

To do this we calculate the [eigenvalues and eigenvectors](https://www.mathsisfun.com/algebra/eigenvalue.html) of the covariance matrix we are dealing with. An eigenvector tells us the vector that when multiplied by a particular matrix in question has the same impact as scaling the vector. The magnitude in which this vector is scaled is the eigenvalue, often represented as $\lambda$. If we are dealing with two variables $x$ and $y$, the span of the first eigenvector is the direction of greatest variability in the dataset and the second eigenvector is orthogonal to the first.

Now that the eigenvectors have been calculated, we take the each of our eigenvectors as new axes and shrink each axis by the square root of the corresponding eigenvalue. This in turn removes covariance between our data points, and now we are in a position to calculate Euclidean distance like usual. Essentially, all the data points in bivariate space are rescaled by compressing them in the direction of each eigenvector, but by different amounts which is ultimately dictated by the eigenvalue.

The formula for the Mahalanobis Distance is given by:

$$MD = \sqrt{(\vec{x} - \vec{m})^T \cdot C^{-1} \cdot (\vec{x} - \vec{m})}$$

Where $(\vec{x} - \vec{m})$ represents a matrix of distances of each datapoint from its mean and $C^{-1}$ is the inverse of the covariance matrix in question

---

The Mahalanobis distance does require multivariate normality, though the [Henze-Zirkler Test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3927875/) can test for this

Null and alternate hypothesis:

$$H_0: \text{The variables follow a multivariate normal distribution}$$

$$H_1: \text{The variables are not multivariate normal}$$


In [None]:
ea.henze_zirkler(train, y='Life_exp', alpha=0.05)

None of the combos follow a multivariate normal distribution. So Mahalanobis Distances for the above scatterplots using the `sklearn.covariance.EllipticEnvelope()` class from sklearn might not be an option. Other techniques like [Density Based Spatial Clustering of Applications with Noise (DBSCAN)](https://en.wikipedia.org/wiki/DBSCAN) (non-parametric), [One-Class Support Vector Machines](https://www.datatechnotes.com/2020/04/anomaly-detection-with-one-class-svm.html) and [Isolation Forests](https://en.wikipedia.org/wiki/Isolation_forest) are useful. Basic One Class SVM below

In [None]:
# One class SVM to algorithmically find potential bivariate outliers
for var in ['InfantMortality', 'Health_exp', 'Employment', 'MeanSchooling', 'ElectricityAccess', 'HDI']:
    ea.one_class_svm(train, var, y='Life_exp', kernel='rbf', gamma=0.001, nu=0.01)

For now I will trust that no substantial bivariate outliers exist in our data based on what I saw from the initial blue and green scatterplots and what One-Class-SVM predicted to be outliers (pink dots) for some of the variables I was interested in

#### <span style="color:black"><u>**Multivariate Analysis**<a name="ma"></a><u></span>

<u>Variance Inflation Factor</u>
* A useful plot if you are familiar with the concept

In [None]:
# Call the function
X = train.select_dtypes(include = 'number')
X = X.drop(['Life_exp'], axis = 1).dropna(axis = 0)

ea.display_vif(X, threshold=5);

As HDI has a very high VIF, if we were to build an Ordinary Least Squares model, multicollinearity would definitely be present. I won't drop the HDI variable right now, but the assumption that no multicollinearity exists in the model will be violated if we use OLS with these variables as our predictor variables. A high VIF for HDI is very unsurprising to me as the HDI in itself is an measure based off variables like income, mortality, employment and other factors (that are variables that already exist in our dataset)

#### <span style="color:black"><u>**Feature Transformations**<u></span>
    
Transforming features in a non linear way can:
- Help turn skewed data into more symetrically distributed data
- Make relationships more linear between a predictor and the response variable (helps linearise the data)
- Can help minimise heteroskedasticity in linear models
- Make data more evenly spread

    
Some transformations include:
    
* Log transformation: 
     * Reduces skew for right skewed data, though all values must be strictly positive
* Reciprocal transformation: 
     * Taking the reciprocal turns larger values into smaller values and vice versa. 
     * It is not defined when x is zero so we proceed with caution
* Square transformation: 
     * Mostly for left skewed data
     * http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_03.pdf (they made a nice point about moving up and down the ladder)
* Square root transformation: 
     * Reduces skewness of right skewed data, though all values must be strictly positive. 
     * Weaker than log transformation.
* Box Cox and Yeo Johnson
    * [Terrific resource](https://towardsdatascience.com/catalog-of-variable-transformations-to-make-your-model-works-better-7b506bf80b97)
    * Finds a lambda parameter that can help you get your data following a normal distribution
    * Check out the [PowerTransformer class from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)
    
    
A log transform for the GDP_cap variable might be nice as it was verey right skewed to begin with
    
    
Will be doing feature engineering using sklearn rather than pandas once I build models :)

In [None]:
# Just wanted to preview and see what happened if we log transformed that GDP_Cap variable

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2, figsize = (12, 6))
fig.suptitle('GDP_Cap vs Life Expectancy (left) and ln(GDP_cap) vs Life Expectancy (right)')

sns.scatterplot(x = train.GDP_cap, y = train.Life_exp, s = 6, hue = train.Status, ax = ax1, palette = 'viridis')
sns.scatterplot(x = np.log(train.GDP_cap), y = train.Life_exp, s = 6, hue = train.Status, ax = ax2, palette = 'viridis')
sns.kdeplot(train.GDP_cap, ax=ax3)
sns.kdeplot(np.log(train.GDP_cap), ax=ax4)

ax1.set_title('Pre Log Transform')
ax2.set_title('Post Log Transform');

# Perthaps a log transform of the GDP feature could be useful 
# ...as it as reduced the skewness and has linearised the relationship with life excpectancy

In [None]:
# Save the updates to new csvs
train.to_csv('../datasets/Train_updated.csv', index=False)
test.to_csv('../datasets/Test_updated.csv', index=False)