<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Regularisation - Data scaling
© ExploreAI Academy

In this notebook, we'll gain a comprehensive understanding of scaling and how to implement it in your models.

## Learning objectives

By the end of this notebook, you should be able to:
- Understand what scaling and standardisation are.
- Understand the code required to implement data scaling.

## Introducing Scaling

Scaling data is crucial when preparing it for machine learning models, especially for those that involve regularisation. Regularisation techniques, such as L1 (Lasso) and L2 (Ridge) regularisation, adjust model complexity by applying penalties to the coefficients of predictors. The magnitude of these penalties is influenced by the scale of the predictors, making scaling an essential step to ensure fairness and effectiveness in regularisation. Essentially, if the features are on different scales, the model might unfairly penalise smaller scale features more than those on a larger scale. Therefore, to apply regularisation uniformly across all features, we must standardise their scales.

There are two common scaling techniques:

### Normalisation

One way to do this is with $[0,1]$-normalisation, otherwise known as min-max normalisation: squeezing your data into the range $[0,1]$. Through normalisation, the maximum value of a variable becomes one, the minimum becomes zero, and the values in-between become decimals between zero and one.

We implement this transformation by applying the following operation to each of the values of a predictor variable:

$$\hat{x}_{ij} = \frac{x_{ij}-min(x_j)}{max(x_j)-min(x_j)},$$

where $\hat{x}_{ij}$ is the value after normalisation, $x_{ij}$ is the $i^{th}$ item of $x_j$, and $min()$, $max()$ return the smallest and largest values of variable $x_j$ respectively. 

Normalisation is useful because it ensures all variables share the same range: $[0,1]$. One problem with normalisation, however, is that if there are outliers, the bulk of your data will all lie in a small range, so you would lose information.

### Standardisation

Z-score standardisation, or simply standardisation, on the other hand, does not suffer from this drawback as it handles outliers gracefully. 

We implement Z-score standardisation by applying the following operation to each of our variables: 

$$\hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}.$$

Here, $\mu_j$ represents the mean of variable $x_j$, while $\sigma_j$ is the variable's standard deviation. As can be seen from the above formula, instead of dividing by the full range of our variable, we instead divide by a more distribution-aware measure in the standard deviation. While this doesn't completely remove the effects of outliers, it does consider them in a more conservative manner. As a trade-off to using this transformation, our variable is no longer contained within the $[0,1]$ range as it was during normalisation (in fact, it can now take on a range which includes negative values). This means that all our variables won't be bound to the exact same range (i.e. they can have slightly different influence levels on the learnt regression coefficients during regularisation), but they are far closer to one another then they were before the use of standardisation.

## Getting started

To begin, let's import a few Python libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


Now we'll load our data as a Pandas DataFrame after fetching it from the GitHub repo.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/SDG_15_Life_on_Land_Dataset.csv', index_col=0)
df.head()

We can take a look at the dimensions of the dataframe to get an idea of the number of rows, _n_, and number of predictors, _p_, which is equal to one less than the number of columns.

In [None]:
df.shape

Our dataset contains various environmental indicators related to SDG 15, such as deforestation rates, protected area coverage, biodiversity indices, and other relevant variables. Our objective is to model an environmental outcome for  the health of biodiversity using these indicators. 

The mathematical representation of our model can be described as follows:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$

In this formulation, $Y$ represents the response variable, which, in our case, is `BiodiversityHealthIndex`. This response variable is influenced by _p_ predictor variables ($X_1, X_2, ..., X_p$), each representing different environmental indicators relevant to SDG 15.

We can see in the data above that the variables have different scales. For example, variables such as `ConservationFunding` may involve financial values potentially reaching into high numerical ranges, whereas other variables like `ProtectedAreaCoverage` or `RenewableEnergyUsage` are expressed as percentages. So let's go ahead and implement scaling. 

## Implementing Scaling

Let's see how we standardise the features. Sklearn makes rescaling easy. We'll import the `StandardScaler()` object from `sklearn.preprocessing`.

In [None]:
# split data into predictors and response
X = df.drop('BiodiversityHealthIndex', axis=1)
y = df['BiodiversityHealthIndex']

In [None]:
# import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

In [None]:
# create scaler object
scaler = StandardScaler()

In [None]:
# create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

In [None]:
# convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Taking a look at one of the variables as an example (`SoilQualityIndex`), we can see that standardising the data has caused it to be centered around zero.

In [None]:
plt.hist(X_standardise['SoilQualityIndex'])
plt.show()

Furthermore, the standard error within each variable in the data is now equal to one. 

In [None]:
X_standardise.describe().loc['std']

## Implementing min-max normalisation

Let's see how we normalise the features. Sklearn makes rescaling easy. We'll import the `MinMaxScaler()` object from `sklearn.preprocessing`.

In [None]:
# import scaler method from sklearn
from sklearn.preprocessing import MinMaxScaler

In [None]:
# create scaler object
scaler = MinMaxScaler()

In [None]:
# create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

In [None]:
# convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Taking a look at one of the variables as an example (`SoilQualityIndex`), we can see that normalising the data put it neatly between 0 and 1.

In [None]:
plt.hist(X_standardise['SoilQualityIndex'])
plt.show()

Furthermore, the standard error for these newly normalised variables is all relatively similar at +-0.28.

In [None]:
X_standardise.describe().loc['std']

## Conclusion

In this train we have seen or been introduced to:

- The difference between scaling and standardising the predictor variables in our dataset
- The different scaling techniques and performed scaling on our data using Standardisation and Normalisation. 

## Appendix
Links to additional resources to help with the understanding of concepts presented in the train.

- [Article on standard min-max normalization vs z-score standardisation](https://www.codecademy.com/articles/normalization)

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>