# Regularisation

This notebook focuses on the practical application of various regularisation methods to prevent overfitting in machine learning models.

Here is the link to the dataset to be used: https://raw.githubusercontent.com/Explore-AI/Public-Data/master/SDG_15_Life_on_Land_Dataset.csv

### Step 1: Data scaling

Before applying regularisation techniques, it's crucial to scale the data. Therefore, scale the features of the `SDG_15_Life_on_Land_Dataset` to have a mean of `0` and a standard deviation of `1`.

We start by doing the following:

- Load the dataset and select features for scaling (exclude the `Year` column).
- Implement standard scaling on the selected features.
- Display the first five rows of the scaled features.

In [22]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Importing the CSV file to be used as a pandas DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/SDG_15_Life_on_Land_Dataset.csv')

# Setting the predictors(excluding year) and response variables
y = df['BiodiversityHealthIndex']
X_temp = df.drop(columns=['Year', 'BiodiversityHealthIndex'])

# Standardising the predictor variables
scaler = StandardScaler()
X_temp_scaled = pd.DataFrame(scaler.fit_transform(X_temp), columns=X_temp.columns)

# Displaying the first five rows of the standardised DataFrame
X_temp_scaled.head()

Unnamed: 0,WaterQualityIndex,ClimateChangeImpactScore,LandUseChange,InvasiveSpeciesCount,ConservationFunding,EcoTourismImpact,ForestCoverChange,SoilQualityIndex,WaterUsage,RenewableEnergyUsage,CarbonEmissionLevels,AgriculturalIntensity,HabitatConnectivity,SpeciesReintroductionEfforts,PollinatorDiversity
0,-0.509823,0.915895,0.532798,0.967295,-0.12943,-1.297085,0.017923,0.689812,-0.641157,-1.29099,-0.930835,-1.237558,-1.131411,1.49466,-0.811078
1,-1.261473,-1.159761,0.479063,1.382383,-1.098165,1.226669,-1.649745,0.655167,0.539995,0.207271,0.470716,-0.67015,0.305779,-0.107952,0.797582
2,-1.363971,-1.409483,1.389846,0.206299,0.32034,-0.529103,-0.87737,0.759101,1.165311,-0.473757,-0.110415,1.006319,1.598836,-1.017291,-1.518029
3,-0.475658,0.746916,0.684528,0.828932,1.323673,1.653728,1.188117,0.481943,1.165311,1.535274,0.368164,-1.360736,0.001642,-0.978326,-1.249998
4,-0.885648,1.230038,-0.213905,1.105658,1.323673,1.445945,-0.439815,-1.319584,-1.78757,1.160709,0.402348,1.675792,-0.977884,0.832482,1.077254


### Step 2: Ridge regression

Ridge regression is a technique used to analyse multiple regression data that suffer from multicollinearity. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.

In this step, we do the following:

- Use the scaled features from **Step 1** as your predictors and select a suitable target variable from the dataset.
- Split the data into training and test sets.
- Implement a ridge regression model, with cross-validation to find the optimal regularisation parameter.
- Evaluate the model on the test set and report the R-squared value.

In [23]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score

# Splitting the data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X_temp_scaled, y, test_size=0.2, random_state=42)

# Ridge regression model with cross-validation
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-4, 1e-3, 1e-2, 1, 5, 10, 20]}
model_r = GridSearchCV(Ridge(), parameters, scoring='r2', cv=5)
model_r.fit(X_train, y_train)

# Generating predictions from the test set
y_pred_r = model_r.predict(X_test)

# Getting the R squared score of the model
r2 = r2_score(y_test, y_pred_r)

#Printing out the results
print(f"Optimal regularisation parameter: {model_r.best_params_}")
print(f"R-squared value on the test set: {r2}")

Optimal regularisation parameter: {'alpha': 20}
R-squared value on the test set: -0.059838872264527554


The ridge regression technique is particularly useful when dealing with multicollinearity, helping to reduce the variance of the coefficients and improve the model's generalisation ability. The task involves selecting a target variable, splitting the data into training and test sets, implementing ridge regression with cross-validation to find the best regularisation parameter (`alpha`), and finally, evaluating the model's performance on the test set using the R-squared metric. The R-squared value will indicate how well the model explains the variance in the target variable.

**Interpretation**: The R-squared value reported from the ridge regression model evaluation signifies the proportion of the variance in the dependent variable that is predictable from the independent variables. In simple terms, it reflects the goodness of fit of the model. A value closer to 1 indicates a model that perfectly predicts the target variable, whereas a value closer to 0 indicates a model that fails to accurately predict the target variable's variance. It should be noted that an R-squared value of 1 suggests possible overfitting and it should be treated with caution. Investigating the individual variables and their coefficients might add more insight to whether the model has been overfitted.

### Step 3: LASSO regression

LASSO (Least Absolute Shrinkage and Selection Operator) regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean.

In this step, we do the following:

- Re-use the scaled features and target variable from the previous exercises.
- Split the data into training and test sets.
- Implement a LASSO regression model, with cross-validation to find the optimal regularisation parameter.
- Evaluate the model on the test set and report the R-squared value and the number of features used.

In [24]:
from sklearn.linear_model import LassoCV

# Splitting the data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X_temp_scaled, y, test_size=0.2, random_state=42)

# Fitting the LASSO regression model on our data with cross validation
model_l = LassoCV(cv=5, random_state=42)
model_l.fit(X_train, y_train)

# Generating predictions from the test set
y_pred_l = model_l.predict(X_test)

# Getting the R squared score of the model
rsq_score = r2_score(y_test, y_pred_l)
features_used = np.sum(model_l.coef_ != 0)

# Printing out the results
print(f'Optimal regularisation parameter: {model_l.alpha_}')
print(f'R squared: {rsq_score}')
print(f'Number of features used: {features_used}')

Optimal regularisation parameter: 0.023673211125955964
R squared: -0.020436824594975977
Number of features used: 1


LASSO regression is beneficial for models that suffer from multicollinearity or when it's necessary to reduce the number of features in a model. LASSO does this by applying a penalty to the absolute size of the coefficients, effectively shrinking some of them to zero and thus excluding them from the model. This exercise involves reusing the scaled features and target variable, splitting the dataset, implementing LASSO regression with cross-validation to identify the optimal alpha, and evaluating the model. The output includes the R-squared value, which assesses the model's fit, and the count of features used, demonstrating LASSO's capability for feature selection.

**Interpretation**: The R-squared value here, similar to ridge regression, measures how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. Moreover, the number of features used by the LASSO model highlights its ability to perform feature selection, reducing the complexity of the model and potentially enhancing its interpretability by retaining only the most informative predictors. The reduction in the number of features used in this LASSO model suggests that we should be reassessing whether the features included in the ridge regression model were really adding value. Try to run the ridge regression again with fewer feature variables, including only the relevant features from the LASSO model to see how the results change.