# Detecting leakage

It's essential to ensure that our model is not inadvertently using information from the future to make predictions about the past. This is known as "data leakage" and can lead to overly optimistic performance estimates during model evaluation.

In [8]:
import pandas as pd

In [9]:
# Load your prepared data from M3L1
df = pd.read_csv("../data/interim/cleaned_data.csv")
target_variable = "volume_per_ha"

print(df.head())

   id  yield_class  age  average_height   dbh  taper  trees_per_ha  \
0   1         15.0   20             5.3  11.5  0.396        2585.0   
1   1         15.0   30            10.6  16.7  0.458        1708.0   
2   1         15.0   40            15.7  21.6  0.460        1266.0   
3   1         15.0   50            20.5  26.1  0.456        1003.0   
4   1         15.0   60            24.6  30.2  0.451         830.0   

   volume_per_ha  
0           54.0  
1          180.0  
2          334.0  
3          499.0  
4          659.0  


## Calculate Pearson Correlation Coefficient

If the Pearson correlation coefficient is higher than 0.95 we are facing potential data leakage. As the highest correlation is between `dbh` and the target variable (0.775), we don't need to take any action.

In [10]:
for col in df.columns:
    if col != "Area" and col != "Year":
        print(
            f"Correlation between {col} and {target_variable}: {df[[col, target_variable]].corr().iloc[0,1]}"
        )

Correlation between id and volume_per_ha: 0.1472221319450277
Correlation between yield_class and volume_per_ha: 0.44550909619919404
Correlation between age and volume_per_ha: 0.5168362829761064
Correlation between average_height and volume_per_ha: 0.9130683925944624
Correlation between dbh and volume_per_ha: 0.775357334262694
Correlation between taper and volume_per_ha: -0.03066251233217578
Correlation between trees_per_ha and volume_per_ha: -0.5313685123907324
Correlation between volume_per_ha and volume_per_ha: 1.0
