# Detecting leakage

It's essential to ensure that our model is not inadvertently using information from the future to make predictions about the past. This is known as "data leakage" and can lead to overly optimistic performance estimates during model evaluation.

In [27]:
import pandas as pd

In [28]:
# Load your prepared data from M3L1
df = pd.read_csv("../data/interim/cleaned_data.csv")

print(df.head())

          Area  Year  Savanna fires  Forest fires  Crop Residues  \
0  Afghanistan  1990        14.7237        0.0557       205.6077   
1  Afghanistan  1991        14.7237        0.0557       209.4971   
2  Afghanistan  1992        14.7237        0.0557       196.5341   
3  Afghanistan  1993        14.7237        0.0557       230.8175   
4  Afghanistan  1994        14.7237        0.0557       242.0494   

   Rice Cultivation  Drained organic soils (CO2)  Pesticides Manufacturing  \
0            686.00                          0.0                 11.807483   
1            678.16                          0.0                 11.712073   
2            686.00                          0.0                 11.712073   
3            686.00                          0.0                 11.712073   
4            705.60                          0.0                 11.712073   

   Food Transport  Forestland  ...  Manure Management  Fires in organic soils  \
0         63.1152   -2388.803  ...       

## Calculate Pearson Correlation Coefficient

If the Pearson correlation coefficient is higher than 0.95 we are facing potential data leakage.

In [29]:
for col in df.columns:
    if col != "Area" and col != "Year":
        print(
            f"Correlation between {col} and Average Temperature °C: {df[[col, 'Average Temperature °C']].corr().iloc[0,1]}"
        )

Correlation between Savanna fires and Average Temperature °C: -0.04658763755189444
Correlation between Forest fires and Average Temperature °C: -0.038103612891848423
Correlation between Crop Residues and Average Temperature °C: 0.01943397938042484
Correlation between Rice Cultivation and Average Temperature °C: -0.02253186470832951
Correlation between Drained organic soils (CO2) and Average Temperature °C: 0.029029549376425948
Correlation between Pesticides Manufacturing and Average Temperature °C: 0.02795991262892865
Correlation between Food Transport and Average Temperature °C: 0.07572424536018706
Correlation between Forestland and Average Temperature °C: -0.0492739839812552
Correlation between Net Forest conversion and Average Temperature °C: -0.031554585486913894
Correlation between Food Household Consumption and Average Temperature °C: 0.05526254631653598
Correlation between Food Retail and Average Temperature °C: 0.07340433859515291
Correlation between On-farm Electricity Use and