# Detecting leakage

It's essential to ensure that our model is not inadvertently using information from the future to make predictions about the past. This is known as "data leakage" and can lead to overly optimistic performance estimates during model evaluation.

In [60]:
import pandas as pd

In [61]:
# Load your prepared data from M3L1
df = pd.read_csv("../data/interim/cleaned_data.csv")

print(df.head())

   squareMeters  numberOfRooms  hasYard  hasPool  floors  cityCode  \
0         75523              3        0        1      63      9373   
1         80771             39        1        1      98     39381   
2         55712             58        0        1      19     34457   
3         32316             47        0        0       6     27939   
4         70429             19        1        1      90     38045   

   cityPartRange  numPrevOwners  made  isNewBuilt  hasStormProtector  \
0              3              8  2005           0                  1   
1              8              6  2015           1                  0   
2              6              8  2021           0                  0   
3             10              4  2012           0                  1   
4              3              7  1990           1                  0   

   basement  attic  garage  hasStorageRoom  hasGuestRoom      price  
0      4313   9005     956               0             7  7559081.5  
1     

## Calculate Pearson Correlation Coefficient

If the Pearson correlation coefficient is higher than 0.95 we are facing potential data leakage.

In [62]:
for col in df.columns:
    print(
        f"Correlation between {col} and price: {df[[col, 'price']].corr().iloc[0,1]}"
    )

Correlation between squareMeters and price: 0.9999993570640745
Correlation between numberOfRooms and price: 0.009590905935479123
Correlation between hasYard and price: -0.006119244882540521
Correlation between hasPool and price: -0.005070340833862509
Correlation between floors and price: 0.0016542562406504835
Correlation between cityCode and price: -0.0015393673485808049
Correlation between cityPartRange and price: 0.008812911660535352
Correlation between numPrevOwners and price: 0.016618826067943373
Correlation between made and price: -0.007209526254690733
Correlation between isNewBuilt and price: -0.010642774359518868
Correlation between hasStormProtector and price: 0.0074959113342807585
Correlation between basement and price: -0.003967482178851138
Correlation between attic and price: -0.0005995140774963296
Correlation between garage and price: -0.017229051207338156
Correlation between hasStorageRoom and price: -0.0034852993013792825
Correlation between hasGuestRoom and price: -0.000