## Overview

We briefly introduced the concept of data leakage in the previous sprint  when we discussed using pipelines for preprocessing and model fitting. In general, and especially if you are using cross-validation, it's good to be conscious of when, where, and why data leakage can occur. This module is focused on learning how to work with data that isn't already prepared for modeling. Not only do we need to know which features to use and if our target is appropriate, but we also need to protect again information leaking into either our testing data or from certain features.

The two main types of leakage are leaky features (predictors) and a leaky validation or testing process. 

### Leaky Features

This type of leakage occurs when you have a feature that has access to data that won't be available when you actually use the model on new data (outside of the test set) to make predictions. This could happen if you adjust the values in that feature *after* you determined the values in your target array. 

For example, if we were predicting if someone has heart disease (True/False) and used a feature called "BP_meds" (indicating if the individual is taking blood pressure medication), we might have a problem. If someone is taking this medication, it might be because they have heart disease and are being treated. But the value in this column could have been changed *after* they were diagnosed with heart disease.

### Leaky Testing Process

The other type of leak can happen when your validation data "learns" from the training data. If you are preprocessing the data, such as filling in missing values with the `Imputer` or standardizing, you might accidentally be using the entire data set. In this case, it's important to apply the preprocessing steps separately to the training and testing data which will prevent the testing data from learning anything from the training set.

Now that we are more familiar with these two different types of data leakage, let's explore our real-world weather data from the previous objective.

## Follow Along

We introduced the Australian weather data set earlier and explored it briefly to decide on the prediction target, which was if it was going to rain on the day following the measurements. Let's look more closely at each of the features (predictors) to see if any of them could present leakage problems.

In [1]:
# Import libraries, load data, and view
import pandas as pd
weather = pd.read_csv('weatherAUS.csv')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No


Before we identify any possible "leaky features" we should decide which features to use and the necessary preprocessing steps. Let's look at each type of variable (numeric and categorical) in more detail.

In [2]:
# Look at the statistics of categorical variables 
weather.describe(include=['object'])

Unnamed: 0,Date,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday,RainTomorrow
count,142193,142193,132863,132180,138415,140787,142193
unique,3436,49,16,16,16,2,2
top,2013-10-08,Canberra,W,N,SE,No,No
freq,49,3418,9780,11393,10663,109332,110316


In [3]:
# Look at the statistics of the numeric variables 
weather.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RISK_MM
count,141556.0,141871.0,140787.0,81350.0,74377.0,132923.0,140845.0,139563.0,140419.0,138583.0,128179.0,128212.0,88536.0,85099.0,141289.0,139467.0,142193.0
mean,12.1864,23.226784,2.349974,5.469824,7.624853,39.984292,14.001988,18.637576,68.84381,51.482606,1017.653758,1015.258204,4.437189,4.503167,16.987509,21.687235,2.360682
std,6.403283,7.117618,8.465173,4.188537,3.781525,13.588801,8.893337,8.803345,19.051293,20.797772,7.105476,7.036677,2.887016,2.720633,6.492838,6.937594,8.477969
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4,0.0
25%,7.6,17.9,0.0,2.6,4.9,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6,0.0
50%,12.0,22.6,0.0,4.8,8.5,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1,0.0
75%,16.8,28.2,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4,0.8
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7,371.0


### Data Exploration: Null Values

From the above DataFrame descriptions, we can see that there are a lot of null values in some of the columns. We'll take a more detailed look at how many are missing in what columns. The plot below is created using a module available at [this repository](https://github.com/ResidentMario/missingno).

In [4]:
# Checking for null values
weather.isnull().sum()

Date                 0
Location             0
MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustDir       9330
WindGustSpeed     9270
WindDir9am       10013
WindDir3pm        3778
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
RainToday         1406
RISK_MM              0
RainTomorrow         0
dtype: int64

In [5]:
import matplotlib.pyplot as plt
import missingno as msno
msno.matrix(weather)

plt.clf()

ModuleNotFoundError: No module named 'missingno'

![mod1_obj1_missingNA.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_3/mod1_obj1_missingNA.png)

We have four columns with a large number of null values. If we were doing this analysis for a competition (or for an actual data science job!) we would want to more carefully explore the missing values. Since these columns are missing about 40% of their data (or more), we're going to drop them for this analysis. For the other missing values, we'll use an `Imputer` in the preprocessing step.

To simplify the analysis for later, we'll also drop the `Location` column. Again, this information might be important for a more detailed model, but we're trying to keep this process simple so that we can focus on identifying the leaky features.

In [None]:
# Drop columns with high-percentage of missing values
cols_drop = ['Location', 'Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm']
weather_drop = weather.drop(cols_drop, axis=1)

### Data Cleaning: Datetime

We have a date column which could be converted to a datetime object; we could then use just the 'month' value in our model, as the full date would be too specific.

In [None]:
# Convert the 'Date' column to datetime, extract month
weather_drop['Date'] = pd.to_datetime(weather_drop['Date'], infer_datetime_format=True).dt.month
weather_drop.head()

### Data Processing: Pipeline

We'll separate our features into numeric and categorical types and then perform the following steps.

#### Numeric Features

As some of the numeric features are on very different scales, we'll want to standardize. And we still have missing values for which the `SimpleImputer()` will take care of.

#### Categorical Features

These features will need to be encoded. Because of the high cardinality of the `Location` column we'll use the `LabelEncoder()` separately (it doesn't accept more than one column at a time and can't be used in the pipeline).

In [None]:
# Print the column names
weather_drop.columns

In [None]:
# Imports
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Define the numeric features
numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm', 'RISK_MM']

# Create the transformer (impute, scale)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define the categorical features
categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ordinal', OrdinalEncoder())])

# Define how the numeric and categorical features will be transformed
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the pipeline steps, including the classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier())])

### Create Feature Matrix, Target Array

We have a couple of final steps before we fit the model, and that is to create the feature matrix and create and encode the target array. 

In [None]:
# Create the feature matrix 
X = weather_drop.drop('RainTomorrow', axis=1)

# Create and encode the target array
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
y=label_enc.fit_transform(weather_drop['RainTomorrow'])

In [None]:
# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [None]:
# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy', clf.score(X_test, y_test))

Wow! We achieved 100% accuracy. But is this too good to be true? Yes - anytime you have a model with very high accuracy, then you likely have a problem and that problem is probably data leakage of some type. Let's look at the feature importances and see where the problem is.

In [None]:
# Features (order in which they were preprocessed)
features_order = numeric_features + categorical_features

# Determine the importances
importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)

In [None]:
# Plot feature importances
import matplotlib.pyplot as plt

n = 7
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.clf()

![mod1_obj1_top7feature_leaky.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_3/mod1_obj1_top7feature_leaky.png)

Well, it looks like the model was essentially fit on a single feature, which must because that predictor was related to the target array. Spoiler: one of the features is leaking information to the model. It the `RISK_MM` column which is essentially how much rain was recorded the following day. 

We'll remove this column, run the model again, and calculate the features importances.

In [None]:
# Remove the 'RISK_MM' column
X_noriskmm = X.drop('RISK_MM', axis=1)

# Create the new training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_noriskmm, y, test_size=0.2, random_state=42)

# Drop the 'RISK_MM' column from the numeric_features
numeric_features = numeric_features.remove('RISK_MM')

# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy (with no "RISK_MM")', clf.score(X_test, y_test))

That's better! The accuracy is still high, but much more reasonable.

In [None]:
# Get feature importances

# Features (order in which they were preprocessed)
numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm']

categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
features_order = numeric_features + categorical_features

importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)

In [None]:
# Plot feature importances

n = 7
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.clf()

![mod1_obj1_top7feature_NOleaky.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_3/mod1_obj1_top7feature_NOleaky.png)

## Challenge

In the above example, we removed the `RISK_MM` column but this takes away information that we might use in our model. For this challenge, think of a way you could group the values in this column and use it as the target for model. Can we predict how much rain is received instead of just a true/false prediction?

## Additional Resources

* [Kaggle: What is Data Leakage?](https://www.kaggle.com/dansbecker/data-leakage)