Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

# Avoid the Embarrassment of Data Leakage: <br>Rescale or Transform Within CV Folds!

A basic cross-validation, "anti-data leakage" notion in that test data fold are data that an algorithm hasn't yet "seen" when it is learning from a training data fold.   To be consistent with this idea, any rescaling or transformations that depends on the values of the data must be done separately for training data and for test data.

# Rescaling Transformations Within Folds

In the following example we'll "standardize" the patient satisfaction data so that every predictor has mean=0 and SD=1.  In the "olden days" of data mining, doing this was sometimes referred to as "sphering" the data.  See:

[scikit-Learn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

There are several other ways of rescaling variables.  Another common method is "MinMax," which rescales a feature's data to be within the range of the minimum and maximum values.

[scikit-Learn MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

You can find other tools for rescaling at [scikit-learn preprocessing API](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).

# Get Some Essential Packages

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn import linear_model  # OLS 
from sklearn.metrics import mean_squared_error, r2_score # Basic metrics
from sklearn.model_selection import KFold
from sklearn import preprocessing

# Get the PT Satisfaction Data

Assuming that they are in the pwd:

In [3]:
# Input into a DataFrame, check the column names

ptSatDF=pd.read_csv('../DATA/ML/DECART-patSat.csv')
ptSatDF.columns

Index(['caseID', 'patSat', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9',
       'ptCat'],
      dtype='object')

## Dummy Code the pt Categories

Just for consistency with things elsewhere, we'll dummy code the categories of `ptCat`, leaving out the first(0) category, medical admission.  (The regular one, not the "highfalutin" concierge type.

In [4]:
ptSatDF2=ptSatDF.drop('caseID',axis=1)  # Get rind of caseID
ptCats=pd.get_dummies(ptSatDF2.ptCat,drop_first=True) # get 0/1 dummies, drop the 0 category
ptSatDF2[["ptCat"]].tail(10)
ptCats.tail(10)

Unnamed: 0,ptCat
1801,1
1802,1
1803,2
1804,2
1805,1
1806,0
1807,0
1808,1
1809,0
1810,2


Unnamed: 0,1,2
1801,1,0
1802,1,0
1803,0,1
1804,0,1
1805,1,0
1806,0,0
1807,0,0
1808,1,0
1809,0,0
1810,0,1


In [5]:
ptSatDF3=ptSatDF2.drop('ptCat',axis=1)
ptSatDF4=pd.concat([ptSatDF3,ptCats],axis=1,sort=False)
ptSatDF4=ptSatDF4.rename(columns={1:'ptCat1',2:'ptCat2'}) ## 원래 args로 index=str이 있었는데 필요없는 것 같다.
ptSatDF4.columns

Index(['patSat', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'ptCat1',
       'ptCat2'],
      dtype='object')

In [6]:
ptSatDF4.shape

(1811, 11)

# K-Fold CV with Separate X Train and Test Standardization Within Folds


20 folds, using defaults for the `scikit-learn` StandardScaler() method.

In [8]:
kf=KFold(n_splits=20,random_state=99,shuffle=True)
X=ptSatDF4.iloc[:,1:].to_numpy()
y=ptSatDF4.iloc[:,0].to_numpy()

cvres=[]  # Holder list for fold results

regr=linear_model.LinearRegression() # define a reg model to use

scaler=preprocessing.StandardScaler() # by default, mean=0, sd=1

for traindx, testdx in kf.split(X):  # loop over folds
    resDict={}                       # Dictionary to hold fold results

    XTrainS=scaler.fit_transform(X[traindx])  # Xtrain rescaled
    yTrain=y[traindx]
    XTestS=scaler.fit_transform(X[testdx])    # Xtest rescaled - training data와 따로 rescaling 해야한다. 
    yTest=y[testdx]

    regModel=regr.fit(XTrainS,yTrain) 
    trainPred=regModel.predict(XTrainS)
    trainR2=r2_score(yTrain,trainPred)
    trainMSE=mean_squared_error(yTrain,trainPred)

    testPred=regModel.predict(XTestS)
    testR2=r2_score(yTest,testPred)
    testMSE=mean_squared_error(yTest,testPred)

    resDict.update({'trainR2':trainR2,
                    'testR2':testR2,
                    'trainMSE':trainMSE,
                    'testMSE':testMSE})

    cvres.append(resDict)





In [9]:
# Rearranging cols to make train vs test comparisons easier

cvresDF=pd.DataFrame(cvres)[['trainMSE','testMSE','trainR2','testR2']]
cvresDF

Unnamed: 0,trainMSE,testMSE,trainR2,testR2
0,1.929971,1.910807,0.704365,0.650944
1,1.877271,2.480409,0.711835,0.602487
2,1.925966,1.579951,0.700704,0.795454
3,1.92501,1.552436,0.706118,0.721352
4,1.923698,1.542167,0.704014,0.764135
5,1.903037,2.024018,0.706371,0.705701
6,1.885127,2.430692,0.709553,0.628048
7,1.891992,2.353795,0.706521,0.681725
8,1.899067,2.113676,0.704958,0.726356
9,1.900755,2.063928,0.708765,0.656303


In [10]:
cvresDF.describe()

Unnamed: 0,trainMSE,testMSE,trainR2,testR2
count,20.0,20.0,20.0,20.0
mean,1.905219,2.053023,0.706933,0.675779
std,0.017148,0.309234,0.002749,0.052867
min,1.871845,1.542167,0.700704,0.561533
25%,1.894593,1.879348,0.705033,0.650602
50%,1.902233,2.088802,0.70727,0.668211
75%,1.923043,2.229367,0.708789,0.702608
max,1.929971,2.599047,0.711835,0.795454


In [18]:
cvres=[]
kf=KFold(n_splits=20,random_state=99,shuffle=True)
scaler=preprocessing.MinMaxScaler() # by default, mean=0, sd=1

for traindx, testdx in kf.split(X):  # loop over folds
    resDict={}                       # Dictionary to hold fold results

    XTrainS=scaler.fit_transform(X[traindx])  # Xtrain rescaled
    yTrain=y[traindx]
    XTestS=scaler.fit_transform(X[testdx])    # Xtest rescaled - training data와 따로 rescaling 해야한다. 
    yTest=y[testdx]

    regModel=regr.fit(XTrainS,yTrain) 
    trainPred=regModel.predict(XTrainS)
    trainR2=r2_score(yTrain,trainPred)
    trainMSE=mean_squared_error(yTrain,trainPred)

    testPred=regModel.predict(XTestS)
    testR2=r2_score(yTest,testPred)
    testMSE=mean_squared_error(yTest,testPred)

    resDict.update({'trainR2':trainR2,
                    'testR2':testR2,
                    'trainMSE':trainMSE,
                    'testMSE':testMSE})

    cvres.append(resDict)



In [20]:
cvresDF = pd.DataFrame(cvres)[["trainMSE","testMSE","trainR2","testR2"]]
cvresDF

Unnamed: 0,trainMSE,testMSE,trainR2,testR2
0,1.929971,1.483412,0.704365,0.729019
1,1.877271,2.465664,0.711835,0.60485
2,1.925966,1.553774,0.700704,0.798843
3,1.92501,1.557776,0.706118,0.720393
4,1.923698,1.59117,0.704014,0.75664
5,1.903037,1.982589,0.706371,0.711725
6,1.885127,2.313917,0.709553,0.645917
7,1.891992,2.207918,0.706521,0.701451
8,1.899067,2.081354,0.704958,0.730541
9,1.900755,2.030082,0.708765,0.661939


# A UDU: Radon Regression With MinMax Rescaling

This can be done essentially like what's above. But instead of

`scaler=preprocessing.StandardScaler()`

use

`scaler=preprocessing.MinMax()`

Use the radon data, of course.  Don't forget that there's an observation with a missing value on `hhincome`. 

In [21]:
radon=pd.read_csv('../DATA/ML/radon.csv')
radon2=radon.loc[radon.hhincome.notnull(),'lcanmort':'hhincome'].drop('radon',axis=1)
radon2.shape
radon2.columns

(2880, 7)

Index(['lcanmort', 'lnradon', 'obesity', 'over65', 'cursmoke', 'evrsmoke',
       'hhincome'],
      dtype='object')

In [22]:
kf=KFold(n_splits=20,random_state=99,shuffle=True)
X=radon2.iloc[:,1:].to_numpy()
y=radon2.loc[:,'lcanmort'].to_numpy()
scaler=preprocessing.MinMaxScaler() # by default, mean=0, sd=1

cvres=list()  # This will hold cv results # []도 가능

regr=linear_model.LinearRegression() # define a reg model to use

for traindx, testdx in kf.split(X):  # loop over folds
    #result를 넣을 dictionary 정의
    resDict={}                       # Dictionary to hold fold results

    # Fold별 Train, Test; x, y 나누기
    XTrain=X[traindx]                
    yTrain=y[traindx]
    XTest=X[testdx]
    yTest=y[testdx]
    
    XTrainS=scaler.fit_transform(X[traindx])  # Xtrain rescaled
    XTestS=scaler.fit_transform(X[testdx])    # Xtest rescaled - training data와 따로 rescaling 해야한다. 

    # Regression model fit 정의 (위에서 정의한 regr를 사용한다.)
    regModel=regr.fit(XTrainS,yTrain)
    
    # XTrain에 따른 prediction 값, MSE, R2을 도출한다. 
    trainPred=regModel.predict(XTrainS)
    trainR2=r2_score(yTrain,trainPred)
    trainMSE=mean_squared_error(yTrain,trainPred)
    
    # XTest에 따른 prediction 값, MSE, R2을 도출한다.     
    testPred=regModel.predict(XTestS)
    testR2=r2_score(yTest,testPred)
    testMSE=mean_squared_error(yTest,testPred)
    
    # result 값들을 resDict에 넣는다. (.update라는 함수를 사용)
    resDict.update({'trainR2':trainR2,
                    'testR2':testR2,
                    'trainMSE':trainMSE,
                    'testMSE':testMSE})
    
    # cvres라는 list에 나온 resultDictionary를 모두 모은다. 
    cvres.append(resDict)
    

In [25]:
cvresDF = pd.DataFrame(cvres)
cvresDF[['trainMSE','testMSE','trainR2','testR2']]
cvresDF.describe()

Unnamed: 0,trainMSE,testMSE,trainR2,testR2
0,165.913376,309.356095,0.465619,0.05982
1,167.118174,260.509577,0.464053,0.137373
2,164.126013,231.821872,0.472399,0.271262
3,167.027701,165.877794,0.46022,0.510757
4,167.125752,167.669161,0.461812,0.485368
5,166.501674,357.119988,0.464057,-0.098346
6,167.654218,213.78232,0.466295,0.177876
7,166.1195,338.631115,0.465306,-0.039748
8,168.463209,176.63585,0.465221,0.257529
9,162.079522,268.025326,0.470901,0.342024


Unnamed: 0,testMSE,testR2,trainMSE,trainR2
count,20.0,20.0,20.0,20.0
mean,230.116225,0.248074,166.338463,0.465881
std,58.60017,0.192349,1.504895,0.00349
min,143.359844,-0.098346,162.079522,0.459364
25%,185.177616,0.117985,165.744857,0.464056
50%,224.0185,0.281171,166.595322,0.46575
75%,262.388514,0.360888,167.248859,0.467429
max,357.119988,0.57098,168.463209,0.472399
