# Transform & Rescale

This notebook focusses on transforming and re-scaling the predictors to prevent issues of scale or skew in one or more dimensions biasing the results. The objective here is to produce two data sets that can be re-loaded and processed as many times as you like in Notebooks 7 and 8 without you ever needing to revisit this section. 

The **only** time you'd need to come back here is if you have decided to change the base transformation used on the scoring dimensions. In which case you need to re-run this notebook _once_ to generate new versions of the two predictor data sets. Note, however, that they will not overwrite each other, so you can (in notebooks 7 and 8) simply switch between Untransformed, Log-Transformed, and Box-Cox-Transformed data at will.

In [None]:
# Needed on a Mac
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt 

In [None]:
# For reproducibility
import random
import numpy as np
r_state = 42
random.seed(r_state) 
np.random.seed(r_state)

In [None]:
import os
import re
import pandas as pd
import seaborn as sns

import sklearn
print('Your scikit-learn version is {}.'.format(sklearn.__version__))
print('Please check it is at least 0.18.0.')

from sklearn.preprocessing import scale
from sklearn import linear_model
from sklearn import tree
from sklearn import preprocessing
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics  
from sklearn import ensemble

from sklearn.externals.six import StringIO
#from sklearn.model_selection import GridSearchCV
#from sklearn.feature_selection import SelectKBest 
#from sklearn.feature_selection import f_regression

from timeit import default_timer as timer
import datetime

In [None]:
analytical = os.path.join('data','analytical')

def load_status_scores(dtype):
    status = pd.read_csv(os.path.join(analytical,dtype+'-Scores.csv.gz'), index_col=0)  # SES scores
    
    # Scores
    status.drop(['RANK_01','RANK_11'], axis=1, inplace=True)
    status.rename(columns={
        'SES_01':'SES 2001',
        'SES_11':'SES 2011',
        'SES_ASC':'SES Ascent 2001-2011',
        'SES_PR_01':'SES 2001 Percentile', # 99 = High-status
        'SES_PR_11':'SES 2011 Percentile', # 99 = High-status
        'SES_PR_ASC':'SES Percentile Ascent 2001-2011'
    }, inplace=True)
    return status

def load_predictors(dtype):
    
    return status

def plot_checks(df, selected_cols, prefix='Test'):
    sns.set(rc={"figure.figsize": (12, 3)})
    for d in selected_cols:
        print("Working on " + d)
        fig = plt.figure(d)
        sns.distplot(df[d], color='green', hist=True, rug=True, norm_hist=False)
        fig = plt.gcf() # *G*et the *C*urrent *F*igure environment so that the next command works
        plt.savefig("{0}-{1}-Check.pdf".format(prefix, d.replace(':',' - ')), bbox_inches="tight")
        plt.close()
    print("Done.")
    return

## Choose Your Transform (if any)

This code enables us to switch between testing different transforms on the data. You would probably want to match what you specified in Notebook 4, though I've added all three basic outputs (Untransformed, Log, and Box-Cox) to GitHub so that it's easy to experiment with the different choices. After you've run this next section once you don't need to run it again _until_ you change the transform used.

In [None]:
to_use = 'Untransformed' # Options are: ['Untransformed','Box-Cox','Log']

SES = load_status_scores(to_use)  # SES scores in 2011

d01input = pd.read_csv(os.path.join('data','canonical','scores',to_use+'-Inputs-2001.csv.gz'), index_col=0)  # SES inputs
d11input = pd.read_csv(os.path.join('data','canonical','scores',to_use+'-Inputs-2011.csv.gz'), index_col=0)  # SES inputs

# Rename to remove confusion
d01input.rename(columns=lambda x: re.sub(' 2001','',x), inplace=True)
d11input.rename(columns=lambda x: re.sub(' 2011','',x), inplace=True)

In [None]:
#  Read in processed datasets
d01 = pd.read_csv(os.path.join(analytical,'Predictor-2001-Data.csv.gz'), compression='gzip', index_col=0)  #  Main dataset for 2001
d11 = pd.read_csv(os.path.join(analytical,'Predictor-2011-Data.csv.gz'), compression='gzip', index_col=0)  #  Main dataset for 2011

d01 = pd.merge(d01input, d01, how='inner', left_index=True, right_index=True)
d11 = pd.merge(d11input, d11, how='inner', left_index=True, right_index=True)

if d01.shape[0] != 4835:
    print("Wrong number of rows in d01: " + d01.shape[0])
if d11.shape[0] != 4835:
    print("Wrong number of rows in d11: " + d11.shape[0])

In [None]:
print("Have " + str(len(d01.columns)+1) + " variables to work with.")
d01.sample(3, random_state=r_state)

In [None]:
# Sanity check
s01 = set(d01.columns)
s11 = set(d11.columns)
print("2001 vs 2011 variable check: " + str(s01.difference(s11)))
print("2011 vs 2001 variable check: " + str(s11.difference(s01)))

In [None]:
SES.describe()

In [None]:
descriptives = pd.DataFrame()
for c in d01.columns:
    descriptives = descriptives.append(pd.concat([d01[c].describe(),d11[c].describe()],axis=0,ignore_index=True),ignore_index=False)

descriptives.columns = ['2001 Count','2001 Mean','2001 StD','2001 Min','2001 LQ','2001 Median','2001 UQ','2001 Max',
                        '2011 Count','2011 Mean','2011 StD','2011 Min','2011 LQ','2011 Median','2011 UQ','2011 Max']

In [None]:
# This enables to re-use the same sample below
dsample = descriptives.sample(4, random_state=r_state).index.values
dsample = np.append(dsample,
                    ['Fare_Zone','House Prices',
                     'Percentage with Level 4+ Qualifications','Percentage of Knowledge Workers',
                     'Household Income'])

In [None]:
# Useful, but time-consuming
#plot_checks(d01, dsample, 'Untransformed')
descriptives[descriptives.index.isin(dsample)][
    ['2001 Min','2011 Min','2001 Max','2011 Max','2001 Median','2011 Median']
]

## Re-Scaling Data

In the below code the data in 2001 has unit variance scaling applied to it.  The same transformation is then applied to the data in 2011.  Finally both datasets are centred independently using median-removal.

In [None]:
# Robust scaling _without_ centering
# and _with_ common scaling. We do this 
# because 2001 and 2011 won't have the 
# same centre but we do want them to use
# a common scale.
rs1 = preprocessing.RobustScaler(with_centering=False, quantile_range=(25.0,75.0))

#  Train on 2001 data set
rs1.fit(d01)

# Apply the same unit variance scaling to both years
d01_trs1 = pd.DataFrame(data=rs1.transform(d01), index=d01.index, columns=d01.columns)
d11_trs1 = pd.DataFrame(data=rs1.transform(d11), index=d11.index, columns=d11.columns)

# Create new robust scaler for centering 
# _without_ common scaling.
rs2 = preprocessing.RobustScaler(with_scaling=False)  

# Centre independently
d01_trs2 = pd.DataFrame(data=rs2.fit_transform(d01_trs1), index=d01.index, columns=d01.columns)  
d11_trs2 = pd.DataFrame(data=rs2.fit_transform(d11_trs1), index=d11.index, columns=d11.columns)

#  Write the transformed data to csv
d01_trs2.to_csv(os.path.join(analytical,to_use+'-2001-Data-Transformed_and_Scaled.csv.gz'), compression='gzip', index=True)
d11_trs2.to_csv(os.path.join(analytical,to_use+'-2011-Data-Transformed_and_Scaled.csv.gz'), compression='gzip', index=True) 

print("Done.")

### Sanity Checks

In [None]:
descriptives_trs1 = pd.DataFrame()
for c in d01_trs1.columns:
    descriptives_trs1 = descriptives_trs1.append(pd.concat([d01_trs1[c].describe(),d11_trs1[c].describe()],axis=0,ignore_index=True),ignore_index=False)

descriptives_trs1.columns = ['2001 Count','2001 Mean','2001 StD','2001 Min','2001 LQ','2001 Median','2001 UQ','2001 Max',
                             '2011 Count','2011 Mean','2011 StD','2011 Min','2011 LQ','2011 Median','2011 UQ','2011 Max']

# Useful, but time-consuming
#plot_checks(d01_trs1, dsample, 'First-transform')

descriptives_trs1[descriptives_trs1.index.isin(dsample)][
    ['2001 Min','2011 Min','2001 Max','2011 Max','2001 Median','2011 Median','2001 Mean','2011 Mean']
]

In [None]:
descriptives_trs2 = pd.DataFrame()
for c in d01_trs2.columns:
    descriptives_trs2 = descriptives_trs2.append(pd.concat([d01_trs2[c].describe(),d11_trs2[c].describe()],axis=0,ignore_index=True),ignore_index=False)

descriptives_trs2.columns = ['2001 Count','2001 Mean','2001 StD','2001 Min','2001 LQ','2001 Median','2001 UQ','2001 Max',
                             '2011 Count','2011 Mean','2011 StD','2011 Min','2011 LQ','2011 Median','2011 UQ','2011 Max']

# Useful, but time-consuming
#plot_checks(d01_trs2, dsample, 'Second-transform')

descriptives_trs2[descriptives_trs2.index.isin(dsample)][
    ['2001 Min','2011 Min','2001 Max','2011 Max','2001 Median','2011 Median','2001 Mean','2011 Mean']
]

In [None]:
# Tidy up
del(s01, s11, d01, d11, d01input, d11input, d01_trs1, d11_trs1, rs1, rs2)
del(dsample, descriptives, descriptives_trs1, descriptives_trs2)

Once the code above has been run once, you do _not_ need to run it again -- _unless_ you want to change the transform used -- as we'll read the transformed data back from CSV in the next notebook. The files using different transforms do _not_ overwrite each other so as to make it easier to swap between approaches without needing to re-run this notebook multiple times.