# Example from last class/ASGN 06- std'zing variables 

In [1]:
#load data

import pandas as pd
url = 'https://github.com/LeDataSciFi/lectures-spr2020/blob/master/assignment_data/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip')

#one line: standardize all numeric vars


In [5]:
from sklearn import preprocessing 
fannie_array = preprocessing.scale(fannie_mae.select_dtypes('number'))

pd.DataFrame(fannie_array).describe().round(2).T

# remaining issues you should fix in assignment...
# - only do this to true continuous variables
# - only keep variables of interest
# - impute missing



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,135038.0,-0.0,1.0,-1.74,-0.87,0.0,0.87,1.73
1,135038.0,-0.0,1.0,-2.32,-0.77,0.01,0.69,4.47
2,135038.0,0.0,1.0,-1.66,-0.74,-0.23,0.53,9.02
3,135038.0,0.0,1.0,-3.0,-0.81,0.64,0.64,0.64
4,135038.0,-0.0,1.0,-3.78,-0.57,0.28,0.57,1.54
5,134007.0,0.0,1.0,-3.81,-0.56,0.24,0.52,4.05
6,135007.0,0.0,1.0,-1.16,-1.16,0.81,0.81,12.61
7,132396.0,0.0,1.0,-2.81,-0.72,-0.03,0.76,2.67
8,134481.0,0.0,1.0,-7.14,-0.66,0.24,0.82,2.01
9,135038.0,0.0,1.0,-0.14,-0.14,-0.14,-0.14,12.13


## The Cardinal Sin of data leakage
**Having data in the training sample that you wouldnt have for real world predictions**

Examples
1. y is explicitly in X (yikes)
2. y is a 2018 variable, but there is a 2019 variable in X
3. subtle: y is loan default, but X contains employee ID and some employees are brought in to handle trouble-loans (if you include it, the firm can't use the model to deploy the trouble-loan specialists)
4. if out-of-sample predicted stock movements have R2 above 10%... unlikely! (or: you'll be richer than Bezos soon)
5. this code below

```python
import #a bunch of sklearn stuff
X, y = #load data
X = transform(X) # imputation, encode cat vars, standardize

# or this:
cross_validate(model,X,y)

```

**Q: What's the problem here?**

**A: `transform(X)` used the whole dataset, so the X_training data was altered using info from X_test** 

In [None]:
x1 sample
1 training
1 training 

2 test
1 test

## Avoiding Data Leakage

- Preventing 1-4: Be very familiar with the data and how it was collected and built
- Preventing 5: Do your data prep _**within**_ CV folds and where the transformations are done using only info from the training 

```python

# loop over folds 
for train_index, test_index in StratifiedKFold(n_splits=5).split(X,y):

    # .split() yields the indices in train/test sets. use those to get 
    # the x/y vars for each separated out:
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index] 
From Donald Bowen to Everyone: (01:03 PM)
 ###################################################################
    # NEW: do the data prep inside this fold, only using training data 
    ###################################################################

    # e.g. figure out means/std in Xtrain so we can impute/std
    prep_methods.fit(Xtrain)                 # "fit" the transform means "estimate (like in training a model) what to do"
    Xtrain = prep_methods.transform(Xtrain)  # apply those to Xtrain to impute and std
    
    # fit/estimate, predict OOS, evaluate and store
    model.fit(X_train,y_train)
    
    ###################################################################
    # NEW: transform the test data the same... 
    ###################################################################
    
    X_test = prep_methods.transform(X_test)  # apply TEST data the FIT from the TRAIN data 
    
    y_predict = model.predict(X_test)
    accuracy.append(   accuracy_score(y_test, y_predict)      )

```

## Our first pipeline

Pipe: a sequence of steps, as long as each step has a fit, and a transform

cross_validate(model,X,y)
model in sklearn --> .fit, and .predict() / transform #predict is sort of like a transformation in an abstract sense
prep_method #fictional function

In [9]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn import svm

iris = load_iris() # data

# set up the pipeline, which will, given a set of observations 
# 1. fit and apply these steps to the training fold
# 2. in the testing fold, apply the transform and model to predict (no estimation)

classifier_pipeline = make_pipeline(
                                    preprocessing.StandardScaler(),  # clean the data
                                    svm.SVC(C=1)                     # model
                                    )

cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)

{'fit_time': array([0.00620794, 0.00116682, 0.00102401, 0.00115275, 0.00100994]),
 'score_time': array([0.00078297, 0.00034523, 0.00032687, 0.00033402, 0.00032711]),
 'test_score': array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])}

In [11]:
#question 1: try this with a Nearest Neighbors Classifier 

from sklearn.neighbors import KNeighborsClassifier
classifier_pipeline = make_pipeline(
                                    preprocessing.StandardScaler(),  # clean the data
                                    KNeighborsClassifier()                     # model
                                    )

cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)

{'fit_time': array([0.00467682, 0.001858  , 0.00182033, 0.00150299, 0.00233507]),
 'score_time': array([0.00625205, 0.00345302, 0.00334573, 0.00212717, 0.00292802]),
 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])}

In [17]:
#question 2: load this altered dataset and add a step to impute the missing values with the column mean

iris2=load_iris()
X2= pd.DataFrame(iris2.data)
X2.columns = [1,2,3,4]
X2[2] = X2[2].sample(frac=0.5,random_state=14)
X2[2].describe()
iris2.data = X2

# print the scores using IRIS2.data (not iris.data)
# this produces an error because of the missing values!
# cross_validate(knn_pipe, iris2.data, iris.target, cv=5)

# so add an imputation step to the pipeline! (5 min, use lecture page!)

from sklearn.impute import SimpleImputer
knn_pipe2 = make_pipeline(
                        SimpleImputer(strategy='mean'), #fill the missing values
                        preprocessing.StandardScaler(),  # clean the data
                        KNeighborsClassifier()          # model
                        )


cross_validate(knn_pipe2, iris2.data, iris.target, cv=5)

{'fit_time': array([0.00640702, 0.00362015, 0.00725293, 0.00743818, 0.00321794]),
 'score_time': array([0.00308275, 0.00278878, 0.00463605, 0.00507283, 0.00239468]),
 'test_score': array([0.9       , 0.96666667, 0.9       , 0.96666667, 1.        ])}

## Optimizing a model- here, `KKN`, with `GridSearchCV`

Let's optimize the model from the last answer... `GridSearchCV` lets you tweak any perameters from any function in the pipeline

Tips:
    - If any perameteres in your grid are optimal at the boundaries, add more points until optimum is inferior
    - After you optimize the model, save it as a model object to use

In [18]:
knn_pipe2

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)

In [20]:
#grid search will let you specify all the perameters of the mdoel
#you want to tweak, and the values you want to try

from sklearn.model_selection import GridSearchCV

# set up parameter grid to try
# the parameter grid is a dictionary where key:value pairs are built like:
#       stepName<two underlines>paramName : [list of settings to try]
param_grid = {'kneighborsclassifier__n_neighbors':[1,5,6,7,8,9,10]}

# like a normal estimator, this has not yet been applied to any data
grid = GridSearchCV(knn_pipe2, param_grid=param_grid)
grid.fit(iris2.data, iris.target)
grid.best_params_

# now save that pipeline as a model object!
optimal_knn_model = grid.best_estimator_



In [22]:
#question 3: add to the param grid to check if we should change these two params:
#            StandardScaler(with_mean=True, with_std=True)

from sklearn.model_selection import GridSearchCV

# set up parameter grid to try
# the parameter grid is a dictionary where key:value pairs are built like:
#       stepName<two underlines>paramName : [list of settings to try]
param_grid = {'kneighborsclassifier__n_neighbors':[1,5,6,7,8,9,10],
             'standardscaler__with_mean':['True','False'],
             'standardscaler__with_std':['True','False']}

# like a normal estimator, this has not yet been applied to any data
grid = GridSearchCV(knn_pipe2, param_grid=param_grid)
grid.fit(iris2.data, iris.target)
grid.best_params_




{'kneighborsclassifier__n_neighbors': 9,
 'standardscaler__with_mean': 'True',
 'standardscaler__with_std': 'True'}

In [23]:
#you can see the WHOLE set of attempts by GridSearch using
grid.cv_results_

{'mean_fit_time': array([0.00684071, 0.00513458, 0.00877579, 0.00487566, 0.00367864,
        0.00390704, 0.00332054, 0.00471258, 0.00338968, 0.00366871,
        0.00467237, 0.00397142, 0.00372084, 0.00421   , 0.00351389,
        0.00345254, 0.0032088 , 0.00453496, 0.00420078, 0.00446335,
        0.00513442, 0.00386103, 0.00551287, 0.00540789, 0.00541401,
        0.00641306, 0.00536577, 0.00478268]),
 'std_fit_time': array([2.43707652e-03, 1.26350251e-03, 3.60079946e-03, 1.04308567e-03,
        2.92842100e-04, 6.68243579e-04, 3.17222428e-04, 1.89248049e-03,
        1.27768089e-04, 2.85923182e-04, 9.65317194e-04, 1.93247569e-04,
        9.94107955e-05, 5.93172807e-04, 3.16526523e-04, 2.23807454e-04,
        1.33014540e-04, 8.87237917e-04, 2.61693134e-04, 4.30525878e-04,
        1.87675942e-03, 5.50857767e-04, 9.12498481e-04, 7.51196985e-04,
        1.20650765e-03, 2.11197061e-03, 1.50454620e-03, 1.25402195e-03]),
 'mean_score_time': array([0.00467436, 0.00671371, 0.0051233 , 0.00386906, 

### Let's do some post-optimization diagnostics
1. scoring on `kfold`
2. graphical if y is continuous
3. categorical accuracy

In [29]:
# print k-fold scoring (like before)
cross_validate(optimal_knn_model, iris2.data, iris.target, cv=5)

###########################################################
# use classification_report to see which types of Y values 
# your prediction performs better/worse on
###########################################################

# to use class_report, we need some predicted y values, so
# make a fold and generate predicted values

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(iris2.data, iris.target, random_state=9,train_size=.5)
y_pred = optimal_knn_model.fit(Xtrain, ytrain).predict(Xtest)

from sklearn.metrics import classification_report
print(classification_report(ytest,
                            y_pred,
                            target_names=iris.target_names))
#################################################################
# use confusion_matrix see exactly model gets predictions wrong
#################################################################

from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

plot_confusion_matrix(optimal_knn_model, Xtest, ytest,   # model and test data
                      display_labels=iris.target_names,  # labels
                      cmap=plt.cm.Blues,                 # colors
                      normalize=None)                    # turns on/off fractions (within row)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        27
  versicolor       0.93      0.96      0.94        26
   virginica       0.95      0.91      0.93        22

    accuracy                           0.96        75
   macro avg       0.96      0.96      0.96        75
weighted avg       0.96      0.96      0.96        75



ImportError: cannot import name 'plot_confusion_matrix' from 'sklearn.metrics' (/Users/Kathrynjaco08/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/__init__.py)

## Final Summary

- We've now seen more post model diagnostics 
- We can specify the models in `make_pipeline` alongside data cleaning/preprocessing steps that improve model performance without introducing data leakage. 
- There are many imputation, and scaling methods available in `sklearn`, and which one you use depends on the use-case. (Read about and try several!)
- Your pipeline for the assignment will be more complicated if you want to include categorical vars
- You can optimize all of the parameters throughout your pipeline using `GridSearchCV`
    - `GridSearchCV` also allows you to specify how you create folds
    - Which leads us to...

**LAST BIG POINT:** 
- Must of your projects involve an important time series dimension. (Ex: predicting stock returns) 
- In these cases, `KFold` and `StratifiedKFold` won't work (you can't have 1985 in the test sample)
- See: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html