Maching Learning Preparation
===

* Create training, testing, and validation datasets
* Data preprocessing
* Select the best information to give to the algorithm
* Select a machine learning algorithm for your dataset

Set Up The Notebook
===

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt

%matplotlib inline

dataset = pd.read_csv('Data/CleanedDataset.csv', index_col=0)

dataset.describe()

Unnamed: 0,Interview/Exam,Gender,Age,Education,Marital Status,Weight (kg),Height (cm),Exam Completion
count,5533.0,5533.0,5533.0,5533.0,5533.0,5533.0,5533.0,5533.0
mean,2.0,1.518887,39.959154,1.798663,3.036689,75.864887,166.629966,1.006145
std,0.0,0.499688,21.953308,0.899678,1.911148,19.939066,9.816356,0.078156
min,2.0,1.0,14.0,1.0,1.0,25.6,119.8,1.0
25%,2.0,1.0,19.0,1.0,1.0,61.74,159.5,1.0
50%,2.0,2.0,36.0,1.0,3.0,72.94,166.3,1.0
75%,2.0,2.0,60.0,3.0,5.0,86.4,173.5,1.0
max,2.0,2.0,85.0,5.0,6.0,193.3,201.3,2.0


Choose our problem
===

We need a target value and features.
* Target: Height (cm)
* Features: Interview/Exam, Gender, Age, Education, Marital Status, Weight (kg), Exam Completion

In [2]:
target = dataset['Height (cm)']

colnames = list(dataset.columns)
colnames.remove('Height (cm)')
features = dataset.loc[:,colnames]

display(target.describe())
display(features.describe())
display(features.head())

count    5533.000000
mean      166.629966
std         9.816356
min       119.800000
25%       159.500000
50%       166.300000
75%       173.500000
max       201.300000
Name: Height (cm), dtype: float64

Unnamed: 0,Interview/Exam,Gender,Age,Education,Marital Status,Weight (kg),Exam Completion
count,5533.0,5533.0,5533.0,5533.0,5533.0,5533.0,5533.0
mean,2.0,1.518887,39.959154,1.798663,3.036689,75.864887,1.006145
std,0.0,0.499688,21.953308,0.899678,1.911148,19.939066,0.078156
min,2.0,1.0,14.0,1.0,1.0,25.6,1.0
25%,2.0,1.0,19.0,1.0,1.0,61.74,1.0
50%,2.0,2.0,36.0,1.0,3.0,72.94,1.0
75%,2.0,2.0,60.0,3.0,5.0,86.4,1.0
max,2.0,2.0,85.0,5.0,6.0,193.3,2.0


Unnamed: 0_level_0,Interview/Exam,Gender,Age,Education,Marital Status,Weight (kg),Exam Completion
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5,2.0,1.0,49.0,3.0,1.0,92.5,1.0
6,2.0,2.0,19.0,3.0,5.0,59.2,1.0
7,2.0,2.0,59.0,1.0,1.0,78.0,1.0
10,2.0,1.0,43.0,2.0,4.0,111.8,1.0
12,2.0,1.0,37.0,3.0,5.0,99.2,1.0


Side Note: Unit Testing
---

Using tests can make sure your analysis code is doing what you intended. Some things to check for:

* Array size
* Data type
* Maximum and minimum values

In [3]:
np.testing.assert_equal(features.shape[0], target.shape[0],
                        'Target and feature shape mismatch')
np.testing.assert_equal(features.shape, (5533,7), 'Wrong feature shape')

Training, Testing, and Validation
===

![Train/Test/Validation Split](Images/TrainTestValSplit.png)

- **Training Set** - Portion of the data used to train a machine learning algorithm.
- **Testing Set** - Portion of the data (usually 10-30%) not used in training, used to evaluate performance.
- **Validation Set** - (Optional) Portion of data (usually 10-30%) used for testing during parameter tuning or classifier selection.

Training, Testing, and Validation (Expanded Notes)
===

![Train/Test/Validation Split](Images/TrainTestValSplit.png)

In order to evaluate our data properly, we need to divide our dataset into training and testing sets. 

- **Training Set** - A portion of the data, usually a majority, used to train a machine learning classifier. These are the examples that the computer will learn in order to try to predict data labels.
- **Testing Set** - A portion of the data, smaller than the training set (usually 10-30%), used to test the accuracy of the machine learning classifier. The computer does not "see" this data while learning, but tries to guess the data labels. We can then determine the accuracy of our method by determining how many examples it got correct.
- **Validation Set** - (Optional) A third section of data used for parameter tuning or classifier selection. When selecting among many classifiers, or when a classifier parameter must be adjusted (tuned), a this data is used like a test set to select the best parameter value(s). The final performance is then evaluated on the remaining, previously unused, testing set.


Why split my dataset?
---

* **The goal of a testing set is not to make the testing accuracy as high as possible!** 
* The goal is to make your "real world/target application" accuracy as high as possible. 
* The more accurately your test set reflects real-world conditions, the better it will reflect the real-world accuracy.


No one cares how good your test-set accuracy is if your product doesn't work when you ship it.

Split Dataset
---

* 30% of data for Test Set
* Remaining 70% will be split between training and validation sets

In [4]:
from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(features, target, test_size=0.3)

print('Original dataset size: ' + str(features.shape))
print('Training dataset size: ' + str(X_trainval.shape))
print('Test dataset size: ' + str(X_test.shape))

Original dataset size: (5533, 7)
Training dataset size: (3873, 7)
Test dataset size: (1660, 7)


Another Good Opportunity for Testing
---

In [5]:
np.testing.assert_equal(X_trainval.shape[0], y_trainval.shape[0],
                        'Target and feature training shape mismatch')
np.testing.assert_equal(X_test.shape[0], y_test.shape[0],
                        'Target and feature test shape mismatch')
np.testing.assert_equal(X_trainval.shape[0] + X_test.shape[0], 
                        features.shape[0], 'Incorrect split')
np.testing.assert_equal(X_trainval.shape[1], features.shape[1], 
                        'Wrong number of features')

Set the Random State (Optional)
---

* Set a random seed so that analysis is reproducible.
* Could accidentally use a seed that produces unusual results.
* Absolutely don't hunt for the best seed!


In [6]:
seed = 42
X_trainval, X_test, y_trainval, y_test = train_test_split(features, target, 
                                                          test_size=0.3,
                                                          random_state=seed)

Evenly Distribute Examples (Stratification)
---

* Proportionally divide target/label values between training and testing data when splitting
  - Especially important in classification when one class has many fewer examples
* For regression tasks, we can create bin labels to help with distribution

In [7]:
bins = np.linspace(np.min(target),np.max(target)+0.1, 5)
labels = np.digitize(target, bins)
X_trainval, X_test, y_trainval, y_test = train_test_split(features, target, 
                                                          test_size=0.3,
                                                          random_state=seed,
                                                          stratify=labels)

How to Use Validation Data
---

Validation data is used to design your machine learning model. 
Best-performing model on validation data is tested on Test Set to get final accuracy rating. 

Use to select:
* Machine learning algorithm
* Any parameters of the algorithm (e.g. SVM C parameter)
* Number of features
* Feature reduction algorithm

Why can't I test all my algorithms on the test set and report that value?
---

* This is a type of overfitting
* By selecting the best model of many, you are biasing results towards good performance on that particular validation data
  - Results are likely overly-optimistic about real-world performance
* Test set gives a less-biased estimate of model performance

Splitting Training and Validation Data
---

Method 1: Split our training/validation data like we split our full dataset  
Method 2: Use Crossvalidation

Single Split
---

In [8]:
bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, 
                                                  test_size=0.3, 
                                                  random_state=seed, 
                                                  stratify=labels)

print('X_trainval size: ' + str(X_trainval.shape))
print('X_train size: ' + str(X_train.shape))
print('X_val size: ' + str(X_val.shape))

X_trainval size: (3873, 7)
X_train size: (2711, 7)
X_val size: (1162, 7)


Crossvalidation
---

* Continually splitting our dataset makes it smaller
  - Want to use as much data as possible for training
* Single division makes results likely to change with random state
* Solution: crossvalidation
  - Divide data into multiple equal sections (called folds)
  - Hold one fold out for validation, train on remaining
  - Repeat using each fold as validation


Crossvalidation Visualized
---

![Crossvalidation Visual](Images/Crossvalidation.png)

Crossvalidation in Code
---

Iterator providing index of training and testing examples

In [9]:
from sklearn.model_selection import KFold

# Older versions of scikit learn used n_folds instead of n_splits
kf = KFold(n_splits=5, random_state=seed)
for trainInd, valInd in kf.split(X_trainval):
    print("%s %s" % (trainInd.shape, valInd.shape))


(3098,) (775,)
(3098,) (775,)
(3098,) (775,)
(3099,) (774,)
(3099,) (774,)


Example: Validation for Linear Regression
---

In [10]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error as mse

regr = linear_model.LinearRegression()
foldMSE = []
kf = KFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 42.9976903151


Stratification
---
Like with `train_test_split()` we can use stratification to evenly distribute labels among folds

In [11]:
from sklearn.model_selection import StratifiedKFold

bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

foldMSE = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 42.9771010149


Feature Scaling and Normalization
===

* Many algorithms work best (or only) when your features are on a similar scale or normalized to a given range
* Helpful for features with unrelated units of different magnitudes
  - e.g. 2 bedrooms & 1,000 sq. feet
* Calculate on training set, then apply to test/validation set

Feature Scaling and Normalization (Continued)
===

* 0-mean and unit-variance
  - Can speed up training convergence (e.g. SVMs)
  - Required for methods that make decisions based on size of variance (e.g. PCA)
* Range [0,1] or [-1,1]
  - Prevents saturation in some algorithms
    - e.g. Neural Networks

Applying Feature Scaling
---

0-mean and unit-variance on a single training/validation set

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

Feature Scaling With Crossvalidation
---

In [13]:
from sklearn.preprocessing import StandardScaler

bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

foldMSE = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Here we scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 42.9771010149


Dimensionality Reduction
===

Reduce the number of features to keep only the most relevant data

Why?
* Speed up training
* Store less data
* Can improve accuracy by getting rid of "noise"

Methods
* Feature Selection
* Feature Extraction



Feature selection
---

Choose a subset of the original features to include

**Feature Selection Method Examples:**

* SelectKBest: Selects features based on a scoring function (e.g. mutual information)
* Recursive Feature Elimination (RFE): Uses machine learning algorithm to remove least important features

Examples available in Notebook
  
scikit learn has a page on [feature selection methods](http://scikit-learn.org/stable/modules/feature_selection.html)

SelectKBest
---

Selects `k` best features based on a scoring function, such as mutual information, chi-squared, or ANOVA F-value
  
[scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)

In [14]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression

selector = SelectKBest(mutual_info_regression, k=4)
selector.fit(X_train, y_train)
X_train = selector.transform(X_train)
X_val = selector.transform(X_val)
print('New shape: ' + str(X_train.shape))

New shape: (3487, 4)


Which Features Were Kept?
---

In [15]:
support = selector.get_support()
kept = features.columns[support]
print('Kept Features: ' + ', '.join(kept))

Kept Features: Gender, Education, Marital Status, Weight (kg)


SelectKBest With Crossvalidation
---

In [16]:
bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

foldMSE = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    selector = SelectKBest(mutual_info_regression, k=4)
    selector.fit(X_train, y_train)
    X_train = selector.transform(X_train)
    X_val = selector.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 44.5497275082


Recursive Feature Elimination (RFE)
---

Recursively removes features until the desired number are left using a machine learning algorithm to determine feature importance

[scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)

In [17]:
# Get un-reduced training sets 
X_train = X_trainval.iloc[trainInd,:]
y_train = y_trainval.iloc[trainInd]
X_val = X_trainval.iloc[valInd,:]
y_val = y_trainval.iloc[valInd]

In [18]:
from sklearn.feature_selection import RFE

regr = linear_model.LinearRegression()

selector = RFE(regr, n_features_to_select=4)
selector.fit(X_train, y_train)
X_train = selector.transform(X_train)
X_val = selector.transform(X_val)

support = selector.get_support()
kept = features.columns[support]
print('Kept Features: ' + ', '.join(kept))

Kept Features: Gender, Education, Weight (kg), Exam Completion


RFE With Crossvalidation
---

In [19]:
bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

foldMSE = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    regr = linear_model.LinearRegression()
    selector = RFE(regr, n_features_to_select=4)
    selector.fit(X_train, y_train)
    X_train = selector.transform(X_train)
    X_val = selector.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 42.9115611772


Feature Extraction
---

* Project data into a feature space with fewer important dimensions
* Creates new features through (typically linear) combinations of original features

**Example Feature Extraction Methods:**
* Principal Component Analysis (PCA): Use feature variance for projection
* Linear Discriminant Analysis (LDA): Supervised projection



Principal Component Analysis (PCA)
---
* Does not use label information, just variance of features
* Creates orthogonal feature set that best explains the feature variance
* Make sure to scale/normalize first, or it will just pick the ones with the largest variance

In [20]:
# Reset our example fold
X_train = X_trainval.iloc[trainInd,:]
y_train = y_trainval.iloc[trainInd]
X_val = X_trainval.iloc[valInd,:]
y_val = y_trainval.iloc[valInd]

In [21]:
from sklearn.decomposition import PCA

pca = PCA(n_components=4)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_val = pca.transform(X_val)

In [22]:
# Reset our example fold
X_train = X_trainval.iloc[trainInd,:]
y_train = y_trainval.iloc[trainInd]
print(X_train.columns)

# Scale the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

Index(['Interview/Exam', 'Gender', 'Age', 'Education', 'Marital Status',
       'Weight (kg)', 'Exam Completion'],
      dtype='object')


PCA With Crossvalidation
---

In [23]:
bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

foldMSE = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    pca = PCA(n_components=4, random_state=seed)
    pca.fit(X_train)
    X_train = pca.transform(X_train)
    X_val = pca.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    foldMSE.append(mse(y_val,pred))
    
print('Average Mean Squared Error: %s' % np.mean(foldMSE))

Average Mean Squared Error: 45.9330535547


In [24]:
# Reset our example fold
X_train = X_trainval.iloc[trainInd,:]
y_train = y_trainval.iloc[trainInd]
labels_train = labels[trainInd]
X_val = X_trainval.iloc[valInd,:]
y_val = y_trainval.iloc[valInd]

Linear Discriminant Analysis (LDA)
---

* LDA can be used as both a classificaton algorithm and a supervised dimensionality reduction method
* Tries to maximize the separation between classes
* Better for classification problems, but we can use it with binned continuous targets



In [25]:
# Can also use LDA (which can also be a classifier)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=4)
lda.fit(X_train, labels_train)
X_train = lda.transform(X_train)
X_val = lda.transform(X_val)




Side-Note: Collinear Variables
---

* We got a warning about collinear variables
  - 2+ features vary together
* If we really want LDA, we should remove collinear variables
  - But really, are we sure we want to use a classification method for a regression problem?

Train/Test/Validation Data Recap
===

* Never Train on Test (or Validation) Data!
  - Easy to do accidentally
    + Load data in twice
    + Split index mistake so an example is in both train and test
    + Doing any transforms or feature selection on the entire dataset before splitting
* Use validation data to select best model
  - Which algorithm to use
  - Which feature reduction technique
  - How many features to keep

Scikit-Learn has methods to automatically do the entire crossvalidation scoring for you, but I recommend not using them unless you aren't doing **any** feature processing or transforms.

Algorithm/Model Selection
---
* Highly dependent on data and problem you want to solve
* Use human knowledge to narrow down a range of possibilities
* Use model selection with validation for final decision


Algorithm/Model Selection Continued
---

* Classification vs regression vs clustering
  - Based on labels (or lack of labels)
* Explainable operation vs "black box"
  - Explainable: decision trees
  - Black box: support vector machines (SVMs), neural networks
* Speed of training and/or predictions
  - SVM predicts quickly, decision trees more slowly
  - Neural networks train slowly

Algorithm/Model Selection Continued
---

* How complex is your data?
  - SVMs and neural networks can handle complex/nonlinear classification
* Are there relationships among features?
  - Some neural networks can take relationships between features into account
    + e.g. Convolutional networks and neighboring pixels in an image
  - Some networks have memories
    + e.g. Recursive neural networks

Example: Compare 4 Models
---

* Recursive Feature Elimination (RFE) vs Principal Component Analysis (PCA)
* 2 vs 3 features

In [26]:
bins = np.linspace(np.min(y_trainval),np.max(y_trainval)+0.1, 5)
labels = np.digitize(y_trainval, bins)

# RFE 3 features
rfe3mse = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    # Get training and validation folds
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    regr = linear_model.LinearRegression()
    selector = RFE(regr, n_features_to_select=3)
    selector.fit(X_train, y_train)
    X_train = selector.transform(X_train)
    X_val = selector.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    rfe3mse.append(mse(y_val,pred))

In [27]:
# RFE 2 features
rfe2mse = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    regr = linear_model.LinearRegression()
    selector = RFE(regr, n_features_to_select=2)
    selector.fit(X_train, y_train)
    X_train = selector.transform(X_train)
    X_val = selector.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    rfe2mse.append(mse(y_val,pred))
    
    
# PCA 3 features
pca3mse = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    pca = PCA(n_components=3)
    pca.fit(X_train, y_train)
    X_train = pca.transform(X_train)
    X_val = pca.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    pca3mse.append(mse(y_val,pred))
    
    
# PCA 2 features
pca2mse = []
kf = StratifiedKFold(n_splits=10, random_state=seed)
for trainInd, valInd in kf.split(X_trainval, labels):
    # Get training and testing folds
    X_train = X_trainval.iloc[trainInd,:]
    y_train = y_trainval.iloc[trainInd]
    X_val = X_trainval.iloc[valInd,:]
    y_val = y_trainval.iloc[valInd]
    
    # Scale the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_val = scaler.transform(X_val)
    
    # Select the features
    pca = PCA(n_components=2)
    pca.fit(X_train, y_train)
    X_train = pca.transform(X_train)
    X_val = pca.transform(X_val)
    
    # Train the classifier and make predictions
    regr.fit(X_train, y_train)
    pred = regr.predict(X_val)
    pca2mse.append(mse(y_val,pred))
    


Validation Results
---

Best performance by RFE with 3 features

In [28]:
print('RFE 3 MSE: %s' % np.mean(rfe3mse))
print('RFE 2 MSE: %s' % np.mean(rfe2mse))
print('PCA 3 MSE: %s' % np.mean(pca3mse))
print('PCA 2 MSE: %s' % np.mean(pca2mse))

RFE 3 MSE: 45.5156398712
RFE 2 MSE: 47.1810928949
PCA 3 MSE: 48.6948873689
PCA 2 MSE: 48.8546099501


Test Results
---

* Retrain the best model and evaluate on the training set

In [29]:
# Scale the data
scaler = StandardScaler()
scaler.fit(X_trainval)
X_trainval = scaler.transform(X_trainval)
X_test = scaler.transform(X_test)

# Select the features
regr = linear_model.LinearRegression()
selector = RFE(regr, n_features_to_select=3)
selector.fit(X_trainval, y_trainval)
X_trainval = selector.transform(X_trainval)
X_test = selector.transform(X_test)

# Train the classifier and make predictions
regr.fit(X_trainval, y_trainval)
pred = regr.predict(X_test)
testmse = mse(y_test,pred)

print('Validation MSE: %s' % np.mean(rfe3mse))
print('Test MSE: %s' % np.mean(testmse))

Validation MSE: 45.5156398712
Test MSE: 45.6205589445
