Loading the data <br>
Let's use the data snippet from last time to load our data back in. I'll also fix the randomseed so the notebook is determininistic.



In [None]:
import pandas as pd
import numpy as np
 
np.random.seed(8675309)
 
eider.s3.download('s3://eider-datasets/mlu/DATA_Training.csv','/tmp/DATA_Training.csv')
eider.s3.download('s3://eider-datasets/mlu/DATA_Public_Test.csv','/tmp/DATA_Public_Test.csv')
train = pd.read_csv('/tmp/DATA_Training.csv', na_values = 'null')
public_test = pd.read_csv('/tmp/DATA_Public_Test.csv', na_values = 'null')


Question 0<br>
Last week we did some data cleaning and feature engineering. It is a good idea if you start from that point this week, so start by bringing over whatever set of cleaned and engineered features you decided upon for last week. We will build based off of that point. Note: please leave off any feature scaling for the moment, as we will include it in the next question for practice.

In [None]:
### ANSWER 0 ###
 
#####  Start deal with missing values  #####
# Deal with missing values for scores, for each score column, Use pandas fillna to replace missing value with the mean of the column.
score_columns = ['score1','score2','score3','score4','score5']
train.fillna(train[score_columns].mean(), inplace = True)
public_test.fillna(train[score_columns].mean(), inplace = True)
 
# Deal with missing value for CIL and contact_type
cil_columns_values = {'CIL1' : 'Not Specified', 'CIL2' : 'Not Specified', 'CIL3' : 'Not Specified', 'CLI4' : 'Not Specified', 'IL1' : 'Not Specified', 'IL2' : 'Not Specified', 'IL3' : 'Not Specified', 'IL4' : 'Not Specified'}
contact_type_value = {'contact_type' : 0.0}
# FOR CIL1-CIL4, AND IL1-IL4, fill in with 'Not Specified'
train.fillna(cil_columns_values, inplace = True) 
public_test.fillna(cil_columns_values, inplace = True) 
# FOR contact_type, fill in with most common value 0.0
train.fillna(contact_type_value, inplace = True) 
public_test.fillna(contact_type_value, inplace = True) 
 
#####  End of deal with missing values  #####
 
#####  Start categorical OneHot encoding  #####
issue_features = ['CIL1', 'CIL2', 'CIL3', 'CLI4', 'IL1', 'IL2', 'IL3', 'IL4','device']
# Try to collect all possible values of each column
for feature in issue_features:
    unique_elements = pd.concat([train[feature], public_test[feature]]).unique().tolist()
    train[feature] = train[feature].astype('category').cat.set_categories(unique_elements)
    public_test[feature] = public_test[feature].astype('category').cat.set_categories(unique_elements)
public_test = pd.get_dummies(public_test)
train = pd.get_dummies(train)
#####  End categorical OneHot encoding  #####
 
train.head(5)

Question 1 <br>
Let's combine a few techniques together into a single pipeline for tuning later. For this question we will eventually use a DecisionTreeClassifier to produce our output. If you recall our discussion of decision trees they only made decisions based on single variables. However, it is quite reasonable to want to make decisions based on combinations of features. To do so, we can transform our features using a PCA first in an attempt to offer our algorithm decorrelated features to work from.

To this end, use Pipeline to take in your data, apply the StandardScaler, transform the data using a PCA, and finally feed the result into a DecisionTreeClassifier. We'll get to grid searching later, but for right now, run cross_val_score with this pipeline as the classifier to see how it does.

Note: when experimenting on your own, you might want to transform your features in many different ways. To do so you should examine FeatureUnion as a companion to pipeline that allows for parallel computation, not serial computation. If you want to see an example of it being used, take a look here.

In [None]:
	
### ANSWER 1 ###
# Using Pipeline to help us keep our code organized.
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
 
# Split train datasets.
X_train, X_test, y_train, y_test = train_test_split(train.drop('response',axis=1), 
                                                    train['response'], test_size=0.20, 
                                                    random_state=101)
 
# Build a Pipeline with StandardScaler, PCA and DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
scaler = StandardScaler(copy=True, with_mean=False, with_std=False)
pca = PCA()
print("-----------------------------------Build a Pipeline with StandardScaler, PCA and DecisionTreeClassifier---------------------------------------------- ")
pipe = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('clf', clf)])
print("Create a Pipeline, apply the StandardScaler, transform the data using a PCA, and finally try DecisionTreeClassifier")
print("Let's run cross_val_score with this pipeline and see the score: ")
print(cross_val_score(pipe, X_train, y_train, cv=5))
print("-----------------------------------End---------------------------------------------- ")
 
# Transform our features in different ways by using FeatureUnion
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import KernelPCA
transformers = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(transformers)
pipeline_with_featureUnion = Pipeline([
    # Apply the StandardScaler
    ('scaler', scaler),
    
    # Use FeatureUnion to combine the features from PCA and kenel_pca
    ('union', combined),
 
    # Use DecisionTreeClassifier on the combined features
    ('clf', clf),
])
print("-----------------------------------Experiment: Pipeline with FeatureUnion---------------------------------------------- ")
print("Create a Pipeline with FeatureUnion, apply the StandardScaler, transform the data using a PCA and Kenel_PCA, and finally try DecisionTreeClassifier")
print("Let's run cross_val_score with this pipeline with FeatureUnion and see the score: ")
pipeline_with_featureUnion.set_params(union__linear_pca__n_components=10, union__kernel_pca__n_components=10)
print(cross_val_score(pipeline_with_featureUnion, X_train, y_train, cv=5))
print("-----------------------------------End---------------------------------------------- ")

Now we'll try to tune this a bit. The first thing you might notice is that the PCA makes the above fit rather slow. We can counteract that by restricting the number of principle components it computes using n_components. This can also help counteract overfitting. However, to select the best possible values, we'd like to do a grid search. Thus the question is, how do you grid search over a parameter burried inside a pipeline?
<br>
Question 2 <br>
Take a look at this example. Here we see a similar process being done with GridSearchCV and a pipeline. Perform a similar sweep with n_components in the range {5,10,20} and class_weight associated to class one from {1,4,16,64} for our dataset. Don't forget to set the scoring method to 'f1'. From the fit model, you can get the best score using best_score_, the best model using best_model_, and the best parameters using best_params_.

In [None]:
	
### ANSWER 2 ###
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
 
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'pca__n_components': [5,10,20],
    'clf__class_weight': [{1:1},{1:4}, {1:16},{1:64}]
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5, scoring='f1')
search.fit(X_train, y_train)
 
print("Best Parameter {}:".format(search.best_params_))
print("Best Score {}:".format(search.best_score_))
print("Best Model {}:".format(search.best_estimator_))

Probably not as good as the model you finally arrived at last week, but at least it shows you how you can chain multiple techniques together and easily search over many possible values for the various parameters in each stage.

Since this model starts with a PCA, it is no longer possible to interpret the features. They have become essentially arbitrary linear combinations of the dataset. However, we do know that the top principle components correspond to the directions of maximum variance in the data, so one could hope that these are the most important features for our data.
<br>
Question 3 <br>
After the grid search above finishes, you are given a trained model that contains a decision tree as the final step. Plot the feature importances (.feature_importances_ of the decision tree in the best estimator). Are the importances of the early features (the higest variance PCs) larger than the later ones? If not, what does that tell us?

In [None]:
	
### ANSWER 3 ###
import numpy as np
import matplotlib.pyplot as plt
 
importances = search.best_estimator_.named_steps.clf.feature_importances_
plt.bar(np.arange(10), importances, width=0.8, align='center')
plt.show()

Answer 3 <br>
ARE THE IMPORTANCES OF THE EARLY FEATURES (THE HIGEST VARIANCE PCS) LARGER THAN THE LATER ONES? IF NOT, WHAT DOES THAT TELL US?
The importances of the early features (the higest variance PCs) are less than the later ones.
It tells us that the later PCs are more important to our model.

Question 4 <br>
Finally, lets look in to the learning curve of our best-fit model (a nice example here). Set the parameters of your pipeline to the ones identified by the cross-validation. What can you say about our model?

In [None]:
### Answer 4 ###
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
 
 
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.
 
    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.
 
    title : string
        Title for the chart.
 
    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.
 
    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.
 
    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.
 
    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.
 
        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
 
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.
 
    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
 
    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
 
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")
 
    plt.legend(loc="best")
    return plt
 
title = "Learning Curves of our best_estimator"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = search.best_estimator_
plot_learning_curve(estimator, title, X_train, y_train,ylim=(0.7, 1.01), cv=cv, n_jobs=100)
plt.show()

Answer 4 <br>
WHAT CAN YOU SAY ABOUT OUR MODEL?
We can see clearly that the training score is still around the maximum and the validation score could not be increased with more training samples.
we can fairly say our model has overfitting issue.

Question 5 <br>
As last week, now take the time to try things out! Being able to use Pipeline (and FeatureUnion if needed), along with GridSearchCV should reduce the burden on lots of the boilerplate code. Try different ways of encoding features, methods of dimension reduction, or learning algorithms. In particular, don't feel constrained to just those topics we've covered in detail! Poke around and just try things!

In [None]:
### ANSWER 5 ###
 
# Using Pipeline to help us keep our code organized.
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
 
# Split train datasets.
X_train, X_test, y_train, y_test = train_test_split(train.drop('response',axis=1), 
                                                    train['response'], test_size=0.10, 
                                                    random_state=101)
print("-----------------------------------Experiment 1: LogisticRegression with PCA ---------------------------------------------- ")
print("Create a Pipeline with FeatureUnion, apply the StandardScaler, transform the data using a PCA, and finally try LogisticRegression")
print("Run GridSearchCV on this pipeline to search the best parameters")
# Build a Pipeline with StandardScaler, PCA and LogisticRegression
clf = LogisticRegression(random_state=0)
scaler = StandardScaler(copy=True, with_mean=False, with_std=False)
# Transform our features in different ways by using FeatureUnion
transformers = [('pca', PCA())]
combined = FeatureUnion(transformers)
pipeline_with_featureUnion = Pipeline([
    # Apply the StandardScaler
    ('scaler', scaler),
    
    # Use FeatureUnion to combine the features from PCA and kenel_pca
    ('union', combined),
 
    # Use DecisionTreeClassifier on the combined features
    ('clf', clf),
])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'union__pca__n_components': [10],
    'clf__C':np.logspace(-3,3,7),
    'clf__penalty':["l2"],
}
search = GridSearchCV(pipeline_with_featureUnion, param_grid, iid=False, cv=5, scoring='f1')
search.fit(X_train, y_train)
print("Best Parameter {}:".format(search.best_params_))
print("Best Score {}:".format(search.best_score_))
print("Best Model {}:".format(search.best_estimator_))
print("-----------------------------------End---------------------------------------------- ")

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
 
# Split train datasets.
X_train, X_test, y_train, y_test = train_test_split(train.drop('response',axis=1), 
                                                    train['response'], test_size=0.10, 
                                                    random_state=101)
 
 
print("-----------------------------------Experiment 2: SVC with PCA ---------------------------------------------- ")
print("Create a Pipeline with FeatureUnion, apply the StandardScaler, transform the data using a PCA, and finally try SVC")
print("Run GridSearchCV on this pipeline to search the best parameters")
# Build a Pipeline with StandardScaler, KernalPCA and SVC
clf = SVC()
scaler = StandardScaler(copy=True, with_mean=False, with_std=False)
# Transform our features in different ways by using FeatureUnion
transformers = [('pca', PCA())]
combined = FeatureUnion(transformers)
pipeline_with_featureUnion = Pipeline([
    # Apply the StandardScaler
    ('scaler', scaler),
    
    # Use FeatureUnion to combine the features from PCA and kenel_pca
    ('union', combined),
 
    # Use DecisionTreeClassifier on the combined features
    ('clf', clf),
])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'union__pca__n_components': [10],
    'clf__kernel': ['rbf'],
    'clf__gamma': [1e-3, 1e-4],
    'clf__C': [1, 10, 100],
}
search = GridSearchCV(pipeline_with_featureUnion, param_grid, iid=False, cv=5, scoring='f1')
search.fit(X_train, y_train)
print("Best Parameter {}:".format(search.best_params_))
print("Best Score {}:".format(search.best_score_))
print("Best Model {}:".format(search.best_estimator_))
print("-----------------------------------End---------------------------------------------- ")

In [None]:
	
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
 
print("-----------------------------------Experiment 3: DecisionTreeClassifier with KenerlPCA ---------------------------------------------- ")
print("Create a Pipeline with FeatureUnion, apply the StandardScaler, transform the data using Kenel_PCA, and finally try DecisionTreeClassifier")
print("Run GridSearchCV on this pipeline to search the best parameters")
# Build a Pipeline with StandardScaler, KernalPCA and DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
scaler = StandardScaler(copy=True, with_mean=False, with_std=False)
# Transform our features in different ways by using FeatureUnion
transformers = [('kernel_pca', KernelPCA())]
combined = FeatureUnion(transformers)
pipeline_with_featureUnion = Pipeline([
    # Apply the StandardScaler
    ('scaler', scaler),
    
    # Use FeatureUnion to combine the features from PCA and kenel_pca
    ('union', combined),
 
    # Use DecisionTreeClassifier on the combined features
    ('clf', clf),
])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'union__kernel_pca__n_components': [5,10],
    'clf__class_weight': [{1:1},{1:4}, {1:16},{1:64}]
}
search = GridSearchCV(pipeline_with_featureUnion, param_grid, iid=False, cv=5, scoring='f1')
search.fit(X_train, y_train)
print("Best Parameter {}:".format(search.best_params_))
print("Best Score {}:".format(search.best_score_))
print("Best Model {}:".format(search.best_estimator_))
print("-----------------------------------End---------------------------------------------- ")
