In [320]:
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

from operator import itemgetter
import pandas as pd
import numpy as np
from scipy import stats
import scipy
import matplotlib.pyplot as plt

# Chapter 1 and 2 Exercises

## Chapter 1 

1. How would you define Machine Learning?   
The process of programming a computer to execute a process with an outcome where through data the computer learns and performs the process with the desired outcome in a more effective manner relative to some performance metric. 

2. Can you name four types of problems where it shines?   
Where is no closed form solution, adaptive programs rather than rule based, reactive programs that adjust for the environment, help gain insights into data for business decisions.

3. What is a labeled training set?    
A training set where the correct outcome is already specified.  

4. What are the two most common supervised tasks?   
Classification and Regression

5. Can you name four common unsupervised tasks?   
Clustering, anomaly detection, dimensionality reduction, association rule learning.

6. What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?   
Reinforcement learning as the terrain may be of any kind and the robot will need to adapt.

7. What type of algorithm would you use to segment your customers into multiple groups?   
Clustering if one wants to only group customers or a classification algorithm the groups are known 

8. Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?   
This would be a supervised learning problem.

9. What is an online learning system?   
A ml model that receives streaming data and learns from data incrementally and not in batch. Can change quickly to incoming data.

10. What is out-of-core learning?   
ML program that is online in order to train on data that cannot fit into memory

11. What type of learning algorithm relies on a similarity measure to make predictions?   
Instance based learning system contains the data known and then compares new instances to the known based upon a similarity metric to make predictions. Examples are k-nearest neighbor and decision trees.

12. What is the difference between a model parameter and a learning algorithm’s hyperparameter?   
A model parameter represents a feature/paramter of the actual model that is used for prediction based upon new data whereas the hyperparameter determines how the model is trained based upon the algorithm being used. 

13. What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?   
Search for optimal values for the model parameters that will enable the function to predict new values well. Values are found by approaches that minimize a cost function and potentially apply penalties for complex models. Predictions for new values are made by applying model to new data.  

14. Can you name four of the main challenges in Machine Learning?   
Relative to data: lack of data, messy data, unrepresentative data; uninformative features; underfitting data due to overly simple model; overfitting data due to overly complex model

15. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?   
This indicates overfitting the model to the training data. Solutions can be: 1. Applying regularization, 2. Employing a simplier model, 3. Get more data for training and/or reducing the noise in the data.

16. What is a test set, and why would you want to use it?   
A test is a set that is held out of the training set that is used for testing the model is not being overfit. 

17. What is the purpose of a validation set?   
To test different models and hyperparameter tuning. 

18. What is the train-dev set, when do you need it, and how do you use it?   
When there is a suspicion that there is a difference between the training data and the validation and testing data (i.e. the data that will be representative of live, production data) separate out from the training data a train-dev set. Train and test on validation and train-dev set. If model performs poorly on train-dev set, then overfitting is probably occuring. If performs well on train-dev set but not well on validation set, then there is a likely data mismatch between the training data and the representative testing and validation data. Need to make the training data more representative. 

19. What can go wrong if you tune hyperparameters using the test set?   
The model can overfit the data and not generalize well to unknown data. It will most likely show promising results on the test set that are too promising. 

## Chapter 2

1.Try a Support Vector Machine regressor (sklearn.svm.SVR) with various hyperparameters, such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don’t worry about what these hyperparameters mean for now. How does the best SVR predictor perform?

In [192]:
hs_data = '/home/jonathan/Projects/LearningReferences/handson-ml2/datasets/housing/housing.csv'

In [193]:
housing_df = pd.read_csv(hs_data)

In [194]:
housing_df.head(1)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY


In [200]:
# Get data
housing_df = pd.read_csv(hs_data)

# Create income category for sampling 
housing_df['income_cat'] = pd.cut(housing_df.median_income, bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf], labels=range(1,6))

# Get train/test sample that is representative of income
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=43)
train_idx, test_idx = next(strat_split.split(housing_df, housing_df.income_cat))
hs_train_df = housing_df.loc[train_idx]
hs_test_df = housing_df.loc[test_idx]

# drop income_cat to put data back to original form
hs_train_df.drop(columns='income_cat', inplace=True)
hs_test_df.drop(columns='income_cat', inplace=True)

hs_train_labels_df = hs_train_df.median_house_value.copy()
hs_train_df.drop(columns='median_house_value', inplace=True)

# Create custom features from existing data - can also have this as custom transformer to be used in pipeline
hs_train_df["rooms_per_household"] = hs_train_df["total_rooms"]/hs_train_df["households"]
hs_train_df["bedrooms_per_room"] = hs_train_df["total_bedrooms"]/hs_train_df["total_rooms"]
hs_train_df["population_per_household"]=hs_train_df["population"]/hs_train_df["households"]

# Replace nan values with median
housing_numeric = hs_train_df.drop('ocean_proximity', axis=1)

# Deal with categorical data
housing_cat = hs_train_df.loc[:, ['ocean_proximity']]

In [201]:
# Setup pipeline
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                         ('std_scaler', StandardScaler())
                        ])

preprocessing_pipeline = ColumnTransformer([('num', num_pipeline, list(housing_numeric)),
                                            ('cat', OneHotEncoder(), list(housing_cat))
                                           ])

model_pipeline = Pipeline([('preproc', preprocessing_pipeline),
                           ('svm', SVR())
                          ])

In [98]:
# Setup CrossValidation
param_grid = [{'svm__kernel': ['linear','rbf', 'sigmoid'],
               'svm__C': [.1, .5, 1],
               'svm__gamma': ['scale', 'auto']},
              {'svm__kernel': ['poly'],
               'svm__C': [.1, .5, 1],
               'svm__gamma': ['scale', 'auto'],
               'svm__degree': [2, 3, 5]}]
grid_cv= GridSearchCV(model_pipeline, param_grid, cv=3, return_train_score=True, scoring='neg_mean_squared_error')

In [32]:
grid_cv.fit(hs_train_df, hs_train_labels_df);

In [33]:
cv_res = grid_cv.cv_results_
cv_res.keys()
cv_param_scrs = [(np.sqrt(-score), params) for params, score in zip(cv_res['params'], cv_res['mean_test_score'])]
cv_params_scrs = sorted(cv_param_scrs, key=itemgetter(0))
print(f'Best Model: {cv_param_scrs[0]}')
print(f'Worst Model: {cv_param_scrs[-1]}')

Best Model: (116388.94596328198, {'svm__C': 0.1, 'svm__degree': 2, 'svm__gamma': 'scale', 'svm__kernel': 'linear'})
Worst Model: (117293.24626128966, {'svm__C': 1, 'svm__degree': 5, 'svm__gamma': 'auto', 'svm__kernel': 'sigmoid'})


2. Try replacing GridSearchCV with RandomizedSearchCV.


In [61]:
param_grid = {'svm__kernel': ['linear', 'poly'],
              'svm__C': stats.expon(),
              'svm__gamma': ['scale'],
              'svm__epsilon': stats.lognorm(1)
             }
rand_cv= RandomizedSearchCV(model_pipeline, param_grid, cv=3, return_train_score=True, scoring='neg_mean_squared_error')

In [62]:
rand_cv.fit(hs_train_df, hs_train_labels_df);

In [63]:
rcv_res = rand_cv.cv_results_
rcv_res.keys()
rcv_param_scrs = [(np.sqrt(-score), params) for params, score in zip(rcv_res['params'], rcv_res['mean_test_score'])]
rcv_params_scrs = sorted(rcv_param_scrs, key=itemgetter(0))
print(f'Best Model: {rcv_param_scrs[0]}')
print(f'Worst Model: {rcv_param_scrs[-1]}')

Best Model: (108948.09944728023, {'svm__C': 0.5187550357055456, 'svm__epsilon': 1.6838856168104908, 'svm__gamma': 'scale', 'svm__kernel': 'linear'})
Worst Model: (116555.2635423261, {'svm__C': 0.09108922219234733, 'svm__epsilon': 0.38185738940253505, 'svm__gamma': 'scale', 'svm__kernel': 'linear'})


3. Try adding a transformer in the preparation pipeline to select only the most important attributes.


In [400]:
from sklearn.base import BaseEstimator, TransformerMixin

# Choose highest correlation features
import copy
X_clone = dict()
class ChooseHighestCorrAttributes(BaseEstimator, TransformerMixin):
    def __init__(self, top_n_feats=5, is_cat=False):
        self.is_cat = is_cat
        self.top_n_feats = top_n_feats
        
    def fit(self, X, y=None):
        num_feats = self.top_n_feats - 1
        if self.is_cat:
            corrs = np.abs([stats.pearsonr(np.squeeze(np.asarray(v)), y)[0] for v in X.T.todense()])
        else:
            corrs = np.abs([stats.spearmanr(v, y).correlation for v in X.T])
        inds = np.argpartition(corrs, -num_feats)
        self.top_feat_inds = inds[-num_feats:]
        return self

    def transform(self, X):
        return X[:, self.top_feat_inds]
        

In [401]:
# Setup pipeline
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                         ('std_scaler', StandardScaler()),
                         ('top_corr_feats', ChooseHighestCorrAttributes())
                        ])

cat_pipeline = Pipeline([('encode', OneHotEncoder()),
                         ('top_corr_cfeats', ChooseHighestCorrAttributes(is_cat=True))
                        ])

preprocessing_pipeline = ColumnTransformer([('num', num_pipeline, list(housing_numeric)),
                                            ('cat', cat_pipeline, list(housing_cat))
                                           ])

model_pipeline = Pipeline([('preproc', preprocessing_pipeline),
                           ('svm', SVR())
                          ])

In [402]:
# Setup CrossValidation
param_grid = [{'svm__kernel': ['linear','rbf', 'sigmoid'],
               'svm__C': [.1, .5, 1],
               'svm__gamma': ['scale', 'auto']},
              {'svm__kernel': ['poly'],
               'svm__C': [.1, .5, 1],
               'svm__gamma': ['scale', 'auto'],
               'svm__degree': [2, 3, 5]}]
grid_cv = GridSearchCV(model_pipeline, param_grid, cv=3, return_train_score=True, scoring='neg_mean_squared_error')

In [403]:
grid_cv.fit(hs_train_df, hs_train_labels_df);

In [404]:
cv_res = grid_cv.cv_results_
cv_res.keys()
cv_param_scrs = [(np.sqrt(-score), params) for params, score in zip(cv_res['params'], cv_res['mean_test_score'])]
cv_params_scrs = sorted(cv_param_scrs, key=itemgetter(0))
print(f'Best Model: {cv_param_scrs[0]}')
print(f'Worst Model: {cv_param_scrs[-1]}')

Best Model: (117686.99910567568, {'svm__C': 0.1, 'svm__gamma': 'scale', 'svm__kernel': 'linear'})
Worst Model: (10991919.088169735, {'svm__C': 1, 'svm__degree': 5, 'svm__gamma': 'auto', 'svm__kernel': 'poly'})
