# Expanding your machine learning toolkit: randomized search,  SVMs and regularized logistic regression

## Introduction

Previously, we wrote about some [common trade-offs](http://blog.cambridgecoding.com/2016/03/24/misleading-modelling-overfitting-cross-validation-and-the-bias-variance-trade-off/) in machine learning and the importance of [tuning models](http://blog.cambridgecoding.com/2016/04/03/scanning-hyperspace-how-to-tune-machine-learning-models/) to your specific dataset. We demonstrated tuning a **random forest** classifier using grid search, and how **cross-validation** can help avoid **overfitting** when tuning **hyperparameters**.

In this short follow-up post, we introduce a different strategy for traversing hyperparameter space - **randomized search**. We also demonstrate the process of tuning and training two other classification algorithms - a **support vector machine** and a **logistic regression classifier**.

![Algorithms we'll use in this tutorial.](hyperparam_algos_logistic.png)

We'll keep working with the [wine dataset](https://archive.ics.uci.edu/ml/datasets/Wine), which contains chemical characteristics of wines of varying quality. As before, our goal is to try to predict a wine's quality form these features.

Here are the things we'll cover in this blog post:

![Tutorial overview.](hyperparam_intro_logistic.png)

In the next blog post, you will learn how to take these three different tuned machine learning algorithms and combine them to build an aggregate **model ensemble**.

## Loading and train/test splitting the dataset

You start off by collecting the dataset. We have covered the data loading, preprocessing conversion and train/test split [previously](http://blog.cambridgecoding.com/2016/03/24/misleading-modelling-overfitting-cross-validation-and-the-bias-variance-trade-off/), so we won't repeat ourselves here. Also check out [this post](http://blog.cambridgecoding.com/2016/02/07/eda-and-interactive-figures-with-plotly/) on using plotly to create exploratory, interactive graphics of the wine dataset features. 

You can fetch and format the data as follows:

In [3]:
import wget
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/wine/winequality-red.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=";")

# Using a lambda function to bin quality scores
dataset['quality_is_high'] = dataset.quality.apply(lambda x: 1 if x >= 6 else 0)

# Convert the dataframe to a numpy array and split the
# data into an input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-2].astype(float)
y = npArray[:,-1]

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

## Mixing things up: introducing randomized search

You have already built a random forest classifier, tuned using grid search, to predict wine quality ([here](http://localhost:8888/notebooks/polished_prediction/scanning_hyperspace.ipynb#)). Grid search exhaustively searches through some manually prespecified HP values and reports the best option and is quite commonly used. Another way to search through hyperparameter space to find optimums is by using **randomized search**. In randomized search, we sample HP values a certain number of times from some distribution which we prespecify in advance. There is [evidence](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) that randomized search is more efficient than grid search, because not all HPs are as important to tune and grid search effectively wastes time by exhaustively checking each option when it might not be necessary. By contrast, the random experiments utilized by randomized search explore the important dimensions of hyperparameter space with more coverage, while simultaneously not devoting too many trials to dimensions which are not as important. So, randomized search is very useful for high-dimensional feature spaces. 

To use randomized search to tune random forests, we first specify the distributions we want to sample from.

If we were to sample from a uniform distribution and have the same number of n_iter trials, randomized search would be practically equivalent to grid search (but not exactly equivalent - why is this?).

In [11]:
from scipy.stats import uniform
from scipy.stats import norm

from sklearn.grid_search import RandomizedSearchCV
from sklearn import metrics

# Designate distributions to sample hyperparameters from 
n_estimators = np.random.uniform(10, 45, 5).astype(int)
max_features = np.random.normal(5, 3, 5).astype(int)

hyperparameters = {'n_estimators': list(n_estimators),
                   'max_features': list(max_features)}

print hyperparameters

{'n_estimators': [40, 36, 21, 12, 44], 'max_features': [7, 8, 4, 5, 9]}


We then run the random search:

In [12]:
from sklearn.ensemble import RandomForestClassifier

# Run randomized search
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=10)
randomCV.fit(XTrain, yTrain)

# Identify optimal hyperparameter values
best_n_estim      = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']  

print("The best performing n_estimators value is: {:5.1f}".format(best_n_estim))
print("The best performing max_features value is: {:5.1f}".format(best_max_features))

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
                                max_features=best_max_features)

clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)

print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

The best performing n_estimators value is:  36.0
The best performing max_features value is:   4.0
             precision    recall  f1-score   support

        0.0       0.78      0.82      0.80       188
        1.0       0.83      0.79      0.81       212

avg / total       0.81      0.81      0.81       400

('Overall Accuracy:', 0.81)


Either grid search or randomized search is [probably fine](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html) for tuning random forests.

Let's look at how to tune our two other predictors. For simplicity, let's revert back to grid search.

## Tuning a support vector machine model

Let's train our second algorithm, **support vector machines** (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be found in Chapter 9 of the [Introduction to Statistical Learning book](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf) or [in this nice blog post](https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners). Briefly, SVMs search for **separating hyperplanes** in the feature space which best divide the different classes in your dataset. If you had 2 features, SVMs would search for the best dividing line; if you had 3 features, it would search for the best dividing 2d plane, etc. Crucially, SVMs can construct complex, **non-linear decision boundaries** between classes using a process called **kernelling**, which projects the data into a higher-dimensional space and makes finding a good boundary easier. This sounds a bit abstract, but if you've ever fit a straight line to power-transformed variables (e.g. maybe you used x^2, x^3 as features as part of linear regression), you're already familiar with the concept of creating additional dimensions to facilitate modelling.

SVMs can use different types of kernel functions, like linear, polynomial, Gaussian or radial kernels, to throw the data into a different space. Let's use the popular **radial basis kernel**. Of the available [hyperparameters](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), some key radial SVM hyperparameters to tune are:

+ `gamma` (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and
+ `C` (which controls the 'softness' of the classification boundary margin and hence the [bias-variance tradeoff](http://blog.cambridgecoding.com/2016/03/24/misleading-modelling-overfitting-cross-validation-and-the-bias-variance-trade-off/) of the SVM model

Let's examine the default settings and define our own: 

In [20]:
from sklearn import svm

default_SVC = svm.SVC()
print ("Default SVC parameters are:", default_SVC.get_params)

# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=5)
C_range = 2. ** np.arange(-5, 15, step=5)

hyperparameters = [{'gamma': g_range, 
                    'C': C_range}] 

print (hyperparameters)

('Default SVC parameters are:', <bound method SVC.get_params of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>)
[{'C': array([  3.12500000e-02,   1.00000000e+00,   3.20000000e+01,
         1.02400000e+03]), 'gamma': array([  3.05175781e-05,   9.76562500e-04,   3.12500000e-02,
         1.00000000e+00])}]


In [16]:
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv=10)  
grid.fit(XTrain, yTrain)

best_gamma = grid.best_params_['gamma']
best_C = grid.best_params_['C']

print ("The best performing gamma value is: {:5.1f}".format(best_gamma))
print ("The best performing C value is: {:5.1f}".format(best_C))

# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)

print (metrics.classification_report(yTest, SVM_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),4))

The best performing gamma value is:   0.0
The best performing C value is: 1024.0
             precision    recall  f1-score   support

        0.0       0.69      0.76      0.72       188
        1.0       0.76      0.69      0.73       212

avg / total       0.73      0.72      0.72       400

Overall Accuracy: 0.7225


How does this performance compare an untuned SVM? What about an SVM with especially badly tuned hyperparams?

Similarly, here is how you can tune an SVM using randomized search:

In [None]:
# Randomized search SVM code goes here


## Tuning a logistic regression classifier

The final model you'll tune and apply to predict wine quality is a logistic regression classifier. This is a type of regression model which is used for predicting binary outcomes (like good wine/not good wine).  Logistic regression fits a sigmoidal (S-shaped) curve through the data, but is essentially just a transformed version of linear regression. In fact, a straight line is fit through transformed data, where the x axes remain the same but the dependent variable is the [log odds](https://en.wikipedia.org/wiki/Logit) of data points being one of the two classes. A nice explanation of the theoretical basis for logistic regression can be found here.

One topic you will often encounter in machine learning is **regularization**, which is a class of techniques to reduce overfitting. The idea behind regularization is that you often, it is not good enough to just maximize a model's fit to your data. This is liable to overfitting. Regularization techniques try to cut down on overfitting by penalizing models if e.g. they use too many parameters, or assign coefficients or weights that are "too big". This means that models have to learn from the data under a series of constraints, which often leads to robust representations of the data. You can read more about regularized regression [here]().

You can adjust just how much regularization you want by adjusting **regularization hyperparameters**, and since this is something you might want to do often, scikit-learn comes with some pre-built models that can very efficiently fit data for a range of regulatization hyperparameter values. This is the case for regularized linear regression models like [Lasso regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV) and [ridge regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV), which use l1 and l2 penalties, respectively, to shrink the size of the regression coefficients. These classes offer a shortcut to doing cross-validated selection of the regularization hyperparameter.

But you can also optimize how much regularization you want ourselves, while at the same time tuning other hyperparameters like the choice of l1 and l2 penalties, in the same manner as you've been doing:

In [29]:
# Tuning a regularized logistic regression model with grid search
from sklearn.linear_model import LogisticRegression

# Search for good hyperparameter values
# Specify values to grid search over
penalty = ["l1", "l2"]
C_range = np.arange(0.1, 1.1, 0.1)

hyperparameters = [{'penalty': penalty, 
                    'C': C_range}] 

# Grid search using cross-validation
grid = GridSearchCV(LogisticRegression(), param_grid=hyperparameters, cv=10)  
grid.fit(XTrain, yTrain)

best_penalty = grid.best_params_['penalty']
best_C = grid.best_params_['C']

print ("The best performing penalty is: {}".format(best_penalty))
print ("The best performing C value is: {:5.1f}".format(best_C))

The best performing penalty is: l2
The best performing C value is:   0.7


In [30]:
# Train model and output predictions
classifier_logistic = LogisticRegression(penalty=best_penalty, C=best_C)
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)

print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),2)

             precision    recall  f1-score   support

        0.0       0.72      0.73      0.73       188
        1.0       0.76      0.75      0.76       212

avg / total       0.74      0.74      0.74       400

Overall Accuracy: 0.74


Similarly, you can use randomized search:

In [32]:
# Randomized search logistic regression code goes here

## Conclusion

In this tutorial, you've X. Quick final comparison of grid vs. randomized search. The important thing is to do some exploration of HP space using either method.

Fancier techniques for hyperparameter optimization include methods based on [gradient descent](http://jmlr.org/proceedings/papers/v37/maclaurin15.pdf), grad student descent (not recommended), and [Bayesian approaches](http://arxiv.org/pdf/1206.2944.pdf) which update prior beliefs about likely values of hyperparameters based on the data (see [Spearmint](https://github.com/JasperSnoek/spearmint) and [hyperopt](http://hyperopt.github.io/hyperopt/)).

In our next post, we will take these different tuned models and build them up into an ensemble model to increase our predictive performance even more.