### Hyperparameter tuning with RandomizedSearchCV

* GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.


* Here, you'll also be introduced to a new model: the Decision Tree. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.


* Your goal is to use RandomizedSearchCV to find the optimal hyperparameters.

In [1]:
# Import necessary modules
import pandas as pd
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [2]:
df = pd.read_csv('diabetes.csv')
X = df.drop('diabetes', axis = 1).values
y = df['diabetes'].values

In [3]:
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Inside RandomizedSearchCV(), specify the classifier, parameter distribution, and number of folds to use.
# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X,y)
# Print the tuned parameters
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))

Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 6}


In [4]:
# Print the score
print("Best score is {}".format(tree_cv.best_score_))

Best score is 0.7395833333333334


Note that RandomizedSearchCV will never outperform GridSearchCV. Instead, it is valuable because it saves on computation time.