# Randomized Search CV
Parameters:
- `estimator`: the model to use
- `param_distributions`: dict containing hyperparameters and possible values
- `n_iter`: number of iterations
- `scoring`: scoring method to use

In [1]:
import pandas as pd
candy = pd.read_csv('datasets/candy-data.csv')
candy.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [2]:
X = candy.drop(['competitorname','winpercent'],axis=1)
y = candy['winpercent']

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error

# Finish the dictionary by adding the max_depth parameter
param_dist = {"max_depth": [2,4,6,8],
              "max_features": [2, 4, 6, 8, 10],
              "min_samples_split": [2, 4, 8, 16]}

# Create a random forest regression model
rfr = RandomForestRegressor(n_estimators=10, random_state=1111)

# Create a scorer to use (use the mean squared error)
scorer = make_scorer(mean_squared_error)

In [5]:
# Import the method for random search
from sklearn.model_selection import RandomizedSearchCV

# Build a random search using param_dist, rfr, and scorer
random_search =\
    RandomizedSearchCV(
        estimator=rfr,
        param_distributions=param_dist,
        n_iter=10,
        cv=5,
        scoring=scorer)

The `random_search`, has `n_iter` 10 which means, it randomly search through the `param_distributions` 10 times to conclude the best parameters with which it creates the final model. And `cv` 5 means the data fitting into the `RandomizedSearchCV` is divided into 5-folds and so for each iteration, the data is modeled 5 times and hence for the final model, cv_results are 5 different ones.

In [6]:
# Fitting random search
random_search.fit(X_train,y_train)

In [7]:
# Best params concluded by RandomSearchCV
random_search.best_params_

{'min_samples_split': 2, 'max_features': 2, 'max_depth': 2}

In [8]:
# Mean cross-validated score of the best_estimator.
random_search.best_score_

141.49616262372115

In [9]:
# Best model
best_model = random_search.best_estimator_
best_model

In [10]:
# Predicting using best model
prediction = best_model.predict(X_test)
print(prediction[:5])

# Checking score
best_model.score(X_test,y_test) #<-- the score is the default score i.e R2 score

[44.36262258 61.67867459 42.64504093 49.72676235 41.86567371]


0.1588813864217652

In [11]:
# Predicting using random search model
prediction = random_search.predict(X_test)
print(prediction[:5])

# Checking score i.e mse
random_search.score(X_test,y_test)

[44.36262258 61.67867459 42.64504093 49.72676235 41.86567371]


168.62461945807263