# Parameter Tuning Using GridsearchCV and RandomizedserachCV

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df=pd.read_csv('heart.csv')

In [4]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
df.shape

(303, 14)

In [10]:
X=df.iloc[:,0:-1]
y=df.iloc[:,-1]

In [13]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [14]:
print(X_train.shape)
print(X_test.shape)

(242, 13)
(61, 13)


### Comparison Among Different Classifier Models

In [15]:
rf=RandomForestClassifier()
gb=GradientBoostingClassifier()
svm=SVC()
lr=LogisticRegression()

#### 1.Random Forest

In [17]:
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.8688524590163934


#### 2.GradientBoostingClassifier

In [18]:
gb.fit(X_train,y_train)
y_pred=gb.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.7704918032786885


#### 3.SVC

In [19]:
svm.fit(X_train,y_train)
y_pred=svm.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.7049180327868853


#### 4.Logistic_Regression

In [20]:
lr.fit(X_train,y_train)
y_pred=lr.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.8852459016393442


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### From Above we can conclude that , RandomForest always performs good out of the box for any problem , It will always lie in best 3-4 algorithm in any problem.<br> In above code you can see that even though the GradientboostingClassifier is one of the best algorithm still it's accuracy is lower than Random Forest  also Logistic regression falls somewhere near the Random_forest accuracy.

# <hr>

### Accuracy of Random_Forest After Tuning and Cross validation

In [23]:
rf=RandomForestClassifier(max_samples=0.75,random_state=42)
#max_samples=0.75 using 75% of Rows 
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
print(accuracy_score(y_pred,y_test))
#You can see that , accuracy of Random Forest Increases(90%) as we tune the Parameter 

0.9016393442622951


In [24]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(RandomForestClassifier(max_samples=0.75),X,y,cv=10,scoring='accuracy'))

0.8348387096774192

<hr> 

# 1.GridSearchCV

- GridSearchCV is a method for hyperparameter tuning.
- We use this method to find the parameter which give maximum accuracy for the model.
- In GridSearchCV we pass the Grid of hyperparameters with there values 
- This Method will calculate result for every combinations of hyperparameters in the grid.
- GridSearchCV is slow for Large data , since it calculates each and every combination of hyperparameter.

In [26]:
# Number of trees in random forest
n_estimators = [20,60,100,120]

# Number of features to consider at every split
max_features = [0.2,0.6,1.0]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75,1.0]

#Sice we are tuning 4 parameter for the random forest it will act like 4-dimensional Grid 
# 108 diff random forest train (108 different combinations of parameters for tuning the model)

In [27]:
param_grid={'n_estimators':n_estimators,
            'max_features':max_features,
            'max_depth':max_depth,
            'max_samples':max_samples
}
print(param_grid)

{'n_estimators': [20, 60, 100, 120], 'max_features': [0.2, 0.6, 1.0], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75, 1.0]}


In [28]:
rf=RandomForestClassifier()

In [29]:
from sklearn.model_selection import GridSearchCV

rf_grid=GridSearchCV(estimator=rf,
                    param_grid=param_grid,
                    cv=5,#Cross_validation
                    verbose=2,#Shows output during the process
                    n_jobs=-1 #USes all cores of machine
                    )

In [30]:
#This step might take time to train the Model , since we are traing 108 Different trees simultatneously

rf_grid.fit(X_train,y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [32]:
rf_grid.best_params_
#This Commend will show the best parameters for best results

{'max_depth': 2, 'max_features': 0.2, 'max_samples': 0.75, 'n_estimators': 100}

##### Now Gettinng the best Score of GridSearchCV

In [34]:
rf_grid.best_score_

0.8386904761904763

# <hr>

# 2.RandomSearchCV

- It is same Method As GridSearchCv use for Parameter tuning.
- It chooses randomly the combination of Parameters in the search grid.
- Unlike GridSearchCv , RandomsearchCV will not calculate each and every combination of hyperparameters.
- This process is much faster than GridSearchCV.
- Although this method will not give best result for the data.
- It can be used for tuning the algorithm on Large Data, by randomly choosing parameters combination.
- This method will give answer faster in comparison to GridsearchCV

In [37]:
# Number of trees in random forest
n_estimators = [20,60,100,120]

# Number of features to consider at every split
max_features = [0.2,0.6,1.0]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75,1.0]

# Bootstrap samples
bootstrap = [True,False]

# Minimum number of samples required to split a node
min_samples_split = [2, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]

In [38]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples,
              'bootstrap':bootstrap,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf
             }
print(param_grid)

{'n_estimators': [20, 60, 100, 120], 'max_features': [0.2, 0.6, 1.0], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75, 1.0], 'bootstrap': [True, False], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2]}


In [39]:
from sklearn.model_selection import RandomizedSearchCV

rf_grid = RandomizedSearchCV(estimator = rf, 
                       param_distributions = param_grid, 
                       cv = 5, 
                       verbose=2, 
                       n_jobs = -1)

In [40]:
rf_grid.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\acer\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\acer\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py", line 402, in fit
    raise ValueError(
ValueError: `max_sample` cannot be set if `bootstrap=False`. Either switch to `bootstrap=True` or set `max_s

###### From above you can see that we apply RandomSearchCV only for 10 candidates(combination) which are choosen randomly , while you can see that in case of GridSearchCV it applied on 108 candiates(combination)

In [43]:
rf_grid.best_params_
#Best parameters according to RandomSearchCV

{'n_estimators': 120,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_samples': 0.75,
 'max_features': 0.2,
 'max_depth': 2,
 'bootstrap': True}

###### Best Score According to RandomSearchCV

In [44]:
rf_grid.best_score_

0.8305272108843538

# <hr>

### Note
- When we have a small data and algorithm we are using having less number of hyperparameters then we should use GridSearchCV
- When we have a large data and algorithm we are using having more number of hyperparameters then we should use RandomSearchCV

# Conclude