# Grid Search

In this section we are trying to utilize GridSearchCV function provided with Scikit-Learn in order to find the optimal paramters for a Random Forest Model.

Please refer into following link to get some overall idea about GridSearchCV before prceeding into the excercise.

GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

The following link provide information about Random Forest Classifer and its usage in Scikit-Learn library.

RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Furthermore, the following dataset will be utilized for the following task.

Heart Disease Cleveland: https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland

In [30]:
# Load the necessary libraries

# Your code here
import pandas as pd
from networkx.readwrite.text import forest_str
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [31]:
# suppress warning messages

# Your code here
import warnings
warnings.filterwarnings('ignore')

In [32]:
# Load the dataset as a Pandas dataframe and display the head

# Your code here
df = pd.read_csv('Heart_disease_cleveland_new.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,0,145,233,1,2,150,0,2.3,2,0,2,0
1,67,1,3,160,286,0,2,108,1,1.5,1,3,1,1
2,67,1,3,120,229,0,2,129,1,2.6,1,2,3,1
3,37,1,2,130,250,0,0,187,0,3.5,2,0,1,0
4,41,0,1,130,204,0,2,172,0,1.4,0,0,1,0


In [33]:
# Check for the null values

# Your code here
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [34]:
# Separate the feature columns and target using pandas functions

# Your code here
X = df.drop(columns = ['target'], axis = 1) # axis = 1 means column
y = df['target']


In [35]:
# Shape of the feature columns
X.shape

(303, 13)

In [36]:
# Shape of the target column
y.shape

(303,)

In [37]:
# Split dataset into train and test sets

# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [38]:
# Print train dataset size

# Your code here
print(X_train.shape, y_train.shape)

(212, 13) (212,)


In [39]:
# Print test dataset size

# Your code here
print(X_test.shape, y_test.shape)

(91, 13) (91,)


In [40]:
# Scale the data using standard scaler

# Your code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [41]:
# Define the random forest classifier with the default parameters

# Your code here
rfc = RandomForestClassifier()

In [42]:
# Define the parameter grid for the grid search
# Refer to the GridSearchCV Documentation

# Your code here
forest_params = [
    {'max_depth': list(range(10, 15)),
		 'max_features': list(range(0, 14))
     }
]

In [43]:
# Perform Grid Search to identify optimal parameters
# Use cv = 5

# Your code here
clf = GridSearchCV(rfc, forest_params, cv = 5, scoring = 'accuracy')

In [44]:
clf.fit(X_train_scaled, y_train)

In [45]:
# Print best hyperparameters detected from the Grid Search

# Your code here
clf.best_params_

{'max_depth': 11, 'max_features': 3}

In [46]:
# Print the mean cross-validated score of the best_estimator

# Your code here
clf.best_score_

0.8345514950166113

In [47]:
# Use the best estimator to obtain the accuracy for the test set

# Your code here
print(clf.best_estimator_.score(X_train_scaled, y_train))
print(clf.best_estimator_.score(X_test_scaled, y_test))
# Indicates overfitting as the training accuracy is much higher than the test accuracy

1.0
0.8571428571428571
