### Writers :
##### omar el sayed
##### Noura medhat

# Grid-Search
## Grid-Search is the one of the most critical processes while bulding a model but what do we mean by the grid search?
### Grid-Search is the process of scanning all the hyperparameters in our model in order to find the optimal parameters to make the model more accurate. It builds a model on every possible parameter combination to avoid the point of the over-fit and the under-fit model to improve the accuracy of the testing process.

#### Step One: Importing the important libraries

In [1]:
import numpy as np #numpy is an important library in linear algebra
import pandas as pd #I/O from our CSV file 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns

#### Step Two: Loading our data

In [2]:
df = pd.read_csv("C:\\Users\\antoz\\Downloads\\d\Breast_Cancer_data.csv")
df.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,


### About the data :
#### The data contains information about a breast cancer observed bya digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
#### Attribute Information:
#### 1) ID number
#### 2) Diagnosis (M = malignant, B = benign)
#### 3-32)
<h5>
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter) <br>
b) texture (standard deviation of gray-scale values) <br>
c) perimeter <br>
d) area<br>
e) smoothness (local variation in radius lengths)<br>
f) compactness (perimeter^2 / area - 1.0)<br>
g) concavity (severity of concave portions of the contour) <br>
h) concave points (number of concave portions of the contour)<br>
i) symmetry<br>
j) fractal dimension ("coastline approximation" - 1)<br>
    </h5>

#### Step Three: Spliting our data into training set and test set , we are going to throw away irrelevant info like the patient's id

#### NOTE: train_test_split is a built-in function used to split our dataset  into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

In [3]:
train, test = train_test_split(df, random_state=42)
X_train = train[train.columns[2:-1]]
y_train = train['diagnosis']
X_test = test[test.columns[2:-1]] 
y_test = test['diagnosis']

#### Step Four: Building a simple DecisionTree model

In [4]:
#Scaling our data

decTree_model = DecisionTreeClassifier(random_state=0).fit(X_train,y_train)

print("train score - " + str(decTree_model.score(X_train, y_train)))
print("test score - " + str(decTree_model.score(X_test, y_test)))

train score - 1.0
test score - 0.9300699300699301


#### You can see that the model is most probably overfitting so we need to find a better way to evaluate the params.

## K-fold cross-validation
### Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.In k-fold cross-validation, we split the input data into k subsets of data (also known as folds). We train an ML model on all but one (k-1) of the subsets, and then evaluate the model on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time.
<img src="https://miro.medium.com/max/1400/1*IjKy-Zc9zVOHFzMw2GXaQw.png"  width="900" height="400">


#### Step FIve: Using a grid search to find the best parameters for our model , with cv as our kfolds

In [5]:
from sklearn.metrics import classification_report
params = {'max_depth': [1, 2, 3, 4, 5, 6],
               'criterion': ["entropy" , "gini"]}

#cv = 5 means that we have split our dataset into 5 subsets
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=0), params, cv=5, n_jobs=-1) #n_jobs=-1 means use all processors available
grid_search.fit(X_train, y_train)

print(classification_report(y_test,grid_search.predict(X_test)))

              precision    recall  f1-score   support

           B       0.96      0.98      0.97        89
           M       0.96      0.93      0.94        54

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.96       143
weighted avg       0.96      0.96      0.96       143



In [6]:
print("train score - " + str(grid_search.score(X_train, y_train)))
print("test score - " + str(grid_search.score(X_test, y_test)))

train score - 0.9929577464788732
test score - 0.958041958041958


#### NOTE: best_params_ is a built-in function to get the best parameters used for the highest scores in the model and best_score is also a built-in function to get the best score throughout the Grid-Search.

In [7]:
#The best parameters the model used
print(grid_search.best_params_)
print(grid_search.best_score_)

{'criterion': 'entropy', 'max_depth': 5}
0.9341997264021888


## You can also use the KFold function from sklearn

In [8]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std

cv = KFold(n_splits=10, random_state=1, shuffle=True)
# evaluate model
scores = cross_val_score(decTree_model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.913 (0.039)
