## Hyperparameter Tuning - Classify Raisins

There are two raisin grain types in this dataset, Kecimen and Besni and seven numerical predictor variables associated with each of the 900 samples in the data. You’re going to use this dataset to implement the two hyperparameter tuning methods:

* Grid Search method to tune a Decision Tree Classifier
* Random Search method to tune a Logistic Regression Classifier

### Import Libraries

In [4]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Evaluating Data
The data sets contains 13 predictor variables. Here's a brief description of them.

* `Area` Integer
* `MajorAxisLength` Continuous
* `MinorAxisLength` Continuous
* `Eccentricity` Continuous
* `ConvexArea` Integer
* `Extent`	Continuous
* `Perimeter` Continuous	


The outcome variable, `Class` Boolean - Kecimen(True - 1) or Besni(False - 0)

Missing values: 0

### Read and Convert Datasets

In [5]:
# 1. Read the CSV files into DataFrames
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

raisins = pd.read_csv('Raisin_Dataset.csv')
raisins['Class'] = raisins['Class'].replace({'Kecimen':0,'Besni':1})

raisins.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,0
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,0
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,0
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,0
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,0


### Scale/Transform Data

Use `StandardScaler().fit()` to fit the variable features and then use `transform()` to get X to get the transformed input to our model.

In [6]:
# 2. Create predictor and target variables, X and y
X = raisins.drop('Class', axis=1)
y = raisins['Class']

In [7]:
# 3. Examine the dataset
print("Number of features:", X.shape[1])
print("Total number of samples:", len(y))
print("Samples belonging to class '1':", y.sum())

Number of features: 7
Total number of samples: 900
Samples belonging to class '1': 450


In [8]:
# 4. Split the data set into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 19)

### Grid Search with Decision Tree Classifier


A decision tree classifier works well for a binary balanced class classification problem. Initialize a decision tree classifier named tree.

In [9]:
# 5. Create a Decision Tree model
tree = DecisionTreeClassifier()


The `DecisionTreeClassifier()` implementation in scikit-learn has many parameters.

Create a dictionary parameters to set up grid search to explore three values each for the following 2 hyperparameters:

* `max_depth` : The maximum tree depth; explore the values 3,5 and 7 for this.
* `min_samples_split` : The minimum number of samples to split at each node; explore the values 2,3 and 4 for this.

In [10]:
# 6. Dictionary of parameters for GridSearchCV
parameters = {'min_samples_split': [2,3,4], 'max_depth': [3,5,7]}

Grid search classifier grid with tree and parameters as inputs. Fit the grid search classifier to the training data.

In [11]:
# 7. Create a GridSearchCV model
grid = GridSearchCV(tree, parameters)

# Fit the GridSearchCV model to the training data
grid.fit(X_train, y_train)

In [12]:
# 8. Print the model and hyperparameters obtained by GridSearchCV
print(grid.best_estimator_)

# Print best score
print(grid.best_score_)
# Print the accuracy of the final model on the test data
print(grid.score(X_test, y_test))

DecisionTreeClassifier(max_depth=5, min_samples_split=4)
0.8785185185185185
0.8133333333333334


Convert the two arrays to DataFrames, concatenate them using pd.concat and print it to view the score for each hyperparameter combination.

In [13]:
# 9. Print a table summarizing the results of GridSearchCV
df = pd.concat([pd.DataFrame(grid.cv_results_['params']), pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['Score'])], axis=1)
print(df)

   max_depth  min_samples_split     Score
0          3                  2  0.859259
1          3                  3  0.859259
2          3                  4  0.862222
3          5                  2  0.865185
4          5                  3  0.868148
5          5                  4  0.878519
6          7                  2  0.854815
7          7                  3  0.848889
8          7                  4  0.845926


### Random Search with Logistic Regression


To perform random search we need to specify the parameters and the distributions to draw from. Define a dictionary `distributions` with the keys

* `penalty`: corresponding to the type of regularization to apply. Choose a discrete distribution with ‘l1’ and ‘l2’
* `C`: corresponding to the regularization strength. Choose a uniform distribution here between 0 and 100.

In [14]:
# 10. The logistic regression model
lr = LogisticRegression(solver = 'liblinear', max_iter = 1000)

In [15]:
# 11. Define distributions to choose hyperparameters from
from scipy.stats import uniform
distributions = {'penalty': ['l1', 'l2'], 'C': uniform(loc=0, scale=100)}

In [16]:
# 12. Create a RandomizedSearchCV model
clf = RandomizedSearchCV(lr, distributions, n_iter=8)

# Fit the random search model
clf.fit(X_train, y_train)

Print the best estimator and score from the random search you’ve performed.

In [17]:
# 13. Print best esimatore and best score
print(clf.best_estimator_)
print (clf.best_score_)

# Print a table summarizing the results of RandomSearchCV
df = pd.concat([pd.DataFrame(clf.cv_results_['params']), pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['Accuracy'])] ,axis=1)
print(df.sort_values('Accuracy', ascending = False))

LogisticRegression(C=51.042228978130225, max_iter=1000, penalty='l1',
                   solver='liblinear')
0.8755555555555556
           C penalty  Accuracy
5  51.042229      l1  0.875556
0  99.512370      l2  0.875556
2  73.350067      l2  0.875556
3  63.849969      l2  0.875556
4  16.549337      l2  0.875556
1  64.251571      l1  0.874074
6  73.371187      l1  0.874074
7  82.422529      l1  0.874074
