# Hyperparameter Tuning: Raisins

This project explores hyperparameter tuning (GridSearchCV, RandomSearchCV) through classifying raisins into 1 of 2 types. The models used are Logistic Regression and Decision Trees.

NOTE: This project is based on Codecademy's [Hyperparameter Tuning project](https://www.codecademy.com/projects/practice/mle-hyperparameter-tuning-project).

## Dataset
The dataset comes from Kaggle: [Raisin](https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset), by Murat Koklu. Briefly this dataset contains 7 features related to raisin properties and has two classes: 'Kecimen' and 'Besni'.

## Setup 

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

In [25]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [26]:
# import dataset
raisins = pd.read_excel('Raisin_Dataset.xlsx')
raisins.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


## Exploratory Data Analysis
The data will be briefly explored to see how many oberservations there are, the datatypes, and some descriptive statistics. 

In [27]:
raisins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Area             900 non-null    int64  
 1   MajorAxisLength  900 non-null    float64
 2   MinorAxisLength  900 non-null    float64
 3   Eccentricity     900 non-null    float64
 4   ConvexArea       900 non-null    int64  
 5   Extent           900 non-null    float64
 6   Perimeter        900 non-null    float64
 7   Class            900 non-null    object 
dtypes: float64(5), int64(2), object(1)
memory usage: 56.4+ KB


In [28]:
raisins.describe()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
count,900.0,900.0,900.0,900.0,900.0,900.0,900.0
mean,87804.127778,430.92995,254.488133,0.781542,91186.09,0.699508,1165.906636
std,39002.11139,116.035121,49.988902,0.090318,40769.290132,0.053468,273.764315
min,25387.0,225.629541,143.710872,0.34873,26139.0,0.379856,619.074
25%,59348.0,345.442898,219.111126,0.741766,61513.25,0.670869,966.41075
50%,78902.0,407.803951,247.848409,0.798846,81651.0,0.707367,1119.509
75%,105028.25,494.187014,279.888575,0.842571,108375.75,0.734991,1308.38975
max,235047.0,997.291941,492.275279,0.962124,278217.0,0.835455,2697.753


It looks like the datatypes all make sense, and there appear to be no missing or 0 values. 

The class column is a column, and should be mapped to a 0 or 1 to represent the two types of grapes. 

## Preprocessing
The class variable is encoded to a binary 0 or 1, and the data is split into testing and training. 

In [29]:
# map class from string to int (1 or 0)
print("Before mapping:", raisins["Class"].unique())
raisins["Class"] = raisins["Class"].map({"Kecimen": 0, "Besni": 1})
print("After mapping:", raisins["Class"].unique()) # verify success of mapping

Before mapping: ['Kecimen' 'Besni']
After mapping: [0 1]


In [30]:
# create predictor and target variables, X and y
y = raisins["Class"]
X = raisins.drop(columns="Class")

num_features = len(X.columns)
num_samples = len(X)
num_1s = y.sum()

print(f"The dataset has {num_features} features, {num_samples} samples, and {num_1s} samples of 1's.")

The dataset has 7 features, 900 samples, and 450 samples of 1's.


In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 14)

## Grid Search with Decision Tree Classifier
A Decision Tree classifier will be trained on the data, with its hyperparameters tuned through GridSearchCV. 

In [32]:
# create a Decision Tree model
tree = DecisionTreeClassifier()

# create parameters for grid search
parameters = {'min_samples_split': [2,3,4], 'max_depth': [3,5,7]}

# create and fit GridSearchCV model
grid = GridSearchCV(estimator=tree, param_grid=parameters)
grid.fit(X_train, y_train) # fit to training data

# best model and hyperparameters obtained by GridSearchCV
print(f"The best estimator achieved a score of {grid.best_score_}, based on the following parameters: ")
print(grid.best_estimator_)

# evaluate the best model on the test data
best_test_score = grid.score(X_test, y_test)
print(f"The best estimator achieved a test score of {best_test_score}.")

The best estimator achieved a score of 0.8696296296296296, based on the following parameters: 
DecisionTreeClassifier(max_depth=3)
The best estimator achieved a test score of 0.8266666666666667.


In [33]:
# table summarizing the results of GridSearchCV
df = pd.concat([pd.DataFrame(grid.cv_results_['params']), pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['Score'])], axis=1)
print(df)

   max_depth  min_samples_split     Score
0          3                  2  0.869630
1          3                  3  0.869630
2          3                  4  0.869630
3          5                  2  0.862222
4          5                  3  0.860741
5          5                  4  0.862222
6          7                  2  0.837037
7          7                  3  0.837037
8          7                  4  0.838519


## Random Search with Logistic Regression
A logistic regression model will be trained on the data, with its hyperparameters tuned through RandomSearchCV. 

In [34]:
# create a logistic regression model
lr = LogisticRegression(solver="liblinear", max_iter=1000)

# create parameters for random search 
distributions = {'penalty': ['l1', 'l2'], 'C': uniform(loc=0, scale=100)}

# create and fit RandomSearchCV model
clf = RandomizedSearchCV(estimator=lr, param_distributions=distributions, n_iter=8)
clf.fit(X_train, y_train)

# best model and hyperparameters obtained by GridSearchCV
print(f"The best estimator achieved a score of {clf.best_score_}, based on the following parameters: ")
print(clf.best_estimator_)

# evaluate the best model on the test data
best_test_score = clf.score(X_test, y_test)
print(f"The best estimator achieved a test score of {best_test_score}.")

The best estimator achieved a score of 0.8785185185185185, based on the following parameters: 
LogisticRegression(C=np.float64(15.654268914667568), max_iter=1000,
                   penalty='l2', solver='liblinear')
The best estimator achieved a test score of 0.8266666666666667.


In [35]:
# table summarizing the results of RandomSearchCV
df = pd.concat([pd.DataFrame(clf.cv_results_['params']), pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['Accuracy'])] ,axis=1)
print(df.sort_values('Accuracy', ascending = False))

           C penalty  Accuracy
3  15.654269      l2  0.878519
4  37.054163      l2  0.878519
5  42.731019      l2  0.877037
7  36.158898      l2  0.877037
1  40.958283      l1  0.872593
2  76.393876      l1  0.872593
6  76.375788      l1  0.872593
0   5.196958      l1  0.871111
