# Classify Raisins with Hyperparameter Tuning Project

The dataset used in this project is posted on (Kaggle)[https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset].
<br> There are two raisin grain types in this dataset, *Kecimen* and *Besni*. The dataset contains 900 samples associated with seven numerical predictor variables. In this project, it will classify different types of raisins using technique of hyperparameter tuning along with Machine Learning model.

### Explore the Dataset

In [25]:
# Setup
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

raisins = pd.read_excel('Raisin_Dataset.xlsx')
raisins.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


In [15]:
# Preprocessing the data before buildling a model
raisins['Class'] = raisins.Class.apply(lambda x: 0 if x=='Kecimen' else 1 if x=='Besni' else None)

In [16]:
# Create predictor and target variables, X and y
X = raisins.drop(columns=['Class'])
y = raisins['Class']

In [17]:
# Examine the dataset
features_num = len(raisins.columns)
sam_num = len(X)
class_1_num = y.sum()
print('total number of features: ', features_num)
print('total number of samples: ', sam_num)
print('Number of samples belong to class \"1\": ', class_1_num)

total number of features:  8
total number of samples:  900
Number of samples belong to class "1":  450


In [18]:
# Split the data set into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)

### Use Grid Search with Decision Tree Classifier

In [19]:
# Create a Decision Tree model
tree = DecisionTreeClassifier()

In [20]:
# Dictionary of parameters for GridSearchCV
parameters = {'max_depth': [3,5,7], 'min_samples_split':[2,3,4]}

In [21]:
# Create a GridSearchCV model
grid = GridSearchCV(tree, parameters)

# Fit the GridSearchCV model to the training data
grid.fit(X_train, y_train)


In [22]:
# Print the model and hyperparameters obtained by GridSearchCV
print(grid.best_estimator_)

# Print best score
print(grid.best_score_)

# Print the accuracy of the final model on the test data
print(grid.score(X_test, y_test))

DecisionTreeClassifier(max_depth=5, min_samples_split=4)
0.874074074074074
0.8133333333333334


In [23]:
# Print a table summarizing the results of GridSearchCV
hyperparameter_score = pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['score'])
param = pd.DataFrame(grid.cv_results_['params'])

pd.concat([param, hyperparameter_score], axis=1)

Unnamed: 0,max_depth,min_samples_split,score
0,3,2,0.860741
1,3,3,0.860741
2,3,4,0.859259
3,5,2,0.863704
4,5,3,0.871111
5,5,4,0.874074
6,7,2,0.856296
7,7,3,0.853333
8,7,4,0.845926


### Use Random Search with Logistic Regression

In [24]:
# The logistic regression model
lr = LogisticRegression(solver='liblinear', max_iter=1000)

In [26]:
# Define a dictionary, distributions, to choose hyperparameters of 'penalty' and 'C'. 
# Use uniform distribution between 0 and 100 to choose 'C'
distributions = {'penalty':['l1', 'l2'], 'C':uniform(loc=0, scale=100)}

In [27]:
# Create a RandomizedSearchCV model
clf = RandomizedSearchCV(lr, distributions, n_iter=8)

# Fit the random search model
clf.fit(X_train, y_train)

In [30]:
# Print best estimator and best score
print('Best estimator with Random Search:', clf.best_estimator_)
print('Best sore with Random Search:', round(clf.best_score_, 4))
print('Best score on test data with Random Search:', clf.score(X_test, y_test))

# Print a table summarizing the results of RandomSearchCV
df = pd.concat([pd.DataFrame(clf.cv_results_['params']), 
           pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['score'])], 
           axis=1)

# Sort the value in decending order
df.sort_values('score', ascending=False)

Best estimator with Random Search: LogisticRegression(C=11.160534984186842, max_iter=1000, penalty='l1',
                   solver='liblinear')
Best sore with Random Search: 0.8756
Best score on test data with Random Search: 0.88


Unnamed: 0,C,penalty,score
0,11.160535,l1,0.875556
1,0.127746,l1,0.875556
5,83.944471,l2,0.875556
7,99.50936,l2,0.875556
2,3.649944,l1,0.874074
3,93.00908,l1,0.874074
4,83.811966,l2,0.874074
6,64.007786,l1,0.874074
