# Iris Data Set
In this notebook, we will be working with the famous iris data set where we will be classifying flowers based on their features. Let's first load in the data set.

In [48]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [49]:
column_names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'class']
iris_data = pd.read_csv('iris.data', names=column_names)
iris_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Let's try to classify our data by first using logistic regression. First, lets split our data into training and test sets. 

In [51]:
X = iris_data.iloc[:, :-1]
y = iris_data.iloc[:, -1:]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, stratify=y)
y_train = y_train.to_numpy().flatten()
y_test = y_test.to_numpy().flatten()

Now that we have our training data, we can use K-folds cross validation to find the best performing logistic regression model.

In [77]:
logistic_regression = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
params = {
    'C': np.logspace(-6, 1, 20)
}

clf = GridSearchCV(estimator=logistic_regression, param_grid=params, cv=10)

In [78]:
clf.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': array([1.00000e-06, 2.33572e-06, 5.45559e-06, 1.27427e-05, 2.97635e-05,
       6.95193e-05, 1.62378e-04, 3.79269e-04, 8.85867e-04, 2.06914e-03,
       4.83293e-03, 1.12884e-02, 2.63665e-02, 6.15848e-02, 1.43845e-01,
       3.35982e-01, 7.84760e-01, 1.83298e+00, 4.28133e+00, 1.00000e+01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [76]:
clf.best_score_

0.9732142857142857

It seems that the best regularization parameter value is around 0.97. Let's use this to find the training and testing accuracies

In [79]:
y_pred_train = clf.predict(X_train)
print("Training accuracy", accuracy_score(y_train, y_pred_train))

y_pred = clf.predict(X_test)
print("Testing accuracy", accuracy_score(y_test, y_pred))

Training accuracy 0.9821428571428571
Testing accuracy 0.9736842105263158
