[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nickdlc/CSc448-Projects/blob/main/Assignment4/Iris-Analysis.ipynb)

In [867]:
from sklearn.datasets import load_iris

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# Load the Dataset

In [868]:
# Load the iris dataset and show the first 15 rows
iris = load_iris(as_frame=True)
df = iris['frame']
df.head(15)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [869]:
# Show a summary of the dataset
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [870]:
# Check for null values
df.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [871]:
# Get unique values of 'target' column
df['target'].unique()

array([0, 1, 2])

## About the Dataset

The dataset contains 150 flowers and records four different characteristics for each flower:
* Sepal length (cm)
* Sepal width (cm)
* Petal length (cm)
* Petal width (cm)

Moreover, the dataset contains a target column which has three possible values from the set {0,1,2}. The four features, ($x_1$, $x_2$, $x_3$, $x_4$), map to one $y$ value in this set of labels. This implies that the four characteristics for each flower will serve as the features which will predict a particular class for the target value. These three distinct possibilities likely mean that there are three different flower species in the dataset, and these flower properties are being used to classify the species.

# Split the Dataset

In [872]:
from sklearn.model_selection import train_test_split

# Divide the dataset into features and label(s)
X = df.drop(['target'], axis=1)
y = df['target']

# Split the data into train (90%) and test (10%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=22)

# Split the data into train (67%) and test (33%) sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=22)

# Logistic Regression

## Training the Logistic Regression Model

In [873]:
from sklearn.linear_model import LogisticRegression

# Train the model using our training data
logreg = LogisticRegression(random_state=22, max_iter=200) # set max_iter to reach convergence and avoid warning
logreg.fit(X_train, y_train)

# Predict the output for each flower in the test set
y_pred = logreg.predict(X_test)

## Predicting Probabilities for Each Class for Sample Data Points

In [893]:
# Create a new point and predict the probabilites for each classification
X0 = [[15.8, 8.8, 13.8, 5], [2.15, 1, 0.5, 0.05], [5.8, 3, 4.35, 1.3]]
X0_pred = pd.DataFrame(X0, columns=X_test.columns)
y0_probs = logreg.predict_proba(X0_pred)

for i in range(len(y0_probs)):
    print('Point X' + str(i))
    for j in range(len(y0_probs[i])):
        print('Probability of class', str(j) + ':', y0_probs[i][j])
    print()

Point X0
Probability of class 0: 9.886265607331051e-25
Probability of class 1: 8.855563560036231e-13
Probability of class 2: 0.9999999999991145

Point X1
Probability of class 0: 0.9953922955320623
Probability of class 1: 0.004607701365240459
Probability of class 2: 3.102697239824551e-09

Point X2
Probability of class 0: 0.013854012075532204
Probability of class 1: 0.8984185571453003
Probability of class 2: 0.08772743077916749



In [901]:
# Verify that the model chooses the class with the highest probability
y0_pred = logreg.predict(X0_pred)

for i in range(len(y0_pred)):
    print('Given a new point X' + str(i), 'the model classifies it as iris', y0_pred[i])

Given a new point X0 the model classifies it as iris 2
Given a new point X1 the model classifies it as iris 0
Given a new point X2 the model classifies it as iris 1


## Performance Analysis

In [902]:
# Display the model's score
logreg_score = logreg.score(X_test, y_test)
# logreg_score = logreg.score(X0_pred, y0_pred)
print('The model\'s score is:', logreg_score)

The model's score is: 1.0


The logistic regression model has a score of 1, and this means that it can classify a flower as one of the three iris species with about 100% accuracy from the testing set or an entirely new point $X_0$.

When using a training set size of 67% and a test set size of 33%, this model has a score of 0.96, meaning it can properly classify the iris species with about 96% accuracy.

## Results

In [903]:
logreg_coefs = logreg.coef_
logreg_intercept = logreg.intercept_

print('The coefficients are:', logreg_coefs)
print('The intercepts are:', logreg_intercept)

The coefficients are: [[-0.41663174  0.95122844 -2.44264802 -1.04324609]
 [ 0.4532748  -0.32750687 -0.1992296  -0.80915027]
 [-0.03664306 -0.62372158  2.64187762  1.85239636]]
The intercepts are: [  9.53861062   2.43822503 -11.97683565]


# Support Vector Machine

## Training the Support Vector Model

In [878]:
from sklearn.svm import SVC

svclf = SVC(random_state=22, probability=True)
svclf.fit(X_train, y_train)

y_pred_svm = svclf.predict(X_test)

## Predicting Probabilities for Each Class for Sample Data Points

In [904]:
# Use the new point X0 and predict the probabilities for each classification
y0_pred_svm = svclf.predict(X0_pred)
y0_probs_svm = svclf.predict_proba(X0_pred)

for i in range(len(y0_probs_svm)):
    print('Point X' + str(i))
    for j in range(len(y0_probs_svm[i])):
        print('Probability of class', str(j) + ':', y0_probs_svm[i][j])
    print()

Point X0
Probability of class 0: 0.3563909475965388
Probability of class 1: 0.2414443096329036
Probability of class 2: 0.4021647427705573

Point X1
Probability of class 0: 0.8143933713532433
Probability of class 1: 0.09096295563272797
Probability of class 2: 0.094643673014029

Point X2
Probability of class 0: 0.00899296022121033
Probability of class 1: 0.9636723559621025
Probability of class 2: 0.027334683816687192



In [906]:
# Verify that the model chooses the class with the highest probability
for i in range(len(y0_pred_svm)):
    print('Given a new point X' + str(i), 'the model classifies it as iris', y0_pred[i])

Given a new point X0 the model classifies it as iris 2
Given a new point X1 the model classifies it as iris 0
Given a new point X2 the model classifies it as iris 1


## Performance Analysis

In [881]:
# Display the model's score
svm_score = svclf.score(X_test, y_test)
# svm_score = svclf.score(X0_pred, y0_pred_svm)
print('The model\'s score is:', svm_score)

The model's score is: 0.9333333333333333


The support vector machine model has a score of 0.93, and this means that it can classify a flower as one of the three iris species with about 93% accuracy from the testing set or an entirely new point $X_0$.

When using a training set size of 67% and a test set size of 33%, this model has a score of 0.92, meaning it can properly classify the iris species with about 92% accuracy.

# Neural Network

## Training the Neural Network

In [882]:
from sklearn.neural_network import MLPClassifier

mlpclf = MLPClassifier(random_state=22, max_iter=600) # set max_iter to reach convergence and avoid warning
mlpclf.fit(X_train, y_train)

y_pred_mlp = mlpclf.predict(X_test)

## Predicting Probabilities for Each Class for Sample Data Points

In [907]:
# Use the new point X0 and predict the probabilities for each classification
y0_pred_mlp = mlpclf.predict(X0_pred)
y0_probs_mlp = mlpclf.predict_proba(X0_pred)

for i in range(len(y0_probs_mlp)):
    print('Point X' + str(i))
    for j in range(len(y0_probs_mlp[i])):
        print('Probability of class', str(j) + ':', y0_probs_mlp[i][j])
    print()

Point X0
Probability of class 0: 1.0251726222892183e-07
Probability of class 1: 0.0030765872326520725
Probability of class 2: 0.9969233102500857

Point X1
Probability of class 0: 0.7307758194364437
Probability of class 1: 0.25328678261147697
Probability of class 2: 0.015937397952079546

Point X2
Probability of class 0: 0.041943962272740824
Probability of class 1: 0.7176115635114697
Probability of class 2: 0.24044447421578938



In [908]:
# Verify that the model chooses the class with the highest probability
for i in range(len(y0_pred_mlp)):
    print('Given a new point X' + str(i), 'the model classifies it as iris', y0_pred_mlp[i])

Given a new point X0 the model classifies it as iris 2
Given a new point X1 the model classifies it as iris 0
Given a new point X2 the model classifies it as iris 1


## Performance Analysis

In [885]:
# Display the model's score
mlp_score = mlpclf.score(X_test, y_test)
# mlp_score = mlpclf.score(X0_pred, y0_pred_mlp)
print('The model\'s score is:', mlp_score)

The model's score is: 1.0


The neural network/MLP classifier model has a score of 1, and this means that it can classify a flower as one of the three iris species with 100% accuracy from the testing set or an entirely new point $X_0$.

When using a training set size of 67% and a test set size of 33%, this model has a score of 1, meaning it can properly classify the iris species with 100% accuracy.

## Experimenting with Different Options

### Activation Options

In [886]:
activation_options = ['identity', 'logistic', 'tanh'] # relu is the default

for option in activation_options:
    print('Using activation option: ' + option.upper())
    mlpclf = MLPClassifier(random_state=22, max_iter=750, activation=option)
    mlpclf.fit(X_train, y_train)
    mlp_score = mlpclf.score(X_test, y_test)
    print('The model\'s score is:', mlp_score, '\n')

Using activation option: IDENTITY
The model's score is: 1.0 

Using activation option: LOGISTIC
The model's score is: 1.0 

Using activation option: TANH
The model's score is: 1.0 



The activation options are the activation functions for the hidden layer(s) of the neural network. In this case, the activation option appears to have no bearing on the model's score as all four options yield a score of 1 when the train-test split is 90%-10%. This could likely be due to overfitting to the training data.

### Max Iteration Options

In [887]:
import warnings
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings('ignore', category=ConvergenceWarning) # suppress convergence warnings

for i in range(50, 300, 50):
    print('Max iterations:', i)
    mlpclf = MLPClassifier(random_state=22, max_iter=i)
    mlpclf.fit(X_train, y_train)
    mlp_score = mlpclf.score(X_test, y_test)
    print('The model\'s score is:', mlp_score, '\n')

Max iterations: 50
The model's score is: 0.6 

Max iterations: 100
The model's score is: 0.9333333333333333 

Max iterations: 150
The model's score is: 1.0 

Max iterations: 200
The model's score is: 1.0 

Max iterations: 250
The model's score is: 1.0 



The max iterations option determines the maximum number of iterations to cap the model at if convergence is not reached. In this case, convergence is not reached until ~600 maximum iterations. The warnings were suppressed to avoid cluttering the results. We see that with lower iterations, the model's performance is worse. This makes sense since there are fewer iterations for the model to learn, but we still see the issue of the score reaching 1 long before the model converges.

# K-Nearest Neighbors

## Training the KNN Model

In [888]:
from sklearn.neighbors import KNeighborsClassifier

knnclf = KNeighborsClassifier()
knnclf.fit(X_train, y_train)

y_pred_knn = knnclf.predict(X_test)

## Predicting Probabilities for Each Class for Sample Data Points

In [909]:
# Use the new point X0 and predict the probabilities for each classification
y0_pred_knn = knnclf.predict(X0_pred)
y0_probs_knn = knnclf.predict_proba(X0_pred)

for i in range(len(y0_probs_knn)):
    print('Point X' + str(i))
    for j in range(len(y0_probs_knn[i])):
        print('Probability of class', str(j) + ':', y0_probs_knn[i][j])
    print()

Point X0
Probability of class 0: 0.0
Probability of class 1: 0.0
Probability of class 2: 1.0

Point X1
Probability of class 0: 1.0
Probability of class 1: 0.0
Probability of class 2: 0.0

Point X2
Probability of class 0: 0.0
Probability of class 1: 1.0
Probability of class 2: 0.0



In [911]:
# Verify that the model chooses the class with the highest probability
for i in range(len(y0_pred_mlp)):
    print('Given a new point X' + str(i), 'the model classifies it as iris', y0_pred_knn[i])

Given a new point X0 the model classifies it as iris 2
Given a new point X1 the model classifies it as iris 0
Given a new point X2 the model classifies it as iris 1


## Performance Analysis

In [891]:
# Display the model's score
knn_score = knnclf.score(X_test, y_test)
# knn_score = mlpclf.score(X0_pred, y0_pred_knn)
print('The model\'s score is:', knn_score)

The model's score is: 1.0


The KNN classifier model has a score of 1, and this means that it can classify a flower as one of the three iris species with about 100% accuracy from the testing set or an entirely new point $X_0$.

When using a training set size of 67% and a test set size of 33%, this model has a score of 0.96, meaning it can properly classify the iris species with about 96% accuracy.

# Conclusion