# Machine Learning & Statistics - Project

***

Martin Cusack  

Student I.D: G00239124

***

## How to use this repository

Run using Python version 3.9.18

***

 ## Introduction
 
 In this notebook, I will deploy a variety of machine learning algorithms to make classifications and predictions based on the widely-studied Fisher's Iris flower data set.  This data set first appeared in biologist Ronald Fisher's 1936 paper "The Use of Multiple Measurements in Taxonomic Problems".  Fisher's work has been highly influential in the fields of data science and machine learning over the decades since its publication and has been referenced on innumerable occasions by teachers and academics.

 Fisher's data set comprises a study of three different species of the Iris flower (Iris setosa, Iris virginica and Iris versicolor) and
 contains six columns: ID, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm and Species. I will fit and train a number of machine learning algorithms on this data in order to assess their relative accuracy in determining the class of each sample.


 ### Supervised learning

 There are currently two main approaches to machine learning: supervised learning and unsupervised learning.  Supervised learning is an approach generally used when dealing with labeled datasets, and uses the available data to train algorithms to accurately classify data and make predictions. By using inputs and outputs, the training model can learn from the data and as a result produce accurate outcomes.  
 
 By contrast, unsupervised learning is an approach generally used with unlabeled datasets. They do not require human intervention, hence are "unsupervised".  Unsupervised techniques are often used in the preprocessing stage of machine learning in order to reduce the dimensionality of a given data set. [1]
 

 ### Classification algorithms

 Classification algorithms are used in the machine learning process in order to make predictions and categorise data (an example being an email spam detector).  Examples of commonly-used classifiers include K Nearest Neighbors, Support Vector Machines (SVM), Naive Bayes and Random Forest.
 

 [1] *Supervised vs. Unsupervised Learning*, Julianna Delua. https://www.ibm.com/blog/supervised-vs-unsupervised-learning/

 ***

In [1]:
# Data frames.
import pandas as pd

# numpy
import numpy as np

# Machine Learning.
import sklearn as sk

# Nearest neighbors.
import sklearn.neighbors as ne

# Preprocessing.
import sklearn.preprocessing as pre

# Decomposition.
import sklearn.decomposition as dec

# Statistical test.
import scipy.stats as ss

# Plots.
import matplotlib.pyplot as plt

# Statistical plots.
import seaborn as sns

Firstly I need to import the Iris data set as a .csv file. I sourced the data set from __[Kaggle](https://www.kaggle.com/datasets/saurabh00007/iriscsv?resource=download)__. [2]

[2] *Iris.csv*, Saurabh Singh. https://www.kaggle.com/datasets/saurabh00007/iriscsv


In [2]:
# import iris data set using Pandas
df = pd.read_csv("Data\iris.csv")

### Summary statistics

In the following cells, I will summarise the data set using some basic Pandas functions to gain a better understanding of the dimensions of the data. It will also be useful to determine whether there are any anomalous features in the data, such as null values, which may affect any calculations later in the project.

In [3]:
# summary statistics - display number of rows and columns
df.shape

(150, 5)

In [4]:
# summary statistics - show first 10 rows
df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [5]:
# Drop all rows with null values from data set
df_nona = df.dropna()

# show
df_nona

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


As we can see from the above summary statistics, there are no null values or other anomalies in the data set, making it ideal to work with using a machine learning algorithm.
***

### Machine Learning Algorithms - K Nearest Neighbors

K Nearest Neighbors (KNN) is one of the most widely-used machine learning algorithms, in part due to its simplicity. KNN is a supervised learning model which works by identifying a specified number of proximate data points, so that "classification is computed from a simple majority vote of the nearest neighbors of each point".[3]

In the below cells, I set up and train an instance of KNN to make predictions based on the Iris Data Set. This will be an example of binary classification, where "Class" is the target variable.

[3] *Nearest Neighbors*, scikit-learn developers. https://scikit-learn.org/stable/modules/neighbors.html
***

To run KNN on the Iris data set, we must fit the data to a new instance of the classifier. To do this, we firstly define the X values(the four variables: sepal length, sepal width, petal length and petal width of each sample) and the y value (the class of each sample).

In [6]:
# define the X values
X = df_nona[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [7]:
# define y value
y = df_nona['class']

y

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: class, Length: 150, dtype: object

In [8]:
# Create new instance of K Nearest Neighbors classifier
from sklearn.neighbors import KNeighborsClassifier
clf = sk.neighbors.KNeighborsClassifier()


# Fit the data.
clf.fit(X, y)

In [9]:
# check that classifier correctly predicts class of first X sample ('setosa')
clf.predict(X.iloc[:1])

array(['setosa'], dtype=object)

In [10]:
# import train_test_split function and split dataset into training set and test set
import sklearn.model_selection as mod
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X, y)

In [11]:
X_train

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
114,5.8,2.8,5.1,2.4
55,5.7,2.8,4.5,1.3
5,5.4,3.9,1.7,0.4
106,4.9,2.5,4.5,1.7
136,6.3,3.4,5.6,2.4
...,...,...,...,...
72,6.3,2.5,4.9,1.5
117,7.7,3.8,6.7,2.2
47,4.6,3.2,1.4,0.2
12,4.8,3.0,1.4,0.1


In [12]:
y_train

114     virginica
55     versicolor
5          setosa
106     virginica
136     virginica
          ...    
72     versicolor
117     virginica
47         setosa
12         setosa
145     virginica
Name: class, Length: 112, dtype: object

In [13]:
# re-initialise classifier
clf = sk.neighbors.KNeighborsClassifier()

In [14]:
# re-train classifier using X_train and y_train
clf.fit(X_train, y_train)

In [15]:
clf.predict(X_test)

array(['versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',
       'virginica', 'setosa', 'virginica', 'virginica', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor',
       'virginica', 'setosa', 'versicolor', 'setosa', 'virginica',
       'virginica', 'virginica', 'versicolor', 'versicolor', 'virginica',
       'virginica', 'versicolor', 'virginica', 'versicolor', 'virginica',
       'versicolor', 'virginica', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'virginica'], dtype=object)

In [16]:
# Proportion of correct classifications
(clf.predict(X_test) == y_test).sum() / X_test.shape[0]

1.0

### Machine Learning Algorithms - Support Vector Machines (SVM)
***

### Machine Learning Algorithms - Naive Bayes
***

### References

[1] *Supervised vs. Unsupervised Learning*, Julianna Delua. https://www.ibm.com/blog/supervised-vs-unsupervised-learning/

[2] *Iris.csv*, Saurabh Singh. https://www.kaggle.com/datasets/saurabh00007/iriscsv

[3] *Nearest Neighbors*, scikit-learn developers. https://scikit-learn.org/stable/modules/neighbors.html

***

## End
***