In [None]:
Required activity 8.1: Predicting plant types
The following data provided is a small subset of the famous Iris database Links to an external site., first used by Sir R. A. Fisher. This is perhaps the best known database in the pattern recognition literature. Fisher’s paper (R. A. Fisher (1936), The Use of Multiple Measurements in Taxonomic Problems Links to an external site., Annals of Eugenics 7(2):179–188) is a classic in the field and is referenced frequently to this day. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant. There are four features: sepal length, sepal width, petal length and petal width.
For the exercise, we will only use the sepal length and the sepal width.
Iris database subset
Sepal length	Sepal width	Class
6.3	2.9	virginica
5.1	3.4	setosa
5.7	2.5	virginica
5.0	3.5	setosa
4.8	3.4	setosa
6.6	2.9	versicolor
6.3	3.3	versicolor
6.3	3.4	versicolor
6.0	3.4	versicolor
4.7	3.2	???
5.0	3.3	???
6.1	2.9	???
6.8	3.2	???
4.5	2.3	???
7.7	3.0	???


Here’s what you need to do:
(A) Normalise the data using z-score normalisation.
Note: For the purpose of this exercise, please use all the data to compute µ and σ required for the normalisation. However, keep in mind that in reality you have to use only the training data to compute µ and σ . The input for future predictions will typically not be available to you yet.


In [3]:
import numpy as np
from scipy.stats import zscore

# Your dataset
data = np.array([
    [6.3, 2.9, 'virginica'],
    [5.1, 3.4, 'setosa'],
    [5.7, 2.5, 'virginica'],
    [5.0, 3.5, 'setosa'],
    [4.8, 3.4, 'setosa'],
    [6.6, 2.9, 'versicolor'],
    [6.3, 3.3, 'versicolor'],
    [6.3, 3.4, 'versicolor'],
    [6.0, 3.4, 'versicolor'],
    [4.7, 3.2, None],
    [5.0, 3.3, None],
    [6.1, 2.9, None],
    [6.8, 3.2, None],
    [4.5, 2.3, None],
    [7.7, 3.0, None]
])

# Extracting numeric data for normalization
numeric_data = np.array([x[:2] for x in data]).astype(float)

# Apply z-score normalization
normalized_data = zscore(numeric_data, axis=0)

# Display normalized data
print("Normalized Data:")
print(normalized_data)


Normalized Data:
[[ 0.5710252  -0.60517333]
 [-0.7814029   0.85895569]
 [-0.10518885 -1.77647654]
 [-0.89410524  1.15178149]
 [-1.11950993  0.85895569]
 [ 0.90913222 -0.60517333]
 [ 0.5710252   0.56612989]
 [ 0.5710252   0.85895569]
 [ 0.23291817  0.85895569]
 [-1.23221227  0.27330408]
 [-0.89410524  0.56612989]
 [ 0.34562051 -0.60517333]
 [ 1.1345369   0.27330408]
 [-1.45761695 -2.36212815]
 [ 2.14885798 -0.31234752]]


(B) Use the k-nearest neighbours method with k=3 to predict the missing classes. Use the Euclidean norm in your distance calculations.
Note: When making predictions, you need to transform the input data for the predictions using the same µ and σ required as for the transformation of the data in (A).

In [2]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Your dataset
data = np.array([
    [6.3, 2.9, 'virginica'],
    [5.1, 3.4, 'setosa'],
    [5.7, 2.5, 'virginica'],
    [5.0, 3.5, 'setosa'],
    [4.8, 3.4, 'setosa'],
    [6.6, 2.9, 'versicolor'],
    [6.3, 3.3, 'versicolor'],
    [6.3, 3.4, 'versicolor'],
    [6.0, 3.4, 'versicolor'],
    [4.7, 3.2, None],
    [5.0, 3.3, None],
    [6.1, 2.9, None],
    [6.8, 3.2, None],
    [4.5, 2.3, None],
    [7.7, 3.0, None]
])

# Separating the dataset into features (X) and target (y)
X = np.array([x[:2] for x in data if x[2] is not None]).astype(float)
y = np.array([x[2] for x in data if x[2] is not None])

# Data to predict
X_predict = np.array([x[:2] for x in data if x[2] is None]).astype(float)

# Create KNN model
model = KNeighborsClassifier(n_neighbors=3)

# Train the model
model.fit(X, y)

# Make predictions
predictions = model.predict(X_predict)

# Add predictions back to the dataset
prediction_index = 0  # Counter for the predictions
for i, row in enumerate(data):
    if row[2] is None:
        row[2] = predictions[prediction_index]
        prediction_index += 1

print("Updated dataset with predicted classes:")
print(data)


Updated dataset with predicted classes:
[[6.3 2.9 'virginica']
 [5.1 3.4 'setosa']
 [5.7 2.5 'virginica']
 [5.0 3.5 'setosa']
 [4.8 3.4 'setosa']
 [6.6 2.9 'versicolor']
 [6.3 3.3 'versicolor']
 [6.3 3.4 'versicolor']
 [6.0 3.4 'versicolor']
 [4.7 3.2 'setosa']
 [5.0 3.3 'setosa']
 [6.1 2.9 'versicolor']
 [6.8 3.2 'versicolor']
 [4.5 2.3 'setosa']
 [7.7 3.0 'versicolor']]
