In [1]:
import numpy as np
from scipy.stats import zscore


In [2]:
Required activity 8.1: Predicting plant types
The following data provided is a small subset of the famous Iris database Links to an external site., first used by Sir R. A. Fisher. This is perhaps the best known database in the pattern recognition literature. Fisher’s paper (R. A. Fisher (1936), The Use of Multiple Measurements in Taxonomic Problems Links to an external site., Annals of Eugenics 7(2):179–188) is a classic in the field and is referenced frequently to this day. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant. There are four features: sepal length, sepal width, petal length and petal width.
For the exercise, we will only use the sepal length and the sepal width.
Iris database subset
Sepal length	Sepal width	Class
6.3	2.9	virginica
5.1	3.4	setosa
5.7	2.5	virginica
5.0	3.5	setosa
4.8	3.4	setosa
6.6	2.9	versicolor
6.3	3.3	versicolor
6.3	3.4	versicolor
6.0	3.4	versicolor
4.7	3.2	???
5.0	3.3	???
6.1	2.9	???
6.8	3.2	???
4.5	2.3	???
7.7	3.0	???


SyntaxError: invalid character '’' (U+2019) (355174454.py, line 2)

Here’s what you need to do:
(A) Normalise the data using z-score normalisation.
Note: For the purpose of this exercise, please use all the data to compute µ and σ required for the normalisation. However, keep in mind that in reality you have to use only the training data to compute µ and σ . The input for future predictions will typically not be available to you yet.


In [3]:
# Your dataset
data = np.array([
    [6.3, 2.9, 'virginica'],
    [5.1, 3.4, 'setosa'],
    [5.7, 2.5, 'virginica'],
    [5.0, 3.5, 'setosa'],
    [4.8, 3.4, 'setosa'],
    [6.6, 2.9, 'versicolor'],
    [6.3, 3.3, 'versicolor'],
    [6.3, 3.4, 'versicolor'],
    [6.0, 3.4, 'versicolor'],
    [4.7, 3.2, None],
    [5.0, 3.3, None],
    [6.1, 2.9, None],
    [6.8, 3.2, None],
    [4.5, 2.3, None],
    [7.7, 3.0, None]
])


In [4]:
# Extracting numeric data
numeric_data = np.array([x[:2] for x in data]).astype(float)

In [5]:
# Calculate mean and standard deviation
mean_values = np.mean(numeric_data, axis=0)
std_dev_values = np.std(numeric_data, axis=0)

# Display the results
print("Mean values (Sepal Length, Sepal Width):", mean_values)
print("Standard Deviation values (Sepal Length, Sepal Width):", std_dev_values)

Mean values (Sepal Length, Sepal Width): [5.79333333 3.10666667]
Standard Deviation values (Sepal Length, Sepal Width): [0.88729301 0.34149996]


In [6]:


# Calculating mean, standard deviation, and variance
mean_values = np.mean(numeric_data, axis=0)
std_dev_values = np.std(numeric_data, axis=0, ddof=0) # Using ddof=0 for population standard deviation
variance_values = np.var(numeric_data, axis=0, ddof=0) # Using ddof=1 for population variance

# Display the results
print("Mean values (Sepal Length, Sepal Width):", mean_values)
print("Standard Deviation values (Sepal Length, Sepal Width):", std_dev_values)
print("Variance values (Sepal Length, Sepal Width):", variance_values)

Mean values (Sepal Length, Sepal Width): [5.79333333 3.10666667]
Standard Deviation values (Sepal Length, Sepal Width): [0.88729301 0.34149996]
Variance values (Sepal Length, Sepal Width): [0.78728889 0.11662222]


In [10]:
# Apply z-score normalization
normalized_numeric_data = zscore(numeric_data, axis=0)

# Extract class labels
class_labels = np.array([x[2] for x in data]).reshape(-1, 1)

# Combine normalized data with class labels
normalized_data_with_class = np.hstack((normalized_numeric_data, class_labels))

# Display normalized data
print("Normalized Data:")
print(normalized_data)

print("Normalized Data with Class Labels:")
print(normalized_data_with_class)


Normalized Data:
[[ 0.5710252  -0.60517333]
 [-0.7814029   0.85895569]
 [-0.10518885 -1.77647654]
 [-0.89410524  1.15178149]
 [-1.11950993  0.85895569]
 [ 0.90913222 -0.60517333]
 [ 0.5710252   0.56612989]
 [ 0.5710252   0.85895569]
 [ 0.23291817  0.85895569]
 [-1.23221227  0.27330408]
 [-0.89410524  0.56612989]
 [ 0.34562051 -0.60517333]
 [ 1.1345369   0.27330408]
 [-1.45761695 -2.36212815]
 [ 2.14885798 -0.31234752]]
Normalized Data with Class Labels:
[[0.5710251967461304 -0.6051733273183482 'virginica']
 [-0.781402900810494 0.8589556903873338 'setosa']
 [-0.10518885203218131 -1.7764765414828936 'virginica']
 [-0.8941052422735457 1.1517814939284703 'setosa']
 [-1.1195099251996499 0.8589556903873338 'setosa']
 [0.9091322211352862 -0.6051733273183482 'versicolor']
 [0.5710251967461304 0.5661298868461971 'versicolor']
 [0.5710251967461304 0.8589556903873338 'versicolor']
 [0.23291817235697457 0.8589556903873338 'versicolor']
 [-1.2322122666627016 0.27330408330506173 None]
 [-0.894105242

(B) Use the k-nearest neighbours method with k=3 to predict the missing classes. Use the Euclidean norm in your distance calculations.
Note: When making predictions, you need to transform the input data for the predictions using the same µ and σ required as for the transformation of the data in (A).

In [11]:
# Function to calculate Euclidean distance
def euclidean_distance(row1, row2):
    # Ensure that we only compute on numeric data (first two columns)
    dist = np.sqrt(np.sum((row1 - row2) ** 2))
    return dist

# Compute the Euclidean norm for each row with a missing class label
# against all other rows
distances = []
for i, row in enumerate(normalized_data_with_class):
    # Check if the class label is missing
    print(f"i = {i}, row = {row}")
    if row[2] is None:
        # Compute the distance with respect to all other data points
        
        row_distances = []
        for j, compare_row in enumerate(normalized_data_with_class):
            if i != j:  # Skip the same row
                dist = euclidean_distance(np.array(row[:2], dtype=float), np.array(compare_row[:2], dtype=float))
                row_distances.append((dist, compare_row[2]))
        distances.append((i, row_distances))

print(len(distances))

# Print all the distances
for index, dist_list in distances:
    print(f"Distances from point {index} with missing class:")
    for dist, cls in dist_list:
        print(f"Distance to class '{cls}': {dist:.4f}")
    print("\n")  # Add a newline for better readability between points

i = 0, row = [0.5710251967461304 -0.6051733273183482 'virginica']
i = 1, row = [-0.781402900810494 0.8589556903873338 'setosa']
i = 2, row = [-0.10518885203218131 -1.7764765414828936 'virginica']
i = 3, row = [-0.8941052422735457 1.1517814939284703 'setosa']
i = 4, row = [-1.1195099251996499 0.8589556903873338 'setosa']
i = 5, row = [0.9091322211352862 -0.6051733273183482 'versicolor']
i = 6, row = [0.5710251967461304 0.5661298868461971 'versicolor']
i = 7, row = [0.5710251967461304 0.8589556903873338 'versicolor']
i = 8, row = [0.23291817235697457 0.8589556903873338 'versicolor']
i = 9, row = [-1.2322122666627016 0.27330408330506173 None]
i = 10, row = [-0.8941052422735457 0.5661298868461971 None]
i = 11, row = [0.3456205138200262 -0.6051733273183482 None]
i = 12, row = [1.1345369040613906 0.27330408330506173 None]
i = 13, row = [-1.4576169495888058 -2.3621281485651666 None]
i = 14, row = [2.148857977228859 -0.31234752377721153 None]
6
Distances from point 9 with missing class:
Distan

In [20]:
from sklearn.neighbors import KNeighborsClassifier

# Separating the dataset into features (X) and target (y)
X = np.array([x[:2] for x in data if x[2] is not None]).astype(float)
y = np.array([x[2] for x in data if x[2] is not None])

# Data to predict
X_predict = np.array([x[:2] for x in data if x[2] is None]).astype(float)

# Create KNN model
model = KNeighborsClassifier(n_neighbors=3)

# Train the model
model.fit(X, y)

# Make predictions
predictions = model.predict(X_predict)

# Add predictions back to the dataset
prediction_index = 0  # Counter for the predictions
for i, row in enumerate(data):
    if row[2] is None:
        row[2] = predictions[prediction_index]
        prediction_index += 1

print("Updated dataset with predicted classes:")
print(data)


Updated dataset with predicted classes:
[[6.3 2.9 'virginica']
 [5.1 3.4 'setosa']
 [5.7 2.5 'virginica']
 [5.0 3.5 'setosa']
 [4.8 3.4 'setosa']
 [6.6 2.9 'versicolor']
 [6.3 3.3 'versicolor']
 [6.3 3.4 'versicolor']
 [6.0 3.4 'versicolor']
 [4.7 3.2 'setosa']
 [5.0 3.3 'setosa']
 [6.1 2.9 'versicolor']
 [6.8 3.2 'versicolor']
 [4.5 2.3 'setosa']
 [7.7 3.0 'versicolor']]


In [25]:
print (predictions)

['setosa' 'setosa' 'versicolor' 'versicolor' 'setosa' 'versicolor']


In [34]:
# Function to calculate Euclidean distance
def euclidean_distance(row1, row2):
    # Ensure that we only compute on numeric data (first two columns)
    dist = np.sqrt(np.sum((row1 - row2) ** 2))
    return dist

# Compute the Euclidean norm for each row with a missing class label
# against all other rows
distances = []
for i, row in enumerate(data):
    # Check if the class label is missing
    print(f"i = {i}, row = {row}")
    if row[2] is None:
        # Compute the distance with respect to all other data points
        
        row_distances = []
        for j, compare_row in enumerate(data):
            if i != j:  # Skip the same row
                dist = euclidean_distance(np.array(row[:2], dtype=float), np.array(compare_row[:2], dtype=float))
                row_distances.append((dist, compare_row[2]))
        distances.append((i, row_distances))

print(len(distances))

# Print all the distances
for index, dist_list in distances:
    print(f"Distances from point {index} with missing class:")
    for dist, cls in dist_list:
        print(f"Distance to class '{cls}': {dist:.4f}")
    print("\n")  # Add a newline for better readability between points


i = 0, row = [6.3 2.9 'virginica']
i = 1, row = [5.1 3.4 'setosa']
i = 2, row = [5.7 2.5 'virginica']
i = 3, row = [5.0 3.5 'setosa']
i = 4, row = [4.8 3.4 'setosa']
i = 5, row = [6.6 2.9 'versicolor']
i = 6, row = [6.3 3.3 'versicolor']
i = 7, row = [6.3 3.4 'versicolor']
i = 8, row = [6.0 3.4 'versicolor']
i = 9, row = [4.7 3.2 'setosa']
i = 10, row = [5.0 3.3 'setosa']
i = 11, row = [6.1 2.9 'versicolor']
i = 12, row = [6.8 3.2 'versicolor']
i = 13, row = [4.5 2.3 'setosa']
i = 14, row = [7.7 3.0 'versicolor']
0
