# Practical C Data Similarity and Distance
In this pract we will be exploring the concepts of data similarity and data distance.

As usual we will be using Jupyter Notebooks, Google Collab, and Python/Pandas. The data for this week can be found on [GitHub](https://github.com/PaulHancock/COMP5009_pracs).

# Q3 from Chapter 3 of [Aggarwal](https://www.springer.com/gp/book/9783319141411)

We will be working with the [*Ionosphere*](http://archive.ics.uci.edu/ml/datasets/Ionosphere) data set from the UCI Machine Learning Repository.

1. Copy the file `ionosphere.data` into the collaboratory space.
  - Review the file `ionosphere.names` if you want some context for the data
2. Compute the $L_p$ distances between all pairs of the first 10 data points, for p = 1, 2, and $\infty$
3. Compute the contrast measure on the data set for each norm.
  - Repeat the exercise after samling the first $r$ dimensions, where $r$ varies from 1 to the full dimensionality of the data.
  - Make a plot of contrast vs $r$, compare to figure 3.1 (a) of Aggarwal.

## 1. Copy the file `ionosphere.data` into the collaboratory space.
Refer to last week's prac for how to use `urllib` to copy a file from the internet into your collaboratory space. Despite the name of the file (`ionosphere.data`) the format is `.csv` so save the file with the appropriate extension.

In [1]:
import urllib
import urllib.request

In [2]:
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data'
file_name = 'ionosphere.csv'
urllib.request.urlretrieve(data_url, file_name)
# now do the same for the ionosphere.names file (at the same location)
urllib.request.urlretrieve(data_url, 'ionosphere.names')

('ionosphere.names', <http.client.HTTPMessage at 0x7f9f6896b5b0>)

Once you have copied the files using the above code, navigate to them and have  a quick look at the raw data and the description file.

## 2 Compute $L_p$ distances
Compute the $L_p$ distances between all pairs of the first 10 data points, for p = 1, 2, and $\infty$.

Note: As per `ionosphere.names` the final attribute is a class attribute either 'g', or 'b'.
We don't want to include non-numeric data when computing the $L_p$ norms, so we must drop this attribute.

In [3]:
import pandas as pd
import numpy as np

In [4]:
all_data = pd.read_csv('ionosphere.csv',
                       header=None) # this csv file has no header
all_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.0,0.0376,...,-0.51171,0.41078,-0.46168,0.21266,-0.3409,0.42267,-0.54487,0.18641,-0.453,g
1,1,0,1.0,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.0,-0.04549,...,-0.26569,-0.20468,-0.18401,-0.1904,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
2,1,0,1.0,-0.03365,1.0,0.00485,1.0,-0.12062,0.88965,0.01198,...,-0.4022,0.58984,-0.22145,0.431,-0.17365,0.60436,-0.2418,0.56045,-0.38238,g
3,1,0,1.0,-0.45161,1.0,1.0,0.71216,-1.0,0.0,0.0,...,0.90695,0.51613,1.0,1.0,-0.20099,0.25682,1.0,-0.32382,1.0,b
4,1,0,1.0,-0.02401,0.9414,0.06531,0.92106,-0.23255,0.77152,-0.16399,...,-0.65158,0.1329,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g


In [5]:
# create a subset of the data by excluding the final column/attribute
df = all_data.iloc[:, :-1]

In [6]:
# lets write separate functions for each of the lp norms
def l1(first_row, second_row):
  """
  Compute the $L_1$ distance between two rows of data
  """
  dist = np.sum(np.abs(first_row - second_row))
  return dist

def l2(first_row, second_row):
  """
  Compute the $L_2$ distance between two rows of data
  """
  dist = np.sqrt(np.sum(np.abs(first_row - second_row)**2))
  return dist

def linf(first_row, second_row):
  """
  Compute the $L_\infty$ distance between two rows of data
  """
  dist = np.max(np.abs(first_row - second_row))
  return dist

In [7]:
# # test that our function(s) work on the first two rows, using all three norms
# We expect that Linf < L2 < L1
print(f"L1 = {l1(df.iloc[0], df.iloc[1])}")
print(f"L2 = {l2(df.iloc[0], df.iloc[1])}")
print(f"Linf = {linf(df.iloc[0], df.iloc[1])}")

L1 = 13.080950000000001
L2 = 2.7763589251571923
Linf = 1.12221


We now have functions which will compute L for p=1,2,$\infty$, so we must set up a list of all the combinations of the first 10 rows. `itertools` has a function exactly for this: `combinations`.

In [8]:
# Generate all pairs of rows from the first 10
from itertools import combinations

In [9]:
# accessing the first 10 rows we use df.iloc[:10]
# our functions want to work on lists of values so we chain the above with .values

pairs = combinations(df.iloc[:10].values, # the items from which we are sampling
                     2) # the number of samples to take at a time
lp1_dist = []
lp2_dist = []
lpinf_dist = []
for r1, r2 in pairs:
  lp1_dist.append(l1(r1,r2))
  lp2_dist.append(l2(r1,r2))
  lpinf_dist.append(linf(r1,r2))


In [10]:
# Summarise our data
print(f"Mean of $L_1$ over first 10 rows: {np.mean(lp1_dist):.2f}")
print(f"Mean of $L_2$ over first 10 rows: {np.mean(lp2_dist):.2f}")
print(f"Mean of $L_\infty$ over first 10 rows: {np.mean(lpinf_dist):.2f}")

Mean of $L_1$ over first 10 rows: 15.58
Mean of $L_2$ over first 10 rows: 3.40
Mean of $L_\infty$ over first 10 rows: 1.20


In [19]:
# check that we have always positive values
lp1_dist

[13.080950000000001,
 5.35971,
 21.057290000000002,
 6.213870000000001,
 15.166310000000001,
 7.577039999999999,
 22.28699,
 5.81361,
 17.20169,
 7.06033,
 16.25809,
 6.372240000000001,
 14.074530000000001,
 14.98318,
 18.11757,
 10.993599999999999,
 34.53319,
 12.671619999999999,
 30.185820000000003,
 6.24754,
 30.92729,
 7.7604299999999995,
 27.62719,
 6.2655,
 16.246969999999997,
 8.25909,
 25.77937,
 21.26855,
 27.856350000000003,
 8.44451,
 20.84771,
 4.18114,
 10.570170000000001,
 11.94792,
 19.761910000000004,
 8.783730000000002,
 29.15677,
 10.47754,
 12.92207,
 15.50779,
 26.848529999999997,
 15.406070000000001,
 19.812210000000004,
 6.320250000000001,
 22.58643,
 6.06466,
 15.984620000000001,
 7.74504,
 21.060719999999996,
 10.218549999999999,
 23.031209999999998,
 27.69323,
 31.724500000000003,
 7.15719,
 30.25797,
 9.66097,
 30.16651,
 12.808430000000001,
 24.963989999999995,
 13.2973,
 16.771210000000004,
 14.678249999999998,
 23.149390000000004,
 10.03331,
 13.68606,
 11.

## 3 Compute the contrast measure
Compute the contrast measure on the data set for each norm.
  - Repeat the exercise after samling the first $r$ dimensions, where $r$ varies from 1 to the full dimensionality of the data.
  - Make a plot of contrast vs $r$, compare to figure 3.1 (a) of Aggarwal.



Recall that the contrast measure is given by

$Contrast(D) = \frac{D_{max} - D_{min}}{\mu}$


In [22]:
# It would be good to start by computing the distances for all pairs of data, for each of the $L_p$ norms.
r1 = df.iloc[0].values
lp1_dist = []
lp2_dist = []
lpinf_dist = []
for r2 in df.iloc[1:].values:
    lp1_dist.append(l1(r1,r2))
    print(r2)
    lp2_dist.append(l2(r1,r2))
    lpinf_dist.append(linf(r1,r2))

[ 1.       0.       1.      -0.18829  0.93035 -0.36156 -0.10868 -0.93597
  1.      -0.04549  0.50874 -0.67743  0.34432 -0.69707 -0.51685 -0.97515
  0.05499 -0.62237  0.33109 -1.      -0.13151 -0.453   -0.18056 -0.35734
 -0.20332 -0.26569 -0.20468 -0.18401 -0.1904  -0.11593 -0.16626 -0.06288
 -0.13738 -0.02447]
[ 1.       0.       1.      -0.03365  1.       0.00485  1.      -0.12062
  0.88965  0.01198  0.73082  0.05346  0.85443  0.00827  0.54591  0.00299
  0.83775 -0.13644  0.75535 -0.0854   0.70887 -0.27502  0.43385 -0.12062
  0.57528 -0.4022   0.58984 -0.22145  0.431   -0.17365  0.60436 -0.2418
  0.56045 -0.38238]
[ 1.       0.       1.      -0.45161  1.       1.       0.71216 -1.
  0.       0.       0.       0.       0.       0.      -1.       0.14516
  0.54094 -0.3933  -1.      -0.54467 -0.69975  1.       0.       0.
  1.       0.90695  0.51613  1.       1.      -0.20099  0.25682  1.
 -0.32382  1.     ]
[ 1.       0.       1.      -0.02401  0.9414   0.06531  0.92106 -0.23255
  0.771

In [23]:
# Use the given definition to create a function which computes contrast from a set of distances
def contrast(D):
  """
  Compute the contrast of a data set with the given distances.
  """
    c = (np.max(D) - np.min(D))/np.mean(D)
    return c

In [16]:
# report the contrast for each lp norm
print(f"Contrast for p=1: {contrast(lp1_dist):.2f}")
print(f"Contrast for p=2: {contrast(lp2_dist):.2f}")
print(f"Contrast for p=inf: {contrast(lpinf_dist):.2f}")

Contrast for p=1: 2.00
Contrast for p=2: 1.68
Contrast for p=inf: 1.28


In [None]:
# As per the question we now compute this for various number of dimensions r
c1 = []
c2 = []
cinf = []
r_values = ?

# this is brute force and is not optimised for speed so will take a mintute or two to complete
for r in r_values:
  pairs = combinations(df.iloc[:,:r].values, # all rows but only r dimensions (columns)
                       2)
  lp1_dist = []
  lp2_dist = []
  lpinf_dist = []
  for r1, r2 in pairs:
    lp1_dist.append(l1(r1,r2))
    lp2_dist.append(l2(r1,r2))
    lpinf_dist.append(linf(r1,r2))
  c1.append(contrast(lp1_dist))
  c2.append(contrast(lp2_dist))
  cinf.append(contrast(lpinf_dist))


In [None]:
# import matplotlib for plotting
# make it so that the plots occur inline instead of in a pop-up window
%matplotlib inline 
from matplotlib import pyplot as plt

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(r_values, c1, label="$L_1$")
ax.plot(r_values, c2, label="$L_2$")
ax.plot(r_values, cinf, label="$L_\infty$")
ax.legend()
ax.set_xlabel("Data dimensions")
ax.set_ylabel("Contrast")
plt.show()

Comparing this to the plot in the text book we can see the same behavior:
- Hihger 'p' values means lower contrast
- Higher dimenstions means lower contrast

# Q6 from Chapter 3 of [Aggarwal](https://www.springer.com/gp/book/9783319141411)

For this task we will use the KDD Cup 1999 data from last week.

1. The data are available via github as [kddcup.arff](https://raw.githubusercontent.com/PaulHancock/COMP5009_pracs/main/data/kddcup99.arff), load them as a pandas data frame.
2. Remove the numeric attributes and keep only categorical attributes.
3. Remove all duplicate rows.
4. Randomly pick a data point (row) and compute it's similarity to all other rows uing:
  - Inverse Occurance Frequency Measure
  - Overlap Measure
5. Find the nearest neighbour for your randomly chosen data point.

## 1 Load the data
We did this exactly last week so just copy across

In [24]:
import pandas as pd
from scipy.io import arff
import urllib
import urllib.request

In [25]:
data_url = 'https://raw.githubusercontent.com/PaulHancock/COMP5009_pracs/main/data/kddcup99.arff'
file_name = 'kddcup99.arff'
# this will download the file, look in your explorer to confirm
urllib.request.urlretrieve(data_url, file_name)

('kddcup99.arff', <http.client.HTTPMessage at 0x7f9f28d3c070>)

In [26]:
# load the data from arff format
data = arff.loadarff(file_name)
raw_df = pd.DataFrame(data[0]) # note the [0] here.

In [27]:
raw_df.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins,lnum_compromised,lroot_shell,lsu_attempted,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,44.2908,1560.068,667.6539,0.0056,0.0,0.0443,0.0003,0.0064,0.0001,0.0001,...,232.4499,189.9626,0.758506,0.028212,0.60088,0.006808,0.176059,0.175725,0.055649,0.054896
std,688.058585,51592.7,10134.22562,0.12558,0.0,0.842501,0.02236,0.080992,0.01,0.01,...,64.617617,105.242412,0.407448,0.099844,0.481548,0.043186,0.380065,0.380314,0.225521,0.224883
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,48.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,68.75,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,573.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.03,1.0,0.0,0.0,0.0,0.0,0.0
max,23815.0,5133876.0,954639.0,3.0,0.0,30.0,2.0,2.0,1.0,1.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 2 Remove non-numeric attributes
Remove the numeric attributes and keep only categorical attributes.

In [28]:
# Determine the data type for each coulmn
raw_df.dtypes

duration                       float64
protocol_type                   object
service                         object
flag                            object
src_bytes                      float64
dst_bytes                      float64
land                            object
wrong_fragment                 float64
urgent                         float64
hot                            float64
num_failed_logins              float64
logged_in                       object
lnum_compromised               float64
lroot_shell                    float64
lsu_attempted                  float64
lnum_root                      float64
lnum_file_creations            float64
lnum_shells                    float64
lnum_access_files              float64
lnum_outbound_cmds             float64
is_host_login                   object
is_guest_login                  object
count                          float64
srv_count                      float64
serror_rate                    float64
srv_serror_rate          

In [29]:
# note that the string columns are of type object so select just those
df = raw_df.select_dtypes(include = 'object')
df

Unnamed: 0,protocol_type,service,flag,land,logged_in,is_host_login,is_guest_login,label
0,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
1,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
2,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
3,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
4,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
...,...,...,...,...,...,...,...,...
9995,b'tcp',b'ldap',b'S0',b'0',b'0',b'0',b'0',b'neptune'
9996,b'tcp',b'http',b'SF',b'0',b'1',b'0',b'0',b'normal'
9997,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
9998,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'


## 3 Remove duplicates
Remove all duplicate rows.

In [30]:
# no need for anything fancy, just use the drop_duplicates function
cleaned_df = df.drop_duplicates()

In [31]:
# see how many rows remain
cleaned_df

Unnamed: 0,protocol_type,service,flag,land,logged_in,is_host_login,is_guest_login,label
0,b'icmp',b'ecr_i',b'SF',b'0',b'0',b'0',b'0',b'smurf'
5,b'tcp',b'smtp',b'SF',b'0',b'1',b'0',b'0',b'normal'
6,b'udp',b'domain_u',b'SF',b'0',b'0',b'0',b'0',b'normal'
9,b'tcp',b'private',b'S0',b'0',b'0',b'0',b'0',b'neptune'
19,b'tcp',b'ftp_data',b'SF',b'0',b'1',b'0',b'0',b'normal'
...,...,...,...,...,...,...,...,...
9507,b'tcp',b'netbios_dgm',b'S0',b'0',b'0',b'0',b'0',b'neptune'
9521,b'tcp',b'daytime',b'S0',b'0',b'0',b'0',b'0',b'neptune'
9586,b'tcp',b'sql_net',b'REJ',b'0',b'0',b'0',b'0',b'neptune'
9592,b'tcp',b'telnet',b'RSTO',b'0',b'0',b'0',b'0',b'normal'


## 4 Randomly pick a data point and compute ...
Randomly pick a data point (row) and compute it's similarity to all other rows uing:
  - Inverse Occurance Frequency Measure
  - Match Measure

Firstly we must choose a row at random.

In [32]:
import random

In [33]:
my_row_number = random.randint(0, # minimum value to choose
                               119) # maximum value
my_row = cleaned_df.iloc[my_row_number]

print(f"I chose instance {my_row_number}:")
print( "----------------------------")
print(my_row)

I chose instance 8:
----------------------------
protocol_type          b'udp'
service            b'private'
flag                    b'SF'
land                     b'0'
logged_in                b'0'
is_host_login            b'0'
is_guest_login           b'0'
label             b'teardrop'
Name: 42, dtype: object


### Inverse Occurance Frequency Meausre (IOFM)

For inverse occurance frequency measure we need to determine what the frequencies are for each value of each attribute.
The python builtin `set` type is useful here as it defines an unordered list of **unique** items.

In [34]:
# use set to reduce all our values to a set of unique values
for attribute in cleaned_df.columns:
  print(f"Attribute {attribute} has values {set(cleaned_df[attribute])}")

Attribute protocol_type has values {b'icmp', b'udp', b'tcp'}
Attribute service has values {b'auth', b'pop_2', b'csnet_ns', b'urp_i', b'login', b'mtp', b'private', b'uucp_path', b'exec', b'echo', b'bgp', b'pop_3', b'shell', b'efs', b'ftp_data', b'nnsp', b'uucp', b'netbios_ns', b'hostnames', b'domain_u', b'discard', b'remote_job', b'ecr_i', b'name', b'eco_i', b'iso_tsap', b'klogin', b'netbios_ssn', b'vmnet', b'smtp', b'supdup', b'imap4', b'nntp', b'kshell', b'whois', b'systat', b'ftp', b'finger', b'http_443', b'ctf', b'Z39_50', b'time', b'ssh', b'sunrpc', b'rje', b'netstat', b'link', b'printer', b'ldap', b'courier', b'netbios_dgm', b'other', b'telnet', b'domain', b'http', b'sql_net', b'X11', b'ntp_u', b'daytime'}
Attribute flag has values {b'OTH', b'SF', b'RSTO', b'REJ', b'S2', b'S1', b'S0', b'RSTR'}
Attribute land has values {b'0'}
Attribute logged_in has values {b'1', b'0'}
Attribute is_host_login has values {b'0'}
Attribute is_guest_login has values {b'1', b'0'}
Attribute label has va

In [38]:
# lets create a function which will create a lookup table of the pk values
def attribute_frequencies(dataframe):
    pk = {} # pk will be a dictionary which we can index using the attribute name
    for attribute in dataframe.columns:
        column = dataframe[attribute]
        categories = set(column)
        frequencies = {}
        for c in categories:
            frequencies[c] = np.sum(column == c)/column.shape[0]
        pk[attribute] = frequencies
    return pk

In [40]:
# test that the function works, hand looks to give sensible results
pk = attribute_frequencies(cleaned_df)
print(pk)

{'protocol_type': {b'icmp': 0.05042016806722689, b'udp': 0.05042016806722689, b'tcp': 0.8991596638655462}, 'service': {b'auth': 0.025210084033613446, b'pop_2': 0.008403361344537815, b'csnet_ns': 0.008403361344537815, b'urp_i': 0.008403361344537815, b'login': 0.008403361344537815, b'mtp': 0.008403361344537815, b'private': 0.06722689075630252, b'uucp_path': 0.008403361344537815, b'exec': 0.01680672268907563, b'echo': 0.01680672268907563, b'bgp': 0.008403361344537815, b'pop_3': 0.008403361344537815, b'shell': 0.01680672268907563, b'efs': 0.008403361344537815, b'ftp_data': 0.05042016806722689, b'nnsp': 0.01680672268907563, b'uucp': 0.008403361344537815, b'netbios_ns': 0.01680672268907563, b'hostnames': 0.008403361344537815, b'domain_u': 0.008403361344537815, b'discard': 0.01680672268907563, b'remote_job': 0.008403361344537815, b'ecr_i': 0.025210084033613446, b'name': 0.008403361344537815, b'eco_i': 0.01680672268907563, b'iso_tsap': 0.01680672268907563, b'klogin': 0.008403361344537815, b'ne

Now we should create another functions with computes S defined as:

$ S(x,y) = 1/p_k(x_i)^2 $ if $x_i = y_i$ and zero otherwise.

In [41]:
def iofm(first_row, second_row):
  """
  Compute the inverse occurance frequency measure (iofm) for two rows.
  """
  sim = 0
  # a pandas series (row) doesn't have columns or column names, but keys
  for attribute in first_row.keys():
    # we access the rows using these keys the same way we would columns of a dataframe
    if first_row[attribute] == second_row[attribute]:
      sim += 1/pk[attribute][first_row[attribute]]
  return sim

In [42]:
# test that our similarity measure works
# each row should be very similar to itself!
iofm(my_row, my_row)

161.98421543211467

In [43]:
best_row = None
best_similar = 0
# iterating over rows we have to use the .iterrows() function
# which returns both the row index, as well as the row
for index, row in cleaned_df.iterrows():
  if row is not my_row: # don't allow my_row to be the best match!
    similar = ? 
    if similar > best_similar:
      ?
print(f"Using Inverse Occurance Frequency Measure the nearest neighbour for my row is:\n{best_row}\nwith similarity score of {similar}")

SyntaxError: invalid syntax (<ipython-input-43-c93a0872d880>, line 7)

### Overlap Measure



This is the same as the IOFM but simplified to

$ S(x,y) = 1$ if $x_i = y_i$ and zero otherwise.

In [None]:
def overlap(first_row, second_row):
  """
  Compute the overlap measure for two rows.
  """
  sim = 0
  # a pandas series (row) doesn't have columns or column names, but keys
  for attribute in first_row.keys():
    # we access the rows using these keys the same way we would columns of a dataframe
    if first_row[attribute] == second_row[attribute]:
      sim += 1
  return sim

In [None]:
# test that our similarity measure works
# each row should be very similar to itself!
overlap(my_row, my_row)

In [None]:
best_row = None
best_similar = 0
# iterating over rows we have to use the .iterrows() function
# which returns both the row index, as well as the row
for index, row in cleaned_df.iterrows():
  if row is not my_row: # don't allow my_row to be the best match!
    similar = ?
    if similar > best_similar:
      ?
print(f"Using Overlap Measure the nearest neighbour for my row is:\n{best_row}\nwith similarity score of {similar}")

## 5 Find the nearest neighbour

Do your two nearest neighbour calculations agree?

Do you exepect that your nerest neighbour is unique or just one among many?

Depending on your chosen random datapoint you may have different answers for part 1, however for part 2 our algorithm didn't record multiple matches for the largest similarity so we don't know if it's unique. For the overlap measure, given that the possible similarity scores are 1 to 8, I would think that there is some likelyhood that there may be multiple neighbours at a similarity of 4 that are equally good matches.