# Hands-on 3

**Execute the cell below. By running this cell, a dataset will be loaded from `patents_small.csv` file. In this notebook, you are asked to analyze this data in several ways. There are three numpy arrays in this dataset:**
- `patent_number`: a unique identifier for each patetnt
- `patent features`: a vector of 16 features describing several properties of each patent
- `category`: the category to which a patent belongs 

In [20]:
import pandas as pd
import numpy as np

df = pd.read_csv('patents.csv')
df.head()
patent_number = df['publication_number'].to_numpy()
patent_features = df['patent_embedding'].to_numpy()
temp = []
for i in range(patent_features.size):
    s = str(patent_features[i])
    s1 = s.replace(r'\n', '')
    temp.append(
        np.array(s.split()[1:-1], dtype='float')[:16]
    )

patent_features = np.stack(temp)
patent_category = df['category']

<hr />

1- Which patent has the highest norm? (Eucledian distance from origin)


In [11]:
dfo = np.sqrt(np.sum(np.power(patent_features,2), axis=1))
max_val = np.max(dfo)
max_val_index = np.argmax(dfo)
max_num = patent_number[max_val_index]
max_num

'CH-527846-A'

2- Find the two patents that are the farthest from eachother.

In [18]:
import scipy.spatial.distance as ssd
import matplotlib.pyplot as plt

dist = ssd.cdist(patent_features, patent_features)
max_dist = np.max(dist)

d = np.where(dist == max_dist)
tuple(zip(d[0], d[1]))

((1661, 9236), (9236, 1661))

3- Write a function that, given a patent number, finds its nearest neighbour.


In [35]:
def find_patent_num_nn(num):    
    nn = np.argsort(dist,axis=1)[num,1]
    return nn
print(find_patent_num_nn(1))

2147


4- For each patent category, find the cluster center. This quantity is computed by taking average of all patents associated with each cluster.

In [30]:
centers = []
for i in patent_category.unique():
    clusters = patent_features[np.where(patent_category == i)]
#     print(clusters.shape)
    centers.append(np.mean(clusters, axis=0))
centers

[array([ 0.01021772,  0.0140427 , -0.03571764,  0.05286253, -0.04302765,
        -0.00263517,  0.02233755, -0.04675915,  0.01272022,  0.03165236,
         0.01146286, -0.00024609,  0.01377522,  0.00555212,  0.02024696,
        -0.04467966]),
 array([ 0.01211396, -0.0304879 ,  0.05560378, -0.03702774,  0.00110319,
         0.01892597, -0.04493763,  0.01639101,  0.03405147,  0.01160055,
        -0.0039251 ,  0.01961012,  0.0012078 ,  0.02051051, -0.04779424,
        -0.01136447]),
 array([ 0.01086092, -0.02427292,  0.06917166, -0.04593048, -0.02812299,
        -0.0124727 , -0.04987288,  0.00655626,  0.0098301 , -0.01550384,
         0.00122531,  0.00426678,  0.00017979,  0.02210309, -0.02753392,
        -0.00829946]),
 array([ 0.01844678,  0.00991557, -0.05545595,  0.02615103, -0.07078419,
        -0.0115121 ,  0.04539117, -0.05906673, -0.02173693,  0.00203886,
         0.00052992,  0.02329754, -0.03247548,  0.03103352,  0.0140693 ,
        -0.06104154]),
 array([ 0.01498087,  0.02345642

5- How many patents have a nearest neighbour that is in the same category?

In [None]:
# count = 0
# for i in range(len(patent_features)):
#     neighbor = find_patent_num_nn(i)
#     if patent_category[neighbor] == patent_category[i]:
#         count += 1
# count

In [51]:
l1 = np.argsort(dist, axis=1)[:,1]
l2 = np.argsort(dist, axis=1)[:,0]

l1_cat = patent_category[l1].to_numpy
l2_cat = patent_category[l2].to_numpy

np.sum(l1_cat == l2_cat)

0

6- What is the average and std of distances between every pair of patents?


In [46]:
print(np.mean(dist), np.std(dist))

0.1774779588870755 0.06172153433074445


7- What is the average and std of distances between every pair of patents within a category?
Using these calculated quantities, which cluster do you think is more condensed? Which one is more scattered?

In [29]:
g = []
for i in patent_category.unique():
    clusters = patent_features[np.where(patent_category == i)]
    g.append(clusters)
mean_list = []
std_list = []
for i in patent_category.unique():
    mean_list.append(np.mean(g[i]))
    std_list.append(np.std(g[i]))
print(mean_list, "\n\n", std_list)

[0.0009738304874799288, 0.00372157464923952, 0.0013628054733375635, -0.005488455708684333, 0.003100204077764672, -0.008824951548397287, -0.00853175919763606, -0.007498980939436589] 

 [0.03410944535427806, 0.03358637702071674, 0.034015677639882086, 0.0367735426671537, 0.034148831780313436, 0.04437697474271802, 0.040063325421186825, 0.03869983254167759]
