# INVENTOR COMBINATIONS TO MEASURE INVENTOR SIMILARITY
Measuring the similarity of the inventors, with two distinct types:
1. **Interpatent** Inventor Similarity - Inventor Similarity between patents. Implementation here is between direct ancestor-descendant relationship 
2. **Intrapatent** Inventor Similarity - Similarity between inventors on the patent 

In [1]:
import neo4j 
import pandas as pd
from credentials import uri, user, pwd
from patent_neo4j.connection import Neo4jConnection
from patent_neo4j.analysis import get_direct_ancestor
from patent_neo4j.analysis import interpatent_inventor_combination
import itertools

Important Patents List

In [2]:
df = pd.read_csv("Data/important_patents.csv")
df.head(8)

Unnamed: 0,id,name
0,4136359,AppleMicrocomputer
1,4237224,MolecularChimeras
2,4371752,DigitalVoiceMailSystems
3,4399216,Co-transformationGeneCoding
4,4683195,PolymeraseChainReaction
5,5061620,StemCell
6,5108388,LaserSurgeryMethod
7,6285999,PageRank


## Inter-Patent Inventor Similarity
**Overall Algorithm**:
1. **Obtain direct ancestor-descendant** relationship
1. **Combine inventors** of the two patents with repeats allowed <br> 
    *("wozniak", "wozniak") would have a similarity score of 1*
1. Compute **average similarity**

Establishing Connection

In [3]:
conn = Neo4jConnection(uri, user, pwd)
result = conn.query_citation_tree(root=df.id[0])

In [4]:
result.head()

Unnamed: 0,id,date,country,claims,kind,lineage,similarity
0,6226017,2001-05-01,US,48,A,[4136359],[0.057604003697633736]
1,9940907,2018-04-10,US,20,B2,"[6226017, 4136359]","[0.13514360785484314, 0.057604003697633736]"
2,7705842,2010-04-27,US,16,B2,"[6226017, 4136359]","[0.1497403383255005, 0.057604003697633736]"
3,9001133,2015-04-07,US,22,B2,"[7705842, 6226017, 4136359]","[0.2110498994588852, 0.1497403383255005, 0.057..."
4,9153179,2015-10-06,US,11,B2,"[7705842, 6226017, 4136359]","[0.3195275664329529, 0.1497403383255005, 0.057..."


In [5]:
# Get Direct Lineage
direct_lineage = get_direct_ancestor(result)

In [6]:
direct_lineage.head()

Unnamed: 0,id,lineage,similarity,hops
0,6226017,4136359,0.057604,1
1,9940907,6226017,0.135144,2
2,7705842,6226017,0.14974,2
3,9001133,7705842,0.21105,3
4,9153179,7705842,0.319528,3


In [7]:
direct_lineage.shape

(18855, 4)

In [8]:
direct_lineage = direct_lineage.drop_duplicates()

In [9]:
direct_lineage.shape

(7965, 4)

Obtain All Inventors from the Built Citation Tree

In [10]:
inventor_tree = conn.query_inventor_tree(root=df.id[0])

In [11]:
inventor_tree.head()

Unnamed: 0,patent,inventor
0,10002407,"[fl:q_ln:zhang-10, fl:m_ln:wollersheim-2, fl:n..."
1,10006610,"[fl:r_ln:cui-5, fl:j_ln:li-228, fl:x_ln:lin-78]"
2,10007679,"[fl:z_ln:guo-135, fl:z_ln:zhang-599]"
3,10007687,"[fl:t_ln:mastronardi-1, fl:h_ln:clement-1, fl:..."
4,10007868,"[fl:h_ln:jin-151, fl:z_ln:wang-256, fl:g_ln:mi..."


Generate Interpatent Combination

In [12]:
interpatent = interpatent_inventor_combination(direct_ancestor=direct_lineage, inventor_tree=inventor_tree)

In [13]:
interpatent.head()

Unnamed: 0,id,lineage,similarity,hops,combination
0,6226017,4136359,0.057604,1,"[(fl:a_ln:godfrey-5, fl:j_ln:goossen-4), (fl:a..."
1,9940907,6226017,0.135144,2,"[(fl:l_ln:blanco-3, fl:r_ln:fink-12), (fl:l_ln..."
2,7705842,6226017,0.14974,2,"[(fl:r_ln:panabaker-1, fl:j_ln:creasey-4), (fl..."
3,9001133,7705842,0.21105,3,"[(fl:p_ln:sacchetto-1, fl:p_ln:nambi-1), (fl:p..."
4,9153179,7705842,0.319528,3,"[(fl:p_ln:sacchetto-1, fl:d_ln:redman-5), (fl:..."


### Calculation of Similarity
This section assumes the embeddings for the inventors have been persisted in the files below as well as the mapping. <br>
**Important Notes:**
* **node2vec** uses biased random walks, re-running the algorithm would likely yield a different results. However, the results should be consistent
* There exist inventors that **do not have relationships up to 3 hops**. In this case, there would **NOT be any embeddings for such nodes**, resulting a return of None by *similarity_score()*. Such cases are treated as 0

In [14]:
from patent_neo4j.analysis import similarity_score, convert_np
import json

In [15]:
file_emb = "./node2vec/emb/" + df.name[0] + ".emb"
file_map = "./node2vec/map/" + df.name[0] + ".map"
node_emb = convert_np(file_emb)
with open(file_map) as json_file:
    coinventor_mapping = json.load(json_file)

Calculating the **Average Combination Similarity**

In [16]:
inventor_sim = []

for index,row in interpatent.iterrows():
    sim_score = 0
    length = len(row.combination)
    
    for pair in row.combination:
        curr_score = similarity_score(coinventor_mapping, node_emb, pair[0], pair[1])
        if curr_score is not None: # Effectively making no connections to be 0
            sim_score = sim_score + curr_score
        
    if length != 0:
        inventor_sim.append(sim_score/length)
    else:
        inventor_sim.append(0)

In [17]:
interpatent = interpatent.assign(inventor_sim=inventor_sim)

Change Combinations to the Number of Combinations 

In [18]:
interpatent['combination'] = interpatent['combination'].apply(lambda x: len(x))

In [19]:
interpatent.head()

Unnamed: 0,id,lineage,similarity,hops,combination,inventor_sim
0,6226017,4136359,0.057604,1,3,0.332927
1,9940907,6226017,0.135144,2,21,0.788112
2,7705842,6226017,0.14974,2,10,0.737528
3,9001133,7705842,0.21105,3,15,0.78942
4,9153179,7705842,0.319528,3,15,0.785134


In [20]:
interpatent.to_csv("interpatent.csv")

## Intra-Patent Inventor Similarity

In [21]:
inventor_tree = conn.query_inventor_tree(root=df.id[0])

In [22]:
inventor_tree = inventor_tree.rename({'patent':'id'}, axis=1)

In [23]:
inventor_tree.head()

Unnamed: 0,id,inventor
0,10002407,"[fl:q_ln:zhang-10, fl:m_ln:wollersheim-2, fl:n..."
1,10006610,"[fl:r_ln:cui-5, fl:j_ln:li-228, fl:x_ln:lin-78]"
2,10007679,"[fl:z_ln:guo-135, fl:z_ln:zhang-599]"
3,10007687,"[fl:t_ln:mastronardi-1, fl:h_ln:clement-1, fl:..."
4,10007868,"[fl:h_ln:jin-151, fl:z_ln:wang-256, fl:g_ln:mi..."


In [24]:
result = conn.query_citation_tree(root=df.id[0])
direct_lineage = get_direct_ancestor(result)

In [25]:
direct_lineage = direct_lineage.drop_duplicates()

In [26]:
direct_lineage.head()

Unnamed: 0,id,lineage,similarity,hops
0,6226017,4136359,0.057604,1
1,9940907,6226017,0.135144,2
2,7705842,6226017,0.14974,2
3,9001133,7705842,0.21105,3
4,9153179,7705842,0.319528,3


In [27]:
intrapatent = pd.merge(inventor_tree, direct_lineage, on='id', how='left')

In [28]:
intrapatent['combination'] = intrapatent['inventor'].apply(lambda x: [i for i in itertools.combinations(x, 2)])

In [29]:
intrapatent.head()

Unnamed: 0,id,inventor,lineage,similarity,hops,combination
0,10002407,"[fl:q_ln:zhang-10, fl:m_ln:wollersheim-2, fl:n...",6978050,0.185457,3.0,"[(fl:q_ln:zhang-10, fl:m_ln:wollersheim-2), (f..."
1,10006610,"[fl:r_ln:cui-5, fl:j_ln:li-228, fl:x_ln:lin-78]",7460179,0.119324,3.0,"[(fl:r_ln:cui-5, fl:j_ln:li-228), (fl:r_ln:cui..."
2,10007679,"[fl:z_ln:guo-135, fl:z_ln:zhang-599]",8463035,0.032218,3.0,"[(fl:z_ln:guo-135, fl:z_ln:zhang-599)]"
3,10007687,"[fl:t_ln:mastronardi-1, fl:h_ln:clement-1, fl:...",6920614,0.258572,3.0,"[(fl:t_ln:mastronardi-1, fl:h_ln:clement-1), (..."
4,10007868,"[fl:h_ln:jin-151, fl:z_ln:wang-256, fl:g_ln:mi...",9047511,0.19501,3.0,"[(fl:h_ln:jin-151, fl:z_ln:wang-256), (fl:h_ln..."


In [30]:
intra_sim = []
for index,row in intrapatent.iterrows():
    sim_score = 0
    if len(row.combination) == 0:
        sim_score = 1
    else:
        for p in row.combination:
            sim_score = sim_score + similarity_score(coinventor_mapping, node_emb, p[0], p[1])
        sim_score = sim_score / len(row.combination)
        
    intra_sim.append(sim_score)
intrapatent['intra_sim'] = intra_sim 

In [31]:
intrapatent.head()

Unnamed: 0,id,inventor,lineage,similarity,hops,combination,intra_sim
0,10002407,"[fl:q_ln:zhang-10, fl:m_ln:wollersheim-2, fl:n...",6978050,0.185457,3.0,"[(fl:q_ln:zhang-10, fl:m_ln:wollersheim-2), (f...",0.99938
1,10006610,"[fl:r_ln:cui-5, fl:j_ln:li-228, fl:x_ln:lin-78]",7460179,0.119324,3.0,"[(fl:r_ln:cui-5, fl:j_ln:li-228), (fl:r_ln:cui...",0.999303
2,10007679,"[fl:z_ln:guo-135, fl:z_ln:zhang-599]",8463035,0.032218,3.0,"[(fl:z_ln:guo-135, fl:z_ln:zhang-599)]",0.999001
3,10007687,"[fl:t_ln:mastronardi-1, fl:h_ln:clement-1, fl:...",6920614,0.258572,3.0,"[(fl:t_ln:mastronardi-1, fl:h_ln:clement-1), (...",0.99744
4,10007868,"[fl:h_ln:jin-151, fl:z_ln:wang-256, fl:g_ln:mi...",9047511,0.19501,3.0,"[(fl:h_ln:jin-151, fl:z_ln:wang-256), (fl:h_ln...",0.998527


In [32]:
intrapatent.drop(columns=['inventor','combination']).to_csv("intrapatent.csv")