# Generational Set Difference 
As we study the relationships between the charactersitics of each generation in a citation tree, a question that arose from that is to identify if the number of inventors that we found across generation is reflective of different inventors or are they the same set of inventors. This is applicable for

1. Assignees 
1. Inventors
1. Locations
1. NBER Categories

The measure of similarity for this sense for the first three is the Jaccard-Coefficient, which is measured by $$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$$
where **A** and **B** are sets

In [1]:
import neo4j
import pandas as pd
import random
import numpy as np
import datetime
from credentials import uri, user, pwd
from patent_neo4j.connection import Neo4jConnection
from patent_neo4j.analysis import fixing_na_nber, nber_distribution, js_divergence
import math
from patent_neo4j.utils import get_max_generation

### Problematic Patent Roots
In the dataset, there exists many root patents that do not have any citations. These are automatically dropped based on what I have written. There is one root patent that simply is too big for me to process and hence I am just ignoring it as for now

In [2]:
roots = pd.read_csv("./Data/generation.csv")

In [3]:
roots = list(set(roots['root']))

### Getting test and degenerate tree

In [4]:
df = pd.read_csv("Data/Mined Data/sample_patents_stats.csv")

In [5]:
root = df.loc[8, 'id']
degen_root = df.loc[643, 'id']
conn = Neo4jConnection(uri, user, pwd)
citation_tree = conn.query_citation_tree(root)
degen_tree = conn.query_citation_tree(degen_root)

In [6]:
citation_tree.head()

Unnamed: 0,id,date,country,claims,kind,assignee,location,inventor,lineage,similarity,nber_lineage
0,5598605,1997-02-04,US,7,A,5bbbc58c-5249-4db5-8188-76480883c35f,06759db6-7920-11eb-bfee-121df0c29c1e,fl:m_ln:tomasiak-1,[5340206],[0.26260286569595337],[55]
1,9572465,2017-02-21,US,19,B2,5bbbc58c-5249-4db5-8188-76480883c35f,06759db6-7920-11eb-bfee-121df0c29c1e,fl:m_ln:tomasiak-1,"[5598605, 5340206]","[0.19024710357189176, 0.26260286569595337]","[59, 55]"
2,6193265,2001-02-27,US,9,A,625e6680-4273-4b0e-88cf-4a712c3ad3b3,a14a1030-791f-11eb-bfee-121df0c29c1e,fl:z_ln:yemini-1,"[5598605, 5340206]","[0.26533082127571106, 0.26260286569595337]","[59, 55]"
3,10259480,2019-04-16,US,5,B2,7d2e1ccd-660b-4c57-91c3-9c4e5db2a9d2,d0f508e3-791d-11eb-bfee-121df0c29c1e,fl:y_ln:bacallao-1,"[6193265, 5598605, 5340206]","[0.19954709708690646, 0.26533082127571106, 0.2...","[55, 59, 55]"
4,10259480,2019-04-16,US,5,B2,7d2e1ccd-660b-4c57-91c3-9c4e5db2a9d2,d0f508e3-791d-11eb-bfee-121df0c29c1e,fl:s_ln:caution-2,"[6193265, 5598605, 5340206]","[0.19954709708690646, 0.26533082127571106, 0.2...","[55, 59, 55]"


In [7]:
degen_tree.head()

Unnamed: 0,id,date,country,claims,kind,assignee,location,inventor,lineage,similarity,nber_lineage
0,8469200,2013-06-25,US,23,B2,05a3efaa-d144-48c7-be32-ccffcab5c1e1,6c9ccb4a-791d-11eb-bfee-121df0c29c1e,fl:f_ln:wendeln-1,[4714397],[0.14073023200035095],[51]


### Fixing Citation Tree
This is equivalent to the section in 14_Generational_Difference 

In [8]:
# Assigning NBER category based on majority vote of parents
citation_tree = fixing_na_nber(citation_tree)
degen_tree = fixing_na_nber(degen_tree)

In [9]:
# Take the max(generation) for a given patent to enforce only single-generation
citation_tree = get_max_generation(citation_tree)
degen_tree = get_max_generation(degen_tree)

In [10]:
citation_tree.head()

Unnamed: 0,id,gen,date,country,claims,kind,assignee,location,inventor,lineage,similarity,nber_lineage,nber,hops
0,5598605,1,1997-02-04,US,7,A,5bbbc58c-5249-4db5-8188-76480883c35f,06759db6-7920-11eb-bfee-121df0c29c1e,fl:m_ln:tomasiak-1,[5340206],0.262603,[55],55,1
1,6668951,1,2003-12-30,US,21,B2,7b729b31-a8a9-4811-bdf0-12e6decc9df6,ec8e4f5c-791d-11eb-bfee-121df0c29c1e,fl:c_ln:won-29,[5340206],0.221906,[55],55,1
2,5829849,1,1998-11-03,US,13,A,7b5d5846-085d-4c04-9692-08c427428316,7086f99e-791f-11eb-bfee-121df0c29c1e,fl:r_ln:lawson-3,[5340206],0.38327,[55],55,1
3,5899541,1,1999-05-04,US,6,A,a5a41860-d237-4ee6-abf4-9444b1e2ea7a,6e90cbbb-791e-11eb-bfee-121df0c29c1e,fl:j_ln:highfill-3,[5340206],0.424273,[55],55,1
4,5899541,1,1999-05-04,US,6,A,a5a41860-d237-4ee6-abf4-9444b1e2ea7a,6e90cbbb-791e-11eb-bfee-121df0c29c1e,fl:j_ln:ying-17,[5340206],0.424273,[55],55,1


In [11]:
degen_tree.head()

Unnamed: 0,id,gen,date,country,claims,kind,assignee,location,inventor,lineage,similarity,nber_lineage,nber,hops
0,8469200,1,2013-06-25,US,23,B2,05a3efaa-d144-48c7-be32-ccffcab5c1e1,6c9ccb4a-791d-11eb-bfee-121df0c29c1e,fl:f_ln:wendeln-1,[4714397],0.14073,[51],51,1


### Creating Sets
For assginees, inventors and locations, we are interested to see the sets of assignees, inventors and locations invovled in the citation tree. In this, we aggregate them by generation.

In [12]:
"""
Takes a citation tree, with an optional parameter to identify what to be converted into sets 
and generate sets based on the generation
Input:
    citation_tree and settables
Output:
    generation - dataframe with sets of settables by generation
"""
def setting_settables(citation_tree, settables=['inventor', 'assignee', 'location']):
    
    setter = lambda x: citation_tree.loc[:,['gen',x]].drop_duplicates().groupby('gen').agg({x:lambda y: set(y)}).reset_index()
    
    generation = pd.DataFrame({'gen':[1,2,3]})
    
    for s in settables:
        generation = pd.merge(generation, setter(s), how='left', on='gen')
    
    return generation

In [13]:
set_citation = setting_settables(citation_tree)

In [14]:
set_degen = setting_settables(degen_tree)

### Set Difference Across Generation
For each citation_tree, we compare the sets of inventors, assignees and location using Jaccard Coefficient to observe the difference of each sets. <br>
We look at the combination of generation, which in this case is going to only be
1. Generation 1 and 2 (labeled with 1)
1. Generation 2 and 3 (labeled with 2)
1. Generation 1 and 3 (labeled with 3)

In [15]:
def jaccard_similarity(set_A, set_B):
    
    # Check for NaN conditions
    if type(set_A) is float or type(set_B) is float:
        return 0
    
    # Cardinality of intersection
    n_cap = len(set_A.intersection(set_B))
    
    # Cardinality of union
    n_cup = len(set_A.union(set_B))
    
    # Jaccard Similarity
    return(n_cap/n_cup)

In [16]:
def set_differences(root, set_citation, combination=[(0,1),(1,2),(0,2)], settables=['inventor','assignee','location']):
    
    # Empty list of data
    data = []
    
    # For each combination, return the jaccard similarity of 
    for c in combination:
        gen_rows = [root,c]
        for s in settables:
            gen_rows.append(jaccard_similarity(set_citation.loc[c[0],s], set_citation.loc[c[1],s]))
        data.append(gen_rows)
        
    return data

In [17]:
set_one = set_differences(root,set_citation)
set_two = set_differences(degen_root,set_degen)

In [18]:
data = []
data = data + set_one + set_two
print(data)

[[5340206, (0, 1), 0.03, 0.10714285714285714, 0.1111111111111111], [5340206, (1, 2), 0.027898326100433975, 0.05063291139240506, 0.05514705882352941], [5340206, (0, 2), 0.001273074474856779, 0.00974025974025974, 0.007547169811320755], [4714397, (0, 1), 0, 0, 0], [4714397, (1, 2), 0, 0, 0], [4714397, (0, 2), 0, 0, 0]]


In [19]:
df_setdiff = pd.DataFrame(data, columns=['root', 'combination', 'inventor', 'assignee', 'location'])

In [20]:
df_setdiff.head(6)

Unnamed: 0,root,combination,inventor,assignee,location
0,5340206,"(0, 1)",0.03,0.107143,0.111111
1,5340206,"(1, 2)",0.027898,0.050633,0.055147
2,5340206,"(0, 2)",0.001273,0.00974,0.007547
3,4714397,"(0, 1)",0.0,0.0,0.0
4,4714397,"(1, 2)",0.0,0.0,0.0
5,4714397,"(0, 2)",0.0,0.0,0.0


### Sad Awful Loops

In [22]:
data = []

for root in roots[:500]:
    # Get Citation Tree
    citation_tree = conn.query_citation_tree(root)
    
    # Fixing Citation Tree
    citation_tree = fixing_na_nber(citation_tree)
    citation_tree = get_max_generation(citation_tree)
    
    # Build list
    data = data + set_differences(root,set_citation)
    
df_setdiff = pd.DataFrame(data, columns=['root', 'combination', 'inventor', 'assignee', 'location'])

In [23]:
df_setdiff.head()

Unnamed: 0,root,combination,inventor,assignee,location
0,4409344,"(0, 1)",0.03,0.107143,0.111111
1,4409344,"(1, 2)",0.027898,0.050633,0.055147
2,4409344,"(0, 2)",0.001273,0.00974,0.007547
3,4722689,"(0, 1)",0.03,0.107143,0.111111
4,4722689,"(1, 2)",0.027898,0.050633,0.055147


In [24]:
df_setdiff.to_csv("gen_set_diff_1.csv", index=False)

In [21]:
data = []

for root in roots[500:]:
    # Get Citation Tree
    citation_tree = conn.query_citation_tree(root)
    
    # Fixing Citation Tree
    citation_tree = fixing_na_nber(citation_tree)
    citation_tree = get_max_generation(citation_tree)
    
    # Build list
    data = data + set_differences(root,set_citation)
    
df_setdiff = pd.DataFrame(data, columns=['root', 'combination', 'inventor', 'assignee', 'location'])

In [22]:
df_setdiff.head()

Unnamed: 0,root,combination,inventor,assignee,location
0,4224097,"(0, 1)",0.03,0.107143,0.111111
1,4224097,"(1, 2)",0.027898,0.050633,0.055147
2,4224097,"(0, 2)",0.001273,0.00974,0.007547
3,5729379,"(0, 1)",0.03,0.107143,0.111111
4,5729379,"(1, 2)",0.027898,0.050633,0.055147


In [23]:
df_setdiff.to_csv("gen_set_diff_2.csv", index=False)

### Combining the Datasets

In [3]:
df1 = pd.read_csv("gen_set_diff_1.csv")
df2 = pd.read_csv("gen_set_diff_2.csv")

In [8]:
df = df1.append(df2)

In [9]:
df.to_csv("gen_set_diff.csv", index=False)