# Getting Low Hanging Fruits 
Characterizing patents with their citation trees, in this specific part, we look at some easy statistics that can be obtained from the citation trees. As for now, the following attributes of the citation tree is mined:
* Number of Edges (Citing-Cited Relationship)
* Number of Patents
* Edge Density
* Number of Assignees
* Number of Inventors
* Number of Locations
* Average Number of Claims
* Average Similarity of Direct Citing-Cited Relationship

A random sample of 1000 is taken with IDs that range in [4136359,6331415] to control for time (older patents tend to have more citations all else equal). The measures from the important patents list are also included.

In [1]:
import neo4j 
import pandas as pd
import random
from functools import reduce
from credentials import uri, user, pwd
from patent_neo4j.connection import Neo4jConnection

**Important Patent List**

In [2]:
df = pd.read_csv("Data/important_patents_list.csv", usecols = ["id", "issue_year"])
df.head(5)

Unnamed: 0,id,issue_year
0,4136359,1979
1,4229761,1980
2,4237224,1980
3,4363877,1982
4,4371752,1983


**Sampling with Imposed Range**

In [3]:
num_sample = 1000
random_list = random.sample(range(df.loc[0,'id'], df.loc[len(df)-1,'id']), num_sample)

In [4]:
conn = Neo4jConnection(uri, user, pwd)

**Initializing Dataframe**

In [5]:
dataset = pd.DataFrame(columns = ["id","num_edge","num_patents","edge_density",
                                  "unq_assignees","unq_inventors", "unq_loc",
                                  "avg_claims","avg_sim"])

In [6]:
dataset.head()

Unnamed: 0,id,num_edge,num_patents,edge_density,unq_assignees,unq_inventors,unq_loc,avg_claims,avg_sim


**Omnibus Function** <br>
All measures are inner functions of this function, with implementation separated for readability

In [7]:
"""
Ugly omnibus function with way too many inner functions.
Used to look worse but I don't want to have too simple functions 
that have no reuse value and the best way to encapsulate it is as such
INPUT:
    citation_tree and patent_id (of root)
OUTPUT:
    List containing [id (root),num_edge, num_patents, edge_density, unq_assignees, unq_inventors,
    unq_loc, avg_claims, avg_sim]
"""
def omnibus_fx(citation_tree, patent_id):
    def count_num_edge(citation_tree):
        # Make Deep Copy of Citation Tree
        num_edges = citation_tree[['id','lineage']].copy()
        # Take direct ancestor (existent of edge)
        num_edges.loc[:,'lineage'] = num_edges.loc[:,'lineage'].apply(lambda x: x[0])
        # Drop duplicates from assignee/inventor information
        num_edges = num_edges.drop_duplicates()

        # Return |E|
        return len(num_edges)
    
    def count_num_patents(citation_tree):
        patent_set = set(citation_tree['id'])
        num_patent = len(patent_set)

        return num_patent

    def edge_density(num_edges, num_patents):
        d = num_edges/num_patents

        return d
    
    def num_assignees(citation_tree):
        assignee_set = set(citation_tree['assignee'])
        num_assignees = len(assignee_set)

        return num_assignees
    
    def num_locations(citation_tree):
        location_set = set(citation_tree['location'])
        num_locations = len(location_set)

        return num_locations
    
    def num_inventors(citation_tree):
        inventor_set = set(citation_tree['inventor'])
        num_inventors = len(inventor_set)

        return num_inventors
    
    def avg_similarity(citation_tree):
        director_sim = list(set(citation_tree['similarity'].apply(lambda x: x[0])))
        avg_similarity = sum(director_sim)/len(director_sim)

        return avg_similarity
    
    def avg_claims(citation_tree):
        claims = citation_tree[['id','claims']]
        claims = claims.drop_duplicates()
        claims = claims.dropna()

        num_claims = list(pd.to_numeric(claims['claims']))
        avg_claims = sum(num_claims)/len(num_claims)

        return avg_claims

    # Using the functions above
    num_edge = count_num_edge(citation_tree)
    num_patents = count_num_patents(citation_tree)
    e_density = edge_density(num_edge, num_patents)
    unq_assignees = num_assignees(citation_tree)
    unq_inventors = num_inventors(citation_tree)
    unq_loc = num_locations(citation_tree)
    avg_claims = avg_claims(citation_tree)
    avg_sim = avg_similarity(citation_tree)
    
    info_list = [patent_id,num_edge,num_patents,e_density,unq_assignees,unq_inventors,unq_loc,avg_claims,avg_sim]
    
    return info_list

Just for a peak

In [8]:
x = "4296981"
citation_tree = conn.query_citation_tree(x)
citation_tree.head()

Unnamed: 0,id,date,country,claims,kind,assignee,location,inventor,lineage,similarity
0,8887444,2014-11-18,US,8,B2,d77128d3-9a3b-4be9-ab41-8ee8e296abd4,b232b40e-791e-11eb-bfee-121df0c29c1e,fl:a_ln:klien-1,"[4461519, 4296981]","[0.14937755465507507, 0.4043945968151093]"
1,8887444,2014-11-18,US,8,B2,d77128d3-9a3b-4be9-ab41-8ee8e296abd4,b232b40e-791e-11eb-bfee-121df0c29c1e,fl:d_ln:slomski-2,"[4461519, 4296981]","[0.14937755465507507, 0.4043945968151093]"
2,8887444,2014-11-18,US,8,B2,d77128d3-9a3b-4be9-ab41-8ee8e296abd4,b232b40e-791e-11eb-bfee-121df0c29c1e,fl:w_ln:kalempa-1,"[4461519, 4296981]","[0.14937755465507507, 0.4043945968151093]"
3,10450796,2019-10-22,US,19,B2,105ad3b8-c491-4f61-992e-95534b9a2243,0d3c550f-791e-11eb-bfee-121df0c29c1e,fl:d_ln:seuberling-1,"[8887444, 4461519, 4296981]","[0.16314050555229187, 0.14937755465507507, 0.4..."
4,5496105,1996-03-05,US,18,A,a7966fa3-9474-4497-a918-ff3b5fba9043,4f771afb-7920-11eb-bfee-121df0c29c1e,fl:s_ln:strait-3,"[4461519, 4296981]","[0.2783316075801849, 0.4043945968151093]"


**Unexpectedly Long Run** <br>
Obtaining Citation Tree and Running Omnibus Function

In [None]:
for i in range(num_sample):
    citation_tree = conn.query_citation_tree(random_list[i])
    print(len(citation_tree))
    if len(citation_tree) != 0:
        dataset = dataset.append(omnibus_fx(citation_tree,random_list[i]), ignore_index=True)

32
13
257916
133
4943
628
612915
12
13606
454423
4690
821
74
6093
36672
1487
39
384
1
13
9336
2396
650
19545
1743
1932
44692
1
0
7175
13488
18193
27113
394326
0
82533
0
401
2194
122225
0
1243
1199
1900
20841
2730
4274
5378
79
19246
3635
2733
1982
34
550
42
32928
279
3915
287
1427
8066
2452
1054
2787
788
16
1875
0
36328
34732
905
47188
2763
35
902
22639
3324
27396
29993
345
9943
23695
196
8581
2
430618
42
885
1909
125
436
82
1372
559
407
4572
5133
146
87
0
2207
9763
391
5764
2064
200
2000
150
2122
1038
8671
26416
40
544
397
10201
6534
101
15
1321
1868
893
4952
182
3291
885
6877
627
0
315523
5906
726
595
2353
83293
8259
120
2474
0
603
10
0
25284
8416
825
52129
15
43
214
19928
2281
3254
292
56
0
3261
13
0
172
1417
4859
244
2
608
703
1946
1011
223
599
0
85
41433
13601
2215
23944
4492
9303
8
4180
1680
260
92
1865
470
3199
2409
285866
2362
9315
108
24651
704
3796
102
0
713
14557
8501
149
2
8078
4712
1579
0
358
185
16235
508
4866
134
2444271
6166
815
550
31
1522
11436
35
0
0
29558
1432
10953


In [None]:
dataset.head()