# Centrality measures

To demonstrate centrality measures we download the citation graph on US patents from 1975-1999. The data is available [here](https://www.nber.org/research/data/us-patents-1975-1999), where also detailed information on the data can be found. By the way, yes, I know the dataset is not exactly up-to-date. There is a slightly more up-to-date dataset [available](https://sites.google.com/site/patentdataproject/Home/downloads?authuser=0) on the internet but that also only covers patents up to 2004. You can, however, get access to current but raw patent data via the US patent office (search for bulk access on https://www.uspto.gov/). As the amount of data becomes huge very quickly I've decided to work with the smaller and pre-cleaned dataset of the years 1975-1999.

Centrality measures as well as general graph theoretical concepts and methods are implemented in the package [networkx](https://networkx.org/), which we'll use here. 


In [1]:
import networkx as nx 
import pandas as pd # tabular data
import operator

## Read in data

We download an unpack the data. The data comes in three pieces: the actual citation graph, in which the patents are identified by their patent id; context information on the patents; and a file that maps patent ids to the names of the patent holders (for some reason the name is not found in the context data). 

Unfortunately, the download may be slow.

In [105]:
!wget -q --show-progress https://data.nber.org/patents/acite75_99.zip
!unzip -q acite75_99.zip

!wget -q --show-progress https://data.nber.org/patents/apat63_99.zip
!unzip -q apat63_99.zip

!wget -q --show-progress https://data.nber.org/patents/aconame.zip
!unzip -q aconame.zip



We read in the graph. As citations are not symmetric (ie, if patent A cites patent B, then B does not necessarily cites A) we'll encode the graph as digraph. The first line of the file reads <code>"CITING","CITED"</code>. To prevent that the first line is interpreted as an edge between a vertex <code>"CITING"</code> and a vertex <code>"CITED"</code>, we tell <code>networkx</code> that a quotation mark starts a comment. (It would be cleaner if we could tell <code>networkx</code> to simply skip the first line but that is unfortunately not an option.) 

To get a bit of information on how large the citation graph is, we print out the number of vertices, the number of edges and the density.

In [63]:
G=nx.read_adjlist("cite75_99.txt",delimiter=',',comments='"',create_using=nx.DiGraph)

G.number_of_nodes(),G.number_of_edges(),nx.density(G)

Next, we read in the context information. The data is encoded as a *csv* file (*comma-separated values*), which we can easily read with a <code>pandas</code> method. If you don't know <code>pandas</code>: it's a library that handles tabular data (think Excel but done properly). We need to <code>merge</code> the context data with the company names so that we have access to the company names in the context table. 

In [97]:
pat_data=pd.read_csv("apat63_99.txt")
pat_names=pd.read_csv("aconame.txt")
pat_data=pd.merge(pat_data,pat_names,how='left',left_on="ASSIGNEE",right_on="ASSIGNEE")
pat_names.head()

## Centralities

Now let's try out some centrality measures. We go with the simplest first: degree centrality. As citations are not symmetric, we turn to *in-degree* centrality, ie, which are the most cited patents. (Clearly, a patent that is cited by a lot of other patents is important; in contrast, a patent that cites many others is merely depend on lot of others.) 

We're only interested in the patents with largest in-degree centrality. Quite arbitrarily I've decided ot focus on the 10 patents with largest in-degree centrality. To restrict to these, we sort by centrality value. The centralities themselves are computed by <code>nx.in_degree_centrality(G)</code>, which returns a dictionary. (That is the reason why you see <code>.items()</code> in the code below.) 

In [106]:
top_deg_centrality= sorted(nx.in_degree_centrality(G).items(), key=operator.itemgetter(1),reverse=True)[:10]
top_deg_centrality

[('4723129', 0.00020637035345492846),
 ('4463359', 0.00018968058160940792),
 ('4740796', 0.00017961373509941143),
 ('4345262', 0.0001743153948309922),
 ('4558333', 0.00017325572677730837),
 ('4313124', 0.00016769246949546818),
 ('4683195', 0.0001668977184552053),
 ('4459600', 0.00016239412922704896),
 ('4683202', 0.00016027479311968128),
 ('3953566', 0.00010888089251601488)]

That is quite unenlightening output. Let's find out at least the patent holders and the years the patent was granted. For that we look up the patent id, the first, number in each list entry.

In [107]:
cols=["PATENT","GYEAR","COMPNAME"]
def lookup(patent_ids):
    df=pat_data[pat_data["PATENT"].isin([int(id) for id in patent_ids])]
    return df[cols].copy()

lookup([id for id,_ in top_deg_centrality])

Unnamed: 0,PATENT,GYEAR,COMPNAME
879900,3953566,1976,"W. L. GORE & ASSSOCIATES, INC."
1239144,4313124,1982,CANON KABUSHIKI KAISHA
1271239,4345262,1982,CANON KABUSHIKI KAISHA
1385400,4459600,1984,CANON KABUSHIKI KAISHA
1389157,4463359,1984,CANON KABUSHIKI KAISHA
1484000,4558333,1985,CANON KABUSHIKI KAISHA
1608719,4683195,1987,CETUS CORPORATION
1608726,4683202,1987,CETUS CORPORATION
1648595,4723129,1988,CANON KABUSHIKI KAISHA
1666241,4740796,1988,CANON KABUSHIKI KAISHA


Canon definitely makes sense. I did not know Cetus, but apparently it's one the first biotech companies and now part of Novartis. The W.L. Gore & associates patent is actually about polymers. By the way, if you want to know more about these patents, look them up on the [site](https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html) of the US patent office.

Next, let's do *eigenvalue centrality*. Again, the method is readily implemented in <code>networkx</code>.

In [73]:
top_eigen_centrality= sorted(nx.eigenvector_centrality_numpy(G).items(), key=operator.itemgetter(1),reverse=True)[:10]
top_eigen_centrality

[('943820', 0.31980107453341566),
 ('2054306', 0.3198010745334156),
 ('3971529', 0.31980107453341555),
 ('2197779', 0.31980107453341555),
 ('2130581', 0.3198010745334155),
 ('2859924', 0.2132007163556107),
 ('2076097', 0.21320071635561058),
 ('2129386', 0.21320071635561053),
 ('3226052', 0.21320071635561053),
 ('2573240', 0.2132007163556105),
 ('4014422', 0.21320071635561047),
 ('2492819', 0.21320071635561047),
 ('4281808', 0.10660035817780533),
 ('5489070', 0.10660035817780533),
 ('4179084', 0.10660035817780528),
 ('3489366', 0.10660035817780526),
 ('4168812', 0.10660035817780526),
 ('4369936', 0.10660035817780526),
 ('1940593', 0.10660035817780525),
 ('2690310', 0.10660035817780525)]

In [108]:
lookup([id for id,_ in top_eigen_centrality)

Unnamed: 0,PATENT,GYEAR,COMPNAME
155234,3226052,1965,
897862,3971529,1976,DEUTSCHE ANGELGERATE MANUFAKTUR DAM HELLMUTH -


Hmmm, that seems fishy. First, two is clearly not equal to ten -- so what happened to the other eight patents? The reason is that these patents were apparently granted before 1975 and thus only appear as sinks in the data. Also, before 1972 or so, the data do not contain any company names -- that explains the NaN. So what happened here? The citation graph is not connected and certainly not strongly connected. In that case, pure eigenvalue centrality may fail as it may concentrate on a small component of the graph. This seems what has happened here. 

Instead of eigenvalue centrality let's turn to *pagerank*.

In [109]:
top_pagerank= sorted(nx.pagerank(G).items(), key=operator.itemgetter(1),reverse=True)[:20]
lookup([id for id,_ in top_pagerank])

Unnamed: 0,PATENT,GYEAR,COMPNAME
621120,3694412,1972,SHELL OIL COMPANY
629357,3702886,1972,MOBIL OIL CORP.
980892,4054595,1977,GIST-BROCADES N.V.
1163351,4237224,1980,"STANFORD UNIVERSITY, LELAND JUNIOR, THE BOARD ..."
1239144,4313124,1982,CANON KABUSHIKI KAISHA
1271239,4345262,1982,CANON KABUSHIKI KAISHA
1284500,4358535,1982,UNIVERSITY OF WASHINGTON
1293878,4367924,1983,
1320361,4394443,1983,YALE UNIVERSITY
1343955,4418068,1983,ELI LILLY AND COMPANY


That is better. I've looked up the Hitachi patent -- that's on lithographic method to manufacture computer chips, certainly an important patent.

If you look at the <code>networkx</code> [documentation](https://networkx.org/documentation/stable/reference/algorithms/centrality.html) you'll find many more implemented centrality measures. I did not run more because I was a bit afraid that the running time would explode as the graph is not small. Feel free to experiment, though.

A final word: What I've done here makes no sense whatsoever. I've you're really interested in identifying the most important patents -- then you'd always restrict to a certain *field*. That is, you'd try to identify the most important patent in micro electronics, or about vaccines or whatever. The data here comes with a bit of information about the field but that is quite course information. For a serious application we'd need to download the actual contents of the patents and filter out the ones in the field we want to focus on.