# Janson-Shannon Divergence 
Calculating **Janson-Shannon Divergence** (the metric/distance equivalent for Kullback-Leibler Divergence) based on the discrete probability distribution of NBER subcategories. This measure captures the distance between two probability distributions and the main reasoning behind this is as follows <br>
1. Prior work on patents shows limited "cross-category" citation of patents
1. Significant patents are associated with strong generality
1. "Cross-category" citations can be associated with generality of patents
<br>

From this line of reasoning, framing each citation hop as a discrete probability distribution and measuring the divergence allows us to **capture how different the previous citing generation and the next**.

In [1]:
import neo4j 
import pandas as pd
from credentials import uri, user, pwd
from patent_neo4j.connection import Neo4jConnection
from patent_neo4j.analysis import assign_missing_nber, nber_distribution, js_divergence
from random import randint

In [2]:
df = pd.read_csv("Data/important_patents.csv")
df.head(8)

Unnamed: 0,id,name
0,4136359,AppleMicrocomputer
1,4237224,MolecularChimeras
2,4371752,DigitalVoiceMailSystems
3,4399216,Co-transformationGeneCoding
4,4683195,PolymeraseChainReaction
5,5061620,StemCell
6,5108388,LaserSurgeryMethod
7,6285999,PageRank


In [13]:
conn = Neo4jConnection(uri, user, pwd)
nber = conn.query_nber_category(root=df.loc[2,:].id)

In [14]:
nber.shape

(477790, 4)

In [15]:
len(nber['id'].unique())

92491

In [16]:
nber.head()

Unnamed: 0,id,nber,lineage,nber_lineage
0,4468751,24.0,[4371752],[21]
1,10296580,,"[4468751, 4371752]","[24, 21]"
2,7376581,25.0,"[4468751, 4371752]","[24, 21]"
3,8145532,25.0,"[7376581, 4468751, 4371752]","[25, 24, 21]"
4,7792756,25.0,"[7376581, 4468751, 4371752]","[25, 24, 21]"


## Assigning Missing NBER using "Majority Vote"
Overview of Algorithm:
1. For missing each NBER assignment patent, get all direct ancestors' NBER
1. Based on the counts of NBER categories of direct ancestors, the majority (most votes) category is assigned
1. Repeat until reaches root (which is expected to be non-None)

In [17]:
nber = assign_missing_nber(nber)

In [18]:
nber.head()

Unnamed: 0,id,nber,hops
0,4468751,24,1
2,7376581,25,2
3,8145532,25,3
4,7792756,25,3
5,8600196,22,3


## Calculating Distribution
Grouped by hops and NBER categories, the probability for each hops' NBER categories are calculated, both for **main categories** and **subcategores**

In [19]:
distribution = nber_distribution(nber)

## Janson-Shannon Divergence
Overview of Algorithm:
1. For both main and subcategories
1. Generate hop combination
1. Calculate JS Divergence

In [20]:
js = js_divergence(distribution)

In [21]:
for i in js:
    print(i)

{(0, 1): 0.13443992926405784, (0, 2): 0.19199303154145753, (1, 2): 0.07447678739841684}
{(0, 1): 0.19520468636739802, (0, 2): 0.30111996185764184, (1, 2): 0.14092726782084597}
