<h3> National Bureau for Economic Research (NBER) Herfindahl indices of patent 'originality' and 'generality'</h3>

<p> Matt Wilder, University of Toronto <br>
Please address questions and comments to <a href="mailto:matt.wilder@utoronto.ca">matt.wilder@utoronto.ca</a>. </p>

<p>Updated 14 April 2024</p>

<p>These functions calculate Herfindahl indices of patent 'originality' and 'generality' as introduced by Trajtenberg et al (1997) 'University versus corporate patents: A window on the basicness of invention' <i>Economics of Innovation and New Technology, 5</i>(1): 19-50. These variables are also included in the <a href="https://www.nber.org/system/files/working_papers/w8498/w8498.pdf">NBER Patent Citation Data File</a> by Hall et al (2001).</p>

<p>The originality index measures the diversity of a patent's backward citations—citations made by the patent to earlier patents—across different technological fields (i.e., the breadth of a patent's intellectual heritage):</p>

$$
\text{Originality}_{i} = 1 - \sum_{k=1}^{N_{i}} \left( \frac{n_{ik}}{N_{i}} \right)^2
$$

<p>where \( i \) refers to a particular "focal" patent, \( N_i \) is the total number of backward citations contained in patent \( i \), \( k \) indexes each unique CPC classification cited by \( i \), and \( n_{ik} \) is the number of backward citations made by patent \( i \) to each CPC classification \( k \).</p>

<p>The generality index measures the distribution of a patent's forward citations—citations made to the patent by later patents—across different technological fields (i.e., the breadth of a patent's influence):</p>

$$
\text{Generality}_{i} = 1 - \sum_{k=1}^{N_{i}} \left( \frac{n_{ik}}{N_{i}} \right)^2
$$

<p>where \( i \) refers to a particular "focal" patent, \( N_i \) is the total number of forward citations received by patent \( i \), \( k \) indexes each unique CPC classification of patents that cite \( i \), and \( n_{ik} \) is the number of citations received by patent \( i \) from each CPC classification \( k \).</p>


In [11]:
'''load in sample data'''

import pandas as pd
import numpy as np

fcit_sample_df = pd.read_csv('https://raw.githubusercontent.com/matt-wilder/patent-research/main/forward_citations_by_cpc_sample.csv',sep='\t', dtype='str')
bcit_sample_df = pd.read_csv('https://raw.githubusercontent.com/matt-wilder/patent-research/main/backward_citations_by_cpc_sample.csv',sep='\t', dtype='str')

print('first 20 patents in forward citations sample')
print(fcit_sample_df.head(20))

print('\nfirst 20 patents in backward citations sample')
print(bcit_sample_df.head(20))

first 20 patents in forward citations sample
          to   patent patent_date        name to_cpc forward_citations
0   10062352  9530381  2016-12-01     Bozarth   G09G               3.0
1   10078977  9530381  2016-12-01     Bozarth   G09G               3.0
2   10062352  9530381  2016-12-01     Bozarth   H01L               3.0
3   10078977  9530381  2016-12-01     Bozarth   G02F               3.0
4   10078977  9530381  2016-12-01     Bozarth   G06F               3.0
5   10324237  9458989  2016-10-01  Hsu et al.   G02B               3.0
6    9927616  9458989  2016-10-01  Hsu et al.   G02B               3.0
7    9927616  9458989  2016-10-01  Hsu et al.   F21V               3.0
8    9927616  9458989  2016-10-01  Hsu et al.   B82Y               3.0
9    9927616  9458989  2016-10-01  Hsu et al.   G02F               3.0
10   9927616  9458989  2016-10-01  Hsu et al.   G09F               3.0
11   9927616  9458989  2016-10-01  Hsu et al.   Y10S               3.0
12   9927616  9458989  2016-10-0

In [12]:
''' calculate the Herfindahl index for backward citations (i.e., "originality")'''

def hhi(series):
    _, cnt = np.unique(series, return_counts=True)
    return np.square(cnt/cnt.sum()).sum()    

# the .agg method only accepts integers, so we must temporarily convert cpc classes to unique integers
bcit_sample_df['cpc_code'], levels = pd.factorize(bcit_sample_df['from_cpc'])

df3 = bcit_sample_df.groupby(['patent']).agg({'cpc_code': hhi}).reset_index()

df3.columns = ['originality_hhi' if x=='cpc_code' else x for x in df3.columns] # rename the column 'originality_hhi'
df3['originality_hhi'] = 1- df3['originality_hhi']  # originality is 1 - hhi

bcit_sample_df = bcit_sample_df.merge(df3, on = "patent", how ="outer") # merge back into bcit dataframe

# drop the temporary column
bcit_sample_df = bcit_sample_df.drop(columns = 'cpc_code')

bcit_sample_df

Unnamed: 0,from,patent,patent_date,name,from_cpc,backward_citations,originality_hhi
0,6906762,8941691,2015-01-01,Baron,G09G,11.0,0.896552
1,6369830,8941691,2015-01-01,Baron,G09G,11.0,0.896552
2,8436873,8941691,2015-01-01,Baron,G09G,11.0,0.896552
3,8432411,8941691,2015-01-01,Baron,G09G,11.0,0.896552
4,8416149,8941691,2015-01-01,Baron,G09G,11.0,0.896552
...,...,...,...,...,...,...,...
95,7619585,8890771,2014-11-01,Pance,H04N,4.0,0.847222
96,7205959,8890771,2014-11-01,Pance,G02F,4.0,0.847222
97,7205959,8890771,2014-11-01,Pance,G09F,4.0,0.847222
98,7205959,8890771,2014-11-01,Pance,H04M,4.0,0.847222


In [13]:
''' calculate the Herfindahl index for forward citations (i.e., "generality")'''

def hhi(series):
    _, cnt = np.unique(series, return_counts=True)
    return np.square(cnt/cnt.sum()).sum()    

# the .agg method only accepts integers, so we must temporarily convert cpc classes to unique integers
fcit_sample_df['cpc_code'], levels = pd.factorize(fcit_sample_df['to_cpc']) 

df2 = fcit_sample_df.groupby(['patent']).agg({'cpc_code': hhi}).reset_index()

df2.columns = ['generality_hhi' if x=='cpc_code' else x for x in df2.columns] # rename the column 'generality_hhi'
df2['generality_hhi'] = 1- df2['generality_hhi']  # generality is 1 - hhi

fcit_sample_df = fcit_sample_df.merge(df2, on = "patent", how ="outer") # merge back into sample_df

# drop the temporary column
fcit_sample_df = fcit_sample_df.drop(columns = 'cpc_code')

fcit_sample_df

Unnamed: 0,to,patent,patent_date,name,to_cpc,forward_citations,generality_hhi
0,10062352,9530381,2016-12-01,Bozarth,G09G,3.0,0.720000
1,10078977,9530381,2016-12-01,Bozarth,G09G,3.0,0.720000
2,10062352,9530381,2016-12-01,Bozarth,H01L,3.0,0.720000
3,10078977,9530381,2016-12-01,Bozarth,G02F,3.0,0.720000
4,10078977,9530381,2016-12-01,Bozarth,G06F,3.0,0.720000
...,...,...,...,...,...,...,...
145,9734658,6906762,2005-06-01,Witehira et al.,A63F,154.0,0.249851
146,9566500,6906762,2005-06-01,Witehira et al.,A63F,154.0,0.249851
147,9595159,6906762,2005-06-01,Witehira et al.,A63F,154.0,0.249851
148,9881453,6906762,2005-06-01,Witehira et al.,A63F,154.0,0.249851
