# Calculating Innovation Measures

This notebook calculates the atypicality score and technological leap score for each patent in the `final_fwdcitations.csv` dataset. 

## Results note

There are **763,040** patents without any any cited CPC subclasses or groups.

## Atypicality Score

This notebook calculates the Atypicality Score for all patents in the `final_fwdcitations.csv` dataset.

### Steps
1. Find the set of cpc subclasses for each patent using the `g_cpc_current.csv` dataset.

2. Filter the set of cpc subclasses to only include those that occur in the `final_fwdcitations.csv` dataset.

3. For each patent, generate all 2-way combinations of cpc subclasses. Any patent that cites fewer than 2 cpc subclasses is considered to have no 2-way combinations.

4. Count the number of occurences of each 2-way combination across all patents.

5. Calculate the Atypicality Score for each cpc-subclass-pair:
$$
\text{CPC Subclass Pair Atypicality Score} = -ln\left(\frac{n_{pair}}{N}\right) \\
\text{ where } n_{pair} \text{ is the number of occurrences of the pair and } \\
N \text{ is the sum of all occurrences of pairs.}
$$

6. Calculate the Atypicality Score for each patent:
$$
\text{Patent Atypicality Score} = \begin{cases}
\frac{\sum_{pair} \text{CPC Subclass Pair Atypicality Score}}{\text{Number of CPC Subclass Pairs cited by Patent}} & \text{if Number of CPC Subclass Pairs} > 0 \\
NaN & \text{otherwise}
\end{cases}
$$



In [16]:
import pandas as pd
import numpy as np

In [2]:
dtypes = {
    "patent_id": "string",
    "forward_citations": "int64",
}
fwd_citations = pd.read_csv("./data/final_fwdcitation.csv", dtype=dtypes)

fwd_citations.head()

Unnamed: 0,patent_id,patent_type,patent_date,patent_title,wipo_kind,forward_citations
0,10000000,utility,2018-06-19,Coherent LADAR using intra-pixel quadrature de...,B2,13
1,10000001,utility,2018-06-19,Injection molding machine and mold thickness c...,B2,0
2,10000002,utility,2018-06-19,Method for manufacturing polymer film and co-e...,B2,0
3,10000003,utility,2018-06-19,Method for producing a container from a thermo...,B2,2
4,10000004,utility,2018-06-19,"Process of obtaining a double-oriented film, c...",B2,0


In [21]:
dtypes = {
    "patent_id": "string",
    "cpc_sequence": "int64",
}
cpc = pd.read_csv(
    "./data/g_cpc_current.tsv", 
    sep="\t",
    dtype=dtypes,
    )
cpc.head()

Unnamed: 0,patent_id,cpc_sequence,cpc_section,cpc_class,cpc_subclass,cpc_group,cpc_type
0,3950000,0,A,A63,A63C,A63C9/001,inventional
1,3950000,1,A,A63,A63C,A63C9/00,inventional
2,3950000,2,A,A63,A63C,A63C9/002,inventional
3,3950000,3,A,A63,A63C,A63C9/081,inventional
4,3950001,0,A,A63,A63C,A63C9/086,inventional


In [4]:
citations_dtypes = {
    "patent_id": "string",
    "citation_patent_id": "string",
}
citations = pd.read_csv(
    "./data/g_us_patent_citation.tsv", 
    sep="\t",
    dtype=citations_dtypes,
    )
citations.head()

  citations = pd.read_csv(


Unnamed: 0,patent_id,citation_sequence,citation_patent_id,citation_date,record_name,wipo_kind,citation_category
0,10000000,0,5093563,1992-03-01,Small,A,cited by examiner
1,10000000,1,5751830,1998-05-01,Hutchinson,A,cited by applicant
2,10000001,0,7804268,2010-09-01,Park,B2,cited by examiner
3,10000001,1,9022767,2015-05-01,Oono,B2,cited by examiner
4,10000001,2,9090016,2015-07-01,Takeuchi,B2,cited by examiner


In [22]:
fwd_citations = fwd_citations[["patent_id", "forward_citations"]]
cpc = cpc[["patent_id", "cpc_sequence", "cpc_subclass", "cpc_group"]]
citations = citations[["patent_id", "citation_sequence", "citation_patent_id"]]

In [39]:
num_patents = fwd_citations["patent_id"].nunique()
print(f"Number of patents: {num_patents}")

Number of patents: 7507819


In [None]:
num_unique_cpc_subclasses = cpc["cpc_subclass"].nunique()
num_unique_cpc_subclasses = cpc["cpc_subclass"].nunique()
print(f"Number of unique CPC subclasses: {num_unique_cpc_subclasses}")
print(f"Number of unique CPC subclasses: {num_unique_cpc_subclasses}")

Number of unique CPC groups: 244725
Number of unique CPC subclasses: 676


In [8]:
from itertools import combinations
import os

#1. Group by patent_id and collect all cpc_subclasses for each patent
# This will create a dataframe where each patent_id maps to a set of its cpc_subclasses
cpc_subclasses = cpc.groupby("patent_id")["cpc_subclass"].apply(set)
cpc_subclasses = pd.merge(
    cpc_subclasses,
    fwd_citations.set_index("patent_id"), 
    left_index=True, 
    right_index=True,
    how="right",
    validate="1:1"
)
cpc_subclasses = cpc_subclasses["cpc_subclass"]
cpc_subclasses = cpc_subclasses.rename("cpc_subclasses")
cpc_subclasses.head()


patent_id
10000000                                  {G01S}
10000001                            {G05B, B29C}
10000002    {B32B, B29L, B29K, B60C, B29C, B29D}
10000003                {B29K, B29L, B29D, B29C}
10000004                      {B29K, B29L, B29C}
Name: cpc_subclasses, dtype: object

In [9]:
# Check the number of patents with no cpc subclasses cited
cpc_subclasses.isna().sum()

763040

In [12]:
import ast
import numpy as np
# 2. Generate all unique pairs of cpc_subclasses for each patent
def generate_pairs(cpc_subclass: set) -> list:
    if pd.isna(cpc_subclass):
        return []
    if not isinstance(cpc_subclass, set):
        cpc_subclass = set(cpc_subclass)
    if not cpc_subclass:
        return []
    return list(combinations(sorted(cpc_subclass), 2))

patent_cpc_subclass_pairs = None
if not os.path.exists("./data/cpc_subclass_pairs.csv"):
    patent_cpc_subclass_pairs = cpc_subclasses.apply(generate_pairs)
    patent_cpc_subclass_pairs = patent_cpc_subclass_pairs.rename("cpc_subclass_pairs")
    patent_cpc_subclass_pairs.to_csv("./data/cpc_subclass_pairs.csv")
else:
    patent_cpc_subclass_pairs = pd.read_csv("./data/cpc_subclass_pairs.csv", index_col=0, dtype={"patent_id": "string"})
    patent_cpc_subclass_pairs["cpc_subclass_pairs"] = patent_cpc_subclass_pairs["cpc_subclass_pairs"].apply(ast.literal_eval)
    patent_cpc_subclass_pairs = patent_cpc_subclass_pairs["cpc_subclass_pairs"]
patent_cpc_subclass_pairs.head()

patent_id
10000000                                                   []
10000001                                       [(B29C, G05B)]
10000002    [(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...
10000003    [(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...
10000004           [(B29C, B29K), (B29C, B29L), (B29K, B29L)]
Name: cpc_subclass_pairs, dtype: object

In [13]:
# 3. Flatten the list of pairs into a long-form DataFrame
# Each row maps a patent to one of its cpc subclass pair, or NaN if it has fewer than 2 cpc subclasses cited
patent_cpc_subclass_pairs_long = patent_cpc_subclass_pairs.explode()
patent_cpc_subclass_pairs_long = patent_cpc_subclass_pairs_long.rename("cpc_subclass_pair")
patent_cpc_subclass_pairs_long = patent_cpc_subclass_pairs_long.to_frame().reset_index()
patent_cpc_subclass_pairs_long.head()

Unnamed: 0,patent_id,cpc_subclass_pair
0,10000000,
1,10000001,"(B29C, G05B)"
2,10000002,"(B29C, B29D)"
3,10000002,"(B29C, B29K)"
4,10000002,"(B29C, B29L)"


In [14]:
# 4. Count occurrences of each unique pair
if not os.path.exists("./data/cpc_subclass_pair_counts.csv"):
    pair_counts = patent_cpc_subclass_pairs_long.groupby("cpc_subclass_pair").size()
    pair_counts = pair_counts.to_frame(name="count")
    pair_counts.to_csv("./data/cpc_subclass_pair_counts.csv")
else:
    pair_counts = pd.read_csv("./data/cpc_subclass_pair_counts.csv")
    pair_counts["cpc_subclass_pair"] = pair_counts["cpc_subclass_pair"].apply(ast.literal_eval)
    pair_counts = pair_counts.set_index("cpc_subclass_pair")

pair_counts.head()

Unnamed: 0_level_0,count
cpc_subclass_pair,Unnamed: 1_level_1
"(A01B, A01C)",1662
"(A01B, A01D)",1209
"(A01B, A01F)",133
"(A01B, A01G)",468
"(A01B, A01H)",15


In [15]:
total_pairs = pair_counts.sum()
print(f"Total number of pairs in the dataset: {total_pairs}")

num_unique_pairs = pair_counts.index.nunique()
print(f"Number of unique pairs in the dataset: {num_unique_pairs}")

Total number of pairs in the dataset: count    14580344
dtype: int64
Number of unique pairs in the dataset: 95614


In [16]:
import numpy as np
# 5. Assign each cpc subclass pair an atypicality score
def atypicality_score(n):
    if n > 0:
        return -np.log(n/total_pairs)
    else:
        # This should not happen since the pairs are generated from existing data. If it does, raise an error.
        raise ValueError("Count must be greater than 0")

if not os.path.exists("./data/cpc_subclass_pair_counts_with_atypicality.csv"):
    pair_counts["atypicality_score"] = pair_counts["count"].apply(atypicality_score)
    pair_counts.to_csv("./data/cpc_subclass_pair_counts_with_atypicality.csv")
else: 
    pair_counts = pd.read_csv("./data/cpc_subclass_pair_counts_with_atypicality.csv")
    pair_counts["cpc_subclass_pair"] = pair_counts["cpc_subclass_pair"].apply(ast.literal_eval)
    pair_counts = pair_counts.set_index("cpc_subclass_pair")
pair_counts.head()




Unnamed: 0_level_0,count,atypicality_score
cpc_subclass_pair,Unnamed: 1_level_1,Unnamed: 2_level_1
"(A01B, A01C)",1662,9.079408
"(A01B, A01D)",1209,9.397636
"(A01B, A01F)",133,11.604836
"(A01B, A01G)",468,10.346717
"(A01B, A01H)",15,13.787135


In [17]:
# Atypicality score per cpc subclass pair summary statistics
pair_counts["atypicality_score"].describe()

count    95614.000000
mean        14.105618
std          1.963772
min          4.489449
25%         12.857599
50%         14.415743
75%         15.802038
max         16.495185
Name: atypicality_score, dtype: float64

In [18]:
pair_counts_dict = pair_counts.to_dict(orient="index")

In [None]:
# Step 6: Calculate the atypicality score for each patent
def calculate_atypicality_score(cpc_subclass_pairs):
    if len(cpc_subclass_pairs) == 0:
        return np.nan
    # Get the atypicality score for each cpc_subclass pair
    scores = [pair_counts_dict.get(tuple(pair)).get("atypicality_score") for pair in cpc_subclass_pairs]
    if len(scores) == 0:
        raise RuntimeError("No matching pairs found for the given cpc_subclass_pairs.")
    
    # Calculate the atypicality score for each patent = average atypicality score for all cpc subclass pairs it cites
    atypicality_score = np.mean(scores)
    
    return atypicality_score

if not os.path.exists("./data/patents_with_atypicality.csv"):
    if not isinstance(patent_cpc_subclass_pairs, pd.DataFrame):
        patent_cpc_subclass_pairs = patent_cpc_subclass_pairs.to_frame()
    # Group by patent_id and calculate the atypicality score for each patent
    patent_cpc_subclass_pairs["atypicality_score"] = patent_cpc_subclass_pairs["cpc_subclass_pairs"].apply(calculate_atypicality_score)
    patent_cpc_subclass_pairs.to_csv("./data/patents_with_atypicality.csv")
else:
    patent_cpc_subclass_pairs = pd.read_csv("./data/patents_with_atypicality.csv", index_col=0)
    patent_cpc_subclass_pairs["cpc_subclass_pairs"] = patent_cpc_subclass_pairs["cpc_subclass_pairs"].apply(ast.literal_eval)


patent_cpc_subclass_pairs.head()

Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10000000,[],
10000001,"[(B29C, G05B)]",8.984207
10000002,"[(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...",8.37534
10000003,"[(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...",6.990229
10000004,"[(B29C, B29K), (B29C, B29L), (B29K, B29L)]",6.23967


In [21]:
# Atypicality score summary statistics
patent_cpc_subclass_pairs["atypicality_score"].describe()

count    4.056170e+06
mean     8.093600e+00
std      1.987437e+00
min      4.489449e+00
25%      6.627396e+00
50%      8.040293e+00
75%      9.427694e+00
max      1.649518e+01
Name: atypicality_score, dtype: float64

In [22]:
# 8. Merge the atypicality scores with the list of cpc subclasses to get the full output
patent_cpc_subclass_pairs_with_atypicality = pd.merge(patent_cpc_subclass_pairs, cpc_subclasses, right_index=True, left_index=True, how="inner", validate="1:1")
patent_cpc_subclass_pairs_with_atypicality.head()

Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score,cpc_subclasses
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10000000,[],,{G01S}
10000001,"[(B29C, G05B)]",8.984207,"{G05B, B29C}"
10000002,"[(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...",8.37534,"{B32B, B29L, B29K, B60C, B29C, B29D}"
10000003,"[(B29C, B29D), (B29C, B29K), (B29C, B29L), (B2...",6.990229,"{B29K, B29L, B29D, B29C}"
10000004,"[(B29C, B29K), (B29C, B29L), (B29K, B29L)]",6.23967,"{B29K, B29L, B29C}"


In [23]:
# Step 9: Save the final DataFrame to a CSV file
patent_cpc_subclass_pairs_with_atypicality.to_csv("./data/patents_with_atypicality.csv")

In [24]:
# 10. Validate that the output dataset contains the expected number of patents
num_patents = patent_cpc_subclass_pairs_with_atypicality.shape[0]
print(f"Number of patents in the output dataset: {num_patents}")

Number of patents in the output dataset: 7507819


In [25]:
patent_cpc_subclass_pairs_with_atypicality.isna().sum()

cpc_subclass_pairs          0
atypicality_score     3451649
cpc_subclasses         763040
dtype: int64

## Atypicality Score Summary Statistics

| Atypicality Score Statistic | Value     |
|----------------------------|-----------|
| Number of non-null values  | 4,056,170 |
| Number of null values      | 3,451,649 |
| Total Count                | 7,507,819 |
| Mean                       | 8.0936    |
| Standard Deviation         | 1.9874    |
| Min                        | 4.4894    |
| Max                        | 16.4952   |
| 25th-percentile            | 6.6274    |
| 50th-percentile            | 8.0403    |
| 75th-percentile            | 9.4277    |

In [26]:
# 11. Validate that all the patents with NaN atypicality scores are indeed patents with fewer than 2 cpc_subclass_pairs
assert patent_cpc_subclass_pairs_with_atypicality[patent_cpc_subclass_pairs_with_atypicality["atypicality_score"].isna()]["cpc_subclass_pairs"].apply(lambda x: len(x) < 2).all(), "There are patents with NaN atypicality scores that have more than 1 cpc_subclass_pairs."
print("All patents with NaN atypicality scores have fewer than 2 cpc_subclass_pairs.")

All patents with NaN atypicality scores have fewer than 2 cpc_subclass_pairs.


In [27]:
# 12. Validate that no cpc_subclass_pairs have been over-counted, i.e. the sum of counts in pair_counts matches the total number of pairs in patent_cpc_subclass_pairs_long
assert pair_counts["count"].sum() == patent_cpc_subclass_pairs_long.dropna(subset=["cpc_subclass_pair"]).shape[0], f"The sum of counts in pair_counts ({pair_counts['count'].sum()}) does not match the total number of pairs in patent_cpc_subclass_pairs_long ({patent_cpc_subclass_pairs_long.dropna(subset=['cpc_subclass_pair']).shape[0]})."
print("All cpc_subclass_pairs have been counted correctly.")

All cpc_subclass_pairs have been counted correctly.


In [31]:
# 13. Random spot check: Validate that the atypicality score for a random patent matches the expected value
import random
def manual_check(patent_id):
    cpc_subclass_pairs = patent_cpc_subclass_pairs.loc[patent_id, "cpc_subclass_pairs"]
    if isinstance(cpc_subclass_pairs, str):
        cpc_subclass_pairs = ast.literal_eval(cpc_subclass_pairs)
    if len(cpc_subclass_pairs) == 0:
        return np.nan
    counts = [pair_counts_dict.get(tuple(pair)).get("count") for pair in cpc_subclass_pairs]
    if len(counts) == 0 or None in counts:
        raise RuntimeError("No matching pairs found for the given cpc_subclass_pairs.")
    scores = [-np.log(count/total_pairs) if count > 0 else np.nan for count in counts]
    atypicality_score = np.mean(scores)
    return atypicality_score

samples = random.sample(patent_cpc_subclass_pairs_with_atypicality.index.tolist(), 10000)
for sample in samples:
    expected_score = manual_check(sample)
    actual_score = patent_cpc_subclass_pairs_with_atypicality.loc[sample, "atypicality_score"]
    assert np.isclose(expected_score, actual_score, equal_nan=True), f"Expected {expected_score}, but got {actual_score} for patent {sample}."
print("All manual checks passed successfully.")


All manual checks passed successfully.


In [32]:
num_patents_in_output = patent_cpc_subclass_pairs_with_atypicality.shape[0]
assert num_patents_in_output == num_patents, f"Number of patents in the output dataset ({num_patents_in_output}) does not match the number of patents in the input dataset ({num_patents})."
print("Number of patents in the output dataset matches the number of patents in the input dataset.")

Number of patents in the output dataset matches the number of patents in the input dataset.


## Technological Leap Score

This notebook calculates the Technological Leap Score for all patents in the `final_fwdcitations.csv` dataset.

### Steps

1. Find the set of backward citations (set of patents cited by a patent) for each patent in `g_us_patent_citations.csv` dataset.

2. Filter the set of backward citations to only include those that occur in the `final_fwdcitations.csv` dataset.

3. For each patent, find the set of CPC  that are cited by the patent.

4. For each patent, find the set of CPC subclasses that are cited by its backward citations. 

5. Find the intersection of (3) and (4) to get the overlapping CPC subclasses.

6. Find the union of (3) and (4) to get the total CPC subclasses.

7. Calculate the Jaccard Similarity score = 
$$
\text{Similarity Score} = \begin{cases}
\frac{\text{Number of Overlapping CPC subclasses}}{\text{Total CPC subclasses in New Patent and Cited Patents}} & \text{if Total CPC subclasses} > 0 \\
NaN & \text{otherwise}
\end{cases}
$$

8. Calculate the Technological Leap Score = 
$$1 - \text{Similarity Score}$$

In [23]:
patent_cpc_subclass_pairs_with_atypicality = pd.read_csv("./data/patents_with_atypicality.csv", index_col=0, dtype={"patent_id": "string"})
patent_cpc_subclass_pairs_with_atypicality.head()

  patent_cpc_subclass_pairs_with_atypicality = pd.read_csv("./data/patents_with_atypicality.csv", index_col=0, dtype={"patent_id": "string"})


Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score,cpc_subclasses
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10000000,[],,{'G01S'}
10000001,"[('B29C', 'G05B')]",8.984207,"{'G05B', 'B29C'}"
10000002,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",8.37534,"{'B32B', 'B29L', 'B29K', 'B60C', 'B29C', 'B29D'}"
10000003,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",6.990229,"{'B29K', 'B29L', 'B29D', 'B29C'}"
10000004,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29K', ...",6.23967,"{'B29K', 'B29L', 'B29C'}"


In [24]:
# 1. Create a dictionary mapping patent IDs to a list of patent IDs they cite
citations_list = citations.groupby("patent_id")["citation_patent_id"].apply(list)
citations_dict = citations_list.to_dict()

In [25]:
# 2. Create a dictionary mapping patent IDs to a list of their CPC subclasses
cpc_subclasses = cpc.groupby("patent_id")["cpc_subclass"].apply(set)
cpc_subclasses = pd.merge(
    cpc_subclasses,
    fwd_citations.set_index("patent_id"), 
    left_index=True, 
    right_index=True,
    how="right",
    validate="1:1"
)
cpc_subclasses = cpc_subclasses["cpc_subclass"]
cpc_subclasses = cpc_subclasses.rename("cpc_subclasses")
cpc_subclasses.head()

patent_id
10000000                                  {G01S}
10000001                            {B29C, G05B}
10000002    {B29C, B29L, B60C, B29D, B32B, B29K}
10000003                {B29C, B29D, B29L, B29K}
10000004                      {B29C, B29L, B29K}
Name: cpc_subclasses, dtype: object

In [26]:
cpc_subclasses_dict = cpc_subclasses.to_dict()

In [27]:
# 3. Get the intersection of CPC subclasses between a patent and its backward citations
def get_intersect_cpc_subclasses(row):
    patent_id = row.name
    patent_cpc_subclasses = cpc_subclasses_dict.get(patent_id, {})
    if not patent_cpc_subclasses or (isinstance(patent_cpc_subclasses, float) and np.isnan(patent_cpc_subclasses)):
        return []
    backward_citations = citations_dict.get(patent_id, [])
    if not backward_citations:
        return []
    backward_citations_cpc_subclasses = []
    for citation in backward_citations:
        citation_cpc_subclasses = cpc_subclasses_dict.get(citation, {})
        if not citation_cpc_subclasses or (isinstance(citation_cpc_subclasses, float) and np.isnan(citation_cpc_subclasses)):
            continue
        if citation_cpc_subclasses:
            backward_citations_cpc_subclasses.extend(citation_cpc_subclasses)
    if not backward_citations_cpc_subclasses:
        return []
    return list(set(patent_cpc_subclasses) & set(backward_citations_cpc_subclasses))

patent_cpc_subclass_pairs_with_atypicality["intersect_cpc_subclasses"] = patent_cpc_subclass_pairs_with_atypicality.apply(get_intersect_cpc_subclasses, axis=1)
patent_cpc_subclass_pairs_with_atypicality.head()

Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score,cpc_subclasses,intersect_cpc_subclasses
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10000000,[],,{'G01S'},[G01S]
10000001,"[('B29C', 'G05B')]",8.984207,"{'G05B', 'B29C'}",[B29C]
10000002,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",8.37534,"{'B32B', 'B29L', 'B29K', 'B60C', 'B29C', 'B29D'}","[B29C, B32B, B29K]"
10000003,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",6.990229,"{'B29K', 'B29L', 'B29D', 'B29C'}","[B29C, B29L]"
10000004,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29K', ...",6.23967,"{'B29K', 'B29L', 'B29C'}","[B29C, B29L]"


In [28]:
# 4. Get the union of CPC subclasses between a patent and its backward citations
def get_union_cpc_subclasses(row):
    patent_id = row.name
    patent_cpc_subclasses = cpc_subclasses_dict.get(patent_id, {})
    if not patent_cpc_subclasses or (isinstance(patent_cpc_subclasses, float) and np.isnan(patent_cpc_subclasses)):
        return []
    backward_citations = citations_dict.get(patent_id, [])
    if not backward_citations:
        return patent_cpc_subclasses
    backward_citations_cpc_subclasses = []
    for citation in backward_citations:
        citation_cpc_subclasses = cpc_subclasses_dict.get(citation, {})
        if not citation_cpc_subclasses or (isinstance(citation_cpc_subclasses, float) and np.isnan(citation_cpc_subclasses)):
            continue
        if citation_cpc_subclasses:
            backward_citations_cpc_subclasses.extend(citation_cpc_subclasses)
    return list(set(patent_cpc_subclasses) | set(backward_citations_cpc_subclasses))

patent_cpc_subclass_pairs_with_atypicality["union_cpc_subclasses"] = patent_cpc_subclass_pairs_with_atypicality.apply(get_union_cpc_subclasses, axis=1)
patent_cpc_subclass_pairs_with_atypicality.head()

Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score,cpc_subclasses,intersect_cpc_subclasses,union_cpc_subclasses
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000000,[],,{'G01S'},[G01S],"[Y02A, G01S]"
10000001,"[('B29C', 'G05B')]",8.984207,"{'G05B', 'B29C'}",[B29C],"[B29C, G05B]"
10000002,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",8.37534,"{'B32B', 'B29L', 'B29K', 'B60C', 'B29C', 'B29D'}","[B29C, B32B, B29K]","[B29C, Y10T, H05K, B29L, B60C, B29D, B32B, C09..."
10000003,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",6.990229,"{'B29K', 'B29L', 'B29D', 'B29C'}","[B29C, B29L]","[B29C, B29D, B60K, B29K, B29L]"
10000004,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29K', ...",6.23967,"{'B29K', 'B29L', 'B29C'}","[B29C, B29L]","[B29C, B29L, B65D, B65B, B32B, B29K]"


In [29]:
# 5. Calculate the Technological Leap Score
# Technological Leap Score = 1 - (Jaccard Similarity)
def calculate_tech_leap_score(patent):
    # If a patent and all its backward citations have cite no CPC subclasses, it is incomparable.
    if not patent["union_cpc_subclasses"]:
        return np.nan
    # If a patent has no CPC subclasses in common with its backward citations, it has maximum technological leap.
    elif not patent["intersect_cpc_subclasses"]:
        return 1.0
    else:
        return 1 - (len(patent["intersect_cpc_subclasses"]) / len(patent["union_cpc_subclasses"]))
patent_cpc_subclass_pairs_with_atypicality["tech_leap"] = patent_cpc_subclass_pairs_with_atypicality.apply(calculate_tech_leap_score, axis=1)
patent_cpc_subclass_pairs_with_atypicality.head(10)

Unnamed: 0_level_0,cpc_subclass_pairs,atypicality_score,cpc_subclasses,intersect_cpc_subclasses,union_cpc_subclasses,tech_leap
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10000000,[],,{'G01S'},[G01S],"[Y02A, G01S]",0.5
10000001,"[('B29C', 'G05B')]",8.984207,"{'G05B', 'B29C'}",[B29C],"[B29C, G05B]",0.5
10000002,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",8.37534,"{'B32B', 'B29L', 'B29K', 'B60C', 'B29C', 'B29D'}","[B29C, B32B, B29K]","[B29C, Y10T, H05K, B29L, B60C, B29D, B32B, C09...",0.666667
10000003,"[('B29C', 'B29D'), ('B29C', 'B29K'), ('B29C', ...",6.990229,"{'B29K', 'B29L', 'B29D', 'B29C'}","[B29C, B29L]","[B29C, B29D, B60K, B29K, B29L]",0.6
10000004,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29K', ...",6.23967,"{'B29K', 'B29L', 'B29C'}","[B29C, B29L]","[B29C, B29L, B65D, B65B, B32B, B29K]",0.666667
10000005,"[('B29C', 'Y10T')]",6.483157,"{'Y10T', 'B29C'}",[B29C],"[B29C, Y10S, Y10T, B29L]",0.75
10000006,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29C', ...",7.613158,"{'B29L', 'B60R', 'B29K', 'B29C', 'Y10T'}",[],"{B29C, Y10T, B29L, B60R, B29K}",1.0
10000007,"[('B29C', 'B29K'), ('B29C', 'B29L'), ('B29K', ...",6.23967,"{'B29K', 'B29L', 'B29C'}","[B29C, B29L, B29K]","[G01D, B25B, B29C, E02F, Y10T, Y10S, B30B, B21...",0.884615
10000008,"[('A44C', 'B29C'), ('A44C', 'B29K'), ('A44C', ...",9.454074,"{'B29K', 'B29L', 'A44C', 'B29C'}","[B29C, A44C]","[B29C, Y10S, Y10T, B44C, B22D, A44C, B22C, B29...",0.8
10000009,"[('B29C', 'B29L')]",5.978974,"{'B29L', 'B29C'}","[B29C, B29L]","[Y02P, B29C, Y10S, Y10T, B29L, G05B, H02K, B41...",0.777778


In [30]:
# 6. Save the final DataFrame to a CSV file
patent_cpc_subclass_pairs_with_atypicality.to_csv("./data/patents_with_atypicality_and_tech_leap.csv")

In [31]:
patent_cpc_subclass_pairs_with_atypicality["tech_leap"].describe()

count    6.744779e+06
mean     6.776016e-01
std      2.999458e-01
min      0.000000e+00
25%      5.000000e-01
50%      7.500000e-01
75%      9.166667e-01
max      1.000000e+00
Name: tech_leap, dtype: float64

## Technological Leap Score summary statistics
| Technological Leap Score Statistic | Value     |
|-----------------------------------|-----------|
| Number of non-null values         | 6,744,779 |
| Number of null values             | 763,040 |
| Total Count                       | 7,507,819 |
| Mean                              | 0.9049   |
| Standard Deviation                | 0.6776    |
| Min                               | 0.0000    |
| Max                               | 1.0000    |
| 25th-percentile                   | 0.5000    |
| 50th-percentile                   | 0.7500    |
| 75th-percentile                   | 0.9167    |

In [32]:
# 7. Validate that all the patents with NaN technological leap scores are indeed patents
# that cite no CPC subclasses and have no backward citations or its backward citations have no CPC subclasses
assert patent_cpc_subclass_pairs_with_atypicality[patent_cpc_subclass_pairs_with_atypicality["tech_leap"].isna()]["union_cpc_subclasses"].apply(lambda x: len(x) == 0).all(), "There are patents with NaN technological leap scores that have CPC subclasses."
print("All patents with NaN technological leap scores have no CPC subclasses.")

All patents with NaN technological leap scores have no CPC subclasses.


In [33]:
# 8. Validate that all technological leap scores are between 0 and 1
assert patent_cpc_subclass_pairs_with_atypicality["tech_leap"].dropna().between(0.0, 1.0, inclusive="both").all(), "There are technological leap scores outside the range [0, 1]."
print("All technological leap scores are between 0 and 1.")

All technological leap scores are between 0 and 1.


In [34]:
# 9. Random spot check: Validate that the technological leap score for a random patent matches the expected value
import random
def manual_tech_leap(patent_id):
    backward_citations = citations_dict.get(patent_id, [])
    patent_cpc_subclasses = cpc_subclasses_dict.get(patent_id, {})
    backward_citations_cpc_subclasses = []
    if backward_citations:
        for citation in backward_citations:
            citation_cpc_subclasses = cpc_subclasses_dict.get(citation, {})
            if not citation_cpc_subclasses or (isinstance(citation_cpc_subclasses, float) and np.isnan(citation_cpc_subclasses)):
                continue
            if citation_cpc_subclasses:
                backward_citations_cpc_subclasses.extend(citation_cpc_subclasses)
    if not patent_cpc_subclasses or (isinstance(patent_cpc_subclasses, float) and np.isnan(patent_cpc_subclasses)):
        return np.nan
    if not backward_citations_cpc_subclasses:
        return 1.0
    union_cpc_subclasses = list(set(patent_cpc_subclasses) | set(backward_citations_cpc_subclasses))
    intersect_cpc_subclasses = list(set(patent_cpc_subclasses) & set(backward_citations_cpc_subclasses))
    if not union_cpc_subclasses:
        return np.nan
    if not intersect_cpc_subclasses:
        return 1.0
    else:
        return 1 - (len(intersect_cpc_subclasses) / len(union_cpc_subclasses))

samples = random.sample(patent_cpc_subclass_pairs_with_atypicality.index.tolist(), 10000)
for sample in samples:
    expected_score = manual_tech_leap(sample)
    actual_score = patent_cpc_subclass_pairs_with_atypicality.loc[sample, "tech_leap"]
    assert np.isclose(expected_score, actual_score, equal_nan=True), f"Expected {expected_score}, but got {actual_score} for patent {sample}."
print("All manual checks for technological leap scores passed successfully.")

All manual checks for technological leap scores passed successfully.


In [None]:
# 10. Validate that the output dataset contains the expected number of patents
num_patents_in_output = patent_cpc_subclass_pairs_with_atypicality.shape[0]
assert num_patents_in_output == num_patents, f"Number of patents in the output dataset ({num_patents_in_output}) does not match the number of patents in the input dataset ({num_patents})."
print("Number of patents in the output dataset matches the number of patents in the input dataset.")

NameError: name 'num_patents' is not defined

## Forward Citation Impact Score

This notebook calculates the Forward Citation Impact Score for all patents in the `final_fwdcitations.csv` dataset.

### Steps

1. For each patent, find the set of forward citations (set of patents that cite a patent) using the `g_us_patent_citations.csv` dataset.

2. Filter the set of forward citations to only include those that occur in the `final_fwdcitations.csv` dataset.

3. For each patent, find the list of CPC subclasses belonging to all of its forward citations.

4. For each patent, find the number of times each unique CPC group is cited by its forward citations.

5. Calculate the each CPC group's proportion (proportion of times a CPC group is cited by forward citations) using the formula:
$$
\text{CPC Group Proportion} = \frac{\text{Number of times CPC group is cited by forward citations}}{\text{Total number of CPC subclasses in forward citations}}
$$

6. For each patent, calculate the entropy of the CPC group proportions using the formula:
$$
\text{Entropy} = -\sum_{i=1}^{n} p_i \cdot \ln(p_i)
$$
where (p_i) is the proportion of the (i-th$$) CPC group.

7. Calculate the Forward Citation Impact Score for each patent:
$$
\text{Forward Citation Impact Score} = \text{Entropy} \cdot \text{Number of Forward Citations}
$$






In [36]:
patent_fwd_citations = citations.groupby("citation_patent_id")["patent_id"].apply(list)
patent_fwd_citations = patent_fwd_citations.rename("fwd_cited_by_patents")
patent_fwd_citations.index.set_names("patent_id", inplace=True)
patent_fwd_citations = pd.merge(
    patent_fwd_citations,
    fwd_citations.set_index("patent_id"),
    left_index=True,
    right_index=True,
    how="right",
    validate="1:1"
)
patent_fwd_citations.head()

Unnamed: 0_level_0,fwd_cited_by_patents,forward_citations
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10000000,"[10753736, 10845468, 10873738, 11092690, 11237...",13
10000001,,0
10000002,,0
10000003,"[10668805, 11318832]",2
10000004,,0


In [37]:
patent_fwd_citations.isna().sum()

fwd_cited_by_patents    2405979
forward_citations             0
dtype: int64

In [38]:
patent_fwd_citations.shape[0]

7507819

In [39]:
patent_fwd_citations["fwd_cited_by_patents"] = patent_fwd_citations["fwd_cited_by_patents"].apply(
    lambda x: [] if isinstance(x, float) and pd.isna(x) else x
)

In [40]:
import numpy as np
def get_fwd_citation_cpc_subclasses(fwd_citation_patent_ids):
    if not fwd_citation_patent_ids:
        return []
    cpc_subclasses = []
    for patent_id in fwd_citation_patent_ids:
        cpc_subclass = cpc_subclasses_dict.get(patent_id, {})
        if not cpc_subclass or (isinstance(cpc_subclass, float) and np.isnan(cpc_subclass)):
            continue
        cpc_subclasses.extend(cpc_subclass)
    return list(cpc_subclasses)

patent_fwd_citations["fwd_cited_cpc_subclasses"] = patent_fwd_citations["fwd_cited_by_patents"].apply(get_fwd_citation_cpc_subclasses)
patent_fwd_citations.head()

Unnamed: 0_level_0,fwd_cited_by_patents,forward_citations,fwd_cited_cpc_subclasses
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10000000,"[10753736, 10845468, 10873738, 11092690, 11237...",13,"[G01B, G06T, G01S, H04N, G01S, G06N, G01S, G01..."
10000001,[],0,[]
10000002,[],0,[]
10000003,"[10668805, 11318832]",2,"[B29C, B60K, B29C, B60K]"
10000004,[],0,[]


In [41]:
def calculate_entropy(fwd_cited_cpc_subclasses):
    if not fwd_cited_cpc_subclasses:
        return np.nan
    counts = pd.Series(fwd_cited_cpc_subclasses).value_counts()
    probabilities = counts / counts.sum()
    entropy = -np.sum(probabilities * np.log(probabilities))
    return entropy

patent_fwd_citations["fwd_cited_cpc_subclasses_entropy"] = patent_fwd_citations["fwd_cited_cpc_subclasses"].apply(calculate_entropy)
patent_fwd_citations.head()

Unnamed: 0_level_0,fwd_cited_by_patents,forward_citations,fwd_cited_cpc_subclasses,fwd_cited_cpc_subclasses_entropy
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10000000,"[10753736, 10845468, 10873738, 11092690, 11237...",13,"[G01B, G06T, G01S, H04N, G01S, G06N, G01S, G01...",1.399631
10000001,[],0,[],
10000002,[],0,[],
10000003,"[10668805, 11318832]",2,"[B29C, B60K, B29C, B60K]",0.693147
10000004,[],0,[],


In [42]:
def calculate_fwd_citation_impact_score(row):
    if np.isnan(row["fwd_cited_cpc_subclasses_entropy"]):
        return np.nan
    else:
        return row["fwd_cited_cpc_subclasses_entropy"] * row["forward_citations"]
    
patent_fwd_citations["fwd_citation_impact_score"] = patent_fwd_citations.apply(calculate_fwd_citation_impact_score, axis=1)
patent_fwd_citations.head(10)

Unnamed: 0_level_0,fwd_cited_by_patents,forward_citations,fwd_cited_cpc_subclasses,fwd_cited_cpc_subclasses_entropy,fwd_citation_impact_score
patent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10000000,"[10753736, 10845468, 10873738, 11092690, 11237...",13,"[G01B, G06T, G01S, H04N, G01S, G06N, G01S, G01...",1.399631,18.195203
10000001,[],0,[],,
10000002,[],0,[],,
10000003,"[10668805, 11318832]",2,"[B29C, B60K, B29C, B60K]",0.693147,1.386294
10000004,[],0,[],,
10000005,[],0,[],,
10000006,[10343329],1,"[B29C, B29L, B29K]",1.098612,1.098612
10000007,"[10618153, 10926451, 10946576, 11110646, 11596...",9,"[F02B, Y02T, B25C, B29C, B29L, B29K, B29C, B29...",1.861736,16.755626
10000008,[],0,[],,
10000009,"[11177606, 12042988, 12059841, 12076918, 12097...",9,"[B41J, H01R, B33Y, B41J, B29C, B22F, B33Y, B29...",1.977066,17.793597


In [43]:
patent_fwd_citations["fwd_citation_impact_score"].isna().sum()

2921914

In [44]:
patent_fwd_citations["fwd_citation_impact_score"].describe()

count    4.585905e+06
mean     3.674883e+01
std      1.574725e+02
min     -0.000000e+00
25%      1.386294e+00
50%      6.870920e+00
75%      2.527974e+01
max      1.102635e+04
Name: fwd_citation_impact_score, dtype: float64

In [45]:
patent_fwd_citations.to_csv("./data/patent_with_fwd_citation_impact.csv")

## Forward Citation Impact Score summary statistics

| Forward Citation Impact Score Statistic | Value     |
|----------------------------------------|-----------|
| Number of non-null values              | 4,585,905 |
| Number of null values                  | 2,921,914   |
| Total Count                            | 7,507,819 |
| Mean                                   | 36.7488    |
| Standard Deviation                     | 157.4725    |
| Min                                    | 0.0000    |
| Max                                    | 11,026.35    |
| 25th-percentile                        | 1.3863    |
| 50th-percentile                        | 6.8709    |
| 75th-percentile                        | 125.2797    |