# Comparing Natural Language Processing Approaches to Clustering Patents from Subsidiary Companies
## Peter de Guzman (ped19)
## Lilah DuBoff (lad90)
## Christian Moreira (csm87)

## Problem Statement:

Using a dataset of patents submitted to the U.S. Patent Office(USPTO) by subsidiaries of large multinational corporations, we will perform clustering of patents into patent topic categories. Some of the NLP techniques employed in this assignment include performing data cleaning on patent text, performing dimension reduction using PCA and machine learning algorithms(Multinomial Naive Bayes and Support Vector Classifier) for clustering patent abstracts and titles into a set of relevant comparable topics. The motivation behind this work is to address the task of tracking innovation across publicly traded companies, especially where patents are filed under different subsidiary names(i.e. “Google” with patents under “Waymo”, “DeepMind”, “Nest”); Emerging technological advancements often occur under subsidiaries of large corporations, but are not tracked due to the multitude of subsidiary firms. This project explores classification methods beyond the traditional Cooperative Patent Classification (CPC) system, offering more flexible and insightful ways for legal specialists, researchers, and investors to explore patent content and similar innovation strategies.


## Solution:

Large publicly traded companies are constantly innovating and investing millions of dollars in research and development to maintain a competitive edge in the marketplace while developing new products. The patents during the innovation process are often filed by the subsidiaries of these large companies. Informed investors and market analysts must track the actions of these subsidiaries to better understand emerging trends and forecast growth across different industries, but manually tracking these can be resource and time intensive. 

To address this problem, we tested the ability of two models to effectively cluster patent abstracts and titles into meaningful groups by topic. We selected a multinomial Naive Bayes classifier and a support vector classifier (SVC) model. 

The **multinomial Naive Bayes classifier**….

The **support vector classifier**…


# Evaluation of Training Results

In [17]:
# Load in libraries and data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from sklearn.svm import LinearSVC
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
import re
import math
from sklearn.decomposition import PCA
import warnings
from tabulate import tabulate

warnings.filterwarnings("ignore")

In [18]:
#PREPROCESSING CODE
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)

stop = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()


# makes everything lowercase, removes punctuation, lemmatizes, and removes stopwords
def clean_text(t):
    t = t.lower()
    t = re.sub(r"[^a-z\s]", " ", t)
    t = " ".join([lemmatizer.lemmatize(word) for word in t.split() if word not in stop])
    return t

# load in new subset
df_subset = pd.read_csv("data/top500_patents.csv")

# combine section and class, then clean text
df_subset["Combined_ipc_clean"] = (
    df_subset["ipc_sections"] + "_" + df_subset["ipc_classes"].astype(str)
)

# combine title and abstract for easier classification
df_subset["text_clean"] = (
    (df_subset["patent_title"] + ": " + df_subset["patent_abstract"])
    .astype(str)
    .apply(clean_text)
)

#Additional Data Cleaning

# drop under 50 observations
df_subset = df_subset.groupby("Combined_ipc_clean").filter(lambda x: len(x) >= 50)

# remove the duplicate rows
dups_to_remove = ["H_4", "G_1", "B_1", "G_6", "C_7"]
for dup in dups_to_remove:
    df_subset = df_subset[df_subset["Combined_ipc_clean"] != dup]

In [19]:
# Synthetic Code

import numpy as np
import pandas as pd

np.random.seed(42)

# List of IPC classes from your data
classes = [
    'G_06','H_04','A_61','A;C_61;7','A;C_07;61','C_07','G_16;6','G;H_4;6',
    'H_02','G_02','H_01','A;C_07;12;61','C_07;12','A_63','A;G_6;63',
    'G;H_04;06','G_10','B_65','F_16','G_01','B_26','B_29','A_43','A_46',
    'A_46;61','B_60','H_03','B_67','C_12','G_06;10','B;C_1;10','C_10','B_01',
    'E_21','C_08','C_11','F_02','G_1;6','B;C_1','B;C_1;7','A_23','F_01',
    'G_10;6','B;G_6;60'
]
# Generate synthetic vocabulary for each class
# We'll use 5-7 distinctive words per class
class_vocab = {
    'G_06': ["network", "algorithm", "compute", "data", "process", "machine"],
    'H_04': ["signal", "communication", "transmit", "channel", "frequency", "modulation"],
    'A_61': ["medical", "device", "surgery", "treatment", "patient", "health"],
    'A;C_61;7': ["chemical", "compound", "reaction", "acid", "solution", "synthesis"],
    'A;C_07;61': ["drug", "therapy", "molecule", "pharma", "treatment", "dose"],
    'C_07': ["organic", "reaction", "synthesis", "compound", "catalyst", "solution"],
    'G_16;6': ["computer", "software", "data", "algorithm", "system", "processing"],
    'G;H_4;6': ["network", "protocol", "signal", "transmission", "error", "coding"],
    'H_02': ["telecom", "signal", "modulation", "channel", "data", "transmit"],
    'G_02': ["imaging", "sensor", "signal", "measurement", "processing", "analysis"],
    'H_01': ["electronics", "circuit", "voltage", "current", "device", "component"],
    'A;C_07;12;61': ["compound", "reaction", "drug", "therapy", "molecule", "pharma"],
    'C_07;12': ["synthesis", "organic", "compound", "reaction", "molecule"],
    'A_63': ["game", "sport", "entertainment", "toy", "device", "play"],
    'A;G_6;63': ["computer", "device", "software", "system", "interface"],
    'G;H_04;06': ["signal", "communication", "network", "channel", "transmission"],
    'G_10': ["mechanical", "machine", "engine", "device", "process"],
    'B_65': ["packaging", "container", "material", "product", "process"],
    'F_16': ["mechanical", "engine", "gear", "device", "machine"],
    'G_01': ["measurement", "sensor", "instrument", "signal", "data"],
    'B_26': ["metal", "alloy", "cutting", "process", "tool"],
    'B_29': ["plastic", "molding", "material", "process", "product"],
    'A_43': ["hair", "cosmetic", "care", "brush", "device"],
    'A_46': ["clothing", "design", "fabric", "pattern", "material"],
    'A_46;61': ["textile", "fabric", "sewing", "material", "design"],
    'B_60': ["vehicle", "engine", "transport", "car", "wheel"],
    'H_03': ["electronics", "circuit", "signal", "power", "device"],
    'B_67': ["container", "tank", "liquid", "fluid", "pipe"],
    'C_12': ["biotech", "enzyme", "cell", "microbe", "reaction"],
    'G_06;10': ["software", "computer", "algorithm", "system", "data"],
    'B;C_1;10': ["chemical", "process", "compound", "reaction", "material"],
    'C_10': ["chemical", "reaction", "compound", "acid", "solution"],
    'B_01': ["process", "material", "equipment", "reaction", "flow"],
    'E_21': ["drilling", "oil", "well", "engine", "pump"],
    'C_08': ["polymer", "material", "compound", "synthesis", "reaction"],
    'C_11': ["oil", "chemical", "process", "refine", "compound"],
    'F_02': ["engine", "turbine", "combustion", "mechanical", "airflow"],
    'G_1;6': ["sensor", "signal", "measurement", "data", "processing"],
    'B;C_1': ["chemical", "compound", "reaction", "process", "material"],
    'B;C_1;7': ["chemical", "reaction", "compound", "process", "catalyst"],
    'A_23': ["medical", "treatment", "therapy", "patient", "drug"],
    'F_01': ["engine", "mechanical", "device", "combustion", "turbine"],
    'G_10;6': ["mechanical", "device", "engine", "gear", "system"],
    'B;G_6;60': ["process", "material", "engine", "chemical", "reaction"]
}

In [20]:
# Base class vocab (distinct words per class)
class_vocab = {cls: [f"{cls}_word{i}" for i in range(5)] for cls in classes}

# Shared words across all classes
shared_words = ["data", "system", "device", "process", "method"]

# Noise words (random filler)
noise_words = ["sample", "example", "info", "text", "random"]

# Add shared and noise words to class vocab
for cls in classes:
    class_vocab[cls] += shared_words
    class_vocab[cls] += list(np.random.choice(noise_words, size=3))

def generate_synthetic_text(classes, class_vocab, n_docs_per_class=100):
    texts = []
    labels = []

    all_words = [w for vocab in class_vocab.values() for w in vocab]

    for cls in classes:
        vocab = class_vocab[cls]

        for _ in range(n_docs_per_class):
            doc_length = np.random.randint(20, 40)  # shorter docs

            # Only 30-40% words from class vocab
            n_cls = int(doc_length * 0.35)
            n_other = doc_length - n_cls

            doc_words = list(np.random.choice(vocab, size=n_cls, replace=True))

            # 65% of words are from other classes/noise
            doc_words += list(np.random.choice(all_words, size=n_other, replace=True))

            np.random.shuffle(doc_words)
            texts.append(" ".join(doc_words))
            labels.append(cls)

    return pd.Series(texts), pd.Series(labels)



# Generate synthetic dataset
X_syn, y_syn = generate_synthetic_text(classes, class_vocab, n_docs_per_class=100)
#print("Synthetic dataset size:", len(X_syn))

In [21]:
X_syn, y_syn = generate_synthetic_text(classes, class_vocab, n_docs_per_class=100)

X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(
    X_syn, y_syn, test_size=0.2, stratify=y_syn, random_state=42
)

# Naive Bayes pipeline
class_model_syn = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", max_features=5000)),
    ("nb", ComplementNB()),
])

# Train
class_model_syn.fit(X_train_syn, y_train_syn)
preds_syn = class_model_syn.predict(X_test_syn)

from sklearn.metrics import accuracy_score, classification_report
#print("Synthetic Data Accuracy:", accuracy_score(y_test_syn, preds_syn))
#print(classification_report(y_test_syn, preds_syn))

# Compute report as dict
report_dict = classification_report(y_test_syn, preds_syn, output_dict=True)

# Convert to DataFrame
report_df = pd.DataFrame(report_dict).transpose()

# Round numbers for readability
report_df = report_df.round(4)

print(tabulate(report_df, headers='keys', tablefmt='grid'))


+--------------+-------------+----------+------------+-----------+
|              |   precision |   recall |   f1-score |   support |
| A;C_07;12;61 |      0.72   |   0.9    |     0.8    |   20      |
+--------------+-------------+----------+------------+-----------+
| A;C_07;61    |      1      |   0.35   |     0.5185 |   20      |
+--------------+-------------+----------+------------+-----------+
| A;C_61;7     |      0.8824 |   0.75   |     0.8108 |   20      |
+--------------+-------------+----------+------------+-----------+
| A;G_6;63     |      0.7391 |   0.85   |     0.7907 |   20      |
+--------------+-------------+----------+------------+-----------+
| A_23         |      0.9    |   0.9    |     0.9    |   20      |
+--------------+-------------+----------+------------+-----------+
| A_43         |      0.8182 |   0.9    |     0.8571 |   20      |
+--------------+-------------+----------+------------+-----------+
| A_46         |      0.9444 |   0.85   |     0.8947 |   20   


**Multinomial Naive Bayes Classifier:**




**Support Vector Classifier:**


# Application of Solution on Real Data

To conduct this experiment, we collected publicly available patent data from the U.S. Patent and Trademark Office (USPTO). First, we researched the thirty companies listed in the Dow Jones Industrial Average, a stock market index of prominent companies. Referencing this list, we collected the legal incorporated names of subsidiaries for each large company. Finally, we used this list of subsidiary names as input for the USPTO API which returned the patent title and abstract for each subsidiary company. 

We conducted multiple data pre-processing steps to clean the data and improve accuracy. First, we removed duplicate patent observations. We also combined the “class” and “section” labels for each patent to produce more meaningful clusters. Both the “class” and “section” labels identify patents by industry field and topic. We additionally subset the data to only predict classes that have more than fifty observations. As patent classifications can be very niche, we expect to observe many patents that have multiple classifications but only one or two observations. This sort of data does not typically perform well for machine learning problems, so we dropped them. 

We also performed multiple steps to make the text easier to classify. We first converted all text to lowercase and used the Python Natural Language Toolkit (“nltk”) package to remove stopwords such as “the”, “and”, etc. We also used a lemmatizer which converts all the words in our dataset to their dictionary form (a lemma). This improves the accuracy by treating words with similar meanings the same, reducing data redundancy and making the text more consistent. Finally, to quicken the training process and reduce the compute load on our machines, we reduced the dataset to the most recent 500 patents. We then created training and test splits from this smaller dataset.

**Multinomial Naive Bayes Classifier:**

After these preprocessing steps, the multinomial Naive Bayes classifier achieved an overall accuracy of 69%. Given the 47 classes in our dataset and the imbalanced nature of patent classification, this performance exceeded our expectations. 

The macro average F1 score, which treats classes equally, was 51%. This indicates that we have imbalance issues in our dataset. The weighted F1 score, which accounts for imbalance by weighing each class by how frequent it is, was 65%. Since the weighted average is high, we can conclude that the model performed well on classes that are more common. However, the macro average was much lower, displaying that the model performs poorly on rare classes. When reviewing the classification report below, we can observe that the F1 score was consistently higher for classes with more support, or a higher number of true samples in the class. This matches our expectation that the model’s metrics are more stable for classes with more support. Poor results occurred more often for classes with support of fewer than 20 observations. 

In [24]:
#Multinomial Naive Bayes Classifier
# Inputs and labels
X_bayes = df_subset["text_clean"]
y_bayes = df_subset["Combined_ipc_clean"]

X_train_bayes, X_test_bayes, y_train_bayes, y_test_bayes = train_test_split(
    X_bayes, y_bayes, test_size=0.20, stratify=y_bayes, random_state=42
)

# Build model
class_model = Pipeline(
    [
        ("tfidf", TfidfVectorizer(stop_words="english", max_features=50000)),
        ("nb", ComplementNB()),
    ]
)

# Train
class_model.fit(X_train_bayes, y_train_bayes)

# Evaluate
preds = class_model.predict(X_test_bayes)
#print("IPC Class Accuracy:", accuracy_score(y_test_bayes, preds))
#print(classification_report(y_test_bayes, preds))

# Compute report as dict
report_dict_realdata = classification_report(y_test_bayes, preds, output_dict=True)

# Convert to DataFrame
report_df_real = pd.DataFrame(report_dict_realdata).transpose()

# Round numbers for readability
report_df_real = report_df_real.round(4)

print(tabulate(report_df_real, headers='keys', tablefmt='grid'))

+--------------+-------------+----------+------------+-----------+
|              |   precision |   recall |   f1-score |   support |
| A;C_07;12;61 |      0      |   0      |     0      |   13      |
+--------------+-------------+----------+------------+-----------+
| A;C_07;61    |      0.5    |   0.835  |     0.6255 |  103      |
+--------------+-------------+----------+------------+-----------+
| A;C_61;7     |      0.7286 |   0.6986 |     0.7133 |   73      |
+--------------+-------------+----------+------------+-----------+
| A;G_6;63     |      1      |   0.0909 |     0.1667 |   11      |
+--------------+-------------+----------+------------+-----------+
| A_23         |      0.9091 |   0.8333 |     0.8696 |   12      |
+--------------+-------------+----------+------------+-----------+
| A_43         |      0.8475 |   0.9804 |     0.9091 |   51      |
+--------------+-------------+----------+------------+-----------+
| A_46         |      0.5    |   0.1333 |     0.2105 |   15   

In [23]:
# # Predictions
# y_pred = class_model.predict(X_test_bayes)

# # Confusion matrix
# cm = confusion_matrix(y_test_bayes, y_pred, labels=class_model.classes_)

# plt.figure(figsize=(12, 10))
# sns.heatmap(
#     cm,
#     annot=False,
#     cmap="Blues",
#     xticklabels=class_model.classes_,
#     yticklabels=class_model.classes_,
# )
# plt.xlabel("Predicted")
# plt.ylabel("Actual")
# plt.title("Confusion Matrix - Naive Bayes")
# plt.show()

# Pros and Cons of the Solution: