Patent and Compustat Matching Pipeline (AI-Audit)

This repository contains a high-precision, 5-step data matching pipeline designed to align patent assignee names with Compustat entities. It leverages the Gemini 1.5 Pro (gemini-3.1-pro-preview) API for rigorous AI-based auditing, ensuring that only legally and structurally identical entities are matched.

🚀 Pipeline Architecture

The pipeline is divided into five sequential steps to ensure data integrity and manual reviewability.

[Step 1] Patent Matching (`step1_patent_matching.py`)

Function: Matches raw patent assignees from a CSV database against a target Excel list (Acquiror/Target).
Logic: Uses a mixture of local strict matching, fuzzy ratios, and final AI verification.
Output: Generates "Auto Matched" and "Manual Review" Excel reports.

[Step 2] Patent Dictionary Builder (`step2_patent_dictionary.py`)

Function: Consolidates matches from Step 1 into a Master Dictionary (.pkl).
Logic: Aggregates unique mappings and detects potential conflicts.

[Step 3] Compustat Matching (`step3_compustat_matching.py`)

Function: Matches the previously identified "M&A" entities against the global Compustat database.
Logic: Applies the same high-precision AI-audit standards as Step 1.
Output: Links M&A entities to Compustat conm (company names).

[Step 4] Compustat Dictionary Builder (`step4_compustat_dictionary.py`)

Function: Finalizes the mapping from M&A entities to Compustat names/IDs.

[Step 5] Final Data Aggregation (`step5_data_aggregation.py`)

Function: The final engine that assembles everything.
Logic: Bonds Compustat identifiers (gvkey, cusip, cik) and patent statistics (counts, inventor sums) back into the original M&A dataset.

⚙️ Configuration & Security

1. API Key Security

To protect your Gemini API key, it is highly recommended to use environment variables:

export GEMINI_API_KEY="your_actual_key_here"

The scripts will automatically prioritize the GEMINI_API_KEY environment variable.

2. Global Configuration

Each matching script (step1 and step3) contains a configuration section at the top:

TESTING_ROWS: Set to a number (e.g., 500) to run a quick test on a sample subset. Set to None for full production runs.
PROCESS_SHEETS / PROCESS_TYPES: Selectively process just "Acquiror", "Target", or both.
COL_*: Configurable column names to adapt to different dataset headers.
START_ROW & MAX_RECORDS: Control the processing range. Useful for resuming interrupted runs or splitting the load.
SAVE_INTERVAL: Frequency of automatic checkpoints (saved in pipeline_outputs/checkpoints).

🛡️ Robustness & Persistence

To handle large-scale data (e.g., 2 million rows) and potential API rate limits:

Batch Checkpoints: Results are periodically saved. If the script stops, check the pipeline_outputs/checkpoints folder for the last successful batch.
Resuming: To resume, set START_ROW in the script configuration to the index of the next unprocessed batch.

🛠️ Getting Started

Prerequisites

Python 3.8+
Gemini API Key

Installation

pip install -r requirements.txt

Execution

Run the steps in order:

python step1_patent_matching.py
python step2_patent_dictionary.py
python step3_compustat_matching.py
python step4_compustat_dictionary.py
python step5_data_aggregation.py

📂 Project Structure

step1-5_*.py: Pipeline logic.
pipeline_outputs/: All generated Excel reports, pickle dictionaries, and logs.
pipeline_outputs/logs/: Detailed timestamped execution logs.

📝 License

[Specify License, e.g., MIT]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patent and Compustat Matching Pipeline (AI-Audit)

🚀 Pipeline Architecture

[Step 1] Patent Matching (`step1_patent_matching.py`)

[Step 2] Patent Dictionary Builder (`step2_patent_dictionary.py`)

[Step 3] Compustat Matching (`step3_compustat_matching.py`)

[Step 4] Compustat Dictionary Builder (`step4_compustat_dictionary.py`)

[Step 5] Final Data Aggregation (`step5_data_aggregation.py`)

⚙️ Configuration & Security

1. API Key Security

2. Global Configuration

🛡️ Robustness & Persistence

🛠️ Getting Started

Prerequisites

Installation

Execution

📂 Project Structure

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
step1_patent_matching.py		step1_patent_matching.py
step2_patent_dictionary.py		step2_patent_dictionary.py
step3_compustat_matching.py		step3_compustat_matching.py
step4_compustat_dictionary.py		step4_compustat_dictionary.py
step5_data_aggregation.py		step5_data_aggregation.py

Folders and files

Latest commit

History

Repository files navigation

Patent and Compustat Matching Pipeline (AI-Audit)

🚀 Pipeline Architecture

[Step 1] Patent Matching (step1_patent_matching.py)

[Step 2] Patent Dictionary Builder (step2_patent_dictionary.py)

[Step 3] Compustat Matching (step3_compustat_matching.py)

[Step 4] Compustat Dictionary Builder (step4_compustat_dictionary.py)

[Step 5] Final Data Aggregation (step5_data_aggregation.py)

⚙️ Configuration & Security

1. API Key Security

2. Global Configuration

🛡️ Robustness & Persistence

🛠️ Getting Started

Prerequisites

Installation

Execution

📂 Project Structure

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

[Step 1] Patent Matching (`step1_patent_matching.py`)

[Step 2] Patent Dictionary Builder (`step2_patent_dictionary.py`)

[Step 3] Compustat Matching (`step3_compustat_matching.py`)

[Step 4] Compustat Dictionary Builder (`step4_compustat_dictionary.py`)

[Step 5] Final Data Aggregation (`step5_data_aggregation.py`)

Packages