Skip to content

l010306/EntityAlign_API

Repository files navigation

Patent and Compustat Matching Pipeline (AI-Audit)

This repository contains a high-precision, 5-step data matching pipeline designed to align patent assignee names with Compustat entities. It leverages the Gemini 1.5 Pro (gemini-3.1-pro-preview) API for rigorous AI-based auditing, ensuring that only legally and structurally identical entities are matched.

🚀 Pipeline Architecture

The pipeline is divided into five sequential steps to ensure data integrity and manual reviewability.

[Step 1] Patent Matching (step1_patent_matching.py)

  • Function: Matches raw patent assignees from a CSV database against a target Excel list (Acquiror/Target).
  • Logic: Uses a mixture of local strict matching, fuzzy ratios, and final AI verification.
  • Output: Generates "Auto Matched" and "Manual Review" Excel reports.

[Step 2] Patent Dictionary Builder (step2_patent_dictionary.py)

  • Function: Consolidates matches from Step 1 into a Master Dictionary (.pkl).
  • Logic: Aggregates unique mappings and detects potential conflicts.

[Step 3] Compustat Matching (step3_compustat_matching.py)

  • Function: Matches the previously identified "M&A" entities against the global Compustat database.
  • Logic: Applies the same high-precision AI-audit standards as Step 1.
  • Output: Links M&A entities to Compustat conm (company names).

[Step 4] Compustat Dictionary Builder (step4_compustat_dictionary.py)

  • Function: Finalizes the mapping from M&A entities to Compustat names/IDs.

[Step 5] Final Data Aggregation (step5_data_aggregation.py)

  • Function: The final engine that assembles everything.
  • Logic: Bonds Compustat identifiers (gvkey, cusip, cik) and patent statistics (counts, inventor sums) back into the original M&A dataset.

⚙️ Configuration & Security

1. API Key Security

To protect your Gemini API key, it is highly recommended to use environment variables:

export GEMINI_API_KEY="your_actual_key_here"

The scripts will automatically prioritize the GEMINI_API_KEY environment variable.

2. Global Configuration

Each matching script (step1 and step3) contains a configuration section at the top:

  • TESTING_ROWS: Set to a number (e.g., 500) to run a quick test on a sample subset. Set to None for full production runs.
  • PROCESS_SHEETS / PROCESS_TYPES: Selectively process just "Acquiror", "Target", or both.
  • COL_*: Configurable column names to adapt to different dataset headers.
  • START_ROW & MAX_RECORDS: Control the processing range. Useful for resuming interrupted runs or splitting the load.
  • SAVE_INTERVAL: Frequency of automatic checkpoints (saved in pipeline_outputs/checkpoints).

🛡️ Robustness & Persistence

To handle large-scale data (e.g., 2 million rows) and potential API rate limits:

  • Batch Checkpoints: Results are periodically saved. If the script stops, check the pipeline_outputs/checkpoints folder for the last successful batch.
  • Resuming: To resume, set START_ROW in the script configuration to the index of the next unprocessed batch.

🛠️ Getting Started

Prerequisites

Installation

pip install -r requirements.txt

Execution

Run the steps in order:

python step1_patent_matching.py
python step2_patent_dictionary.py
python step3_compustat_matching.py
python step4_compustat_dictionary.py
python step5_data_aggregation.py

📂 Project Structure

  • step1-5_*.py: Pipeline logic.
  • pipeline_outputs/: All generated Excel reports, pickle dictionaries, and logs.
  • pipeline_outputs/logs/: Detailed timestamped execution logs.

📝 License

[Specify License, e.g., MIT]

About

An AI-powered 5-stage pipeline for high-precision entity resolution and synchronization between Patent and Compustat datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages