This repository contains a high-precision, 5-step data matching pipeline designed to align patent assignee names with Compustat entities. It leverages the Gemini 1.5 Pro (gemini-3.1-pro-preview) API for rigorous AI-based auditing, ensuring that only legally and structurally identical entities are matched.
The pipeline is divided into five sequential steps to ensure data integrity and manual reviewability.
- Function: Matches raw patent assignees from a CSV database against a target Excel list (Acquiror/Target).
- Logic: Uses a mixture of local strict matching, fuzzy ratios, and final AI verification.
- Output: Generates "Auto Matched" and "Manual Review" Excel reports.
- Function: Consolidates matches from Step 1 into a Master Dictionary (
.pkl). - Logic: Aggregates unique mappings and detects potential conflicts.
- Function: Matches the previously identified "M&A" entities against the global Compustat database.
- Logic: Applies the same high-precision AI-audit standards as Step 1.
- Output: Links M&A entities to Compustat
conm(company names).
- Function: Finalizes the mapping from M&A entities to Compustat names/IDs.
- Function: The final engine that assembles everything.
- Logic: Bonds Compustat identifiers (
gvkey,cusip,cik) and patent statistics (counts, inventor sums) back into the original M&A dataset.
To protect your Gemini API key, it is highly recommended to use environment variables:
export GEMINI_API_KEY="your_actual_key_here"The scripts will automatically prioritize the GEMINI_API_KEY environment variable.
Each matching script (step1 and step3) contains a configuration section at the top:
TESTING_ROWS: Set to a number (e.g., 500) to run a quick test on a sample subset. Set toNonefor full production runs.PROCESS_SHEETS/PROCESS_TYPES: Selectively process just "Acquiror", "Target", or both.COL_*: Configurable column names to adapt to different dataset headers.START_ROW&MAX_RECORDS: Control the processing range. Useful for resuming interrupted runs or splitting the load.SAVE_INTERVAL: Frequency of automatic checkpoints (saved inpipeline_outputs/checkpoints).
To handle large-scale data (e.g., 2 million rows) and potential API rate limits:
- Batch Checkpoints: Results are periodically saved. If the script stops, check the
pipeline_outputs/checkpointsfolder for the last successful batch. - Resuming: To resume, set
START_ROWin the script configuration to the index of the next unprocessed batch.
- Python 3.8+
- Gemini API Key
pip install -r requirements.txtRun the steps in order:
python step1_patent_matching.py
python step2_patent_dictionary.py
python step3_compustat_matching.py
python step4_compustat_dictionary.py
python step5_data_aggregation.pystep1-5_*.py: Pipeline logic.pipeline_outputs/: All generated Excel reports, pickle dictionaries, and logs.pipeline_outputs/logs/: Detailed timestamped execution logs.
[Specify License, e.g., MIT]