Cypris Patent XML Parser

Kevin Singh

A Python cli tool used to clean malformed XML + extracts attributes from the cleaned patent XML.

Overview

This project solves the challenge of extracting doc-number values from patent XML documents stored in Google Cloud Storage. The tool handles:

Malformed XML (missing closing tags, incomplete attributes, missing > brackets)
Priority-based extraction (e.g., EPO format first, then patent-office)
Flexible attribute extraction with custom XPath queries

Features

XML Parsing: Automatically fixes common XML malformations
- Missing closing > brackets
- Attributes without values (e.g., us-series-code becomes us-series-code="")
- Placeholder ... removal
Priority Ordering: Extract attributes in a specified priority order
CLI Support: Command-line interface for easy integration
Comprehensive Tests: Includes some unit tests to verify functionality

Project Structure

.
├── cypris_xml_parser.py           # Core XML cleaning/parsing logic
├── patent_attribute_extraction.py # Main extraction class with CLI
├── input_xmls/
│   └── test_case_1.txt            # Sample malformed XML file
├── tests/
│   ├── __init__.py
│   └── test_patent_attribute_extraction.py  # Unit tests
├── requirements.txt                # Python dependencies
├── setup.sh                        # Automated setup script
└── README.md                       # This file

Installation

# Create virtual environment
uv venv

# Activate virtual environment (Mac)
source .venv/bin/activate
# Activate virtual environment (Windows)
.venv\Scripts\activate.bat

# Install dependencies
uv pip install -r requirements.txt

Usage

Coding challenge solution: Run this one for the Cypris coding challenge assessment solution!

# Mac ONLY
python run_assesment_scenario.py 
# Windows ONLY
uv run run_assesment_scenario.py

Other ways to use the tool are below.

Command Line Interface

Basic usage (extract doc-numbers with default settings):

# Mac ONLY
python patent_attribute_extraction.py input_xmls/test_case_1.txt
# Windows ONLY
uv run patent_attribute_extraction.py input_xmls/test_case_1.txt

Extract multiple attributes:

# Max ONLY
python patent_attribute_extraction.py input_xmls/test_case_1.txt --attributes doc-number country lang
# Windows ONLY
uv run patent_attribute_extraction.py input_xmls/test_case_1.txt --attributes doc-number country lang

Custom XPath:

# Mac ONLY
python patent_attribute_extraction.py input_xmls/test_case_1.txt --xpath './/document-id[@format="epo"]'
# Windows ONLY
uv run patent_attribute_extraction.py input_xmls/test_case_1.txt --xpath './/document-id[@format="epo"]'

Help:

# Mac ONLY
python patent_attribute_extraction.py --help
# Windows ONLY
uv run patent_attribute_extraction.py --help

Python Code Example

If you want to use custom priority ordering!

from patent_attribute_extraction import PatentAttributeExtractor, DEFAULT_PRIORITY_ORDER

# Read XML file
with open('input_xmls/test_case_1.txt', 'r') as f:
    xml_data = f.read()

# Create extractor
extractor = PatentAttributeExtractor(xml_data)

# Extract doc-numbers with default priority (EPO first, then patent-office)
doc_numbers = extractor.extract_attributes(priority_ordering=DEFAULT_PRIORITY_ORDER)
print(doc_numbers)
# Output: ['999000888', '66667777']

# Extract multiple attributes (returns DataFrame)
result = extractor.extract_attributes(
    attributes=['doc-number', 'country', 'lang'],
    priority_ordering=DEFAULT_PRIORITY_ORDER
)
print(result)

# Custom priority order
custom_priority = {
    1: {
        'attribute': "load-source",
        "values": ["patent-office"]
    },
    2: {
        'attribute': "format",
        "values": ["epo"]
    }
}

doc_numbers = extractor.extract_attributes(priority_ordering=custom_priority)
print(doc_numbers)
# Output: ['66667777', '999000888']  # patent-office first now

Assumptions

The following assumptions are made about the XML structure:

XML may be malformed: Missing closing tags, incomplete tags, missing >
Attributes without values: Set to blank "" (e.g., us-series-code)
document-id elements: Have a format attribute that determines priority
Priority order: epo format first, then patent-office, then original, then docdb
doc-number elements: Nested within document-id elements
Multiple document-id elements: May exist within application-reference
Placeholder data: XML may contain ... representing omitted data (removed during parsing)
Implicit closing: If a new opening tag is encountered without a closing tag for the previous element, the previous element is assumed to be implicitly closed

Testing

# Run all tests
# Mac ONLY
python -m unittest discover tests -v
# Windows ONLY
uv run python -m unittest discover tests -v


# Run specific test file
# Mac ONLY
python -m unittest tests.test_patent_attribute_extraction -v
# Windows ONLY
uv run python -m unittest tests.test_patent_attribute_extraction -v

Example Output

Given the sample XML in input_xmls/test_case_1.txt:

# Mac ONLY
python patent_attribute_extraction.py input_xmls/test_case_1.txt
# Windows ONLY
uv run patent_attribute_extraction.py input_xmls/test_case_1.txt

Output:

Extracted attributes (in priority order):
==================================================
1. 999000888
2. 66667777
==================================================
Total: 2 values extracted

The output shows that 999000888 (EPO format) comes before 66667777 (patent-office format) based on the default priority ordering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cypris Patent XML Parser

Overview

Features

Project Structure

Installation

Usage

Command Line Interface

Python Code Example

Assumptions

Testing

Example Output

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
input_xmls		input_xmls
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
cypris_xml_parser.py		cypris_xml_parser.py
patent_attribute_extraction.py		patent_attribute_extraction.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_assesment_scenario.py		run_assesment_scenario.py

kevinsinghdeveloper/CyprisCodeChallenge

Folders and files

Latest commit

History

Repository files navigation

Cypris Patent XML Parser

Overview

Features

Project Structure

Installation

Usage

Command Line Interface

Python Code Example

Assumptions

Testing

Example Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages