🇪🇺 Arne Krueger and Chad G. Petey Presents 🤓: 
# 🦅 Arne Krueger's Fun 🎉 with 📚 IPC ✨

Dear Patent Information Professionals,

Patent classification systems, such as the International Patent Classification (IPC), are vital tools for organizing, searching, and analyzing patent information. IPC data is hierarchical, providing granular insights into technological fields, ranging from broad sections to detailed subgroups. To effectively work with and analyze this data, parsing IPC XML files is a critical skill, especially when large datasets need to be structured for meaningful insights.

The Python script you are about to explore is designed specifically for patent professionals who need to:

1.	Extract Key Information:
- Parse IPC XML files to extract essential elements: kind, symbol, level, and title.
- Interpret the hierarchical structure of IPC classifications, from sections (broadest level) to subgroups (most detailed level).

2.	Organize Data for Analysis:
- Dynamically compute levels and titles using IPC mappings (e.g., “section”, “class”, “sub-class”).
- Generate a structured pandas DataFrame that is ready for statistical analysis, visualization, or integration into downstream workflows.

3.	Track Progress and Summarize Results:
- Follow the script’s progress as it parses large XML files, ensuring transparency and ease of use.
- Receive a detailed summary of parsed data, including counts of entries at each IPC level.

### Features of the Script

1.	Dynamic and Recursive XML Parsing:
- The script employs a recursive approach to navigate through the hierarchical structure of IPC XML files.
- Supports filtering by classification types (kind), allowing you to focus on specific levels of interest.

2.	Level Mapping:
- Includes a predefined mapping (kind_to_levelTitle) to translate IPC kind attributes into intuitive descriptions (e.g., “section”, “class”).
- Automatically computes the level (e.g., 1 for section, 5 for sub-class) and level title for each entry.

3.	Output in a User-Friendly Format:
- The data is output as a pandas DataFrame, suitable for visualization, further processing, or export to external systems.

4.	Progress Feedback and Summaries:
- Displays real-time progress for large files to enhance user confidence during processing.
- Summarizes the count of parsed entries at each classification level.

### Example Use Case

Imagine you are tasked with analyzing a recent IPC release to identify trends in patent filings within Class A (Human Necessities). 

With this script, you can:
1.	Parse the IPC XML file to extract all entries related to Class A.
2.	Review and analyze the parsed data to focus on specific subclasses or main groups.
3.	Export the structured data for visualization or reporting.

### Summary of Output

Upon execution, the script generates the following output:
- DataFrame: A structured table with columns:
- kind: IPC classification type (e.g., “s” for section, “c” for class).
- symbol: IPC symbol (e.g., “A01”).
- title: Classification title (e.g., “Agriculture; Forestry”).
- level: Numeric level of the classification hierarchy (e.g., 1 for section).
- leveltitle: Descriptive name for the level (e.g., “section”).
- Counts Per Level: A summary of the number of entries parsed at each IPC level (e.g., sections, classes, subclasses).
- Execution Time: The time taken to process the XML file.

### Next Steps

Feel free to customize this script to your specific needs:
- Add additional filters for classification types or symbols.
- Integrate the output into patent analytics workflows.
- Extend the script to parse additional fields from the XML.

With this tool in hand, we hope you will find working with IPC data more streamlined and efficient. Should you require further guidance or enhancements, don’t hesitate to reach out!

**Happy parsing and analyzing!** 😊

Best regards,
Your Python Assistant for Patent Professionals 🚀
🫶 Arne Krueger and Chad G. Petey

In [24]:
# import the needed pythond modules
# make sure they are installed in your container with pip install... 

from lxml import etree
import pandas as pd
import time
from datetime import datetime
import os
import requests
import zipfile


In [25]:
import os
import requests
import zipfile
from datetime import datetime

def download_latest_ipc_scheme(destination_folder):
    """
    Downloads the latest IPC scheme ZIP file from WIPO's website if it doesn't already exist
    and extracts the English (EN) XML file.

    Parameters:
    - destination_folder (str): The folder where the ZIP and extracted files will be saved.

    Returns:
    - str: The path to the extracted English XML file.
    """
    # Determine the current year
    current_year = datetime.now().year

    # Construct the file name and path
    base_url = "https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area"
    zip_file_name = f"ipc_scheme_{current_year}0101.zip"
    zip_file_path = os.path.join(destination_folder, zip_file_name)

    # Create the destination folder if it doesn't exist
    os.makedirs(destination_folder, exist_ok=True)

    # Check if the ZIP file already exists
    if not os.path.exists(zip_file_path):
        # Construct the URL for the latest IPC scheme file
        url = f"{base_url}/{current_year}0101/MasterFiles/{zip_file_name}"

        try:
            # Download the ZIP file
            print(f"Downloading file from: {url}")
            response = requests.get(url)
            response.raise_for_status()  # Raise an error for bad status codes

            # Save the ZIP file locally
            with open(zip_file_path, 'wb') as file:
                file.write(response.content)

            print(f"Downloaded IPC scheme ZIP file to: {zip_file_path}")
        except requests.exceptions.RequestException as e:
            print(f"Failed to download the IPC scheme file: {e}")
            return None
    else:
        print(f"ZIP file already exists: {zip_file_path}")

    # Extract the ZIP file and find the EN XML file
    try:
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(destination_folder)

            # Look for the EN XML file in the extracted contents
            en_files = [f for f in os.listdir(destination_folder) if "EN" in f and f.endswith(".xml")]
            if en_files:
                ipc_xml_path = os.path.join(destination_folder, en_files[0])
                print(f"Found English IPC scheme XML file: {ipc_xml_path}")
                return ipc_xml_path
            else:
                print("No English (EN) XML file found in the ZIP archive.")
                return None
    except zipfile.BadZipFile:
        print("Failed to extract the ZIP file. It may be corrupted.")
        return None


# Set the destination folder for IPC files
destination_folder = "./FunWithIPC/ipc_schemes"

# Download and extract the latest IPC scheme
ipc_xml_path = download_latest_ipc_scheme(destination_folder)

if ipc_xml_path:
    print(f"The English IPC scheme XML is ready at: {ipc_xml_path}")
else:
    print("Failed to prepare the English IPC scheme XML.")
    

ZIP file already exists: ./FunWithIPC/ipc_schemes/ipc_scheme_20240101.zip
Found English IPC scheme XML file: ./FunWithIPC/ipc_schemes/EN_ipc_scheme_20240101.xml
The English IPC scheme XML is ready at: ./FunWithIPC/ipc_schemes/EN_ipc_scheme_20240101.xml


In [32]:
import pandas as pd
from lxml import etree
import time

# Define constants
IPC_ENTRY_TAG = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
TEXT_BODY_TAG = '{http://www.wipo.int/classifications/ipc/masterfiles}textBody'

# Define level mappings
KIND_TO_LEVEL_TITLE = {
    's': 'section', 't': 'sub-section title', 'c': 'class', 'I': 'sub-class index',
    'u': 'sub-class', 'g': 'guidance heading', 'm': 'main group', '1': '.subgroup',
    '2': '..subgroup', '3': '...subgroup', '4': '....subgroup', '5': '.....subgroup',
    '6': '......subgroup', '7': '.......subgroup', '8': '........subgroup',
    '9': '.........subgroup', 'A': '..........subgroup', 'B': '...........subgroup',
    'n': 'note'
}
KIND_TO_LEVEL = {k: i + 1 for i, k in enumerate(KIND_TO_LEVEL_TITLE.keys())}

# Function to extract text from <textBody>
def get_text_body(entry):
    """Extracts and returns the concatenated text from a <textBody> node."""
    for child in entry:
        if child.tag == TEXT_BODY_TAG:
            return "".join(child.itertext()).strip()
    return None

# Recursive walker to parse the XML tree
def parse_ipc_tree(node, kind_filter=None, data=None, level_counts=None):
    """Recursively traverses XML nodes and collects IPC classification data."""
    if data is None:
        data = []
    if level_counts is None:
        level_counts = {level: 0 for level in KIND_TO_LEVEL.values()}

    for child in node:
        if child.tag == IPC_ENTRY_TAG:
            kind = child.attrib.get("kind", "").lower()

            # Skip entries of kind "i" (sub-class index)
            if kind == "i":
                continue

            if kind_filter is None or kind in kind_filter:
                symbol = child.attrib.get("symbol")
                title = get_text_body(child)
                level = KIND_TO_LEVEL.get(kind)
                level_title = KIND_TO_LEVEL_TITLE.get(kind, 'Unknown Title')

                # Append data for the DataFrame
                data.append({
                    "kind": kind,
                    "symbol": symbol,
                    "title": title,
                    "level": level,
                    "leveltitle": level_title
                })

                # Update level counts
                if level:
                    level_counts[level] += 1

        # Recursively process child nodes
        parse_ipc_tree(child, kind_filter, data, level_counts)

    return data, level_counts

# Main execution
if __name__ == "__main__":
    # Start timing
    start_time = time.time()

    # Parse the XML file
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(ipc_xml_path, parser)
    root = tree.getroot()

    print("Starting XML parsing...")

    # Parse the XML tree and collect data
    ipc_data, level_counts = parse_ipc_tree(root)

    # Create a DataFrame from the parsed data
    df = pd.DataFrame(ipc_data)

    # Ensure "level" column is integer and handle NaN values
    df['level'] = df['level'].fillna(-1).astype(int)

    # Stop timing
    execution_time = time.time() - start_time

    # Output results
    print(f"\nExtracted {len(df)} entries in {execution_time:.2f} seconds.\n")
    print("DataFrame Content (first 15 rows):")
    print(df.head(15))
    
    # Print level summary
    print("\nSummary of Counts Per Level:")
    for level, count in sorted(level_counts.items()):
        if count > 0:
            title = next((k for k, v in KIND_TO_LEVEL.items() if v == level), None)
            title_desc = KIND_TO_LEVEL_TITLE.get(title, 'Unknown Title')
            print(f"Level: {level} ({title_desc}), Count: {count}")

Starting XML parsing...

Extracted 81033 entries in 0.94 seconds.

DataFrame Content (first 15 rows):
   kind          symbol                                              title  \
0     s               A                                  HUMAN NECESSITIES   
1     t             A01                                        AGRICULTURE   
2     c             A01  AGRICULTUREFORESTRYANIMAL HUSBANDRYHUNTINGTRAP...   
3     u            A01B  SOIL WORKING IN AGRICULTURE OR FORESTRYPARTS, ...   
4     m  A01B0001000000                  Hand toolsedge trimmers for lawns   
5     1  A01B0001020000                                      SpadesShovels   
6     2  A01B0001040000                                         with teeth   
7     1  A01B0001060000                               HoesHand cultivators   
8     2  A01B0001080000                                with a single blade   
9     2  A01B0001100000                            with two or more blades   
10    2  A01B0001120000                 


# letz create a database for the hierachy
