🇪🇺 Arne Krueger and Chad G. Petey Presents 🤓: 
# 🦅 Arne Krueger's Fun 🎉 with 📚 Patent Classifikation ✨

## Welcome, Patent Information Professionals - this is Session 3!

Patent classification systems, like the Cooperative Patent Classification (CPC), are essential tools for organizing, searching, and analyzing patent information. CPC data provides hierarchical insights into technology fields, from broad sections to detailed subgroups. 

This notebook is a hands-on guide to **download, parse, and analyze CPC data**. Whether you're a patent professional, data scientist, or simply curious about the structure of CPC, this is your starting point for transforming CPC text data into meaningful insights.

### What We'll Cover:
1. **Downloading CPC Title Lists**: Access the latest CPC classification data.
2. **Parsing Text Files into Structured Data**: Create pandas DataFrames from raw CPC files.
3. **Analyzing CPC Data**: Prepare the data for visualization and queries.

### Step 1: Downloading the CPC Title List

The CPC Title List is a ZIP file that contains text files representing different sections of the CPC classification system. This function downloads the ZIP file, extracts it, and collects all the relevant text files.

In [2]:
import os
import requests
import zipfile
import pandas as pd

def download_and_extract_cpc_title_list(url, destination_folder="./cpc_data"):
    """
    Downloads and extracts the CPC Title List ZIP file.

    Parameters:
        url (str): URL to the CPC Title List ZIP file.
        destination_folder (str): Folder to store the downloaded and extracted files.

    Returns:
        list: Paths to the extracted text files.
    """
    # Ensure the destination folder exists
    os.makedirs(destination_folder, exist_ok=True)

    # Define paths
    zip_file_path = os.path.join(destination_folder, "CPCTitleList.zip")

    # Step 1: Download the ZIP file
    if not os.path.exists(zip_file_path):
        print(f"Downloading {url}...")
        response = requests.get(url)
        response.raise_for_status()
        with open(zip_file_path, "wb") as f:
            f.write(response.content)
        print(f"Downloaded to {zip_file_path}")
    else:
        print(f"ZIP file already exists: {zip_file_path}")

    # Step 2: Unzip the file
    print("Extracting the ZIP file...")
    with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
        zip_ref.extractall(destination_folder)
        extracted_files = zip_ref.namelist()
        print(f"Extracted files: {extracted_files}")

    # Step 3: Collect all text files
    text_files = [os.path.join(destination_folder, file) for file in extracted_files if file.endswith(".txt")]

    if not text_files:
        raise FileNotFoundError("No text files found in the extracted ZIP.")
    return text_files

### Step 2: Parsing CPC Text Files

Once we have the text files extracted, this function parses the text files into a pandas DataFrame. It extracts key attributes like the CPC symbol, title, and depth for hierarchical analysis.

In [3]:
def parse_cpc_text_files(text_files):
    """
    Parses CPC text files into a pandas DataFrame.

    Parameters:
        text_files (list): List of paths to CPC text files.

    Returns:
        pd.DataFrame: Combined DataFrame of all CPC sections.
    """
    data = []

    for file_path in text_files:
        print(f"Parsing file: {file_path}")
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                # Split the line into parts
                parts = line.strip().split("\t")
                if len(parts) < 2:
                    continue  # Skip malformed lines

                symbol = parts[0].strip()
                title = parts[-1].strip()
                depth = int(parts[1].strip()) if len(parts) > 2 and parts[1].isdigit() else 0

                data.append({
                    "symbol": symbol,
                    "title": title,
                    "depth": depth
                })

    return pd.DataFrame(data)

### Step 3: Main Workflow

This is the main workflow that ties everything together. It downloads the CPC Title List, parses the extracted files, and displays a preview of the structured CPC data.

In [4]:
# Main workflow
if __name__ == "__main__":
    # URL and destination folder
    cpc_zip_url = "https://www.cooperativepatentclassification.org/sites/default/files/cpc/bulk/CPCTitleList202408.zip"
    destination_folder = "./cpc_data"

    # Download and extract
    text_files = download_and_extract_cpc_title_list(cpc_zip_url, destination_folder)

    # Parse text files into a DataFrame
    df_cpc = parse_cpc_text_files(text_files)

    # Display the DataFrame
    print("\nCombined CPC DataFrame:")
    print(df_cpc.head())

    # Optional: Save the DataFrame to SQLite or process further

ZIP file already exists: ./cpc_data/CPCTitleList.zip
Extracting the ZIP file...
Extracted files: ['cpc-section-A_20240801.txt', 'cpc-section-B_20240801.txt', 'cpc-section-C_20240801.txt', 'cpc-section-D_20240801.txt', 'cpc-section-E_20240801.txt', 'cpc-section-F_20240801.txt', 'cpc-section-G_20240801.txt', 'cpc-section-H_20240801.txt', 'cpc-section-Y_20240801.txt']
Parsing file: ./cpc_data/cpc-section-A_20240801.txt
Parsing file: ./cpc_data/cpc-section-B_20240801.txt
Parsing file: ./cpc_data/cpc-section-C_20240801.txt
Parsing file: ./cpc_data/cpc-section-D_20240801.txt
Parsing file: ./cpc_data/cpc-section-E_20240801.txt
Parsing file: ./cpc_data/cpc-section-F_20240801.txt
Parsing file: ./cpc_data/cpc-section-G_20240801.txt
Parsing file: ./cpc_data/cpc-section-H_20240801.txt
Parsing file: ./cpc_data/cpc-section-Y_20240801.txt

Combined CPC DataFrame:
     symbol                                              title  depth
0         A                                  HUMAN NECESSITIES      0

## 🎉 Congratulations! 🎉

You’ve successfully:
- Downloaded the latest CPC Title List from the official CPC website.
- Parsed the text files into a structured pandas DataFrame.
- Prepared the data for further analysis, visualization, or integration into databases.

### What’s Next?
- Use this data for keyword-based queries or visualization tools.
- Combine the CPC data with patent filing datasets for advanced analytics.
- Explore the hierarchical relationships using the `depth` attribute.

---

We hope you found this guide useful! Share your thoughts, ideas, or any questions with us on LinkedIn or in the community. Let’s continue exploring the exciting world of patent information! 🚀