<p><br><br></p>
<div style="display: flex; gap: 30px; padding: 0px; margin: 0;">
    <!-- Image on the left -->
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/piznet-logo-rgb.png" style="object-fit: contain; max-height: 55px;">
    <div style="text-align: center; margin: 0 70px;">
        <p style="margin-bottom: 5px;"><b>Arne Krueger</b><br></p>
        <p style="margin: 2px 0;">Head of PIZnet.de eV & mtc.berlin<br></p>
            arne.krueger@mtc.berlin, +491725119844<br></p>
    </div>
    <!-- Image on the right -->
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/mtc_logo_url_transparent.png" title="mtc.berlin" style="object-fit: contain; max-height: 35px;">
</div>
<p><br><br></p>

<u>
   
# EPO - PATENT KNOWLEDGE FORUM 2024

</u>

## PATLIBs are regional well connected Knowledge- and Tech Transfer Centers in Europe

---
This notebook shows: 

- How to use the new `Technology Intelligence Platform` as a patent information professional - without the real need for developer or data scientist skills.
  
- The presentation will show a guided workflow and how the `Technology Intelligence Platform` and Generative AI simplify patent analysis with PATSTAT.

- The result is a visualized landscape: a dynamic map of granted applications over the last decade, enriched with technology fields across German states and districts.

---

## Why regional technology and applicant rankings? 
They can support funding and improve regional networks and businesses. 

---
### Example University Patents in Federal States
The Institut der Deutschen Wirtschaft created this report of patents at German universities in federal states.


<div style="display: flex; gap: 20px; padding: 0px; margin: 0;">
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/hochschulpatentanmeldungen.png" style="object-fit: contain; max-height: 300px;">
</div>    

[1 Study of Institut der Deutschen Wirtschaft, 2024](https://www.iwkoeln.de/presse/pressemitteilungen/oliver-koppel-ostdeutsche-hochschulen-sind-bei-patenten-besonders-effizient.html)
<p><br></p>

---
### Example Patents in Bavaria

The 5th. IHK Report of the IHK zu Coburg about Patents in the Bavarian Chambers of Commerce

<div style="display: flex; gap: 20px; padding: 0px; margin: 0;">
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/patentreport_bayern.png" style="object-fit: contain; max-height: 300px;">
</div>
<p><br></p>

[2 Patente in Bayern, 2023](https://www.ihk.de/coburg/beratung-und-service/innovation-technologie/ihk-report-patente-in-bayern-2023--6016150)

---
### Example Patent Map Thuringa

Adam Bartkowski of PATON, Patlib in Thuringia created this public available Tableau Dashboard:

<div style="display: flex; gap: 20px; padding: 0px; margin: 0;">
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/patentatlas_thuringa.png" style="object-fit: contain; max-height: 300px;">
</div>
<p><br></p>

[3 Patent-Dashboard Thüringen, 2022](https://public.tableau.com/app/profile/adam.bartkowski6425/viz/path-2018/PatentzahlenThringen)

---

These studies and show cases require extensive manual work by patent information professionals and are *not yet* available for all regions or PATLIBs.

---

## Todays Use Case: Applicants and Technology Distribution in Germany by County

With the new `Technology Intelligence Platform` we analyse `PATSTATs` patent applicants and technology distributions across Germany's NUTS Level 3 regions (Landkreise). 

With the help of Generative AI we desveloped SQL queries for PATSTAT and added and mapped additional data:

* extract patent data at federal state and district levels (NUTS Level 3)
* mapping of NUTS codes to region names available via EUROSTAT
* added CPC subclass titles for better readability
* visualized the results interactively with **Pygwalker**.

The result is a refactored modular Python class to be presented at the Patent Knowledge Forum 2024.


---

## Step 1: Setup and Import Libraries

This step installs and imports all the necessary libraries for:
- Querying the PATSTAT database.
- Handling data with Pandas.
- Visualizing data using Pygwalker.
- Parsing XML to extract CPC subclass titles.

### Import the Key Libraries
- **Pandas**: For data manipulation.
- **SQLAlchemy**: To handle database queries.
- **Pygwalker**: For interactive visualization.
- **lxml**: For XML parsing (to extract CPC subclass titles).
- **Geopandas**: Optional for geographical mapping.

### Instantiate the PATSTAT client
The EPOs PATSTAT client is instantiated to connect to the test or production environment.


In [1]:
#  Install the libraries into this TIP Container
# !pip install pandas sqlalchemy pygwalker geopandas

print("Import the libraries for all our data importing and handling...")

# Import time library for measuring sql execution time
import time

# Import Geopandas for mapping if needed later
import geopandas as gpd
import pandas as pd

# Import pygwalker library for vizualisation
import pygwalker as pyg

# Import the EPO library module for PATSTAT
from epo.tipdata.patstat import PatstatClient

# Import xml lib for IPC sub group labels
from lxml import etree as ET

# Import sql library for easy sql execution
from sqlalchemy import create_engine, func
from sqlalchemy.sql import literal_column

# Intantiate the client objects with reduced data set with TEST or the full dataset with PRDOD
patstat = PatstatClient(env="TEST")
#patstat = PatstatClient(env='PROD')

# Instantiate the ORM
db = patstat.orm()

# import all the tables we need
from epo.tipdata.patstat.database.models import (
    TLS201_APPLN,
    TLS202_APPLN_TITLE,
    TLS206_PERSON,
    TLS207_PERS_APPLN,
    TLS224_APPLN_CPC,
    TLS231_INPADOC_LEGAL_EVENT,
)

Import the libraries for all our data importing and handling...


---
## Step 2: Develop Initial SQL Queries

### Test Query 1: Granted Applications Filed at EPO in 2010
This query retrieves a list of granted applications filed at the European Patent Office (EPO) in the year 2010. 

It showcases:
- Filtering by filing year.
- Filtering by application authority (`EP` for European Patent).
- Retrieving only granted applications.

The goal is to ensure the PATSTAT connection and ORM setup are working correctly.

In [2]:
# Test query 1

# Start the timer
start_time = time.time()

q = db.query(
    TLS201_APPLN.appln_id,
    TLS201_APPLN.appln_auth,
    TLS201_APPLN.appln_nr,
    TLS201_APPLN.appln_kind,
    TLS201_APPLN.appln_filing_date,
).filter(
    TLS201_APPLN.appln_filing_year == 2010,
    TLS201_APPLN.appln_auth == "EP",
    TLS201_APPLN.granted == "Y",
)

df = patstat.df(q)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
execution_time = end_time - start_time
print(f"Query execution time: {execution_time:.2f} seconds")

# Display the first few rows of the DataFrame
print(df.head())

Query execution time: 0.78 seconds
    appln_id appln_auth  appln_nr appln_kind appln_filing_date
0  274222610         EP  10000313         A         2010-01-14
1  274369023         EP  10000849         A         2010-01-28
2  274681480         EP  10001469         A         2010-02-12
3  274720647         EP  10001552         A         2010-02-16
4  274875659         EP  10002051         A         2010-03-01


### Test Query 2: Hitlist of Chinese Applicants at EPO

This query creates a hitlist of Chinese applicants who filed patents at the EPO. It demonstrates:
- Joining tables to link applicants with their filings.
- Filtering for Chinese applicants (`person_ctry_code = 'CN'`).
- Grouping by applicant name to count their applications.
- Ordering by the number of applications filed.

In [3]:
# Test query 2

# Start the timer
start_time = time.time()

q = (
    db.query(
        TLS206_PERSON.psn_name,
        TLS206_PERSON.person_ctry_code,
        func.count(TLS201_APPLN.appln_id).label("APPLICATIONS_AT_EPO"),
    )
    .select_from(TLS206_PERSON)
    .join(TLS207_PERS_APPLN)
    .join(TLS201_APPLN)
    .filter(
        TLS206_PERSON.person_ctry_code == "CN",
        TLS207_PERS_APPLN.applt_seq_nr > 0,
        TLS207_PERS_APPLN.invt_seq_nr == 0,
        TLS201_APPLN.appln_auth == "EP",
    )
    .group_by(TLS206_PERSON.psn_name, TLS206_PERSON.person_ctry_code)
    .order_by(func.count(TLS201_APPLN.appln_id).desc())
    .limit(100)
)
df = patstat.df(q)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
execution_time = end_time - start_time
print(f"Query execution time: {execution_time:.2f} seconds")

# Display the first few rows of the DataFrame
df

Query execution time: 0.67 seconds


Unnamed: 0,psn_name,person_ctry_code,APPLICATIONS_AT_EPO
0,BEIJING GOLDWIND SCIENCE & CREATION WINDPOWER ...,CN,100
1,XINJIANG GOLDWIND SCIENCE & TECHNOLOGY COMPANY,CN,50
2,Beijing Goldwind Science & Creation Windpower ...,CN,21
3,"Huawei Digital Power Technologies Co., Ltd.",CN,16
4,SINOVEL WIND GROUP COMPANY,CN,14
...,...,...,...
95,NANJING YUNENG NEW ENERGY TECHNOLOGY COMPANY,CN,1
96,QINGDAO UNIVERSITY,CN,1
97,BEIJING WEST INDUSTRY COMPANY,CN,1
98,BAIDU ON-LINE NETWORK TECHNOLOGY (BEIJING) COM...,CN,1


---
## Step 3: Co-Develop a Query with Gen AI Chatbot

Using PATSTAT, this step introduces a SQL query to analyze patent applicants and technologies at the district level in Germany (NUTS Level 3). 

### Key Highlights:
1. Group by NUTS Level 3 (`Landkreis`) to map applicant activity across districts.
2. Add **technology fields** by including CPC subclass codes.
4. Visualize the results using Pygwalker for interactive exploration.

The query is refined step-by-step to include:
- Applicant names and NUTS codes.
- Application counts grouped by technology fields.

In [4]:
# Query patent applicants and technology distribution with filing year and grant status
## extract the NUTS Level 1 = Federal State
## extract the CPC sub classes (CPC hierachy level 3: e.g. B66B = )

print("Query patent applicant and technology distribution...")

#### Start the timer
start_time = time.time()

q = (
    db.query(
        TLS206_PERSON.person_name.label("applicant"),
        TLS206_PERSON.nuts.label("nuts_code"),
        literal_column("SUBSTR(nuts, 1, 3)").label(
            "federal_state_code"
        ),  # Federal state (NUTS Level 1)
        TLS224_APPLN_CPC.cpc_class_symbol.label("technology_field"),  # Technology field
        literal_column("SUBSTR(cpc_class_symbol, 1, 4)").label("cpc_subclass"),
        TLS201_APPLN.appln_filing_year.label("filing_year"),  # Filing year
        TLS201_APPLN.granted.label("granted"),  # Grant status
        func.count(TLS201_APPLN.appln_id).label("appln_count"),  # Application count
    )
    .select_from(TLS206_PERSON)
    .join(TLS207_PERS_APPLN, TLS206_PERSON.person_id == TLS207_PERS_APPLN.person_id)
    .join(TLS201_APPLN, TLS207_PERS_APPLN.appln_id == TLS201_APPLN.appln_id)
    .join(TLS224_APPLN_CPC, TLS201_APPLN.appln_id == TLS224_APPLN_CPC.appln_id)
    .filter(
        TLS206_PERSON.nuts.startswith("DE"),  # Filter for Germany NUTS code
        TLS206_PERSON.nuts_level == 3,  # Limit to NUTS level 3
    )
    .group_by(
        TLS201_APPLN.appln_filing_year,  # Group by filing year
        TLS206_PERSON.nuts,  # Group by NUTS Level 3 code
        literal_column("SUBSTR(nuts, 1, 3)"),  # Group by federal state code
        TLS224_APPLN_CPC.cpc_class_symbol,  # Group by technology field
        TLS206_PERSON.person_name,  # Group by person name
        TLS201_APPLN.granted,  # Group by grant status
    )
    .order_by(TLS206_PERSON.nuts)
)  # , TLS201_APPLN.appln_filing_year)

# Execute the query
df = patstat.df(q)

### Stop the timer
end_time = time.time()

# Calculate and print the execution time
execution_time = end_time - start_time
print(f"Query execution time: {execution_time:.2f} seconds")

# Display the first few rows of the DataFrame
df

Query patent applicant and technology distribution...
Query execution time: 4.19 seconds


Unnamed: 0,applicant,nuts_code,federal_state_code,technology_field,cpc_subclass,filing_year,granted,appln_count
0,Universität Stuttgart,DE111,DE1,F03B 15/00,F03B,2019,Y,1
1,Universität Stuttgart,DE111,DE1,F03B 17/061,F03B,2019,Y,1
2,Universität Stuttgart,DE111,DE1,F03D 7/043,F03D,2019,Y,1
3,Universität Stuttgart,DE111,DE1,F05B2260/845,F05B,2019,Y,1
4,Universität Stuttgart,DE111,DE1,F05B2270/8042,F05B,2019,Y,1
...,...,...,...,...,...,...,...,...
71507,"Fliegl, Helmut",DEG0K,DEG,B60D 1/015,B60D,2014,Y,1
71508,"Fliegl, Helmut",DEG0K,DEG,B60D 1/30,B60D,2014,Y,1
71509,"Fliegl, Helmut",DEG0K,DEG,B60D 1/305,B60D,2014,Y,1
71510,"Fliegl, Helmut",DEG0K,DEG,B60D 1/62,B60D,2014,Y,1


---
## Step 4: Add Regional Mappings (NUTS Codes to Names)

To make the data more readable, let a Gen AI Chatbot help you to:
1. Load a mapping CSV file from Eurostat containing NUTS codes and corresponding names.
2. Extract mappings for:
   - **Federal States (NUTS Level 1)**: e.g., "Baden-Württemberg."
   - **Districts (NUTS Level 3)**: e.g., "Stuttgart, Stadtkreis."
3. Apply these mappings to the query results using Pandas.

This step ensures that the visualization contains user-friendly names instead of raw codes.

In [5]:
# Add mapping for Bundesland NUTS Code 1 and Landkreis NUTS Code 3 with a mapping CSV from EUROSTAT

## Load the prepared CSV file
nuts_mapping = pd.read_csv("./mappings/nuts_mapping.csv", delimiter=",")

## Create separate mappings for federal states and districts
federal_state_mapping = (
    nuts_mapping[nuts_mapping["LEVEL"] == 1]
    .set_index("NUTS_ID")["NAME_LATIN"]
    .to_dict()
)
landkreis_mapping = (
    nuts_mapping[nuts_mapping["LEVEL"] == 3]
    .set_index("NUTS_ID")["NAME_LATIN"]
    .to_dict()
)

## Map federal states (NUTS Level 1)
df["federal_state_name"] = df["nuts_code"].str[:3].map(federal_state_mapping)

## Map Landkreise (NUTS Level 3)
df["landkreis_name"] = df["nuts_code"].map(landkreis_mapping)

# Display the first few rows of the DataFrame for checking
df

Unnamed: 0,applicant,nuts_code,federal_state_code,technology_field,cpc_subclass,filing_year,granted,appln_count,federal_state_name,landkreis_name
0,Universität Stuttgart,DE111,DE1,F03B 15/00,F03B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt"
1,Universität Stuttgart,DE111,DE1,F03B 17/061,F03B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt"
2,Universität Stuttgart,DE111,DE1,F03D 7/043,F03D,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt"
3,Universität Stuttgart,DE111,DE1,F05B2260/845,F05B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt"
4,Universität Stuttgart,DE111,DE1,F05B2270/8042,F05B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt"
...,...,...,...,...,...,...,...,...,...,...
71507,"Fliegl, Helmut",DEG0K,DEG,B60D 1/015,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt"
71508,"Fliegl, Helmut",DEG0K,DEG,B60D 1/30,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt"
71509,"Fliegl, Helmut",DEG0K,DEG,B60D 1/305,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt"
71510,"Fliegl, Helmut",DEG0K,DEG,B60D 1/62,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt"


---
## Step 5: Map CPC Subclasses to Titles

To enrich the technology field analysis, let Gen AI Chatbot help to do that mapping as well:
1. Use the IPC XML scheme provided by WIPO to extract titles for CPC subclasses.
2. Parse the XML to extract:
   - Subclass symbols (e.g., `B66B`).
   - Corresponding titles (e.g., "Elevators and Lifts").
3. Map these titles to the query results.

The result is a dataset with meaningful technology labels, enabling more insightful analysis.

In [6]:
# Start measuring time
start = time.time()

# File path to the IPC XML
filename = "./mappings/EN_ipc_scheme_20210101.xml"

# Define the namespace and parser
ipc_namespace = "{http://www.wipo.int/classifications/ipc/masterfiles}"
ipcEntry = f"{ipc_namespace}ipcEntry"
text_body = f"{ipc_namespace}textBody"
title_part = f"{ipc_namespace}titlePart"
text = f"{ipc_namespace}text"
parser = ET.XMLParser(remove_blank_text=True)

# Parse the XML file
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

# Initialize dictionary for sub-class mapping
sub_class_mapping = {}

# Iterate through the XML to extract sub-class information
for element in root.iter(ipcEntry):
    if element.attrib.get("kind") == "u":  # Focus on sub-classes
        symbol = element.attrib.get("symbol")  # Extract sub-class symbol

        # Locate the title text within the nested structure
        text_element = element.find(f".//{text_body}//{title_part}//{text}")
        title = text_element.text.strip() if text_element is not None else "No Title"

        sub_class_mapping[symbol] = title

# Print a sample of the extracted data
# for symbol, title in list(sub_class_mapping.items())[:20]:
#    print(f"{symbol}: {title}")

# Print execution time
print(
    f"Extracted {len(sub_class_mapping)} sub-classes in {(time.time() - start) * 1000:.0f} ms."
)

# execute the mapping
df["cpc_subclass_title"] = df["cpc_subclass"].map(sub_class_mapping)

# Display the first few rows of the DataFrame
df

Extracted 646 sub-classes in 253 ms.


Unnamed: 0,applicant,nuts_code,federal_state_code,technology_field,cpc_subclass,filing_year,granted,appln_count,federal_state_name,landkreis_name,cpc_subclass_title
0,Universität Stuttgart,DE111,DE1,F03B 15/00,F03B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt",MACHINES OR ENGINES FOR LIQUIDS
1,Universität Stuttgart,DE111,DE1,F03B 17/061,F03B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt",MACHINES OR ENGINES FOR LIQUIDS
2,Universität Stuttgart,DE111,DE1,F03D 7/043,F03D,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt",WIND MOTORS
3,Universität Stuttgart,DE111,DE1,F05B2260/845,F05B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt",
4,Universität Stuttgart,DE111,DE1,F05B2270/8042,F05B,2019,Y,1,Baden-Württemberg,"Stuttgart, Landeshauptstadt",
...,...,...,...,...,...,...,...,...,...,...,...
71507,"Fliegl, Helmut",DEG0K,DEG,B60D 1/015,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt",VEHICLE CONNECTIONS
71508,"Fliegl, Helmut",DEG0K,DEG,B60D 1/30,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt",VEHICLE CONNECTIONS
71509,"Fliegl, Helmut",DEG0K,DEG,B60D 1/305,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt",VEHICLE CONNECTIONS
71510,"Fliegl, Helmut",DEG0K,DEG,B60D 1/62,B60D,2014,Y,1,Thüringen,"Saalburg-Ebersdorf, Stadt",VEHICLE CONNECTIONS


---
## Step 6: Refactor the Notebook into a Python Class

With all components tested, the process is refactored into a modular Python class called `PatentDataProcessor`.

### Key Features:
1. **Class-Based Design**:
   - Encapsulates the entire process (querying, mapping, visualization).
2. **Modular Methods**:
   - `load_nuts_mapping`: Load and apply NUTS regional mappings.
   - `query_patent_data`: Execute the main SQL query.
   - `load_ipc_scheme`: Parse XML for CPC subclass titles.
   - `process_data`: Integrate mappings into the dataset.
   - `visualize_data`: Launch Pygwalker for visualization.
3. **Parameterization**:
   - Paths to mapping files and the PATSTAT environment are configurable.

This structure makes the code reusable, maintainable, and easy to execute outside the notebook.

### Define and execute the Class:

In [7]:
import time
import os

import geopandas as gpd
import pandas as pd
import pygwalker as pyg

from epo.tipdata.patstat import PatstatClient
from epo.tipdata.patstat.database.models import (
    TLS201_APPLN,
    TLS206_PERSON,
    TLS207_PERS_APPLN,
    TLS224_APPLN_CPC,
)

from lxml import etree as ET
from sqlalchemy import func, text

class PatentDataProcessor:
    def __init__(
        self,
        patstat_env="TEST",
        nuts_mapping_path="./mappings/nuts_mapping.csv",
        ipc_scheme_path="./mappings/EN_ipc_scheme_20210101.xml",
        ipc_namespace="{http://www.wipo.int/classifications/ipc/masterfiles}",
    ):
        self.patstat = PatstatClient(env=patstat_env)
        self.db = self.patstat.orm()
        self.nuts_mapping_path = nuts_mapping_path
        self.ipc_scheme_path = ipc_scheme_path
        self.ipc_namespace = ipc_namespace
        self.nuts_mapping = None
        self.sub_class_mapping = {}

    def load_nuts_mapping(self):
        print("Loading NUTS mapping...")
        self.nuts_mapping = pd.read_csv(self.nuts_mapping_path, delimiter=",")
        self.federal_state_mapping = (
            self.nuts_mapping[self.nuts_mapping["LEVEL"] == 1]
            .set_index("NUTS_ID")["NAME_LATIN"]
            .to_dict()
        )
        self.landkreis_mapping = (
            self.nuts_mapping[self.nuts_mapping["LEVEL"] == 3]
            .set_index("NUTS_ID")["NAME_LATIN"]
            .to_dict()
        )
        
    def query_patent_data(self):
        # Query patent data using raw SQL.
        print("Querying patent data with raw SQL...")
        start = time.time()
    
        # Define the raw SQL query
        sql_query = """
            SELECT
                tls206_person.person_name AS applicant,
                tls206_person.nuts AS nuts_code,
                tls201_appln.appln_filing_year AS filing_year,
                tls224_appln_cpc.cpc_class_symbol AS cpc_subclass,
                COUNT(DISTINCT tls201_appln.appln_id) AS appln_count
            FROM
                tls201_appln
            INNER JOIN tls207_pers_appln ON tls201_appln.appln_id = tls207_pers_appln.appln_id
            INNER JOIN tls206_person ON tls207_pers_appln.person_id = tls206_person.person_id
            INNER JOIN tls224_appln_cpc ON tls201_appln.appln_id = tls224_appln_cpc.appln_id
            WHERE
                tls206_person.nuts LIKE 'DE%' AND
                tls206_person.nuts_level = 3 AND
                tls201_appln.appln_filing_year >= EXTRACT(YEAR FROM CURRENT_DATE()) - 10
            GROUP BY
                tls206_person.person_name,
                tls206_person.nuts,
                tls201_appln.appln_filing_year,
                tls224_appln_cpc.cpc_class_symbol
            ORDER BY
                tls206_person.nuts, tls201_appln.appln_filing_year, appln_count DESC;
        """
        # Use self.db.bind to access the engine
        engine = self.db.bind
    
        # Execute the raw SQL query
        with engine.connect() as connection:
            result = connection.execute(text(sql_query))
    
        # Convert the result to a DataFrame
        rows = result.fetchall()
        columns = result.keys()
        df = pd.DataFrame(rows, columns=columns)
    
        print(f"Query execution time: {time.time() - start:.2f} seconds")
        return df
        
    def load_ipc_scheme(self):
        # Load CPC or IPC scheme for CPC subclass titles.
        # Note:
        # - This method currently uses the IPC scheme.
        # - Some CPC-specific subclasses (e.g., Y02 series) may not be covered.
        # - To achieve full coverage, switch to the CPC scheme from USPTO/EPO resources.
                    
        print("Loading IPC scheme...")
        start = time.time()
    
        ipc_namespace = self.ipc_namespace
        ipcEntry = f"{ipc_namespace}ipcEntry"
        text_body = f"{ipc_namespace}textBody"
        title_part = f"{ipc_namespace}titlePart"
        text = f"{ipc_namespace}text"
    
        parser = ET.XMLParser(remove_blank_text=True)
        tree = ET.parse(self.ipc_scheme_path, parser=parser)
        root = tree.getroot()
    
        for element in root.iter(ipcEntry):
            if element.attrib.get("kind") == "u":  # Subclass kind is "u"
                symbol = element.attrib.get("symbol")
                text_element = element.find(f".//{text_body}//{title_part}//{text}")
                title = text_element.text.strip() if text_element is not None else "No Title"
                self.sub_class_mapping[symbol] = title
    
        # Log a sample of loaded subclass titles
        # print(f"Sample CPC Subclass Titles: {list(self.sub_class_mapping.items())[:5]}")
        
        print(f"Loaded {len(self.sub_class_mapping)} IPC subclasses in {(time.time() - start):.2f} seconds.")

    def process_data(self, df):
        print("Processing data...")
        start = time.time()
    
        # Add federal state codes and names
        df["federal_state_code"] = df["nuts_code"].str[:3]
        df["federal_state_name"] = df["federal_state_code"].map(self.federal_state_mapping)
        df["landkreis_name"] = df["nuts_code"].map(self.landkreis_mapping)
    
        # Normalize CPC subclass
        if "cpc_subclass" in df.columns:
            df["normalized_cpc_subclass"] = df["cpc_subclass"].str[:4]
    
            # Map CPC titles
            df["cpc_subclass_title"] = df["normalized_cpc_subclass"].map(self.sub_class_mapping)
    
            # Combine subclass and title into one column
            df["cpc_combined"] = df.apply(
                lambda row: f"{row['normalized_cpc_subclass']} - {row['cpc_subclass_title']}"
                if pd.notna(row['cpc_subclass_title']) else row['normalized_cpc_subclass'],
                axis=1
            )
        else:
            print("Warning: 'cpc_subclass' column is missing. CPC subclass titles will not be added.")
    
        print(f"Processing time: {time.time() - start:.2f} seconds")
        return df

    def save_data(self, df, file_path, file_format="csv"):
        
        print("Saving data to disk...")
        
        start = time.time()
        
        supported_formats = ["csv", "excel", "json"]
        
        if file_format not in supported_formats:
            raise ValueError(f"Unsupported file format: {file_format}. Supported formats are {supported_formats}.")

        os.makedirs(os.path.dirname(file_path), exist_ok=True)
        
        if file_format == "csv":
            df.to_csv(file_path, index=False)
        elif file_format == "excel":
            df.to_excel(file_path, index=False, engine="openpyxl")
        elif file_format == "json":
            df.to_json(file_path, orient="records")
            
        print(f"Data saved successfully to {file_path} in {file_format.upper()} format in: {time.time() - start:.2f} seconds")
    
    def visualize_data(self, df):
        
        print("Launching visualization...")

        df

        walker = pyg.walk(df, spec="pygwalker_config.json")
        
        return walker

if __name__ == "__main__":
    
    # Instantiate and execute the workflow
    processor = PatentDataProcessor(patstat_env="PROD")
    
    # Step 1: Load mappings
    processor.load_nuts_mapping()
    processor.load_ipc_scheme()
    
    # Step 2: Query data
    patent_data = processor.query_patent_data()
    
    # Step 3: Process data
    processed_data = processor.process_data(patent_data)
    
    # Step 4: Save the processed data
    processor.save_data(processed_data, file_path="./output/patent_data.csv", file_format="csv")
    
    # Step 5: Visualize the data (optional)
    processor.visualize_data(processed_data)


Loading NUTS mapping...
Loading IPC scheme...
Loaded 646 IPC subclasses in 0.27 seconds.
Querying patent data with raw SQL...
Query execution time: 41.81 seconds
Processing data...
Processing time: 18.94 seconds
Saving data to disk...
Data saved successfully to ./output/patent_data.csv in CSV format in: 12.05 seconds
Launching visualization...


Box(children=(HTML(value='\n<div id="ifr-pyg-00062858e10f7a34d1iLM5XlWhE7sCnN" style="height: auto">\n    <hea…

---
# The result of our exploring the `Technology Intelligence Platform`
---

## PygWalker Visualization of the queried data

``` 
Loading NUTS mapping...
Loading IPC scheme...
Loaded 646 IPC subclasses in 0.25 seconds.
Querying patent data with raw SQL...
Query execution time: 39.34 seconds
Processing data...
Processing time: 18.96 seconds
Saving data to disk...
Data saved successfully to ./output/patent_data.csv in CSV format in: 11.98 seconds
Launching visualization...
```

Here the final Map of the Numbers of Applications in Thuringia in the last 10 years, as queried from PATSTAT.

<div style="display: flex; gap: 20px; padding: 0px; margin: 0;">
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/pygwalker_1.png" style="object-fit: contain; max-height: 500px;">
</div>
<p><br></p>

In PygWalker you need to:
- Change **Coordinate System** to `Geographic`
- Change **Mark Type** to `Choropleth`
- Download nuts_mapping in geojson from `Eurostat`
- Load geojson file in `Geography Configuration`
- Enter `NUTS_ID` into `Feature ID`
- Add nuts_code field to `Geometry ID`
- Add Applicant field to `Color`
- Add Federal State field to `Filter`
- Add Landkreis field to `Text`

See the UI of PygWalker here in the Screenshot: 

<div style="display: flex; gap: 20px; padding: 0px; margin: 0;">
    <img src="https://tip.epo.org/user/YSEnMLHtRnAwezLvvPk3kt/files/piznet/images/pygwalker_full.png" style="object-fit: contain; max-height: 700px;">
</div>
<p><br></p>


---

## Take Aways

### How did we do it

- This `Jupyter` notebook is step-by-step documentation of the data analysis.
- We used `SQL` and `Python` Code and co-developed it with Generative AI
- We mapped `PATSTAT NUTS Code Level 3` into Federal States and Countys with `EUROSTAT` data
- We mapped `CPC Symbols` with `CPC Titels` with WIPO IPC schema data
- We used the `Tableau` like `PygWalker` makes „chloropleth geografical mapping“ easy.

### What you can do with it

- Current `PATSTAT` data are available
- More External data can be included: Eurostat, Wipo, CPC, …
- All of `Python` (the data science language!) environment is available
- Thousends of mora open source libraries can be loaded and used
- `Generative AI Assistents` help you with the project!

DOWNLOAD LINK -> **[Github Repository with this Jupyter Notebook and all the mapping files](https://www.iwkoeln.de/presse/pressemitteilungen/oliver-koppel-ostdeutsche-hochschulen-sind-bei-patenten-besonders-effizient.html)**

---
## Thank you!
...at **Patent Knowledge Forum 2024** for your attention and interest in this topic!
**Carlos Aitor Pérez de Unzueta** (EPO) for all the support with TIP.
**Adam Bartkowski** (PATON) for the insipriration and all his work on Thuringia patent map.
and **Sebastian Gabel** (mtc.berlin) for his moral and development support.

---
<p><br></p>

<div style="text-align: right;">
    Arne Krueger, arne.krueger@mtc.berlin, +49 172 5119844
</div>

