# Fun with IPC XML, Python xml, lxml and ElementTree

### Python Quellen
* John Shipman's tutorial on [Python XML processing with lxml](https://www.academia.edu/38587906/Python_XML_processing_with_lxml)
* [The ElementTree API on](https://docs.python.org/3/library/xml.etree.elementtree.html) on python.org
* Tutorials on [Real Python](https://realpython.com/)
* [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)

### WIPO Links
* Current Edition of IPC Master Files from [WIPO's Download and IT support area](https://www.wipo.int/classifications/ipc/en/ITsupport/), here the  [direkt link zu the zip File](https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area//20210101/MasterFiles/ipc_scheme_images_20210101.zip)
* Documentation and XSDs are [here](https://www.wipo.int/classifications/ipc/en/ITsupport/Version20210101/documentation/IPCfiles.html), esp. the Specification of the Scheme file [here](https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area/Documentation/20210101/IPC_scheme_specs_v3_1.docx)
* [Link](https://www.wipo.int/classifications/ipc/ipcpub/?notion=scheme&version=20210101&symbol=none&menulang=en&lang=en&viewmode=f&fipcpc=no&showdeleted=yes&indexes=no&headings=yes&notes=yes&direction=o2n&initial=A&cwid=none&tree=no&searchmode=smart) to the IPC Browser of WIPO


First, we download the IPC XML from WIPO, to work with it further down the road. This is every time needed, after this machine here restarted!

In [1]:
import requests, zipfile, io, os, time

tic = time.perf_counter() * 1000

url = 'https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area//20210101/MasterFiles/ipc_scheme_20210101.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

filename = os.listdir()[1]

toc = time.perf_counter() * 1000

print('downloaded and unzipped', filename, f'in: {(toc - tic):0.0f} ms')

downloaded and unzipped patstat_training in: 440 ms



#First Sample Code

First, we just import lxml and get the file (manual download, put it here next to the sample data) and access it, print the upper level elements tags and the attribut dictonary. These are the sections of the IPC Tree. The attributes contain: 'kind' and 'symbol' and 'entryTpe'.

In [2]:
from lxml import etree as ET

filename = "./EN_ipc_scheme_20210101.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

for child in root:
    print(child.tag, child.attrib)


{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'A', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'B', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'C', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'D', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'E', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'F', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'G', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'H', 'entryType': 'K'}


just a different, shorter way of doing this stuff

In [3]:
import xml.etree.ElementTree as ET
root = ET.parse("./EN_ipc_scheme_20210101.xml").getroot()

for sections in root:
   print(sections.tag, sections.attrib)

{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'A', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'B', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'C', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'D', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'E', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'F', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'G', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'H', 'entryType': 'K'}


#What do we see?

The **tag** (including the xmlns - NameSpace, that this entry belongs. There is only one ns in the XML btw.) and the **attributes**, obviously, and the **atributes** you can see in the next text box.


##Atributes
###'kind' with its Values:
* s = section
* t = sub-section title
* c = class
* i = sub-class index
* u = sub-class
* g = guidance heading
  m = main group
* 1 to B = 11 levels of group (hexadecimal notation)
* n = note

###'symbol' with its Values:

###The IPC Symbol! Thats the thing...

###'entryType' with its Values:
* K = classification symbol (default, i.e. for classification purpose only)
* I = Indexing symbol  (i.e. for indexing purpose only)
* D = Double purpose classification symbol (i.e. for both classification and indexing purpose) – existed only prior to the IPC reform
* Z = problematic entry (i.e. structure and/or contents have been partially converted from CPC or FI)
Interesting for us, are only entryType 'K'

##Dictionaries for kind level and title of level
```
kind_to_level = {
  's':1,
  'c':2,
  'u':3,
  'g':4,
  'm':4,
  '1':5,
  '2':6,
  '3':7,
  '4':8,
  '5':9,
  '6':10,
  '7':11,
  '8':12,
  '9':13,
  'A':14,
  'B':15}

kind_to_levelTitle = {
  's':'section',
  't':'sub-section title',
  'c':'class',
  'I':'sub-class index',
  'u':'sub-class',
  'g':'guidance heading',
  'm':'main group',
  '1':'.subgroup',
  '2':'..subgroup',
  '3':'...subgroup',
  '4':'....subgroup',
  '5':'.....subgroup',
  '6':'......subgroup',
  '7':'.......subgroup',
  '8':'........subgroup',
  '9':'.........subgroup',
  'A':'..........subgroup',
  'B':'...........subgroup',
  'n':'note'}
```

##here the list for all the interesing entries in our IPC XML

```
whatlevel = ["s","c","u","m","1","2","3","4","5","6","7","8","9","A","B"]
```



#Next Sample Code

Now we use lxml again and try to iterate two levels down and print a list of section, classes and sub classes.

In [4]:
from lxml import etree as ET

filename = "./EN_ipc_scheme_20210101.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

for sections in root:
    print('1st level sections: ', sections.attrib['symbol'], " kind:", sections.attrib['kind'])

    #go one level deeper to classes
    for classes in sections.iterchildren(tag=ipcEntry):
      print('2nd level classes: ', classes.attrib['symbol'], " kind:", classes.attrib['kind'])

      #go one level deeper to sub classes
      for subclasses in classes.iterchildren(tag=ipcEntry):
        print('3nd level sub classes: ', subclasses.attrib['symbol'], " kind:", subclasses.attrib['kind'])

1st level sections:  A  kind: s
2nd level classes:  A01  kind: t
2nd level classes:  A01  kind: c
3nd level sub classes:  A01B  kind: u
3nd level sub classes:  A01C  kind: u
3nd level sub classes:  A01D  kind: u
3nd level sub classes:  A01F  kind: u
3nd level sub classes:  A01G  kind: u
3nd level sub classes:  A01H  kind: u
3nd level sub classes:  A01J  kind: u
3nd level sub classes:  A01K  kind: u
3nd level sub classes:  A01L  kind: u
3nd level sub classes:  A01M  kind: u
3nd level sub classes:  A01N  kind: u
3nd level sub classes:  A01P  kind: u
2nd level classes:  A21  kind: t
2nd level classes:  A21  kind: c
3nd level sub classes:  A21B  kind: u
3nd level sub classes:  A21C  kind: u
3nd level sub classes:  A21D  kind: u
2nd level classes:  A22  kind: c
3nd level sub classes:  A22B  kind: u
3nd level sub classes:  A22C  kind: u
2nd level classes:  A23  kind: c
3nd level sub classes:  A23  kind: n
3nd level sub classes:  A23B  kind: u
3nd level sub classes:  A23C  kind: u
3nd level s

now we learn python and find others ways, to iterate over all children, checking with if for specific 'kind' of entries.

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
count = 0
start = time.time()

whatlevel = "1"

for element in root.iter(ipcEntry):
  if element.attrib['kind'] == whatlevel:
    count = count + 1
    #print(count, element.attrib['symbol'])

print("for kind", element, "found", count, 'entries in:', time.time() - start, 'sec')



for kind <Element {http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry at 0x7f68db963690> found 23390 entries in: 0.06341743469238281 sec


Now with a tuple of all the "kind" of entries i want to check...

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
count = 0

whatlevel = ("s","c","u","m","1","2","3","4","5","6","7","8","9","A","B")

for level in whatlevel:

  tic = time.perf_counter() * 1000

  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])

  toc = time.perf_counter() * 1000

  print("for kind", level, "found ", count, f"entries in: {(toc - tic):0.0f} ms")

  count = 0


for kind s found  8 entries in: 146 ms
for kind c found  131 entries in: 58 ms
for kind u found  646 entries in: 61 ms
for kind m found  7523 entries in: 66 ms
for kind 1 found  23390 entries in: 66 ms
for kind 2 found  23048 entries in: 61 ms
for kind 3 found  13661 entries in: 60 ms
for kind 4 found  5934 entries in: 62 ms
for kind 5 found  1987 entries in: 61 ms
for kind 6 found  638 entries in: 62 ms
for kind 7 found  155 entries in: 60 ms
for kind 8 found  68 entries in: 59 ms
for kind 9 found  10 entries in: 59 ms
for kind A found  4 entries in: 68 ms
for kind B found  4 entries in: 57 ms


now with a dictonary

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

whatlevel = {
  's':'section',
  't':'sub-section title',
  'c':'class',
  'I':'sub-class index',
  'u':'sub-class',
  'g':'guidance heading',
  'm':'main group',
  '1':'.subgroup',
  '2':'..subgroup',
  '3':'...subgroup',
  '4':'....subgroup',
  '5':'.....subgroup',
  '6':'......subgroup',
  '7':'.......subgroup',
  '8':'........subgroup',
  '9':'.........subgroup',
  'A':'..........subgroup',
  'B':'...........subgroup',
  'n':'note'}

for level in whatlevel.keys():

  count = 0
  start = time.time()

  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])
  print("for kind ", whatlevel[level], "found ", count, f'entries (in: {(time.time() - start) * 1000:0.0f} ms)')

  count = 0
  start = time.time()


for kind  section found  8 entries (in: 148 ms)
for kind  sub-section title found  20 entries (in: 62 ms)
for kind  class found  131 entries (in: 66 ms)
for kind  sub-class index found  0 entries (in: 63 ms)
for kind  sub-class found  646 entries (in: 66 ms)
for kind  guidance heading found  547 entries (in: 63 ms)
for kind  main group found  7523 entries (in: 64 ms)
for kind  .subgroup found  23390 entries (in: 63 ms)
for kind  ..subgroup found  23048 entries (in: 68 ms)
for kind  ...subgroup found  13661 entries (in: 62 ms)
for kind  ....subgroup found  5934 entries (in: 60 ms)
for kind  .....subgroup found  1987 entries (in: 65 ms)
for kind  ......subgroup found  638 entries (in: 66 ms)
for kind  .......subgroup found  155 entries (in: 61 ms)
for kind  ........subgroup found  68 entries (in: 60 ms)
for kind  .........subgroup found  10 entries (in: 60 ms)
for kind  ..........subgroup found  4 entries (in: 73 ms)
for kind  ...........subgroup found  4 entries (in: 66 ms)
for kind  no

In [5]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

whatLevel = ["s","c","u","m","1","2","3","4","5","6","7","8","9","A","B"]

tic = time.perf_counter() * 1000
def iterDurchsXML(level):
  count = 0
  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])

  print("for kind", level, "found ", count)

for level in whatLevel:
  iterDurchsXML(level)

toc = time.perf_counter() * 1000

print("time needed:", f"entries in: {(toc - tic):0.0f} ms")

for kind s found  8
for kind c found  131
for kind u found  646
for kind m found  7523
for kind 1 found  23390
for kind 2 found  23048
for kind 3 found  13661
for kind 4 found  5934
for kind 5 found  1987
for kind 6 found  638
for kind 7 found  155
for kind 8 found  68
for kind 9 found  10
for kind A found  4
for kind B found  4
time needed: entries in: 578 ms


#Sample with a recursive function

Instead of iter through the whole, we recursively go through the children of each entry. much more fun!





In [6]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

whatLevel = ["s","c","u","m","1","2","3","4","5","6","7","8","9","A","B"]

### here is my recursive function

tic = time.perf_counter() * 1000
list = [0,0,0]

def recWalker(node, list, kind):
  for child in node:
    list[0] += 1
    if not child.attrib == {}:
      list[1] += 1
      attrib = child.attrib['kind']
      if attrib == kind:
        list[2] += 1
        print("kind", attrib, list)
      #here I call the function within the function! thats makes it recursive!
      recWalker(child, list, kind)

# this ist the first call of the function
# argument is the xml, a empty list and the "kind" of entries, i want to find.
recWalker(root, list, "A")

toc = time.perf_counter() * 1000

print(f'time needed: {(toc - tic):0.0f} ms')

kind A [144005, 72003, 1]
kind A [144015, 72008, 2]
kind A [144023, 72012, 3]
kind A [144045, 72023, 4]
time needed: 172 ms


#Next Idea: minidom

Use a different module, here xml.dom.minidom

* as described [here](https://docs.python.org/3/library/xml.dom.minidom.html#module-xml.dom.minidom) at python.org
* and the DOM [here]((https://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html)) at w3c.org

Interesting functions would access Node with parentNode, childNodes, firstChild, lastChild, previousSibling and nextSibling and their content and attributes.

loading very slow! but after loading access to the entries are fast.

In [None]:
import xml.dom.minidom
#the module used for parsing the xml file is imported

xml_file = './EN_ipc_scheme_20210101.xml'
tag_name = 'ipcEntry'
attr_name = 'symbol'

#This function is declared with three arguements namely the xml file to be parsed, the tag name and the attribute name and it does the magic
def generic_dom(xml_file,tag_name,attr_name):
  doc = xml.dom.minidom.parse(xml_file)
  tags = doc.getElementsByTagName(tag_name)
  for any_attr in tags:
    attr = any_attr.getAttribute(attr_name)
    print(attr)

generic_dom(xml_file,tag_name,attr_name)

# Next Sample with recursive function and a XPath Call

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

whatLevel = ["s","c","u","m","1","2","3","4","5","6","7","8","9","A","B"]

### here is my recursive function

tic = time.perf_counter() * 1000
list = [0,0,0]

def recWalker(node, list, kind):
  for child in node:
    list[0] += 1
    if not child.attrib == {}:
      list[1] += 1
      attrib = child.attrib['kind']
      if attrib == kind:
        list[2] += 1
        alltext = child.xpath('descendant-or-self::text()')
        print("kind", attrib, list)
        print(alltext[0])
      #here I call the function within the function! thats makes it recursive!
      recWalker(child, list, kind)

# this ist the first call of the function
# argument is the xml, a empty list and the "kind" of entries, i want to find.
recWalker(root, list, "9")

toc = time.perf_counter() * 1000

print(f'time needed: {(toc - tic):0.0f} ms')

kind 9 [135193, 67597, 1]
the data current flowing through the driving transistor during a setting phase, e.g. by using a switch for connecting the driving transistor to the data driver
kind 9 [143757, 71879, 2]
Dynamic random access memory structures (DRAM)
kind 9 [143759, 71880, 3]
Static random access memory structures (SRAM)
kind 9 [143761, 71881, 4]
Read-only memory structures (ROM)
kind 9 [143997, 71999, 5]
with cell select transistors, e.g. NAND
kind 9 [144001, 72001, 6]
of memory regions comprising cell select transistors, e.g. NAND
kind 9 [144003, 72002, 7]
Simultaneous manufacturing of periphery and memory cells
kind 9 [144021, 72011, 8]
with source and drain on different levels, e.g. with sloping channels
kind 9 [144035, 72018, 9]
with cell select transistors, e.g. NAND
kind 9 [144043, 72022, 10]
with source and drain on different levels, e.g. with sloping channels
time needed: 292 ms


# Dictionary Approach
So, letz see, if I can reate a dictionary with almost all the content of the IPC XML.

letz have a look at one entry

now start thinking, how to extract the values in a nice way

* each ipcEntry has attributes
  * each ipcEntry has kind = 1...16
  * each ipcEntry has its symbol
  * each ipcEntry has a type

* each ipcEntry has one parent and one to many children
  * each ipcEntry has one parent ipcEntry, except for the first, upper 8 ipcEntries (remember its a hierarchical tree).
  * each ipcEntry has one or many child ipcEntries


* each ipcEntry has a textBody
  * each textBody has a title
    * each title has one or many titlePart
      * each titlePart has one text
      * each titlePart has one or many entryReference with a sref to another symbol

here an example
```
<ipcEntry kind="u" symbol="A01B" entryType="K">
  <textBody>
    <title>
      <titlePart>
         <text>SOIL WORKING IN AGRICULTURE OR FORESTRY</text>
      </titlePart>
      <titlePart>
        <text>PARTS, DETAILS, OR ACCESSORIES OF AGRICULTURAL MACHINES OR IMPLEMENTS, IN GENERAL</text>
           <entryReference>making or covering furrows or holes for sowing, planting or manuring <sref ref="A01C0005000000"/>
           </entryReference>
           <entryReference>machines for harvesting root crops <sref ref="A01D"/>
           </entryReference>
           <entryReference>mowers convertible to soil working apparatus or capable of soil working <sref ref="A01D0042040000"/>
           </entryReference>
           <entryReference>mowers combined with soil working implements <sref ref="A01D0043120000"/>
           </entryReference>
           <entryReference>soil working for engineering purposes <sref ref="E01"/>,<sref ref="E02"/>,<sref ref="E21"/>
           </entryReference>
        </titlePart>
      </title>
    </textBody>
```


In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
ipcXML = {}

tic = time.perf_counter() * 1000
count = 0

for element in root.iter(ipcEntry):
  if not element.attrib == {}:
    count = count + 1

    elementAttribKind   = element.attrib['kind']
    elementAttribSymbol = element.attrib['symbol']

    elementAttribType   = element.attrib['entryType']

    elementText = element.xpath('descendant-or-self::text()')
    elementText = elementText[0]
    if count < 100:
      print(count, elementAttribKind, elementAttribType, elementAttribSymbol, elementText)

toc = time.perf_counter() * 1000

print("time needed:", f"entries in: {(toc - tic):0.0f} ms")

In [13]:
from lxml import etree as ET
import time

# File path to the IPC XML
filename = "./EN_ipc_scheme_20210101.xml"

# Define the namespace and parser
ipc_namespace = '{http://www.wipo.int/classifications/ipc/masterfiles}'
ipcEntry = f"{ipc_namespace}ipcEntry"
text_body = f"{ipc_namespace}textBody"
title_part = f"{ipc_namespace}titlePart"
text = f"{ipc_namespace}text"
parser = ET.XMLParser(remove_blank_text=True)

# Parse the XML file
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

# Initialize dictionary for sub-class mapping
sub_class_mapping = {}

# Start measuring time
start = time.time()

# Iterate through the XML to extract sub-class information
for element in root.iter(ipcEntry):
    if element.attrib.get("kind") == "u":  # Focus on sub-classes
        symbol = element.attrib.get("symbol")  # Extract sub-class symbol

        # Locate the title text within the nested structure
        text_element = element.find(f".//{text_body}//{title_part}//{text}")
        title = text_element.text.strip() if text_element is not None else "No Title"

        sub_class_mapping[symbol] = title

# Print execution time
print(f"Extracted {len(sub_class_mapping)} sub-classes in {(time.time() - start) * 1000:.0f} ms.")

# Print a sample of the extracted data
for symbol, title in list(sub_class_mapping.items())[:5]:
    print(f"{symbol}: {title}")



Extracted 646 sub-classes in 151 ms.
A01B: SOIL WORKING IN AGRICULTURE OR FORESTRY
A01C: PLANTING
A01D: HARVESTING
A01F: THRESHING
A01G: HORTICULTURE
