<a href="https://colab.research.google.com/github/herrkrueger/funwithipcxml/blob/main/ipcbrowser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fun with IPC XML, Python xml, lxml and ElementTree

###Python Quellen
* John Shipman's tutorial on [Python XML processing with lxml](https://www.academia.edu/38587906/Python_XML_processing_with_lxml)
* [The ElementTree API on](https://docs.python.org/3/library/xml.etree.elementtree.html) on python.org
* Tutorials on [Real Python](https://realpython.com/)
* [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)

###WIPO Links
* Current Edition of IPC Master Files from [WIPO's Download and IT support area](https://www.wipo.int/classifications/ipc/en/ITsupport/), here the  [direkt link zu the zip File](https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area//20210101/MasterFiles/ipc_scheme_images_20210101.zip)
* Documentation and XSDs are [here](https://www.wipo.int/classifications/ipc/en/ITsupport/Version20210101/documentation/IPCfiles.html), esp. the Specification of the Scheme file [here](https://www.wipo.int/ipc/itos4ipc/ITSupport_and_download_area/Documentation/20210101/IPC_scheme_specs_v3_1.docx)
* [Link](https://www.wipo.int/classifications/ipc/ipcpub/?notion=scheme&version=20210101&symbol=none&menulang=en&lang=en&viewmode=f&fipcpc=no&showdeleted=yes&indexes=no&headings=yes&notes=yes&direction=o2n&initial=A&cwid=none&tree=no&searchmode=smart) to the IPC Browser of WIPO


#First Sample Code

First, we just import lxml and get the file (manual download, put it here next to the sample data) and access it, print the upper level elements tags and the attribut dictonary. These are the sections of the IPC Tree. The attributes contain: 'kind' and 'symbol' and 'entryTpe'.

In [None]:
from lxml import etree as ET

filename = "./EN_ipc_scheme_20210101.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

for child in root:
    print(child.tag, child.attrib)


{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'A', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'B', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'C', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'D', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'E', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'F', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'G', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'H', 'entryType': 'K'}


just a different, shorter way of doing this stuff

In [None]:
import xml.etree.ElementTree as ET
root = ET.parse("./EN_ipc_scheme_20210101.xml").getroot()

for sections in root:
   print(sections.tag, sections.attrib)

{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'A', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'B', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'C', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'D', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'E', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'F', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'G', 'entryType': 'K'}
{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry {'kind': 's', 'symbol': 'H', 'entryType': 'K'}


##What do we see? 

The **tag** (including the xmlns - NameSpace, that this entry belongs. There is only one ns in the XML btw.) and the **attributes**, obviously, and the **atributes** are:

* 'kind' with its Values:
 * s = section
 * t = sub-section title
 * c = class
 * i = sub-class index
 * u = sub-class
 * g = guidance heading
 * m = main group
 * 1 to B = 11 levels of group (hexadecimal notation)
 * n = note
* 'symbol' with its Values:
 * The IPC Symbol! Thats the thing... 
* 'entryType' with its Values:
 * K = classification symbol (default, i.e. for classification purpose only)
 * I = Indexing symbol  (i.e. for indexing purpose only)
 * D = Double purpose classification symbol (i.e. for both classification and indexing purpose) – existed only prior to the IPC reform
 * Z = problematic entry (i.e. structure and/or contents have been partially converted from CPC or FI)
Interesting for us, are only entryType 'K'

Dictionaries for kind level and title of level
```
kind_to_level = {
  's':1,
  'c':2,
  'u':3,
  'g':4,
  'm':4,
  '1':5,
  '2':6,
  '3':7,
  '4':8,
  '5':9,
  '6':10,
  '7':11,
  '8':12,
  '9':13,
  'A':14,
  'B':15}

kind_to_levelTitle = {
  's':'section',
  't':'sub-section title',
  'c':'class',
  'I':'sub-class index',
  'u':'sub-class',
  'g':'guidance heading',
  'm':'main group',
  '1':'.subgroup',
  '2':'..subgroup',
  '3':'...subgroup',
  '4':'....subgroup',
  '5':'.....subgroup',
  '6':'......subgroup',
  '7':'.......subgroup',
  '8':'........subgroup',
  '9':'.........subgroup',
  'A':'..........subgroup',
  'B':'...........subgroup',
  'n':'note'}
```

here the list for all the interesing entries in our IPC XML

```
whatlevel = ["s","c","u,"m","1","2","3","4","5","6","7","8","9","A","B"]
```



#Next Sample Code

Now we use lxml again and try to iterate two levels down and print a list of section, classes and sub classes. 

In [None]:
from lxml import etree as ET

filename = "./EN_ipc_scheme_20210101.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

for sections in root:    
    print('1st level sections: ', sections.attrib['symbol'], " kind:", sections.attrib['kind'])    
    
    #go one level deeper to classes
    for classes in sections.iterchildren(tag=ipcEntry):
      print('2nd level classes: ', classes.attrib['symbol'], " kind:", classes.attrib['kind'])
     
      #go one level deeper to sub classes
      for subclasses in classes.iterchildren(tag=ipcEntry):
        print('3nd level sub classes: ', subclasses.attrib['symbol'], " kind:", subclasses.attrib['kind'])

now we learn python and find others ways, to iterate over all children, checking with if for specific 'kind' of entries. 

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
count = 0
start = time.time()

whatlevel = "s"

for element in root.iter(ipcEntry):
  if element.attrib['kind'] == whatlevel:
    count = count + 1
    #print(count, element.attrib['symbol'])

print("for kind", level, "found", count, 'entries in:', time.time() - start, 'sec')



for kind B found 8 entries in: 0.14451146125793457 sec


Now with a tuple of all the "kind" of entries i want to check... 

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'
count = 0

whatlevel = ("s","c","u","m","1","2","3","4","5","6","7","8","9","A","B")

for level in whatlevel:
  
  tic = time.perf_counter() * 1000

  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])
  
  toc = time.perf_counter() * 1000

  print("for kind", level, "found ", count, f"entries in: {(toc - tic):0.0f} ms")
  
  count = 0


for kind s found  8 entries in: 142 ms
for kind c found  131 entries in: 69 ms
for kind u found  646 entries in: 70 ms
for kind m found  7523 entries in: 75 ms
for kind 1 found  23390 entries in: 76 ms
for kind 2 found  23048 entries in: 71 ms
for kind 3 found  13661 entries in: 71 ms
for kind 4 found  5934 entries in: 75 ms
for kind 5 found  1987 entries in: 68 ms
for kind 6 found  638 entries in: 67 ms
for kind 7 found  155 entries in: 69 ms
for kind 8 found  68 entries in: 67 ms
for kind 9 found  10 entries in: 76 ms
for kind A found  4 entries in: 76 ms
for kind B found  4 entries in: 72 ms


now with a dictonary

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

whatlevel = {
  's':'section',
  't':'sub-section title',
  'c':'class',
  'I':'sub-class index',
  'u':'sub-class',
  'g':'guidance heading',
  'm':'main group',
  '1':'.subgroup',
  '2':'..subgroup',
  '3':'...subgroup',
  '4':'....subgroup',
  '5':'.....subgroup',
  '6':'......subgroup',
  '7':'.......subgroup',
  '8':'........subgroup',
  '9':'.........subgroup',
  'A':'..........subgroup',
  'B':'...........subgroup',
  'n':'note'}

for level in whatlevel.keys():
  
  count = 0
  start = time.time()

  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])
  print("for kind ", whatlevel[level], "found ", count, f'entries (in: {(time.time() - start) * 1000:0.0f} ms)')
  
  count = 0
  start = time.time()
  

for kind  section found  8 entries (in: 152 ms)
for kind  sub-section title found  20 entries (in: 72 ms)
for kind  class found  131 entries (in: 70 ms)
for kind  sub-class index found  0 entries (in: 72 ms)
for kind  sub-class found  646 entries (in: 73 ms)
for kind  guidance heading found  547 entries (in: 74 ms)
for kind  main group found  7523 entries (in: 69 ms)
for kind  .subgroup found  23390 entries (in: 76 ms)
for kind  ..subgroup found  23048 entries (in: 71 ms)
for kind  ...subgroup found  13661 entries (in: 71 ms)
for kind  ....subgroup found  5934 entries (in: 78 ms)
for kind  .....subgroup found  1987 entries (in: 71 ms)
for kind  ......subgroup found  638 entries (in: 68 ms)
for kind  .......subgroup found  155 entries (in: 70 ms)
for kind  ........subgroup found  68 entries (in: 68 ms)
for kind  .........subgroup found  10 entries (in: 68 ms)
for kind  ..........subgroup found  4 entries (in: 73 ms)
for kind  ...........subgroup found  4 entries (in: 68 ms)
for kind  no

now, finally i try with a recursive function and fail

In [None]:
from lxml import etree as ET
import time

filename = "./EN_ipc_scheme_20210101.xml"

parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser=parser)
root = tree.getroot()

ipcEntry = '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'

whatlevel = {
  's':'section',
  't':'sub-section title',
  'c':'class',
  'I':'sub-class index',
  'u':'sub-class',
  'g':'guidance heading',
  'm':'main group',
  '1':'.subgroup',
  '2':'..subgroup',
  '3':'...subgroup',
  '4':'....subgroup',
  '5':'.....subgroup',
  '6':'......subgroup',
  '7':'.......subgroup',
  '8':'........subgroup',
  '9':'.........subgroup',
  'A':'..........subgroup',
  'B':'...........subgroup',
  'n':'note'}

def iterDurchsXML(level):
  count = 0
  tic = time.perf_counter() * 1000
  for element in root.iter(ipcEntry):
    if element.attrib['kind'] == level:
      count = count + 1
      #print(count, element.attrib['symbol'])
  toc = time.perf_counter() * 1000
  print("for kind", level, "found ", count, f"entries in: {(toc - tic):0.0f} ms")    
    
def countElements(Anzahl):
  if n 
    iterDurchsXML(whatlevel.key[n])
    
countElements(3)





```
#iterfind() iterates over all Elements that match the path expression

#findall() returns a list of matching Elements

#find() efficiently returns only the first match

#findtext() returns the .text content of the first match

#Illustrative Examples:

root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
#Find a child of an Element:
print(root.find("b"))
None

print(root.find("a").tag)
a
#Find an Element anywhere in the tree:
print(root.find(".//b").tag)
b

[ b.tag for b in root.iterfind(".//b") ]
['b', 'b']
#Find Elements with a certain attribute:

print(root.findall(".//a[@x]")[0].tag)
a
print(root.findall(".//a[@y]"))
[]
```

