<a href="https://colab.research.google.com/github/harrylloyd-bl/hr-coleridge/blob/hr/combine_xml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combine XMLs

## Import Packages

Import modules from the standard library to work with the file system xmls and regular expressions.


*   glob - searches for filenames that match a specific pattern
*   ElementTree - for working with xmls
*   re - for regular expressions (patterns used to find specific parts of strings)

[Colab Markdown Cheat Sheet](https://colab.research.google.com/notebooks/markdown_guide.ipynb)


In [None]:
import glob
import xml.etree.ElementTree as ET
import re

Define path for data


In [None]:
pages = glob.glob("sample_data/data/raw/1865/00*.xml")

Imports data from the defined path

In [None]:
pages

['sample_data/data/raw/1865/0005_p013.xml',
 'sample_data/data/raw/1865/0003_1865_page_1.xml',
 'sample_data/data/raw/1865/0021_p029.xml',
 'sample_data/data/raw/1865/0012_p020.xml',
 'sample_data/data/raw/1865/0017_p025.xml',
 'sample_data/data/raw/1865/0013_p021.xml',
 'sample_data/data/raw/1865/0023_p031.xml',
 'sample_data/data/raw/1865/0024_p032.xml',
 'sample_data/data/raw/1865/0022_p030.xml',
 'sample_data/data/raw/1865/0001_1865_cover.xml',
 'sample_data/data/raw/1865/0018_p026.xml',
 'sample_data/data/raw/1865/0014_p022.xml',
 'sample_data/data/raw/1865/0011_p019.xml',
 'sample_data/data/raw/1865/0002_1865_letter.xml',
 'sample_data/data/raw/1865/0015_p023.xml',
 'sample_data/data/raw/1865/0016_p024.xml',
 'sample_data/data/raw/1865/0009_p017.xml',
 'sample_data/data/raw/1865/0006_p014.xml',
 'sample_data/data/raw/1865/0008_p016.xml',
 'sample_data/data/raw/1865/0026_p034.xml',
 'sample_data/data/raw/1865/0019_p027.xml',
 'sample_data/data/raw/1865/0004_p011.xml',
 'sample_dat

Splits filename string in two places for one page to demonstrate correct functioning

In [None]:
int(pages[7].split("/")[-1].split("_")[0])

24

Splits filename string in two places on every page and orders correctly

In [None]:
ordered_pages = sorted(pages, key=lambda x: int(x.split("/")[-1].split("_")[0]))
ordered_pages

['sample_data/data/raw/1865/0001_1865_cover.xml',
 'sample_data/data/raw/1865/0002_1865_letter.xml',
 'sample_data/data/raw/1865/0003_1865_page_1.xml',
 'sample_data/data/raw/1865/0004_p011.xml',
 'sample_data/data/raw/1865/0005_p013.xml',
 'sample_data/data/raw/1865/0006_p014.xml',
 'sample_data/data/raw/1865/0007_p015.xml',
 'sample_data/data/raw/1865/0008_p016.xml',
 'sample_data/data/raw/1865/0009_p017.xml',
 'sample_data/data/raw/1865/0010_p018.xml',
 'sample_data/data/raw/1865/0011_p019.xml',
 'sample_data/data/raw/1865/0012_p020.xml',
 'sample_data/data/raw/1865/0013_p021.xml',
 'sample_data/data/raw/1865/0014_p022.xml',
 'sample_data/data/raw/1865/0015_p023.xml',
 'sample_data/data/raw/1865/0016_p024.xml',
 'sample_data/data/raw/1865/0017_p025.xml',
 'sample_data/data/raw/1865/0018_p026.xml',
 'sample_data/data/raw/1865/0019_p027.xml',
 'sample_data/data/raw/1865/0020_p028.xml',
 'sample_data/data/raw/1865/0021_p029.xml',
 'sample_data/data/raw/1865/0022_p030.xml',
 'sample_dat

Prints integer of ordered list to demonstrate correct functioning

In [None]:
for i in [0,1,2,3,4]:
  print(i)

0
1
2
3
4


Specifies definition of tree and root

In [None]:
trees, roots = [], []
for p in ordered_pages:
    tree = ET.parse(p)
    root = tree.getroot()

    trees.append(tree)
    roots.append(root)

Specifies combined trees and roots

In [None]:
combined_root = roots[0]
combined_tree = trees[0]

Prints children of first root

In [None]:
for child in roots[0]:
    print(child)

<Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata' at 0x79ccf03b1ee0>
<Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page' at 0x79ccf0218ef0>


Prints the line of text identified through tree branches

In [None]:
for child in roots[2][1][5][7][2]:
    print(child.text)

of bringing the whole of the surveys under the Home Department, as proposed in Lieutenant


Combines children into single list

In [None]:
for root in roots[1:]:
    for child in root:
        combined_root.append(child)

Prints the combined list of children

In [None]:
for child in combined_root:
    print(child.tag)

{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Page
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Metadata
{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-

Saves the combined tree into a new file called combined_pages.xml

In [None]:
ET.indent(combined_tree, space="    ")
combined_tree.write("combined_pages.xml", encoding="UTF-8")

### Parse credit sections

In [None]:
"structure {type:credit;}" in combined_root[1][5].attrib["custom"]

IndexError: child index out of range

In [None]:
credits = []
for child in combined_root:
    if child.tag.split("}")[1] == "Page":
        for region in child:
            if "structure {type:credit;}" in region.attrib.get("custom", []):
                credits.append(region)

In [None]:
def parse_xml_attrib(s):
    return re.findall(r"(?<attr_name>\w*)\s\(?<attr_value>{[\w:\s\d]*;\})", s)

In [None]:
s = credits[0][3].attrib["custom"]
s

In [None]:
re.findall(r"(?P<attr_name>\w*)\s(?P<attr_value>\{[\w:;\s\d]*;\})", s)