In [1]:
import re
import glob
from tqdm import tqdm
from datetime import datetime
import xml.etree.ElementTree as ET
from collections import Counter

ns = {'xml': 'http://www.w3.org/XML/1998/namespace',
      'dflt': 'http://www.tei-c.org/ns/1.0',
      'frus':'http://history.state.gov/frus/ns/1.0',
      'xi':'http://www.w3.org/2001/XInclude'
      }

In [None]:
# AIM
# up to which level, all the documents are unified ?
# which tags can hold useful info ? all unique tags after in document level
# decide which tags are most important. then search for their attribute types like in persName or gloss.
# tags should be considered in terms of if they can include ready-to-use info for KG before nlp.
# compose info from frus.odd and narrate here as a reference!

This nb is produced for understanding FRUS corpus' schema. For the analysis to be complete, I relied on information provided in the schema file (link below), and only consulted to volumes for seeing a particular interesting phenomenon. <br/>

https://github.com/HistoryAtState/frus/blob/master/schema/frus.odd (in below cell ref line xxx means this schema's xxx's line.)<br/>

All info about tags can be found in tei-c.org where FRUS authors relied on. <br/>

I estimated the importance of a tag or attribute in the scale if it can add knowledge to our KG, either in ready-to-use format or free text. <br/>
In addition, I searched for up to which level, the corpus is unified, and where we should be careful extracting information from xml. <br/>

### TEI

In [2]:
volume_list = glob.glob('volumes/*')

print(f'volume count in frus: {len(volume_list)}')

volume count in frus: 543


In [219]:
# ref line 123: children tags of TEI (root)

roots_children = ["{http://www.tei-c.org/ns/1.0}teiHeader", "{http://www.tei-c.org/ns/1.0}text"]

flag = True
for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    children = []
    for elem in root.findall('*', ns):
        children.append(elem.tag)
    
    if children != roots_children:
        flag=False
        print(volume)

if flag:
    print(f"all volumes start with these children elements: {', '.join(roots_children)}")
else:
    print('invalid volume schema exists')

100%|██████████| 543/543 [02:17<00:00,  3.94it/s]

all volumes start with these children elements: {http://www.tei-c.org/ns/1.0}teiHeader, {http://www.tei-c.org/ns/1.0}text





### + TEI/teiHeader

In [79]:
#ref line 128: children tags of teiHeader

teiHeader_dict = {}

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for elem in root.findall("dflt:teiHeader/*",ns):

        val = teiHeader_dict.get(elem.tag, None)
        
        if not val:
            teiHeader_dict[elem.tag]=[volume]
        else:
            teiHeader_dict[elem.tag].append(volume)

print(f'children tags of teiHeader: {", ".join(teiHeader_dict.keys())}')

100%|██████████| 543/543 [01:24<00:00,  6.41it/s]

children tags of teiHeader: {http://www.tei-c.org/ns/1.0}fileDesc, {http://www.tei-c.org/ns/1.0}encodingDesc, {http://www.tei-c.org/ns/1.0}revisionDesc, {http://www.tei-c.org/ns/1.0}xenoData





fileDesc tag holds the metadata about the corresponding volume, which I deem important. <br/>
encodingDesc tag holds paper to digital layout, style definitions, which is not important. <br/>
revisionDesc tag holds revision history, which is not important. <br/>
xenoData tag appears only in 1 volume, I inspected it and is not important.

In [33]:
#ref line 131: extracting series, sub-series, volume-number, volume info

titleStmt_dict = {}

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for elem in root.findall("./dflt:teiHeader/dflt:fileDesc/dflt:titleStmt/dflt:title",ns):
        
        val = elem.attrib['type']

        try:
            text= " ".join(elem.text.split())
        except:
            text = None
        
        if val not in titleStmt_dict:
            titleStmt_dict[val]=[text]
        else:
            titleStmt_dict[val].append(text)


print(titleStmt_dict.keys())

100%|██████████| 543/543 [01:33<00:00,  5.84it/s]


This is a section that should be unified in format, but (unfortunately) it is not! <br/>
@type = "series" is the name of the series which is "FRUS", but lots of different naming variations exist. see first cell below. <br/>
also, @type = "sub-series" fails to capture presidential periods. see second cell below. <br/>
in short: volume number, name, year, presidential period can appear sporadically across any of @type = "series","sub-series","volume-number" and "volume" in the corpus. <br/>
however, this section will not affect the knowledge quality of KG, because each document still holds necessary info about itself independently. only problematic issue is we need to do a few more steps for answering queries like 'document count histogram per president' as "sub-series" is not ready to use.

In [72]:
Counter(titleStmt_dict['series'])

Counter({'Foreign Relations of the United States': 308,
         'Papers Relating to the Foreign Relations of the United States, 1906': 2,
         'Papers Relating to the Foreign Relations of the United States, 1922': 2,
         'Foreign Relations of the United States Diplomatic Papers': 50,
         'Papers Relating to the Foreign Relations of the United States': 21,
         'Foreign Relations of the United States: Diplomatic Papers': 33,
         'Papers relating to the foreign relations of the United States': 14,
         'Papers relating to Foreign Affairs': 3,
         'Papers Relating to Foreign Affairs': 10,
         'Papers Relating to the Foreign Relations of the United States, 1877': 2,
         'Papers Relating to Foreign Affairs, 1865': 1,
         'Papers Relating to the Foreign Relations of the United States, 1920': 3,
         'Papers Relating to the Foreign Relations of the United States, 1919': 16,
         'Papers Relating to Foreign Affairs, 1864': 1,
         'Pa

In [73]:
Counter(titleStmt_dict['sub-series'])

Counter({'1917–1972': 5,
         None: 97,
         '1941': 7,
         '1946': 11,
         '1872': 6,
         '1932': 5,
         '1943': 8,
         '1964–1968': 35,
         '1949': 10,
         '1961–1963': 26,
         '1930': 3,
         '1863': 2,
         '1958–1960': 20,
         '1952–1954': 30,
         '1969–1976': 49,
         '1969-1976': 13,
         'Foreign Relations of the United States, 1958–1960': 1,
         '1948': 11,
         '1915': 1,
         '1950–1955': 1,
         '1951': 11,
         '1981-1988': 2,
         'Accompanying the Annual Message of the President to the Second Session Thirty-eighth Congress': 1,
         '1973-1976': 1,
         '1875': 2,
         '1873': 4,
         '1867': 2,
         '1865': 2,
         '1914': 1,
         '1945': 9,
         '1955–1957': 28,
         '1870': 1,
         '1977–1980': 19,
         '1936': 5,
         '1942': 7,
         '1928': 3,
         '1947': 8,
         '1931–1941': 2,
         '1918': 5,
         '

### + TEI/text

### ++ TEI/text/front

In [95]:
# ref line 224
tag_dict = {}
tag_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for temp_tag in root.findall('./dflt:text/dflt:front/', ns):
        if temp_tag.tag not in tag_dict:
            tag_dict[temp_tag.tag]=1
            tag_examples.append((temp_tag,volume))
        else:
            tag_dict[temp_tag.tag] += 1

print(tag_dict)

100%|██████████| 543/543 [01:27<00:00,  6.17it/s]

{'{http://www.tei-c.org/ns/1.0}pb': 3992, '{http://www.tei-c.org/ns/1.0}titlePage': 540, '{http://www.tei-c.org/ns/1.0}div': 2230, '{http://www.tei-c.org/ns/1.0}p': 39, '{http://www.tei-c.org/ns/1.0}figure': 2, '{http://www.tei-c.org/ns/1.0}docTitle': 1, '{http://www.tei-c.org/ns/1.0}docImprint': 2, '{http://www.tei-c.org/ns/1.0}byline': 2, '{http://www.tei-c.org/ns/1.0}gap': 4}





3 volumes do not have titlePage tag, as opposed to ref schema's definition. <br/>
p tag is not supposed to be present, I checked the examples, can be ignored. <br/>

In [139]:
#ref line 279

frontdiv_attrb_dict = {}

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for div in root.findall('./dflt:text/dflt:front/dflt:div', ns):

        val = div.attrib['{http://www.w3.org/XML/1998/namespace}id']

        if val not in frontdiv_attrb_dict:
            frontdiv_attrb_dict[val]=0
        else:
            frontdiv_attrb_dict[val]+=1


100%|██████████| 543/543 [01:27<00:00,  6.19it/s]


in ref line 279, xml:id attribute might be toc, persons, terms, preface, or sources. <br/>
actually, it is more messier than see first cell below <br/>

In [140]:
print(frontdiv_attrb_dict)

{'pressrelease': 51, 'AbouttheSeries': 13, 'preface': 363, 'acknowledge': 1, 'toc': 469, 'sources': 245, 'terms': 300, 'persons': 269, 'message-of-the-president': 41, 'papers': 109, 'toc-papers': 33, 'editorial': 3, 'subseriesvols': 33, 'message': 18, 'message1': 2, 'actionsstatement': 21, 'summary': 17, 'contents': 6, 'toc-topics': 4, 'toc-countries': 2, 'about': 9, 'source': 2, 'abouttheseries': 9, 'abstract': 1, 'Summary': 0, 'intro1': 0, 'errata': 2, 'introduction': 24, 'papers1': 6, 'shorttitles': 8, 'actionssatement': 0, 'documents': 1, 'foreword': 2, 'intro': 2, 'photographs': 2, 'volumes': 1, 'Contents': 1, 'note': 10, 'convention': 0, 'foreign-relations': 19, 'Preface': 1, 'Notes': 4, 'prefatory-note': 0, 'ch1': 0, 'ch2': 0, 'ch3': 0, 'ch4': 0, 'ch5': 0, 'ch6': 0, 'list-of-illustrations': 1, 'address-of-the-president': 5, 'introductory': 0, 'about-this-digital-edition': 1, 'summaryvii': 0, 'summaryviii': 0, 'summaryix': 0, 'directory': 0, 'recommendations': 0, 'notes': 8, 'Mes

more critical than that is both Person and Terms and Abbrv. annotations are not available for most of the early volumes (see first cell below). This mean we cannot extract instution or person knowledge from the free text easily without named entity recognition. one solution can be using the newest volumes initially for KG construction.

In [138]:
print(f"random 20 volumes without person annotations: {set(volume_list)-set(frontdiv_attrb_dict['persons'])}")

random 20 volumes without person annotations: {'volumes/frus1866p2.xml', 'volumes/frus1907p2.xml', 'volumes/frus1940v05.xml', 'volumes/frus1939v02.xml', 'volumes/frus1949v02.xml', 'volumes/frus1943.xml', 'volumes/frus1948v08.xml', 'volumes/frus1928v01.xml', 'volumes/frus1934v04.xml', 'volumes/frus1944v05.xml', 'volumes/frus1930v01.xml', 'volumes/frus1876.xml', 'volumes/frus1942v03.xml', 'volumes/frus1931-41v01.xml', 'volumes/frus1940v01.xml', 'volumes/frus1926v01.xml', 'volumes/frus1880.xml', 'volumes/frus1863p2.xml', 'volumes/frus1952-54v12p2.xml', 'volumes/frus1922v01.xml', 'volumes/frus1920v03.xml', 'volumes/frus1920v02.xml', 'volumes/frus1950v06.xml', 'volumes/frus1919Parisv13.xml', 'volumes/frus1948v05p2.xml', 'volumes/frus1872p2v5.xml', 'volumes/frus1944v01.xml', 'volumes/frus1888p2.xml', 'volumes/frus1944Quebec.xml', 'volumes/frus1913.xml', 'volumes/frus1906p1.xml', 'volumes/frus1942v01.xml', 'volumes/frus1946v01.xml', 'volumes/frus1902app1.xml', 'volumes/frus1941v01.xml', 'volu

### ++ TEI/text/body

In [160]:
# ref line 333
bodydiv_attrb_dict = {}
bodydiv_attrb_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for div in root.findall('./dflt:text/dflt:body//dflt:div', ns):
        for key in list(div.attrib.keys()):
            if key not in bodydiv_attrb_dict:
                bodydiv_attrb_dict[key]=[div.attrib[key]]
                bodydiv_attrb_examples.append((div,volume))
            else:
                bodydiv_attrb_dict[key].append(div.attrib[key])

100%|██████████| 543/543 [02:03<00:00,  4.40it/s]


this is the most important part of any volume because it includes the free text itself. <br/>
for @type, compilation, chapter, document, subchapter are expected. we have section and toc occuring in corpus unexpectedly. (first cell below) <br/>
among all, what we need is @type=document because it is the most fundamental unit, which carries the document content and also thankfully unified across corpus.

In [166]:
set(bodydiv_attrb_dict['type'])

{'chapter', 'compilation', 'document', 'section', 'subchapter', 'toc'}

In [169]:
print(set(bodydiv_attrb_dict['subtype']))

{'map-or-chart', 'subsection', 'referral', 'table-of-contents', 'section', 'historical-document', 'editorial-note', 'index'}


In [196]:
subtype_list = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for div in root.findall('./dflt:text/dflt:body//dflt:div[@type="document"]', ns):
        
        try:
            subtype = div.attrib['subtype']
        except:
            subtype = None

        subtype_list.append(subtype)

100%|██████████| 543/543 [02:15<00:00,  4.02it/s]


In [197]:
Counter(subtype_list)

Counter({'historical-document': 301760,
         'editorial-note': 8329,
         None: 48,
         'index': 6})

In [3]:
# document level (text/body/div type='document') tag variety

tag_dict = {}
tag_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for temp_tag in root.findall('./dflt:text/dflt:body//dflt:div[@type="document"]//', ns):
        if temp_tag.tag not in tag_dict:
            tag_dict[temp_tag.tag]=1
            tag_examples.append((temp_tag,volume))
        else:
            tag_dict[temp_tag.tag] += 1

100%|██████████| 543/543 [01:21<00:00,  6.66it/s]


In [4]:
print(f"document level tags and counts in FRUS\n'tag' : count -> {tag_dict}")

document level tags and counts in FRUS
'tag' : count -> {'{http://www.tei-c.org/ns/1.0}head': 402718, '{http://www.tei-c.org/ns/1.0}persName': 2075374, '{http://www.tei-c.org/ns/1.0}note': 757195, '{http://www.tei-c.org/ns/1.0}opener': 382059, '{http://www.tei-c.org/ns/1.0}dateline': 338173, '{http://www.tei-c.org/ns/1.0}placeName': 326859, '{http://www.tei-c.org/ns/1.0}date': 338149, '{http://www.tei-c.org/ns/1.0}list': 179636, '{http://www.tei-c.org/ns/1.0}item': 688636, '{http://www.tei-c.org/ns/1.0}p': 2783482, '{http://www.tei-c.org/ns/1.0}gloss': 1458648, '{http://www.tei-c.org/ns/1.0}pb': 429224, '{http://www.tei-c.org/ns/1.0}hi': 2301115, '{http://www.tei-c.org/ns/1.0}ref': 179084, '{http://www.tei-c.org/ns/1.0}quote': 52491, '{http://www.tei-c.org/ns/1.0}closer': 277079, '{http://www.tei-c.org/ns/1.0}signed': 273984, '{http://www.tei-c.org/ns/1.0}salute': 1246, '{http://www.tei-c.org/ns/1.0}seg': 178294, '{http://www.tei-c.org/ns/1.0}idno': 614, '{http://www.tei-c.org/ns/1.0}t

The above listing shows the tag variaties occuring in document level. Some of these are important for providing ready-to-use info for KG, and some is important for keeping the free text. I investigated all 42 of the above tags to see if we miss anything when ignoring a particular tag. <br/>

+important <br/>
list, p -> holds free text <br/>
head -> title of the document, chapter or compilation <br/>
persName, gloss -> see below section <br/>
placeName <br/>
signed -> signed person's name <br/>

+consider again in the future <br/>
note -> footnotes <br/>
ref -> cross ref to other docs <br/>
table, row, cell -> represents table occuring in free text, needs special treatment <br/>
attachment <br/>

+ok to ignore <br/>
about page layout, and style: dateline, pb, hi, seg, idno, lb <br/>
opener -> date and place can be extracted w/o it <br/>
date -> can already be extracted from div tag attribute <br/>
label, item -> each bullet and its content in list tag respectively, no need in free text <br/>
quote, salute, affiliation, postscript -> extract text if present w/o looking tag type <br/>
closer -> holds signed tag <br/>
any tag including and after del tag above -> no significance, sporading annotations <br/>




In [5]:
# document level (text/body/div type='document') only one child below

tag_dict = {}
tag_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for temp_tag in root.findall('./dflt:text/dflt:body//dflt:div[@type="document"]/*', ns):
        if temp_tag.tag not in tag_dict:
            tag_dict[temp_tag.tag]=1
            tag_examples.append((temp_tag,volume))
        else:
            tag_dict[temp_tag.tag] += 1

print(f"document level tags and counts in FRUS\n'tag' : count -> {tag_dict}")

100%|██████████| 543/543 [01:17<00:00,  6.97it/s]

document level tags and counts in FRUS
'tag' : count -> {'{http://www.tei-c.org/ns/1.0}head': 310136, '{http://www.tei-c.org/ns/1.0}opener': 320488, '{http://www.tei-c.org/ns/1.0}list': 77616, '{http://www.tei-c.org/ns/1.0}p': 2138712, '{http://www.tei-c.org/ns/1.0}pb': 104430, '{http://www.tei-c.org/ns/1.0}quote': 9375, '{http://www.tei-c.org/ns/1.0}closer': 242592, '{http://www.tei-c.org/ns/1.0}table': 6981, '{http://history.state.gov/frus/ns/1.0}attachment': 46106, '{http://www.tei-c.org/ns/1.0}postscript': 13489, '{http://www.tei-c.org/ns/1.0}dateline': 14834, '{http://www.tei-c.org/ns/1.0}note': 190783, '{http://www.tei-c.org/ns/1.0}figure': 206, '{http://www.tei-c.org/ns/1.0}lb': 32, '{http://www.tei-c.org/ns/1.0}lg': 63, '{http://www.tei-c.org/ns/1.0}gap': 119, '{http://www.tei-c.org/ns/1.0}cit': 3, '{http://www.tei-c.org/ns/1.0}div': 4, '{http://www.tei-c.org/ns/1.0}label': 4, '{http://www.tei-c.org/ns/1.0}salute': 5, '{http://www.tei-c.org/ns/1.0}signed': 2}





In [6]:
# each document's head tag's child tag varieties (text/body/div type='document'/head)

tag_dict = {}
tag_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for temp_tag in root.findall('./dflt:text/dflt:body//dflt:div[@type="document"]/dflt:head/*', ns):
        if temp_tag.tag not in tag_dict:
            tag_dict[temp_tag.tag]=1
            tag_examples.append((temp_tag,volume))
        else:
            tag_dict[temp_tag.tag] += 1

print(f"'tag' : count -> {tag_dict}")

100%|██████████| 543/543 [01:16<00:00,  7.11it/s]

'tag' : count -> {'{http://www.tei-c.org/ns/1.0}persName': 210387, '{http://www.tei-c.org/ns/1.0}note': 101760, '{http://www.tei-c.org/ns/1.0}gloss': 36339, '{http://www.tei-c.org/ns/1.0}hi': 322968, '{http://www.tei-c.org/ns/1.0}seg': 239, '{http://www.tei-c.org/ns/1.0}lb': 17567, '{http://www.tei-c.org/ns/1.0}date': 1}





In [218]:
# use below to extract free text from given Element
#" ".join(ET.tostring(root, method='text').decode("utf-8").split())

### + important tags

In [4]:
# <persName>'s attribute variety

persName_attrb_dict = {}
persName_attrb_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for pers in root.findall('.//dflt:persName', ns):
        for key in list(pers.attrib.keys()):
            if key not in persName_attrb_dict:
                persName_attrb_dict[key]=1
                persName_attrb_examples.append(pers)
            else:
                persName_attrb_dict[key] += 1

print(f"<persName> attributes and counts in FRUS\n'attrb' : count -> {persName_attrb_dict}\n\nExamples for each attribute type")
for ex in persName_attrb_examples:
    print(f'attribute: {ex.attrib}, text: {ex.text}')

100%|██████████| 543/543 [01:18<00:00,  6.91it/s]

<persName> attributes and counts in FRUS
'attrb' : count -> {'corresp': 1808037, '{http://www.w3.org/XML/1998/namespace}id': 60824, 'type': 299665}

Examples for each attribute type
attribute: {'corresp': '#p_KJF_1'}, text: John F. Kennedy
attribute: {'{http://www.w3.org/XML/1998/namespace}id': 'p_AJT_1'}, text: Abernethy, John
                                        T.
attribute: {'type': 'from'}, text: Beaupré





persName is important for detecting persons across document using @corresp and matching it with front/persons. <br/>
in addition @type= from or to is for extracting receiver and sender. This type of information is also present in list tag's @from, @to or @participants as well as in signed tag.

In [5]:
# <gloss>'s attribute variety

gloss_attrb_dict = {}
gloss_attrb_examples = []

for volume in tqdm(volume_list):

    tree = ET.parse(volume)
    root = tree.getroot()

    for gloss in root.findall('.//dflt:gloss', ns):
        for key in list(gloss.attrib.keys()):
            if key not in gloss_attrb_dict:
                gloss_attrb_dict[key]=1
                gloss_attrb_examples.append(gloss)
            else:
                gloss_attrb_dict[key] += 1

print(f"<gloss> attributes and counts in FRUS\n'attrb' : count -> {gloss_attrb_dict}\n\nExamples for each attribute type")
for ex in gloss_attrb_examples:
    print(f'attribute: {ex.attrib}, text: {ex.text}')

100%|██████████| 543/543 [01:16<00:00,  7.09it/s]

<gloss> attributes and counts in FRUS
'attrb' : count -> {'target': 1287394, 'type': 220148, 'corresp': 151, 'rend': 2}

Examples for each attribute type
attribute: {'target': '#t_PL_1'}, text: P.L.
attribute: {'type': 'from'}, text: Secretary of State
attribute: {'corresp': '#t_NODIS_1'}, text: Nodis
attribute: {'rend': 'superscript'}, text: 3





gloss tag is important for extracting 'instution' names, using @target and matching the id from front/terms and abbrv.