# *MunchXMLmuncher*
Developed by research assistant Loke Sjølie for the University of Oslo

This script currently consumes 1 file in eMunch's TEIXML format and converts it to a complete CMIF/TEIXML file. The script is customized to ensure high precision, with fallbacks designed around their particular TEIXML files. Be aware that CMIF is purely intended to represent correspondance between individuals, and as such there is *significant* (intentional) data loss in converting to the format.

The script targets documents that have been tagged with **"brev"** or **"letter"**, and extracts from these:
1. Document ID, which is extrapolated to form an eMunch URL
2. Document Author tends to be Edvard Munch, and he is given his customary VIAF ID
3. Document Authored Date, which is converted to YYYY-MM-DD (or YYYY-MM, or YYYY) or a range that can be from or from-to
4. Document Recipient(s), names and IDs

... and then places these in a hierarchy: <CorrespDesc(DocumentID)><*Author*><*Date*/><*/Author*><*Recipient*(s)/>.

The file I was provided does **not** specify locations, but I'm sure that we'll be able to work out how to add those if such information is available. Further development: use glob.glob<sup>(the yeast of thought and mind)</sup> to consume files by folder. Add location data? Redirect recipient IDs to VIAF?

The script can be altered to target all documents with one or more recipients, but many of the documents within that criteria are drafts and/or notes. Alternatively, the script can be further restricted to target only letters with "brev"/"letter" type *and* one or more recipients - but the test file has 0 instances where this would have an effect.
___
Users:
I ask that you do not touch anything below the header **Program** unless you *sort of* know what you're doing. :)

## CMIF Metadata & options

In [None]:
# Users: only edit things that exist WITHIN double quotation marks ("").
cmifTitle = "MunchXMLmuncher version "+str(version) # Title of resulting CMIF
editorName = "Loke Sjølie" # The name to issue to the CMIF file as "editor" (responsible for this file)
editorMail = "loke.sjolie@ub.uio.no" # The e-mail associated with the above.

#publishers = 1 # How many publishers? Add later if required.
publisherURL = "eMunch.no" # Website of publisher #1
publisherName = "eMunch" # Name of publisher #1

cmifURL = "eMunch.no" # URL where this file is located
typeOfBibl = "online" # The type of bibliography that is being described
publicationStatementFull = "[Full bibliographical statement of the scholarly edition or repository where this file points to]"

version = "0.67" # Describes the "program's" state of completion and versioning.
cmifUid = "a403c593-09df-4538-8acf-8d459339fca8" # Unique ID. Used in sourceDesc of CMIF. Don't change it without a good reason.
# cmifUid is also used as "source" for the time being in each object. Read more about this in CMIF docs.

inputfilepath = r"INPUT/tei.xml" # Where is the target TEI source file located?

#inputfolder = r"INPUT/*" # The folder containing the TEI/XML-files to be transformed.

# MODE = "FILE" # Takes FILE or FOLDER.

## Program

### Init, metadata, etc

In [None]:
from bs4 import BeautifulSoup # Hent BeautifulSoup-modulen (https://www.crummy.com/software/BeautifulSoup/) for XML
from bs4 import Comment # BS4-addon for å håndtere kommentarer <!-- X -->
from datetime import date # Dates
import time # Time
import re # Regex
#import glob # The yeast of thought and mind
#import os # Filsystem for lasting og lagring
today = date.today() # Sett dato i dag
today = today.strftime("%Y-%m-%d") # Formater dato
currVer = version+" "+today

In [None]:
previouslyRun = "Last executed code was version "+str(currVer)+". All OUTPUT files are current to that version on that date.\n"+str(cmifUid)+"."
print("Version",currVer)

f = open(r"settings/lastRunVersion.txt", "w")
f.write(previouslyRun)
f.close()
print("\tUpdated version information.")

### Read/process TEI-XML

In [None]:
with open(inputfilepath, "r", encoding="utf-8") as file: # Open a file
    tei = file.readlines() # Les innholdet som linjer
    tei = "".join(tei) # Linjene blir kombinert i en variabel
soup = BeautifulSoup(tei, from_encoding="UTF-8") # It is now soup
# Don't worry about the error screaming about Unicode markup being provided twice

In [None]:
# Create CMIF boilerplate object
CMIFstring = '<?xml-model href="https://raw.githubusercontent.com/TEI-Correspondence-SIG/CMIF/master/schema/cmi-customization.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?><TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt><title>'+str(cmifTitle)+'</title><editor>'+str(editorName)+'<email>'+str(editorMail)+'</email></editor></titleStmt><publicationStmt><publisher><ref target="'+str(publisherURL)+'">'+str(publisherName)+'</ref></publisher><idno type="url">'+str(cmifURL)+'</idno> <date when="'+str(today)+'"/><availability><licence target="https://creativecommons.org/licenses/by/4.0/">This file is licensed under the terms of the Creative-Commons-License CC-BY 4.0</licence></availability></publicationStmt><sourceDesc><bibl type="'+str(typeOfBibl)+'" xml:id="'+str(cmifUid)+'">'+str(publicationStatementFull)+'</bibl></sourceDesc></fileDesc><profileDesc></profiledesc></teiheader></tei>'
CMIF = BeautifulSoup(CMIFstring)

# Before handling the data: remove all comments
# Making a list of <!--comments--> to be destroyed...
commentDocs = 0 # Used only in terminating comments
comments = 0 # Used only in terminating comments
start = time.time()

for comment in soup.findAll(string=lambda text: isinstance(text, Comment)):
    if "xml:id=\"" in comment:
        commentDocs+=1
    comment.extract()
    comments+=1
if comments > 0:
    print("Destroyed",comments,"<!--comments-->, of which",commentDocs,"contained an @XML:ID.")
# ... and checking it twice.
comments = soup.findAll(string=lambda text: isinstance(text, Comment))
if comments:
    print("There are still",len(comments),"comments present.")
else:
    print("All comments destroyed.")

    # Declare variables
# ---------------------------------------------------------------
errors_found = [] # List of errors found during execution
letterCount = 0 # # letters, as defined by an item having a recipient, processed
miscCount = 0 # non-letter documents, as defined by an item having no recipients, processed
addresseesUnique = [] # List of unique recipients
datetype = 0 # Var for the type of date we're dealing with
noOfRecipients = 0 # Counting non-unique recipients
otherMiscDocCount = 0 # Counting objects that are not letters.
authorID = "" # Reserved for VIAF etc.
# ---------------------------------------------------------------

# Limit workspace to individual div (document) here.
profileDescElement = CMIF.find('profiledesc') # Target correspondence wrapper
# For each Div element with an XML:ID (should be each document)
for document in soup.findAll("div", {"xml:id":True}):
    # Get the document ID from the <div> element.
    # Look for the document type assignment.
    documentType = document.find("list", {"type" : "objectType"}).findChild(True, recursive=True)#.attrs['n']
    if "brev" in documentType or "letter" in documentType: # Checks if the words "letter" or "brev" appear in the type
        # This code applies to letters as directed by the data type.
        documentID = list(document.attrs.values())[0]
        #print(documentType)
        #print("DEBUG Checking",documentID)
        # Check if the document has more than 0 recipients. If there are no recipients, there is no correspAction required.
        recipient = document.find("item", {"n":"recipient"})

        # Check if the document has an author.
        authorName = document.find("item", {"n":"author"})
        if authorName:
            #print(authorName)
            authorName = authorName.contents[0]
        else:
            authorName = "No author"
            print("WARNING:",documentID,"suffered code 201881 no author found!")
            errors_found.append("INFO 201881 in "+str(documentID))
        if authorName == "Edvard Munch":
            authorID = "https://viaf.org/viaf/61624802/"
        else:
            authorID = "Add author ID mechanism."

        # Attempt to divine the date or date range of the document. Assumes that each document only has 1 date (or 1 range).
        isDocumentUndated = document.find("item", {"n":"undated"})
        if isDocumentUndated:
            date = "s.d."
            datetype = "none"
        else:
            isDocumentFromTo = document.find("date", {"from":True}) # Does the date element have a from assignment? 
            # Using "from" because PN1350 does not have a fromTo attr despite using fromTo. Uses "from", though. Works fine.
            if isDocumentFromTo: # If it does, and thus has a range (JK, No-MM_T1296 has FROM attr but not a TO attr.)
                doesDocumentHaveToDate = document.find("date", {"to":True})
                if doesDocumentHaveToDate:
                    fromDate = isDocumentFromTo['from'] # Extract 'from' date. 
                    #date = " ".join(date)
                    toDate = isDocumentFromTo['to'] # Extract 'to' date.
                    datetype = "range"
                else:
                    date = isDocumentFromTo['from']
                    datetype = "fromRange"
                
            elif not isDocumentFromTo: # If it doesn't:
                yearSent = document.find("date", {"type":"year","when":True}) # Check for year element
                monthSent = document.find("date", {"type":"month","when":True}) # Check for month element
                daySent = document.find("date", {"type":"day","when":True}) # Check for day element
                if yearSent:
                    datetype = "exact"
                    date = yearSent.attrs["when"]
                    if monthSent: # Only look for a month if there's a year. That 1 letter with just month/day, tho...
                        M = re.sub('[-]', '', monthSent.attrs["when"]) # Strip the random '-' characters in here.
                        date+="-"+str(M) # Join month to year by YYYY-MM.
                        if daySent: # Only applies if there is a month AND a day. No point having a day if you don't have a month.
                            M = re.sub('[-]', '', daySent.attrs["when"]) # Strip the random '-' characters in here, too.
                            date+="-"+str(M) # Join month to year-month by YYYY-MM-DD.
                else:
                    datetype = "none"
                    date = "s.d."
                    print("WARNING:",documentID,"suffered code 301881 - no year found in a specific-year element. Expected in MM_N1071 and MM_N3734.")
                    errors_found.append("INFO 301881 in "+str(documentID))
            else:
                datetype = "Warning 301882"
                print("WARNING:",documentID,"suffered error 301882 - catastrophic date error")
                errors_found.append("CRITICAL ERROR 301882 in "+str(documentID))
        
        # Construct CMIF author ("sent") element
        correspDescElement = soup.new_tag("correspDesc", attrs={"key":str(documentID), "ref":"https://www.emunch.no/HYBRID"+str(documentID)+".xhtml", "source":cmifUid})
        profileDescElement.append(correspDescElement)

        targetElementCorrespDesc = CMIF.find("correspDesc", attrs={"key":str(documentID)})
        correspActionElement = soup.new_tag("correspAction", attrs={'type':'sent'})
        targetElementCorrespDesc.append(correspActionElement)
        correspActionTarget = targetElementCorrespDesc.findChild("correspAction",attrs={"type": "sent"}, recursive=False)
        persNameElement = soup.new_tag("persName", attrs={"ref":authorID})
        persNameElement.string = str(authorName)
        
        correspActionTarget.append(persNameElement)
        
        if datetype == "exact":
            dateSentElement = soup.new_tag("date", attrs={"when":date})
            #print(datetype,date)
        elif datetype == "range":
            dateSentElement = soup.new_tag("date", attrs={"from":fromDate,"to":toDate})
            #print(datetype,fromDate,toDate)
        elif datetype == "fromRange":
            dateSentElement = soup.new_tag("date", attrs={"from":fromDate})
            #print(datetype,fromDate)
        elif datetype == "none":
            #print("> NO DATE!",documentID)
            pass
        else:
            print("ERROR 2839 - Unrecognized datetype!")
            errors_found.append("2839")
        if datetype == "none":
            pass
        else:
            # Append date element to correspAction @sent
            correspActionTarget.append(dateSentElement)


        
        if recipient: # If there are more than 0 recipients:
            letterCount += 1
            i=0
            recipientList = recipient.findChildren(True) # Get ALL children of the recipient item element. Might be 2+!
            for each in recipientList: # For every recipient:
                recipientName = str(each.contents[0]) # Assign a name
                noOfRecipients += 1
                if recipientName not in addresseesUnique:
                    addresseesUnique.append(recipientName)
                recipientID = recipientList[i].attrs["target"] # Assign an ID

                if "institution" in recipientID:
                    recipientType = "orgName"
                elif "person" in recipientID:
                    recipientType = "persName"
                else:
                    print("WARNING:",documentID,"suffered error 20191. Defaulted to person.")
                    recipientType = "persName"
                    errors_found.append("WARNING 20191 in "+str(documentID))

                correspActionElement = soup.new_tag("correspAction", attrs={'type':'received'})
                targetElementCorrespDesc.append(correspActionElement)
                correspActionTarget = targetElementCorrespDesc.findChildren("correspAction",attrs={"type": "received"}, recursive=False)

                if recipientType == "persName":
                    persNameElement = soup.new_tag("persName", attrs={"ref":recipientID})
                elif recipientType == "orgName":
                    persNameElement = soup.new_tag("orgName", attrs={"ref":recipientID})

                persNameElement.string = str(recipientName)
                correspActionTarget[i].append(persNameElement)
                i+=1
        else: # If document does not have a recipient, what do we do?
            miscCount+=1
    else:
        otherMiscDocCount += 1
        #print("Skipped item",documentID,"as it is not a letter.")
#print("</profileDesc>")
end = time.time()

In [None]:
print("Processed",otherMiscDocCount+letterCount,"documents.",str(letterCount)+"("+str(round(letterCount/(otherMiscDocCount+letterCount)*100))+"%) were letters addressed to "+str(noOfRecipients)+" recipients, of which "+str(len(addresseesUnique))+" were unique (meaning each person received avg. "+str(round(letterCount/len(addresseesUnique)))+" letters), and",miscCount,"letters without recipients (if this > 0, there's a problem) in",round(end - start,1),"seconds.")
if len(errors_found) > 0:
    i = 0
    print("\n"+str(len(errors_found)),"data warnings and errors, listed as INFO, WARNING, and ERROR in order of severity:")
    for error in errors_found:
        i+=1
        if "201881" in error:
            print(i,error,"\n\tDocument has no author. Registered as \"No author\".")
        elif "301881" in error:
            print(i,error,"\n\tDocument has a specific date type, but does not specify or suggest a year (MM-DD/MM). Document has been given \"undated\" status.")
        elif "301882" in error:
            print(i,error,"\n\tCatastrophic failure in date format or harvesting. The script was not designed for this.")
        elif "30190" in error:
            print(i,error,"\n\tCatastrophic failure in recipient list processing. I don't think the script will run to this point with such an error.")
        elif "20191" in error:
            print(i,error,"\n\tThe recipient is not a person or an organization. Suggests error in reference XMLURI. Defaulted to person.")
        else:
            print("There is an error that is not indexed. :(")
        print("\n")
else:
    print("No warnings or errors found.")
print("Saving to disk.")
start = time.time()
with open(r"OUTPUT\output.xml", "w", encoding="utf-8") as output_file:
    output_file.write(CMIF.prettify())
end = time.time()
print("Prettified CMIF file created in",round(end - start,1),"seconds.")
print("Process complete.")

In [None]:
raise KeyboardInterrupt

#### Debug stuff - Tags/attributes
What is this thing? An investigation of the input document's tags.

In [None]:
soup.find("div", {"xml:id" : "No-MM_N0025"})

# MM_N3734 er notater på MM_K4982. MM_K4982 opptrer ikke som objekt i XML-filen jeg har fått. Hmm.
# Akkurat nå er vi ganske liberale med hva som defineres som brev. Mange objekter har mottakere, men er kun utkast.

In [None]:
tagsAttrs = [] # New list
for x in soup.findAll(): # For every tag in the soup
    tag = str(x.name) # Assign name of tag to var tag
    for attribute in x.attrs: # For every attribute belonging to the tag
        tag = tag+" @"+attribute # Append attribute to tag with " @" as separator - results in combination
    if tag not in tagsAttrs: # If this particular combination of tag/attribute(s) has not been seen previously
        tagsAttrs.append(tag) # Register it in our list
# Dict with known tag/attribute pairings and understood meanings
dict = {
    "tei":"The TEI element - is where our file actually begins.",
    "teiheader":"The TEI header contains metadata (titleStmt, publicationStmt, sourceDesc...).",
    "p":"P is a paragraph. This is used in the TEIheader to contain the actual strings for publication & source desc. And a single, random </p> element later.",
    "body":"Body is used as a sub-element of <text> to contain all the metadata for all letters. I am personally offended by this practice. BS4 adds one, too.",
    "text":"Text appears to be a wrapper for the body tag, which contains all the texts' metadata.",
    "div":"Div, with @xml:id, is used to contain the metadata of a single letter.",
    "date":"Date is a date element. It seems to have the @when attribute very often, as well as enclosed text. Often has @type(year/fromTo, etc.)",
    "table":"Table is the primary data structure in which information about each letter is stored. This is a *table*.",
    "row":"Row is a sub-element of the table element. It defines a new X-axis in a table.",
    "seg":"Seg appears to be some kind of ID attached to each letter. The ID is used as an @xml:id attribute in div, and the element appears in references to other letters.",
    "cell":"Cell is a sub-element of the table element also. A single cell appears to be an entry into a row element.",
    "ref":"Ref appears to contain references to other XML items.",
    "item":"Item is a generic element that has multiple @attributes, such as owner, owner signature, author, paper type... This is evidently a very important tag.",
    "list":"List is a list. Often, the list only has one item. The list is used as description tag, containing other lists, and describes anything between dates to material type.",
    "html":"The HTML tag can be ignored. BS4 adds this.",
    "filedesc":"Filedesc contains title, publication, source statements.",
    "sourcedesc":"Sourcedesc describes the source of the whole document.",
    "publicationstmt":"Publication statement for the whole document.",
    "title":"Title for the whole document.",
    "titlestmt":"Titlestmt is a wrapper for the title tag (whole document).",
    "div @xml:id":"Div has an attribute @xml:id. This describes the unique ID of the item in question.",
    "list @type":"List's @type attribute describes whether the list is wrapped around an object/physical description, a date, or other category.",
    "item @n":"Item's @n attribute describes role, library sorting, language, measures, dated, notes and so on. Very... multipurpose.",
    "tei @xmlns @xml:id":"tei @xmlns @xml:id is functionally identical to TEI tag. Just the one.",
    "date @type @from @to":"date @type @from @to describes the sequence type=fromTo, from, to. A date range.",
    "ref @target":"ref's @target attribute describes a URL to another XML.",
    "date @type @when":"Date with attributes type and when. Single date/year.",
    "ref @type @target":"Seems to contain URL to eMunch's web pages for a 'Read More' function.",
    "date @type @from":"date @type @from is an open-ended date.",
    "date @type":"Caution: date @type is a date with just a type. The date itself might be enclosed...? Potentially misleading. Investigate.",
    "ref @target @n":"ref @target @n - like ref @target, but @n tends to be the name of an institution or so.",
    "row @n":"row @n describes parts of the text. Inventory number, paper type, etc.",
    "ref @type @target @n":"ref @type @target @n - Working off of previous information, I'll infer that ref @type @target @n describes a Read More, with URL, with name."
}
print("Listing all unique tags and attribute combinations found with mapped, understood meanings.\n")
tagsAttrs.sort() # We do a little sorting
for x in tagsAttrs: # For every tag/attr combination registered
    if x in dict: # If our dict has the combo
        if "@" in x: # If there's an attribute involved
            print("ATTR ["+str(x)+"]",dict[x]) # Print with attribute focus
        else: # If there is no attribute involved
            print("TAG ["+str(x)+"]",dict[x]) # Print with tag focus
    else: # If our dict does not have the combo
        print("\n"+str(x),"has no description. What is this?\n") # Print error
comments = soup.find_all(string=lambda text: isinstance(text, Comment)) # Find all comments in soup
if comments: # If there are comments
    n = len(comments) # Check how many comments
    print("\n> Detected",n,"comments (<!-- -->, etc). These should be eradicated before tag extraction.") # Print message
else: # If there are no comments
    print("\n> There are no (0) comments to worry about in this document.") # There are no comments

We have notes as well as letters. The notes generally do not have a item @recipient tag, while the letters generally do.

Every div has an xml:id, and an enclosed ID.

Every item then has a list with items in it.

#### Debug stuff - Types of attributes
There's a whole lot of item *n* tags. What are they? Let's find out. The following extracts list and item tags with unique attribute texts. We have to filter out a loooot of tags that're IDs, dates etc. And look - we got cells, too!

In [None]:
itemNs = []
itemXmlIds = []
itemDates = []
itemTargets = []
lists = itemNs,itemXmlIds,itemDates,itemTargets
for x in soup.findAll(True):
    name = x.name
    if "list" in name:
        children = x.findChildren(True, recursive=True)
        i=0
        for child in children:
            if len(child.attrs) == 0:
                print("Child of",name,i,child.name,"\n")
            else:
                print("Child of",name,i,child.name,child.attrs,"\n")
            i+=1
    for i in x.attrs:
        attribute = i
        value = x.attrs[i]
        try:
            contents = x.contents[0]
            fullTag = str(name)+" @"+str(attribute)+" = "+str(value)+" "+str(contents)
        except:
            fullTag = str(name)+" @"+str(attribute)+ "= "+str(value)
        if "@xml:id" in fullTag:
            if fullTag not in itemXmlIds:
                itemXmlIds.append(fullTag)
        elif "date" in fullTag:
            if fullTag not in itemDates:
                itemDates.append(fullTag)
        elif "@target" in fullTag:
            if fullTag not in itemTargets:
                itemTargets.append(fullTag)   
        else:
            if fullTag not in itemNs:
                itemNs.append(fullTag)
#for x in lists:
#    x.sort()
#itemNs

### CMIF

https://correspsearch.net/en/documentation.html

/correspAction/@type == correspAction element with attribute type="xyz"

/correspAction/persName == correspAction element with persName child element

@X == attribute of element

*Each letter, postcard - document - that is to be described features its own **correspDesc element**. *There are as many correspDescs as there are items. A particular correspDesc element in CMI format is more restrictive and reduced with regard to its vocabulary than the TEI Guidlines generally allow. This enables interchange between the respective TEI documents.*

for each in letters:
    create correspDesc wrapper
    
<correspDesc>
    <correspAction type="sent">
        <persName ref="VIAFetc url">NAME</>
        <placeName ref="Geonames url">NAME</>
    <correspAction type="received">
        <persName ref="url">NAME</>
        <placeName ref="Geonames url">NAME</>

### Mapping tags
*Italics* == Tag is category/folder only, does not contain text in itself

#### TEI-Header (metadata)
1. *Titlestmt* {Title, Editor(email)}
2. *Publicationstmt* {*Publisher* (Ref @target), idno@url, date@when, *Availability*(licence@target)}
3. *Sourcedesc* {Bibl@type@xml:id} - type="online" xml:id="cmifUid"

The header mostly features direct correlation, or items where the program will directly inject new information.

Now, because nothing is easy, the example file is just all TEI header including the letters it wants to describe. There is a body tag with a random \<p/>, which just serves absolutely no purpose. Why?

#### "profileDesc" (data)
1. correspDesc @key @ref @source {correspAction @type (persname @ref, placename @ref, *date @when*), correspAction @type (persname @ref, placename @ref, *date @when*)}

Dates need to be YYYY-MM-DD, dropping DD and/or MM if required. Unknown dates should be skipped as per CMIF documentation. 
