# XML Updater Tool

Using a relational data to automatically update XML code, and vice versa.

[Github development notes](https://github.com/pickettj/xml_development_eurasia/issues/10#issuecomment-576038585)

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import os, pickle, glob, shutil

In [16]:
hdir = os.path.expanduser('~')
data_path = ext_corp_path = hdir + r"/Dropbox/Active_Directories/Digital_Humanities/Datasets/exported_database_data/basic_corresondences"
import_xml_path = hdir + r"/Dropbox/Active_Directories/Notes/Primary_Sources/xml_notes_stage2/parser_depository"
export_xml_path = hdir + r"/Dropbox/Active_Directories/Notes/Primary_Sources/xml_notes_stage3_final/updater_repository"

## Read in CSV Database Files, Refine Dataframe

In [17]:
db_locs = pd.read_csv(data_path + '/location_data.csv', names=['id', 'name'])

Split on `\x0b` for IDs with more than one value separated by a line break.

In [18]:
db_locs[db_locs['id']==5].iloc[0]['name'].split('\x0b')

['سمرقند', 'ثمرقند']

Create DataFrame with doubled entries for IDs with multiple values.

In [19]:
db_locs = pd.DataFrame(sum([[(x.id, z) for z in x.name.split('\x0b')] for x in db_locs.fillna('').itertuples()], []), columns=['id', 'name'])

In [20]:
db_locs.head(6)

Unnamed: 0,id,name
0,1,حصار
1,2,کندرود
2,3,بخارا
3,4,ولایت بلخ
4,5,سمرقند
5,5,ثمرقند


In [21]:
db_locs.count()

id      1460
name    1460
dtype: int64

## Read in XML

Create corpus from XML documents in Stage 2 folder

In [22]:
xml_update_files = glob.glob(import_xml_path + r'/*.xml')

xml_update = {}
for longname in xml_update_files:
    f = open(longname)
    txt = f.read()
    f.close()
    start = os.path.basename(longname)
    short = os.path.splitext(start)
    xml_update[short[0]] = txt

xml_update.keys()

dict_keys(['ser193', 'ser187', 'ser811', 'ser596', 'ser970', 'ser958', 'ser179', 'ser812', 'ser621', 'ser972', 'ser967', 'ser973', 'ser813', 'ser817', 'ser963', 'ser988', 'ser989', 'ser816', 'ser814', 'ser960', 'ser237', 'ser961', 'ser626', 'ser183', 'ser815', 'ser906', 'ser537', 'ser898', 'ser1004', 'ser1006', 'ser939', 'ser84', 'ser905', 'ser904', 'ser85', 'ser938', 'ser91', 'ser1003', 'ser81', 'ser80', 'ser929', 'ser108', 'ser877', 'ser903', 'ser97', 'ser902', 'ser876', 'ser106', 'ser105', 'ser72', 'ser501', 'ser110', 'ser706', 'ser842', 'ser937', 'ser843', 'ser857', 'ser818', 'ser944', 'ser993', 'ser561', 'ser212', 'ser560', 'ser945', 'ser979', 'ser990', 'ser991', 'ser952', 'ser215', 'ser809', 'ser808'])

# The Updater

Loop blueprint:
- Create new Dataframe for new data to be exported.
- Loop through xml corpus (i.e. a dictionary of XML files).
    - Create BeautifulSoup object for that XML document
    - When encounter an empty tag (e.g. `<location>placename</location>`)
        - check the place name against database CSV file of location names and id codes:
            - if there's a unique match (only one value for the place name string), replace: `<location id ="serial_no" flag = "auto">placename</location>`
                - Multiple place name variants with the same UID should be fine.
            - if one place name string has multiple UIDs (e.g. Samarkand province vs. Samarkand city):
                - flag for manual examination w/o guessing UID, i.e. replace: `<location flag = "check">placename</location>
            - if no match, then:
                - Tag with an auto-generated ID
                - flag for manual examination, i.e. replace: `<location id = "(auto-generated UID)" flag = "check">placename</location>
                - extract placename to csv file for import into database: 
                    - `(auto-generated new UI`, `(extracted string data)`, 'extracted'
    - Archive and rename the originating XML file in archive folder.
    - Save updated version XML file in separate file.

### Testing Constituent Parts of the Updater Loop

In [23]:
with open("test_case.xml") as f:
        txt = f.read()

In [24]:
tree = BeautifulSoup(txt)

*Check if location is in database* 

In [25]:
for loc in tree.find_all("location"):
    if db_locs['name'].str.contains(loc.get_text()).any():
        print ("yes: ", loc.get_text())
    else:
        print ("no: ", loc.get_text())

yes:  بلجوان
yes:  بخارا
yes:  بخارا
yes:  ده نو
yes:  ده نو
yes:  ثمرقند
yes:  ثمرقند
no:  فیض اباد
no:  فیض اباد


*Check if location is in database* and *there is only one UID* 

In [26]:
for loc in tree.find_all("location"):
    text = loc.get_text()
    match = db_locs['name'].str.contains(text)
    num = len(match.value_counts(True))
    if num > 1:
        print('multiple_ids: ', text)


multiple_ids:  بلجوان
multiple_ids:  بخارا
multiple_ids:  بخارا
multiple_ids:  ده نو
multiple_ids:  ده نو
multiple_ids:  ثمرقند
multiple_ids:  ثمرقند


*Show the various categories*

In [27]:
for loc in tree.find_all("location"):
    text = loc.get_text()
    # First ignore tags that have already been given an attribute ID
    if loc.has_attr("id"):
        print("tag id already entered: ", text)
    # Then look at all tags that lack an ID
    elif loc.has_attr("id")==False:
        match = db_locs[db_locs['name'].str.contains(text)]
        num = len(match)
        if num > 1:
            print('multiple_ids: ', text)
        elif num == 1:
            print('unique:', text)
        else:
            print('no: ', text)

multiple_ids:  بلجوان
multiple_ids:  بخارا
tag id already entered:  بخارا
unique: ده نو
tag id already entered:  ده نو
unique: ثمرقند
tag id already entered:  ثمرقند
no:  فیض اباد
no:  فیض اباد


*Export data for updating database*

In [28]:
# Gazetteer
loc_export = pd.DataFrame(columns=['UID', 'Name', 'Tags'])
# Gazetteer-Bibliography relational file
loc_bib_relational = pd.DataFrame(columns=['Loc_ID', 'Doc_ID', 'Type', 'Tags'])

New Unique ID: *highest previous value, plus 2000; that way all new imports will be above 8000*

In [29]:
highest_loc_id = db_locs['id'].max()
new_uid = highest_loc_id + 2001
print ("highest current database location UID: ", highest_loc_id, "\n import UIDs will start at: ", new_uid)

highest current database location UID:  6349 
 import UIDs will start at:  8350


*Manipulate the tags based on database entries*

In [226]:
for loc in tree.find_all("location"):
    text = loc.get_text()
    # Serial number of document associated with currently selected location
    parents = loc.findParents('document')
    doc_id = int([x["serial"] for x in parents][0])
    # Look at all tags that lack an ID
    if loc.has_attr("id")==False:
        match = db_locs[db_locs['name'].str.contains(text)]
        num = len(match)
        # Process tag values with multiple possible ID values
        if num > 1:
            loc["flag"] = "multiple_ids"
        # Add IDs to tags with unique string corresponding to single database entry
        elif num == 1:
            loc["id"] = int(db_locs.loc[db_locs['name'] == text]["id"])
        # For strings not contained in database, add new UID, and create new entry to update database
        else:
            # Ignore tags that have already been updated
            new_text = loc.get_text()
            if loc_export['Name'].str.contains(new_text).any():
                loc["id"] = int(loc_export.loc[loc_export['Name'] == new_text]["UID"])
            # Create new database entries
            else:
                # Export file for main table
                loc["id"] = new_uid
                loc_export = loc_export.append({'UID' : new_uid , 'Name' : loc.get_text(), 'Tags' : 'updater_import'}, ignore_index=True)
                loc_bib_relational = loc_bib_relational.append({'Loc_ID' : new_uid , 'Doc_ID': doc_id, 'Type' : 'mentioned', 'Tags' : 'updater_import'}, ignore_index=True)
                #print (loc.get_text())
                new_uid = new_uid + 1

In [228]:
loc_export

Unnamed: 0,UID,Name,Tags
0,8350,فیض اباد,updater_import


In [210]:
loc_bib_relational

Unnamed: 0,Loc_ID,Doc_ID,Type,Tags
0,6349,898,mentioned,updater_import


## Exporting / Archiving

Read out all updated XML files into Stage 3 import folder

In [247]:
for doc in xml_update:
    
    # Export filename
    output_file = doc + ".xml"
    
    # Export text
    output_text = xml_update[doc]
    
    # Send file
    with open(export_xml_path + "/" + output_file, 'w+') as fout:
        fout.write(output_text)
    
    

Copy Stage 2 folder contents into archive folder

In [244]:
archive_folder = import_xml_path + r"/archived_now_at_stage3_do_not_use"

files = os.listdir(import_xml_path)

for f in files:
    if filename.endswith(".xml"):
        # Figure out simple way of prepending "archived" to file name.
        shutil.move(import_xml_path + "/" + f, archive_folder)
    

'/Users/enkidu/Box/Notes/Primary_Sources/xml_notes_stage2/parser_depository/archived_now_at_stage3_do_not_use'