# XML Updater Tool

Using a relational data to automatically update XML code, and vice versa.

[Github development notes](https://github.com/pickettj/xml_development_eurasia/issues/10#issuecomment-576038585)

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import os, pickle

In [3]:
hdir = os.path.expanduser('~')
pickle_path = hdir + "/Box/Notes/Digital_Humanities/Corpora/pickled_tokenized_cleaned_corpora"
data_path = ext_corp_path = hdir + "/Box/Notes/Digital_Humanities/Datasets/exported_database_data/basic_corresondences"

### Read in CSV Database Files

In [4]:
db_locs = pd.read_csv(data_path + '/location_data.csv', names=['id', 'name'])

Split on `\x0b` for IDs with more than one value separated by a line break.

In [5]:
db_locs[db_locs['id']==5].iloc[0]['name'].split('\x0b')

['سمرقند', 'ثمرقند']

Create DataFrame with doubled entries for IDs with multiple values.

In [6]:
db_locs = pd.DataFrame(sum([[(x.id, z) for z in x.name.split('\x0b')] for x in db_locs.fillna('').itertuples()], []), columns=['id', 'name'])

In [7]:
db_locs.head(10)

Unnamed: 0,id,name
0,1,حصار
1,2,کندرود
2,3,بخارا
3,4,ولایت بلخ
4,5,سمرقند
5,5,ثمرقند
6,6,خوقند
7,7,کابل
8,8,قزان
9,9,تاشکند


In [8]:
db_locs.count()

id      1460
name    1460
dtype: int64

Add columns to differentiate which names are unique and which IDs are unique.

In [24]:
db_locs['name_count'] = db_locs.groupby(['name'])['id'].transform('count')

value = db_locs.loc[db_locs['name'] == "ده نو"]["id"]
int(value)

394

In [10]:
db_locs['id_count'] = db_locs.groupby(['id'])['name'].transform('count')

db_locs.loc[db_locs['name'] == "ده نو"]

Unnamed: 0,id,name,name_count,id_count
840,394,ده نو,1,1


In [11]:
db_locs['name'].str.contains("حصار").any()

True

## XML

Unpickle

In [12]:
with open(pickle_path + "/xml_corpora.pkl", "rb") as f:
    ind_man_docs, hyd_man_docs, trans_man_docs,\
                combo_xml_final, combo_xml_all = pickle.load(f)

In [13]:
combo_xml_all.keys()

dict_keys(['ser818', 'ser179', 'ser183', 'ser187', 'ser212', 'ser215', 'ser237', 'ser537', 'ser561', 'ser596', 'ser626', 'ser706', 'ser72', 'ser91', 'IVANUz_1936_ser185', 'NLR_f-940_ser190', 'RGVIA_400-1-1015_ser143', 'TsGARUz_i126-1-938-2_ser82', 'TsGARUz_i126_1_1160_ser193', 'TsGARUZ_i126_1_1729_101_ser213', 'TsGARUz_i126_1_1730_19_ser218', 'TsGARUz_i126_1_1730_22_ser217', 'TsGARUz_i126_1_1730_2_ser188', 'TsGARUZ_i126_1_1730_81_ser227', 'TsGARUZ_i126_1_1986_1_ser201', 'TsGARUz_i126_1_1990_20_ser186', 'TsGARUZ_i126_1_1990_3_ser192', 'TsGARUz_R-2678_ser184', 'ser560', 'ser808', 'ser809', 'ser811', 'ser812', 'ser813', 'ser814', 'ser815', 'ser816', 'ser817', 'ser842', 'ser843', 'ser857', 'ser876', 'ser877', 'ser898'])

# The Updater

Loop blueprint:
- Create new Dataframe for new data to be exported.
- Loop through xml corpus (i.e. a dictionary of XML files).
    - Create BeautifulSoup object for that XML document
    - When encounter an empty tag (e.g. `<location>placename</location>`)
        - check the place name against database CSV file of location names and id codes:
            - if there's a unique match (only one value for the place name string), replace: `<location id ="serial_no" flag = "auto">placename</location>`
                - Multiple place name variants with the same UID should be fine.
            - if one place name string has multiple UIDs (e.g. Samarkand province vs. Samarkand city):
                - flag for manual examination w/o guessing UID, i.e. replace: `<location flag = "check">placename</location>
            - if no match, then:
                - Tag with an auto-generated ID
                - flag for manual examination, i.e. replace: `<location id = "(auto-generated UID)" flag = "check">placename</location>
                - extract placename to csv file for import into database: 
                    - `(auto-generated new UI`, `(extracted string data)`, 'extracted'
    - Archive and rename the originating XML file in archive folder.
    - Save updated version XML file in separate file.

### Testing Constituent Parts of the Updater Loop

In [75]:
with open("test_case.xml") as f:
        txt = f.read()

In [76]:
tree = BeautifulSoup(txt)

*Check if location is in database* 

In [16]:
for loc in tree.find_all("location"):
    if db_locs['name'].str.contains(loc.get_text()).any():
        print ("yes: ", loc.get_text())
    else:
        print ("no: ", loc.get_text())

yes:  بلجوان
yes:  بخارا
yes:  بخارا
yes:  ده نو
yes:  ده نو
yes:  ثمرقند
yes:  ثمرقند
no:  فیض اباد
no:  فیض اباد


*Check if location is in database* and *there is only one UID* 

In [17]:
for loc in tree.find_all("location"):
    text = loc.get_text()
    match = db_locs['name'].str.contains(text)
    num = len(match.value_counts(True))
    if num > 1:
        print('multiple_ids: ', text)


multiple_ids:  بلجوان
multiple_ids:  بخارا
multiple_ids:  بخارا
multiple_ids:  ده نو
multiple_ids:  ده نو
multiple_ids:  ثمرقند
multiple_ids:  ثمرقند


*Show the various categories*

In [19]:
for loc in tree.find_all("location"):
    text = loc.get_text()
    # First ignore tags that have already been given an attribute ID
    if loc.has_attr("id"):
        print("tag id already entered: ", text)
    # Then look at all tags that lack an ID
    elif loc.has_attr("id")==False:
        match = db_locs[db_locs['name'].str.contains(text)]
        num = len(match)
        if num > 1:
            print('multiple_ids: ', text)
        elif num == 1:
            print('unique:', text)
        else:
            print('no: ', text)

multiple_ids:  بلجوان
multiple_ids:  بخارا
tag id already entered:  بخارا
unique: ده نو
tag id already entered:  ده نو
unique: ثمرقند
tag id already entered:  ثمرقند
no:  فیض اباد
no:  فیض اباد


*Export data for updating database*

In [78]:
loc_export = pd.DataFrame(columns=['UID', 'Name'])

*Manipulate the tags based on database entries*

In [80]:
new_uid = 5
for loc in tree.find_all("location"):
    text = loc.get_text()
    # Look at all tags that lack an ID
    if loc.has_attr("id")==False:
        match = db_locs[db_locs['name'].str.contains(text)]
        num = len(match)
        # Process tag values with multiple possible ID values
        if num > 1:
            loc["flag"] = "multiple_ids"
        # Add IDs to tags with unique string corresponding to single database entry
        elif num == 1:
            loc["id"] = int(db_locs.loc[db_locs['name'] == text]["id"])
        # For strings not contained in database, add new UID, and create new entry to update database
        else:
            new_text = loc.get_text()
            if loc_export['Name'].str.contains(new_text).any():
                loc["id"] = int(loc_export.loc[loc_export['Name'] == new_text]["UID"])
            else:
                loc["id"] = new_uid
                loc_export = loc_export.append({'UID' : new_uid , 'Name' : loc.get_text()}, ignore_index=True)
                #print (loc.get_text())
                new_uid = new_uid + 1
    # Drop duplicates in export file
    #loc_export = loc_export.drop_duplicates(subset="Name")

فیض اباد


In [39]:
tree.find_all("location")[2].get_text()

'بخارا'

In [67]:
loc_export

Unnamed: 0,UID,Name
0,5,فیض اباد


In [70]:
int(db_locs.loc[db_locs['name'] == "ده نو"]["id"])

'فیض اباد'

## Testing

In [90]:
tree = BeautifulSoup(combo_xml_all["ser898"])

In [91]:
first = tree.find_all('location')[0]

In [86]:
'locid' in first.attrs

True

In [87]:
first.attrs['locid'] = 5

In [None]:
first

In [None]:
hits = db_locs[db_locs['name']==first.text]

In [None]:
len(hits)

In [29]:
tree.find_all('location')

[<location flag="multiple_ids">بلجوان</location>,
 <location flag="multiple_ids">بخارا</location>,
 <location id="3">بخارا</location>,
 <location id="394">ده نو</location>,
 <location id="394">ده نو</location>,
 <location id="394">ثمرقند</location>,
 <location id="5">ثمرقند</location>,
 <location id="testing">فیض اباد</location>,
 <location id="testing">فیض اباد</location>]

In [None]:
tree.div

In [81]:
print(tree.prettify())

<?xml-model href="../../../../../Projects/xml_development_eurasia/schemas/persian_documents_schema_basic.rnc" type="application/relax-ng-compact-syntax"?>
<html>
 <body>
  <document serial="898">
   <div type="heading">
    <ts type="inscriptio">
    </ts>
    <lb>
    </lb>
    جناب حضرت وزارت پناهی امیدگاهی و صاحب دولتم سلمه الله تعالی
   </div>
   <div type="section">
    <!-- left column -->
    <ts type="apprecatio">
    </ts>
    <lb>
    </lb>
    عرضه داشت اینغلام
    <honorific type="inferior">
     رضاجوی
    </honorific>
    <honorific type="inferior">
     جانسپار
    </honorific>
    <flag type="meaning">
     خرمان
    </flag>
    کثیر الاخلاص
    <lb>
    </lb>
    وافر الاعتقاد و خبر خواه عقیدت نهاد قلیل الخدمت کثیر الامید بجناب
    <lb>
    </lb>
    ذاة خجسته صفات زیب بخش امارت و زینت افزای بساط
    <lb>
    </lb>
    عزت و حرمت ترازندۀ لوای معدلت و نیک نامی فرازنده اعلام
    <lb>
    </lb>
    <diplo type="orthography">
     حشمت
    </diplo>
    و انتظام ناظم مناظم 

In [None]:
tag = tree.div

In [None]:
type(tag.name)

In [None]:
type(tag.attrs)

In [None]:
tree.div

In [None]:
type(tree.div)

In [None]:
tree.div.contents

In [None]:
for string in tree.stripped_strings:
    print(repr(string))

In [None]:
type(tree.find_all("div"))

In [123]:
tree.find_all("location").get_text()

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?