# XML Updater Tool

Using a relational data to automatically update XML code, and vice versa.

In [73]:
import pandas as pd
from bs4 import BeautifulSoup
import os, pickle

In [74]:
hdir = os.path.expanduser('~')
pickle_path = hdir + "/Box/Notes/Digital_Humanities/Corpora/pickled_tokenized_cleaned_corpora"
data_path = ext_corp_path = hdir + "/Box/Notes/Digital_Humanities/Datasets/exported_database_data/basic_corresondences"

### Read in CSV Database Files

In [75]:
locs = pd.read_csv(data_path + '/location_data.csv', names=['id', 'name'])

Split on `\x0b` for IDs with more than one value separated by a line break.

In [76]:
locs[locs['id']==5].iloc[0]['name'].split('\x0b')

['سمرقند', 'ثمرقند']

Create DataFrame with doubled entries for IDs with multiple values.

In [77]:
locs = pd.DataFrame(sum([[(x.id, z) for z in x.name.split('\x0b')] for x in locs.fillna('').itertuples()], []), columns=['id', 'name'])

In [78]:
locs.head(10)

Unnamed: 0,id,name
0,1,حصار
1,2,کندرود
2,3,بخارا
3,4,ولایت بلخ
4,5,سمرقند
5,5,ثمرقند
6,6,خوقند
7,7,کابل
8,8,قزان
9,9,تاشکند


In [79]:
locs.count()

id      1460
name    1460
dtype: int64

Add columns to differentiate which names are unique and which IDs are unique.

In [80]:
locs['name_count'] = locs.groupby(['name'])['id'].transform('count')

locs.loc[locs['name'] == "حصار"]

Unnamed: 0,id,name,name_count
0,1,حصار,2
794,351,حصار,2


In [81]:
locs['id_count'] = locs.groupby(['id'])['name'].transform('count')

locs.loc[locs['name'] == "سمرقند"]

Unnamed: 0,id,name,name_count,id_count
4,5,سمرقند,1,2


In [119]:
locs['name'].str.contains("حصار").any()

True

## XML

Unpickle

In [82]:
with open(pickle_path + "/xml_corpora.pkl", "rb") as f:
    ind_man_docs, hyd_man_docs, trans_man_docs,\
                combo_xml_final, combo_xml_all = pickle.load(f)

In [83]:
combo_xml_all.keys()

dict_keys(['ser818', 'ser179', 'ser183', 'ser187', 'ser212', 'ser215', 'ser237', 'ser537', 'ser561', 'ser596', 'ser626', 'ser706', 'ser72', 'ser91', 'IVANUz_1936_ser185', 'NLR_f-940_ser190', 'RGVIA_400-1-1015_ser143', 'TsGARUz_i126-1-938-2_ser82', 'TsGARUz_i126_1_1160_ser193', 'TsGARUZ_i126_1_1729_101_ser213', 'TsGARUz_i126_1_1730_19_ser218', 'TsGARUz_i126_1_1730_22_ser217', 'TsGARUz_i126_1_1730_2_ser188', 'TsGARUZ_i126_1_1730_81_ser227', 'TsGARUZ_i126_1_1986_1_ser201', 'TsGARUz_i126_1_1990_20_ser186', 'TsGARUZ_i126_1_1990_3_ser192', 'TsGARUz_R-2678_ser184', 'ser560', 'ser808', 'ser809', 'ser811', 'ser812', 'ser813', 'ser814', 'ser815', 'ser816', 'ser817', 'ser842', 'ser843', 'ser857', 'ser876', 'ser877', 'ser898'])

# The Updater

Loop blueprint:
- Create new Dataframe for new data to be exported.
- Loop through xml corpus (i.e. a dictionary of XML files).
    - Create BeautifulSoup object for that XML document
    - When encounter an empty tag (e.g. `<location>placename</location>`)
        - check the place name against database CSV file of location names and id codes:
            - if there's a unique match (only one value for the place name string), replace: `<location id ="serial_no" flag = "auto">placename</location>`
                - Multiple place name variants with the same UID should be fine.
            - if one place name string has multiple UIDs (e.g. Samarkand province vs. Samarkand city):
                - flag for manual examination w/o guessing UID, i.e. replace: `<location flag = "check">placename</location>
            - if no match, then:
                - Tag with an auto-generated ID
                - flag for manual examination, i.e. replace: `<location id = "(auto-generated UID)" flag = "check">placename</location>
                - extract placename to csv file for import into database: 
                    - `(auto-generated new UI`, `(extracted string data)`, 'extracted'
    - Archive and rename the originating XML file in archive folder.
    - Save updated version XML file in separate file.

### Testing Constituent Parts of the Updater Loop

*Check if location is in database* 

In [126]:
for loc in tree.find_all("location"):
    if locs['name'].str.contains(loc.get_text()).any():
        print ("yes: ", loc.get_text())
    else:
        print ("no: ", loc.get_text())

yes:  بلجوان
no:  فیض اباد


*Check if location is in database *and* there is only one UID* 

In [150]:
for loc in tree.find_all("location"):
    if locs['name'].str.contains(loc.get_text()).any():
        if locs.loc[locs['name'] == loc]["name_count"] > 1:
            print ("multiple_ids: ", loc.get_text())
        else:
            print ("unique: ", loc.get_text())
    else:
        print ("no: ", loc.get_text())

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [157]:
locs.loc[locs['name'] == "بلجوان"]["name_count"]

101    2
830    2
Name: name_count, dtype: int64

## Testing

In [90]:
tree = BeautifulSoup(combo_xml_all["ser898"])

In [91]:
first = tree.find_all('location')[0]

In [86]:
'locid' in first.attrs

True

In [87]:
first.attrs['locid'] = 5

In [None]:
first

In [None]:
hits = locs[locs['name']==first.text]

In [None]:
len(hits)

In [88]:
tree.find_all('location')

[]

In [None]:
tree.div

In [92]:
print(tree.prettify())

<?xml-model href="../../../../../Projects/xml_development_eurasia/schemas/persian_documents_schema_basic.rnc" type="application/relax-ng-compact-syntax"?>
<html>
 <body>
  <document serial="898">
   <div type="heading">
    <!-- inscriptio -->
    <ts type="inscriptio">
    </ts>
    جناب حضرت وزارت پناهی امیدگاهی و صاحب دولتم سلمه الله تعالی
    <lb>
    </lb>
   </div>
   <div type="section">
    <!-- left column -->
    <ts type="apprecatio">
    </ts>
    عرضه داشت اینغلام
    <honorific type="inferior">
     رضاجوی
    </honorific>
    <honorific type="inferior">
     جانسپار
    </honorific>
    <flag type="meaning">
     خرمان
    </flag>
    کثیر الاخلاص
    <lb>
    </lb>
    وافر الاعتقاد و خبر خواه عقیدت نهاد قلیل الخدمت کثیر الامید بجناب
    <lb>
    </lb>
    ذاة خجسته صفات زیب بخش امارت و زینت افزای بساط
    <lb>
    </lb>
    عزت و حرمت ترازندۀ لوای معدلت و نیک نامی فرازنده اعلام
    <lb>
    </lb>
    <diplo type="orthography">
     حشمت
    </diplo>
    و انتظام ناظم م

In [None]:
tag = tree.div

In [None]:
type(tag.name)

In [None]:
type(tag.attrs)

In [None]:
tree.div

In [None]:
type(tree.div)

In [None]:
tree.div.contents

In [None]:
for string in tree.stripped_strings:
    print(repr(string))

In [None]:
type(tree.find_all("div"))

In [123]:
tree.find_all("location").get_text()

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?