## Process xmls

This notebook is going to show you one way to process xmls, extractiong all paragraphs and save them into pickle object for later use 

In [1]:
import os 
python_root = '..'
import sys
sys.path.insert(0, python_root)
import data_util
from bs4 import BeautifulSoup
import pickle

### 1. Download some sample data if you don't have them yet. 

In [2]:
### specify download path and extract path 
download_path = "staff_reports.zip"
download_link = "https://www.dropbox.com/s/wi37fy1apjuiyqt/staff_reports.zip?dl=1"
extract_path = "../data"  # place data in Python project root folder

In [3]:
## detailed of the download_data function is in data_util module in python_root 
## if you do not yet have the data, run this code, it will set up a data folder under ./Python folder 
data_util.download_data(download_path,download_link,extract_path)

article iv: 11.0MB [00:05, 2.30KB/s]                                           


### 2. Now we can start process xmls

In [7]:
xml_path = "../data/xmls"
files = os.listdir(xml_path)             ## list all files in xml_path 
files = [f for f in files if '_' in f]   ## only keep the files with _ in its name 
                                         ## it is just our xmls are formated that way, 
                                         ## files without "_" are just headers wich don't contrain 
                                         ## any information        


    
Here, we will use beautiful soup package to read xml files, we first define a function to read xml 

In [8]:
def read_xml(file_path):
    with open(file_path,'r',encoding='utf8') as f:
        soup = BeautifulSoup(f, 'xml')
    
    return soup

In [9]:
## try see what the result looks like
res = read_xml(os.path.join(xml_path,files[0]))
print(res.contents)

['article PUBLIC "-//IMF//IMF DTD//EN" "../../../../IMF_DTDs_XSLs/journal-dtd-3.0/3.0/journalpublishing3.dtd"', 'xml-stylesheet type="text/xsl" href="../../../../IMF_DTDs_XSLs/journal-dtd-3.0/journal_ViewIMF-v1.0.xsl"', <article article-type="001" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">002</journal-id>
<journal-title-group>
<journal-title>IMF Staff Country Reports</journal-title>
</journal-title-group>
<issn>1934-7685</issn>
<isbn>9781451800203</isbn>
<publisher>
<publisher-name>International Monetary Fund</publisher-name>
<publisher-loc>Washington, D.C.</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5089/9781451800203.002.A001</article-id>
<article-id pub-id-type="publisher-id">002A001</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Afghanistan, Islamic State

</article>]


some basic orerations with the xml file 

In [10]:
## get all paragraphs 
paras = res.body.find_all('p')
print(paras[0])

<p>1. Undertaking a debt sustainability analysis (DSA) for Afghanistan presents unique challenges. The lack of complete debt information, the presence of disputed or unverified claims, and the almost total absence of historical data on key macroeconomic variables precludes use of the Fund’s standard template for assessing external sustainability.<xref id="RA01fn02" ref-type="fn" rid="A01fn02"><sup>2</sup></xref> Even the more nuanced approach to debt sustainability for low income countries presents difficulties, given that standard assessments of institutional capacity (for example, through the World Bank’s Country Policy and Institutional Assessments—CPIA) have yet to be completed for Afghanistan.</p>


In [11]:
## get all tables 
figs = res.body.find_all('table-wrap')
print(figs[0].title)  ## only print out the title of that figure 

<title>Islamic State of Afghanistan: Debt Sustainability Analysis, 2005–30 <xref id="RA01tab01fn01" ref-type="table-fn" rid="A01tab01fn01">1/</xref></title>


for more detailed beautiful soup operations, please look at the documentation: 
https://www.crummy.com/software/BeautifulSoup/bs4/doc/



### 3. Just to make the file a bit cleaner, i made document object in data_util file 

- document object takes 3 argument: series_id,file_id,xml_path. 
- it will reture an object, with a couple of filed: series_id, file_id, paras, meta
- the way our xmls are names follows: [xxxxx]-[xxxxxxxxx]_A[xxx].xml 
- the first part is the series id, second part is the document id. we will extract them out so that we can use those ids to find extrac mata data in the mata data sheet. 

In [12]:
def get_ids(xml):
    """
    input  :xml full name 
    return :series id and file id 
    """
    series_id,xml_name = xml.split('-')
    file_id,_ = xml_name.split('_') 
    return series_id,file_id

In [13]:
ids = get_ids(files[0])
print(ids)

('00160', '9781451800203')


In [14]:
doc_test = data_util.document(ids[0],ids[1],os.path.join(xml_path,files[0]))

In [15]:
doc_test.series_id

'00160'

In [16]:
doc_test.file_id

'9781451800203'

In [17]:
doc_test.paras[:2]

['1. Undertaking a debt sustainability analysis (DSA) for Afghanistan presents unique challenges. The lack of complete debt information, the presence of disputed or unverified claims, and the almost total absence of historical data on key macroeconomic variables precludes use of the Fund’s standard template for assessing external sustainability.2 Even the more nuanced approach to debt sustainability for low income countries presents difficulties, given that standard assessments of institutional capacity (for example, through the World Bank’s Country Policy and Institutional Assessments—CPIA) have yet to be completed for Afghanistan.',
 '2. Data and other constraints notwithstanding, the need for an assessment of Afghanistan’s debt sustainability over the medium‐ and long-term remains. Particularly in the post-election environment, as ministries and other government institutions come to grips with their respective portfolios, a quantitative assessment of how the debt burden might evolve

### 4. Now we will loop through all xml files we have the make them into document object for later use 

In [18]:
files[0]

'00160-9781451800203_A001.xml'

In [19]:
doc_dict = dict()
total_length = len(files)
print('converting {} xmls into doctment object ......'.format(total_length))
for idx,file_name in enumerate(files):
    f_path = os.path.join(xml_path,file_name)
    try:
        series_id,file_id = get_ids(file_name)             ## get series id and file id from xml names 
    except:
        print("file name is not consistent: ", file_name)  ## sometimes names are not consistent, skip them for now
        continue
    
    doc = data_util.document(series_id,file_id,f_path)
    try:
        if doc.file_id in doc_dict.keys():
            doc_dict[doc.file_id].paras.extend(doc.paras)   ## reason we do that here is because some documents are divide
                                                            ## inot mulitiple xmls, we just concatenate all the paras together
        else:
            doc_dict[doc.file_id] = doc
    except:
        print(doc.file_id)
        
    #docs_dict[doc.file_id] = doc
    if (idx+1)%100 == 0:
        print('{} / {} '.format(idx+1,total_length))
    

converting 344 xmls into doctment object ......
no paragraph found: 9781451800616
no paragraph found: 9781451800692
no paragraph found: 9781451811346
100 / 344 
no paragraph found: 9781451811414
no paragraph found: 9781451811513
no paragraph found: 9781451811469
no paragraph found: 9781451811544
no paragraph found: 9781451800494
no paragraph found: 9781451801804
no paragraph found: 9781451801828
no paragraph found: 9781451801859
no paragraph found: 9781451801873
file corropted: 9781451801255
200 / 344 
no paragraph found: 9781451801927
no paragraph found: 9781451801958
300 / 344 
no paragraph found: 9781451802245
no paragraph found: 9781451802252


In [20]:
## let get one sample result 
print(doc_dict['9781451800203'].file_id)
print(doc_dict['9781451800203'].series_id)
print(doc_dict['9781451800203'].paras[:2])

9781451800203
00160
['1. Undertaking a debt sustainability analysis (DSA) for Afghanistan presents unique challenges. The lack of complete debt information, the presence of disputed or unverified claims, and the almost total absence of historical data on key macroeconomic variables precludes use of the Fund’s standard template for assessing external sustainability.2 Even the more nuanced approach to debt sustainability for low income countries presents difficulties, given that standard assessments of institutional capacity (for example, through the World Bank’s Country Policy and Institutional Assessments—CPIA) have yet to be completed for Afghanistan.', '2. Data and other constraints notwithstanding, the need for an assessment of Afghanistan’s debt sustainability over the medium‐ and long-term remains. Particularly in the post-election environment, as ministries and other government institutions come to grips with their respective portfolios, a quantitative assessment of how the debt 

### 5. Now we save our processed data into python pickle file, so that we can read and write into it easily later 

In [21]:
## save our doct_dict object inot a pickle file 
pickle.dump(doc_dict,open(os.path.join(extract_path,'processed_xml.p'), "wb"))

In [22]:
## we can read it back from pickle
doc_dict_2 = pickle.load(open(os.path.join(extract_path,'processed_xml.p'), "rb"))

In [23]:
doc_dict_2['9781451800203'].paras[0]

'1. Undertaking a debt sustainability analysis (DSA) for Afghanistan presents unique challenges. The lack of complete debt information, the presence of disputed or unverified claims, and the almost total absence of historical data on key macroeconomic variables precludes use of the Fund’s standard template for assessing external sustainability.2 Even the more nuanced approach to debt sustainability for low income countries presents difficulties, given that standard assessments of institutional capacity (for example, through the World Bank’s Country Policy and Institutional Assessments—CPIA) have yet to be completed for Afghanistan.'

### Now with clean text data, you can move on to search and analysis