## Process xmls

This notebook is going to show you one way to process xmls, extractiong all paragraphs and save them into pickle object for later use 

In [None]:
import os 
python_root = '..'
import sys
sys.path.insert(0, python_root)
import data_util
from bs4 import BeautifulSoup
import pickle

### 1. Download some sample data if you don't have them yet. 

In [None]:
### specify download path and extract path 
download_path = "staff_reports.zip"
download_link = "https://www.dropbox.com/s/wi37fy1apjuiyqt/staff_reports.zip?dl=1"
extract_path = "../data"  # place data in Python project root folder

In [None]:
## detailed of the download_data function is in data_util module in python_root 
data_util.download_data(download_path,download_link,extract_path)

### 2. Now we can start process xmls

In [None]:
xml_path = "../data/xmls"
files = os.listdir(xml_path)             ## list all files in xml_path 
files = [f for f in files if '_' in f]   ## only keep the files with _ in its name 
                                         ## it is just our xmls are formated that way, 
                                         ## files without "_" are just headers wich don't contrain 
                                         ## any information        


    
Here, we will use beautiful soup package to read xml files, we first define a function to read xml 

In [None]:
def read_xml(file_path):
    with open(file_path,'r',encoding='utf8') as f:
        soup = BeautifulSoup(f, 'xml')
    
    return soup

In [None]:
## try see what the result looks like
res = read_xml(os.path.join(xml_path,files[0]))
print(res.contents)

some basic orerations with the xml file 

In [None]:
## get all paragraphs 
paras = res.body.find_all('p')
print(paras[0])

In [None]:
## get all tables 
figs = res.body.find_all('table-wrap')
print(figs[0].title)  ## only print out the title of that figure 

for more detailed beautiful soup operations, please look at the documentation: 
https://www.crummy.com/software/BeautifulSoup/bs4/doc/



### 3. Just to make the file a bit cleaner, i made document object in data_util file 

- document object takes 3 argument: series_id,file_id,xml_path. 
- it will reture an object, with a couple of filed: series_id, file_id, paras, meta
- the way our xmls are names follows: [xxxxx]-[xxxxxxxxx]_A[xxx].xml 
- the first part is the series id, second part is the document id. we will extract them out so that we can use those ids to find extrac mata data in the mata data sheet. 

In [None]:
def get_ids(xml):
    """
    input  :xml full name 
    return :series id and file id 
    """
    series_id,xml_name = xml.split('-')
    file_id,_ = xml_name.split('_') 
    return series_id,file_id

In [None]:
ids = get_ids(files[0])
print(ids)

In [None]:
doc_test = data_util.document(ids[0],ids[1],os.path.join(xml_path,files[0]))

In [None]:
doc_test.series_id

In [None]:
doc_test.file_id

In [None]:
doc_test.paras[:2]

### 4. Now we will loop through all xml files we have the make them into document object for later use 

In [None]:
files[0]

In [None]:
doc_dict = dict()
total_length = len(files)
print('converting {} xmls into doctment object ......'.format(total_length))
for idx,file_name in enumerate(files):
    f_path = os.path.join(xml_path,file_name)
    try:
        series_id,file_id = get_ids(file_name)
    except:
        print("file name is not consistent: ", file_name)
        continue
    
    doc = data_util.document(series_id,file_id,f_path)
    try:
        if doc.file_id in doc_dict.keys():
            doc_dict[doc.file_id].paras.extend(doc.paras)
        else:
            doc_dict[doc.file_id] = doc
    except:
        print(doc.file_id)
        
    #docs_dict[doc.file_id] = doc
    if (idx+1)%100 == 0:
        print('{} / {} '.format(idx+1,total_length))
    

In [None]:
## let get on sample result 
print(doc_dict['9781451800203'].file_id)
print(doc_dict['9781451800203'].series_id)
print(doc_dict['9781451800203'].paras[:2])

### 5. Now we save our processed data into python pickle file, so that we can read and write into it easily later 

In [None]:
## save our doct_dict object inot a pickle file 
pickle.dump(doc_dict,open(os.path.join(extract_path,'processed_xml.p'), "wb"))

In [None]:
## we can read it back from pickle
doc_dict_2 = pickle.load(open(os.path.join(extract_path,'processed_xml.p'), "rb"))

In [None]:
doc_dict_2['9781451800203'].paras[0]

### Now with clean text data, you can move on to search and analysis