# XML Parser 

Sometimes data comes in XML format, and we want to pull it apart. Let's do that here.

In [28]:
from bs4 import BeautifulSoup
import lxml
import os
import pandas as pd

## Read in Data

In [3]:
xml_data = ""
xml_filepath = "consolidated.xml"

with open(xml_filepath, "r") as file:
    xml_data = file.read()
    
print(type(xml_data))
print(len(xml_data))

<class 'str'>
919037


## Parse with BS4
In the future we may be able to automatically recognize the structure of the xml file and fill out distinct tables, but for now we can manually look through and determine the element type that we want to extract. 

- ROOT: sdnList
    - publshInformation
    - sdnEntry

There are two entity types in this dataset with distinct value types, both included as elements in `sdnEntry`.

**Individual**
- uid
- firstName
- lastName
- sdnType
- programList
    - program...
- akaList
    - aka
        - uid
        - type
        - category
        - lastName
- dateOfBirthList
    - dateOfBirthItem
        - uid
        - dateOfBirth
        - mainEntry
- placeOfBirthList
    - placeOfBirthItem
        - uid
        - placeOfBirth
        - mainEntry

**Entity**
- uid
- lastName (Organization Name)
- sdnType
- programList
    program...
- idList (Registration Numbers)
    - uid
    - idType
    - idNumber   
- akaList
- addressList

We don't necessarily need all this information. We will implement the following mapping to create our output table:

| Output Column | Map |
|-|-|
| uid | uid |
|first_name | firstName |
| last_name | lastName |
|entry_type | sdnType |
| aka_uids  | [akaList>aka>uid] |
| birth_date | dateOfBirthList>dateOfBirthItem>dateOfBirth
| birth_place  | placeOfBirthList>placeOfBirthItem>placeOfBirth
| aliases | [akaList>aka>uid] |



Complications may arise with `programList` and `akaList`, because they may include multiple subitems. For this reason, we'll deal with them last.  




In [25]:
soup = BeautifulSoup(xml_data, features="xml")

entities = soup.select("sdnEntry")
print(len(entities))

test_element = entities[0]

443


The manual way to do this is then item by item iteration over the entry items, assigning values to a dictionary or something we can transform into a dataframe row.

In [37]:
key_dict = {
    "uid": "",
    "firstName": "",
    "lastName": "",
    "sdnType": ""
}

def parse_element(element, key_dict):

    parse_dict = {}

    for value in key_dict.keys():
        try:
            parse_dict[value] = element.select(value)[0].text
        except:
            parse_dict[value] = ""
    
    return parse_dict

function_test = parse_element(test_element, key_dict)
function_test

{'uid': '9639',
 'firstName': 'Ismail Abdul Salah',
 'lastName': 'HANIYA',
 'sdnType': 'Individual'}

In [60]:
entity_data = []

for entity in entities:
    data = parse_element(entity, key_dict)
    entity_data.append(data)

entity_df = pd.DataFrame(entity_data)
entity_df.head()

Unnamed: 0,uid,firstName,lastName,sdnType
0,9639,Ismail Abdul Salah,HANIYA,Individual
1,9640,Mohammed,ABU TEIR,Individual
2,9641,Jamileh Abdullah,AL-SHANTI,Individual
3,9642,Mohammed Jamal,NU'MAN ALAEDDIN,Individual
4,9643,Yasser Daoud,MANSOUR,Individual


Now, let's look at the values that may have more than one value. Once we understand the relative numbers of subelements we might find in each list-type value, we can decide how to proceed. For now, we'll just find the counts of each value.

In [40]:
list_dict = {
    "program": "",
    "aka": "",
    "dateOfBirthItem": "",
    "placeOfBirthItem": ""
}

def parse_ele_lists(element, list_dict):
    
    parse_list_dict = {}
    
    for value in list_dict.keys():
        parse_list_dict[value] = len(element.select(value))
    
    return parse_list_dict

    
function_test = parse_ele_lists(test_element, list_dict)
function_test


{'program': 1, 'aka': 2, 'dateOfBirthItem': 1, 'placeOfBirthItem': 1}

In [55]:
counts_data = []

for entity in entities:
    data = parse_ele_lists(entity, list_dict)
    counts_data.append(data)
    
counts_df = pd.DataFrame(counts_data)
counts_df.describe()

Unnamed: 0,program,aka,dateOfBirthItem,placeOfBirthItem
count,443.0,443.0,443.0,443.0
mean,1.155756,2.541761,0.167043,0.045147
std,0.381272,2.015762,0.373436,0.207861
min,1.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0
50%,1.0,2.0,0.0,0.0
75%,1.0,3.0,0.0,0.0
max,4.0,26.0,1.0,1.0


From above, we can see that `dateOfBirthItem` and `placeOfBirthItem` only show up once per entity. `program` has a maximum of 4 occurances, while `aka` has a maximum of 26. 

To display these values, we will add additional columns for each `program` and `aka`, to a maximum of 5 additional values (for now). 

In [None]:
def parse_ele_lists_2(element, list_dict):
    
    parse_list_dict = {}
    
    # For exploration purposes, just return counts
    for value in list_dict.keys():
        
        if len(element.select(value)) == 1:
            
            item_keys = 
            
            
            
        else:
   
                parse_list_dict[value] = len(element.select(value))
    
    return parse_list_dict

test = parse_ele_lists_2(test_element, list_dict)
test

## Save Results

In [None]:
entity_df.to_csv("entity_data.csv")

## Code Scraps 

In [57]:
unique_elements = {tag.name for tag in soup.descendants if tag.name}

