<h1>The XML Strucutre</h1>

<img src="https://www.w3schools.com/xml/nodetree.gif"/>

Source: https://www.w3schools.com/xml/xml_tree.asp

<h1>Importing Libraries</h1>

In [1]:
import pandas as pd
from lxml import etree
pd.set_option('display.max_columns', None)

<h1>Basic Set Up</h1>

In [2]:
tree = etree.parse('data/consolidated_20220616.xml') # returns an ElementTree object

root = tree.getroot() # for parsing XML fragments

<h1>Basic Navigation</h1>

In [3]:
# "root" is a iterable like a simple python "list". To get the number of nodes in first level:

print(f'Number of nodes in the XML: {len(root)}') # Use len() to get the number of child nodes
print(f'The first node is named "{root[0].tag}"') # Access the 1 child node in the XML --> index = 0
print(f'The second node is named "{root[1].tag}"') # Access the 1 child node in the XML --> index = 1

Number of nodes in the XML: 2
The first node is named "INDIVIDUALS"
The second node is named "ENTITIES"


In [4]:
print(f'The name (element / tag) of node is :"{root[0].tag}"')
print(f'The value (text) of node is :"{root[0].text}"')
print(f'The attribute of node is :"{root[0].attrib}"')

# ".tag", ".text" and ".attrib" are critical methods to extract data from XML file. While we will be only using ".tag", ".text" in this tutorial.

The name (element / tag) of node is :"INDIVIDUALS"
The value (text) of node is :"
        "
The attribute of node is :"{}"


In [5]:
#Iterate the XML to get all child nodes:

for node in root:

    print(node.tag)

INDIVIDUALS
ENTITIES


In [6]:
# Create variable for "individual" and "entity" nodes seperately

individuals = root[0]
entities = root[1]

In [7]:
# Use len() to get the number of records under the "INDIVIDUALS" node and "ENTITIES" node

print(f'Number of individual: {len(individuals)}')
print(f'Number of entity: {len(entities)}')

Number of individual: 700
Number of entity: 253


<h1>.tag and .text</h1>

In [8]:
# We can apply the same function or methods to child node (and any child node underneath) like what we have applied to "root" before

print(individuals[0])
print(individuals[0].tag)
print(individuals[0].text)

<Element INDIVIDUAL at 0x203efa8db80>
INDIVIDUAL
None


In [9]:
print(individuals[0][0]) # The first "node" of the first individual record
print(individuals[0][0].tag) # The name of the first node --> can be transformed to "column-name" in tabular form
print(individuals[0][0].text) # The innertext of the first node --> can be transformed to the "value" in tabular form

<Element DATAID at 0x203efa91a00>
DATAID
6908555


In [10]:
#Let's explore what's inside the first individual record!

for node in individuals[0]:

    print(node.tag, '|', node.text)

DATAID | 6908555
VERSIONNUM | 1
FIRST_NAME |  RI 
SECOND_NAME | WON HO
THIRD_NAME | None
UN_LIST_TYPE | DPRK
REFERENCE_NUMBER | KPi.033
LISTED_ON | 2016-11-30
COMMENTS1 | Ri Won Ho is a DPRK Ministry of State Security Official stationed in Syria supporting KOMID.

DESIGNATION | None
NATIONALITY | None
LIST_TYPE | None
LAST_DAY_UPDATED | None
INDIVIDUAL_ALIAS | None
INDIVIDUAL_ADDRESS | None
INDIVIDUAL_DATE_OF_BIRTH | None
INDIVIDUAL_PLACE_OF_BIRTH | None
INDIVIDUAL_DOCUMENT | None
SORT_KEY | None
SORT_KEY_LAST_MOD | None


<h1>Create dataframe for individual[0]</h1>

In [11]:
features = {}

for node in individuals[0]:

    features[node.tag] = node.text
    
df_features = pd.DataFrame(features, index=['0'])

In [12]:
df_features

Unnamed: 0,DATAID,VERSIONNUM,FIRST_NAME,SECOND_NAME,THIRD_NAME,UN_LIST_TYPE,REFERENCE_NUMBER,LISTED_ON,COMMENTS1,DESIGNATION,NATIONALITY,LIST_TYPE,LAST_DAY_UPDATED,INDIVIDUAL_ALIAS,INDIVIDUAL_ADDRESS,INDIVIDUAL_DATE_OF_BIRTH,INDIVIDUAL_PLACE_OF_BIRTH,INDIVIDUAL_DOCUMENT,SORT_KEY,SORT_KEY_LAST_MOD
0,6908555,1,RI,WON HO,,DPRK,KPi.033,2016-11-30,Ri Won Ho is a DPRK Ministry of State Security...,,,,,,,,,,,


<h1>Create Dataframe for all individuals</h1>

In [13]:
df_individuals = pd.DataFrame()

for individual in individuals:

    features = {}

    for node in individual:

        features[node.tag] = node.text

    df_features = pd.DataFrame(features, index=['0'])

    df_individuals = df_individuals.append(df_features, ignore_index=True)

In [14]:
df_individuals.head(5)

Unnamed: 0,DATAID,VERSIONNUM,FIRST_NAME,SECOND_NAME,THIRD_NAME,UN_LIST_TYPE,REFERENCE_NUMBER,LISTED_ON,COMMENTS1,DESIGNATION,NATIONALITY,LIST_TYPE,LAST_DAY_UPDATED,INDIVIDUAL_ALIAS,INDIVIDUAL_ADDRESS,INDIVIDUAL_DATE_OF_BIRTH,INDIVIDUAL_PLACE_OF_BIRTH,INDIVIDUAL_DOCUMENT,SORT_KEY,SORT_KEY_LAST_MOD,NAME_ORIGINAL_SCRIPT,FOURTH_NAME,GENDER,TITLE,SUBMITTED_BY
0,6908555,1,RI,WON HO,,DPRK,KPi.033,2016-11-30,Ri Won Ho is a DPRK Ministry of State Security...,,,,,,,,,,,,,,,,
1,6908570,1,CHANG,CHANG HA,,DPRK,KPi.037,2016-11-30,,,,,,,,,,,,,,,,,
2,6908571,1,CHO,CHUN RYONG,,DPRK,KPi.038,2016-11-30,,,,,,,,,,,,,,,,,
3,6908858,1,EMRAAN,ALI,,Al-Qaida,QDi.430,2021-11-23,Senior member of Islamic State in Iraq and the...,,,,,,,,,,,,,,,,
4,6908565,1,JO,YONG CHOL,,DPRK,KPi.034,2016-11-30,Jo Yong Chol is a DPRK Ministry of State Secur...,,,,,,,,,,,,,,,,


In [15]:
df_individuals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   DATAID                     700 non-null    object
 1   VERSIONNUM                 700 non-null    object
 2   FIRST_NAME                 700 non-null    object
 3   SECOND_NAME                691 non-null    object
 4   THIRD_NAME                 341 non-null    object
 5   UN_LIST_TYPE               700 non-null    object
 6   REFERENCE_NUMBER           700 non-null    object
 7   LISTED_ON                  700 non-null    object
 8   COMMENTS1                  612 non-null    object
 9   DESIGNATION                0 non-null      object
 10  NATIONALITY                0 non-null      object
 11  LIST_TYPE                  0 non-null      object
 12  LAST_DAY_UPDATED           0 non-null      object
 13  INDIVIDUAL_ALIAS           0 non-null      object
 14  INDIVIDUAL