# Parsing XML Files
The below code follows the following process:
* package **os** is used to read to files.
* package **pandas** is used to create dataframes i.e. structures to store the data.
* package **ElementTree** is read the nodes/tags of the XML structure and helps us access these elements.

More detailed comments below.

In [135]:
import os #Access OS functions.

path = '/home/ishu/Challenge_1/Dataset/' #Defined a Path

file_list = [] #Created an array to store only the names of the XML Files in a Folder.

for filename in os.listdir(path): #For loop to check only for XML Files.
    if not filename.endswith('.xml'): continue
    fullname = os.path.join(path, filename)
    print (fullname)
    file_list.append(fullname)
    

/home/ishu/Challenge_1/Dataset/mail_depers_1.xml
/home/ishu/Challenge_1/Dataset/mail_depers_2.xml
/home/ishu/Challenge_1/Dataset/mail_depers_7.xml
/home/ishu/Challenge_1/Dataset/mail_depers_0.xml
/home/ishu/Challenge_1/Dataset/mail_depers_9.xml
/home/ishu/Challenge_1/Dataset/mail_depers_4.xml
/home/ishu/Challenge_1/Dataset/mail_depers_3.xml
/home/ishu/Challenge_1/Dataset/mail_depers_6.xml


In [156]:
file_list #Check the output.

['/home/ishu/Challenge_1/Dataset/mail_depers_1.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_2.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_7.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_0.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_9.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_4.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_3.xml',
 '/home/ishu/Challenge_1/Dataset/mail_depers_6.xml']

In [152]:
import pandas as pd #Import pandas to play around with the data.

df_cols = ["from","to","subject","body"] #Created a dataframe.
out_df = pd.DataFrame(columns = df_cols)

In [153]:
out_df #Empty as of now.

Unnamed: 0,from,to,subject,body


In [145]:
import xml.etree.ElementTree as ET #Need this library to parse XML Files.

import codecs #Need this library to open XML Files (different XML files are saved with diffenent encodings).

In [154]:
for file in file_list: #Created a for loop to go through each XML File and check it's contents.
    print (file) #Just to check, we dont really need this.
    target_file = codecs.open(file,mode='r',encoding='ISO-8859-1') #Open XML File.
    xtree = ET.parse(target_file)
    root = xtree.getroot() #Get the root of the XML File. In our case it is "<DOCUMENT>".
    for child_of_root in root: #Created this for loop to get the child elements such as 'FROM','TO','SUBJECT','BODY'
        s_name = child_of_root.attrib.get("NAME")
        if (s_name == 'FROM'):
            s_from = child_of_root.text
        if (s_name == 'TO'):
            s_to = child_of_root.text
        if (s_name == 'SUBJECT'):
            s_subject = child_of_root.text
        if (s_name == 'BODY'):
            s_body = child_of_root.text
    #Take the relevant information for the 4 tags defined above and put them in the dataframe.
    out_df = out_df.append(pd.Series([s_from, s_to, s_subject, s_body], index = df_cols), ignore_index = True) 
    
    
    print ("Done!") #Just to check, we dont really need this.

/home/ishu/Challenge_1/Dataset/mail_depers_1.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_2.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_7.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_0.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_9.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_4.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_3.xml
Done!
/home/ishu/Challenge_1/Dataset/mail_depers_6.xml
Done!


In [155]:
out_df #Final output.

Unnamed: 0,from,to,subject,body
0,dummyfrom1@rabobank.nl,dummyto1@rabobank.nl,Dummy Subject for this email 1,\nNieuw-Vennep 24-09-2018.\n\nGoedemorgen ik h...
1,dummyfrom2@rabobank.nl,dummyto2@rabobank.nl,Dummy Subject for this email 2,"\nGoedemorgen heer, mevrouw, \n\nConform het t..."
2,dummyfrom7@rabobank.nl,dummyto7@rabobank.nl,Dummy Subject for this email 7,"\nGeachte heer, mevrouw,\n\nNaar aanleiding va..."
3,dummyfrom0@rabobank.nl,dummyto0@rabobank.nl,Dummy Subject for this email 0,"\nHello, \nMy mother, Frans MARLIER want to tr..."
4,dummyfrom9@rabobank.nl,dummyto9@rabobank.nl,Dummy Subject for this email 9,\nHi again! \nI want to confirm that you can c...
5,dummyfrom4@rabobank.nl,dummyto4@rabobank.nl,Dummy Subject for this email 4,"\nGeachte heer, mevrouw, \nIn verband met de a..."
6,dummyfrom3@rabobank.nl,dummyto3@rabobank.nl,Dummy Subject for this email 3,"\nGoedemorgen, \n\nIk ben de wettelijk vertege..."
7,dummyfrom6@rabobank.nl,dummyto6@rabobank.nl,Dummy Subject for this email 6,\nBeste \n\nIk heb enige tijd geleden een schr...
