Claes Pauline. Master Digital Text Analysis. Student ID: 20163274

## English XML data to TXT
All English data come in XML format. In order to use AntConc to query it, it needs to be parsed into regular TXT. I used the BeautifulSoup module for this. 

In [1]:
import pandas as pd
import os, glob
from bs4 import BeautifulSoup

In [22]:
def clean_XML_to_txt(path, kind):
    '''
    Function that iterates over PWD (so you first have to navigate to the directory that you want it applied to), 
    looks for all XML files in that directory, reads these in and removes all XML tags so you are only left with 
    cleaned text. Function then writes this cleaned text to a new file, in a directory of choice that you have to 
    specify. 
    
    Arguments: 
            - path: path of the target directory for the cleaned text files
            - kind: a specification of the text kind (translation, reference, early, later, ...)
            
    '''
    
    for file in glob.glob("*.xml"): # iterate over PWD (present working directory) to find each file ending in .xml extension
        with open(file, "r") as f: # open each file to read
            xml = f.read() # read in data
        
        # BeautifulSoup to clean text so it does not have any tags left
        
        soup = BeautifulSoup(xml)
        soup_text = soup.findAll(text=True)
        clean_text = ' '.join(soup_text)
        
        filename = file.replace(".xml", r"")
        
        # write to new file in new folder
        with open(f"{path}/{kind}_{filename}_cleaned.txt", "w") as t: 
            t.write(clean_text)

In [None]:

# EXAMPLE ILLUSTRATION OF WHAT BEAUTIFUL SOUP DOES 

with open("A68475.xml", "r") as f: 
    xml = f.read()

soup = BeautifulSoup(xml)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
text = ' '.join(pageText)
print(text[:10000])


 
 
 
 Essays vvritten in French by Michael Lord of Montaigne, Knight of the Order of S. Michael, gentleman of the French Kings chamber: done into English, according to the last French edition, by Iohn Florio reader of the Italian tongue vnto the Soueraigne Maiestie of Anna, Queene of England, Scotland, France and Ireland, &c. And one of the gentlemen of hir royall priuie chamber 
 Essais. English 
 Montaigne, Michel de, 1533-1592. 
 
 
 
 1613 
 
 
 Approx. 3150 KB of XML-encoded text transcribed from 322 1-bit group-IV TIFF page images. 
 
 Text Creation Partnership, 
 Ann Arbor, MI ; Oxford (UK) : 
 2006-06 (EEBO-TCP Phase 1). 
 A68475 
 STC 18042 
 ESTC S111840 
 99847105 
 99847105 
 12117 
 
 This keyboarded and encoded edition of the
	       work described above is co-owned by the institutions
	       providing financial support to the Early English Books
	       Online Text Creation Partnership. This Phase I text is
	       available for reuse, according to the terms of  Creat

In [11]:
clean_XML_to_txt("BeautifulSoup_cleaned")

In [3]:
path = "BeautifulSoup_cleaned"
kind = "laterReference"

filename = "A52146"

print(f"{path}/{kind}_{filename}_cleaned.txt")

BeautifulSoup_cleaned/laterReference_A52146_cleaned.txt


In [23]:
cd /Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/LaterReference

/Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/LaterReference


In [24]:
clean_XML_to_txt(path = "../AddData_cleaned", kind = "LaterReference")

In [26]:
cd /Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/LaterTranslation

/Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/LaterTranslation


In [27]:
clean_XML_to_txt(path = "../AddData_cleaned", kind = "LaterTranslation")

In [28]:
cd /Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/EarlyReference

/Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/EarlyReference


In [29]:
clean_XML_to_txt(path = "../AddData_cleaned", kind = "EarlyReference")

In [30]:
cd /Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/EarlyTranslation

/Users/paulineclaes/Documents/dta/thesis/Data/XML/AddData/EarlyTranslation


In [31]:
clean_XML_to_txt(path = "../AddData_cleaned", kind = "EarlyTranslation")