In [81]:
import glob
from pathlib import Path 
import re
import pandas as pd

Our goal here is to make a spreadsheet of useful information from proquest text exports.  The text exports seem to contain more useful information than the spreadsheet exports, for whatever reason. 

The format of the file is an underscore delimiter between sections, a title, and then key value pairs with new lines in between. There's also a bit of work to separate out links, which appear in a series.  The below function turns a section into a dictionary of key-value pairs.

In [105]:
def parseProquestDissertation(txt):
    toReturn = {}
    lines = txt.split("\n")
    #toReturn['title'] = lines[2] #this is the title, but we get this again, so ignore
    for l in lines[3:]: #rest of the lines are key-value pairs
        res = re.search("^(.*?):(.*)$",l) #split as key:value
        if res: #some of the lines are invalid (new-lines) so we have to make sure the regex parses
            key = res.group(1)
            value = res.group(2)
            if key=='http': #must deal with http, which looks like a section, but is really just another link
                toReturn['Links'] = toReturn['Links'] + "\n" + key+":"+value
            elif key=='https': #this seems to be just a direct reference, add as such
                toReturn['Link'] = key + ":" + value
            else:
                toReturn[key] = value #put in dictionary if it is a valid key-value pair
    return toReturn

This is the main function.  Put all text files, organized by file name, which will become the filename.csv, and are best organized by search term, in the data folder.  

In [106]:
files = glob.glob('data/*.txt')  #get all files by txt extension
for path in files:  
    file_stem = Path(path).stem #we use the stem (first part of the filename) as the stem of the csv
    with open(path,mode='r') as fd:
        text_contents = fd.read()
        sections = text_contents.split('____________________________________________________________') #default delimiter...probably is a smarter way to split this
        valid_sections = sections[1:-1] #invalid sections at the beginning and end after the split, so ignore them
        print(f"Parsing { len(valid_sections) } sections")
        parsed_sections = map(parseProquestDissertation,valid_sections)
        df = pd.DataFrame(parsed_sections)
        df.to_csv('data/'+file_stem+".csv")
                                                                    
    
    

Parsing 92 sections
