# CHEM 584 Machine Learning Project
## Marcus Sak

### Preprocessing

This notebook takes in a folder containing TEI-formatted `xml` files generated from PDFs of research articles, and exports single-column `.csv` files with the abstract, body text, title and abstract, as well as title and body text of the articles.

### Imports

In [9]:
import os
import re
import csv
import glob
import pandas as pd
from pathlib import Path
import multiprocessing as mp
from bs4 import BeautifulSoup
from os.path import basename, splitext

### Defining TEIFile class and utility functions

In [10]:
# The following code is modified from https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/

def full_path(filename):
    """ Generates a full path of a file from the input filename, assuming that the file is in the current directory
    """
    script_dir = os.path.dirname(__file__)
    rel_path = filename
    return os.path.join(script_dir, rel_path)

def read_tei(tei_file):
    """ Parses the xml file and returns a soup object containing all the data in the file as a tree
    """
    with open(tei_file, 'r') as tei:
        soup = BeautifulSoup(tei, 'lxml')
        return soup
    raise RuntimeError('Cannot generate a soup from the input')

class TEIFile(object):
    def __init__(self, filename):
        self.filename = filename
        self.soup = read_tei(filename)
        self._text = None
        self._title = ''
        self._abstract = ''
        self._date = None
        #filter terms to exclude from text: acknowledgements and experimental
        self._filter = ['thank', 'acknowledg', 'nih', 'nsf', 'http', 'experimental procedure', \
                        'www.', 'co-wrote', 'authors declare', 'grateful', 'yale', 'silica gel', 'ppm',\
                        'free of charge', 'crystallographic data', 'mail', 'experimental details', 'see:']
    
    @property
    def date(self):
        if not self._date:
            self._date = self.soup.date.getText()
        return self._date
    
    @property
    def title(self):
        if not self._title:
            self._title = self.soup.title.getText()
            self._title = self._text_process(self._title)
        return self._title

    @property
    def abstract(self):
        if not self._abstract:
            abstract = self.soup.abstract.getText(separator=' ', strip=True)
            abstract = self._text_process(abstract)
            self._abstract = abstract
        return self._abstract
    
    @property
    def text(self):
        if not self._text:
            divs_text = []
            for div in self.soup.body.find_all("div"):
                # div is neither an appendix nor references, just plain text.
                if not div.get("type"):
                    for ref in div.find_all("ref"): 
                        ref.decompose()  # remove references to citations and figures
                    for p in div.find_all('p'):
                        p_text = p.get_text(separator='', strip=True)  # returns one chunk of text without markdown
                        p_text = self._text_process(p_text)
                        
                        if all(term not in p_text.lower() for term in self._filter):
                            # skip paragraphs of acknowledgements, etc
                            divs_text.append(p_text)
                            
            self._text  = divs_text            
        return self._text
    
    def _text_process(self, string):
        """ Strips everything in parens, handles proper punctuation, symbols, etc.
        """
        string = re.sub("[\(\[].*?[\)\]]", "", string)
        string = re.sub(r'[^A-Za-z0-9 ,.\-:]+', '', string)
        string = re.sub(" +", " ", string)  # handle double spaces
        string = re.sub(" \.", ".", string)  # space before period
        string = re.sub(" ,", ",", string)  # space before comma
        string = re.sub(r'((?!\w)\.(?!\ |^\w))', '. ', string)  # no space before and after period
        string = re.sub(r'((?!\w),(?!\ |^\w))', ', ', string)  # no space before and after comma
        return string       

In [18]:
def tei_to_csv_entry(tei_file):
    tei = TEIFile(tei_file)
    print(f"Handled {tei_file}")
    return tei.title, tei.date, tei.abstract, tei.text

### Testing on one file

In [None]:
tei_doc = 'tei_papers/364.full.tei.xml'

In [14]:
tei_file = 'tei_papers/364.full.tei.xml'
tei_to_csv_entry(tei_file)

Handled tei_papers/364.full.tei.xml


('Light-driven deracemization enabled by excited-state electron transfer',
 '18 October 2019',
 'Deracemization is an attractive strategy for asymmetric synthesis, but intrinsic energetic challenges have limited its development. Here, we report a deracemization method in which amine derivatives undergo spontaneous optical enrichment upon exposure to visible light in the presence of three distinct molecular catalysts. Initiated by an excited-state iridium chromophore, this reaction proceeds through a sequence of favorable electron, proton, and hydrogen-atom transfer steps that serve to break and reform a stereogenic C-H bond. The enantioselectivity in these reactions is jointly determined by two independent stereoselective steps that occur in sequence within the catalytic cycle, giving rise to a composite selectivity that is higher than that of either step individually. These reactions represent a distinct approach to creating out-of-equilibrium product distributions between substrate e

### Process all files

In [15]:
# return all files in the folder tei_papers
papers = sorted(Path("./tei_papers").glob('*.tei.xml'))

# use multiprocessing to parse in parallel
pool = mp.Pool()
csv_entries = pool.map(tei_to_csv_entry, papers)

result_csv = pd.DataFrame(csv_entries, columns=['Title', 'Date','Abstract', 'Text'])
result_csv

Handled tei_papers/364.full.tei.xmlHandled tei_papers/a-fluorination.tei.xml
Handled tei_papers/a-trifluoromethylation_iodonium.tei.xmlHandled tei_papers/Alpha-Amination.tei.xmlHandled tei_papers/1-s2.0-S004040200600281X-main.tei.xml



Handled tei_papers/SOMOenolation.tei.xmlHandled tei_papers/CopperVinylation.tei.xmlHandled tei_papers/1-s2.0-S0040402003012808-main.tei.xmlHandled tei_papers/acs.orglett.8b00364.tei.xml
Handled tei_papers/a-nitroalkylation.tei.xmlHandled tei_papers/Direct-arylation-of-strong-aliphatic-C–H-bonds.tei.xmlHandled tei_papers/A-radical-approach-to-the-copper-oxidative-addition-problem.tei.xml
Handled tei_papers/HomoEne.tei.xml

Handled tei_papers/acscentsci.6b00237.tei.xmlHandled tei_papers/aldehyde-aldol.tei.xml

Handled tei_papers/Vincorine.tei.xml

Handled tei_papers/Decarboxylativ-sp3-C-N-coupling.tei.xmlHandled tei_papers/SOMOmechanism.tei.xml

Handled tei_papers/Copper-Catalyzed-Trifluoromethylation-of-Alkyl-Bromides.tei.xmlHandled tei_papers/Tomer-Pape

Unnamed: 0,Title,Date,Abstract,Text
0,Combined Lewis acid and Brnsted acid-mediated ...,10 October 2013,Biomimetic conditions for a synthetic glycosyl...,[The structural and functional characterizatio...
1,Asymmetric Acylation Reactions Catalyzed by Co...,,Octapeptides capable of adopting b-hairpin con...,[Small peptides that promote asymmetric reacti...
2,Amino acids and peptides as asymmetric organoc...,,,[Use of a short peptide as a catalyst would al...
3,A peptide-based catalyst approach to regiosele...,,Two small peptide libraries have been subjecte...,[The chiral pool continues to represent a trem...
4,Synthesis of aziridinomitosenes through base-c...,,Synthesis of an aziridinomitosene core structu...,[Initial studies were performed to determine i...
...,...,...,...,...
295,The First Suzuki Cross-Couplings of Aryltrimet...,,,[The transition-metal-catalyzed cross-coupling...
296,Enantioselective -Vinylation of Aldehydes via ...,"May 22, 2012",The enantioselective -vinylation of aldehydes ...,[Our investigation began with an examination o...
297,Design of a New Cascade Reaction for the Const...,,,[Tandem or domino reactions have long been est...
298,Enantioselective organocatalytic aldehyde -ald...,15 July 2004,An asymmetric proline catalyzed aldol reaction...,[The proposed organocatalytic cross aldol was ...


In [17]:
result_csv["Text"]= result_csv["Text"].str.join(" ") 

# join title and body text in a prompt/response format, later found to not be useful
result_csv["Text with title"] = '[T] ' + result_csv["Title"] + ' [A] ' + result_csv["Text"]
result_csv["Abstract with title"] = '[T] ' + result_csv["Title"] + ' [A] ' + result_csv["Abstract"]
result_csv

Unnamed: 0,Title,Date,Abstract,Text,Text with title,Abstract with title
0,Combined Lewis acid and Brnsted acid-mediated ...,10 October 2013,Biomimetic conditions for a synthetic glycosyl...,The structural and functional characterization...,[T] Combined Lewis acid and Brnsted acid-media...,[T] Combined Lewis acid and Brnsted acid-media...
1,Asymmetric Acylation Reactions Catalyzed by Co...,,Octapeptides capable of adopting b-hairpin con...,Small peptides that promote asymmetric reactio...,[T] Asymmetric Acylation Reactions Catalyzed b...,[T] Asymmetric Acylation Reactions Catalyzed b...
2,Amino acids and peptides as asymmetric organoc...,,,Use of a short peptide as a catalyst would all...,[T] Amino acids and peptides as asymmetric org...,[T] Amino acids and peptides as asymmetric org...
3,A peptide-based catalyst approach to regiosele...,,Two small peptide libraries have been subjecte...,The chiral pool continues to represent a treme...,[T] A peptide-based catalyst approach to regio...,[T] A peptide-based catalyst approach to regio...
4,Synthesis of aziridinomitosenes through base-c...,,Synthesis of an aziridinomitosene core structu...,Initial studies were performed to determine if...,[T] Synthesis of aziridinomitosenes through ba...,[T] Synthesis of aziridinomitosenes through ba...
...,...,...,...,...,...,...
295,The First Suzuki Cross-Couplings of Aryltrimet...,,,The transition-metal-catalyzed cross-coupling ...,[T] The First Suzuki Cross-Couplings of Aryltr...,[T] The First Suzuki Cross-Couplings of Aryltr...
296,Enantioselective -Vinylation of Aldehydes via ...,"May 22, 2012",The enantioselective -vinylation of aldehydes ...,Our investigation began with an examination of...,[T] Enantioselective -Vinylation of Aldehydes ...,[T] Enantioselective -Vinylation of Aldehydes ...
297,Design of a New Cascade Reaction for the Const...,,,Tandem or domino reactions have long been esta...,[T] Design of a New Cascade Reaction for the C...,[T] Design of a New Cascade Reaction for the C...
298,Enantioselective organocatalytic aldehyde -ald...,15 July 2004,An asymmetric proline catalyzed aldol reaction...,The proposed organocatalytic cross aldol was f...,[T] Enantioselective organocatalytic aldehyde ...,[T] Enantioselective organocatalytic aldehyde ...


In [70]:
abs_df = result_csv["Abstract"]
abs_df.replace("", float("NaN"), inplace=True)
abs_df.dropna(inplace=True)

# export to csv
result_csv["Text"].to_csv('text_proc.csv', index=False,header=False)
abs_df.to_csv('abs_proc.csv', index=False,header=False)
result_csv["Text with title"].to_csv('text_title_proc.csv', index=False,header=False)
result_csv["Abstract with title"].to_csv('abs_title_proc.csv', index=False,header=False)