#### Single Document NER

We expect as input a PDF file which will be converted to a raw text format using *pdfminer*, the text extraction tool selected for this task given that it extracts the most information when compared to others such as *Py2PDF* (used in the word2vec part of the project).

Two processing functions are then applied to the string containing the text found in the pdf file. One function to identify **references** and remove them as they are useless to the task at hand and another function to detect if the authors have provided **keywords** for the document, keywords which would be complementary to the topic classification part of this project.

In [1]:
import re
from pdfminer.layout import LTTextContainer
from pdfminer.high_level import extract_text, extract_pages

##### Provided the location of a pdf file on disk, this function extract the entirety of text from it
##### This extracted text can be returned as a single string or a list containing the text of each page
def extract_text_from_pdf(pdf_file):
    page_counter = 0
    text_as_list = []
    text_as_str = ''

    for page_layout in extract_pages(pdf_file):
        page_counter += 1
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                extracted_text = element.get_text()
                text_as_list.append(extracted_text)
                text_as_str += '' + extracted_text
    print('Text from PDF File ({} pages) extracted successfully.'.format(page_counter)) 
    return text_as_str, text_as_list

##### Function to locate the possible 'keywords' segment on the string containing the document's text
##### If found, a list that contains all the keywords the authors have put will be generated 
def try_finding_keywords(document_text_str):
    potential_keywords_index = document_text_str.find('Keywords')
    final_keywords = []

    #### Check if the sub-string "Keywords" is located in the document
    if potential_keywords_index != -1:
        found_keywords = []
        for i in range(potential_keywords_index, len(document_text_str)):
            if document_text_str[i] != '\n':
                found_keywords.append(document_text_str[i])
            else:
                break     
                
        keyword = ''
        for i in range(len(found_keywords)):
            if found_keywords[i].isalpha():
                keyword += found_keywords[i]
            elif (keyword != ''):
                final_keywords.append(keyword)
                keyword = ''    
    
    return final_keywords

##### Function to locate the sub-string holding the References title and remove all the text after it
##### This will potentially remove all the references in a document from the NER pipeline
def remove_references(doc_str):
    print('Original document character length: {}'.format(len(doc_str)))
    potential_references_index = ''
    
    ##### Locate the index of the references title sub-string
    if (doc_str.find('References') != -1) or (doc_str.find('REFERENCES') != -1):
        potential_references_index = doc_str.find('References') if doc_str.find('References') != -1 else doc_str.find('REFERENCES')
        # print('Potential References Index: {}'.format(potential_references_index))
        doc_str = doc_str[:potential_references_index]
        print('References removed, new document character length: {}'.format(len(doc_str)))
    else:
        print('References string index not found')
    return doc_str

##### Remove links from a given string using RE (Regular Expressions)
def remove_links(review):
  review = re.sub(r'https?:\/\/.*[\r\n]*', '', review)
  return review

Loop to treat all publications in our testing set.

In [2]:
import os

pdfs_directory = 'data/implementome_publications/test_miner/'
#### Load the paths of the input PDF file and the output empty text notebook
for file in os.listdir(pdfs_directory):
    filename = os.fsdecode(file)
    # print('what are you', filename)
    if filename.endswith('.pdf'):
        print('For Publication: {}'.format(filename[:-4]))
        pdf_file = open(pdfs_directory + filename, 'rb')
        doc_as_str, doc_as_list = extract_text_from_pdf(pdf_file)
        
        #### Processing the string containing the entirety of the documents text
        #### The goal is to remove references and detect potential keywords in the document
        print('#### Processing ####')
        doc_as_str = remove_references(doc_as_str)
        doc_keywords = try_finding_keywords(doc_as_str)
        print('Potential Extracted Keywords: {}'.format(doc_keywords[1:] if doc_keywords != [] else 'None Found'))
        print('#####\n')


# pdf_file = open('data/implementome_publications/test_miner/child_obesity_switzerland.pdf', 'rb')
# empty_text_file = open('data/implementome_publications/test_miner/test_text.txt', "w", encoding="utf-8")

# #### The extract_text() function is used to extract any text found in the pdf file
# #### Text in tables, headers, banners and other graphical representations is also extracted
# publications_text = extract_text(pdf_file)
# print('Text from PDF File extracted successfully.')
# # empty_text_file.write(publications_text)

For Publication: ai_and_surgical_decision_making
Text from PDF File (11 pages) extracted successfully.
#### Processing ####
Original document character length: 66451
References removed, new document character length: 50476
Potential Extracted Keywords: None Found
#####

For Publication: brachytherapy_lmic
Text from PDF File (10 pages) extracted successfully.
#### Processing ####
Original document character length: 43351
References removed, new document character length: 38957
Potential Extracted Keywords: None Found
#####

For Publication: child_obesity_switzerland
Text from PDF File (11 pages) extracted successfully.
#### Processing ####
Original document character length: 44197
References removed, new document character length: 36152
Potential Extracted Keywords: ['Overweight', 'Obesity', 'Migration', 'Children']
#####

For Publication: haemophilia_senegal
Text from PDF File (7 pages) extracted successfully.
#### Processing ####
Original document character length: 29549
References st

In [3]:
single_pdf = open('data/implementome_publications/test_miner/child_obesity_switzerland.pdf', 'rb')
doc_as_str, doc_as_list = extract_text_from_pdf(single_pdf)

Text from PDF File (11 pages) extracted successfully.


In [4]:
doc_as_str = remove_references(doc_as_str)

Original document character length: 44197
References removed, new document character length: 36152


In [5]:
for i in range(0, len(doc_as_str), 100):
    print(doc_as_str[i:])

Eiholzer et al. BMC Public Health          (2021) 21:243 
https://doi.org/10.1186/s12889-021-10213-0
R E S E A R C H A R T I C L E
Open Access
The increase in child obesity in Switzerland
is mainly due to migration from Southern
Europe – a cross-sectional study
Urs Eiholzer*, Chris Fritz and Anika Stephan
Abstract
Background: Novel height, weight and body mass index (BMI) references for children in Switzerland reveal an
increase in BMI compared to former percentile curves. This trend may be the result of children with parents
originating from Southern European countries having a higher risk of being overweight compared to their peers
with parents of Swiss origin. We examined the association of generational, migration-related and socioeconomic
factors on BMI in Switzerland and expect the results to lead to more targeted prevention programs.
Methods: From contemporary cross-sectional data, we calculated subgroup-specific BMI percentiles for origin.
Results for children of Swiss origin we

In [6]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

sentences = sent_tokenize(doc_as_str)
sentences

['Eiholzer et al.',
 'BMC Public Health          (2021) 21:243 \nhttps://doi.org/10.1186/s12889-021-10213-0\nR E S E A R C H A R T I C L E\nOpen Access\nThe increase in child obesity in Switzerland\nis mainly due to migration from Southern\nEurope – a cross-sectional study\nUrs Eiholzer*, Chris Fritz and Anika Stephan\nAbstract\nBackground: Novel height, weight and body mass index (BMI) references for children in Switzerland reveal an\nincrease in BMI compared to former percentile curves.',
 'This trend may be the result of children with parents\noriginating from Southern European countries having a higher risk of being overweight compared to their peers\nwith parents of Swiss origin.',
 'We examined the association of generational, migration-related and socioeconomic\nfactors on BMI in Switzerland and expect the results to lead to more targeted prevention programs.',
 'Methods: From contemporary cross-sectional data, we calculated subgroup-specific BMI percentiles for origin.',
 'Resu

In [7]:
for sentence in sentences:
    print('{}'.format(sentence))
    print('')

Eiholzer et al.

BMC Public Health          (2021) 21:243 
https://doi.org/10.1186/s12889-021-10213-0
R E S E A R C H A R T I C L E
Open Access
The increase in child obesity in Switzerland
is mainly due to migration from Southern
Europe – a cross-sectional study
Urs Eiholzer*, Chris Fritz and Anika Stephan
Abstract
Background: Novel height, weight and body mass index (BMI) references for children in Switzerland reveal an
increase in BMI compared to former percentile curves.

This trend may be the result of children with parents
originating from Southern European countries having a higher risk of being overweight compared to their peers
with parents of Swiss origin.

We examined the association of generational, migration-related and socioeconomic
factors on BMI in Switzerland and expect the results to lead to more targeted prevention programs.

Methods: From contemporary cross-sectional data, we calculated subgroup-specific BMI percentiles for origin.

Results for children of Swiss orig

In [8]:
sentences = [sentence.replace('\n', ' ') for sentence in sentences]

'BMC Public Health          (2021) 21:243  https://doi.org/10.1186/s12889-021-10213-0 R E S E A R C H A R T I C L E Open Access The increase in child obesity in Switzerland is mainly due to migration from Southern Europe – a cross-sectional study Urs Eiholzer*, Chris Fritz and Anika Stephan Abstract Background: Novel height, weight and body mass index (BMI) references for children in Switzerland reveal an increase in BMI compared to former percentile curves.'

In [28]:
for i in range(20, 100):
    print(sentences[i], len(sentences[i]), i)
    print('')

2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. 364 20

The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. 173 21

If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. 233 22

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 82 23

The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies 

In [51]:
import spacy 
from spacy import displacy

nlp = spacy.load('en_core_web_sm', enable = ['ner'])
document_entities = {}
our_labels = ['GPE', 'LAW', 'LOC', 'ORG', 'PERSON', 'PRODUCT']

for i in range(25, 45):
    doc = nlp(sentences[i])
    doc.ents = tuple([ent for ent in doc.ents if ent.label_ in our_labels])
    displacy.render(doc, 'ent')