## EDA
Although we dont have much data available, we can use some of the supplied pdfs for the exploring the pdf parsing system. PDFs of documents are easy to read for humans but tricky for machines as they can have varying layouts, watermarks, links, images, captions, annotations etc, which make it very hard to analyze the actual content

In the past I have only worked with pypdf to extract or merge simple PDFs. While it worked well for my purpose back then (joining clinical pdfs), I doubt it will work well for extracting scientific corpus (afaik it doesnt scan the files or analyzes the metadata) 

In [None]:
import os
from pypdf import PdfReader


PDF_PATH = '../data/enfothelial_dysfunction.pdf'

reader = PdfReader(PDF_PATH)
reader.metadata


{'/Keywords': 'Post-COVID syndrome,Myalgic encephalomyelitis/chronic fatigue syndrome,Endothelial dysfunction,Reactive hyperaemia index,Endothelin-1',
 '/CrossMarkDomains[1]': 'springer.com',
 '/Creator': 'Springer',
 '/ModDate': "D:20220322100752+01'00'",
 '/CreationDate': "D:20220317084812+05'30'",
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/Subject': 'Journal of Translational Medicine, https://doi.org/10.1186/s12967-022-03346-2',
 '/Author': 'Milan Haffke ',
 '/Title': 'Endothelial dysfunction and altered endothelial biomarkers in patients with post-COVID-19 syndrome and chronic fatigue syndrome (ME/CFS)',
 '/CrossmarkDomainExclusive': 'true',
 '/robots': 'noindex',
 '/Producer': 'Acrobat Distiller 10.1.8 (Windows); modified using iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)',
 '/doi': '10.1186/s12967-022-03346-2',
 '/CrossMarkDomains[2]': 'springerlink.com'}

In [None]:
#experiment with reading the first page
page_0 = reader.pages[0]

#length
print(len(reader.pages))

#extract only text oriented up
print(page_0.extract_text(0))

11
Haffke  et al. Journal of Translational Medicine          (2022) 20:138  
https://doi.org/10.1186/s12967-022-03346-2
RESEARCH
Endothelial dysfunction and altered 
endothelial biomarkers in patients 
with post -COVID -19 syndrome and chronic 
fatigue syndrome (ME/CFS)
Milan Haffke1* , Helma Freitag1, Gordon Rudolf1, Martina Seifert1,2,3, Wolfram Doehner2,3,4,5, 
Nadja Scherbakov2,3,4,5, Leif Hanitsch1, Kirsten Wittke1, Sandra Bauer1, Frank Konietschke6, Friedemann Paul7,8,9, 
Judith Bellmann‑Strobl7,8,9, Claudia Kedor1, Carmen Scheibenbogen1† and Franziska Sotzny1† 
Abstract 
Background: Fatigue, exertion intolerance and post ‑exertional malaise are among the most frequent symptoms of 
Post ‑COVID Syndrome (PCS), with a subset of patients fulfilling criteria for Myalgic Encephalomyelitis/Chronic Fatigue 
Syndrome (ME/CFS). As SARS‑ CoV‑2 infects endothelial cells, causing endotheliitis and damaging the endothelium, 
we investigated endothelial dysfunction (ED) and endothelial biomark

In [None]:
#extract ext & layout
print(page_0.extract_text(extraction_mode='layout'))

Haffke et al. Journal of Translational Medicine          (2022) 20:138
https://doi.org/10.1186/s12967-022-03346-2                                                                                                                                                                                                                        Journal of
                                                                                                                                                                                                           Translational Medicine

   RESEARCH                                                                                                                                                                                                                                                  Open Access
Endothelial dysfunc tion and altered
endothelial biomarkers in patients
with post- COVID -19 syndrome and chronic
fatigue syndrome (ME/CFS)

MilanHaffke                

In [None]:
#experiment with reading the 2nd page
page_1 = reader.pages[1]
print(page_1.extract_text(extraction_mode='layout'))

Haffke et al. Journal of Translational Medicine          (2022) 20:138                                                                           Page 2 of 11




state persisting at least three    months from the onset of                      Counterregulator y ACE2 cleaves angiotensin I into angi-
COVID-19 with common symptoms such as fatigue,                                   otensin 1–9 and metabolises angiotensin II to angioten               -
post-exertional malaise and cognitive dysfunction                                sin 1–7. Peptides generated by ACE2 are ligands of the
impacting ever yday functioning .                                                receptor Mas, which triggers protective, vasodilative sig-
  In our observational longitudinal PA-COVID Fatigue                             nalling [24].
study of PCS patients with persistent moderate to
severe fatigue and exertion intolerance for more than                            Methods
six   months after mild to moderate CO

Although the metadata looks correct, as well as extracted text (especially when layout is preserved), this might not be so useful to us - ideally we want to get a tool which is capable of extracting texts from specific sections (e.g. we want text from abstract, main text body but not from authors, references etc). Technically we can use some spatial cropping to extract only contents that are of interest but this is going to work on case-by-case basis so not very generelizable.

After very brief search, I stumbled upon PyMUPDF which seems much more suitalbe for our purpose - it provides table of contents and hierarchical levelling of contents. 

#### PyMUPDF

In [None]:
import pymupdf
doc = pymupdf.open('../data/enfothelial_dysfunction.pdf')

print(doc.metadata)

#table of contents
doc.get_toc()

{'format': 'PDF 1.6', 'title': 'Endothelial dysfunction and altered endothelial biomarkers in patients with post-COVID-19 syndrome and chronic fatigue syndrome (ME/CFS)', 'author': 'Milan Haffke ', 'subject': 'Journal of Translational Medicine, https://doi.org/10.1186/s12967-022-03346-2', 'keywords': 'Post-COVID syndrome,Myalgic encephalomyelitis/chronic fatigue syndrome,Endothelial dysfunction,Reactive hyperaemia index,Endothelin-1', 'creator': 'Springer', 'producer': 'Acrobat Distiller 10.1.8 (Windows); modified using iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)', 'creationDate': "D:20220317084812+05'30'", 'modDate': "D:20220322100752+01'00'", 'trapped': '', 'encryption': None}


[[1,
  'Endothelial dysfunction and\xa0altered endothelial biomarkers in\xa0patients with\xa0post-COVID-19 syndrome and\xa0chronic fatigue syndrome (MECFS)',
  1],
 [2, 'Abstract ', 1],
 [3, 'Background: ', 1],
 [3, 'Methods: ', 1],
 [3, 'Results: ', 1],
 [3, 'Conclusion: ', 1],
 [2, 'Background', 1],
 [2, 'Methods', 2],
 [3, 'Participants', 2],
 [3, 'Assessment of\xa0endothelial function', 2],
 [3, 'Assessment of\xa0biomarkers', 3],
 [3, 'Assessment of\xa0symptom severity', 3],
 [3, 'Statistical analysis', 4],
 [2, 'Results', 4],
 [3, 'Study population', 4],
 [3, 'Evidence for\xa0peripheral ED in\xa0patients', 4],
 [3,
  'Paradoxical associations of\xa0clinical parameters with\xa0the\xa0RHI',
  4],
 [3, 'Alterations in\xa0endothelial biomarkers in\xa0post-COVID cohorts', 4],
 [2, 'Discussion', 5],
 [2, 'Conclusion', 8],
 [2, 'Acknowledgements', 9],
 [2, 'References', 9]]

This is more like it  we get insight into which pages can be not of our interest (eg Acknowledgements, References, ); metadata is not as extensive as in case of pypdf but we are not as interested in metadata

In [None]:
print(doc.page_count)

#examine first page
print(doc[0].get_text())

11
Haffke et al. Journal of Translational Medicine          (2022) 20:138  
https://doi.org/10.1186/s12967-022-03346-2
RESEARCH
Endothelial dysfunction and altered 
endothelial biomarkers in patients 
with post‑COVID‑19 syndrome and chronic 
fatigue syndrome (ME/CFS)
Milan Haffke1*  , Helma Freitag1, Gordon Rudolf1, Martina Seifert1,2,3, Wolfram Doehner2,3,4,5, 
Nadja Scherbakov2,3,4,5, Leif Hanitsch1, Kirsten Wittke1, Sandra Bauer1, Frank Konietschke6, Friedemann Paul7,8,9, 
Judith Bellmann‑Strobl7,8,9, Claudia Kedor1, Carmen Scheibenbogen1† and Franziska Sotzny1† 
Abstract 
Background:  Fatigue, exertion intolerance and post-exertional malaise are among the most frequent symptoms of 
Post-COVID Syndrome (PCS), with a subset of patients fulfilling criteria for Myalgic Encephalomyelitis/Chronic Fatigue 
Syndrome (ME/CFS). As SARS-CoV-2 infects endothelial cells, causing endotheliitis and damaging the endothelium, 
we investigated endothelial dysfunction (ED) and endothelial biomarkers 

In [None]:
#examine paragraphs
print(doc[0].get_text('blocks'))

[(56.62200164794922, 31.710655212402344, 263.01239013671875, 51.33465576171875, 'Haffke\xa0et\xa0al. Journal of Translational Medicine          (2022) 20:138  \nhttps://doi.org/10.1186/s12967-022-03346-2\n', 0, 0), (61.822898864746094, 89.10063934326172, 129.68161010742188, 105.2076416015625, 'RESEARCH\n', 1, 0), (56.62199401855469, 115.64900970458984, 489.315673828125, 222.4730224609375, 'Endothelial dysfunction and\xa0altered \nendothelial biomarkers in\xa0patients \nwith\xa0post‑COVID‑19 syndrome and\xa0chronic \nfatigue syndrome (ME/CFS)\n', 2, 0), (56.61956787109375, 229.29855346679688, 536.1975708007812, 271.2506103515625, 'Milan\xa0Haffke1*\u200a , Helma\xa0Freitag1, Gordon\xa0Rudolf1, Martina\xa0Seifert1,2,3, Wolfram\xa0Doehner2,3,4,5, \nNadja\xa0Scherbakov2,3,4,5, Leif\xa0Hanitsch1, Kirsten\xa0Wittke1, Sandra\xa0Bauer1, Frank\xa0Konietschke6, Friedemann\xa0Paul7,8,9, \nJudith\xa0Bellmann‑Strobl7,8,9, Claudia\xa0Kedor1, Carmen\xa0Scheibenbogen1†\xa0and Franziska\xa0Sotzny1†\u20

Although I didnt find a functionality allowing for extracting specific sections from the doc, the TOC still should be helpful with removing parts of PDFs that we are not so interested in (eg references, acknowledgments). I will now analyze remaining data using this

### Brief EDA of supplied data

In [None]:
pdfs = os.listdir('../data')
for pdf in pdfs:
    print('\n',pdf)
    doc = pymupdf.open(f'../data/{pdf}')
    print(doc.get_toc())
    print(doc.metadata)
    print(doc.page_count)
    


 enfothelial_dysfunction.pdf
[[1, 'Endothelial dysfunction and\xa0altered endothelial biomarkers in\xa0patients with\xa0post-COVID-19 syndrome and\xa0chronic fatigue syndrome (MECFS)', 1], [2, 'Abstract ', 1], [3, 'Background: ', 1], [3, 'Methods: ', 1], [3, 'Results: ', 1], [3, 'Conclusion: ', 1], [2, 'Background', 1], [2, 'Methods', 2], [3, 'Participants', 2], [3, 'Assessment of\xa0endothelial function', 2], [3, 'Assessment of\xa0biomarkers', 3], [3, 'Assessment of\xa0symptom severity', 3], [3, 'Statistical analysis', 4], [2, 'Results', 4], [3, 'Study population', 4], [3, 'Evidence for\xa0peripheral ED in\xa0patients', 4], [3, 'Paradoxical associations of\xa0clinical parameters with\xa0the\xa0RHI', 4], [3, 'Alterations in\xa0endothelial biomarkers in\xa0post-COVID cohorts', 4], [2, 'Discussion', 5], [2, 'Conclusion', 8], [2, 'Acknowledgements', 9], [2, 'References', 9]]
{'format': 'PDF 1.6', 'title': 'Endothelial dysfunction and altered endothelial biomarkers in patients with post-C

Provided PDFs come in different formats, from some of them its not possible to extract ToC, thus the solution with extracting specific sections might not be very generelizable. The metadata for post-exertional_malaise.pdf is also not existent, showing challenge of potential data types (and opportuntiy for making more robust solution). We can also notice different encodings, whitespace inconsistencies, across the papers, making text extraction yet more challenging

Regardless, although at first I was planning to only utilize Pypdf, pymupdf might allow for less noisy solutions -> will use it for parsing

## SciSpacy NER BENCHMARKING

models taken from [sciSpacy](https://allenai.github.io/scispacy/), project coming from allenai institute so they are reliable. Spacy allows for easy extraction of entities using pre-trained models, sciSpacy has models pretrained on scientific corpa.

When benchmarking the NER models, we need to ask the following qs:
* what entities we are interested in
* how accurately are those entities captured by the models (ie are they capturing noise?)
* is context suitable
* how much time inference takes - thats not as important for now 

Wrapping functions used for extracting entities can be found in src.data module; they utilize spacy tokenization process. Contextualization process is a simple 'sliding window' approach, where we take fixed no. tokens prior and after our token of interest. While it is not a semantic process, with large window enough context should be provided.

Normally, I would use the pretrained models for fine-tuning a NER model for our own interest however we dont have enough data to achieve this and it is not clear what exact entities are we interested in - they can be clinical, biological, chemical, technical, or all in one.

I will use data/enfothelial_dysfunction.pdf & data/mecfs_and_long_covid_similar_symptoms.pdf for assessing the performance. For this purposes, I will not do any preprocessing as I also want to examine how the models handle 'raw' data (ie data with all content). 

In [88]:

!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_craft_md-0.5.4.tar.gz

!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_jnlpba_md-0.5.4.tar.gz

!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz

^C


In [30]:
import time
import spacy
import sys

sys.path.append('..')
from src.data import extract_txt_from_pdf
from src import model

In [42]:
#read data - lets extract only second and last page
enfo_txt = extract_txt_from_pdf('../data/enfothelial_dysfunction.pdf',[1])
enfo_ref = extract_txt_from_pdf('../data/enfothelial_dysfunction.pdf',[-1])
mecfs_txt =extract_txt_from_pdf('../data/mecfs_systematic_review.pdf',[1])
mecfs_ref =extract_txt_from_pdf('../data/mecfs_systematic_review.pdf',[-1])

### en_ner_craft_md
Following entity types:	GGP, SO, TAXON, CHEBI, GO, CL

In [44]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_craft_md', enfo_txt, 10)
output_mecfs = extract_entities_with_context('en_ner_craft_md', mecfs_txt, 10)

[{'entity': 'virus', 'start_context': 133, 'end_context': 154, 'label': 'TAXON', 'context': 'complex disease frequently triggered by an infection with \nEpstein-Barr virus (EBV) or parvovirus B19, but several \n'}, {'entity': 'viral', 'start_context': 145, 'end_context': 166, 'label': 'TAXON', 'context': 'EBV) or parvovirus B19, but several \nother viral and nonviral triggers have been described \n[5–7]'}, {'entity': 'nonviral', 'start_context': 147, 'end_context': 168, 'label': 'TAXON', 'context': 'or parvovirus B19, but several \nother viral and nonviral triggers have been described \n[5–7]. Postexertional'}, {'entity': 'endothelium-derived \nvasoconstrictors', 'start_context': 261, 'end_context': 284, 'label': 'GGP', 'context': '-\nlators, while on the other hand, endothelium-derived \nvasoconstrictors are increased, leading to impaired \nendothelium-dependent vasodilation ['}, {'entity': 'virus', 'start_context': 133, 'end_context': 154, 'label': 'TAXON', 'context': 'complex diseas

In [47]:
#check how they handle reference / biblography pages
output_enfo = extract_entities_with_context('en_ner_craft_md', enfo_ref, 10)
output_mecfs = extract_entities_with_context('en_ner_craft_md', mecfs_ref, 10)

[{'entity': 'Rot A. Duffy antigen receptor', 'start_context': 152, 'end_context': 177, 'label': 'GGP', 'context': 'Hypertens. 2020;38(9):1682–98.\n 48. Novitzky‑Basso I, Rot A. Duffy antigen receptor for chemokines and its \ninvolvement in patterning and control'}, {'entity': 'interleukin 8', 'start_context': 204, 'end_context': 226, 'label': 'GGP', 'context': 'Bont ES. \nEndothelial cells are main producers of interleukin 8 through Toll‑like \nreceptor 2 and 4 signaling during bacterial'}, {'entity': 'Toll‑like \nreceptor 2', 'start_context': 207, 'end_context': 231, 'label': 'GGP', 'context': '\nEndothelial cells are main producers of interleukin 8 through Toll‑like \nreceptor 2 and 4 signaling during bacterial infection in leukopenic cancer \n'}, {'entity': 'bacterial', 'start_context': 215, 'end_context': 236, 'label': 'TAXON', 'context': '8 through Toll‑like \nreceptor 2 and 4 signaling during bacterial infection in leukopenic cancer \npatients. Clin Diagn Lab'}, {'entity': 'tight

Conclusion: extracts quite generic entities like virus, house; the entity types are not very domain-specific (biological pathways, gene ontology etc); errors very visible in the outputs for reference texts (eg classifying names as CHEBI or GGP entities), difficult to recognize errors from actual entities as they are very domain-specific 

### en_ner_jnlpba_md
Following entity types: DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN

In [48]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_jnlpba_md', enfo_txt, 10)
output_mecfs = extract_entities_with_context('en_ner_jnlpba_md', mecfs_txt, 10)

[{'entity': 'PA-COVID', 'start_context': 45, 'end_context': 66, 'label': 'PROTEIN', 'context': '\nimpacting everyday functioning.\nIn our observational longitudinal PA-COVID Fatigue \nstudy of PCS patients with persistent moderate to'}, {'entity': 'Enzyme (ACE) 2 receptor', 'start_context': 317, 'end_context': 343, 'label': 'PROTEIN', 'context': '\nthe endothelium by engaging the Angiotensin-Convert -\ning Enzyme (ACE) 2 receptor [15]. In acute COVID-19, \nthere'}, {'entity': 'peripheral \n', 'start_context': 367, 'end_context': 389, 'label': 'CELL_TYPE', 'context': '.\nIn this study, we aimed to characterise peripheral \nendothelial function using postocclusive reactive hyper -\naemia peripheral'}, {'entity': 'PCS', 'start_context': 385, 'end_context': 406, 'label': 'CELL_LINE', 'context': '-\naemia peripheral arterial tonometry (RH-PAT) in PCS \npatients following mild to moderate COVID-19. In \n'}, {'entity': 'ET-1', 'start_context': 407, 'end_context': 428, 'label': 'PROTEIN', 'con

In [49]:
#check how they handle reference / biblography pages
output_enfo = extract_entities_with_context('en_ner_jnlpba_md', enfo_ref, 10)
output_mecfs = extract_entities_with_context('en_ner_jnlpba_md', mecfs_ref, 10)

[{'entity': 'Page 11 of 11\n Haffke', 'start_context': 0, 'end_context': 16, 'label': 'PROTEIN', 'context': 'Page 11 of 11\n Haffke\xa0et\xa0al. Journal of Translational Medicine          ('}, {'entity': 'BMC', 'start_context': 109, 'end_context': 130, 'label': 'CELL_TYPE', 'context': 'to submit y our researc h  ?  Choose BMC and benefit fr om: ?  Choose BMC and'}, {'entity': 'BMC', 'start_context': 109, 'end_context': 130, 'label': 'CELL_TYPE', 'context': 'to submit y our researc h  ?  Choose BMC and benefit fr om: ?  Choose BMC and'}, {'entity': 'Novitzky‑Basso I', 'start_context': 149, 'end_context': 171, 'label': 'PROTEIN', 'context': 'society. J Hypertens. 2020;38(9):1682–98.\n 48. Novitzky‑Basso I, Rot A. Duffy antigen receptor for chemokines and its'}, {'entity': 'Rot A. Duffy antigen receptor', 'start_context': 152, 'end_context': 177, 'label': 'PROTEIN', 'context': 'Hypertens. 2020;38(9):1682–98.\n 48. Novitzky‑Basso I, Rot A. Duffy antigen receptor for chemokines and its \nin

Conclusion: extracted entities are not as generic but clear errors occur - eg classifying PA-covid and long covid as protein or DNA; the entity types seem more specific for biological purposes and are quite self-explanatory. errors again visible in the outputs for reference texts (eg classifying page count as protein) 

### en_ner_bc5cdr_md
Following entity types: 
DISEASE, CHEMICAL

In [50]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_bc5cdr_md', enfo_txt, 10)
output_mecfs = extract_entities_with_context('en_ner_bc5cdr_md', mecfs_txt, 10)

[{'entity': 'fatigue', 'start_context': 27, 'end_context': 48, 'label': 'DISEASE', 'context': 'the onset of \nCOVID-19 with common symptoms such as fatigue, \npost-exertional malaise and cognitive dysfunction \nimpacting everyday'}, {'entity': 'post-exertional malaise', 'start_context': 30, 'end_context': 52, 'label': 'DISEASE', 'context': '\nCOVID-19 with common symptoms such as fatigue, \npost-exertional malaise and cognitive dysfunction \nimpacting everyday functioning.\nIn'}, {'entity': 'cognitive dysfunction', 'start_context': 33, 'end_context': 55, 'label': 'DISEASE', 'context': 'common symptoms such as fatigue, \npost-exertional malaise and cognitive dysfunction \nimpacting everyday functioning.\nIn our observational longitudinal'}, {'entity': 'Fatigue', 'start_context': 46, 'end_context': 67, 'label': 'DISEASE', 'context': 'impacting everyday functioning.\nIn our observational longitudinal PA-COVID Fatigue \nstudy of PCS patients with persistent moderate to \n'}, {'entity': 'fa

In [54]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_bc5cdr_md', enfo_ref, 10)
output_mecfs = extract_entities_with_context('en_ner_bc5cdr_md', mecfs_ref, 10)

[{'entity': 'Toll‑like', 'start_context': 207, 'end_context': 228, 'label': 'CHEMICAL', 'context': '\nEndothelial cells are main producers of interleukin 8 through Toll‑like \nreceptor 2 and 4 signaling during bacterial infection in'}, {'entity': 'bacterial infection', 'start_context': 215, 'end_context': 237, 'label': 'DISEASE', 'context': '8 through Toll‑like \nreceptor 2 and 4 signaling during bacterial infection in leukopenic cancer \npatients. Clin Diagn Lab Immunol'}, {'entity': 'leukopenic cancer', 'start_context': 218, 'end_context': 240, 'label': 'DISEASE', 'context': '\nreceptor 2 and 4 signaling during bacterial infection in leukopenic cancer \npatients. Clin Diagn Lab Immunol. 2003;10(4):558–63.'}, {'entity': 'Li', 'start_context': 284, 'end_context': 305, 'label': 'CHEMICAL', 'context': 'J Biol Sci. \n2013;9(9):966–79.\n 51. Li L, Li J, Gao M, Fan H'}, {'entity': 'Li', 'start_context': 284, 'end_context': 305, 'label': 'CHEMICAL', 'context': 'J Biol Sci. \n2013;9(9):966–79

Conclusion: extracted entities seem correct at firstare not as generic but clear errors occur - eg classifying PA-covid and long covid as protein or DNA; the entity types are quite clear but quite broad (e.g. many things can be recognized as chemical).

errors again visible in the outputs for reference texts (eg classifying page count as protein) 

### en_ner_bionlp13cg_md

Following entity types: 
AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

In [52]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_bionlp13cg_md', enfo_txt, 10)
output_mecfs = extract_entities_with_context('en_ner_bionlp13cg_md', mecfs_txt, 10)

[{'entity': 'COVID-19', 'start_context': 21, 'end_context': 42, 'label': 'SIMPLE_CHEMICAL', 'context': 'at least three\xa0 months from the onset of \nCOVID-19 with common symptoms such as fatigue, \npost-exertional malaise'}, {'entity': 'patients', 'start_context': 51, 'end_context': 72, 'label': 'ORGANISM', 'context': 'In our observational longitudinal PA-COVID Fatigue \nstudy of PCS patients with persistent moderate to \nsevere fatigue and exertion intolerance'}, {'entity': 'patients', 'start_context': 51, 'end_context': 72, 'label': 'ORGANISM', 'context': 'In our observational longitudinal PA-COVID Fatigue \nstudy of PCS patients with persistent moderate to \nsevere fatigue and exertion intolerance'}, {'entity': 'patients', 'start_context': 51, 'end_context': 72, 'label': 'ORGANISM', 'context': 'In our observational longitudinal PA-COVID Fatigue \nstudy of PCS patients with persistent moderate to \nsevere fatigue and exertion intolerance'}, {'entity': '[4].', 'start_context': 115, '

In [55]:
#chekc how they handle body text pages
output_enfo = extract_entities_with_context('en_ner_bionlp13cg_md', enfo_ref, 10)
output_mecfs = extract_entities_with_context('en_ner_bionlp13cg_md', mecfs_ref, 10)

[{'entity': 'BMC', 'start_context': 80, 'end_context': 101, 'label': 'TISSUE', 'context': 'over 100M website views per year •\n  At BMC, research is always in progress.\nLearn more'}, {'entity': 'biomedcentral.com/submissionsReady', 'start_context': 91, 'end_context': 112, 'label': 'GENE_OR_GENE_PRODUCT', 'context': ', research is always in progress.\nLearn more biomedcentral.com/submissionsReady to submit y our researc h Ready to submit y'}, {'entity': 'BMC', 'start_context': 80, 'end_context': 101, 'label': 'CELL', 'context': 'over 100M website views per year •\n  At BMC, research is always in progress.\nLearn more'}, {'entity': 'BMC', 'start_context': 80, 'end_context': 101, 'label': 'CELL', 'context': 'over 100M website views per year •\n  At BMC, research is always in progress.\nLearn more'}, {'entity': 'ESH', 'start_context': 127, 'end_context': 148, 'label': 'ORGANISM', 'context': 'Choose BMC and benefit fr om: \nfrom the ESH working group on vascular structure and function and 

extracted entities seem quite off (e.g. classifying patients as organism, page count as simple chemical; the entity types are quite clear but quite broad (e.g. many things can be recognized as chemical); errors again visible in the outputs for reference texts (eg classifying page count as protein) 

### conclusions
* None of the provided models handles references / bibliography sections well 
* they would all require some post-processing to remove same entities that are repeated through text 

I will probably choose **en_ner_bc5cdr_md** as the model of choice, mainly because of high number of entity types, extracted entities are not very generic and could be used for tasks such as target or biomarker identification. That being said though, it is the largest model so the tradeoff is the inference & fitting time.