In this notebook the first section shows different libraries for reading PDF files.

This was influenced by [Blueprints for Text Analytics Using Python](https://www.oreilly.com/library/view/blueprints-for-text/9781492074076/) chapters 3 `Extracting Data`

The remaining sections are influenced by chapter 4 `Preparing Textual Data for Statistics and Machine Learning`

## Preparing Textual Data for Statistics and Machine Learning

### Extracting Data from a Laboratory Report

In [10]:
import markdown
import pymupdf
import pymupdf4llm

# Karotype testing
#file = 'R24-0WH7-1.pdf'

# this is typical of reports coming at present from iGene
#file = 'ORU_R01_R125.1_R0A.txt.pdf'

# output from shire
file = 'SHIRE_ORU_R01_RM3.txt.pdf'

folder = 'Output/PDF/R01/'

with open(folder + file, 'rb') as g:
    pdf_stream = g.read()

    doc = pymupdf.open("pdf", pdf_stream)
    md_text = pymupdf4llm.to_markdown(doc)
    html_text = markdown.markdown(md_text)


### Cleaning Text Data

In [11]:
from io import StringIO
from html.parser import HTMLParser
import re

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


clean_text = strip_tags(html_text)


### [spaCy](https://spacy.io/)

In [12]:
import spacy


print('============ Spacy ' + file + ' ==============')
nlp = spacy.load('en_core_web_sm')

#doc = pdf_reader(folder + file, nlp)
doc = nlp(clean_text)

for token in doc:
    print(token,end="|")

Karyotype|:|FISH|ON|PARAFFIN|EMBEDDED|TISSUE|
|FISH|Result|:|IRF4|/|DUSP22|gene|rearrangement|detected|No|evidence|of|MYC|,|IGH::MYC|,|
|BCL2|or|BCL6|rearrangements|.|
|Fluorescence|in|situ|hybridisation|(|FISH|)|studies|were|carried|out|using|the|following|Cytocell|
|gene|probes|;|MYC|Breakapart|,|IGH|-|MYC|LPS|Dual|Fusion|,|BCL2|Breakapart|,|BCL6|
|Breakapart|,|and|MyProbe|IRF4|/|DUSP22|Breakapart|to|detect|rearrangements|of|6p25|found|in|
|B-|and|T|-|cell|lymphomas|.|
|There|was|no|evidence|of|MYC|,|IGH::MYC|,|BCL2|or|BCL6|gene|rearrangements|in|multiple|
|tissue|areas|examined|.|
|A|signal|pattern|consistent|with|IRF4|gene|rearrangement|was|seen|in|multiple|tissue|areas|
|examined|.|
|IRF4|rearrangement|is|recognised|finding|in|Large|B|-|cell|lymphoma|and|is|consistent|with|the|
|diagnosis|of|'|Large|B|-|cell|lymphoma|with|IRF4|rearrangement|'|(|WHO|Classification|2022|;|ICDO|code|9698/3|)|.|
|Please|note|that|the|probe|used|in|this|assay|can|not|distinguish|between|IRF4|and|DUSP22

### Extracting Named Entities

In [13]:

 for ent in doc.ents:
    print(ent.text, ent.label_)

MYC ORG
BCL6 ORG
Fluorescence PERSON
Cytocell ORG
MyProbe IRF4 ORG
MYC ORG
BCL6 ORG


In [14]:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

## [scispaCy](https://github.com/allenai/scispacy)

A full spaCy pipeline and models for scientific/biomedical documents.

In [15]:
import scispacy
from scispacy.abbreviation import AbbreviationDetector

import pyobo

from scispacy.linking import EntityLinker




nlp = spacy.load("en_core_sci_sm")

nlp.add_pipe("abbreviation_detector")

doc = nlp(clean_text)

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
    print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")


Abbreviation 	 Definition
FISH 	 (8, 9) Fluorescence in situ hybridisation
FISH 	 (234, 235) Fluorescence in situ hybridisation
FISH 	 (2, 3) Fluorescence in situ hybridisation
FISH 	 (34, 35) Fluorescence in situ hybridisation
FISH 	 (248, 249) Fluorescence in situ hybridisation


Named Entities

In [16]:
 for ent in doc.ents:
    print(ent.text,  ent.label_)

Karyotype ENTITY
EMBEDDED TISSUE
FISH ENTITY
gene rearrangement ENTITY
MYC ENTITY
IGH::MYC ENTITY
BCL2 ENTITY
BCL6 ENTITY
rearrangements ENTITY
Cytocell
gene ENTITY
MYC Breakapart ENTITY
IGH-MYC ENTITY
LPS ENTITY
Dual ENTITY
Fusion ENTITY
BCL2 ENTITY
BCL6
Breakapart ENTITY
MyProbe IRF4/DUSP22 ENTITY
detect ENTITY
rearrangements ENTITY
6p25 ENTITY
T-cell lymphomas ENTITY
MYC ENTITY
IGH::MYC ENTITY
BCL2 ENTITY
BCL6 ENTITY
gene rearrangements ENTITY
multiple ENTITY
examined ENTITY
signal pattern ENTITY
consistent with ENTITY
IRF4 gene ENTITY
rearrangement ENTITY
multiple ENTITY
tissue ENTITY
examined ENTITY
IRF4 ENTITY
rearrangement ENTITY
Large B-cell lymphoma ENTITY
consistent with ENTITY
diagnosis ENTITY
IRF4 ENTITY
ICDO code 9698/3 ENTITY
probe ENTITY
assay ENTITY
IRF4 ENTITY
DUSP22 ENTITY
gene
 ENTITY
result ENTITY
interpreted ENTITY
conjunction ENTITY
clinical factors ENTITY
cases ENTITY
Ig ENTITY
translocations ENTITY
atypical ENTITY
breakpoints ENTITY
cluster
 ENTITY
genetic mecha

In [17]:
displacy.render(doc, jupyter=True, style='ent')

##  [medspaCy](https://spacy.io/universe/project/medspacy)


In [18]:
from medspacy.visualization import visualize_ent
from medspacy.ner import TargetRule
import medspacy

nlp = medspacy.load()

target_matcher = nlp.get_pipe("medspacy_target_matcher")
target_rules = [
    TargetRule("Large B-cell lymphoma with IRF4 rearrangement", "PROBLEM"),
    TargetRule("structurally abnormal unbalanced karyotype", "PROBLEM"),
    TargetRule("Chromosomal translocation", "PROBLEM"),
    TargetRule("Thoracic aortic aneurysm", "PROBLEM"),
    TargetRule("Type II Diabetes Mellitus", "PROBLEM",
               pattern=[
                   {"LOWER": "type"},
                   {"LOWER": {"IN": ["2", "ii", "two"]}},
                   {"LOWER": {"IN": ["dm", "diabetes"]}},
                   {"LOWER": "mellitus", "OP": "?"}
               ]),
    TargetRule("warfarin", "MEDICATION")
]
target_matcher.add(target_rules)
doc = nlp(clean_text)

visualize_ent(doc)

In [19]:
#displacy.render(doc, style='ent', jupyter=True)

### Entity Linked - scispacy

In [20]:
import spacy


from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp(clean_text)

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
    print(linker.kb.cui_to_entity[umls_ent[0]])

Name:  EMBEDDED TISSUE
FISH
CUI: C1519524, Name: Paraffin Embedded Tissue
Definition: Tissue that is preserved and embedded in paraffin.
TUI(s): T024
Aliases: (total: 0): 
	 
