## PSA var testing
here we will be takign a look at some mock data from PSA and seeing
if we can extarct it with NER we will also be looking to extract the dates
that might be realated with the PSA values

---



first lets set up some mock data

In [None]:
notes = '''
Note 1:
Date: July 10, 2023

Patient: John Smith
Age: 58
PSA Level: 4.2 ng/mL

Subjective: Mr. Smith presented today with concerns about his recent PSA test results. He reported a family history of prostate cancer, which heightened his anxiety. He denies any urinary symptoms, but mentioned occasional difficulty in initiating the stream.

Objective: Physical examination reveals a non-tender prostate with a normal size and texture. No palpable nodules were detected. Digital rectal examination (DRE) showed no abnormalities. Previous PSA levels from the last three years were 3.8 ng/mL (June 2022), 4.0 ng/mL (July 2021), and 3.6 ng/mL (August 2020).

Assessment: Given the gradual increase in PSA levels over the last year, there is a slight cause for concern. However, no significant changes in the DRE findings suggest localized prostate abnormalities.

Plan: To monitor PSA levels closely, a repeat PSA test in six weeks is recommended. Additionally, I have advised Mr. Smith to maintain a healthy lifestyle and to inform us immediately if any urinary symptoms develop.

Note 2:
Date: August 15, 2023

Patient: Robert Johnson
Age: 65
PSA Level: 8.5 ng/mL

Subjective: Mr. Johnson visited today to discuss the recent rise in his PSA level. He has been experiencing increased frequency of urination, especially at night, and mild discomfort in the lower abdomen.

Objective: Physical examination revealed a slightly enlarged prostate with a firm, non-nodular consistency. DRE findings were consistent with benign prostatic hyperplasia (BPH). Previous PSA levels over the past four years were 6.2 ng/mL (July 2022), 5.8 ng/mL (August 2021), and 6.0 ng/mL (June 2020).

Assessment: The progressive increase in PSA levels, along with urinary symptoms and an enlarged prostate, warrant further evaluation. While BPH is a possible explanation, prostate cancer cannot be ruled out at this stage.

Plan: Immediate referral to a urologist for a comprehensive evaluation, including a transrectal ultrasound (TRUS) and possible biopsy. Further investigations will help establish a definitive diagnosis and guide appropriate management.

Note 3:
Date: September 5, 2023

Patient: William Anderson
Age: 70
PSA Level: 1.8 ng/mL

Subjective: Mr. Anderson visited for his annual check-up, and we discussed his latest PSA results. He reported no urinary symptoms or family history of prostate cancer. He remains concerned due to media coverage on prostate cancer.

Objective: Physical examination revealed a normal-sized prostate with a smooth texture and no nodules detected during DRE. Previous PSA levels from the last four years were 1.9 ng/mL (September 2022), 2.1 ng/mL (October 2021), and 1.8 ng/mL (August 2020).

Assessment: Mr. Anderson's PSA levels have remained stable over the past few years, and his DRE findings are unremarkable. These findings are reassuring, suggesting a lower risk of prostate cancer.

Plan: Continue with routine annual PSA screenings and encourage a healthy lifestyle to maintain overall prostate health. Reiterate the importance of staying informed but avoiding unnecessary anxiety related to media reports.

Note 4:
Date: October 20, 2023

Patient: Michael Brown
Age: 62
PSA Level: 15.2 ng/mL

Subjective: Mr. Brown visited today with concerns about his recent PSA results. He has been experiencing persistent lower back pain and occasional blood in his urine. He denies any history of prostate-related issues.

Objective: Physical examination reveals an asymmetrically enlarged prostate with a firm, nodular consistency. DRE findings indicate potential irregularities on the right lobe. Previous PSA levels from the last three years were 8.9 ng/mL (October 2022), 6.4 ng/mL (November 2021), and 6.8 ng/mL (September 2020).

Assessment: The significant rise in PSA levels, along with the presence of urinary symptoms and palpable abnormalities during DRE, raises concerns about possible prostate malignancy.

Plan: Urgent referral to a urologist for further evaluation, including imaging studies and a prostate biopsy. Prompt investigation is crucial to determine the extent of the disease and appropriate management.

Note 5:
Date: November 12, 2023

Patient: David Clark
Age: 55
PSA Level: 2.6 ng/mL

Subjective: Mr. Clark visited today for his regular check-up, and we discussed his recent PSA results. He reported no urinary symptoms or family history of prostate cancer. He is proactive about his health and regularly engages in physical activities.

Objective: Physical examination revealed a normal-sized prostate with a smooth texture and no palpable abnormalities during DRE. Previous PSA levels from the last three years were 2.4 ng/mL (November 2022), 2.1 ng/mL (December 2021), and 2.3 ng/mL (October 2020).

Assessment: Mr. Clark's PSA levels have been consistently within the normal range, and his DRE findings are unremarkable. There are no immediate concerns related to prostate cancer based on his history and current presentation.

Plan: Continue with regular PSA screenings and advise Mr. Clark to maintain his healthy lifestyle habits. Educate him about the importance of ongoing monitoring and being proactive about his health to ensure early detection of any potential issues.'''

Now lets convert it into a dataframe

In [None]:
import pandas as pd
import re

def parse_notes_to_dataframe(notes):
    note_index = []
    dates = []
    patients = []
    ages = []
    psa_levels = []
    texts = []

    # Use regular expression to split the notes based on the "Note" keyword followed by a digit
    note_sections = re.split(r'\nNote \d+:\n', notes)[1:]

    for idx, note_section in enumerate(note_sections, start=1):
        lines = note_section.strip().split('\n')
        if len(lines) < 8:
            continue

        note_index.append(idx)
        dates.append(lines[0].split(': ')[1])

        # Use regular expression to extract the patient name and age
        patient_info = re.search(r'Patient: (.+?)\nAge: (\d+)', note_section)
        if patient_info:
            patients.append(patient_info.group(1))
            ages.append(int(patient_info.group(2)))
        else:
            patients.append(None)
            ages.append(None)

        psa_levels.append(float(lines[3].split(': ')[1].split()[0]))
        texts.append('\n'.join(lines[5:]))

    data = {
        'Note index': note_index,
        'Date': dates,
        'Patient': patients,
        'Age': ages,
        'PSA Level': psa_levels,
        'Text': texts
    }

    df = pd.DataFrame(data)
    return df

Now print out the dataframe

In [None]:
df = parse_notes_to_dataframe(notes)
df

Unnamed: 0,Note index,Date,Patient,Age,PSA Level,Text
0,1,"August 15, 2023",Robert Johnson,65,65.0,\nSubjective: Mr. Johnson visited today to dis...
1,2,"September 5, 2023",William Anderson,70,70.0,\nSubjective: Mr. Anderson visited for his ann...
2,3,"October 20, 2023",Michael Brown,62,62.0,\nSubjective: Mr. Brown visited today with con...
3,4,"November 12, 2023",David Clark,55,55.0,\nSubjective: Mr. Clark visited today for his ...


## Build Extraction Methods
Now lets build out some extraction methods that will allow us to identify PSA and dates from the unstructered text

## Regex Approach
before we start busting out the big language models lets first try to attack this problem with simple regex in many situations PSA values might be easy to detect we should try the easiest compute tractic first before getting into the hard stuff so we begin the investigation here

In [None]:
import re
import pandas as pd

def extract_psas_with_dates(text):
    # Step 1: Preprocess the text
    text = text.replace('\n', ' ')

    # Step 2: Use Regex to Extract PSA Values and Dates
    psa_matches = re.findall(r'PSA(?:[- ]+)(\d+(?:\.\d+)?)', text, re.IGNORECASE)
    date_matches = re.findall(r'\b(\d{1,2}/\d{1,2}/\d{4})\b|\b(\w+\s+\d{1,2},?\s+\d{4})\b', text)

    # Step 3: Extract PSA Values and Associated Dates
    psa_values = [float(match) for match in psa_matches]
    dates = [date[0] if date[0] else date[1] for date in date_matches]

    # Step 4: Build a DataFrame with Extracted Results
    df = pd.DataFrame({
        'text': [text],
        'extractions': [[{'PSA': psa, 'Date': date} for psa, date in zip(psa_values, dates)]]
    })

    return df


In [None]:
# Example usage with a single doctors' note
text1 = "The patient came in today with PSA-150 on 10/29/1999. This is indicative of cancer. Unlike on 10/2/1999 when PSA was 2, not cancer indicative."
df1 = extract_psas_with_dates(text1)
df1['extractions']


0    [{'PSA': 150.0, 'Date': '10/29/1999'}]
Name: extractions, dtype: object

In [None]:
# Example usage with multiple doctors' notes
text2 = "Patient A's PSA-12 on 5/15/2001, PSA-15 on 6/20/2002. Patient B's PSA-8.5 on 9/30/2003."
df2 = extract_psas_with_dates(text2)
df2
df2['extractions'][0]

[{'PSA': 12.0, 'Date': '5/15/2001'},
 {'PSA': 15.0, 'Date': '6/20/2002'},
 {'PSA': 8.5, 'Date': '9/30/2003'}]

## Including More Context - ML Approach
This works okay for direct situation targets 'Patient A's PSA-12 on 5/15/2001' but we need more help for a varity of context driven situations like this one
<br>

'Unlike on 10/2/1999 when PSA was 2'

this will requier a stronger tool for processing so lets bring in a bit of AI o the mix and see if we can improve our results as regex alone

In [None]:
import spacy
import pandas as pd

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

def extract_psas_with_dates(text):
    # Step 1: Preprocess the text (optional)
    text = text.replace('\n', ' ')

    # Step 2: Use spaCy NER to extract entities (dates and numeric values)
    doc = nlp(text)
    psa_values = []
    dates = []
    print(doc.ents)
    for ent in doc.ents:
        print(ent)
        print(ent.label_)
        if ent.label_ == 'QUANTITY' or (any(char.isdigit() for char in ent.text) and len(ent.text) < 5):
            print('hit')
            # Quantity entities contain numeric values
            psa_values.append(float(ent.text))
        if ent.label_ == 'DATE':
            dates.append(ent.text)

    # Step 3: Associate PSA values with their respective dates
    associations = []
    print(psa_values)
    for psa_value in psa_values:
        nearest_date = None
        min_distance = float('inf')
        for date in dates:
            if any(char.isdigit() for char in date):  # Ensure the date contains numeric characters
                distance = abs(text.find(date) - text.find(str(psa_value)))
                if distance < min_distance:
                    nearest_date = date
                    min_distance = distance
        associations.append({'PSA': psa_value, 'Date': nearest_date})

    # Step 4: Build a DataFrame with Extracted Results
    df = pd.DataFrame({
        'text': [text],
        'extractions': [associations]
    })

    return df

In [None]:
# Example usage with a single doctors' note
text1 = "The patient came in today with PSA-150 on 10/29/1999. This is indicative of cancer. Unlike on 10/2/1999 when PSA was 2.0, not cancer indicative."
df1 = extract_psas_with_dates(text1)
df1

(today, 10/29/1999, 10/2/1999, PSA, 2.0)
today
DATE
10/29/1999
DATE
10/2/1999
DATE
PSA
ORG
2.0
CARDINAL
hit
[2.0]


Unnamed: 0,text,extractions
0,The patient came in today with PSA-150 on 10/2...,"[{'PSA': 2.0, 'Date': '10/2/1999'}]"


In [None]:
# Example usage with multiple doctors' notes
text2 = "Patient A's PSA 12 on 5/15/2001, PSA 15 on 6/20/2002, Patient B's PSA 8.5 on 9/30/2003."
df2 = extract_psas_with_dates(text2)
df2

(12, 5/15/2001, 15, 6/20/2002, Patient B's, 9/30/2003)
12
CARDINAL
hit
5/15/2001
DATE
15
CARDINAL
hit
6/20/2002
DATE
Patient B's
PERSON
9/30/2003
DATE
[12.0, 15.0]


Unnamed: 0,text,extractions
0,"Patient A's PSA 12 on 5/15/2001, PSA 15 on 6/2...","[{'PSA': 12.0, 'Date': '5/15/2001'}, {'PSA': 1..."


## Other options
This works but it could potentially be better if it used the index closest to the target from the orginal extracted list that was a date rather than the text object closest to the orginal target PSA extracted value

lets see if we can implement that in a similar code structure -

In [None]:
def find_closest_date_index(target, lst):
    target_index = lst.index(target)
    closest_diff = float('inf')
    closest_index = None

    for i, item in enumerate(lst):
        if i == target_index:
            continue

        if isinstance(item, str) and '/' in item:
            try:
                datetime.datetime.strptime(item, '%m/%d/%Y')
                diff = abs(i - target_index)
                if diff < closest_diff:
                    closest_diff = diff
                    closest_index = i
            except ValueError:
                pass

    return closest_index, lst[closest_index]

# Example usage
lst = ['today', '10/29/1999', '10/2/1999', 'PSA', '2.0']
target = '2.0'
index, value = find_closest_date_index(target, lst)
print(f"Closest date index: {index}, value: {value}")


Closest date index: 2, value: 10/2/1999


In [None]:
def find_closest_date_index(target, lst):
    target_index = lst.index(target)
    closest_diff = float('inf')
    closest_index = None

    for i, item in enumerate(lst):
        if i == target_index:
            continue

        if isinstance(item, str) and '/' in item:
            try:
                datetime.datetime.strptime(item, '%m/%d/%Y')
                diff = abs(i - target_index)
                if diff < closest_diff:
                    closest_diff = diff
                    closest_index = i
            except ValueError:
                pass

    return closest_index, lst[closest_index]


def extract_psas_with_dates2(text):
    # Step 1: Preprocess the text (optional)
    text = text.replace('\n', ' ')

    # Step 2: Use spaCy NER to extract entities (dates and numeric values)
    doc = nlp(text)
    psa_values = []
    dates = []
    print(doc.ents)
    list_e =[]
    for ent in doc.ents:
      list_e.append(str(ent))
      print(ent)
      print(ent.label_)
      if ent.label_ == 'QUANTITY' or (any(char.isdigit() for char in ent.text) and len(ent.text) < 5):
          print('hit')
          # Quantity entities contain numeric values
          psa_values.append(float(ent.text))
      if ent.label_ == 'DATE':
          dates.append(ent.text)

    # Step 3: Associate PSA values with their respective dates
    associations = []
    print(psa_values)
    for psa_value in psa_values:
        nearest_date = None
        nearest_dates = find_closest_date_index(str(psa_value), list_e)
        nearest_date = nearest_dates[1]
        #nearest_date =
        associations.append({'PSA': psa_value, 'Date': nearest_date})

    # Step 4: Build a DataFrame with Extracted Results
    df = pd.DataFrame({
        'text': [text],
        'extractions': [associations]
    })
    return df




In [None]:
# Example usage with a single doctors' note
text1 = "The patient came in today with PSA-150 on 10/29/1999. This is indicative of cancer. Unlike on 10/2/1999 when PSA was 2.0, not cancer indicative."
df1 = extract_psas_with_dates2(text1)
df1

(today, 10/29/1999, 10/2/1999, PSA, 2.0)
today
DATE
10/29/1999
DATE
10/2/1999
DATE
PSA
ORG
2.0
CARDINAL
hit
[2.0]


Unnamed: 0,text,extractions
0,The patient came in today with PSA-150 on 10/2...,"[{'PSA': 2.0, 'Date': '10/2/1999'}]"


## Combined Approach
Finally lets combine these two approaches regex and NER from spacy toghter the first part will run the NER the second part will run the NLP NER and then the last part will bring them togther and remove any duplicate values found.

In [None]:
import spacy
import pandas as pd
import re
import datetime


# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")


def are_dicts_equal(dict1, dict2):
    return dict1.items() == dict2.items()


def extract_psas_with_regex(text):
    # Step 1: Preprocess the text
    text = text.replace('\n', ' ')

    # Step 2: Use Regex to Extract PSA Values and Dates
    psa_matches = re.findall(r'PSA(?:[- ]+)(\d+(?:\.\d+)?)', text, re.IGNORECASE)
    date_matches = re.findall(r'\b(\d{1,2}/\d{1,2}/\d{4})\b|\b(\w+\s+\d{1,2},?\s+\d{4})\b', text)

    # Step 3: Extract PSA Values and Associated Dates
    psa_values = [float(match) for match in psa_matches]
    dates = [date[0] if date[0] else date[1] for date in date_matches]

    # Step 4: Build a DataFrame with Extracted Results
    extract = [{'PSA': psa, 'Date': date} for psa, date in zip(psa_values, dates)]
    return extract


def find_closest_date_index(target, lst):
    target_index = lst.index(target)
    closest_diff = float('inf')
    closest_index = None

    for i, item in enumerate(lst):
        if i == target_index:
            continue

        if isinstance(item, str) and '/' in item:
            try:
                datetime.datetime.strptime(item, '%m/%d/%Y')
                diff = abs(i - target_index)
                if diff < closest_diff:
                    closest_diff = diff
                    closest_index = i
            except ValueError:
                pass

    return closest_index, lst[closest_index]


def extract_psas_with_dates2(text):
    # Step 1: Preprocess the text (optional)
    text = text.replace('\n', ' ')
    associations = []
    extract = extract_psas_with_regex(text)
    associations.append(extract)

    # Step 2: Use spaCy NER to extract entities (dates and numeric values)
    doc = nlp(text)
    psa_values = []
    dates = []
    print(doc.ents)
    list_e =[]
    for ent in doc.ents:
      list_e.append(str(ent))
      print(ent)
      print(ent.label_)
      if ent.label_ == 'QUANTITY' or (any(char.isdigit() for char in ent.text) and len(ent.text) < 5):
          print('hit')
          # Quantity entities contain numeric values
          psa_values.append(float(ent.text))
      if ent.label_ == 'DATE':
          dates.append(ent.text)

    # Step 3: Associate PSA values with their respective dates
    print(psa_values)
    for psa_value in psa_values:
        nearest_date = None
        nearest_dates = find_closest_date_index(str(psa_value), list_e)
        nearest_date = nearest_dates[1]
        #nearest_date =
        if(len(associations)<1):
          associations.append({'PSA': psa_value, 'Date': nearest_date})
        else:
          new_dict ={'PSA': psa_value, 'Date': nearest_date}
          if not any(are_dicts_equal(new_dict, existing_dict) for existing_dict in associations[0]):
            associations[0].append(new_dict)


    # Step 4: Build a DataFrame with Extracted Results
    df = pd.DataFrame({
        'text': [text],
        'extractions': associations
    })
    return df




now lets check the final results

In [None]:
text1 = "The patient came in today with PSA-150 on 10/29/1999. This is indicative of cancer. Unlike on 10/2/1999 when PSA was 2.0, not cancer indicative."
df1 = extract_psas_with_dates2(text1)
df1['extractions'][0]

(today, 10/29/1999, 10/2/1999, PSA, 2.0)
today
DATE
10/29/1999
DATE
10/2/1999
DATE
PSA
ORG
2.0
CARDINAL
hit
[2.0]


[{'PSA': 150.0, 'Date': '10/29/1999'}, {'PSA': 2.0, 'Date': '10/2/1999'}]

## Further AIML NER options
from here we have a baseline of regex and NER that can begin the extarction process but a refined rules and posibly futher trianed NER NLP model will be needed for a full final product for this var work

## Scispacy Medical Detection
Now we will use a model leveraging spacy based NLP central models but pre trained on ons of medical text to help further identify the targets of medical interest inside the actualcentral text


scispaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for processing biomedical, scientific or clinical text.

In [None]:
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz

Now import in the tools for Bio based NER that we need -

In [None]:
import scispacy
import spacy
nlp = spacy.load("en_ner_bc5cdr_md")

Now lets see what we can extract using it on our central targets


In [None]:
doc = nlp(Text)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

TEXT START END ENTITY TYPE
10/29/1999 42 52 CHEMICAL
cancer 76 82 DISEASE
10/2/1999 94 103 CHEMICAL
cancer 124 130 DISEASE


We can see this is not exactly what we need but it may be more useful on full medical data that will have other values we need identified to help final date PSA association

## Stanza Medical Detection
Next We Will try to a test with the model framework Stanza. Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

In [None]:
!pip install Stanza

Collecting Stanza
  Downloading stanza-1.5.0-py3-none-any.whl (802 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from Stanza)
  Downloading emoji-2.7.0.tar.gz (361 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m361.8/361.8 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (pyproject.toml) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.7.0-py2.py3-none-any.whl size=356563 sha256=92a52228d14e268d65a95c966f07d3a2f2b822a05fb2380975a2addc17db3fc8
  Stored in directory: /root/.cache/pip/wheels/41/11/48/5df0b9727d5669c9174a141134f10304d1d78a3b89a4676f3d
Successfully built emoji
Installing collected packages: emoj

Next download the english model

In [None]:
# Import the package
import stanza

# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


## Processing Text
Constructing Pipeline To process a piece of text, you'll need to first construct a Pipeline with different Processor units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

In [None]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en',verbose=True)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Building an English pipeline...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## Annotating Text
After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a Document object, which can be used to access detailed annotations from. For example:

In [None]:
# Processing English text
en_doc = en_nlp(Text)
print(type(en_doc))

<class 'stanza.models.common.doc.Document'>


We can now have the model perform a full prediction on the structure and anotation of the text-

In [None]:
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))

    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))

    print("")

[Sentence 1]
The         	the         	DET   	2	det         
patient     	patient     	NOUN  	3	nsubj       
came        	come        	VERB  	0	root        
in          	in          	ADV   	3	advmod      
today       	today       	NOUN  	3	obl:tmod    
with        	with        	ADP   	7	case        
PSA         	PSA         	PROPN 	3	obl         
-           	-           	SYM   	9	case        
150         	150         	NUM   	7	nmod        
on          	on          	ADP   	11	case        
10/29/1999  	10/29/1999  	NUM   	3	obl         
.           	.           	PUNCT 	3	punct       

[Sentence 2]
This        	this        	PRON  	3	nsubj       
is          	be          	AUX   	3	cop         
indicative  	indicative  	ADJ   	0	root        
of          	of          	ADP   	5	case        
cancer      	cancer      	NOUN  	3	obl         
.           	.           	PUNCT 	3	punct       

[Sentence 3]
Unlike      	unlike      	ADP   	3	case        
on          	on          	ADP   	3	case       

This can help us see if the things we are looking for can be indentifited by their sent struct type directly here it looks like we may have to many targets for that so lets move on to more cognative based extraction

## Now lets perform Basic NER
Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

Accessing Named Entities for Sentence and Document Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

doc = nlp(Text)
print()
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!



entity: PSA-150	type: PRODUCT
entity: 10/29/1999	type: CARDINAL
entity: 10/2/1999	type: DATE
entity: PSA	type: ORG
entity: 2	type: CARDINAL


This is much better not only giving us dates but also the target PSA and the value numbers through extarction around them is not perfect

## Stansa Medical Text Extraction
Now lets use Stanzas medical libs to define our text further

In [None]:
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})

doc = nlp(Text)

# print out the entities
print()
for ent in doc.entities:
    print(f'{ent.text}\t{ent.type}')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| forward_charlm  | mimic   |
| pretrain        | mimic   |
| backward_charlm | mimic   |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/tokenize/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pos/mimic.pt:   0%|          | …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/lemma/mimic.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/depparse/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/ner/i2b2.pt:   0%|          | 0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/forward_charlm/mimic.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/backward_charlm/mimic.pt:   0%|…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | i2b2    |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!



cancer	PROBLEM
PSA	TEST
cancer	PROBLEM


This can identify PSA and cancer as a part of this but not much else not exactly what we need but potentialy worth testing on more full data sets

# BioBert NER
now lets try using Bio Bert, BioBERT is a contextualized language representation model, based on BERT, a pre-trained model which is trained on different combinations of general & biomedical domain corpora. One major problem with domain problems is that you have domain texts which are only understood by domain experts

In [None]:
!pip install -q transformers
!pip install -q simpletransformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.7/250.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m89.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.6/215.6 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

Now lets import the Bio Bert model we need as well as anything else we would like to have to help with data trasformation or evaluation.

In [None]:
from simpletransformers.ner import NERModel
from transformers import AutoTokenizer
import pandas as pd
import logging

logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the bio BERT pre-trained model.
#model = NERModel('bert', 'dmis-lab/biobert-v1.1',use_cuda=False ,labels=custom_labels, args=train_args)
model = NERModel('bert', 'dmis-lab/biobert-v1.1', use_cuda=True)


Downloading (…)lve/main/config.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

### Using the Model (Running Inference)
Running the model to do some predictions/inference is as simple as calling model.predict(samples). **bold text**

In [None]:
samples = [Text]

predictions, _ = model.predict(samples)
for idx, sample in enumerate(samples):
  print('{}: '.format(idx))
  for word in predictions[idx]:
    print('{}'.format(word))

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

0: 
{'The': 'I-LOC'}
{'patient': 'I-LOC'}
{'came': 'I-LOC'}
{'in': 'B-PER'}
{'today': 'I-LOC'}
{'with': 'I-LOC'}
{'PSA-150': 'I-LOC'}
{'on': 'I-LOC'}
{'10/29/1999.': 'I-LOC'}
{'This': 'B-MISC'}
{'is': 'B-ORG'}
{'indicative': 'B-ORG'}
{'of': 'I-LOC'}
{'cancer.': 'I-LOC'}
{'Unlike': 'I-LOC'}
{'on': 'I-LOC'}
{'10/2/1999': 'I-LOC'}
{'when': 'I-LOC'}
{'PSA': 'I-LOC'}
{'was': 'I-LOC'}
{'2,': 'I-LOC'}
{'not': 'I-LOC'}
{'cancer': 'I-LOC'}
{'indicative.': 'B-ORG'}


As we can see the model can id text but in its own format and without much specific significance given on onset to our targets a lot of training would be needed to get this model to work for our needs but it dose run well with GPU support so there are advanateges to using it if we have the time and labeled examples to do so.

### Bio Bert NER -Final Model Score - Low Viability For Use - Without Extensive Training

## BERT Question Answer
BERT is a Bidirectional Encoder Representations from Transformers. It is one of the most popular and widely used NLP models. BERT models can consider the full context of a word by looking at the words that come before and after it, which is particularly useful for understanding the intent behind the query asked. Because of its bidirectionality, it has a deeper sense of language context and flow and hence, is used in a lot of NLP tasks nowadays. More details about BERT in the article along with the code.

Transformers library has a lot of different BERT models. It is easy to find a task-specific model from this library and do our task. So, let’s get started but let’s first look at our dataset.

### Asking questions direct
CoQA is a Conversational Question Answering dataset released by Stanford NLP in 2019. It is a large-scale dataset for building Conversational Question Answering Systems. This dataset aims to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. The unique feature about this dataset is that each conversation is collected by pairing two crowd workers to chat about a passage in the form of questions and answers and hence, the questions are conversational. To understand the format of the JSON data, please refer to this link. We will be using the story, question, and answer from the JSON dataset to form our data frame.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m78.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.4 MB/s[0m eta [36m0:00:0

import needed libs


In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

But for question answering tasks, we can even use the already trained model and get decent results even when our text is from a completely different domain. To get decent results, we are using a BERT model which is fine-tuned on the SQuAD benchmark. For our task, we will use the BertForQuestionAnswering class from the transformers library.

In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

From here lets build a function that can take in text tokenize it and send it to the model

In [None]:
def question_answer(question, text):

    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)

    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a

    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)

    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."

    print("\nPredicted answer:\n{}".format(answer.capitalize()))

Now lets put it all togther

In [None]:
Text = '''The patient came in today with PSA-150 on 10/29/1999. This is indicative of cancer. Unlike on 10/2/1999 when PSA was 2, not cancer indicative.'''

In [None]:
question = 'tell me about PSA'

question_answer(Text, question)


Predicted answer:
Indicative of cancer


In [None]:
question = 'give me all dates related to PSA and their values'

question_answer(Text, question)


Predicted answer:
Psa - 150 on 10 / 29 / 1999 . this is indicative of cancer


## Conclution
the use of spacy Stanza and perhaps sci spacy and stanza medical for more medical extarction isights plus regex plus some use of advanced date extarction tools should get us where we need for NER of PSA and date value extarction

If not consider use of very targeted labeled data on a full training job for bio bert to extarct exactly what you need.

Also for final relationship accessment we saw that char placemnet tracking and index list distance tracking from extarted targets can be effective but if on larger data sets it is not worlking consider finding a way to extract the dates and PSA as full text segments togther and turning it into a labeling job -

### Example
extract all data around the PSA targets dont worry about date to start with 50 chars before and after or 100 could work
next filter for only ones that have a date withing the segemnt that have a date near them use a date NER extractor or regex for this
using a set of this data get the cinical experts to label all times the date is related to the PSA value in question and all times its not
then use that in a simple transformers clasification training job to give you a model that can make a call on the binary clasifcation related or not.


or you could try another way - first just focus on PSA values of interest not of interest turn that into a labeling job to id if the not is of interest -
then for notes of interest set the date value to the note date
then extract from text PSA values
then look near the target PSA vlues for a better date to use
if you find it use that if not use default date
and thats your whole model