<a href="https://colab.research.google.com/github/orthopendar/CTGAN/blob/main/04_NLP_with_medspaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/blob/main/media/DELPHI-long.png?raw=true" size="20%">
</br>

<h1 valign="center" align="center"><font size="+150">Introduction to NLP in Python</br>Spring 2024</font></h1>

In [None]:
!pip install https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/releases/download/v0.1/delphi_nlp_2024-0.1.tar.gz

In [None]:
from delphi_nlp_2024 import *
from delphi_nlp_2024.quizzes.quizzes import *
from delphi_nlp_2024.helpers import *

# NLP with medspaCy
This notebook will introduce the Python package `medspaCy`, a toolkit for clinical NLP.

# I. Overview
Clinical text is very complex and differs from general domain language.
- **It is very messy**, with semi-structured formatting from EHR
- Clinical documents include **many abbreviations**, some of which are ambiguous
- There are **specific tasks** needed in clinical NLP, such as **detecting negation or uncertainty** for concepts in the text

## medspacy
<img alt="MedSpaCy logo" src="https://github.com/medspacy/medspacy/raw/master/images/medspacy_logo.png">


[Medspacy](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. It's built using the popular [spaCy](https://spacy.io/) library and is specifically designed for working with clinical notes.

The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

Here are a couple of papers that used medspaCy:

- [Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python
](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/)
- [A Natural Language Processing System for National
COVID-19 Surveillance in the US Department of Veterans Affairs](https://aclanthology.org/2020.nlpcovid19-acl.10.pdf)
- [ReHouSED: A novel measurement of Veteran housing stability using natural language processing](https://www.sciencedirect.com/science/article/pii/S153204642100232X?via%3Dihub)
- [Temporary Financial Assistance Reduced The Probability Of Unstable Housing Among Veterans For More Than 1 Year](https://www.healthaffairs.org/doi/full/10.1377/hlthaff.2023.00730)

## Getting started with medspaCy
This notebook will walk show how to use medspaCy to process clinical text and introduce some of the spaCy infrastrcuture. We'll then design some rules to extract concepts from clinical texts.


To get started with medspaCy, we'll import the library and then load a **model** which we will call `nlp`. A model in spaCy is the object which processes a note and performs the various steps of text processing.

In [None]:
!pip install medspacy==1.1.5

  Building wheel for unqlite (setup.py) ... [?25l[?25hdone
  Created wheel for unqlite: filename=unqlite-0.9.6-cp310-cp310-linux_x86_64.whl size=1622755 sha256=13c1357188a02a5669ad5be038f5636a0f8a62e63872c810632397ed52d08c05
  Stored in directory: /root/.cache/pip/wheels/81/f4/a1/7e97f75c3102460c515a52f33cd7d5d61a93a57408fd0efad8
Successfully built medspacy unqlite
Installing collected packages: pysimstring, unidecode, pysbd, pydantic, pathlib-abc, Cython, unqlite, quicksectx, pathy, thinc, PyFastNER, spacy, PyRuSH, medspacy-quickumls, medspacy
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.6.3
    Uninstalling pydantic-2.6.3:
      Successfully uninstalled pydantic-2.6.3
  Attempting uninstall: Cython
    Found existing installation: Cython 3.0.9
    Uninstalling Cython-3.0.9:
      Successfully uninstalled Cython-3.0.9
  Attempting uninstall: thinc
    Found existing installation: thinc 8.2.3
    Uninstalling thinc-8.2.3:
      Successfully uninstalled

In [None]:
import medspacy

### `nlp`
The simplest way to create a model in medspaCy is `medspaCy.load()`. This is a spaCy `English` class. You can also load models for other languages.

In [None]:
nlp = medspacy.load()
nlp

<spacy.lang.en.English at 0x7ab5c0089270>

### `Doc`
To process a text, we call `nlp(text)` and save the result to `doc`. Calling `nlp` on a text returns an object from the `Doc` class. In spaCy, `Doc` objects represent a single text.

In [None]:
text = "Chief complaint: Fever and SOB"
doc = nlp(text)
doc

Chief complaint: Fever and SOB

In [None]:
type(doc)

spacy.tokens.doc.Doc

### `Token`
A `Token` is a single word, symbol, or whitespace in a `doc`. When we create a `doc` object, the text broken up into individual tokens. This is called **"tokenization"**.

**Discussion**: Look at the tokens generated from this text snippet. What can you say about the tokenization method? Is it as simple as splitting up into words every time we reach a whitespace?

In [None]:
for token in doc:
    print(token)

Chief
complaint
:
Fever
and
SOB


In [None]:
print(type(token))

<class 'spacy.tokens.token.Token'>


If we access a single index of a doc, we get a token:

In [None]:
token = doc[0]
token

Chief

In [None]:
quiz_doc_cc_idx

NameError: name 'quiz_doc_cc_idx' is not defined

### `Span`
While a `Token` represents a single word, a `Span` represents one or more words from a `Doc`. We can get a `Span` by slicing a `Doc` object:

In [None]:
span = doc[0:3]
span

## Pipeline Components
Under the hood, the `nlp` object goes through a number of sequential steps to process the text. This is called a **pipeline** and it allows us to create modular, independent processing steps when analyzing text. We can see the names of our pipeline components through the `nlp.pipe_names` attribute:

In [None]:
nlp.pipe_names

There's also a hidden component which runs before all of them called the `tokenizer`. This splits text up into tokens and creates a `Doc` object, which is then passed on to the rest of the components.

In [None]:
nlp.tokenizer(text)

We'll learn more about some of these pipeline components in the following notebooks. First, we'll start with the `target_matcher` component and learn how to extract clinical concepts from text.

In [None]:
# For now, remove some components we don't need
nlp.remove_pipe("medspacy_pyrush")
nlp.remove_pipe("medspacy_context")
nlp.pipe_names

## Concept Extraction
One of the first step in many clinical NLP tasks is identiyfing particular **concepts** in text. These will vary in each use case, but some common examples of concepts are:
- Diagnoses
- Signs and symptoms
- Medications
- Tests

### TODO
For each of the texts below, identify the best description of the concepts **in bold**.

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_1

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_2

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_3

The task of extracting these spans of text is called **named entity recognition (NER)**. This can be done using either machine learning models or rule-based models. In this class, we'll focus on building rule-based systems. In rule-based NLP, we define patterns to match concepts in text. SpaCy offers many [rule-based methods](https://spacy.io/usage/rule-based-matching). MedSpaCy uses a pipeline component called `TargetMatcher` and rules defined by a class called `TargetRule`. Extracted concepts will be stored as `Span` objects in `doc.ents`.

### `target_matcher`
To start adding rules, we'll first need to access the pipeline component. We can do this by calling `nlp.get_pipe(pipe_name)`:

In [None]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")

Next we need to actually write some rules using the `TargetRule` class. Target rules require two positional arguments:
- `literal`: A span of text to match in the text (case insensitive)
- `category`: The label to assign to extracted concepts

(There are also a few keyword arguments that we'll explore later, but these are the two required arguments.)

Let's say that we want to extract patient diagnoses from the following text:

In [None]:
dx_text = "Pt is a 63M w/ h/o metastatic carcinoid tumor, HTN and hyperlipidemia"

There are three diagnoses in this text. The first is `"metastatic carcinoid tumor"`. Let's write a rule to capture this:

In [None]:
from medspacy.target_matcher import TargetRule
rule = TargetRule("metastatic carcinoid tumor", "DIAGNOSIS")

We can then add it to our target matcher:

In [None]:
target_matcher.add(rule)

In [None]:
target_matcher.rules

Now let's process the text above and see if it's extracted by our NLP model by looking at `doc.ents`:

In [None]:
doc = nlp(dx_text)
doc.ents

The `target_matcher` added a `Span` to the doc's entities representing the concept we just extracted. Let's assign this span to the variable `ent`. We can see the concept category by checking the `ent.label_` attribute.

In [None]:
ent = doc.ents[0]
print(ent)
print(type(ent))
print(ent.label_)

`medspaCy` provides some visualization functions which make it easier to look at what has been extracted from the notes:

In [None]:
from medspacy.visualization import visualize_ent

In [None]:
visualize_ent(doc)

### TODO
Edit the cell below to write a list of rules for extracting the two remaining diagnoses from `dx_text`. Then add them to the target matcher and reprocess the doc.

In [None]:
rules = [

]

In [None]:
target_matcher.add(rules)

In [None]:
doc_dx = nlp(dx_text)

In [None]:
visualize_ent(doc_dx)

In [None]:
doc_dx.ents

In [None]:
# RUN CELL TO TEST VALUE
test_dx_text.test(doc_dx)

### Advanced pattern matching
We could pass in simple strings to our `ruler` to extract exact matches. However, there may be lots of small variations in the text we want to extract, and it will grow cumbersome to type out every single possible string. Instead, we'll do some more advanced matching by using **token attribute matching**.

SpaCy allows us to write patterns based on not only the exact text, but other linguistic attributes such as **part-of-speech tag**, **numerical properties**, **regular expressions**, and much more.

### Example: Chronic Kidney Disease
Each of the texts below mention a different stage of Chronic Kidney Disease:

---
- 76 year old man with CKD Stage 3.
- relevant diagnoses: ckd stage 4
- The patient has progressed to ckd stage 5
---

We could write different target rules to match each text, but sometimes there are too many combinations to feasibly write out every option. Instead of trying to think of the near-infinite number of variations, let's write one pattern which will match all of these clinical problems.

One way we can do this using **regular expressions**. We can pass a regular expression into the `pattern` argument for a TargetRule. The pattern will be case insensitive.

#### TODO
Finish the code below to create a rule matching which will match all three examples of CKD using regular expressions. Then add it to the pipeline and test your model. You can test it on the three examples below.

In [None]:
texts = [
    "76 year old man with CKD Stage 3.",
    "relevant diagnoses: ckd stage 4",
    "The patient has progressed to ckd stage 5",
    "She was dx'd with CKD in January."
]

In [None]:
rule = TargetRule("CKD Stage X", "DIAGNOSIS",
                  pattern=___)

In [None]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")
target_matcher.add(rule)

In [None]:
for text in texts:
    visualize_ent(nlp(text))

In [None]:
# RUN CELL TEST VALUE
test_ckd_stage_x.test(nlp)

## Concept extraction practice
Let's return to the example discharge summary we looked at in a previous notebook. Add rules to `target_matcher` that will extract the following concepts from the text:
- `"DIAGNOSIS"`
- `"MEDICATION"`
- `"SIGN/SYMPTOM"`
- `"SOCIAL_DETERMINANT"`
- `"PROCEDURE"`

It might be useful to work in teams with clinicians or people familiar with these concepts so you can identify and define them. You don't need to extract every concept from the text (there are a lot!) so maybe just go through the note and add a few examples of each conceptt. If you'd like to write more sophisticated rules, it may be helpful to review spaCy's [rule-based NLP documentation](https://spacy.io/usage/rule-based-matching#matcher) (look at the documentation under `Matcher`).

In [None]:
# RUN CELL TO SEE HINT
hint_discharge_summ_target_rules

In [None]:
# Load a fresh NLP model
nlp = medspacy.load(enable=["medspacy_target_matcher"])
target_matcher = nlp.get_pipe("medspacy_target_matcher")

In [None]:
rules = [


]

target_matcher.add(rules)

In [None]:
doc = nlp(disch_summ)
visualize_ent(doc)