[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2025/blob/main/Module%202%20NLP%20Basics.ipynb)
# Welcome to the "Colab/SpaCy" demo for Module 2.

Google offers this browser-based Python development environment, called "Colab", that replicates most of the functionality of a Jupyter notebook. We plan to use it in class for future assignments. So we wanted you to walk through a gentle introduction here.

You can see right away that Colab looks a lot like the notebooks we have used in class. Just as in a notebook, you use the SHIFT-RETURN keys to run the code in a cell OR you can click the "run arrow" on the left of code cells.

Our goal with this Colab notebook is simply to show you Colab itself, as well as to introduce you to a few spaCy functions. IF YOU HAVE NOT READ the Mod2/Week2 Canvas text yet, STOP, READ IT, and THEN RETURN HERE. Be thinking about spaCy's output compared to similar functions from NLTK or coreNLP that you have already seen.

# Set up the environment
In colab, a set of commonly used python packages have already been installed by default. But spaCy is not one of them. So we need to install it ourselves. _Note that the next cell may take a minute or two to load._

In [None]:
!pip install medspacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy
from spacy import displacy
from pathlib import Path
print(spacy.__version__)

# Overview
In this notebook, we will use spaCy's visualization tool to demonstrate the basic NLP concepts that we covered in the course text. We will cut short a discussion of what these NLP basics are (that's what we did in the Canvas course text). You can use the "Table of contents" on the left to navigate through different NLP basics.

First, let's get a sample text into the Python variable "text":

In [None]:
text='''HX: This 64 y/o RHM had had difficulty remembering names, phone numbers and events for 12 months prior to presentation, on 2/28/95. This had been called to his attention by the clerical staff at his parish--he was a Catholic priest. He had had no professional or social faux pas or mishaps due to his memory. He could not tell whether his problem was becoming worse, so he brought himself to the Neurology clinic on his own referral.'''
text

spaCy does not have a default visualization for tokens or sentences. Although visualization is not necessary for the code to run, it can be useful for humans trying to learn about NLP basics. So we will play a few tricks here to visualize tokens, sentences, and POS. That way you can get a tangible sense of what these basic functions are really doing "under the hood."

In [None]:
def vis_tokens(doc, color='#c4e6e2'):
  # convert tokens into entities for visualization
  doc.ents=[doc.char_span(t.idx, t.idx+len(t), label="Token") for t in doc]
  displacy.render(doc, jupyter=True, style="ent", options={'colors':{'Token':color}})

def vis_sentences(doc, color='#89C4F4'):
  # convert sentences into entities for visualization
  doc.ents=[doc.char_span(s.start_char, s.end_char, label="Sentence") for s in doc.sents]
  displacy.render(doc, jupyter=True, style="ent", options={'colors':{'Sentence':color}})

def vis_pos(doc, colors={}):
  colors={'ADJ':'#ffccd5', 'ADP':'#dee2e6','ADV':'#e2ece9','AUX':'#dfe7fd',
        'CCONJ':'#e9ecef','CCONJ':'#adb5bd','DET':'#edede9','INTJ':'#fad2e1',
        'NOUN':'#fff1e6','NUM':'#fde2e4','PART':'#f0efeb','PRON':'#eddcd2','PROPN':'#f0efeb','PUNCT':'#eae4e9',
        'SCONJ':'#8e9aaf','SYM':'#bee1e6','VERB':'#cddafd','X':'#edf2fb'}
  # spaCy has another style of visualization to display POS tags, but here, before we throw in additional jargon, we use the same trick to render these tags
  doc.ents=[doc.char_span(t.idx, t.idx+len(t), label=t.pos) for t in doc]
  displacy.render(doc, jupyter=True, style="ent",options={'colors':colors})

## Processing text

In [None]:
# load a spaCy model (similar to the concept of "pipeline" that we will cover in the next module). SpaCy
# wraps everything together into "nlp" so we don't need to execute each task individually
nlp = spacy.load("en_core_web_sm")

In [None]:
# process the text--- one line code, simple, isn't it?
doc=nlp(text)

# Tokenization
As we mentioned above, all the NLP basic tasks have been executed in the single line of code above. We don't need to run tokenization separately here. But do want to *see* the tokens, so we run the vis_tokens() function:

In [None]:
vis_tokens(doc)

# Sentence segmentation
Same here. We just show the sentences that have been split. Note that spaCy decided "HX:" was a self-contained sentence. Do you agree?

In [None]:
vis_sentences(doc)

# POS tagging
When we started this Colab we loaded a general "language model" called "en_core_web_sm." That model is configured using a smaller amount of parameters. The text we defined here is clinical texts. So clinical idioms like "HX" (short for "history of present illness") can confuse spaCy if it is looking for general English. The visualization below shows why it makes sense for humans to review visualized output. We can see that spaCy thought "HX:" was a proper noun (see POS schema below).

In [None]:
vis_pos(doc)

## POS schema used in spaCy
| POS |	DESCRIPTION  |	EXAMPLES |
| --- | --------- |  ----------------------------------------- |
| ADJ | adjective | *big, old, green, incomprehensible, first* |
| ADP | adposition | *in, to, during* |
| ADV | adverb | *very, tomorrow, down, where, there* |
| AUX | auxiliary | *is, has (done), will (do), should (do)* |
| CONJ | conjunction | *and, or, but* |
| CCONJ | coordinating conjunction | *and, or, but* |
| DET | determiner | *a, an, the* |
| INTJ | interjection | *psst, ouch, bravo, hello* |
| NOUN | noun | *girl, cat, tree, air, beauty* |
| NUM | numeral | *1, 2017, one, seventy-seven, IV, MMXIV* |
| PART | particle | *’s, not,* |
| PRON | pronoun | *I, you, he, she, myself, themselves, somebody* |
| PROPN | proper noun | *Mary, John, London, NATO, HBO* |
| PUNCT | punctuation | *., (, ), ?* |
| SCONJ | subordinating conjunction | *if, while, that* |
| SYM | symbol | *$, %, §, ©, +, −, ×, ÷, =, :), 😝* |
| VERB | verb | *run, runs, running, eat, ate, eating* |
| X | other | *sfpksdpsxmsa* |





# Dependency parsing


In [None]:
  displacy.render(list(doc.sents)[2], jupyter=True, style="dep")

# Named entity recognition
Since our code above messed with the entities in the "doc", we redo the "doc" processing before we display the **real** named entities spaCy is looking for. The model en_core_web_sm is looking for named entities like `DATEs (DATE), Nationality or religion (NORP), and facilities (like named bridges, buidings, etc.)`. Hmm. Is the number 64 really a date here? In spaCy's view it is. To spaCy a date is "an absolute or relative date or time period." If spaCy were looking at the phrase "64 y/o" then `DATE` makes sense. And 'Catholic' is definitely a religion. But why it concluded "2/28/95" was a cardinal number is not clear.

The lesson? NLP is never perfect. So we 'ALWAYS' have to evlautae how well our NLP systems are doing.


In [None]:
doc = nlp(text)

In [None]:
displacy.render(doc, jupyter=True, style="ent")

The following table explains these labels

| LABEL | DESCRIPTION   |
| --- | ------------------------------------------------------- |
| DATE | Absolute or relative dates or periods |
| PERSON | People, including fictional |
| GPE | Countries, cities, states |
| LOC | Non-GPE locations, mountain ranges, bodies of water |
| MONEY | Monetary values, including unit |
| TIME | Times smaller than a day |
| PRODUCT | Objects, vehicles, foods, etc. (not services) |
| CARDINAL | Numerals that do not fall under another type |
| ORDINAL | "first", "second", etc. |
| QUANTITY | Measurements, as of weight or distance |
| EVENT | Named hurricanes, battles, wars, sports events, etc. |
| FAC | Buildings, airports, highways, bridges, etc. |
| LANGUAGE | Any named language |
| LAW | Named documents made into laws. |
| NORP | Nationalities or religious or political groups |
| PERCENT | Percentage, including "%" |
| WORK_OF_ART | Titles of books, songs, etc. |


## Exercise:

Now we understand that 'en_core_web_sm' may not suffice for basic NER tasks, even though it serves as a good starting point for experimentation. If you're curious about the performance of larger models, consider trying the following:
*   en_core_web_md
*   en_core_web_lg


en_core_web_lg

Additionally, spaCy offers a customized transformer model specifically tailored for NER tasks. We'll delve deeper into transformers later in the course. For now, let's just think of it as a much larger model. You will need to install spacy-transformers and download the model before you can load it.
*   en_core_web_trf



# QuickUMLS

In the Module pages we mentioned QuickUMLS. Although its full functioning requires a full UMLS installation (long time to load and very space  hungry), we can demonstrate it using a made-up "UMLS" with 3 CUIs hardcoded as below.

QuickUMLS needs two files: *MRCONSO.RRF* and *MRSTY.RRF*. We follow the same format of these two files, and create a customized dictionary:

In [None]:
mrconso_text=r'''C0000001|ENG|S|L0000002|VO|S0000895|N|A0000030||||BI|PT|BI00001|difficulty remembering|2|N|256|
C0000001|ENG|S|L0000002|VO|S0000895|N|A0000031||||BI|PT|BI00001|memory lost|2|N|256|
C0000002|ENG|S|L0000003|VO|S0000005|N|A0000041||||BI|PT|BI00002|neurology clinic|2|N|256|'''

In [None]:
mrsty_text=r'''C0000001|T191|||||
C0000002|T191|||||'''

In [None]:
Path('MRCONSO.RRF').write_text(mrconso_text)
Path('MRSTY.RRF').write_text(mrsty_text)

In [None]:
# the location of the default quickumls (medspacy's version) data
default_medspacy_quickumls_dir='/usr/local/lib/python3.10/dist-packages/resources/quickumls/QuickUMLS_SAMPLE_lowercase_POSIX_unqlite'

In [None]:
# clean up the folder
if Path(default_medspacy_quickumls_dir).exists:
    import shutil
    shutil.rmtree(Path(default_medspacy_quickumls_dir))
Path(default_medspacy_quickumls_dir).mkdir()

In [None]:
# install our customized "UMLS"
!python -m quickumls.install  ./ /usr/local/lib/python3.10/dist-packages/resources/quickumls/QuickUMLS_SAMPLE_lowercase_POSIX_unqlite

In [None]:
import spacy
import medspacy
from quickumls.spacy_component import SpacyQuickUMLS

In [None]:
# start from a default spacy language model
nlp = spacy.load('en_core_web_sm')

In [None]:
nlp.add_pipe('medspacy_quickumls', config={"quickumls_fp": default_medspacy_quickumls_dir})

In [None]:
doc = nlp(text)

In [None]:
# Now you can see two additional named entities have been annotated and labeled with CUIs.
displacy.render(doc, jupyter=True, style="ent")

#So that was a super simple illustration of QuickUMLS.

We will not be using QuickUMLS more in the course, but we will refer often to the UMLS itself. This last exercise is meant to reinforce the idea that the UMLS has many concepts and has many **more** terms that map to those concepts. Tools like medspacy and CLAMP can help identify these concepts automatically.

#