# Spacy and SciSpacy 

[Spacy](https://spacy.io/usage/processing-pipelines) is one of the main NLP open-source libraries and contributors.

They have a vast amount of methods, pipelines, pre-trained models, tools and tutorials for NLP. Anything from cleaning, linguistic analysis and advanced modelling capabilties are available in Spacy. 

One of their most interesting capabilities is found on their related SciSpacy module, that brings several biomedical models and tools for Medical NLP analysis. 

This notebooks covers a few of the basic topics in the Spacy tool-kit



In [None]:
# Imports 
import spacy
from scispacy.linking import EntityLinker

# Load the model
# warning make sure you download the relevant model first.
# check README for instructions.
nlp = spacy.load("en_core_sci_md")


## Basic functionalities

In [None]:
text: str = "Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in people."
# add multiple NLP based attributes and methods to the text
doc = nlp(text)
# For example we now have the options to
# identify if a words is a stopword or not.
# use spacy's lemmatizer, get the Part of Speech of the word.
# as well as any dependency
for token in doc:
    if not token.is_stop:
        print(
            token.text, "->",
            token.lemma_,  # lemmatizer
            token.pos_,  # Part of Speech
            token.tag_,  # Tag 
            token.dep_,  # Dependencies
            token.shape_  # Shape -> Caps, Case representation
        )

In [None]:
spacy.displacy.render(next(doc.sents), style='dep', jupyter=True)

## Entity Linker

One of the most interesting features of SciSpacy is the entity-linker, which is better described in this [link](https://github.com/allenai/scispacy?tab=readme-ov-file#entitylinker:~:text=config%3D%7B%22make_serializable%22%3A%20True%7D-,EntityLinker,-The%20EntityLinker%20is)

It basically allows you to relate an entity to a particular knowledge base. For example, Ibuprofen is linked to the drug base and you can then get further details of the linked entity.

** WARNING ** This will download 1GB of data and it's slow at the beginning (then it should be faster as it is cached). However, adding this to your pipeline will severely decrease processing speed.

In [None]:
# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.

nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})


In [None]:
doc = nlp(
    "Spinal and bulbar muscular atrophy (SBMA) is an \
    inherited motor neuron disease caused by the expansion \
    of a polyglutamine tract within the androgen receptor (AR). \
    SBMA can be caused by this easily."
)

In [None]:
print(doc.ents)

In [None]:
entity = doc.ents[2]

print("Name: ", entity)
# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])

## Word and text similarity

Text similarity has been primordial in NLP since its beginning and even with the great advancements in GenAI, the methods used to semantically compare words and sentences mainly rely in vector operations and "distance" metrics. For example, we know that `shark` and `whale` are more closely related to each other than `shark` and `computer`. Likewise, with modern language models (Word2Vec, transformers, etc.) we can mathematically represent this. Usually, using a common distance metric like euclidean distance or cosine similarity. Let's give it a try.

In [None]:
# let's compare words
text1 = "shark"
text2 = "whale"
text3 = "computer"

print(
    "Similarity Shark and Whale:",
    nlp(text1).similarity(nlp(text2))
)
print()
print(
    "Similarity Shark and Computer:",
    nlp(text1).similarity(nlp(text3))
)


In [None]:
# We can do the same with senteces 
# Spacy first calculates the mean of the sentence
# vector to compare across sentences. 
text1 = "Tylenol is used to treat headaches"
text2 = "Ibuprofen is used to alleviate migraines"

nlp(text1).similarity(nlp(text2))

In [None]:
# A completely unrelated sentece.
text3 = "There is no place like home"

nlp(text1).similarity(nlp(text3))

In [None]:
# Get the Vector representation 
word: str = "melanoma"
word_id = nlp.vocab.strings[word]
word_vector = nlp.vocab.vectors[word_id]
print(word_vector[:50])

In [None]:
# You can also get the vector of a sentece 

nlp("Hello, this is a sentece").vector[:20]


### Semantic similarity example

In NLP, one of the biggest breakthroughs came when the Word2Vec model was able "answer" to the famous analogy riddle. `King` is to `man` as `woman` is to ____. For humans, it's easy to find the analogy and responde correctly `queen`, however this was thought to be almost impossible for an AI. In a vector representation, these could be noted as `x = king - man + woman`. When using the original Word2Vec model, the resulting vector will be very close to the vector for `queen`, hence adding a whole new dimmention to NLP and AI. 

We can do something similar with the bio-medical model.

In [None]:
from scipy.spatial import distance
import numpy as np

king = nlp("cardiologist").vector
man = nlp("heart").vector
woman = nlp("brain").vector

result = king - man + woman

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)


# *** Find the closest word below ***
closest_index = distance.cdist(np.array(result.reshape(1,-1)), vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
output_word


## Spacy's Pipelines

However, I would say Spacy's greatest feature is its capacity to create a Pipe (pipeline) with multiple transformations. It allows you to set up an elaborate pre-processing pipeline to efficiently clean, tag and analyse your text input. For example, let's create a cleaning pipeline, in which we can remove some of the attributes and models to make it run faster.


In [None]:
import pandas as pd
from time import time  # Medir tiempo de ejecucion


data_path : str = "../data/mtsamples.csv"
df = pd.read_csv(data_path)
# let's pick the first 50 transcriptions as example 
transcriptions = df["transcription"].dropna()[:50]

In [None]:
# reload the NLP model.
nlp = spacy.load(
  'en_core_sci_md',
  disable=['ner', 'parser']  # let's remove some things we don't need for cleaning
) 

def cleaning(doc) -> str:
    """Simple cleaning pipeline using Spacy.

    Lemmatize and eliminates stopwords. Keeps only alpha (removes digits)
    Args:
        doc (spacy.tokens.doc.Doc): Document processed by spacy's pipeline
    Returns:
        str: Processed String.
    """
    txt = [
      token.lemma_ for token in doc if not token.is_stop and token.is_alpha
    ]
    return ' '.join(txt)


In [None]:
# Example in a short sentence
text: str = "Alterations in the Hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in people."
cleaning(nlp(text))

In [None]:
t = time() # let's measure execution time

txt = [
    cleaning(doc) for doc in nlp.pipe(
        transcriptions,
        batch_size=20,
        n_process=1  # number of processors to use. 
    )
]
# medimos tiempo de ejecución 
t_ = round((time() - t) / 60, 2)  # seconds needed to run
print(f'Execution time: {t_} mins')
print(txt[0])

In [None]:
# Without multiprocessing 

t = time() # let's measure execution time

txt = [
    cleaning(nlp(doc)) for doc in transcriptions
]
# medimos tiempo de ejecución 
t_ = round((time() - t) / 60, 2)  # seconds needed to run
print(f'Execution time: {t_} mins')