### spaCy Intro

Jay Urbain, PhD

References and credits:  
https://spacy.io/      

spaCy is an open-source natural language processing library written in Python. It levarages the SciPy ecosystem. The API is relatively easy to use and is well defined.

spaCy is very efficient and is designed for production use. The spaCy open-source team quickly encorprates new NLP models, and the spaCy library interoperates well with other machine learning libraries including TensorFlow, PyTorch, scikit-learn, and Gensim.

Some other options for NLP libraires include:  
- [NLTK](https://www.nltk.org/) . Most popular Python NLP library. More difficult to use, does not typically have top performing models. Several add-on packages. 
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) . Written in Java, has Python API library. Several state-of-the-art models including dependency parsing.
- [Gensim](https://github.com/RaRe-Technologies/gensim) . Written in Python, best for unsupervised learning NLP tasks like topic modeling and word vectors.
- Many more ...

Here's a quick comparison of the models from the spaCy website. More can be found here: [Spacy Facts and Figures](https://spacy.io/usage/facts-figures)       

<img src="spacy_nltk_corenlp_comparison.png" width="400px"/>                                                 

The goal of this notebooks is to provide and introduction and getting started guide for using spaCy for baseline NLP tasks.



#### Installation (not required for Colab) 

Pip:
`python -m venv .env
source .env/bin/activate
pip install spacy`

Conda:
`conda install -c conda-forge spacy`


In [None]:
!pip install spacy

Check spacy version

In [None]:
import spacy
spacy.__version__

####  Models & Languages

spaCy’s models can be installed as Python packages. This means that they’re a component of your application, just like any other module. They’re versioned and can be defined as a dependency in your `requirements.txt`. Models can be installed from a download URL or a local directory, manually or via pip. Their data can be located anywhere on your file system.

Install model:
`python -m spacy download en_core_web_sm`

External URL (very helpful for requirements.txt during deployment):
`pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz`

or local file:
`pip install /Users/you/en_core_web_sm-2.1.0.tar.gz`

requirements.txt model format:   
`spacy>=2.0.0,<3.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm`


In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz

You can explore the meta-data for a spacy model prior to loading it. The description and pipeline fields identify the NLP functionaity provided by the model.

In [None]:
import spacy
spacy.info("en_core_web_sm")

#### NLP object and tokenization

At the center of spaCy is the object containing the processing pipeline which is intantiated by loading a model. Usually referenced by the variable "nlp".

In the following example, when we load the "en_core_web_sm" model we instantiate an `nlp` pipeline with the functionality provided by that model. See the meta-data above.

When we process a sentence through the pipeline, spaCy creates in a document (Doc) object. The Doc lets you access information about the text in a structured way. In our exaple, we can access each token of `text` from the tokenization provided by the pipeline.

Pipeline annotations can be accessed like any other Python sequence.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") # load model package "en_core_web_sm"
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)
    
doc[5]

In [None]:
# 1. On your own

# Import spacy
import _____

# Create the nlp object
nlp = ____

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(____.text)

# Print text of each token
for token in doc:
    -----.----

Solutions below.

You can also create a slice of tokens within doc, or characters within a token.

In [None]:
print(doc.text)
print( doc[1:6] )
print( doc[2].text[1:3])

#### Lexical attributes

You can access token text to make lexical (text) comparisons.

Check whether the next token’s text attribute is a percent sign ”%“.

The `like_num` token attribute can be used to check if a token is a number.

The index of the next token in the doc is token.i + 1.


In [None]:
# 2. Import the English language class
import spacy

# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if ____.____:
        # Get the next token in the document
        next_token = ____[____]
        # Check if the next token's text equals '%'
        if next_token.____ == "%":
            print("Percentage found:", token.text)

Solutions below.

#### Statistical models

spaCy's statistical models allow you to predict linguistic attributes in context
- Part-of-speech tags  
- Syntactic dependencies  
- Named entities  

Models are trained on labeled example texts and can be updated with more examples to fine-tune predictions.

The `en_core_web_sm` package which we have already loaded, is a small English model that supports all core spaCycapabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.


In [None]:
# Import the English language class
import spacy

# Create the nlp object
nlp = spacy.load("en_core_web_sm")

*Predicting parts-of-speech*

Part of speech tagging can be helpful for several downstream NLP tasks, e.g., noun-chunking for candidate entities, named entity recognition, and sentence parsing.

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

In [None]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("Chronic obstructive pulmonary disease (COPD) is a chronic inflammatory lung disease that causes obstructed airflow from the lungs.")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

*Predicting syntactic dependencies*

Using a dependency parser, we can predict how words in a sentence are related. This is especially helpful for extracting entity attributes and identify entity relations.

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

In the example below, `Chronic, obstructive, pulmonary, and pulmonary` are all modifiers of the NOUN `disease`. And together identify a distinct entity NOUN phrase.

`Chronic obstructive pulmonary disease` is related to `chronic inflammatory lung disease` with the verb `is`. Also called an `is-a` or `type` relation.

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

*Dependency label scheme*

<img src="spacy_dependency_label_scheme.png" width="500px"/>

The pronoun "She" is a nominal subject attached to the verb "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

*Predicting Named Entities*

Named entities are "real world objects" that are assigned a name, e.g., person, location, organization, or country. The task of identifying named entities in text is typically called *named entity recognition (NER)*.

The `doc.ents` property lets you access the named entities predicted by the model.

It returns an iterator of `Span` objects (character positions), so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.


In [None]:
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Now lets try a slightly more complex example.

The results aren't too good. The acronymn for LBP for low back pain is labeled as an organization, and location is missed, In fact NER is one of the most common and useful task in NLP. Different domains have different vocabularies and require specially trained NER's.

1 - Try a few examples of your own.  
2 - Try out a medical named entity recognizer here: https://cis.ctsi.mcw.edu/nlp/ 

In [None]:
doc = nlp(u"Jay Urbain, is an aging caucasian male suffering from illusions of grandeur and LBP. Jay has been prescribed meloxicam, and venti americano. He lives at 9050 N. Tennyson Dr., Disturbia, WI with his wife Kimberly Urbain.")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Get definitions of the most common tags and labels.

In [None]:
print( 'GPE:', spacy.explain('GPE') )

print( 'NNP:', spacy.explain('NNP') )

print( 'dobj:', spacy.explain('dobj') )

#### Predicting linguistic annotations

In [None]:
# 3. On your own

- Process the text with the nlp object and create a doc.  
- For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

for token in doc:
    # Get the token text, part-of-speech tag, and dependency label
    token_text = ____.____
    token_pos = ____.____
    token_dep = ____.____
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))


In [None]:
# 4. On your own

# Process the text and create a doc object.  
# Iterate over the doc.ents and print the entity text and label_ attribute.  

import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

# Iterate over the predicted entities
for ent in ____.____:
    # Print the entity text and its label
    print(ent.____, ____.____)

Solutions below.

#### Summary:  
    
We covered the core capabilities of spaCy. spaCy has a lot more useful functionality, including tools for annotating your data and a machine learning library for training your own models.

spaCy also has a machine learning libary to build custom models for appliations like named entity recognition and text classification.

For more information read the documentation and take the tutorials at spaCy: https://spacy.io/ .

#### Solutions

In [None]:
# 1. On your own

# Import the English language class
import spacy

# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

# Print text of each token
for token in doc:
    print(token.text)

In [None]:
# 2. Import the English language class
import spacy

# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

In [None]:
# 3. On your own

import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

In [None]:
# 4. On your own

import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)