## Advanced NLP with Spacy course: https://course.spacy.io/en

Introduction to Spacy

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import spacy and use the spacy.blank method to create a blank English pipeline. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages.

In [2]:
# Import spaCy
import spacy

# Create a blank English nlp object
nlp = spacy.blank("en")

In [3]:
# Created by processing a string of text with the nlp object
doc = nlp("2nd Edition Machine Learning for Biomedicine Research and Healthcare: From Theory to Practice")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

2nd
Edition
Machine
Learning
for
Biomedicine
Research
and
Healthcare
:
From
Theory
to
Practice


Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

In [4]:
# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

Edition


A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a span, you can use Python's slice notation.

In [7]:
# A slice from the Doc is a Span object
span = doc[1:9]

# Get the span text via the .text attribute
print(span.text)

Edition Machine Learning for Biomedicine Research and Healthcare


Lexical attributes

In [8]:
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Text:     ['2nd', 'Edition', 'Machine', 'Learning', 'for', 'Biomedicine', 'Research', 'and', 'Healthcare', ':', 'From', 'Theory', 'to', 'Practice']
is_alpha: [False, True, True, True, True, True, True, True, True, False, True, True, True, True]
is_punct: [False, False, False, False, False, False, False, False, False, True, False, False, False, False]
like_num: [True, False, False, False, False, False, False, False, False, False, False, False, False, False]


Task 1


-Use spacy.blank to create a blank English ("en") nlp object.
-Create a doc and print its text.

In [None]:
# Create the English nlp object
nlp = ____

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(____.text)

Task 2
-Use spacy.blank to create a blank German ("pl") nlp object.
-Create a doc and print its text.

In [None]:
# Create the German nlp object
nlp = ____

# Process a text (this is German for: "Kind regards!")
doc = nlp("Pozdrawiam")

# Print the document text
print(____.text)

Task 3
-Use spacy.blank to create a blank Spanish ("es") nlp object.
-Create a doc and print its text.

In [None]:
# Create the Spanish nlp object
nlp = ____

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(____.text)

ModuleNotFoundError: No module named '____'

Task 4
When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

Step 1

-Use spacy.blank to create the English nlp object.
-Process the text and instantiate a Doc object in the variable doc.
-Select the first token of the Doc and print its text.

In [12]:
nlp = ____
​
# Process the text
doc = ____("I like tree kangaroos and narwhals.")
​
# Select the first token
first_token = doc[____]
​
# Print the first token's text
print(first_token.____)

SyntaxError: invalid non-printable character U+200B (3888716749.py, line 2)

Step 2

-Use spacy.blank to create the English nlp object.
-Process the text and instantiate a Doc object in the variable doc.
-Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.

In [13]:
nlp = ____

# Process the text
doc = ____("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = ____
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = ____
print(tree_kangaroos_and_narwhals.text)

NameError: name '____' is not defined

Task 5
In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

-Use the like_num token attribute to check whether a token in the doc resembles a number.
-Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
-Check whether the next token’s text attribute is a percent sign ”%“.


In [None]:
nlp = spacy.blank("en")
​
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)
​
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if ____.____:
        # Get the next token in the document
        next_token = ____[____]
        # Check if the next token's text equals "%"
        if next_token.____ == "%":
            print("Percentage found:", token.text)

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Trained pipeline components have statistical models that enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Pipelines are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

spaCy provides a number of trained pipeline packages you can download using the spacy download command. For example, the "en_core_web_sm" package is a small English pipeline that supports all core capabilities and is trained on web text.

The spacy.load method loads a pipeline package by name and returns an nlp object.

In [16]:
nlp = spacy.load("en_core_web_sm")

For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

In [17]:
# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


n addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The .dep_ attribute returns the predicted dependency label.

The .head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [18]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc.ents property lets you access the named entities predicted by the named entity recognition model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the .label_ attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [19]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

In [20]:
spacy.explain("GPE")
spacy.explain("NNP")

'noun, proper singular'

Task 6

The pipelines we’re using in this course are already pre-installed. For more details on spaCy’s trained pipelines and how to install them on your machine, see the documentation.

-Use spacy.load to load the small English pipeline "en_core_web_sm".
-Process the text and print the document text.

In [None]:
# Load the "en_core_web_sm" pipeline
nlp = ____
​
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
​
# Process the text
doc = ____
​
# Print the document text
print(____.____)

SyntaxError: invalid non-printable character U+200B (4151226445.py, line 2)

Task 7

You’ll now get to try one of spaCy’s trained pipeline packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

Part 1
-Process the text with the nlp object and create a doc.
-For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [22]:
nlp = spacy.load("en_core_web_sm")
​
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
​
# Process the text
doc = ____
​
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = ____.____
    token_pos = ____.____
    token_dep = ____.____
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

SyntaxError: invalid non-printable character U+200B (1726608096.py, line 2)

Part 2

-Process the text and create a doc object.
-Iterate over the doc.ents and print the entity text and label_ attribute.

In [23]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = ____

# Iterate over the predicted entities
for ent in ____.____:
    # Print the entity text and its label
    print(ent.____, ____.____)

NameError: name '____' is not defined