# Introduction to spaCy

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

[spaCy](https://spacy.io/) provides a rather complete NLP pipeline (the output of one module feeds to the input of the next): it takes a raw document and performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER). It also supports similarity prediction, but that is outside of the scope of this notebook. The advantage of spaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages, among which: English, German, Spanish, Portuguese, French, Italian and Dutch. 

In this notebook, we will show you the basic usage. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides excellent user guides. 

**At the end of this notebook, you will be able to extract the output from spaCy for the following NLP tasks**:
* **Sentence splitting**: attribute **sents** of a `Doc` (of type *spacy.tokens.doc.Doc*)
* **Tokenization**: `Doc` contains a sequence of `Token` objects (of type *spacy.tokens.token.Token*)
* **Part-of-speech (POS) tagging**: attributes **pos_** and **tag_** of `Token`
* **Stop words recognition** attribute **is_stop** of `Token`
* **Stemming and lemmatization**: attribute **lemma_** of `Token`
* **Constituency/dependency parsing:** attributes **dep_** and **head**
* **Named Entity Recognition (NER):** attribute **ents** (of type *spacy.tokens.span.Span*) of `Doc` (of type *spacy.tokens.doc.Doc*). 

In addition, you will be able to use spaCy to visualize the output for each NLP task.

### Acknowledgements
We thank [Chantal van Son](https://chantalvanson.wordpress.com/) for the creation of [her notebook](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2019%20-%20More%20about%20Natural%20Language%20Processing%20Tools%20(spaCy).ipynb) of which parts were used in this notebook.

## Installing and loading spaCy

To install spaCy, check out the instructions [here](https://spacy.io/usage). On this page, it is explained exactly how to install spaCy for your operating system, package manager and desired languages. Simply run the suggested commands in your terminal ([Anaconda Prompt](https://docs.anaconda.com/anaconda/user-guide/getting-started/) or cmd). 

Alternatively, you may be able to just run the following cells in this notebook:

**Tip**: comment out the next two commands after using them.

In [None]:
# %%bash
# conda install -c conda-forge spacy

In [None]:
# %%bash
# python -m spacy download en

Now, let's first load spaCy. We import the spaCy module and load the English tokenizer, tagger, parser, NER, and word vectors.

In [2]:
# %pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
# from spacy import displacy
# nlp = spacy.load('en') # other languages: de, es, pt, fr, it, nl
from pathlib import Path
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
import spacy

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

You might get a variation of the following error:
```
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
```

This means that there is a problem with linking the language model of spaCy. You can try to load it in the following way:

In [None]:
nlp = spacy.load('en_core_web_sm')

Even if this works, you might still want to invest some time to make sure that the linking was succesful, i.e., that you can load spaCy with spacy.load('en'). Here is some more information on how to fix this.

**Troubleshooting (optional)**

*Cause*: Anaconda prompt does not have enough priviliges to execute the linking part of `python -m spacy download en`. The same is true for any other `python -m spacy [...]`

*Solution:* 

1. Create the link manually

The prompt should display something along the lines of:

```
<Data Downloaded>
You do not have sufficient privilege to perform this operation.

    Linking successful
    <Anaconda dir>\lib\site-packages\en_core_web_sm -->
    <Anaconda dir>\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')
Use the following command. Note that the target is pointing to the link, not the other way around
mklink /D <Anaconda>\lib\site-packages\spacy\data\en <Anaconda>\lib\site-packages\en_core_web_sm
```

2. Give Anaconda Permissions to create link. Simping using "runas ... python -m spacy ..." may not suffice

3. More details: https://github.com/explosion/spaCy/issues/1283

**If none of this works (can happen on Windows)** You might want to install the model manually from SpaCy's GitHub through pip: `pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz`.

## Using spaCy

`nlp` is now a Python object representing the English NLP pipeline that we can use to process a text. 

Parsing a text with spaCy after loading a language model is as easy as follows:

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

`doc` is now a Python object of the class `Doc`. It is a container for accessing linguistic annotations and a sequence of `Token` objects.

#### Doc, Token and Span objects

At this point, there are three important types of objects to remember:

* A `Doc` is a sequence of `Token` objects.
* A `Token` object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations. 
* A `Span` object is a slice from a `Doc` object and a sequence of `Token` objects.

Since `Doc` is a sequence of `Token` objects, we can iterate over all of the tokens in the text as shown below, or select a single token from the sequence: 

In [None]:
# Iterate over the tokens
for token in doc:
    print(token)
print()

# Select one single token by index
first_token = doc[0]
print("First token:", first_token)

Please note that even though these look like strings, they are not:

In [None]:
for token in doc:
    print(token, "\t", type(token))

These `Token` objects have many useful methods and *attributes*, which we can list by using `dir()`. We haven't really talked about attributes during this course, but while methods are operations or activities performed by that object, attributes are 'static' features of the objects. Methods are called using parantheses (as we have seen with `str.upper()`, for instance), while attributes are indicated without parantheses. We will see some examples below.

You can find more detailed information about the token methods and attributes in the [documentation](https://spacy.io/api/token).

In [3]:
dir(first_token)

NameError: name 'first_token' is not defined

Let's inspect some of the attributes of the tokens. Can you figure out what they mean? Feel free to try out a few more.

In [None]:
# Print attributes of tokens
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

Notice that some of the attributes end with an underscore. For example, tokens have both `lemma` and `lemma_` attributes. The `lemma` attribute represents the id of the lemma (integer), while the `lemma_` attribute represents the unicode string representation of the lemma. In practice, you will mostly use the `lemma_` attribute.

In [None]:
for token in doc:
    print(token.lemma, token.lemma_)

You can also use spacy.explain to find out more about certain labels:

In [None]:
# try out some more, such as NN, ADP, PRP, VBD, VBP, VBZ, WDT, aux, nsubj, pobj, dobj, npadvmod
spacy.explain("VBZ")

## Sentence splitting & tokenization
spaCy performs sentence splitting for you. The information is stored in the attribute **sents** of `Doc` (of type *spacy.tokens.doc.Doc*).
Each `Doc` contains a sequence of `Token` objects, i.e., this is where the output from the tokenizer is found. The token itself can be accessed using the attribute **text**.

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

In [4]:
for sentence in doc.sents:
    print()
    print(sentence)
    for token in sentence:
        print(token.text)

NameError: name 'doc' is not defined

## Lemmatization
The output from the lemmatizer is stored in the attribute **lemma_** of each `Token` object.

In [None]:
doc = nlp("I have awesome cats")

In [None]:
cat_token = doc[3]
print(cat_token.text, cat_token.lemma_)

## Part of speech tagging

The output from the part of speech tagger is stored:
* in the attribute **pos_** of each `Token` object: The simple part-of-speech tag
* in the attribute **tag_** of each `Token` object: The detailed part-of-speech tag ([Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))

In [None]:
doc = nlp("I have awesome cats")

In [None]:
cat_token = doc[3]
print(cat_token.text, cat_token.pos_, cat_token.tag_)

## Stop word recognition
The output from stop word recognition is stored in the attribute **is_stop** of each `Token' object.

In [None]:
doc = nlp("I have awesome cats")

In [None]:
have_token = doc[1]
print(have_token.is_stop) # this means that 'have' is a stop word according to spaCy

In [None]:
cats_token = doc[3]
print(cats_token.is_stop) # this means that 'cats' is not a stop word according to spaCy

In [None]:
type(token)

## Dependency parsing
The output of the dependency parser can only be accessed by combining the information from multiple attributes. Let's look at an example:

In [None]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

In [None]:
displacy.render(doc, jupyter=True, style='dep')

We observe that each token has a dependency relation with at least one other token. For example:
* **cars** has an **amod** relation with **autonomous**
* **shift** has an **nsubj** relation with **cars**

If you want to know what these relations mean, you can use **spacy.explain**

In [None]:
spacy.explain('amod')

spaCy makes use of the terms **child** and **head** in their dependency parsing output.
* a relation is always between a **child** and a **head**, e.g., *autonomous* is the child of *cars*
* a head of a phrase can be the child of another token, e.g., *cars* is the child of *shift*

The following attributes are needed to access this information:
* **dep_** provides the syntactic relation, e.g., *nsubj*
* **head** provides the **head** of a `Token`, e.g., in the case of *autonomous* the head would be *cars*

In [None]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

In [None]:
autonomous_token = doc[0]
print(autonomous_token.dep_, autonomous_token.head)

In [None]:
cars_token = doc[1]
print(cars_token.text, cars_token.head)

### Save tree structure to SVG image

In [None]:
tree_structure = displacy.render(doc, jupyter=False, style='dep')

output_path = 'spacy_tree_structure.svg'
with open(output_path, 'w') as outfile:
    outfile.write(tree_structure)

## Named Entity Recognition
The output from the Named Entity Recognizer is stored in the attribute **ents** of `Doc`.
The attribute **label_** and an **ent** (of type *spacy.tokens.span.Span*) contains the named entity type.

In [None]:
text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption."""
doc = nlp(text)

In [None]:
displacy.render(doc, jupyter=True, style='ent')

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

# End of this notebook