# spaCy


Is an open-source library for NLP in Python, which supports a wide variety of languages. One big advantage of using spaCy is that it's desined to be integrated in  real products without big difficulties.

### Getting started
In order to start using spaCy you need to specify which language class you are going to use. Remember that spaCy was created to be used for several languages, therefore it doesn't asume that you want to use English, you need to explicit specify this.

In [1]:
# Note
# If you haven't intalled spaCy, please uncomment next line and run this cell
# !pip install spacy

In [2]:
import spacy

Let's begin with an example in English. Since we already know how to tokenize a text, let's take a look of how spaCy does this process for us.

In [3]:
# Import English
from spacy.lang.en import English

nlp = English()

raw = "Hard to judge whether these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

doc = nlp(raw)

print(doc)

for token in doc:
    print(token.text)

Hard to judge whether these sides were good. We were grossed out by the melted styrofoam and didn't want to eat it for fear of getting sick.
Hard
to
judge
whether
these
sides
were
good
.
We
were
grossed
out
by
the
melted
styrofoam
and
did
n't
want
to
eat
it
for
fear
of
getting
sick
.


In [4]:
# Now it's your turn to do the same for the following Spanish text taken from BBC in Spanish.

spanish_raw = '¿Es posible "desconectar" a un país entero de internet? ' \
              'La respuesta corta es "sí".'

from spacy.lang.es import Spanish

es_nlp = Spanish()

es_doc = es_nlp(spanish_raw)

print(es_doc)

for token in es_doc:
    print(token.text)

¿Es posible "desconectar" a un país entero de internet? La respuesta corta es "sí".
¿
Es
posible
"
desconectar
"
a
un
país
entero
de
internet
?
La
respuesta
corta
es
"
sí
"
.


### Indexing

spaCy uses the same syntax as Python for indexing. This way you can address specific tokens in your documents

In [5]:
last_word = doc[-1]
first_word = doc[0]
print(first_word, last_word)

Hard .


Every token in our document has some characteristics that are know in spaCy as **lexical attributes**.

In [6]:
print(first_word.is_digit)
print(last_word)
print(last_word.is_punct)

False
.
True


But what do we need indexing for?

### Documents and spans

A token or a sequence of them can be referred as a span. In some NLP tasks spans are very relevant. For instance, in areas as Question Answering (QA), obtaining the correct span that answers a query is a crucial for the task itself. with spaCy, we can also define spans and use their lexical attributes in the same way as we can do it for a token.

In [7]:
span = doc[4:9]

In [8]:
print(span)

these sides were good.


In [9]:
# This cell is reserved for you to explore more about lexical attributes on the previous text. 
# Check this link: https://spacy.io/api/token for more attributes.
# What can you comment about?
print("Here is a part-of-speech tag:", last_word.pos_) # Why is it empty?

Here is a part-of-speech tag: 


### Let's get a bit deeper in statistics

In our last exercise we could play around with probabilities. Working with language requires most of the time statistics to solve problems. As an example, we can decide if a the word _tweet_ refers to a noun or to a verb by counting. Can you tell why?

Knowing the context of a word and counting how often our desired word appears after a verb or after a noun would give us the probability that we are searching for.

### How can we include statistics in spaCy?

The good news is that spaCy provides pre-trained models that we can use depending on our necessities. There is an offer of small, medium and large models for different languages. Having such a model, we can use attributes in context. But what exactly is contained in a pre-trained model? It contains a vocabulary of the words used to train our model, their weights and meta-information useful for spaCy. 

Let's download and use a small model for English

In [1]:
# Uncomment next line and run this cell only if you haven't done it before.
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm==2.2.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz (12.0MB)
[K     |████████████████████████████████| 12.0MB 377kB/s eta 0:00:01     |████████████████████████████▎   | 10.6MB 358kB/s eta 0:00:04
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25lerror
[31m    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ukllw283/en-core-web-sm/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ukllw283/en-core-web-sm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --rec

loading the model is as simple as telling spaCy the name of the model to load.

In [14]:
nlp = spacy.load('en_core_web_sm')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

And we already know what to do...

In [15]:
# It's your turn to create a new document of our English text 
# and define a span for its last two words excluding the dot.

# new_doc =
# last_span = 
new_doc = nlp(raw)

word_two = new_doc[1]
last_span = new_doc[-3:-1]
print(word_two.text)

to


In [16]:
# Now display part-of-speech tags, dependencies and lemma for them.
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)

getting   getting
sick   sick


### Structure inside spaCy

Until this point, we have seen how to pass raw text to spaCy and process it into lexical features. However, keeping every token for every occurrence in a text is memory expensive. Therefore, spaCy manages everything in a sort of `internal structure`. 

This structure has three levels or components, the document (doc), a vocabulary called **vocab** and a lookup table that is called in spaCy the **string store**. The vocab contains token ids also known as **hashes**. From now on, we will call every entry in vocab a **lexeme**. A look-up table indicates which token correponds to which lexeme.

### Recap

- A document contains tokens with their lexical attributes

In [None]:
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)

- Each object in our vocab is a lexeme

In [None]:
lexeme = nlp.vocab[last_span[1].text]
print(lexeme.text, lexeme.orth)


- Each string representation of a hash id can be search in the string store and viceversa.

In [None]:
searched_string = nlp.vocab.strings[lexeme.orth]
searched_hash = nlp.vocab.strings[lexeme.text]

print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

### Searching for specific patterns with Matcher

spaCy provides a `Matcher`, which works similar to regular expressions in Python. The difference is that you can search not only the text, but also other token attributes. In this way we could for example differentiate between _tweet_ being a verb or a noun and search only for noun appearances.

Here, we have examples of searching text, lexical attributes for a specific token and lexical attributes in a more general search.

In [None]:
example = "Google Inc. is a company that has a big development in NLP. " \
          "When users google for a word or any query, their system internally " \
          "runs a pipeline in order to process what the person is querying."

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [None]:
# Matching exact text

pattern_text = [{'TEXT': 'Google'}, {'TEXT': 'Inc.'}]

# Match lexical attributes

pattern_attr = [{'LOWER': 'google'}]

# Match any token attributes with these characteristics

pattern_gen_attr = [{'LEMMA': 'query'}, {'IS_PUNCT': True}]

# Add the pattern to the matcher
matcher.add('PATTERN_TEXT', None, pattern_text)
matcher.add('PATTERN_ATTR', None, pattern_attr)
matcher.add('PATTERN_GEN_ATTR', None, pattern_gen_attr)

# Process some text
doc = nlp(example)

# Call the matcher on the doc
matches = matcher(doc)

In [None]:
print(matches)
print("Total of matches found:", len(matches))

But, what can we do with this output? What does it mean?

`Matcher` returns a list of tuples indicating start and end of each found matched span. 

In [None]:
# Display a list of found matches
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

In [None]:
# This cell is reserved for you to suggest a piece of text and create patterns. 
# The main idea for those patterns is to disambiguate tokens.

Following what we have seen until now, download a Spanish and a German model and create patterns to find several tokens with more than one ocurrence in the text given in following cells. 

#### Hint!
Notice that models for languages other than English were trained on news data instead of web data.

In [None]:
# !python -m spacy download de_core_news_sm

In [None]:
# !python -m spacy download es_core_news_sm

In [None]:
es_nlp = spacy.load('es_core_news_sm')
de_nlp = spacy.load('de_core_news_sm')

In [None]:
from spacy.lang.es.examples import sentences 
raw_spanish = sentences[0:5]

In [None]:
from spacy.lang.de.examples import sentences
raw_german = sentences[0:5]

In [None]:
# This cell is reserved for you to create your patterns.