# Spacy Basics

The two main NLP libraries we are going to use are **Spacy** and **NLTK**.

Main differences of the two libraries:
- NLTK was released in 2001 and it has several algorithms and models implemented.
- Spacy was released in 2015 and it has the best and fastest methods only; it can be more than 100x faster than NLTK.

Spacy can have a tricky installation: look at [Spacy Installation](https://spacy.io/usage). Take into account that we need to download the dictionaries, too. I installed everything as follows:

```bash
conda install keras nltk
conda install -c conda-forge spacy
# Download dictionaries/models
python -m spacy download en # spacy.load('en_core_news_sm')
python -m spacy download es # spacy.load('es_core_news_sm')
python -m spacy download de # spacy.load('de_core_news_sm')
```

Both libraries are used to perform **Natural Language Processing**, which consists in parsing and structuring the raw text so that it can be handled by the computer.

For a starting guide: [Spacy 101](https://spacy.io/usage/spacy-101).

Overview of contents:

1. Model, Doc, Pipeline
2. Tokens and Their Attributes
3. Spans (Slices of Docs) and Sentences

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Model, Doc, Pipeline

In [75]:
import spacy

In [76]:
# We load our English _model_
nlp = spacy.load('en_core_web_sm')

In [77]:
# Create a _Doc_ object:
# the nlp model processes the text 
# and saves it structured in the Doc object
# u: Unicode string (any symbol, from any language)
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [78]:
# Print each token separately
# Tokens are word representations, unique elements
# Note that spacy does a lot of identification work already
# $ is a symbol, U.S. is handled as a word, etc.
for token in doc:
    # token.text: raw text
    # token.pos_: part of speech: proper noun, verb, ... (MORPHOLOGY)
    # token.dep_: subject, etc., syntactic dependency (SYNTAXIS)
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN dobj
startup VERB dep
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


In [79]:
# The Doc object contains the processed text
# To see how it is processed, we can show the pipeline used
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fbd3c713bb0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fbd3c713d70>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fbd3c7b01d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fbd3e367eb0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fbd3e212aa0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fbd3c7b0650>)]

In [80]:
# We can get the basic names of the steps in the pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

For a more detailed explaination of pipelines and their steps, see: [Spacy Pipelines](https://spacy.io/usage/spacy-101#pipelines)

![Spacy Pipeline](../pics/spacy_pipeline.png)

## 2. Tokens and Their Attributes

The tokens have an identified meaning; they are often words, but might be also spaces, punctuation, negation particles, etc. -- because all those have also an identifiable meaning!

Spacy assigns many attributes to the detected tokens; these can be checks with `. TAB`. The most important ones are:

- `.pos_`: part-of-speech, i.e., morphological type: noun, verb, adjective, etc.
- `.dep_`: syntactic dependency; a list of classes can be seen in the [Stanford NLP Dependencies Manual](https://nlp.stanford.edu/software/dependencies_manual.pdf).

The method `.explain()` provides the explanation of each class.

**Additional attributes**:

|Tag|Description|doc[i].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [81]:
doc2 = nlp(u"Tesla isn't    looking into startups anymore.")

In [82]:
# Tokens are unique elements with (sig) meaning
# Spacy, additionally, annotates them!
# Example:
# - "n't" is a token meaning negation of the root verb
# - "." is a punctuation symbol.
# - "  " is a space
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
    SPACE dep
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [83]:
doc2

Tesla isn't    looking into startups anymore.

In [84]:
# Get first token
doc2[0]

Tesla

In [85]:
# Part-of-speech: Morphology
doc2[0].pos_

'PROPN'

In [86]:
# Syntactical function
doc2[0].dep_

'nsubj'

In [87]:
spacy.explain('PROPN')

'proper noun'

In [88]:
spacy.explain('nsubj')

'nominal subject'

In [89]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [90]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [91]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [92]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


## 3. Spans (Slices of Docs) and Sentences

Since `Docs` can be very large, we often might want to use `Spans`, which are slices of `Docs`.

In [93]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [94]:
type(doc3)

spacy.tokens.doc.Doc

In [95]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [96]:
type(life_quote)

spacy.tokens.span.Span

Since tokens have a start-of-sentence attribute `is_sent_start`, we can navigate from sentence to sentence.

In [97]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [98]:
w = doc4[6]

In [99]:
w.is_sent_start

True

In [100]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.
