## Intro to SpaCy

*Course materials*: https://course.spacy.io/

**Chapter 1 headlines**:
- data structure in SpaCy;
- statistical models variation;
- linguistic features prediction.

SpaCy library allows you can establish language processing pipeline by instantiating an object. <br>
For example, let's create **English NLP object** below. 

In [1]:
"""
- import English language class;
- create an object;
- this will include language specific rules for tokenization 

"""
import spacy
from spacy.lang.en import English

nlp = English()

We can pass the text into nlp object and SpaCy will create a **doc** - document with text information in **tokenized form** (words as tokens), that be accessd using index. <br>
So, we can **iterate over the text**.

In [2]:
doc = nlp('Hello world!')
for indx, token in enumerate(doc):
    print(indx, token.text)

0 Hello
1 world
2 !


Output above shows tokens enumerated as normal Python sequence.

### Token object

**Token objects** represent the tokens in document - words or punctuation.<br>
As it shown above, we can **get a token at specific position**, indexing into the doc.

![title](doc.png)

In [3]:
"""
- for example, lets acces the second token in doc 
- apply .text attribute to display the text of token 
"""
token = doc[1]
print(token.text)

world


### Span object 

**Span object** is a basic slice of the document, that contains one or more tokens. <br>
Closest analogy - python list slices. 

![title](span.png)

In [4]:
"""
- slice from document with tokens 
"""

span = doc[1:4]
print(span.text)

world!


### Lexical Attributes

We can retrieve lexical attributes from document using indecies of tokens:
- `.is_alpha` returns booleans that indicate whether token consists of alphabetical attributes;
- `.is_punct` returns booleans that indicate whether token is punctuation;
- `.like_num` returns booleans that indicate whether token is number. 

In [5]:
doc = nlp('It costs €10.')

In [6]:
print('Index:', [token.i for token in doc])
print('Text:', [token.text for token in doc])

print('Alphabetic?', [token.is_alpha for token in doc])
print('Punctuation?', [token.is_punct for token in doc])
print('Numbers', [token.like_num for token in doc])

Index: [0, 1, 2, 3, 4]
Text: ['It', 'costs', '€', '10', '.']
Alphabetic? [True, True, False, False, False]
Punctuation? [False, False, False, False, True]
Numbers [False, False, False, True, False]


This feature can be applied for lexical analysis and doc's content evaluation: how frequently numbers are used or presence of punctuation in text. 

In [7]:
print('Alphabetic tokens in doc:', sum([token.is_alpha for token in doc]))
print('Punctuation in doc:', sum([token.is_punct for token in doc]))
print('Numbers in doc:', sum([token.like_num for token in doc]))

Alphabetic tokens in doc: 2
Punctuation in doc: 1
Numbers in doc: 1


### Getting started

Note, that currently more than **45 languages** are available in SpaCy library.  

**English example**

In [8]:
from spacy.lang.en import English 

nlp = English()

doc = nlp('This is a sentence')
print(doc.text)
print('Tokens in English doc: ', [token.text for token in doc])

This is a sentence
Tokens in English doc:  ['This', 'is', 'a', 'sentence']


**German example**

In [9]:
from spacy.lang.de import German 

nlp = German()

doc = nlp('Liebe Grüße!')
print(doc.text)
print('Tokens in German doc: ', [token.text for token in doc])

Liebe Grüße!
Tokens in German doc:  ['Liebe', 'Grüße', '!']


**Spanish example**

In [10]:
from spacy.lang.es import Spanish 

nlp = Spanish()

doc = nlp('¿Cómo estás?')
print(doc.text)
print('Tokens in Spanish doc: ', [token.text for token in doc])

¿Cómo estás?
Tokens in Spanish doc:  ['¿', 'Cómo', 'estás', '?']


Next, let's create more examples docs, spans and tokens.

In [11]:
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


One more example of span below - slicing the document. 

In [12]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


**Lexical attribute example**

Using SpaCy we can perform analysis of text content, for example, find specific attributes, like **percentages (%)**.<br>
In this example, we will investigate subsequent tokens: **number + percent sign**. <br>
Here we will iterate over tokens, using:
- `like_num` to check whether token is a number;
- `token.i + 1` to get token, following the token of document;
- check whether token's attribute `text` has a percent sign.

In [13]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


### Statistical models

Statistical models in SpaCy can **analyze words in context**, for example:
- whether the word is verb;
- whether the span of text is person name. 

We can **predict attributes** in context:
- part-of-speech tags;
- syntactic dependencies;
- named entities.

**Models**, that can be used to predict in context:
- trained on large labeled texts (so, pretrained models available); 
- possible fine-tuning: add more data, custom labels.

### Model packages

A wide range of pretrained models can be found, using `download`.<br>
For example, `en_core_web_sm` is a small English model, that supports all core capabilities and trained on Web-text.<br>
`spacy.load()` loads the model and returns nlp object.<br>
Package provides **binary weights**, that enable library to make predictions. 

Usage: `$ python -m spacy download en_core_web_sm`

In [15]:
nlp = spacy.load('en_core_web_sm')

#### Predicting part of the speech tag

Let's use loaded English model to predict parts of the speach in context.<br>
First of all we will process the text "She ate the pizza" (using nlp object). <br>
After that for each token in doc we will display `.pos_` attribute, related to part of the speech tag. 

In [16]:
doc = nlp('She ate the pizza')

for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.<br>
Note, in SpaCy attributes, written with `_` usually returns string. Attributes without underscore returns ID.

#### Predicting syntactic dependancies

*More about*: https://spacy.io/usage/linguistic-features

Additionally, we can predict **how the words are related** in document. <br>
For example, whether the word is a subject of a sequence in doc. <br>
In this case `dep_` is used. `.head` is used to return **syntetic head** token - parent token the word is attached to. 

In [18]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


Here we see that "She" is a pronoun and **nominal subject** of syntetic **head** token "ate" (it's attached to head).<br>
Meanwhile "ate" is a verb and **root** word. <br>
Pizza will be an **direct object** of parent token "ate" (attached to head as well).

Additionally we an display **children tokens** for each token in document using `.children`.

In [20]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text,
         [child for child in token.children])

She PRON nsubj ate []
ate VERB ROOT ate [She, pizza]
the DET det pizza []
pizza NOUN dobj ate [the]


We see, that some words are parent tokens for others.  Determiner "the" and pronoun "She" do not own ones.

![title](dep_labels_schema.png)

### Predicting Named Entities

Named entities are "real world objects", that are assigned a name (name of organisation, person or country).<br>
We are able to use `.ents` to predict names of entitie using model.<br>
It will return iterator of Span objects, hence we can print text and label of entity using `.label_`. 

In [26]:
"""
- process the tet as usually 
- itreration over predicted labels
"""
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In this case "Apple" is `ORG` or organisation. U.K is `GPE` or country/city/state. <br>
$1 billion is `MONEY`.

We can quickly access the definition of unknown tags or labels, using `.explain` method. 

In [29]:
print('GPE:',spacy.explain('GPE'))
print('NNP:',spacy.explain('NNP'))
print('dobj:',spacy.explain('dobj'))

GPE: Countries, cities, states
NNP: noun, proper singular
dobj: direct object
