## What we will use on the session?

* spaCy ([conda-forge](https://anaconda.org/conda-forge/spacy))
* 


# Spacy
[spaCy](https://spacy.io/) is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
### How to start
We can install it via 

```bash
conda install -c conda-forge spacy
``` 
or
```bash
pip install spacy
```
In the next step we need to install [language models](https://spacy.io/usage/models#quickstart) i.e. for english 
```bash
spacy download en_core_web_sm
```
and
```bash
spacy download en_core_web_lg
```
Then we import it
```python
nlp = spacy.load("en_core_web_sm")           # load model package "en_core_web_sm"
nlp = spacy.load("/path/to/en_core_web_sm")  # load package from a directory
nlp = spacy.load("en")                       # load model with shortcut link "en"
```

I could not use above methods to import spaCy models in JupyterLab, I used i.e.
```python
import en_core_web_lg
nlp = en_core_web_lg.load()
```

To left align tables and pictures I used additionally below cel.

In [2]:
%%html
<style>
table {float:left}
img{float: left}
</style>

In [3]:
from spacy.lang.en import English
from spacy import displacy
nlp = English()

## Documents, spans and tokens

When you process a text with the nlp object, spaCy creates a **Doc** object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.
The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!


**Token** objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the Doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.

A **Span** object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

![span_indices.png](img/span_indices.png)

In [20]:
doc = nlp("Hello world!")

In [19]:
token = doc[1]
print(token.text)

world


In [22]:
span = doc[1:3]
print(span.text)

world!


### Token lexical attributes
ere you can see some of the available token attributes:

* "i" is the index of the token within the parent document.

* "text" returns the token text.

* "is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.


In [40]:
doc = nlp("It costs $5.")

In [54]:
doc = nlp("It costs $five.")

In [39]:
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


In [33]:
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are.")

In [35]:
for token in doc:
    if token.like_num:
        next_token = doc[token.i+1]
        if next_token.text == "%":
            print("Percentage found:", token.text + "%")

Percentage found: 60%
Percentage found: 4%


## Statistical models

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts. They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

In [124]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [128]:
doc = nlp("She ate the pizza")

## Part of Speech
In this example, we're using spaCy to predict part-of-speech tags, the word types in context.
For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag. In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

In [60]:
for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


### Syntactic Dependencies
In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

| Label     | Description          | Example |
|:----------|:---------------------|:--------|
| nsubj     | nominal subject      | She     |
| dobj      | direct object        | pizza   |
| det       | determiner (article) | the     |

In [130]:
displacy.render(doc, style="dep")
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


## Named Entities
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country. The doc dot **ents** property lets you access the named entities predicted by the model. 

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.


In [135]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
  
displacy.render(doc, style="ent")

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY
Cisco ORG
Chuck Robbins PERSON


### Entities
Entity patterns are dictionaries with two keys: ```"label"```, specifying the label to assign to the entity if the pattern is matched, and ```"pattern"```, the match pattern. There are two types of patterns:
1. **Phrase patterns** for exact string matches (string).
```json
{
    "label": "ORG", 
    "pattern": "Apple"
}
```
2. **Token patterns** with one dictionary describing one token (list).
```json
{
    "label": "GPE", 
    "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]
}
```

### Adding Entities
In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

In [139]:
from spacy.tokens import Doc, Span

words = ["My", "manager", "is", "Philip", "Doueihi"]
spaces = [True, True,True, True, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

span = Span(doc, 3, 5, label="PERSON")
print(span.text, span.label_)

doc.ents = [span]

print([(ent.text, ent.label_) for ent in doc.ents])

My manager is Philip Doueihi
Philip Doueihi PERSON
[('Philip Doueihi', 'PERSON')]


### Naming explanation
To get definitions for the most common tags and labels, you can use the ```spacy.explain()``` helper function.

For example, ```"GPE"``` for geopolitical entity isn't exactly intuitive – but spacy dot explain can tell you that it refers to countries, cities and states. The same works for part-of-speech tags and dependency labels.

In [69]:
for i in ['GPE', 'NNP', 'dobj']:
    print("{0}: {1}".format(i, spacy.explain(i)))

GPE: Countries, cities, states
NNP: noun, proper singular
dobj: direct object


### Predicting name entinties
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.
* Process the text with the nlp object.
* Iterate over the entities and print the entity text and label.

Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.


In [125]:
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

Apple ORG


In [74]:
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

iphone_x = doc[1:3]
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## Rule-based matching
Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

```[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]```


We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

```[{'LOWER': 'iphone'}, {'LOWER': 'x'}]```

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

```[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]```

The ```matcher.add()``` method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

When you call the matcher on a doc, it returns a list of tuples. Each tuple consists of three values: 
* ```match_id``` - hash value of the pattern name
* ```start``` index of matched span
* ```end``` index of matched span
This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

In [118]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

#### Example 1

In [79]:
doc = nlp("New iPhone X release date leaked")

In [80]:
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


#### Example 2
Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:
* A token consisting of only digits.
* 3 case-insensitive tokens for "fifa", "world" and "cup".
* token that consists of punctuation.

In [83]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

In [95]:
doc = nlp("2018 FIFA World Cup: France won!")

In [85]:
matcher.add('FIFA', None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


#### Example 3

In this example, we're looking for two tokens:
* a verb with the lemma "love"
* followed by a noun.

In [97]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

In [98]:
doc = nlp("I loved dogs but now I love cats more.")

In [99]:
matcher.add('LOVE', None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


#### Example 4

Operators and quantifiers let you define how often a token should be matched. They can be added using the ```'OP'``` key. Here, the ```'?'``` operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

| Operators | Description                  |
|:------------|:-----------------------------|
| ```{'OP': '!'} ```| Negation: match 0 times      |
|``` {'OP': '?'} ```| Optional: match 0 or 1 times |
| ```{'OP': '+'} ```| Match 1 or more times        |
| ```{'OP': '*'} ```| Match 0 or more times        |


In [114]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

In [115]:
doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher.add('BUY', None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


#### Example 5
Write a pattern for adjective plus one or two nouns. 

In [119]:
pattern = [{"POS": 'ADJ'}, 
           {"POS": 'NOUN'}, 
           {"POS": 'NOUN', "OP": '?'}]

In [120]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


## PhraseMatcher
The phrase matcher is another helpful tool to find sequences of words in your data. It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

It takes Doc objects as patterns. This makes it very useful for matching large dictionaries and word lists on large volumes of text.

To create the patterns, each phrase has to be processed with the ```nlp``` object. If you have a model loaded, doing this in a loop or list comprehension can easily become inefficient and slow. If you **only need the tokenization and lexical attributes**, you can run ```nlp.make_doc``` instead, which will only run the tokenizer. For an additional speed boost, you can also use the ```nlp.tokenizer.pipe``` method, which will process the texts as a stream.

BAD:  ```patterns = [nlp(term) for term in LOTS_OF_TERMS]```

BETTER: ```patterns = [nlp.make_doc(term) for term in LOTS_OF_TERMS]```

**THE BEST**: ```patterns = list(nlp.tokenizer.pipe(LOTS_OF_TERMS))```

In [16]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)

doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


In [17]:
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Angela Merkel
Barack Obama
Washington, D.C.


We can also use ```nlp.pipe``` to create patterns for PhraseMatcher
```python
people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

patterns = list(nlp.pipe(people))
```

### Matching over attributes
By default, the ```PhraseMatcher``` will match on the verbatim token text, e.g. ```Token.text```. By setting the attr argument on initialization, you can change **which token attribute the matcher should use** when comparing the phrase pattern to the matched ```Doc```. For example, using the attribute ```LOWER``` lets you match on ```Token.lower``` and create case-insensitive match patterns. We can pass also list of patterns ```patterns``` to the matcher by using:
```python
matcher.add(pattern_name, callback_function(), *patterns)
```

The examples here use ```nlp.make_doc``` to create ```Doc``` object patterns as efficiently as possible and without running any of the other pipeline components. If the token attribute you want to match on are set by a pipeline component, make sure that the pipeline component runs when you create the pattern. For example, to match on ```POS``` or ```LEMMA```, the pattern ```Doc``` objects need to have part-of-speech tags set by the ```tagger```. You can either call the ```nlp``` object on your pattern texts instead of ```nlp.make_doc```, or use ```nlp.disable_pipes``` to disable components selectively.

Another possible use case is matching number tokens like IP addresses based on their **shape**. This means that you won’t have to worry about how those string will be tokenized and you’ll be able to find tokens and combinations of tokens based on a few examples. Here, we’re matching on the shapes ```ddd.d.d.d``` and ```ddd.ddd.d.d```.

In [18]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", None, *patterns)

doc = nlp("Angela merkel and us president Barack Obama")
for match_id, start, end in matcher(doc):
    print("Matched based on lowercase token text:", doc[start:end])

Matched based on lowercase token text: Angela merkel
Matched based on lowercase token text: Barack Obama


In [19]:
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
matcher.add("IP", None, nlp("127.0.0.1"), nlp("127.127.0.0"))

doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
for match_id, start, end in matcher(doc):
    print("Matched based on token shape:", doc[start:end])

Matched based on token shape: 192.168.1.1
Matched based on token shape: 192.168.2.1


## String store

* ```Vocab```: stores data shared across multiple documents
* To save memory, spaCy encodes all strings to hash values
* Strings are only stored once in the StringStore via nlp.vocab.strings
* String store: lookup table in both directions
* Hashes can't be reversed – that's why we need to provide the shared vocab



In [133]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
print("Hash: {0} String: {1}".format(coffee_hash,coffee_string))

Hash: 3197928453018144401 String: coffee


## Similarity
One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

### Importing large models

In [140]:
import en_core_web_lg
nlp = en_core_web_lg.load()

### Similarity between documents
We can then create two doc objects and use the first doc's similarity method to compare it to the second.

Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".

In [142]:
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627204117787385


### Similarity of tokens
According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.

In [144]:
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


### Similarity between document and token
You can also use the similarity methods to compare different types of objects.

For example, a document and a token.

Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.

In [145]:
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


### Similarity between span and document
Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.

The score returned here is 0.61, so it's determined to be kind of similar.

In [146]:
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


### Similarity between spans

In [156]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

span1 = doc[3:5]
span2 = doc[-3:-1]

similarity = span1.similarity(span2)
print("The similarity between {1} and {2} is {0}".format(similarity, span1, span2))

The similarity between great restaurant and nice bar is 0.730758547782898


### Word vector
Similarity is determined using word vectors, multi-dimensional representations of meanings of words. You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text. Vectors can be added to spaCy's statistical models. By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary. 

Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors.

That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

In [150]:
doc = nlp("I have a banana")
print("Below there is {0}-dimmensional vetor of the word banana \n".format(len(doc[3].vector)))
print(doc[3].vector)

Below there is 300-dimmensional vetor of the word banana 

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2