# NLP session 29th Oct 2019


## What is NLP?

**Neuro-linguistic programming** -  is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s. NLP's creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life.[[1]](http://web.archive.org/web/20190103020411/http://www.som.surrey.ac.uk/NLP/Resources/IntroducingNLP.pdf)
[[2]](https://en.wikipedia.org/wiki/Neuro-linguistic_programming)

**Natural language processing** -  is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve *speech recognition*, *natural language understanding*, and *natural language generation*. 
[[3]](https://en.wikipedia.org/wiki/Natural_language_processing)

In [None]:
%%html
<style>
table {float: left}
</style>

## Seting Up env
Let us import and check what version of **spacy** we have, this should be ```2.2.1```. I am also using **displacy** for visualizations.

To check which version of the language models we have installed, in terminall run also
```bash
spacy validate
```

We will be using some advanced tools, so we need large model ```en_core_web_lg```. Lastly, we import also **Matcher**.

In [176]:
import time
import random
import nltk

import spacy
print("spaCy version: ",spacy.__version__)
from spacy import displacy

import en_core_web_lg
import en_core_web_sm

from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Token, Span
from spacy.lang.en import English

import json
import numpy as np

spaCy version:  2.2.1


## Strings

Strings in python are **immutable arrays**. We can use both positive and negative index to slice them
```python
b = "Hello, World!"
print(b[-5:-2])
```

We can concatinate them 
```python
a = "Hello"
b = "World"
c = a + b
print(c)
```
... format
```python
quantity = 3
itemno = 567
price = 49.95
myorder = "I want {} pieces of item {} for {} dollars."
print(myorder.format(quantity, itemno, price))
```

...split
```python
a = "Hello, World!"
print(a.split(",")) 
```

... apply methods

In [177]:
string = "Konrad"
print("Length: ",len(string))
print("Upper case: ",string.upper())
print("Lower case: ",string.lower())
print("Title case: ",string.lower().title())
print(string.startswith('ko'))
print(string.lower().startswith('ko'))
print(string.endswith('D'))
print(string.upper().endswith('D'))

Length:  6
Upper case:  KONRAD
Lower case:  konrad
Title case:  Konrad
False
True
False
True


In [178]:
filename = "cisco_article.txt"

with open(filename, 'r') as f:
    text_whole=f.read().replace('\n', '').replace('    ','')

## Regex

Webpage: [Regex 101](https://regex101.com/)

**"If someone told you that they know regex, they lied to you."**

Pattern for words (including ```It's```, ```AI/ML``` or ```SD-WAN```)
```
(\w*[-'—/]*\w+)
```

## Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

In [179]:
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer("(\w*[-'—/]*\w+)")

In [180]:
nltk_tokenized_words = tokenizer.tokenize(text_whole)
print(len(nltk_tokenized_words))
print(nltk_tokenized_words[666])

1467
baseline


### Sentence tokenize


In [181]:
corpus = nltk.sent_tokenize(text_whole.lower())
print(len(corpus))
print(corpus[12])

35
in the cisco dna center and the cloud.cisco ai network analytics in the cloudfor years now, cisco has been integrating ai/ml into many operational and security components, with cisco dna center the focal point for insights and actions.


## Frequency

We understand now the term of the token and we know how to tokenize the text. Next step checok how frequently tokens appears in our document. To count it we will use ```Counter``` from ```collections``` built in python package. To see most common X objects we run ```Counter().most_common(X)```. 

We will be monitoring this frequency at our next steps.

In [182]:
from collections import Counter

In [183]:
Counter(nltk_tokenized_words).most_common(20)

[('and', 74),
 ('the', 46),
 ('of', 44),
 ('to', 44),
 ('a', 28),
 ('for', 25),
 ('Cisco', 25),
 ('in', 22),
 ('AI', 20),
 ('network', 19),
 ('with', 15),
 ('that', 14),
 ('Analytics', 14),
 ('data', 13),
 ('Network', 13),
 ('is', 12),
 ('performance', 11),
 ('can', 11),
 ('DNA', 11),
 ('issues', 10)]

## Stop words
**Stop words** are words which are filtered out before processing of natural language data (text).

Stop words are generally the most common words in a language; there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools avoid removing stop words to support phrase search. 

In [184]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

In [185]:
nltk_filtered_sentence = [w for w in nltk_tokenized_words if not w in stop_words]
print(len(nltk_filtered_sentence))

985


In [186]:
Counter(nltk_filtered_sentence).most_common(20)

[('Cisco', 25),
 ('AI', 20),
 ('network', 19),
 ('Analytics', 14),
 ('data', 13),
 ('Network', 13),
 ('performance', 11),
 ('DNA', 11),
 ('issues', 10),
 ('NetOps', 8),
 ('Worldwide', 8),
 ('Data', 8),
 ('Platform', 8),
 ('networks', 6),
 ('branch', 6),
 ('devices', 6),
 ('IT', 6),
 ('patterns', 6),
 ('Center', 6),
 ('offices', 5)]

## Stemming

**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

**Information**: Removing suffixes from a word is called Suffix Stripping

In [187]:
from nltk import PorterStemmer
porter = PorterStemmer()
#proide a word to be stemmed
print(porter.stem("cats"))
print(porter.stem("trouble"))
print(porter.stem("troubling"))
print(porter.stem("troubled"))
print(porter.stem("destabilized"))

cat
troubl
troubl
troubl
destabil


In [188]:
from nltk import LancasterStemmer
lancaster=LancasterStemmer()
print(lancaster.stem("cats"))
print(lancaster.stem("trouble"))
print(lancaster.stem("troubling"))
print(lancaster.stem("troubled"))
print(lancaster.stem("destabilized"))
print(lancaster.stem("the"))

cat
troubl
troubl
troubl
dest
the


In [189]:
from nltk import SnowballStemmer
snowball = SnowballStemmer('english', ignore_stopwords="yes")
#proide a word to be stemmed
print(snowball.stem("cats"))
print(snowball.stem("trouble"))
print(snowball.stem("troubling"))
print(snowball.stem("troubled"))
print(snowball.stem("destabilized"))

cat
troubl
troubl
troubl
destabil


In [190]:
nltk_snowball_stemmer = [snowball.stem(w) for w in nltk_filtered_sentence]
print(len(nltk_snowball_words))

985


In [191]:
Counter(nltk_snowball_stemmer).most_common(20)

[('network', 46),
 ('cisco', 25),
 ('data', 21),
 ('ai', 20),
 ('analyt', 15),
 ('perform', 12),
 ('center', 11),
 ('dna', 11),
 ('issu', 11),
 ('action', 9),
 ('platform', 9),
 ('netop', 8),
 ('insight', 8),
 ('worldwid', 8),
 ('branch', 7),
 ('offic', 7),
 ('devic', 7),
 ('connect', 7),
 ('cloud', 7),
 ('time', 7)]

## Lemmatization
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

Consider the below two cells. **Did the lemmatizer work as expected?**

In the next step let us check the most frequent words. **Do you see some issue here?**

In [192]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

In [193]:
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("trouble"))
print(lemmatizer.lemmatize("troubling"))
print(lemmatizer.lemmatize("troubled"))
print(lemmatizer.lemmatize("destabilized"))

cat
trouble
troubling
troubled
destabilized


In [194]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [195]:
nltk_lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk_filtered_sentence]
print(len(nltk_lemmatized_words))
Counter(nltk_lemmatized_words).most_common(20)

985


[('network', 25),
 ('Cisco', 25),
 ('AI', 20),
 ('Analytics', 14),
 ('data', 13),
 ('Network', 13),
 ('performance', 11),
 ('DNA', 11),
 ('issue', 11),
 ('NetOps', 8),
 ('Worldwide', 8),
 ('Data', 8),
 ('Platform', 8),
 ('branch', 7),
 ('office', 7),
 ('device', 7),
 ('pattern', 7),
 ('action', 7),
 ('application', 6),
 ('time', 6)]

In [198]:
nltk_lemmatized_words = [lemmatizer.lemmatize(w.lower(), get_wordnet_pos(w)) for w in nltk_filtered_sentence]
print(len(nltk_lemmatized_words))
Counter(nltk_lemmatized_words).most_common(20)

985


[('network', 44),
 ('cisco', 25),
 ('data', 21),
 ('ai', 20),
 ('analytics', 15),
 ('center', 11),
 ('performance', 11),
 ('dna', 11),
 ('issue', 11),
 ('platform', 9),
 ('netops', 8),
 ('insight', 8),
 ('action', 8),
 ('worldwide', 8),
 ('branch', 7),
 ('office', 7),
 ('device', 7),
 ('cloud', 7),
 ('pattern', 7),
 ('baseline', 7)]

## Lexical Diversity 
Let’s define a short function to identify an introductory metric for our story. The Lexical Diversity represents the ratio of unique words used to the total number of words in the story.

In [199]:
def lexical_diversity(text):
    return (len(set(text)) / len(text), len(set(text)), len(text))

In [200]:
print("Lexical diversity of NLTK: ",lexical_diversity(nltk_tokenized_words))
print("Lexical diversity of NLTK without stepwords: ",lexical_diversity(nltk_filtered_sentence))
print("Lexical diversity of NLTK Snowball: ",lexical_diversity(nltk_snowball_stemmer))
print("Lexical diversity of NLTK Lemmantized: ",lexical_diversity(nltk_lemmatized_words))

Lexical diversity of NLTK:  (0.4178595773687798, 613, 1467)
Lexical diversity of NLTK without stepwords:  (0.5604060913705584, 552, 985)
Lexical diversity of NLTK Snowball:  (0.44568527918781725, 439, 985)
Lexical diversity of NLTK Lemmantized:  (0.47411167512690355, 467, 985)


## Bag of Words (BOW)

The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.

If you want to make an NLP application that classifies documents in different categories, then you can use BOW. BOW is also used to generate frequency count and vocabulary from a dataset. These derived attributes are then used in NLP applications such as sentiment analysis, Word2vec, and so on.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is called a bigram model. For example, the bigrams in the first document : ```It was the best of times``` are as follows:
* "it was"
* "was the"
* "the best"
* "best of"
* "of times"

We treat each sentence as a separate document and we make a list of all words from all the four documents excluding the punctuation.

The next step is the create vectors. Vectors convert text that can be used by the machine learning algorithm.

### n-grams
*\"In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles[clarification needed].\"* 

[source](https://en.wikipedia.org/wiki/N-gram)

In n-grams, word order is important, whereas in BOW it is not important to maintain word order. During the NLP application, n-gram is used to consider words in their real order so we can get an idea about the context of the particular word; BOW is used to build vocabulary for your text dataset.





In [201]:
most_freq_words = Counter(nltk_lemmatized_words).most_common(20)
most_freq_words = [freq[0] for freq in most_freq_words]
most_freq_words

['network',
 'cisco',
 'data',
 'ai',
 'analytics',
 'center',
 'performance',
 'dna',
 'issue',
 'platform',
 'netops',
 'insight',
 'action',
 'worldwide',
 'branch',
 'office',
 'device',
 'cloud',
 'pattern',
 'baseline']

In [202]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = tokenizer.tokenize(sentence)
    lemmatized_sentence_tokens = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in sentence_tokens]
    sent_vec = []
    for token in most_freq_words:
        if token in lemmatized_sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [203]:
sentence_vectors = np.asarray(sentence_vectors)
sentence_vectors

array([[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0],
       [1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 1, 0,

## Word2vec
The Word to Vec model produces a vocabulary, with each word being represented by an n-dimensional numpy array (100 values in this example)

In [204]:
from gensim.models import Word2Vec

In [205]:
corpus_tokenised = []
for sentence in corpus:
    sentence_tokens = tokenizer.tokenize(sentence)
    filtered_sentence = [w.lower() for w in sentence_tokens if not w in stop_words] 
    lemmatized_sentence_tokens = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in filtered_sentence]
    corpus_tokenised.append(lemmatized_sentence_tokens)

In [222]:
model = word2vec.Word2Vec(corpus_tokenised, size=300, window=3, min_count=1)

In [209]:
model.wv['ai']

array([ 4.6817740e-04, -2.8518352e-05, -3.3628449e-04, -4.9796910e-04,
        7.3395413e-04,  1.4484520e-03, -6.6038156e-05,  2.1801452e-04,
        5.9292431e-04, -5.1206538e-05, -4.2694263e-04, -1.7206463e-03,
       -1.3506947e-03, -9.1160729e-04,  1.4229257e-03, -6.0548878e-04,
       -8.1224879e-04,  9.8901649e-04,  5.9237989e-04, -3.1332063e-04,
        7.5117685e-05,  3.5242827e-04, -1.6659006e-04, -1.2199940e-04,
       -7.9429876e-05, -2.3054839e-04, -3.0415252e-04, -9.7894762e-04,
        1.5105685e-03,  9.4492879e-04, -1.3115926e-03, -2.9610924e-04,
        1.5793691e-03,  1.1521649e-03, -1.4870856e-03, -1.0963898e-03,
        1.3932979e-03, -2.1771775e-04,  9.9827885e-04, -4.8045994e-04,
        1.6082077e-03,  1.1020086e-03,  2.9957460e-05, -3.3649329e-05,
       -3.9492518e-04, -1.7665951e-04,  2.2322120e-04,  8.9955184e-04,
       -1.3561846e-03,  8.0665498e-04,  8.7253470e-04, -1.1191472e-03,
       -1.1536921e-03,  1.3400570e-03,  1.0289421e-03, -2.2954899e-05,
      

### Similarity 
Now we could even use Word2vec to compute the similarity between two Make Models in the vocabulary by invoking the model.similarity( ) and passing in the relevant words.

In [223]:
model.wv.similarity('ai','cisco')

-0.0069827754

In [224]:
model.wv.most_similar('cisco')

[('gather', 0.18863195180892944),
 ('application', 0.182594895362854),
 ('abnormal', 0.14176899194717407),
 ('blizzard', 0.14128705859184265),
 ('operating', 0.14054402709007263),
 ('problem', 0.1390073001384735),
 ('switch', 0.13393692672252655),
 ('change', 0.1276453733444214),
 ('human', 0.12112872302532196),
 ('compare', 0.12103308737277985)]

# Spacy

Before going here please familiarise yourself with Preworkout :)

### Similarity

In [225]:
nlp = en_core_web_lg.load()

In [240]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")

Do something to the doc here!
Do something to the doc here!


In [234]:
print("Doc similarity: ", doc1.similarity(doc2))
print("Words similarity: ",doc1[2].similarity(doc2[2]))

displacy.render(doc1, style="dep")
print(doc1[2])
print(doc2[2])

Doc similarity:  0.957709143352323
Words similarity:  0.83117634


cats
dogs


In [242]:
displacy.render(doc2, style="dep")
displacy.render(doc1, style="ent")


  "__main__", mod_spec)


In [228]:
print(doc1[2].vector)

[-0.26763    0.029846  -0.3437    -0.54409   -0.49919    0.15928
 -0.35278   -0.2036     0.23482    1.5671    -0.36458   -0.028713
 -0.27053    0.2504    -0.18126    0.13453    0.25795    0.93213
 -0.12841   -0.18505   -0.57597    0.18538   -0.19147   -0.38465
  0.21656   -0.4387    -0.27846   -0.41339    0.37859   -0.2199
 -0.25907   -0.019796  -0.31885    0.12921    0.22168    0.32671
  0.46943   -0.81922   -0.20031    0.013561  -0.14663    0.14438
  0.0098044 -0.15439    0.21146   -0.28409   -0.4036     0.45355
  0.12173   -0.11516   -0.12235   -0.096467  -0.26991    0.028776
 -0.11307    0.37219   -0.054718  -0.20297   -0.23974    0.86271
  0.25602   -0.3064     0.014714  -0.086497  -0.079054  -0.33109
  0.54892    0.20076    0.28064    0.037788   0.0076729 -0.0050123
 -0.11619   -0.23804    0.33027    0.26034   -0.20615   -0.35744
  0.54125   -0.3239     0.093441   0.17113   -0.41533    0.13702
 -0.21765   -0.65442    0.75733    0.359      0.62492    0.019685
  0.21156    0.28125 

In [None]:
print(doc2[2].vector)

### Vector Norm and OOV
```token.vector_norm``` is L2 norm of the token (the square root of the sum of the values squared) while ```token.is_ovv``` checks if the token is Out-Of-Vocabulary

In [243]:
print(doc1[2].vector_norm)
print(doc2[2].vector_norm)

22.897898
21.888851


In [244]:
print(doc1[2].is_oov)

True


## Statistical Models vs Rule-based systems
Statistical models are useful if your application needs to be able to generalize based on a few examples.

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.

|                     | Statistical models                                          | Rule-based systems                                     |
|:--------------------|:------------------------------------------------------------|:-------------------------------------------------------|
| Use cases           | application needs to generalize based on examples           | dictionary with finite number of examples              |
| Real-world examples | product names, person names, subject/object relationships   | countries of the world, cities, drug names, dog breeds |
| spaCy features      | entity recognizer, dependency parser, part-of-speech tagger | tokenizer, Matcher, PhraseMatcher                      |

In [245]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Do something to the doc here!
Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


#### Task 1
Why does this pattern not match the tokens “Silicon Valley” in the doc?

In [248]:
pattern = [{'LOWER': 'silicon'}, {'TEXT': ' '}, {'LOWER': 'valley'}]
doc = nlp("Can silicon valley workers rein in big tech from within?")

matcher = Matcher(nlp.vocab)
matcher.add("SILICON_VALLEY", None, pattern)

for match_id, start, end in matcher(doc):
    print("Matched based on token shape:", doc[start:end])

Do something to the doc here!


In [249]:
# The tokenizer doesn't create tokens for single spaces, so there's no token with the value ' ' in between. 
# The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.
pattern = [{'LOWER': 'silicon'}, {'LOWER': 'valley'}]
doc = nlp("Can Silicon Valley workers rein in big tech from within?")

matcher = Matcher(nlp.vocab)
matcher.add("SILICON_VALLEY", None, pattern)

for match_id, start, end in matcher(doc):
    print("Matched based on token shape:", doc[start:end])

Do something to the doc here!
Matched based on token shape: Silicon Valley


#### Task 2
Both patterns in this exercise contain mistakes and won’t match as expected. Can you fix them? 
* ```pattern1``` so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
* ```pattern2``` so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.


In [251]:
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

pattern1 = [{"LOWER": "Amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad-free"}, {"POS": "NOUN"}]

matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

for match_id, start, end in matcher(doc):
    print(doc.vocab.strings[match_id], doc[start:end].text)

Do something to the doc here!


In [252]:
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

for match_id, start, end in matcher(doc):
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


### Exact string matching
Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

In [253]:
doc = nlp("Czech Republic may help Slovakia protect its airspace")

Do something to the doc here!


In [254]:
with open("data/countries.json") as f:
    COUNTRIES = json.loads(f.read())

patterns = list(nlp.pipe(COUNTRIES))

matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *patterns)

print([doc[start:end] for match_id, start, end in matcher(doc)])

Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do something to the doc here!
Do somethi

## Pipelines
spaCy ships with the following built-in pipeline components. 

### What happens when we call nlp
First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

<img src="img/pipeline.png" >
<br clear="left"/>

The part-of-speech tagger sets the ```token.tag``` attribute. The dependency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks. The named entity recognizer adds the detected entities to the| doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not. Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property. Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.


| Name    | Description             | Creates                                           |
|:--------|:------------------------|:--------------------------------------------------|
| tagger  | Part-of-speech tagger   | Token.tag                                         |
| parser  | Dependency parser       | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| ner     | Named entity recognizer | Doc.ents, Token.ent_iob, Token.ent_type           |
| textcat | Text classifier         | Doc.cats                                          |

In [235]:
nlp = en_core_web_sm.load()
nlp.pipe_names

['tagger', 'parser', 'ner']

In [236]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fc24c9579d0>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fc1ea430d70>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fc1ea430f30>)]

### Custom pipeline components

After the text is tokenized and a ```Doc``` object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own. Custom components are executed automatically when you call the ```nlp``` object on a text. They're especially useful for adding your own custom metadata to documents and tokens. You can also use them to update built-in attributes, like the named entity spans.

Fundamentally, a pipeline component is a function or callable that takes a ```doc```, modifies it and returns it, so it can be processed by the next component in the pipeline. Components can be added to the pipeline using the ```nlp.add_pipe``` method. The method takes at least one argument: the component function. 

Don't forget to return the ```Doc``` so it can be processed by the next component in the pipeline! The Doc created by the tokenizer is passed through all components, so it's important that they all return the modified doc.

| Argument | Description          | Example                                 |
|:---------|:---------------------|:----------------------------------------|
| last     | If True, add last    | nlp.add_pipe(component, last=True)      |
| first    | If True, add first   | nlp.add_pipe(component, first=True)     |
| before   | Add before component | nlp.add_pipe(component, before='ner')   |
| after    | Add after component  | nlp.add_pipe(component, after='tagger') |

In [237]:
def custom_component(doc):
    print("Do something to the doc here!")
    return doc

In [238]:
nlp.add_pipe(custom_component, first=True)

In [239]:
print([pip for pip in nlp.pipeline])

[('custom_component', <function custom_component at 0x7fc24670f560>), ('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc24c9579d0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc1ea430d70>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc1ea430f30>)]


In [None]:
nlp = en_core_web_lg.load()

#### Excercise 1 (Simple component)
Before we run the tagger, we want to know the length of ```Doc``` object in tokens.

In [255]:
# Load the small English model
nlp = spacy.load("en_core_web_sm")

def length_component(doc):
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    return doc

nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner']


In [256]:
doc = nlp("This is a sentence.")

This document is 5 tokens long.


#### Excercise 2 (Complex Component)
Use the ```PhraseMatcher``` to find animal names in the document and adds the matched spans to the ```doc.ents```. A ```PhraseMatcher``` with the animal patterns has already been created as the variable matcher.

In [257]:
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("ANIMAL", None, *animal_patterns)

def animal_component(doc):
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

doc = nlp("I have a cat and a golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, style="ent")

This document is 2 tokens long.
This document is 1 tokens long.
This document is 1 tokens long.
This document is 2 tokens long.
animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['length_component', 'tagger', 'parser', 'ner', 'animal_component']
This document is 8 tokens long.
[('cat', 'ANIMAL'), ('golden Retriever', 'ANIMAL')]


### Extension attributes
Custom attributes let you add any meta data to ```Docs```, ```Tokens``` and ```Spans```. The data can be added once, or it can be computed dynamically. Custom attributes are available via the ```._``` property. 
```python
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False
```
This makes it clear that they were added by the user, and not built into spaCy, like ```token.text```.  Attributes need to be registered on the global ```Doc```, ```Token``` and ```Span``` classes you can import from ```spacy.tokens```. To register a custom attribute on the ```Doc```, ```Token``` or ```Span```, you can use the ```set_extension``` method.

The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

In [None]:
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

#### Attribute extensions
Attribute extensions set a default value that can be overwritten. For example, a custom ```is_color``` attribute on the token that defaults to ```False```. We can also add ```force=True``` to force the process of overwrite. 

On individual tokens, its value can be changed by overwriting it – in this case, ```True``` for the token ```blue```.

```python
Token.set_extension('is_color', default=False, force=True)
doc = nlp("The sky is blue.")
doc[3]._.is_color = True
```


#### Property extensions

Property extensions work like properties in Python: they can define a ```getter``` function and an optional ```setter```. The ```getter``` function is only called when you retrieve the attribute. This lets you compute the value dynamically, and even take other custom attributes into account. ```Getter``` functions take one argument: the object, in this case, the token. In this example, the function returns whether the token text is in our list of colors. We can then provide the function via the getter keyword argument when we register the extension. 

The token "blue" now returns True for "is color".

##### Tokens

In [None]:
nlp = en_core_web_sm.load()

def get_is_color(token):
    colors = ['red', 'yellow', 'blue', 'green']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue. Roses are red. Grass is green.")
print([str(token._.is_color) + ' - ' + token.text for token in doc if token._.is_color == True])

##### Spans

In [None]:
from spacy.tokens import Span

def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

Span.set_extension('has_color', getter=get_has_color, force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

#### Method extensions
Method extensions make the extension attribute a callable method. You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting.

In this example, the method function checks whether the ```doc``` contains a ```token``` with a given text. The first argument of the method is always the object itself – in this case, the ```Doc```. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, ```token_text```.

Here, the custom ```has_token``` method returns ```True``` for the word "blue" and ```False``` for the word "cloud".

In [None]:
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

Doc.set_extension('has_token', method=has_token, force=True)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

#### Excercise 3
Check if the ```Doc``` objecxt has number inside.

In [None]:
nlp = English()

def get_has_number(doc):
    return any(token.like_num for token in doc)

Doc.set_extension("has_number", getter=get_has_number, force=True)

doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

#### Excercise 4
Wrap ```Span``` into XML tags with ```to_html``` attribute and return it.

In [None]:
nlp = English()

def to_html(span, tag):
    return "<{tag}>{text}</{tag}>".format(tag=tag, text=span.text)

Span.set_extension("to_html", method=to_html, force=True)

doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

#### Excerice 5
Components with extensions. Extension attributes are especially powerful if they’re combined with custom pipeline components. Write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.
List of countries is in ```data/countries.json```. Capitals are in ```data/capitals.json``` and the text to check is in ```data/country_text.txt```

In [None]:
with open("data/countries.json") as f:
    COUNTRIES = json.loads(f.read())

with open("data/capitals.json") as f:
    CAPITALS = json.loads(f.read())
    
with open("data/country_text.txt") as f:
    TEXT = f.read()

nlp = English()

In [None]:
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))

def countries_component(doc):
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc

nlp.add_pipe(countries_component)
print(nlp.pipe_names)

In [None]:
get_capital = lambda span: CAPITALS.get(span.text)
Span.set_extension("capital", getter=get_capital, force=True)

In [None]:
doc = nlp(TEXT)
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

### Scaling
If you need to process a lot of texts and create a lot of ```Doc``` objects in a row, the ```nlp.pipe``` method can speed this up significantly. It processes the texts as a stream and yields ```Doc``` objects. It is much faster than just calling ```nlp``` on each text, because **it batches up the texts**. ```nlp.pipe``` is a generator that yields ```Doc``` objects, so in order to get a list of ```Docs```, remember to call the list method around it.

**BAD**:
```python
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```
**GOOD**:
```python
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```

#### Passing in context
```nlp.pipe``` also supports passing in tuples of text / context if you set ```as_tuples=True```. The method will then yield doc / context tuples. This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

In [None]:
Doc.set_extension('id', default=None, force=True)
Doc.set_extension('page_number', default=None, force=True)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']
    print("ID: {0}, Page Number: {1}, Content: {2}".format(doc._.id, doc._.page_number, doc.text))

### Performance
Another common scenario: Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text. Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.
<img src="img/pipeline.png" >
<br clear="left"/>

If you only need a tokenized ```Doc``` object, you can use the ```nlp.make_doc``` method instead, which takes a text and returns a ```Doc```. This is also how spaCy does it behind the scenes: ```nlp.make_doc``` turns the text into a ```Doc``` before the pipeline components are called.

**BAD**: 
```python
doc = nlp("Hello world")
```
**GOOD**:
```python
doc = nlp.make_doc("Hello world!")
```


In [259]:
nlp1 = en_core_web_lg.load()
nlp2 = en_core_web_sm.load()
nlp3 = English()

for nlp in [nlp1, nlp2, nlp3]:
    for method in ['complete', 'make_doc']:
        if method == 'complete':
            start_time = time.time()
            doc = nlp("Hello world")
            stop_time = time.time() - start_time
            print(nlp, method ,stop_time)
        else:
            start_time = time.time()
            doc = nlp.make_doc("Hello world!")
            stop_time = time.time() - start_time
            print(nlp, method ,stop_time)

<spacy.lang.en.English object at 0x7fc232fc8550> complete 0.6604928970336914
<spacy.lang.en.English object at 0x7fc232fc8550> make_doc 0.00038242340087890625
<spacy.lang.en.English object at 0x7fc20c765c50> complete 0.007763385772705078
<spacy.lang.en.English object at 0x7fc20c765c50> make_doc 0.0002598762512207031
<spacy.lang.en.English object at 0x7fc20c765590> complete 0.00023317337036132812
<spacy.lang.en.English object at 0x7fc20c765590> make_doc 0.00016069412231445312


#### Disabling pipeline components
spaCy also allows you to temporarily disable pipeline components using the ```nlp.disable_pipes``` context manager. It takes a variable number of arguments, the string names of the pipeline components to disable. For example, if you only want to use the entity recognizer to process a document, you can temporarily disable the tagger and parser. After the with block, the disabled pipeline components are automatically restored. In the with block, spaCy will only run the remaining components.

```python
with nlp.disable_pipes('tagger', 'parser'):
    doc = nlp(text)
    print(doc.ents)
```

In [None]:
nlp = en_core_web_sm.load()

with open("data/tweets.json") as f:
    TEXTS = json.loads(f.read())

In [None]:
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

In [None]:
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

## Training and updating model

spaCy’s models are **statistical** and every “decision” they make – for example, which ```part-of-speech``` tag to assign, or whether a word is a named entity – is a **prediction**. This prediction is based on the examples the model has seen during **training**. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.

The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an **error gradient** of the **loss function** that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.

<img src="img/training.png" >
<br clear="left"/>

* **Training data**: Examples and their annotations.
* **Text**: The input text the model should predict a label for.
* **Label**: The label the model should predict.
* **Gradient**: How to change the weights.

**Why updating the model?**
* Better results on your specific domain
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

### Where to get training data?
Collecting training data may sound incredibly painful – and it can be, if you’re planning a large-scale annotation project. 

spaCy’s rule-based ```Matcher``` is a great way to quickly create training data for named entity models.

In [None]:
with open("data/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()

In [None]:
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

In [None]:
with open("data/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

### Create new pipe
We start off with a blank English model using the spacy dot blank method. The blank model doesn't have any pipeline components, only the language data and tokenization rules. We then create a blank entity recognizer and add it to the pipeline. Using the "add label" method, we can add new string labels to the model.



In [None]:
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

### Problems with training
When you start running your own experiments, you might find that a lot of things just don't work the way you want them to. And that's okay. Training models is an iterative process, and you have to try different things until you find out what works best.

#### Problem 1: Models can "forget" things

Statistical models can learn lots of things – but it doesn't mean that they won't unlearn them. If you're updating an existing model with new data, especially new labels, it can **overfit** and **adjust too much to the new examples**. For instance, if you're only updating it with examples of ```WEBSITE```, it may "forget" other labels it previously predicted correctly – like ```PERSON```. This is also known as the catastrophic forgetting problem.

**TL;DR**
* Existing model can overfit on new data e.g.: if you only update it with ```WEBSITE```, it can "unlearn" what a ```PERSON``` is
* Also known as **catastrophic forgetting** problem

#### Solution 1
To prevent this, make sure to always mix in examples of what the model previously got correct. If you're training a new category ```WEBSITE```, also include examples of ```PERSON```. You can create those additional examples by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with your existing data and update the model with annotations of all labels.

**BAD**:
```json
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]
```
**GOOD**:
```json
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
```

#### Problem 2: Models can't learn everything

Another common problem is that your model just won't learn what you want it to. spaCy's models make **predictions based on the local context** – for example, for named entities, the surrounding words are most important. **If the decision is difficult to make based on the context, the model can struggle to learn it**. The label scheme also needs to be consistent and not too specific. For example, it may be very difficult to teach a model to predict whether something is ```ADULT_CLOTHING``` or ```CHILDRENS_CLOTHING``` based on the context. However, just predicting the label ```CLOTHING``` may work better.

* spaCy's models make predictions based on local context
* Model can struggle to learn if decision is difficult to make based on context
* Label scheme needs to be consistent and not too specific. For example: ```CLOTHING``` is better than ```ADULT_CLOTHING``` and ```CHILDRENS_CLOTHING```.

#### Solution 2
Before you start training and updating models, it's worth taking a step back and planning your label scheme. Try to pick categories that are reflected in the local context and make them more generic if possible. You can always add a rule-based system later to go from generic to specific. Generic categories like "clothing" or "band" are both easier to label and easier to learn.

**BAD**:
```json
LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
```
**GOOD**:
```json
LABELS = ['CLOTHING', 'BAND']
```

## TEST

### Question I

Which sentence is correct about Strings in python:

1) Strings are fancy underwear

**2) Strings are immutable arrays**

3) Strings are mutable

4) We can use only double quote (") for Strings.

### Question II

What is required for Lemmatization (greedy)?

1) Sentence tokenization

2) Words tokenization

3) Stemming

**4) Words tokenization and Parts of speech**

### Question III 

What are the examples of Stop Words in english:

1) 'yes', 'no', 'gimme more'

2) 'pneumonoultramicroscopicsilicovolcanoconiosis'

3) 'stop'

**4) 'a', 'the'**

https://en.wikipedia.org/wiki/Pneumonoultramicroscopicsilicovolcanoconiosis

### Question IV

How many bi-grams can we generate from the given sentence:
“Cisco is a great place to work”

1) 7

**2) 6**

3) 5

4) 10


### Question V 

Collaborative Filtering and Content Based Models are the two popular recommendation engines, what role does NLP play in building such algorithm?

1) Feature Extraction from text

2) Measuring Feature Similarity

3) Engineering Features for vector space learning model

**4) All of these**




### Question VII
What’s not included in a model package that you can load into spaCy?

1) A meta file including the language, pipeline and license.

2) Binary weights to make statistical predictions.

**3) The labelled data that the model was trained on.**

4) Strings of the model's vocabulary and their hashes.

**Note:** Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.

### Question VIII 
Why does this code throw an error?
```python
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

# Look up the ID for 'Bowie' in the vocab
print(nlp_de.vocab.strings[bowie_id])
```



**1) The string ```'Bowie'``` isn't in the German vocab, so the hash can't be resolved in the string store.**

2) ```'Bowie'``` is not a regular word in the English or German dictionary, so it can't be hashed.

3) ```nlp_de``` is not a valid name. The vocab can only be shared if the ```nlp``` objects have the same name.

**Note:** Hashes can’t be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

### Question IX

What does spaCy do when you call ```nlp``` on a string of text?
```python
doc = nlp("This is a sentence.")
```
1) Run the tagger, parser and entity recognizer and then the tokenizer.

**2) Tokenize the text and apply each pipeline component in order.**

3) Connect to the spaCy server to compute the result and return it.

4) Initialize the language, add the pipeline and load in the binary model weights.


### Question X

Here’s an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.
```json
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]
```
Why is this data and label scheme problematic?

**1) Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn.**

2) Paris and Arkansas should also be labelled as tourist destinations for consistency. Otherwise, the model will be confused.

3) Rare out-of-vocabulary words like the misspelled 'amsterdem' shouldn't be labelled as entities.

**Note:** A much better approach would be to only label ```GPE``` (geopolitical entity) or ```LOCATION``` and then use a rule-based system to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki. 

Even very uncommon words or misspellings can be labelled as entities. In fact, **being able to predict categories in misspelled text based on the context is one of the big advantages of statistical named entity recognition**.

### Question XI 

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it’s doing great on ```WEBSITE```, but doesn’t recognize ```PERSON``` anymore. Why could this be happening?

1) It's very difficult for the model to learn about different categories like PERSON and WEBSITE.

**2) The training data included no examples of PERSON, so the model learned that this label is incorrect.** 

3) The hyperparameters need to be retuned so that both entity types can be recognized.

**Note:** If ```PERSON``` entities occur in the training data but aren’t labelled, the model will learn that they shouldn’t be predicted. Similarly, if an existing entity type isn’t present in the training data, the model may ”forget” and stop predicting it.

## Next steps

[0. urllib and bs4 and textacy]()

[1. Rule-based entity](https://spacy.io/usage/rule-based-matching#entityruler)

[2. Pipelines](https://spacy.io/usage/processing-pipelines)

[3. Training models](https://spacy.io/usage/training#basics)

[4. Stanfords CS224n](http://web.stanford.edu/class/cs224n/)