### Language Models

We have to download specific [language models to work with spacy itself](https://spacy.io/usage/models). Each language model contains statistical properties and attributes about the specific language. For example, a common set of language models are 

* `en_core_web_sm`
* `en_core_web_md`
* `en_core_web_lg`

These are the names of the English web (there are other genres) corpus of varying sizes (small, medium, and large).

You can install a specific language model with the following command:

```shell
pip install spacy
python -m spacy download en_core_web_sm
```

### Visualizing Documents
You can use spacy's `displacy` tool to visualize the relationships and meta-information gathered by spacy while ingesting your documents.

In [1]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love Professor Chen's class - it is great.")
displacy.render(doc)

### The Spacy Pipeline

When you call `nlp(raw_text)`, spacy performs a series of operations, in sequence, hence a pipeline:

* `Tokenizer`: tokenizes your text to map a raw string document into a series tokens that comprise a `Doc`.
* `Tagger`: tags each of the tokens in a `Doc` with part of speech.
* `Dependency Parser`: parses the tokens to assign dependency relationships (ie. possessive, compound noun, etc.)
* `Named Entity Recognizer`: identifies and labels tokens as named entities.

#### Tokenizing Text

You can quickly retrieve information about your tokens by simply iterating through them using a regular for loop.

In [164]:
text = """Chesa Boudin, the district attorney of San Francisco, 
will face a recall election next June after a backlash in one of 
America's most liberal cities to his policies aimed at reducing 
one hundred thousand people in jails and prisons.
"""

spacy_doc = nlp(text) # spacy_doc is an iterable object of tokens

In [169]:
sentence1 = nlp("I love cats.")

In [172]:
sentence2 = nlp("Me adore felines.")

In [174]:
for token in spacy_doc:
    print(f"""
token: {token.text}, tag:{token.tag_}, lemma:{token.lemma_}, 
entity type: {token.ent_type_}, stopword: {token.is_stop}, like num: {token.like_num}
{spacy.explain(token.tag_)}
    """)


token: Chesa, tag:NNP, lemma:Chesa, 
entity type: PERSON, stopword: False, like num: False
noun, proper singular
    

token: Boudin, tag:NNP, lemma:Boudin, 
entity type: PERSON, stopword: False, like num: False
noun, proper singular
    

token: ,, tag:,, lemma:,, 
entity type: , stopword: False, like num: False
punctuation mark, comma
    

token: the, tag:DT, lemma:the, 
entity type: , stopword: True, like num: False
determiner
    

token: district, tag:NN, lemma:district, 
entity type: , stopword: False, like num: False
noun, singular or mass
    

token: attorney, tag:NN, lemma:attorney, 
entity type: , stopword: False, like num: False
noun, singular or mass
    

token: of, tag:IN, lemma:of, 
entity type: , stopword: True, like num: False
conjunction, subordinating or preposition
    

token: San, tag:NNP, lemma:San, 
entity type: GPE, stopword: False, like num: False
noun, proper singular
    

token: Francisco, tag:NNP, lemma:Francisco, 
entity type: GPE, stopword: False, lik

You can use the [following link](https://spacy.io/usage/linguistic-features#pos-tagging) to reference what each part of speech tag means.

#### Accessing Named Entities
You can quickly identify the named entities in your text using `spacy_doc.ents`. These named entities are identified using a multi-class classifier deep learning model, usually some derivative of a sequence-based neural network, such as a LSTM.

In [175]:
for token in spacy_doc:
    if token.ent_type_:
        print(token.text, token.ent_type_, spacy.explain(token.ent_type_))

Chesa PERSON People, including fictional
Boudin PERSON People, including fictional
San GPE Countries, cities, states
Francisco GPE Countries, cities, states
next DATE Absolute or relative dates or periods
June DATE Absolute or relative dates or periods
one CARDINAL Numerals that do not fall under another type
America GPE Countries, cities, states
one CARDINAL Numerals that do not fall under another type
hundred CARDINAL Numerals that do not fall under another type
thousand CARDINAL Numerals that do not fall under another type


You can use this to quickly count references to certain concepts, topics, or people - in the
below code snippet, we read in 250 BBC Sports football articles, run them through spacy's pipeline, and identify the most recurring entities (`Everton`).

In [176]:
from collections import Counter
documents = []
articles = [f"bbcsport/football/{str(i).zfill(3)}.txt" for i in range(1,250)]

count = Counter()
for article in articles:
    documents.append(nlp(open(article, encoding="latin1").read())) # open each sports article
    count += Counter([entity.text for entity in doc.ents])
count.most_common(5)

[('Chen', 249)]

#### Fusing Tokens Together
In the previous example, you can see that `San` and `Francisco` are defined as two separate tokens. Perhaps we want to fuse these tokens together.

In [181]:
spacy_doc[7:9]

San Francisco

In [107]:
with spacy_doc.retokenize() as rtk:
    rtk.merge(spacy_doc[7:9])
for token in spacy_doc[:8]:
    print(token)

Chesa
Boudin
,
the
district
attorney
of
San Francisco, 



#### Get the Document Vector (Average of Each Token's Vector)

In [186]:
spacy_doc.vector

array([ 0.7863186 ,  0.1918263 ,  0.13655715, -0.32552698,  0.02235261,
       -0.23494928, -0.23189236,  0.22802822,  0.2849812 , -0.44059864,
       -0.14603324,  0.11368418, -0.14675935, -0.06631666, -0.04136682,
        0.06409609,  0.36511496,  0.04003874,  0.06382134, -0.0392394 ,
       -0.22590397, -0.02681326, -0.18175632, -0.01114394, -0.2981428 ,
       -0.11283874, -0.13519627,  0.12961553,  0.08538555,  0.07563426,
       -0.42844483,  0.11483084, -0.02804231, -0.14675218,  0.50241476,
       -0.03776771,  0.04326417, -0.08465672,  0.16092725,  0.0996326 ,
       -0.01267661,  0.08726451, -0.22938558,  0.17821172, -0.08325638,
       -0.22826734, -0.01873402,  0.05682325, -0.0900195 , -0.08148623,
        0.09486844,  0.12895676, -0.11231033,  0.02614486,  0.17087993,
       -0.1627619 , -0.18449049, -0.00526714,  0.02839799,  0.17819184,
       -0.10176533, -0.28971377,  0.19312021, -0.17347246,  0.06518303,
       -0.00627545, -0.13956574, -0.177201  , -0.03008786,  0.17

In [184]:
for token in spacy_doc:
    print(token.text, token.vector)

Chesa [ 2.5781105   1.4851422   0.66668475 -0.60898983  0.5822772  -0.99396026
 -0.6164455   1.1441483  -0.15920722 -0.9916673  -0.733324   -0.46003422
  0.5547266   1.0656301  -0.9460996   0.65011287 -0.2894455   1.4954193
  1.2494959  -0.09863493 -0.97842     0.36420727 -1.0774815   0.580465
  0.9059092   0.9087603  -0.16817498  0.11348483  1.9163984   0.35930339
  1.4921854  -0.04471752 -0.03041995 -0.81111723 -0.06112164 -0.11117259
 -0.02868384  1.4696815   0.05071256  2.1784353  -1.5221603  -0.14882278
 -1.1196026  -0.45203105 -0.7293926  -1.0136325   0.32718337 -0.8637043
 -0.15427247  0.6167193  -0.31026864  0.60379845 -0.6902966   1.0173128
  0.2813806  -0.70849043 -0.88784236  1.1615374   0.17625298  0.9271499
 -0.25180635 -1.36858     1.262291   -0.5726997  -0.05301022 -1.4893147
 -1.4506245  -0.6753426   0.5262906  -1.1294615  -1.0431541   0.13071054
  0.31262332  0.48740473 -0.23954546  1.119847    1.1350921  -0.7028847
 -2.1299992   1.6502466  -0.7313124  -0.48681015  0.2

In [188]:
import numpy as np
initial_vector = np.zeros(96)

for token in spacy_doc:
    initial_vector += token.vector
initial_vector /= len(spacy_doc)
print(spacy_doc.vector[:5])
print(initial_vector[:5])

[ 0.7863186   0.1918263   0.13655715 -0.32552698  0.02235261]
[ 0.78631861  0.19182629  0.13655713 -0.32552698  0.02235261]


### Using PhraseMatcher

Often, we want to match certain phrases together. We can construct patterns using the 

In [None]:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"ENT_TYPE": "PERSON", "OP": "+"}, {"POS": "VERB"}]
matcher.add("FULL_NAME", [pattern])
doc = nlp("Elon Musk spends more money than Jeff Bezos. But also Bill burns cash too.")
matched_names = matcher(doc)
for _, start, end in matched_names:
    print(doc[start:end])