## Spacy 101

Spacy Usage: 
1. Information Extraction
2. Natural Language Understanding
3. Depp Learning

Features:
1. Tokenization
2. Lemmatization
3. Part-of-Speech Tagging
4. Dependency Parsing
5. Sentence Boundry Detection
6. Named Entity Recognition
7. Entity Linking
8. Similarity
9. Text Classification
10. Rule-based Matching
11. Training
12. Serialization

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("The cat sat on the mat.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

The DET det
cat NOUN nsubj
sat VERB ROOT
on ADP prep
the DET det
mat NOUN pobj
. PUNCT punct


### Tokenization

In [3]:
doc = nlp("You don't seem to have the spaCy package itself installed.")
for token in doc:
    print(token.text)

You
do
n't
seem
to
have
the
spaCy
package
itself
installed
.


### Part-of-Speech Tagging and Dependecies

    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [4]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_, token.shape_, token.is_stop, token.is_alpha)

You PRON PRP nsubj Xxx True True
do AUX VBP aux xx True True
n't PART RB neg x'x True False
seem VERB VB ROOT xxxx True True
to PART TO aux xx True True
have AUX VB xcomp xxxx True True
the DET DT det xxx True True
spaCy NOUN NN compound xxxXx False True
package NOUN NN nsubj xxxx False True
itself PRON PRP appos xxxx True True
installed VERB VBD ccomp xxxx False True
. PUNCT . punct . False False


### Named Entities

    Text: The original entity text.
    Start: Index of start of entity in the Doc.
    End: Index of end of entity in the Doc.
    Label: Entity label, i.e. type.

In [5]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


### Word Vector and Similarity

1. has vector: Does the token have a vector representation?
2. Vector norm: The L2 norm of the token’s vector (the square root of the sum of the values     squared)
3. OOV: Out-of-vocabulary

In [7]:
nlp = spacy.load('en_core_web_md')
doc = nlp("Lion Tiger Milk Honey")

for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Lion True 6.5120897 False
Tiger True 6.518183 False
Milk True 7.3250523 False
Honey True 6.845015 False


In [8]:
# similarity
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

Lion Lion 1.0
Lion Tiger 0.7359829
Lion Milk 0.2319908
Lion Honey 0.27584368
Tiger Lion 0.7359829
Tiger Tiger 1.0
Tiger Milk 0.20491147
Tiger Honey 0.308199
Milk Lion 0.2319908
Milk Tiger 0.20491147
Milk Milk 1.0
Milk Honey 0.64480776
Honey Lion 0.27584368
Honey Tiger 0.308199
Honey Milk 0.64480776
Honey Honey 1.0


In [9]:
# saving
data = nlp.to_bytes()