# spaCy Demo

In [159]:
# Import spaCy's English analyzer and the location where data is stored
import pandas as pd
from spacy.en import English, LOCAL_DATA_DIR

In [2]:
# What's the English object all about?
English?

In [3]:
# Set up spaCy NLP analyzer (tokenizer, parser, NER-er, etc.)
nlp_analyzer = English(data_dir=LOCAL_DATA_DIR)

In [4]:
# How does one interact with the analyzer object?
nlp_analyzer?

## Input

In [32]:
"""
The analyzer expects as input text -- the output contains all types of analysis
(parsing, tagging, and entity recognition can be turned off by setting them to
False)

Here's some input text that we'll use:
"""
text = ["This is a very simple sentence.",
        "This sentence, which is moderately more complex, is still quite simple.",
        "The two preceding sentences are easy to understand, hopefully easy to parse too.",
        "These sentences will be correctly parsed and tokenized if the gods look favorably on this demo.",
        "I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings).",
        "One would even hopes that ungrammatical sentences not effects the parsing drammatically."]
text = ' '.join(text)

"""
Let's analyze it and also get a sense for how long it takes for a text of this size
to be analyzed
"""
%timeit nlp_analyzer(text)
analyzed_text = nlp_analyzer(text)

100 loops, best of 3: 4.21 ms per loop


In [None]:
analyzed_text.

## Sentence Recognition

In [33]:
"""
Let's take a look at what's in the output

The output is automatically divided up into the constituent sentences (.sents
attribute) and the sentences and text are composed of constituent tokens
"""
for sent in analyzed_text.sents:
    print('{}\n'.format(sent))

This is a very simple sentence.

This sentence, which is moderately more complex, is still quite simple.

The two preceding sentences are easy to understand, hopefully easy to parse too.

These sentences will be correctly parsed and tokenized if the gods look favorably on this demo.

I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings).

One would even hopes that ungrammatical sentences not effects the parsing drammatically.



### .sents

In [46]:
"""
The .sents attribute is a generator and it stores the objects corresponding to each
recognized sentence
"""
sent = next(analyzed_text.sents)

In [52]:
"""
Each sentence is of type spacy.tokens.span.Span, which is basically just a sequence
of token objects (more on that later)

Here you can see the type of the objects
"""
type(sent)

spacy.tokens.span.Span

### .string or .orth_

In [82]:
"""
To get the string representation of anything (not just a sentence object), i.e., the
original token, the original sentence, the lemma, etc., use the .string or .orth_
attributes
"""
sent.orth_

'This is a very simple sentence.'

In [84]:
# The .string attribute contains whitespace
sent.string

'This is a very simple sentence. '

## Miscellaneous String Attributes

### .is_alpha, .is_oov, .is_space, .like_email, .is_title, etc.

In [186]:
"""
Various pieces of information can be collected about each object representing a
token.
"""
lines = []
for token in sent:
    lines.append(dict(Token=token.orth_, letter=token.is_alpha, ASCII=token.is_ascii,
                      digit=token.is_digit, lower=token.is_lower, OOV=token.is_oov,
                      punct=token.is_punct, space=token.is_space, stop=token.is_stop,
                      titlecase=token.is_title, like_email=token.like_email,
                      like_number=token.like_num, like_url=token.like_url,
                      shape=token.shape_, prefix=token.prefix_, suffix=token.suffix_,
                      lowercased=token.lower_))
pd.DataFrame(lines)

Unnamed: 0,ASCII,OOV,Token,digit,letter,like_email,like_number,like_url,lower,lowercased,prefix,punct,shape,space,stop,suffix,titlecase
0,True,False,This,False,True,False,False,False,False,this,T,False,Xxxx,False,True,his,True
1,True,False,is,False,True,False,False,False,True,is,i,False,xx,False,True,is,False
2,True,False,a,False,True,False,False,False,True,a,a,False,x,False,True,a,False
3,True,False,very,False,True,False,False,False,True,very,v,False,xxxx,False,True,ery,False
4,True,False,simple,False,True,False,False,False,True,simple,s,False,xxxx,False,False,ple,False
5,True,False,sentence,False,True,False,False,False,True,sentence,s,False,xxxx,False,False,nce,False
6,True,False,.,False,False,False,False,False,False,.,.,True,.,False,False,.,False


### .doc attribute

In [60]:
"""
If you want the whole document that the sentence occurred in, use the .doc attribute.
"""
sent.doc

This is a very simple sentence. This sentence, which is moderately more complex, is still quite simple. The two preceding sentences are easy to understand, hopefully easy to parse too. These sentences will be correctly parsed and tokenized if the gods look favorably on this demo. I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings). One would even hopes that ungrammatical sentences not effects the parsing drammatically.

## Lemmatization

### .lemma_

In [68]:
"""
A lemmatized version of the object can be accessed via the .lemma_ attribute
"""
sent.lemma_

'this be a very simple sentence .'

## Parts of Speech and Tags

In [181]:
tokens = []
for token in sent:
    tokens.append(dict(Token=token.orth_, tag=token.tag_, part_of_speech=token.pos_))
pd.DataFrame(tokens)

Unnamed: 0,Token,part_of_speech,tag
0,This,DET,DT
1,is,VERB,VBZ
2,a,DET,DT
3,very,ADV,RB
4,simple,ADJ,JJ
5,sentence,NOUN,NN
6,.,PUNCT,.


## Parsing

### .root

In [128]:
print("sentence = {}".format(sent.orth_))
print("root of sentence = {}".format(sent.root))

sentence = This is a very simple sentence.
root of sentence = is 


### .children, .dep_ attributes

In [126]:
"""
Parse tree-related attributes can be accessed for each token, such as the
children/parents of the token, the dependency relationships, etc.
"""
token = sent[1]
print("sentence = {}".format(sent.orth_))
print("token: {}".format(token))
print("children: {}".format(list(token.children)))
print("head: {}".format(token.head))
print("dependency relationship: {}".format(token.dep_))


sentence = This is a very simple sentence.
token: is 
children: [This , sentence, . ]
head: is 
dependency relationship: ROOT


## Word Representation Vectors

### .repvec, .has\_vector, .similarity()

In [201]:
"""
Representing words as vectors allows for similarity calculations.
"""
last_sentence = list(analyzed_text.sents)[-1]
last_sentence.string

'One would even hopes that ungrammatical sentences not effects the parsing drammatically.'

In [235]:
token1 = last_sentence[5]
token1

ungrammatical 

In [236]:
# Does this token have a vector?
token1.has_vector

True

In [204]:
token1.repvec

array([ 0.11843595, -0.06623109,  0.04538647,  0.02884347, -0.01294969,
       -0.0082486 , -0.0888363 , -0.00583113,  0.00400044, -0.04227782,
        0.12704024,  0.0674548 , -0.05201958, -0.01820735,  0.01730199,
       -0.06442036, -0.10153879, -0.11339349,  0.04282263, -0.03628189,
        0.01757631, -0.08853921,  0.05862839, -0.06814495,  0.03264917,
        0.10444937,  0.00236884, -0.11556426,  0.08706038,  0.00214826,
        0.00678102, -0.02899291, -0.03419275,  0.03446937, -0.01230073,
        0.01153022, -0.01424529,  0.00808482, -0.02617856,  0.1339121 ,
        0.02390695,  0.03199203,  0.03178117, -0.06182787,  0.01291667,
        0.00363886,  0.04186788,  0.01629478, -0.05109784,  0.05638494,
       -0.05947157, -0.03952258,  0.06754078, -0.04384648, -0.06870665,
        0.15238763,  0.10236482, -0.12464427,  0.06899939, -0.04538237,
       -0.03409704,  0.02009485, -0.05656509, -0.02657468,  0.0268659 ,
       -0.00144608, -0.0471281 , -0.05084553,  0.03195825,  0.02

In [205]:
token2 = last_sentence[6]
token2

sentences 

In [206]:
# How similar are "ungrammatical" and "sentences"?
token1.similarity(token2)

0.3955893739185366

In [207]:
# How similar are two random other words?
token3 = last_sentence[8]
token3

effects 

In [213]:
token4 = last_sentence[1]
token4

would 

In [214]:
token3.similarity(token4)

0.23203454697543943

In [215]:
# The similarity value is not as off as one might think, but it's still less

## Log Probabilities and Brown Cluster IDs

### .prob, .cluster

In [218]:
for i, token in enumerate(sent):
    print("original:", token.orth_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")

original: This
log probability: -6.78391695022583
Brown cluster id: 382
----------------------------------------
original: is
log probability: -4.457748889923096
Brown cluster id: 762
----------------------------------------
original: a
log probability: -3.92978835105896
Brown cluster id: 19
----------------------------------------
original: very
log probability: -6.93242883682251
Brown cluster id: 234
----------------------------------------
original: simple
log probability: -9.069649696350098
Brown cluster id: 551
----------------------------------------
original: sentence
log probability: -10.146957397460938
Brown cluster id: 14309
----------------------------------------
original: .
log probability: -3.0678977966308594
Brown cluster id: 8
----------------------------------------


## NER

### .ents, .ent\_label, .ent\_type_, .ent\_iob_

In [230]:
# Get a list of the entities directly with .ents
analyzed_text.ents

(two, British)

In [233]:
# Let's print out all of the tokens in the example text only if they
# are entities
[print(token.orth_, token.ent_type_)
 for token in analyzed_text
 if token.ent_type_ != ""]

two CARDINAL
British NORP


[None, None]

## Can Handle Messy Data

In [239]:
messy_data = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
analyzed_messy_data = nlp_analyzer(messy_data)
for token in analyzed_messy_data:
    print(token.orth_, token.pos_, token.lemma_)

lol NOUN lol
that ADJ that
is VERB be
rly ADV rly
funny ADJ funny
:) PUNCT :)
This DET this
is VERB be
gr8 VERB gr8
i NOUN i
rate VERB rate
it NOUN it
8/8 NUM 8/8
! PUNCT !
! PUNCT !
! PUNCT !


## Access to the Vocabulary

In [256]:
# The vocabulary that the analyzer uses can be accessed and used (and
# also it, along with almost every other component of the system, can
# be customized)
vocab = nlp_analyzer.vocab
vocab.length

1297484

In [259]:
# If there's a word that's in the vocabulary, then it can be loaded in and
# interacted with
vapid = vocab['vapid']

In [261]:
vapid.similarity(vocab['senseless'])

0.68950913894393639

In [286]:
# Credit: https://nicschrading.com/project/Intro-to-NLP-with-spaCy/
# Let's see if it can figure out this analogy
# Man is to King as Woman is to ??
from numpy import dot
from numpy.linalg import norm

# Cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

king = nlp_analyzer.vocab['king']
man = nlp_analyzer.vocab['man']
woman = nlp_analyzer.vocab['woman']

result = king.repvec - man.repvec + woman.repvec

# Gather all known words, take only the lowercased versions
all_words = list({w for w in nlp_analyzer.vocab
                  if w.has_vector
                     and w.orth_.islower()
                     and w.lower_ != "king"
                     and w.lower_ != "man"
                     and w.lower_ != "woman"})

# Sort by similarity to the result
all_words.sort(key=lambda w: cosine(w.repvec, result))
all_words.reverse()
print("Top 3 closest results for king - man + woman:\n")
for word in all_words[:3]:   
    print("\t{}".format(word.orth_))

Top 3 closest results for king - man + woman:

	queen
	monarch
	princess


In [268]:
# Most of the methods/attributes that we've been using can also be used in
# "standalone" mode and further attributes of the analyzer object can be
# specified

In [262]:
nlp_analyzer.like_email("mulhodm@gmail.com")

True

In [275]:
nlp_analyzer.tagger.tag_names

('""',
 '#',
 '$',
 "''",
 ',',
 '-LRB-',
 '-RRB-',
 '.',
 ':',
 'ADD',
 'AFX',
 'BES',
 'CC',
 'CD',
 'DT',
 'EX',
 'FW',
 'GW',
 'HVS',
 'HYPH',
 'IN',
 'JJ',
 'JJR',
 'JJS',
 'LS',
 'MD',
 'NFP',
 'NIL',
 'NN',
 'NNP',
 'NNPS',
 'NNS',
 'PDT',
 'POS',
 'PRP',
 'PRP$',
 'RB',
 'RBR',
 'RBS',
 'RP',
 'SP',
 'SYM',
 'TO',
 'UH',
 'VB',
 'VBD',
 'VBG',
 'VBN',
 'VBP',
 'VBZ',
 'WDT',
 'WP',
 'WP$',
 'WRB',
 'XX',
 '``')

# Sources, Links to Guides

## 1. [spaCy home page](https://spacy.io/) - [tutorials section](http://spacy.io/docs/#tutorials)
## 2. [Nic Schrading's Intro to NLP with spaCy](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/), a fantastic guide (which I stole from a little)

## End