# spaCy Demo

In [159]:
# Import spaCy's English analyzer and the location where data is stored
import pandas as pd
from spacy.en import English, LOCAL_DATA_DIR

In [2]:
# What's the English object all about?
English?

In [3]:
# Set up spaCy NLP analyzer (tokenizer, parser, NER-er, etc.)
nlp_analyzer = English(data_dir=LOCAL_DATA_DIR)

In [4]:
# How does one interact with the analyzer object?
nlp_analyzer?

## Input

In [32]:
"""
The analyzer expects as input text -- the output contains all types of analysis
(parsing, tagging, and entity recognition can be turned off by setting them to
False)

Here's some input text that we'll use:
"""
text = ["This is a very simple sentence.",
        "This sentence, which is moderately more complex, is still quite simple.",
        "The two preceding sentences are easy to understand, hopefully easy to parse too.",
        "These sentences will be correctly parsed and tokenized if the gods look favorably on this demo.",
        "I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings).",
        "One would even hopes that ungrammatical sentences not effects the parsing drammatically."]
text = ' '.join(text)

"""
Let's analyze it and also get a sense for how long it takes for a text of this size
to be analyzed
"""
%timeit nlp_analyzer(text)
analyzed_text = nlp_analyzer(text)

100 loops, best of 3: 4.21 ms per loop


In [None]:
analyzed_text.

## Sentence Recognition

In [33]:
"""
Let's take a look at what's in the output

The output is automatically divided up into the constituent sentences (.sents
attribute) and the sentences and text are composed of constituent tokens
"""
for sent in analyzed_text.sents:
    print('{}\n'.format(sent))

This is a very simple sentence.

This sentence, which is moderately more complex, is still quite simple.

The two preceding sentences are easy to understand, hopefully easy to parse too.

These sentences will be correctly parsed and tokenized if the gods look favorably on this demo.

I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings).

One would even hopes that ungrammatical sentences not effects the parsing drammatically.



### .sents

In [46]:
"""
The .sents attribute is a generator and it stores the objects corresponding to each
recognized sentence
"""
sent = list(analyzed_text.sents)[0]

In [52]:
"""
Each sentence is of type spacy.tokens.span.Span, which is basically just a sequence
of token objects (more on that later)

Here you can see the type of the objects
"""
type(sent)

spacy.tokens.span.Span

### .string or .orth_

In [82]:
"""
To get the string representation of anything (not just a sentence object), i.e., the
original token, the original sentence, the lemma, etc., use the .string or .orth_
attributes
"""
sent.orth_

'This is a very simple sentence.'

In [84]:
# The .string attribute contains whitespace
sent.string

'This is a very simple sentence. '

## Miscellaneous String Attributes

### .is_alpha, .is_oov, .is_space, .like_email, .is_title, etc.

In [186]:
"""
Various pieces of information can be collected about each object representing a
token.
"""
lines = []
for token in sent:
    lines.append(dict(Token=token.orth_, letter=token.is_alpha, ASCII=token.is_ascii,
                      digit=token.is_digit, lower=token.is_lower, OOV=token.is_oov,
                      punct=token.is_punct, space=token.is_space, stop=token.is_stop,
                      titlecase=token.is_title, like_email=token.like_email,
                      like_number=token.like_num, like_url=token.like_url,
                      shape=token.shape_, prefix=token.prefix_, suffix=token.suffix_,
                      lowercased=token.lower_))
pd.DataFrame(lines)

Unnamed: 0,ASCII,OOV,Token,digit,letter,like_email,like_number,like_url,lower,lowercased,prefix,punct,shape,space,stop,suffix,titlecase
0,True,False,This,False,True,False,False,False,False,this,T,False,Xxxx,False,True,his,True
1,True,False,is,False,True,False,False,False,True,is,i,False,xx,False,True,is,False
2,True,False,a,False,True,False,False,False,True,a,a,False,x,False,True,a,False
3,True,False,very,False,True,False,False,False,True,very,v,False,xxxx,False,True,ery,False
4,True,False,simple,False,True,False,False,False,True,simple,s,False,xxxx,False,False,ple,False
5,True,False,sentence,False,True,False,False,False,True,sentence,s,False,xxxx,False,False,nce,False
6,True,False,.,False,False,False,False,False,False,.,.,True,.,False,False,.,False


### .doc attribute

In [60]:
"""
If you want the whole document that the sentence occurred in, use the .doc attribute.
"""
sent.doc

This is a very simple sentence. This sentence, which is moderately more complex, is still quite simple. The two preceding sentences are easy to understand, hopefully easy to parse too. These sentences will be correctly parsed and tokenized if the gods look favorably on this demo. I hope that strange words like vapidity and celerity don't confuse the analyser (nor British spellings). One would even hopes that ungrammatical sentences not effects the parsing drammatically.

## Lemmatization

### .lemma_

In [68]:
"""
A lemmatized version of the object can be accessed via the .lemma_ attribute
"""
sent.lemma_

'this be a very simple sentence .'

## Parts of Speech and Tags

In [181]:
tokens = []
for token in sent:
    tokens.append(dict(Token=token.orth_, tag=token.tag_, part_of_speech=token.pos_))
pd.DataFrame(tokens)

Unnamed: 0,Token,part_of_speech,tag
0,This,DET,DT
1,is,VERB,VBZ
2,a,DET,DT
3,very,ADV,RB
4,simple,ADJ,JJ
5,sentence,NOUN,NN
6,.,PUNCT,.


## Parsing

### .root

In [128]:
print("sentence = {}".format(sent.orth_))
print("root of sentence = {}".format(sent.root))

sentence = This is a very simple sentence.
root of sentence = is 


### .children, .dep_ attributes

In [126]:
"""
Parse tree-related attributes can be accessed for each token, such as the
children/parents of the token, the dependency relationships, etc.
"""
token = sent[1]
print("sentence = {}".format(sent.orth_))
print("token: {}".format(token))
print("children: {}".format(list(token.children)))
print("head: {}".format(token.head))
print("dependency relationship: {}".format(token.dep_))


sentence = This is a very simple sentence.
token: is 
children: [This , sentence, . ]
head: is 
dependency relationship: ROOT


In [185]:
for i, token in enumerate(sent):
    print("original:", token.orth_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

original: This
log probability: -6.78391695022583
Brown cluster id: 382
----------------------------------------
original: is
log probability: -4.457748889923096
Brown cluster id: 762
----------------------------------------
original: a
log probability: -3.92978835105896
Brown cluster id: 19
----------------------------------------
