# Intro Spacy

In [2]:
!pip3 install spacy

Collecting spacy
  Downloading spacy-1.8.2.tar.gz (3.3MB)
[K    100% |████████████████████████████████| 3.3MB 390kB/s eta 0:00:01
Collecting murmurhash<0.27,>=0.26 (from spacy)
  Downloading murmurhash-0.26.4.tar.gz
Collecting cymem<1.32,>=1.30 (from spacy)
  Downloading cymem-1.31.2.tar.gz
Collecting preshed<2.0.0,>=1.0.0 (from spacy)
  Downloading preshed-1.0.0.tar.gz (89kB)
[K    100% |████████████████████████████████| 92kB 5.7MB/s eta 0:00:01
[?25hCollecting thinc<6.6.0,>=6.5.0 (from spacy)
  Downloading thinc-6.5.2.tar.gz (926kB)
[K    100% |████████████████████████████████| 931kB 1.3MB/s eta 0:00:01
[?25hCollecting plac<1.0.0,>=0.9.6 (from spacy)
  Downloading plac-0.9.6-py2.py3-none-any.whl
Collecting pathlib (from spacy)
  Downloading pathlib-1.0.1.tar.gz (49kB)
[K    100% |████████████████████████████████| 51kB 7.9MB/s eta 0:00:01
Collecting dill<0.3,>=0.2 (from spacy)
  Downloading dill-0.2.7.tar.gz (64kB)
[K    100% |████████████████████████████████| 71kB 6.1MB/s eta 

## [Spacy Documentation](https://spacy.io/docs)

Spacy is an NLP/Computational Linguistics package built from the ground up. It's written in Cython so it's fast!!

Let's check it out. Here's some text from [Alice in Wonderland](https://www.gutenberg.org/files/11/11-h/11-h.htm) free on Gutenberg.

In [3]:
text = """'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat,' said the Duchess, 'and that's why. Pig!'
She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,' Alice said very politely, feeling quite pleased to have got into a conversation.
'You don't know much,' said the Duchess; 'and that's a fact.'"""

Download and load the model. SpaCy has an excellent English NLP processor. It has the following features which we shall explore:
- Entity recognition
- Dependency Parsing
- Part of Speech tagging
- Word Vectorization
- Tokenization
- Lemmatization
- Noun Chunks

## Download the Model, it may take a while

Install model (english) which allows for some parsing, and other NLP features. We're not training a model, we're using someone else's

In [8]:
!python3 -m spacy download en


    Downloading en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz (52.2MB)
[K    0% |                                | 112kB 2.3MB/s eta 0:00:23^C[K


In [2]:
import spacy
# import spacy.en.download
# spacy.en.download.download.en()


In [5]:
processor = spacy.load('en')
#processor = spacy.en.English()

In [6]:
processed_text = processor(text)
processed_text

'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat,' said the Duchess, 'and that's why. Pig!'
She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,' Alice said very politely, feeling quite pleased to have got into a conversation.
'You don't know much,' said the Duchess; 'and that's a fact.'

Looks like the same text? Let's dig a little deeper

## Tokenization

### Sentence Tokenizing

In [21]:
n = 0
for sentence in processed_text.sents:
    print(n, sentence)
    n+=1

0 'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat,' said the Duchess, 'and that's why.
1 Pig!'

2 She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,' Alice said very politely, feeling quite pleased to have got into a conversation.
'
3 You don't know much,' said the Duchess; 'and that's a fact.'


The preceeding code breaks up the Alice in Wonderland text into sentences.  As you see in the first (very long) sentence, indexed at 0, it keeps the quotes together and waits til ending sentence punctuation (doesn't break it up for commas, semicolons). 

### Words and Punctuation - Along with POS tagging

In [10]:
n = 0
for sentence in processed_text.sents:
    for token in sentence:
        print(n, token, token.pos_, token.lemma_, token.nbor())
        n+=1

0 ' PUNCT ' Please
1 Please INTJ please would
2 would VERB would you
3 you PRON -PRON- tell
4 tell VERB tell me
5 me PRON -PRON- ,
6 , PUNCT , '
7 ' PUNCT ' said
8 said VERB say Alice
9 Alice PROPN alice ,
10 , PUNCT , a
11 a DET a little
12 little ADJ little timidly
13 timidly ADV timidly ,
14 , PUNCT , for
15 for ADP for she
16 she PRON -PRON- was
17 was VERB be not
18 not ADV not quite
19 quite ADV quite sure
20 sure ADJ sure whether
21 whether ADP whether it
22 it PRON -PRON- was
23 was VERB be good
24 good ADJ good manners
25 manners NOUN manner for
26 for ADP for her
27 her PRON -PRON- to
28 to PART to speak
29 speak VERB speak first
30 first ADV first ,
31 , PUNCT , '
32 ' PUNCT ' why
33 why ADV why your
34 your ADJ -PRON- cat
35 cat NOUN cat grins
36 grins VERB grin like
37 like ADP like that
38 that DET that ?
39 ? PUNCT ? '
40 ' PUNCT ' 

41 
 SPACE 
 '
42 ' PUNCT ' It
43 It PRON -PRON- 's
44 's VERB be a
45 a DET a Cheshire
46 Cheshire PROPN cheshire cat
47 cat NOUN cat ,
48

IndexError: list index out of range

We can separate the words (but there are variants, those that have apostrophes/punctuations).  We can take each of the sentences and loop over the "tokens" (a chunk of words, punctuation).  Then we print the numerical id (n), token (word chunk), part of speech (token.pos), roots of words (token.lemma, e.g., 17 was VERB be or 76 jumped VERB jump), or neighbors of words (token.nbor(), e.g. 12 little ADJ little timidly)  [but there are many objects in the "token." library e.g., language, URL, numbers, prefixes, etc.).

### Entities - [Explanation of Entity Types](https://spacy.io/docs#annotation-ner)

In [11]:
for entity in processed_text.ents:
    print(entity, entity.label_)

Alice PERSON
first ORDINAL
Cheshire GPE
Alice PERSON
Cheshire GPE
Alice PERSON


Entity recognition is the ability for spaCy to take processed text and give you the entities (Person, Ordinal (numbered position), Geopolitical entitiy (GPE or geographic location such as a city). Looks for named objects.  

### Noun Chunks

If you want to find relationships among entites, it may be useful to look for pronouns or noun chunks. For example, the previous tool pulled out Cheshire as a GPE, but this pulls out Cheshire cat-- provides useful context on what Cheshire is. 

In [8]:
for noun_chunk in processed_text.noun_chunks:
    print(noun_chunk)

you
me
Alice
she
it
good manners
her
your cat
It
a Cheshire cat
the Duchess
She
the last word
such sudden violence
Alice
she
another moment
it
the baby
her
she
courage
again:—
I
Cheshire cats
fact
I
cats
They
the Duchess
'em
I
Alice
a conversation
You
the Duchess
a fact


## The Semi Holy Grail - Syntactic Dependency Parsing [See Demo for clarity](https://spacy.io/demos/displacy)

Dependency parsing (displaCy), is where it sits in a sentence [e.g., subject, verb, adj, pronoun, auxillary verbs, dependent objects).  It guesses at the syntactic structure of a sentence.  Provides context, relationships among words.

Takes in a sentence, nouns and verbs, and relationship among words.

In [9]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level+1)
    print('\t'* level + word.text + ' - ' + word.dep_)
    for child in word.rights:
        pr_tree(child, level+1)

In [10]:
for sentence in processed_text.sents:
    pr_tree(sentence.root, 0)
    print('-------------------------------------------')

		Please - intj
		would - aux
		you - nsubj
	tell - ccomp
		me - dobj
said - ROOT
	Alice - nsubj
			a - det
		little - npadvmod
	timidly - advmod
		for - mark
		she - nsubj
	was - advcl
		not - neg
			quite - advmod
		sure - acomp
				whether - mark
				it - nsubj
			was - ccomp
					good - amod
				manners - attr
						for - mark
						her - nsubj
						to - aux
					speak - relcl
						first - advmod
		why - advmod
			your - poss
		cat - nsubj
	grins - ccomp
		like - prep
			that - pobj
		It - nsubj
	's - ccomp
			a - det
			Cheshire - compound
		cat - attr
-------------------------------------------
said - ROOT
		the - det
	Duchess - nsubj
		and - cc
		that - nsubj
	's - conj
		why - ccomp
-------------------------------------------
Pig - ROOT
-------------------------------------------
	She - nsubj
said - ROOT
		the - det
		last - amod
	word - dobj
		with - prep
				such - amod
				sudden - amod
			violence - pobj
			that - nsubj
			Alice - nsubj
			quite - advmod
		jumped - relcl


Look at the example under She - nsubj.  She (subject), said (verb), word (direct oject), violence (prepositional object).  It takes grammar framework and the computer is discovering it on its own.  Look at the dependencies relations to learn more about the grammatical, syntactic structure.

### What is 'nsubj'? 'acomp'? See [The Universal Dependencies](http://universaldependencies.org/u/dep/)

## Word Vectorization - [Word2Vec](http://deeplearning4j.org/word2vec)

Word2Vec is a way of creating a vector.  The following code uses a for loop to look at each of the words in proc_fruits (token) and it looks at the words along with a vector (token.vector). 

In the print results, you see that the word " I " is represented as a vector, with context built into it. Using this vector, we can compare similarity.  

In [14]:
for sent in proc_fruits.sents:
    for token in sent:
        print(token, token.vector)

I [  1.87329993e-01   4.05950010e-01  -5.11740029e-01  -5.54820001e-01
   3.97160016e-02   1.28869995e-01   4.51370001e-01  -5.91489971e-01
   1.55910000e-01   1.51370001e+00  -8.70199978e-01   5.06719984e-02
   1.52109995e-01  -1.91829994e-01   1.11809999e-01   1.21310003e-01
  -2.72119999e-01   1.62030005e+00  -2.48840004e-01   1.40599996e-01
   3.30989987e-01  -1.80610009e-02   1.52439997e-01  -2.69430012e-01
  -2.78329998e-01  -5.21229990e-02  -4.81489986e-01  -5.18390000e-01
   8.62620026e-02   3.08180004e-02  -2.12530002e-01  -1.13779999e-01
  -2.23839998e-01   1.82620004e-01  -3.45409989e-01   8.26110020e-02
   1.00240000e-01  -7.95499980e-02  -8.17210019e-01   6.56209979e-03
   8.01339969e-02  -3.99760008e-01  -6.31309971e-02   3.22600007e-01
  -3.16249989e-02   4.30559993e-01  -2.72700012e-01  -7.60200024e-02
   1.02930002e-01  -8.86529982e-02  -2.90870011e-01  -4.72140014e-02
   4.60360013e-02  -1.77880004e-02   6.49899989e-02   8.84509981e-02
  -3.15739989e-01  -5.85219979e-

   4.88559991e-01   5.72209992e-02   2.47580007e-01  -4.03829992e-01]
in [  8.91870037e-02   2.57919997e-01   2.62820005e-01  -2.93649994e-02
   4.71870005e-01  -1.03890002e-01  -1.00129999e-01   8.12299997e-02
   2.08829999e-01   2.57259989e+00  -6.78539991e-01   3.61209996e-02
   1.30850002e-01   1.24619994e-03   1.47689998e-01   2.69259989e-01
   3.71439993e-01   1.35010004e+00  -1.13260001e-01  -2.30360001e-01
  -2.65749991e-01  -1.80769995e-01   9.24549997e-02  -1.62149996e-01
   1.50030002e-01  -3.45470011e-01   7.22950026e-02   4.06590015e-01
   1.00210002e-02  -7.92570040e-03  -1.14349999e-01   1.70079991e-02
  -2.97890007e-01   1.90789998e-01   3.71120006e-01  -2.65879989e-01
   1.62120000e-01   6.54689968e-02  -3.17809999e-01  -3.22600007e-02
   8.19690004e-02   3.44500005e-01  -1.73620000e-01  -3.57450008e-01
   5.44870012e-02   3.99410009e-01   1.36989996e-01  -2.20660008e-02
   1.10250004e-01  -4.18980002e-01   1.27599999e-01  -9.58689973e-02
  -1.79440007e-01  -1.74429998

In [12]:
proc_fruits = processor('''I think green apples are delicious. 
                            While pears have a strange texture to them. 
                            The bowls they sit in are ugly.''')
apples, pears, bowls = proc_fruits.sents
fruit = processed_text.vocab['fruit']
print(apples.similarity(fruit))
print(pears.similarity(fruit))
print(bowls.similarity(fruit))


0.63287260512
0.430215129782
0.360582530461


proc_fruits.sents makes apples, pears, bowls a sentence. 

fruit = processed_text.vocab['fruit'] pulls out the individual word. 

to compare it with the word apple, we run apple = proc_fruits.vocab['apple'].  Then run apple.simliarity(fruit) [apple similarity to fruit]. Can also do apple.similarity(processed_text.vocab['steel']) and it's a low similarity score of 0.13.  

Can use it to find related sentences in text. 

In [18]:
fruit = proc_fruits.vocab['fruit']

In [20]:
apple = proc_fruits.vocab['apple']

In [21]:
apple.similarity(fruit)

0.63061824225321095

In [25]:
apple.similarity(processed_text.vocab['comedy'])

0.15424666228308809

# Assignment
Find your favorite news source and grab the article text.

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar nouns (chunks) in the article

For 
1. tokenization
2. POS
3. dependency parser
4. common entities
5. find the dependency of the entities, the verbs that are related to it (entity.root.head, gives you dependency)
6. find similarities among noun chunks. 