# Spacy and NLTK

Spacy is open source Natural Language Processing Library

It is designed to effectively handle NLP tasks with the most efficient implementation of common algorithms.

NLTK - Natural Language ToolKit. It is also open source and provides many functionalities but includes less efficient implementations.

The nlp() function from spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text data. 

# Natural Language Processing

NLP is an area of computer science and Artificial Intelligence concerned with the interactions between computers and human(natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 

Computers are good at performing analysis when lot's of data is numerical but when it comes to text data, computers find it difficult to understand raw text data. Also, text data is highly unstructured and can be in multiple languages. A computer needs specialized processing techniques in order to understand raw text data. NLP attempts to use a variety of techniques in order to create structure out of text data.

Example use cases of NLP:
1. Classifying emails as spam vs legitimate
2. Sentiment analysis of text movie reviews
3. Analyzing trends from written customer feedback forms
4. Understanding text commands which are said by users such as someone saying something to alexa, siri, or google.

In [6]:
import spacy

In [8]:
nlp = spacy.load("en") # We loaded a model and named it nlp

In [13]:
# Using the language library that was just loaded, spacy is esentially going to parse the entire string into separate components. The separate components in which it is going to be parsed is called token. 
doc = nlp(u'Tesla is looking at buying a U.S.S.R. startup for $6 million') 
# We created a doc or document object by applying the created model nlp to the text. Doc object holds the processed text.

In [14]:
for token in doc:
    print(token.text)

Tesla
is
looking
at
buying
a
U.S.S.R.
startup
for
$
6
million


In [15]:
for token in doc:
    print(token.text,token.pos) # pos stands for parts of speech. Each of the numbers printed against each token represents a part of speech like an adverb, noun, conjugation etc.

Tesla 96
is 87
looking 100
at 85
buying 100
a 90
U.S.S.R. 96
startup 92
for 85
$ 99
6 93
million 93


In [16]:
for token in doc:
    print(token.text,token.pos_) # Adding an underscore gave us all the details about POS

Tesla PROPN
is AUX
looking VERB
at ADP
buying VERB
a DET
U.S.S.R. PROPN
startup NOUN
for ADP
$ SYM
6 NUM
million NUM


In [29]:
for token in doc:
    print(token.text,token.pos_,token.dep_) # dep stands for syntactic dependency. So more information we have got now.

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
a DET det
U.S.S.R. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


___
# SpaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.

Next we created a **Doc** object by applying the model to our text, and named it `doc`.

spaCy also builds a companion **Vocab** object

# Pipeline

When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

In [70]:
nlp.pipeline # When we run nlp, our text is entering a processing pipeline that first breaks down the text and then performs the series of operations of tagging, parsing and describing the data.

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1f3c2b275c8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1f3c2b1b948>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1f3c2b1bee8>)]

In [71]:
nlp.pipe_names

['tagger', 'parser', 'ner']

# Tokenization

First step in processing any text is to split up all the component parts i.e. words and punctuations into tokens. These tokens are annotated inside the dot object to contain descriptive information.

Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [40]:
doc2 = nlp(u"Tesla isn't looking into       startups anymore.") # u means it is a unicode string

In [41]:
for token in doc2:
    print(token.text)

Tesla
is
n't
looking
into
      
startups
anymore
.


In [42]:
for token in doc2:
    print(token.text, token.pos)

Tesla 96
is 87
n't 94
looking 100
into 85
       103
startups 92
anymore 86
. 97


In [43]:
for token in doc2:
    print(token.text,token.pos_)

Tesla PROPN
is AUX
n't PART
looking VERB
into ADP
       SPACE
startups NOUN
anymore ADV
. PUNCT


In [44]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
into ADP prep
       SPACE 
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [45]:
doc2[0]

Tesla

In [46]:
doc2[0].pos_

'PROPN'

In [47]:
doc2[0].dep_

'nsubj'

In [48]:
doc2[0].lemma_

'Tesla'

In [49]:
doc2[0].tag_

'NNP'

In [50]:
doc2[0].shape_

'Xxxxx'

In [52]:
doc2[0].is_alpha

True

In [54]:
doc2[0].is_stop

False

In [55]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [56]:
life_quote = doc3[16:30]

In [57]:
life_quote

"Life is what happens to us while we are making other plans"

In [59]:
type(life_quote)

spacy.tokens.span.Span

In [60]:
type(doc3)

spacy.tokens.doc.Doc

In [61]:
doc4 = nlp(u"This is the first sentence. This is another sentence. This is the last sentence")

In [62]:
for sentence in doc4.sents: # Spacy understands that after a period there is another sentence.
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence


In [64]:
doc4[6]

This

In [65]:
doc4[6].is_sent_start

True

In [66]:
doc4[8]

another

In [68]:
doc4[8].is_sent_start

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

To see the full name of a tag use `spacy.explain(tag)`

In [72]:
spacy.explain('PROPN')

'proper noun'

In [73]:
spacy.explain('nsubj')

'nominal subject'

## Additional Token Attributes

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [74]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

into
into


In [75]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

ADP
IN / conjunction, subordinating or preposition


In [76]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
a : x


In [77]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False
