### NLP packages nltk & spacy
Here, we are importing spacy, specifically the 'en_core_web_sm' module as 'nlp', which has a pipeline of functionality
1. tokenization
2. tagging
3. parsing
4. named entity recognition

In [4]:
import spacy

# Small web language library for spacy
nlp = spacy.load('en_core_web_sm')

In [5]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

##### Tokenization - the first first in processing is to split text into tokens (words and punctuation)

##### Part-of-speech tagging (POS) - each token is processed as a pos such as 'PROPN' for proper noun, VERB, or ADP

##### Dependencies - each token is also processed according to its relevance in the sentence such as 'nsubj' for nominal subject
https://spacy.io/api/dependencyparser

In [10]:
# POS = part of speech (saved as an int corresponding to pos_)
for token in doc:
    print(token.text, token.pos, token.pos_, token.dep_)

Tesla 96 PROPN nsubj
is 87 AUX aux
looking 100 VERB ROOT
at 85 ADP prep
buying 100 VERB pcomp
U.S. 96 PROPN compound
startup 92 NOUN dobj
for 85 ADP prep
$ 99 SYM quantmod
6 93 NUM compound
million 93 NUM pobj


In [12]:
# What does nlp() do to the text?
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x28d68c35900>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x28d68c35a80>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x28d689f9150>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x28d68db50c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x28d68d71840>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x28d689f94d0>)]

In [13]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [14]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

In [15]:
for token in doc2:
    print(token.text, token.pos, token.pos_, token.dep_)

Tesla 96 PROPN nsubj
is 87 AUX aux
n't 94 PART neg
looking 100 VERB ROOT
into 85 ADP prep
startups 92 NOUN pobj
anymore 86 ADV advmod
. 97 PUNCT punct


In [18]:
doc2[3].pos_

'VERB'

In [19]:
doc3 = nlp(u'Although commonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [20]:
life_quote = doc3[16:30]

In [21]:
life_quote

"Life is what happens to us while we are making other plans"

In [22]:
type(doc2)

spacy.tokens.doc.Doc

In [23]:
doc4 = nlp(u"This is the first sentence. This is another sentence. This is the last sentence")

In [24]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence


In [27]:
# Is this the start of a sentence
doc4[6].is_sent_start

doc4[8].is_sent_start

False

### Tokenization 
Basic building blocks for docs

Types
* Prefix - examples: $ ( "
* suffix - examples km ) , . ! "
* infix - examples - -- / ...
* exception - special case rule to split into tokens such as (let's, U.S.)

In [32]:
doc1 = nlp(u"\"We're moving to L.A.!\"")

In [33]:
for token in doc1:
    print(token)

"
We
're
moving
to
L.A.
!
"
