#### The nlp() function from spacy automatically takes raw text and performs a series of operations to tag, parse and describe text data.

In [62]:
import spacy
nlp = spacy.load('en')

In [19]:
# Create a unicode string
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [28]:
#Spacy recognizes tokens quite easily
for token in doc:
    print(f"{token.text:{10}}, {token.pos:{5}}, {token.pos_:{10}}, {token.dep_:{20}}")

Tesla     ,    96, PROPN     , nsubj               
is        ,    87, AUX       , aux                 
looking   ,   100, VERB      , ROOT                
at        ,    85, ADP       , prep                
buying    ,   100, VERB      , pcomp               
U.S.      ,    96, PROPN     , compound            
startup   ,    92, NOUN      , dobj                
for       ,    85, ADP       , prep                
$         ,    99, SYM       , quantmod            
6         ,    93, NUM       , compound            
million   ,    93, NUM       , pobj                


#### Tesla is recognised as a Proper Noun. 
#### U.S. is treated as a single entity

### Spacy pipeline object

When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data. 

In [29]:
# Series of operations
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x229f6fba148>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x229f6fa0ac8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x229f6fbc0a8>)]

In [30]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [32]:
nlp.pipe_factories

{'tagger': 'tagger', 'parser': 'parser', 'ner': 'ner'}

## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [39]:
doc2 = nlp(u"Tesla isn't looking into    startups anymore.")

In [40]:
for token in doc2:
    print(f"{token.text:{10}},{token.pos:{5}},{token.pos_:{10}},{token.dep_:{10}}")

Tesla     ,   96,PROPN     ,nsubj     
is        ,   87,AUX       ,aux       
n't       ,   94,PART      ,neg       
looking   ,  100,VERB      ,ROOT      
into      ,   85,ADP       ,prep      
          ,  103,SPACE     ,          
startups  ,   92,NOUN      ,pobj      
anymore   ,   86,ADV       ,advmod    
.         ,   97,PUNCT     ,punct     


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [43]:
doc2

Tesla isn't looking into    startups anymore.

In [44]:
#Indexing to grab token individually
doc2[0]

Tesla

## POS Tagging

The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [45]:
doc2[0].pos_

'PROPN'

## Syntanctic Dependencies

We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [46]:
doc2[0].dep_

'nsubj'

In [47]:
spacy.explain('PROPN')

'proper noun'

In [48]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [76]:
print(doc2)
print(f'The original word text : {doc2[3].text}')
print(f'The base form of the word : {doc2[3].lemma_}')
print(f'The simple part of speech tag : {doc2[3].pos_}')
print(f'The detailed part of spech tag : {doc2[3].tag_}')
print(f'The word shape : {doc2[3].shape_}')
print(f'Is the token an alpha character? : {doc2[3].is_alpha}')
print(f'Is the token part of a stop list? : {doc2[3].is_stop}')

Tesla isn't looking into    startups anymore.
The original word text : looking
The base form of the word : look
The simple part of speech tag : VERB
The detailed part of spech tag : VBG
The word shape : xxxx
Is the token an alpha character? : True
Is the token part of a stop list? : False


### Lemmatization

Reducing to the root form of the word

In [64]:
print(doc2[0])
print(doc2[0].lemma_)

Tesla
Tesla


In [65]:
print(doc2[3])
print(doc2[3].lemma_)

looking
look


In [72]:
type(doc2[3])

spacy.tokens.token.Token

### Word Shapes

In [73]:
print(doc2[0])
print(doc2[0].shape_)

Tesla
Xxxxx


In [74]:
print(doc[5])
print(doc[5].shape_)

U.S.
X.X.


In [75]:
print(doc[8])
print(doc[8].shape_)

$
$


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [77]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [78]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [79]:
type(life_quote)

spacy.tokens.span.Span

In [80]:
type(doc3)

spacy.tokens.doc.Doc

In [81]:
type(doc3[0])

spacy.tokens.token.Token

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [82]:
doc4 = nlp(u"This is the first sentence. This is another sentence. This is yet another sentence. This is the last sentence.")

In [88]:
for sentences in doc4.sents:
    print(sentences)

This is the first sentence.
This is another sentence.
This is yet another sentence.
This is the last sentence.


In [89]:
doc4[6]

This

In [90]:
doc4[6].is_sent_start

True

In [91]:
doc4[8]

another

In [94]:
# Does not return anything because it isn't part of any sentence

doc4[8].is_sent_start