# Spacy Basic

#### Import Spacy

In [39]:
import spacy

#### Load Model

In [2]:
nlp = spacy.load('en_core_web_sm')

#### Convert Sentence into token

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [7]:
for token in doc:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


### Pipeline Object

When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

<img src="../pipeline1.png" width="600">

Basic NLP pipeline is a tagger, a parser, and then NER which stands for name entity recognition

In [8]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f8a16d104a8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f8a16ba6288>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8a16ba62e8>)]

In [10]:
nlp.pipe_names

['tagger', 'parser', 'ner']

### Tokenization

First step before processing any text is split it up into a component parts, such word into punctuation into a token.

In [15]:
doc2 = nlp(u"Tesla isn't looking into stratup anymore.")

In [12]:
for token in doc2:
    print(token.text, token.pos_,token.dep_)

Tesla ADV nsubj
is AUX aux
n't PART neg
looking VERB ROOT
into ADP prep
stratup NOUN pobj
anymore ADV advmod


In [16]:
doc2[0]

Tesla

### Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

In [17]:
doc2[0].pos_

'ADV'

### Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

In [18]:
doc2[0].dep_

'nsubj'

In [19]:
spacy.explain('PROPN')

'proper noun'

In [20]:
spacy.explain('nsubj')

'nominal subject'

### Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

#### Lemmas (the base form of the word):

In [23]:
print(doc2[3].text)
print(doc2[3].lemma_)

looking
look


#### Word Shapes:

In [24]:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


### Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [25]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [26]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [27]:
type(life_quote)

spacy.tokens.span.Span

### Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [31]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [32]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


#### Checking if word is Start of the sentence.

In [35]:
doc4[6]

This

In [33]:
doc4[6].is_sent_start

True