# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.

For many NLP tasks, Spacy only has one implemented method, choosing the most effciient algorithm currently available. So you dont have the option to choose other algorithms.

## NLTK vs Spacy
- NLTK released in 2001 and Spacy in 2015
- NLTK provides many functionalities, but includes less efficient implementations.
- For many NLP tasks, Spacy is much faster and more efficient, at the cost of user not being able to choose algorithmic implementations,
- However, Spacy doesnt include pre-created models for some applications such as sentiment analysis, which is typically easier to perform with NLTK.
- This course will focus on using Spacy due to its efficency(upto 400times more efficient) and state of the art approach but use NLTK when its easier to use.   



The `nlp()` function from Spacy automatically takes raw text and performs a series of operations to tag, parse and describe the text data.

Let's discover the pipeline object and its series of operations.


# [Working with spaCy in Python](https://spacy.io/models)

# Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`

> ### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`

### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`

> ### If successful, you should see a message like:

> **`Linking successful`**<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en')`




![](images\16.PNG)

This is a typical set of instructions for importing and working with spaCy. Don't be surprised if this takes awhile - spaCy has a fairly large library to load:

In [1]:
print("Learning SpaCy")

Learning SpaCy


Topics to be covered in this lecture:
- Loading the language library
- Building a pipeline object
- Using tokens
- Parts-of-speec tagging
- Understanding token attributes

Spacy works with a pipeline object.

The `nlp()` function from Spacy automatically takes raw text and performs a series of operations(tokenization, POS tagging, etc) to tag, parse, and describe the text data.


To avoid this error: `OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.`  

In your Anaconda Prompt, run the command:
```
!python -m spacy download en
```

In [4]:
# !python -m spacy download en

In [5]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm') 

In [6]:
# Create a Document object
doc = nlp('Tesla is looking at buying U.S. startup for $6 million')
doc

Tesla is looking at buying U.S. startup for $6 million

Above using the language library, the entire sentence will be parsed into seperate components and these components are known as tokens.


__doc__ object has the processed object and is the main focus in this notebook.

In [4]:
type(doc)

spacy.tokens.doc.Doc

This token object is iterable. (From the doc, SpaCy also build a companion vocab object.)

In [5]:
# Print each token separately
for token in doc:
    print(token)

Tesla
is
looking
at
buying
U.S.
startup
for
$
6
million


Observe __U.S.__ and __$__ have been taken seperately. And observe it has identified __Tesla__ as a company name and has assigned it as a Noun. 

So Spacy knows a lot of information about these tokens.

`token.dep_` stands for syntactic dependency.

In [7]:
# Print each token separately
for token in doc:
    print(f"{token.text} - {token.pos_} - {token.dep_}")

Tesla - PROPN - nsubj
is - AUX - aux
looking - VERB - ROOT
at - ADP - prep
buying - VERB - pcomp
U.S. - PROPN - dobj
startup - VERB - advcl
for - ADP - prep
$ - SYM - quantmod
6 - NUM - compound
million - NUM - pobj


This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object (for vocabulary) that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   


![](images\15a.PNG)

We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed.

In [6]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x15715e96380>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x157161a4fa0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x15715ef6960>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1571602db80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x15715d6d2c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x15715ef6c00>)]

We can get the basic names using `nlp.pipe_names`.

In [7]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. 

We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [9]:
# passed a unicode string - such string always starts with a - u
doc2 = nlp(u"Tesla isn't   looking into startups anymore.") 

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE dep
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [15]:
doc2

Tesla isn't   looking into startups anymore.

We can also use indexing to grab tokens.

In [16]:
doc2[0]

Tesla

In [17]:
doc2[0].dep_

'nsubj'

From that token object we can also get its corresponding parts of speech and syntactic dependency.

In [18]:
doc2[0].pos_

'PROPN'

In [19]:
type(doc2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [20]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [21]:
doc2[0].dep_

'nsubj'

To see the full name of a tag use `spacy.explain(tag)`

In [22]:
spacy.explain('PROPN')

'proper noun'

In [23]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [24]:
print(doc2)

Tesla isn't   looking into startups anymore.


In [25]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [26]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [27]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [28]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [29]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [30]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


SpaCy can undestand that _life_quote_ variable is a span of _doc3_.

In [31]:
type(life_quote)

spacy.tokens.span.Span

In upcoming lectures we'll see how to create Span objects using `Span()`. This will allow us to assign additional information to the Span.

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [32]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

`doc.sents` is a generator.

In [33]:
for sent in doc4.sents:   ## .sents => sentence attribute
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [35]:
for token in doc4:
    print(token.text)

This
is
the
first
sentence
.
This
is
another
sentence
.
This
is
the
last
sentence
.


In [36]:
doc4[0]

This

In [34]:
doc4[6].is_sent_start   ### ask if this is start of a sentence

True

In [37]:
doc4[7].is_sent_start

False

## Next up: Tokenization