# Introduction to Spacy

## Everything you Need to Know about Spacy

https://spacy.io/usage/spacy-101


Spacy uses neural network models, trained on classical NLP datasets, to predict the NLP data of a sentence. There are different model that vary for different use cases. Some are larger and more accurate, some trained on different kinds of data, some predict different things.

### Spacy Features

- Tokenization -- Segmenting your text. 
- Parts- of Speech Tagging -- Assigning grammatical word types to individual words in a sentence.
- Dependency Parsing -- Assigning dependency labels that describe relationships between tokens.
- Lemmatization -- Assigning the base form of a word
- Sentenc Boundary Detection -- Finding and Segmenting individual sentences.
- Named Entity Recoginition -- Label real world objects.
- Similarity -- comparing two textual documents to determine similarity.
- Text Classification -- Assigning categories and labels to a document or subdocument.
- Rule Based Matching -- regex
- Training -- Statistical model predictions?
- Serialization -- Saving objects to files or bite strings.

### English Models
Downloadable statistical models for spaCy to predict and assign features. Most are CNNs with residual connections, layer normalization, and maxout nonlinearity.

- tagging
- parsing
- entity recognition

#### en_core_web_sm
English multitask CNN assigns content specific token vectors, Parts of Speech tags, depencdency parsing, and Named Entities. 29MB. 

In [12]:
import spacy

# load the model
nlp = spacy.load('en_core_web_sm')

#assign avariable with the models output.
doc = nlp("My name is Harrison and I do not likely Apple Music.")
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

My name is Harrison and I do not likely Apple Music.
My ADJ poss
name NOUN nsubj
is VERB ROOT
Harrison PROPN attr
and CCONJ cc
I PRON nsubj
do VERB aux
not ADV neg
likely ADV conj
Apple PROPN compound
Music PROPN dobj
. PUNCT punct


### Linguistic Annotation 

load a model with spacy.load(). Which returns a language model that is referred to as nlp. Call nlp on a doc to return a compressed doc.

In [22]:
doc = nlp("Sometimes I cry myself to sleep at night thinking about Donald Trump and Brexit.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Sometimes ADV advmod
I PRON nsubj
cry VERB ROOT
myself PRON dobj
to PART aux
sleep VERB ccomp
at ADP prep
night NOUN pobj
thinking VERB advcl
about ADP prep
Donald PROPN compound
Trump PROPN pobj
and CCONJ cc
Brexit PROPN conj
. PUNCT punct


### Tokenization
Each document is tokenized by rules specific to each language.
Raw text is split on whitespace, then the tokenizer iterates over the text.

Checks:
1. Does the substring match a tokenizer exception?
2. Can a prefix, suffix, or infix be split off?

Prefix: Character(s) at the beginning, e.g. $, (, “, ¿.

Suffix: Character(s) at the end, e.g. km, ), ”, !.

Infix: Character(s) in between, e.g. -, --, /, ….

If the substring matches to an above exception, the substring is modified and the tokenizer continues its iteration through the text.

In [28]:
import pandas as pd
doc = nlp("Chimpanzees drink boba-tea in the sunshine.")
df = pd.DataFrame([token.text for token in doc], columns = ["Text"])
df

Unnamed: 0,Text
0,Chimpanzees
1,drink
2,boba
3,-
4,tea
5,in
6,the
7,sunshine
8,.


### Parts-of-Speech tags and Dependencies
