# spaCy models

Use to predict linguistic attributes in context, e.g. is the word a verb.

- POS
- Syntactic dependencies
- Named entity recognition

Pre-trained models are available for use. These can be accessed using `nlp = spacy.load('en_core_web_sm')`.

- `en_core_web_sm`
- `en_core_web_md`
- `en_core_web_lg`
- ScispaCy's `en_core_sci_sm`
- ScispaCy's `en_core_sci_md`
- Other ScispaCy models can be found [here](https://allenai.github.io/scispacy/)
- [medaCy](https://github.com/NLPatVCU/medaCy) from VCU's NLP Lab

Models can be trained on labeled example texts. Models can be updated with more examples to fine-tune predictions.

## Part-of-speech tagging

In [3]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("She ate the pizza.")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN
. PUNCT


## Syntactic dependencies

Predicting how the words are related.

In [4]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate
. PUNCT punct ate


You can use the `explain` method to figure out what these abbreviations mean.

In [10]:
# `nsubj` is the nominal subject, e.g. "She"
# `dobj` is the direct object, e.g. "pizza"
# `det` is the determiner, e.g. "the"

spacy.explain('dobj')

'direct object'

## Named Entity Recognition

Objects that are assigned names, e.g. medications, countries, organizations, etc.

`doc.ents` returns an iterator of "span" objects, so we can predict the entity text and the entity label (using the `label_` attribute).

In [11]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")


# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [12]:
spacy.explain("GPE")

'Countries, cities, states'

# Examples:

In [14]:
import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


## POS and syntactic dependency

In [15]:
# POS and dependency

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          PROPN     ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


## Named Entity Prediction

In [16]:
# Predict Named Entities

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY
