## spaCy Crash Course

In [1]:
import spacy
import en_ner_bc5cdr_md as ner
from spacy.matcher import Matcher

In [2]:
nlp = ner.load()

In [3]:
doc = nlp("Caffeine is metabolized into paraxanthine.")

### Tokens

Each word in doc is tokenized (turned into tokens) when passed to the `nlp` pipeline.

In [4]:
for token in doc:
    print(token)

Caffeine
is
metabolized
into
paraxanthine
.


### Linguistic Features
Each token has built in linguistic features, including:
-   text (original word text)
-   lemma_ (base form of the word)
-   pos_ (simple part-of-speech)
-   dep_ (syntactic dependency)
-   is_punct (boolean, token is a punctuation)

In [5]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.is_punct)

Caffeine caffeine NOUN nsubjpass False
is be AUX auxpass False
metabolized metabolize VERB ROOT False
into into ADP case False
paraxanthine paraxanthine NOUN nmod False
. . PUNCT punct True


### Token Matching
spaCy has a built-in token matching engine called `Matcher` that can recognize phrases with the given linguistic patterns. Consider the example sentence below:

In [6]:
doc2 =  nlp("The family received an apple, a sack of rice, and bananas.")

For instance we want to get all mentions of phrases with the pattern: `article` `noun`. We can do so by adding the pattern in the matcher engine.

In [7]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"DET"},{'POS': 'NOUN'}]
matcher.add('CAND', None, pattern)

In [8]:
matches = matcher(doc2)
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc2[start:end]
    print(matched_span.text)

The family
an apple
a sack
