# spaCy

spaCy is a Python library supporting various text analysis pipelines, such as named entity recognition, part-of-speech tagging, entity linking, etc., on over 70+ languages using large language models. It also supports adding custom components to their pipelines, training new models, and has some useful built-in visualizers.


In [117]:
import spacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Token, Doc, Span
from spacy.language import Language
from IPython.display import clear_output
import random

## Chapter 1: Finding words, phrases, names, & concepts

spaCy's core functionality lies in the processing pipeline, typically called `nlp`. This object can be used like a function to analyze text. 

In the below cell a blank pipeline is made, containing only the language specific rules/components like those used for tokenizing.

In [26]:
nlp = spacy.blank('en')

Processing text with this object yields a `Doc` object. `Token` objects represent the tokens in a `Doc`, which can be indexed or iterated upon. 

In [28]:
doc = nlp('Hello world!')


for token in doc:
    print(f'Token at index {token.i} (iterator) is: {token.text}')

print(f'Token at index 1 (index) is: {doc[1]}')

Token at index 0 (iterator) is: Hello
Token at index 1 (iterator) is: world
Token at index 2 (iterator) is: !
Token at index 1 (index) is: world


`Span` objects are slices of the `Doc`, however it's only a view of the `Doc` and doesn't actually contain any data itself. They can be created using normal Python slicing on a `Doc`.

In [29]:
span = doc[1:3]
print(f'The span text from index 1:3 is: {span.text}')

The span text from index 1:3 is: world!


`Tokens` have a number of useful attributes, such as:
- i: index within the parent document
- text: token text
- is_alpha: bool indicating whether token consists of alphabetic characters
- is_punct: bool indicating whether token is punctuation
- like_num: bool indiciating whether token "resembles" a number

Attributes such as these are lexical attributes, they don't depend at all on how the token is used (its context).

In [30]:
doc = nlp('Google is looking at buying a London based company for $20 million.')

for token in doc:
    print(f'Index: {token.i:2d}, Text: {token.text:>10}, Is alphabetic: {token.is_alpha:3}, Is punctuation: {token.is_punct:3}, Like number: {token.like_num:3}')

Index:  0, Text:     Google, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  1, Text:         is, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  2, Text:    looking, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  3, Text:         at, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  4, Text:     buying, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  5, Text:          a, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  6, Text:     London, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  7, Text:      based, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  8, Text:    company, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  9, Text:        for, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index: 10, Text:          $, Is alphabetic:   0, Is punctuation:   0, Like number:   0
Index: 11, Text:         20, Is alphabetic:

### Trained Pipelines

Pipelines contain trained modles to make predictions using context, e.g. POS tags and named entities. The `spacy download` command can be used to download a trained pipeline, which then makes it available to be used by the `spacy.load` method. 

A pipeline's package contains the necessary weights for its models, the vocabulary, meta information, and the configuration file used to train it.

```python -m spacy download en_core_web_sm```

In [33]:
!python -m spacy download en_core_web_sm
clear_output()
print('en_core_web_sm pipeline can now be loaded!')

en_core_web_sm pipeline can now be loaded!


Using a trained pipeline can we predict context dependent attributes, attributes returning strings usually end with a underscore, those without return a integer ID value from the central `Vocab`. Some context dependent attributes include:
- pos_: predicted part-of-speech
- dep_: dependency label, relationship between two tokens
- head: syntactic head token, parent token this one is attached to

In [40]:
nlp = spacy.load('en_core_web_sm')

doc = nlp('She ate the large pizza')

for token in doc:
    print(f'Token text: {token.text:>10}, Token POS: {token.pos_:>5}, Token POS ID: {token.pos}, Token Dependency: {token.dep_}, Token Head: {token.head}')


Token text:        She, Token POS:  PRON, Token POS ID: 95, Token Dependency: nsubj, Token Head: ate
Token text:        ate, Token POS:  VERB, Token POS ID: 100, Token Dependency: ROOT, Token Head: ate
Token text:        the, Token POS:   DET, Token POS ID: 90, Token Dependency: det, Token Head: pizza
Token text:      large, Token POS:   ADJ, Token POS ID: 84, Token Dependency: amod, Token Head: pizza
Token text:      pizza, Token POS:  NOUN, Token POS ID: 92, Token Dependency: dobj, Token Head: ate


The `.ents` attribute on a `Doc` object access the named entities predicted by the NER model, it returns a list of `Span` objects.

In [46]:
doc = nlp('Apply is looking at buying U.K. startup for $1 billion after receiving positive reviews on their iPhone X case.')

for ent in doc.ents:
    print(f'Entity: {ent.text:>10}, Label: {ent.label_:>5}')

Entity:       U.K., Label:   GPE
Entity: $1 billion, Label: MONEY


The `spacy.explain` method can be used to get definitions for most tags & labels.

In [43]:
print(spacy.explain('GPE'))
print(spacy.explain('NNP'))
print(spacy.explain('dobj'))

Countries, cities, states
noun, proper singular
direct object


### Rule-based 

The matcher in spaCy operates on `Doc` and `Token` objects instead of only strings like regex, allowing matches on lexical attributes or predicted attributes as well as text. 

Matcher patterns are lists of dictionaries, with each dictionary describing one token. The keys of the dictionaries are the names of attributes.

The matcher needs to be initialized with the shared vocabulary, and uses the `add` method to add a pattern given a unique ID & list of patterns to add. Calling it on a `Doc` object returns a list of tuples containing the match ID, start index, & end index.

Common keys to use for patterns are:
- TEXT
- LOWER
- LEMMA
- POS
- IS_DIGIT
- IS_PUNCT

In [55]:
matcher = Matcher(nlp.vocab)

pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(f'Match ID: {match_id}, Matched Span: {matched_span}')

Match ID: 9528407286733565721, Matched Span: iPhone X


Operators let you define how often a token should be matched, they can be added using the 'OP' key.
- ?: makes the pattern optional
- !: negates the token, matched 0 times
- +: matches a token 1+ times
- *: matches a token 0+ times

In [56]:
pattern = [{'LEMMA': 'buy'}, {'POS': 'DET', 'OP': '?'}, {'POS': 'NOUN'}]
matcher = Matcher(nlp.vocab)

matcher.add('BUYING', [pattern])

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matches = matcher(doc)

for match_id, start, end in matches:
    print(f'Match: {doc[start:end]}')

Match: bought a smartphone
Match: buying apps


## Chapter 2: Large-scale data analysis with spaCy

spaCy stores all shared data in a central vocabulary that functions like a bidirectional lookup table for data shared across multiple documents. All strings are hashed and stored in the string store (available as `nlp.vocab.strings`), internally spaCy only communicates using the hash IDs.

The `Doc` object also exposes its vocab & strings.

In [58]:
nlp.vocab.strings.add('coffee')
coffee_hash = nlp.vocab.strings['coffee']
coffee_str = nlp.vocab.strings[coffee_hash]

print(f'String: {coffee_str}, Hash: {coffee_hash}')

doc = nlp('I love coffee')
print(f'Coffee document hash: {doc.vocab.strings["coffee"]}')

String: coffee, Hash: 3197928453018144401
Coffee document hash: 3197928453018144401


`Lexemes` are context-independent entries in the `Vocab`, they are returned by looking up a string or hash ID. `Lexemes` expose attributes just like `Tokens`.

In [59]:
lexeme = nlp.vocab['coffee']

print(f'Lexeme Test: {lexeme.text}, Lexeme Hash: {lexeme.orth}, Example attribute (is_alpha): {lexeme.is_alpha}')

Lexeme Test: coffee, Lexeme Hash: 3197928453018144401, Example attribute (is_alpha): True


### Data Structures: Doc, Span, & Token

`Docs` are one of the central data structures in spaCy, created automatically by calling `nlp` on some text or manually by providing a list of words & list of bools indicates which words have spaces afterward (in addition to the shared vocab).

In [62]:
words = ['Hello', 'world', '!']
spaces = [True, False, False]

doc = Doc(nlp.vocab, words = words, spaces = spaces)

print(doc)

Hello world!


`Spans` are slices of `Docs`, taking in the `Doc` it refers to, the starting index, & the ending index (exclusive). It can also be given a optional label.

In [64]:
span = Span(doc, 0, 2, label = 'GREETING')

print(span)

Hello world


spaCy is optimized to work with `Docs` & `Spans`, so it's usually best to convert them to text as late as possible and to use built-in attributes as much as possible, e.g. `token.i` for the token index.

### Word vectors and semantic similarity

`Doc`, `Token`, & `Span` object all have a `similarity` method that takes in another object and returns a floating point number between 0 - 1 indicating how similar they are. Objects of different types can be compared, e.g. `Doc` and `Token`

Doing this requires a medium or large pipeline, the small one doesn't ship with word vectors.

In [68]:
!python -m spacy download en_core_web_md
clear_output()

print('en_core_web_md pipeline can now be loaded!')

en_core_web_md pipeline can now be loaded!


In [71]:
nlp = spacy.load('en_core_web_md')

doc1 = nlp('I like fast food')
doc2 = nlp('I like pizza')
print(f'doc1 <-> doc2 similarity: {doc1.similarity(doc2)}')

doc = nlp('I like pizza and pasta')
token1, token2 = doc[2], doc[4]
print(f'Token 1: {token1}, Token 2: {token2}, Similarity: {token1.similarity(token2)}')

doc1 <-> doc2 similarity: 0.869833325851152
Token 1: pizza, Token 2: pasta, Similarity: 0.6850197911262512


Similarity is measured using the cosine similarity between two vectors, objects composed of several tokens like `Doc` use the average of their token vectors.

These vectors can also be accessed individually.

In [73]:
doc = nlp('I have a banana.')
print(f'{doc[3].text}: {doc[3].vector}')

banana: [ 0.20778  -2.4151    0.36605   2.0139   -0.23752  -3.1952   -0.2952
  1.2272   -3.4129   -0.54969   0.32634  -1.0813    0.55626   1.5195
  0.97797  -3.1816   -0.37207  -0.86093   2.1509   -4.0845    0.035405
  3.5702   -0.79413  -1.7025   -1.6371   -3.198    -1.9387    0.91166
  0.85409   1.8039   -1.103    -2.5274    1.6365   -0.82082   1.0278
 -1.705     1.5511   -0.95633  -1.4702   -1.865    -0.19324  -0.49123
  2.2361    2.2119    3.6654    1.7943   -0.20601   1.5483   -1.3964
 -0.50819   2.1288   -2.332     1.3539   -2.1917    1.8923    0.28472
  0.54285   1.2309    0.26027   1.9542    1.1739   -0.40348   3.2028
  0.75381  -2.7179   -1.3587   -1.1965   -2.0923    2.2855   -0.3058
 -0.63174   0.70083   0.16899   1.2325    0.97006  -0.23356  -2.094
 -1.737     3.6075   -1.511    -0.9135    0.53878   0.49268   0.44751
  0.6315    1.4963    4.1725    2.1961   -1.2409    0.4214    2.9678
  1.841     3.0133   -4.4652    0.96521  -0.29787   4.3386   -1.2527
 -1.7734   -3.5637   

There is no objective measurement of similarity, it really just depends on the context and what the application needs to do. For example, the below cell has two pieces of text with opposing sentiments, yet are scored very similar.

In [74]:
doc1 = nlp('I love cats')
doc2 = nlp('I hate cats')

print(f'doc1 <-> doc2 similarity: {doc1.similarity(doc2)}')

doc1 <-> doc2 similarity: 0.9534673962804006


### Combining predictions & rules

Statistical models are useful if you need to generalize from a few examples instead of listing out all possibilities, rule-based approaches might be more useful if there's a more finite number of instances we're searching for.

Statistical components include the named-entity recognizerk, dependency parser, or part-of-speech tagger. Rule-based components include the tokenizer, `Matcher`, `PhraseMatcher`.

The `PhraseMatcher` is more efficient & faster than the `Matcher`, taking in `Docs` and gives access to the tokens in context, better for large dictionaries & word lists on large volumes of text. It follows the same API as the base `Matcher`, except instead of a list of dictionaries it's provided a `Doc` object for the pattern.

In [78]:
matcher = PhraseMatcher(nlp.vocab)

pattern = nlp('Golden Retriever')
matcher.add('DOG', [pattern])
doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print(f'Matched span: {span.text}')

Matched span: Golden Retriever


## Chapter 3: Processing Pipelines

Pipelines are a series of functions applied to a `Doc` to add attributes. The first thing a pipeline does is apply the tokenizer to turn the string into a `Doc` object, then the series of components are applied to the object in order. spaCy has several built-in components, such as:
- tagger: part-of-speech tagger, creates `Token.tag` & `Token.pos`
- parser: dependency parser, creates `Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks`
- ner: named entity recognizer, creates `Doc.ents`, `Token.ent_iob`, `Token.ent_type`
- textcat: text classifier, creates `Doc.cats`

All pipeline packages include several files & a `config.cfg`, which defines things like the language, which components to instantiate, & how they should be configured. The names of the pipeline components are stored in the `pipe_names` attribute, while the `pipeline` attribute stores the component names and function tuples.

*Text categories are very specific, so it's not included in any pipelines by default, but it can be used to train your own.



In [79]:
print(f'Component names: {nlp.pipe_names}')
print(f'Names and components: {nlp.pipeline}')

Component names: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Names and components: [('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7ff81cc64f50>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7ff81cc64b90>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7ff83cc5cf90>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7ff814016610>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7ff81d2734d0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7ff83cc5d4d0>)]


### Custom pipeline components

Custom components allow you to change the behavior of the executed pipeline. A component is a function or callabale that takes a `Doc`, modifies it, & returns it to be processed by the next component. Custom components have to be decorated with `@Language.component` about the definition, providing a name. 

Once registered, it can be added using the `add_pipe` method providing the name given to the decorator. To specify where the component should be added, use the `first` or `last` keyword (which take a bool) or the `before` or `after` keywords (which take the string name of another component).

In [90]:
nlp = spacy.load('en_core_web_md')

@Language.component('custom_component')
def custom_component_function(doc):
    print(f'Document Length: {len(doc)}')
    return doc

nlp.add_pipe('custom_component', first=True)
# nlp.add_pipe('custom_component', after='ner')

print(f'Current pipeline: {nlp.pipe_names}')

doc = nlp('Hello world!')

Current pipeline: ['custom_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Document Length: 3


In [None]:
nlp.remove_pipe('custom_component')

### Extension attributes

Custom attributes can add metadata to `Docs`, `Tokens`, & `Spans`, computed dynamically or added just once. These attributes are available via the `._` (dot underscore) property to be clear they were added by the user. They must be registered on the global class using the `set_extension` method, providing a attribute name and using keyword args to determine how the value should be computed.

In [96]:
Doc.set_extension('title', default = None)

There are 3 types of extensions:
- attribute extensions: set a default value that can be overwritten
- property extensions: define a getter & optional setter function, getter is called only when you retrieve the attribute & token only one argument (the token)
- method extensions: make the attribute a callable method, taking in arguments & computing values dynamically, first argument is always the object itself
  
*If setting a extension attribute on `Span` you almost always want to use a property extension, otherwise you'd have to update every possible span ever

In [112]:
doc = nlp('The sky is blue.')


#attribute
Token.set_extension('is_color', default = False)
doc[3]._.is_color = True

#property
def get_is_title(token):
    return 'A' <= token.text[0] <= 'Z'
Token.set_extension('is_title', getter = get_is_title)

#method
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc
Doc.set_extension('has_token', method = has_token)


for token in doc:
    print(f'Token: {token.text}, is_color: {token._.is_color}, is_title: {token._.is_title}')

print(f'Document has token "blue"? {doc._.has_token("blue")}')
print(f'Document has token "cloud"? {doc._.has_token("cloud")}')

Token: The, is_color: False, is_title: True
Token: sky, is_color: False, is_title: False
Token: is, is_color: False, is_title: False
Token: blue, is_color: True, is_title: False
Token: ., is_color: False, is_title: False
Document has token "blue"? True
Document has token "cloud"? False


In [113]:
Token.remove_extension('is_color')
Token.remove_extension('is_title')
Doc.remove_extension('has_token')
Doc.remove_extension('title')

(None, None, None, None)

### Scaling & Performance

The `nlp.pipe` method processes text as a stream, yielding `Doc` objects, making it much faster to process text batches. Since it's a generator, we need to wrap it in a list to get a list of `Docs`.

In [116]:
LOTS_OF_TEXT = ['The quick brown fox jumped over the lazy dog' for _ in range(100)]
docs = list(nlp.pipe(LOTS_OF_TEXT))

It also supports context if you set `as_tuples` to True, yielding `Doc`/context tuples. This is useful for including extra metadata.

In [118]:
LOTS_OF_TEXT = [(text, {'page_number': random.randint(1, 101)}) for text in LOTS_OF_TEXT]
docs = list(nlp.pipe(LOTS_OF_TEXT, as_tuples=True))

for doc, context in docs:
    print(doc.text, context['page_number'])

The quick brown fox jumped over the lazy dog 26
The quick brown fox jumped over the lazy dog 40
The quick brown fox jumped over the lazy dog 1
The quick brown fox jumped over the lazy dog 22
The quick brown fox jumped over the lazy dog 78
The quick brown fox jumped over the lazy dog 37
The quick brown fox jumped over the lazy dog 52
The quick brown fox jumped over the lazy dog 93
The quick brown fox jumped over the lazy dog 100
The quick brown fox jumped over the lazy dog 56
The quick brown fox jumped over the lazy dog 101
The quick brown fox jumped over the lazy dog 72
The quick brown fox jumped over the lazy dog 51
The quick brown fox jumped over the lazy dog 9
The quick brown fox jumped over the lazy dog 80
The quick brown fox jumped over the lazy dog 23
The quick brown fox jumped over the lazy dog 87
The quick brown fox jumped over the lazy dog 50
The quick brown fox jumped over the lazy dog 80
The quick brown fox jumped over the lazy dog 13
The quick brown fox jumped over the lazy

If all you need is a tokenized `Doc`, and not to run the entire pipeline, the `nlp.make_doc` method can be used to tokenize some text and return a `Doc`

In [119]:
doc = nlp.make_doc('Hello world!')