# spaCy

spaCy is a Python library supporting various text analysis pipelines, such as named entity recognition, part-of-speech tagging, entity linking, etc., on over 70+ languages using large language models. It also supports adding custom components to their pipelines, training new models, and has some useful built-in visualizers.


In [25]:
import spacy
from IPython.display import clear_output

## Chapter 1: Finding words, phrases, names, & concepts

spaCy's core functionality lies in the processing pipeline, typically called `nlp`. This object can be used like a function to analyze text. 

In the below cell a blank pipeline is made, containing only the language specific rules/components like those used for tokenizing.

In [26]:
nlp = spacy.blank('en')

Processing text with this object yields a `Doc` object. `Token` objects represent the tokens in a `Doc`, which can be indexed or iterated upon. 

In [28]:
doc = nlp('Hello world!')


for token in doc:
    print(f'Token at index {token.i} (iterator) is: {token.text}')

print(f'Token at index 1 (index) is: {doc[1]}')

Token at index 0 (iterator) is: Hello
Token at index 1 (iterator) is: world
Token at index 2 (iterator) is: !
Token at index 1 (index) is: world


`Span` objects are slices of the `Doc`, however it's only a view of the `Doc` and doesn't actually contain any data itself. They can be created using normal Python slicing on a `Doc`.

In [29]:
span = doc[1:3]
print(f'The span text from index 1:3 is: {span.text}')

The span text from index 1:3 is: world!


`Token`s have a number of useful attributes, such as:
- i: index within the parent document
- text: token text
- is_alpha: bool indicating whether token consists of alphabetic characters
- is_punct: bool indicating whether token is punctuation
- like_num: bool indiciating whether token "resembles" a number

Attributes such as these are lexical attributes, they don't depend at all on how the token is used (its context).

In [30]:
doc = nlp('Google is looking at buying a London based company for $20 million.')

for token in doc:
    print(f'Index: {token.i:2d}, Text: {token.text:>10}, Is alphabetic: {token.is_alpha:3}, Is punctuation: {token.is_punct:3}, Like number: {token.like_num:3}')

Index:  0, Text:     Google, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  1, Text:         is, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  2, Text:    looking, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  3, Text:         at, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  4, Text:     buying, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  5, Text:          a, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  6, Text:     London, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  7, Text:      based, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  8, Text:    company, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index:  9, Text:        for, Is alphabetic:   1, Is punctuation:   0, Like number:   0
Index: 10, Text:          $, Is alphabetic:   0, Is punctuation:   0, Like number:   0
Index: 11, Text:         20, Is alphabetic:

### Trained Pipelines

Pipelines contain trained modles to make predictions using context, e.g. POS tags and named entities. The `spacy download` command can be used to download a trained pipeline, which then makes it available to be used by the `spacy.load` method. 

A pipeline's package contains the necessary weights for its models, the vocabulary, meta information, and the configuration file used to train it.

```python -m spacy download en_core_web_sm```

In [33]:
!python -m spacy download en_core_web_sm
clear_output()
print('en_core_web_sm pipeline can now be loaded!')

en_core_web_sm pipeline can now be loaded!


Using a trained pipeline can we predict context dependent attributes, attributes returning strings usually end with a underscore, those without return a integer ID value from the central `Vocab`. Some context dependent attributes include:
- pos_: predicted part-of-speech
- dep_: dependency label, relationship between two tokens
- head: syntactic head token, parent token this one is attached to

In [40]:
nlp = spacy.load('en_core_web_sm')

doc = nlp('She ate the large pizza')

for token in doc:
    print(f'Token text: {token.text:>10}, Token POS: {token.pos_:>5}, Token POS ID: {token.pos}, Token Dependency: {token.dep_}, Token Head: {token.head}')


Token text:        She, Token POS:  PRON, Token POS ID: 95, Token Dependency: nsubj, Token Head: ate
Token text:        ate, Token POS:  VERB, Token POS ID: 100, Token Dependency: ROOT, Token Head: ate
Token text:        the, Token POS:   DET, Token POS ID: 90, Token Dependency: det, Token Head: pizza
Token text:      large, Token POS:   ADJ, Token POS ID: 84, Token Dependency: amod, Token Head: pizza
Token text:      pizza, Token POS:  NOUN, Token POS ID: 92, Token Dependency: dobj, Token Head: ate


The `.ents` attribute on a `Doc` object access the named entities predicted by the NER model, it returns a list of `Span` objects.

In [41]:
doc = nlp('Apply is looking at buying U.K. startup for $1 billion.')

for ent in doc.ents:
    print(f'Entity: {ent.text:>10}, Label: {ent.label_:>5}')

Entity:       U.K., Label:   GPE
Entity: $1 billion, Label: MONEY
