# Spacy

Is a popular library for NLP that provides a wide range of features for text processing and analysis, including tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and text classification.

In [1]:
import spacy

In spacy we can create our own pipelines

In [2]:
# Create a blank model in portuguese with spacy
nlp = spacy.blank('pt')

A blank model contains the basic components needed for processing text.

In [3]:
# Defining the sentence that we will use in this blank pipeline
doc = nlp('Oi eu sou goku.')

In [4]:
# We can see the text with the method text
doc.text

'Oi eu sou goku.'

Also with the sentence in the pipeline we can use a lot of methods, and all our sentence now is tokenized

In [5]:
# This is an example of a token in the text
doc[0]

Oi

In [6]:
# Also all the tokens has their own methods
for token in doc:
    # This method verify if the token is a punctuation
    print(token.is_punct, token)

False Oi
False eu
False sou
False goku
True .


# Let's use a trained pipeline

In [7]:
# Installing the a trained pipeline with "spacy download"
#!python -m spacy download pt_core_news_sm

With this pipeline we will identify the POS tags in our setences

All POS part-of-speech tags.

* "ADJ": Adjective
    * A word that describes or modifies a noun or pronoun.
* "ADP": Adposition
    * A word that shows the relationship between a noun or pronoun and other words in a sentence, such as prepositions and postpositions.
* "ADV": Adverb
    * A word that modifies a verb, adjective, or other adverb, and typically provides more information about time, place, manner, degree, etc.
* "AUX": Auxiliary verb
    * An auxiliary verb that is used with a main verb to form a verb tense, mood, or voice.
* "CCONJ": Coordinating conjunction
    * A word that connects words, phrases, or clauses of equal syntactic importance, such as "and", "or", "but", etc.
* "DET": Determiner
    * A word that indicates which noun is being referred to, such as articles, demonstratives, and possessives.
* "INTJ": Interjection
    * An interjection, which is a word or phrase used to express strong emotion or surprise, such as "oh", "ah", "wow", etc.
* "NOUN": Noun
    * "NOUN": A word that represents a person, place, thing, or idea.
* "NUM": Numeral
    * "NUM": A word that represents a number, such as "one", "two", "three", etc.
* "PART": Particle
    * "PART": A word that is used in a sentence to indicate a grammatical relationship with a neighboring word, such as "to" in "going to", or "not" in "not happy".
* "PRON": Pronoun
    * "PRON": A word that is used to replace a noun or noun phrase, such as "he", "she", "it", "they", etc.
* "PROPN": Proper noun
    * "PROPN": A proper noun, which is a specific name of a person, place, or organization, such as "Maria", "Lisboa", "Google", etc.
* "PUNCT": Punctuation
    * "PUNCT": A punctuation mark, such as ".", ",", ";", etc.
* "SCONJ": Subordinating conjunction
    * A subordinating conjunction, which is a word that connects a subordinate clause to a main clause, such as "because", "although", "if", etc.
* "SYM": Symbol
    * A symbol, such as "$", "%", "#", etc.
* "VERB": Verb
    * "VERB": A word that expresses an action, occurrence, or state of being.
* "X": Other
    * "X": A catch-all category for words that do not fall into any of the other categories.

In [8]:
# Load the small English pipeline
nlp = spacy.load("pt_core_news_sm")

# Process a text
doc = nlp("Ela gosta de comer pizza, na Australia")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

Ela PRON
gosta VERB
de SCONJ
comer VERB
pizza NOUN
, PUNCT
na ADP
Australia PROPN


Also we can see the syntatic dependencies like

* Subject-verb agreement: the subject noun or pronoun governs the verb in a sentence.
* Object-verb agreement: the object noun or pronoun is governed by the verb in a sentence.
* Adjective-noun agreement: the adjective modifies the noun in a sentence.
* Adverb-verb agreement: the adverb modifies the verb in a sentence.
* Preposition-object agreement: the preposition governs the object noun or pronoun in a sentence.

In [9]:
for token in doc:
    print(token.text, token.dep_)

Ela nsubj
gosta ROOT
de mark
comer xcomp
pizza obj
, punct
na case
Australia obl


Spacy is a good thing to identify the labels about common words to, like name of places in text.

In [10]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Australia LOC


Finally sometimes is hard to identify, what each tag represent, so we can use the method explain in spacy

In [11]:
spacy.explain('LOC')

'Non-GPE locations, mountain ranges, bodies of water'

# Matcher
We can use matcher to identify pattern in our text

* 'TEXT': Matches the exact text of the token, including case and whitespace.
* 'LEMMA': Matches the base form of the token, i.e. the canonical form of the word. For example, the lemma of 'ran' is 'run'.
* 'POS': Matches the part-of-speech tag of the token.
* 'TAG': Matches the fine-grained part-of-speech tag of the token.
* 'SHAPE': Matches the shape of the token, i.e. its capitalization, punctuation, and digit pattern.
* 'ENT_TYPE': Matches the named entity type of the token.
* 'DEP': Matches the syntactic dependency label of the token.

In [12]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

pattern = [[{'LOWER': 'new'}, {'LOWER': 'york'}, {'LOWER': 'city'}]]

matcher.add('NYC_PATTERN', pattern)

doc = nlp("I love New York City, but I also love San Francisco.")

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)


New York City


Other example, here's a pattern that matches instances of a verb followed by a noun:

```python
pattern = [{'POS': 'VERB'}, {'POS': 'NOUN'}]
```

# Create a Doc manually

In [13]:
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ["Oi", "Tudo", "Bem"]
spaces = [True, True, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

print(doc)

Oi Tudo Bem


# Text simlarity

In [14]:
# !python -m spacy download en_core_web_md

In [15]:
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like frutas")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.9607215630444086


In [16]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.6850197911262512


    * Similarity is determined using word vectors
    * Multi-dimensional meaning representations of words
    * Generated using an algorithm like Word2Vec and lots of text
    * Can be added to spaCy's pipelines
    * Default: cosine similarity, but can be adjusted
    * Doc and Span vectors default to average of token vectors
    * Short phrases are better than long documents with many irrelevant words

As we use Word2Vec we also can pick the generated tensor 

In [17]:
token1.tensor

array([ 0.14112753, -0.33611476,  0.06699206, -0.88421255,  0.06669152,
       -0.1297453 , -0.31521478,  0.60422575,  0.4733977 , -0.64390045,
        0.9460465 ,  1.2743245 , -0.25726676, -1.1570396 , -0.4630825 ,
       -0.05508524, -1.2497791 , -0.51810896, -0.7528888 , -0.47276112,
        0.2770418 ,  0.22972283, -0.00947742,  0.2889976 ,  0.28562495,
        0.9584112 ,  0.51542306,  0.7909095 , -0.5317948 , -0.40295917,
        0.0052537 , -0.3223989 , -0.36749154,  0.74106073,  0.384262  ,
       -0.13837826,  0.03337788, -0.11230475,  0.15701273,  0.3177856 ,
        0.2158667 ,  0.18146369, -0.21280894, -0.3369672 , -0.3046311 ,
       -0.2077651 ,  0.3454957 ,  0.47986588,  0.37145495, -0.24688868,
        0.5591996 , -0.5802103 ,  0.03678508, -0.25743204, -0.38831013,
       -0.84508497, -0.23053315,  0.19568542, -0.15692766, -0.17506352,
        0.56411767, -0.4119339 , -0.9519711 , -0.40832722,  0.72344995,
       -0.68837714, -0.16142121,  0.07086641, -0.30072153, -0.29

In [18]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [19]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1e0476c5280>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1e0476c5c40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1e047f5f890>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1e0479a2380>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1e0479d6740>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1e047f5f820>)]

In [21]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


In [22]:
# Adding another component in pipeline

@Language.component("custom_component")
def custom_component_function(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


Creating a complex component with matcher

# Models

In [None]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# define the pipeline
nlp = spacy.blank("en")

# definer the matcher with the vocab of the pipeline
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])

docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

doc_bin = DocBin(docs=docs)
doc_bin.to_disk("./train.spacy")

In [3]:
# !python -m spacy init config ./config.cfg --lang en --pipeline ner
# !cat ./config.cfg
# !python -m spacy train ./exercises/en/config_gadget.cfg --output ./output --paths.train ./exercises/en/train_gadget.spacy --paths.dev ./exercises/en/dev_gadget.spacy