<a href="https://colab.research.google.com/github/guimaraesabrina/mastering_spaCy/blob/main/mastering_spacy_chapters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mastering spaCy
## Chapter 1: Getting started with spaCy



## Install spaCy library and setup

In [None]:
%pip install -U spacy

In [None]:
%%python -m spacy download en_core_web_sm

In [10]:
import spacy

from spacy import displacy

## How to load the model

In [7]:
nlp = spacy.load('en_core_web_sm')

In [17]:
doc = nlp("Hi! My name is Sabrina Guimarães. I live in São Paulo and I'm working at NTT DATA")
doc

Hi! My name is Sabrina Guimarães. I live in São Paulo and I'm working at NTT DATA

## Entity recognition

In [18]:
displacy.serve(doc, style='ent')


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Chapter 2: Core operations with spaCy


- Tokenization
- Lemmatization
- POS Tagger (part-of-speech tagger)
- Parser (Dependency Parser)

In [None]:
%%python -m spacy download en_core_web_md

In [23]:
nlp = spacy.load("en_core_web_md")

In [24]:
doc = nlp("I went there")

In [None]:
%%python -m spacy download pt_core_news_sm

In [26]:
nlp = spacy.load("pt_core_news_sm")
text = "Oi. Meu nome é Sabrina e eu tenho 3 gatos"
doc = nlp(text)

print("Tokens:")
for token in doc:
    print(token.text)

Tokens:
Oi
.
Meu
nome
é
Sabrina
e
eu
tenho
3
gatos


In [27]:
print("\nLematização:")
for token in doc:
    print(f"Lema: {token.lemma_}")

    # ter
    # ser
    # base form of a token


Lematização:
Lema: Oi
Lema: .
Lema: meu
Lema: nome
Lema: ser
Lema: Sabrina
Lema: e
Lema: eu
Lema: ter
Lema: 3
Lema: gato


In [29]:
text = "A Maria corre rápido para pegar o ônibus e não se atrasar para a faculdade."
doc = nlp(text)

print("\nPOS Tagger:")
for token in doc:
    print(f"Token: {token.text}, POS (Universal): {token.pos_}, POS (Detalhada): {token.tag_}")


POS Tagger:
Token: A, POS (Universal): DET, POS (Detalhada): DET
Token: Maria, POS (Universal): PROPN, POS (Detalhada): PROPN
Token: corre, POS (Universal): VERB, POS (Detalhada): VERB
Token: rápido, POS (Universal): ADV, POS (Detalhada): ADV
Token: para, POS (Universal): SCONJ, POS (Detalhada): SCONJ
Token: pegar, POS (Universal): VERB, POS (Detalhada): VERB
Token: o, POS (Universal): DET, POS (Detalhada): DET
Token: ônibus, POS (Universal): NOUN, POS (Detalhada): NOUN
Token: e, POS (Universal): CCONJ, POS (Detalhada): CCONJ
Token: não, POS (Universal): ADV, POS (Detalhada): ADV
Token: se, POS (Universal): PRON, POS (Detalhada): PRON
Token: atrasar, POS (Universal): VERB, POS (Detalhada): VERB
Token: para, POS (Universal): ADP, POS (Detalhada): ADP
Token: a, POS (Universal): DET, POS (Detalhada): DET
Token: faculdade, POS (Universal): NOUN, POS (Detalhada): NOUN
Token: ., POS (Universal): PUNCT, POS (Detalhada): PUNCT


In [31]:
text = "O cachorro grande latiu alto na rua."
doc = nlp(text)
for token in doc:
    print(f"Token: {token.text}, Head: {token.head.text}, Relação: {token.dep_}")

Token: O, Head: cachorro, Relação: det
Token: cachorro, Head: latiu, Relação: nsubj
Token: grande, Head: cachorro, Relação: amod
Token: latiu, Head: latiu, Relação: ROOT
Token: alto, Head: latiu, Relação: advmod
Token: na, Head: rua, Relação: case
Token: rua, Head: latiu, Relação: obl
Token: ., Head: latiu, Relação: punct


In [32]:
displacy.render(doc, style="dep", jupyter=True, options={"distance": 90})

In [34]:
# tokenization

doc = nlp("It's been a crazy week!!!")
print([token.text for token in doc])

# Tokenization does not need a specific model
# Tokenization is based on language-specific-rules
# Tokenization rules depends on the grammatical rules of the individual language

['It', "'s", 'been', 'a', 'crazy', 'week', '!', '!', '!']


### Costumizing the tokenizer

In [36]:
from spacy.symbols import ORTH

In [38]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("lemme that")
print([w.text for w in doc])

['lemme', 'that']


In [39]:
special_case = [{"ORTH":"lem"}, {"ORTH":"me"}]
nlp.tokenizer.add_special_case("lemme", special_case)
print([w.text for w in nlp("lemme that")])

['lem', 'me', 'that']


### Sentence segmentation

In [41]:
text = "I flied to N.Y yesterday. It was around 9 am."
doc = nlp(text)
for sent in doc.sents:
  print(sent.text)

I flied to N.Y yesterday.
It was around 9 am.


### Lemma

In [43]:
nlp.get_pipe("attribute_ruler").add([[{"TEXT":"Angeltown"}]], {"LEMMA": "Los Angeles"})

In [45]:
doc = nlp("I am flying to Angeltown")
for token in doc:
  print (token.text, token.lemma_)

I I
am be
flying fly
to to
Angeltown Los Angeles


In [46]:
# print the entities

doc = nlp("I flied to NY with Ashley")
entities = doc.ents
print(entities)

(NY, Ashley)


In [47]:
# each sentence is a Span object

doc = nlp("This is the 1st sentence. This is the 2nd sentence. And blablabla.")
sentences = list(doc.sents)
print(sentences)

[This is the 1st sentence., This is the 2nd sentence., And blablabla.]


### Focus on token object

In [52]:
doc = nlp("Hello madam!")
print(doc[0])
print(doc[1])
print(doc[2])

Hello
madam
!


In [53]:
token = doc[0]
print(token.doc)

Hello madam!


In [55]:
doc = nlp("The Brazilian president visited Argentina")
print(doc.ents)
print(doc[1].ent_type_, spacy.explain(doc[1].ent_type_))

(Brazilian, Argentina)
NORP Nationalities or religious or political groups


In [58]:
doc = nlp("I visited Buenos Aires when I was on vacation")
print(doc.ents)
print(doc[1].ent_type_, spacy.explain(doc[1].ent_type_))

(Buenos Aires,)
 None


Another interesting spaCy features

- like_url
- like_num
- like_emai
- token.shape

In [60]:
# token.shape

doc = nlp("My nickname is Sa123Sous456a")
for token in doc:
  print(token.text, token.shape_)

My Xx
nickname xxxx
is xx
Sa123Sous456a XxdddXxxxdddx


In [61]:
# stop words (such as the, a, an, and, just, with...)

doc = nlp("She was just walking with a friend to the park, and they had an ice cream along the way.")
for token in doc:
  print(token, token.is_stop)

She True
was True
just True
walking False
with True
a True
friend False
to True
the True
park False
, False
and True
they True
had True
an True
ice False
cream False
along True
the True
way False
. False


### spaCy's core NLP pipeline notes:
- https://spacy.io/usage/processing-pipelines
- https://www.bmc.com/blogs/nlu-vs-nlp-natural-language-understanding-processing/

Tokenization:
- The very first step, breaking raw text into individual units called tokens (words, punctuation, numbers) - Because all the other operations require tokens.
- It's the foundational step for all subsequent processing, allowing the NLP model to treat each meaningful unit separately.
- How to access: token.text for the token's original string.

Lemmatization:
- Reducing words to their base or dictionary form, known as a lemma. For example, "running," "ran," and "runs" all become "run."
- Helps normalize words, so different inflections of the same word are treated consistently, which is crucial for tasks like frequency analysis.
- How to access: token.lemma_

Part-of-Speech (POS) Tagging
- Assigning a grammatical category (e.g., noun, verb, adjective, adverb) to each token.
- Provides grammatical context, essential for disambiguating word meanings and understanding syntactic roles.
- How to access: token.pos_ (universal tag, like NOUN, VERB) and token.tag_ (detailed tag, language-specific).

Dependency Parsing:
- Analyzing the grammatical relationships between words in a sentence, creating a dependency tree. It identifies which words modify or depend on others.
- Reveals the syntactic structure, allowing for deeper semantic understanding (e.g., identifying subjects, objects, and modifiers).
- How to access: token.head (the token this one depends on) and token.dep_ (the type of dependency relationship).

![spaCy NLP pipeline](https://spacy.io/images/pipeline.svg)

### NER — Named Entity Recognition

In [65]:
nlp = spacy.load("pt_core_news_sm")
text = "Saulo mora em São Paulo e trabalha na Google desde 2010."
doc = nlp(text)

print("NER:")
for ent in doc.ents:
    print(f"Text: {ent.text}, Label: {ent.label_}")

NER:
Text: Saulo, Label: PER
Text: São Paulo, Label: LOC
Text: Google, Label: ORG


- Doc
- Token
- Span
- Lexeme

In [66]:
# Doc

text = "Texto de exemplo"
doc = nlp(text) # criando objeto Doc nesse momento
print(f"Tipo do objeto doc: {type(doc)}")
print(f"Número de tokens no doc: {len(doc)}")

Tipo do objeto doc: <class 'spacy.tokens.doc.Doc'>
Número de tokens no doc: 3


In [67]:
# Token

text = "Alexa, coloque meu alarme para 6 am"
doc = nlp(text)

print("\nDetalhes dos tokens:")
for token in doc:
    print(f"Texto: '{token.text}' | Lema: '{token.lemma_}' | POS: '{token.pos_}' | É alfa: {token.is_alpha}")


Detalhes dos tokens:
Texto: 'Alexa' | Lema: 'alexa' | POS: 'PROPN' | É alfa: True
Texto: ',' | Lema: ',' | POS: 'PUNCT' | É alfa: False
Texto: 'coloque' | Lema: 'coloque' | POS: 'VERB' | É alfa: True
Texto: 'meu' | Lema: 'meu' | POS: 'DET' | É alfa: True
Texto: 'alarme' | Lema: 'alarme' | POS: 'NOUN' | É alfa: True
Texto: 'para' | Lema: 'para' | POS: 'ADP' | É alfa: True
Texto: '6' | Lema: '6' | POS: 'NUM' | É alfa: False
Texto: 'am' | Lema: 'am' | POS: 'PROPN' | É alfa: True


In [68]:
# Span
# Span is a piece of Doc
# It represents a sequence of one or more adjacent tokens

text = "Eu amo ir ao Rio de Janeiro, é uma cidade linda!"
doc = nlp(text)

print("\nSpans:")
for sent in doc.sents:
    print(f"Sentença: '{sent.text}'")

for ent in doc.ents:
    print(f"Entidade: '{ent.text}' | Tipo: '{ent.label_}'")

# Manually Span
span_manual = doc[3:6]
print(f"Span manual: '{span_manual.text}'")
print(f"Tipo do span manual: {type(span_manual)}")


Spans (Sentenças e Entidades):
Sentença: 'Eu amo ir ao Rio de Janeiro, é uma cidade linda!'
Entidade: 'Rio de Janeiro' | Tipo: 'LOC'
Span manual: 'ao Rio de'
Tipo do span manual: <class 'spacy.tokens.span.Span'>


- a Doc is the entire document.
- a Doc is composed of a sequence of Tokens.
- a Span is a slice of a Doc (a sequence of Tokens).
- named entities (ent) are Spans.
- each Token refers to a Lexeme in the vocabulary for its context-independent properties.

## Chapter 3: Extracting linguistic features

### POS tagging
One of the core tasks in NLP is Parts of Speech (PoS) tagging, which is giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through improved comprehension of phrase structure and semantics, this technique makes it possible for machines to study and comprehend human language more accurately.

In [70]:
nlp = spacy.load("en_core_web_sm")

text = "The quick brown foxes jump over the lazy dogs."
doc = nlp(text)

print("POS Tagging example:")
for token in doc:
    print(f"Token: '{token.text}' | Detailed POS Tag (token.tag_): '{token.tag_}' | Universal POS Tag (token.pos_): '{token.pos_}'")

POS Tagging example:
Token: 'The' | Detailed POS Tag (token.tag_): 'DT' | Universal POS Tag (token.pos_): 'DET'
Token: 'quick' | Detailed POS Tag (token.tag_): 'JJ' | Universal POS Tag (token.pos_): 'ADJ'
Token: 'brown' | Detailed POS Tag (token.tag_): 'JJ' | Universal POS Tag (token.pos_): 'ADJ'
Token: 'foxes' | Detailed POS Tag (token.tag_): 'NNS' | Universal POS Tag (token.pos_): 'NOUN'
Token: 'jump' | Detailed POS Tag (token.tag_): 'VBP' | Universal POS Tag (token.pos_): 'VERB'
Token: 'over' | Detailed POS Tag (token.tag_): 'IN' | Universal POS Tag (token.pos_): 'ADP'
Token: 'the' | Detailed POS Tag (token.tag_): 'DT' | Universal POS Tag (token.pos_): 'DET'
Token: 'lazy' | Detailed POS Tag (token.tag_): 'JJ' | Universal POS Tag (token.pos_): 'ADJ'
Token: 'dogs' | Detailed POS Tag (token.tag_): 'NNS' | Universal POS Tag (token.pos_): 'NOUN'
Token: '.' | Detailed POS Tag (token.tag_): '.' | Universal POS Tag (token.pos_): 'PUNCT'


The more detailed tags (`token.tag_`) come from a tagset called the **Penn Treebank Tagset**, which is very common in NLP for English.

* DT: Determiner. Used for articles (a, an, the) and other determiners that precede nouns.
  Example: "The", "a", "this".

* JJ: Adjective. A word that describes or modifies a noun or pronoun.
  Example: "quick", "brown", "lazy".

* NNS: Noun, plural. A noun in the plural form.
  Example: "foxes", "dogs", "cats".

* VBP: Verb, non-3rd person singular present. A verb in the present tense, not in the third person singular.
  Example: "jump" (I jump, you jump, they jump), "walk", "sing".

* IN: Preposition or subordinating conjunction. A preposition or subordinating conjunction.
  Example: "over", "in", "on", "because".

* . : Punctuation mark, sentence closer. A punctuation mark that ends a sentence.
  Example: ".", "!", "?".

---

### Other common tags

* NN: Noun, singular or mass. A singular or mass (uncountable) noun.
  Example: "cat", "water", "love".

* PRP: Personal pronoun. A personal pronoun.
  Example: "I", "you", "he", "she", "it", "we", "they".

* PRP\$: Possessive pronoun. A possessive pronoun.
  Example: "my", "your", "his", "her", "its", "our", "their".

* VB: Verb, base form. A verb in its base form (infinitive without "to").
  Example: "run", "walk", "sing".

* VBG: Verb, gerund or present participle. A verb in the gerund or present participle form (ending in -ing).
  Example: "running", "walking", "singing".

* VBN: Verb, past participle. A verb in the past participle form.
  Example: "eaten", "walked", "sung".

* RB: Adverb. A word that modifies a verb, adjective, or another adverb.
  Example: "quickly", "very", "happily".

* CC: Coordinating conjunction. A coordinating conjunction (joins words, phrases, or independent clauses).
  Example: "and", "but", "or".


### WSD — Word Sense Disambiguation
- Classical NLU problem
- A word can have many sense

In [71]:
nlp = spacy.load("en_core_web_sm")

sent1 = "I flew to Rome"
sent2 = "I'm flying to Rome"
sent3 = "I will fly to Rome"

doc1 = nlp(sent1)
doc2 = nlp(sent2)
doc3 = nlp(sent3)

# Iterates over each document and extracts tokens with POS tag 'VBG' (gerund) or 'VB' (verb in base/infinitive form).
# Returns a list of tuples (token text, token lemma).
results = []
for doc in [doc1, doc2, doc3]:
    extracted_tokens = []
    for w in doc:
        if w.tag_ == 'VBG' or w.tag_ == 'VB':
            extracted_tokens.append((w.text, w.lemma_))
    results.append(extracted_tokens)

print(results)

[[], [('flying', 'fly')], [('fly', 'fly')]]
