# POS tagging and Chunking

To help the machine understand a sentence, we will tell it what each word is.
For that we use **P**art **O**f **S**peech tagging and **Chunking**.

- [More info here](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)


## What is Part of Speech?

"The part of speech explains how a word is used in a sentence. There are 8 main POS tags: nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections."

## How to do that?

A lot of tools are performing this task. But SpaCy (again...) does it quite well. When you use the `nlp` object from it, it applies a complete preprocessing pipeline, including POS tagging.

#### Let's practice: can you find the POS tag for each word using SpaCy?

In [24]:
import spacy

# Load the English language model
#spacy.load("en_core_web_sm"): This loads a pre-trained spaCy language model. In this case, "en_core_web_sm" is a small English model trained on web text.
#It includes components for tokenization, part-of-speech tagging, named entity recognition, and more.

nlp = spacy.load("en_core_web_sm")

# Sample text
text = "SpaCy is a powerful library for natural language processing that removes stop words. USA, France on 21/10/2025"

# Step 1: Tokenization
doc = nlp(text)
print("\nStep 1: Tokenization")
for token in doc:
    print(f"Token: {token.text}")




Step 1: Tokenization
Token: SpaCy
Token: is
Token: a
Token: powerful
Token: library
Token: for
Token: natural
Token: language
Token: processing
Token: that
Token: removes
Token: stop
Token: words
Token: .
Token: USA
Token: ,
Token: France
Token: on
Token: 21/10/2025


In [25]:

# Step 2: Part-of-Speech Tagging
print("\nStep 2: Part-of-Speech Tagging")
for token in doc:
    print(f"Token: {token.text}, POS Tag: {token.pos_}")




Step 2: Part-of-Speech Tagging
Token: SpaCy, POS Tag: PROPN
Token: is, POS Tag: AUX
Token: a, POS Tag: DET
Token: powerful, POS Tag: ADJ
Token: library, POS Tag: NOUN
Token: for, POS Tag: ADP
Token: natural, POS Tag: ADJ
Token: language, POS Tag: NOUN
Token: processing, POS Tag: NOUN
Token: that, POS Tag: PRON
Token: removes, POS Tag: VERB
Token: stop, POS Tag: VERB
Token: words, POS Tag: NOUN
Token: ., POS Tag: PUNCT
Token: USA, POS Tag: PROPN
Token: ,, POS Tag: PUNCT
Token: France, POS Tag: PROPN
Token: on, POS Tag: ADP
Token: 21/10/2025, POS Tag: NUM


In [26]:
# Step 3: Named Entity Recognition (NER)
print("\nStep 3: Named Entity Recognition (NER)")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

print(doc.ents)


Step 3: Named Entity Recognition (NER)
Entity: USA, Label: GPE
Entity: France, Label: GPE
Entity: 21/10/2025, Label: DATE
(USA, France, 21/10/2025)


In [27]:
# Step 4: Dependency Parsing
print("\nStep 4: Dependency Parsing")
for token in doc:
    print(f"Token: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}")



Step 4: Dependency Parsing
Token: SpaCy, Dependency: nsubj, Head: is
Token: is, Dependency: ROOT, Head: is
Token: a, Dependency: det, Head: library
Token: powerful, Dependency: amod, Head: library
Token: library, Dependency: attr, Head: is
Token: for, Dependency: prep, Head: library
Token: natural, Dependency: amod, Head: language
Token: language, Dependency: compound, Head: processing
Token: processing, Dependency: pobj, Head: for
Token: that, Dependency: nsubj, Head: removes
Token: removes, Dependency: nsubj, Head: stop
Token: stop, Dependency: relcl, Head: library
Token: words, Dependency: dobj, Head: stop
Token: ., Dependency: punct, Head: is
Token: USA, Dependency: ROOT, Head: USA
Token: ,, Dependency: punct, Head: USA
Token: France, Dependency: appos, Head: USA
Token: on, Dependency: prep, Head: USA
Token: 21/10/2025, Dependency: pobj, Head: on


In [28]:

# Step 5: Lemmatization
print("\nStep 5: Lemmatization")
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}")


Step 5: Lemmatization
Token: SpaCy, Lemma: SpaCy
Token: is, Lemma: be
Token: a, Lemma: a
Token: powerful, Lemma: powerful
Token: library, Lemma: library
Token: for, Lemma: for
Token: natural, Lemma: natural
Token: language, Lemma: language
Token: processing, Lemma: processing
Token: that, Lemma: that
Token: removes, Lemma: remove
Token: stop, Lemma: stop
Token: words, Lemma: word
Token: ., Lemma: .
Token: USA, Lemma: USA
Token: ,, Lemma: ,
Token: France, Lemma: France
Token: on, Lemma: on
Token: 21/10/2025, Lemma: 21/10/2025


In [29]:
# Sample text
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976. The company is known for its innovative products like the iPhone and MacBook."

# Process the text using the loaded model
doc = nlp(text)

# Display named entities and their labels
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

# Extract organizations from named entities
organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]

# Print information about organizations
print("\nOrganizations mentioned in the text:")
for org in organizations:
    print(f"- {org}")


Named Entities:
Entity: Apple Inc., Label: ORG
Entity: Steve Jobs, Label: PERSON
Entity: Steve Wozniak, Label: PERSON
Entity: April 1976, Label: DATE
Entity: iPhone, Label: ORG
Entity: MacBook, Label: ORG

Organizations mentioned in the text:
- Apple Inc.
- iPhone
- MacBook


In [None]:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "I am a junior data scientist at Becode and my ultimate dream is to become a famous NLP engineer"

doc = nlp(text)

for token in doc:
    
    pos = ## TO COMPLETE
    print(token, "--", pos)

## What is chunking?

"Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words."

## How to do that?

Well, every library has its own way of doing it. Let's see how SpaCy does it with [displacy, their vizaluazation tool](https://spacy.io/usage/visualizers):

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus."""

# Preprocess the text
doc = nlp(text)
# Create a list of sentence
sentence_spans = list(doc.sents)
# Display SpaCy vizualizer for each sentence
displacy.render(sentence_spans, style="dep")

Now, search how SpaCy chunks the text. 

In [None]:
# Print the text's chunking by using the Doc object

# Additional resources
* [Learning POS tagging & chunking in NLP](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)
* [Spacy API](https://spacy.io/api)