# Lab 02: POS Tagging, Morphology, Lemmatization, Dependency Parsing, and Tokenization

## Introduction  
In this tutorial, we will explore some core NLP tasks using **spaCy**, a powerful and efficient Python library for NLP. Additionally, we will examine tokenization techniques used in modern language models.  

### Topics Covered:  

1. **Tokenization**  
2. **Part-of-Speech (POS) Tagging**  
3. **Lemmatization**  
4. **Morphology**  
5. **Dependency Parsing**  
6. **Subword Tokenization**  

## Prerequisites  

Before we begin, ensure that you have **spaCy** installed in your environment. If you are using the `NLP2026` environment, make sure it is activated. You can install **spaCy** using the following command:



In [1]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a href="https://colab.research.google.com/github/surrey-nlp/NLP-2026/blob/main/lab02/lab02_Tokenization.ipynb" target="_parent">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>'
)
display(colab_button)

In [None]:
!pip install spacy

Next, download the spaCy model for the English language:

In [None]:
!python -m spacy download en_core_web_sm

## Importing spaCy
Let's start by importing the spaCy library and loading the English language model.

In [None]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

Before beginning, let's define a couple example strings that we can look at.

In [None]:
short_text = "Cats like to chase mice."
long_text = "The University of Surrey is a U.K. university founded in 1966, with a budget of Â£314.0 million."

## 1. Tokenization  

**Tokenization** is the process of breaking down text input into smaller units called **tokens**, which can be **words, punctuation marks, or other meaningful elements**. This is a fundamental step in NLP as it enables structured text analysis.  

### Tokenization in spaCy  

In **spaCy**, tokenization is performed using language-specific grammatical rules. For example:  
- Punctuation at the end of a sentence is **split off** as a separate token.  
- Abbreviations like **"U.K."** retain their periods within a single token.  

### How spaCy Handles Tokenization  

- The **input** to the tokenizer is a **Unicode text**.  
- The **output** is a **Doc object**, which consists of individual tokens.  
- We can **iterate** over tokens and access attributes such as `token.text`.  
- spaCy's tokenizer is **non-destructive**, meaning it preserves the original text while providing structured access to tokens.  

This efficient tokenization process enables deeper linguistic analysis while maintaining the integrity of the original text.  


In [None]:
# Useful Library for formatting the table
import tabulate

doc = nlp(short_text)

# Plot table 
table = []
for count, token in enumerate(doc):
    table.append([count + 1, token.text])

print(tabulate.tabulate(table, headers=['Position','Text']))

In [None]:
# Again for our longer text
doc2 = nlp(long_text)

table = []
for count, token in enumerate(doc2):
    table.append([count + 1, token.text])

print(tabulate.tabulate(table, headers=['Position','Text']))

## 2. Part-of-Speech (POS) Tagging  

**Part-of-Speech (POS) tagging** is the process of assigning grammatical tags to individual words in a sentence, indicating their role, such as **noun, verb, adjective,** etc. This helps in understanding the **syntactic structure** of a sentence and is fundamental in many NLP tasks.  

### Using spaCy for POS Tagging  

Since we have previously processed the text input using **spaCy**, we can easily retrieve the POS tag for each token with a simple attribute call `token.pos_`


In [None]:
# For our short sentence
POS_Tags = []
for count, token in enumerate(doc):
    POS_Tags.append([count + 1, token.text, token.pos_])

print(tabulate.tabulate(POS_Tags, headers=['Position','Text', 'POS Tag']))

In [None]:
# And our longer sentence
POS_Tags = []
for count, token in enumerate(doc2):
    POS_Tags.append([count + 1, token.text, token.pos_])

print(tabulate.tabulate(POS_Tags, headers=['Position','Text', 'POS Tag']))

Based on the example above, we can see several **POS Tags**. Some common examples include:

- **DET**: Determiner  
- **PROPN**: Proper Noun  
- **ADP**: Adposition  

These tags represent different parts of speech in a sentence and are crucial for understanding the syntactic structure of the language.

### Why Use POS Tagging?

In NLP, understanding the grammatical structure of sentences can be extremely valuable for many tasks. POS tagging helps computers to identify the roles that different words play within a sentence, such as subjects, objects, or actions.

However, in some tasks, it may also be useful to **discard certain words** based on their POS tags. For example:

- **Sentiment Analysis**:  
  In sentiment analysis, words like **articles** (e.g., *"the"*, *"a"*) and **pronouns** (e.g., *"he"*, *"she"*) might be discarded because they contribute little to the overall sentiment of the text.

By filtering out less relevant POS tags, the model can focus on words that carry more meaning and help improve task performance.


## 3. Lemmatization  

**Lemmatization** is the process of reducing words to their **base** or **root** form, known as a **lemma**. This helps in **text normalization** by converting different inflectional forms of a word into a single standardized form.  

Lemmatization is particularly useful in NLP tasks such as:  
- Improving text **search and retrieval**  
- Enhancing **sentiment analysis**  
- Reducing **dimensionality** in text-based models  


In [None]:
# For our short sentence
Lemmas = []
for count, token in enumerate(doc):
    Lemmas.append([count + 1, token.text, token.lemma_])

print(tabulate.tabulate(Lemmas, headers=['Position','Text','Lemma']))

In [None]:
# And our longer sentence
Lemmas = []
for count, token in enumerate(doc2):
    Lemmas.append([count + 1, token.text, token.lemma_])

print(tabulate.tabulate(Lemmas, headers=['Position','Text','Lemma']))

## 4. Morphology  

**Morphology** is the study of the structure of words and their components, such as **prefixes, suffixes,** and **roots**. In essence, it is the process through which the root form (lemma) of a word is modified by the addition of prefixes or suffixes, altering its meaning or grammatical function.  

In **spaCy**, we can access detailed morphological information for each token, which includes features such as:  
- **Number** (singular or plural)  
- **Tense** (present, past, etc.)  
- **Mood**: Indicates the mode or manner in which the action is expressed (e.g., **indicative**, **imperative**, or **subjunctive**).  
  - Example: *"She eats"* (indicative) vs. *"Eat!"* (imperative)
- **Aspect**: Describes the temporal flow or completion of an action (e.g., **perfective**, **progressive**, or **habitual**).  
  - Example: *"I am eating"* (progressive) vs. *"I have eaten"* (perfective)  

This morphological analysis is essential for understanding how words relate to one another in context and is crucial for tasks such as syntactic parsing and word generation.  


In [None]:
# For our short sentence
Morphs = []
for count, token in enumerate(doc):
    Morphs.append([count + 1, token.text, token.morph])

print(tabulate.tabulate(Morphs, headers=['Position','Text', 'Morphology']))

In [None]:
# And our longer sentence
Morphs = []
for count, token in enumerate(doc2):
    Morphs.append([count + 1, token.text, token.morph])

print(tabulate.tabulate(Morphs, headers=['Position','Text', 'Morphology']))

## 5. Dependency Parsing  

Dependency parsing involves analyzing the grammatical structure of a sentence and establishing relationships between **head** words and their **modifiers**. This technique allows us to decompose a sentence into multiple sections, assuming a direct connection between each linguistic unit. These relationships are typically represented as a **tree structure**, illustrating how words depend on one another.  

### Example  

**Sentence:**  
*"I prefer the morning flight through Denver."*  

The diagram below visualizes the sentence's dependency structure:  

![Dependency Parsing](https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/29920Screenshot-127.webp)  
*[Source](https://www.analyticsvidhya.com/blog/2021/12/dependency-parsing-in-natural-language-processing-with-examples/)*  

### Understanding the Dependency Structure  

In the diagram:  

- **Directed arcs** illustrate grammatical relationships between words in the sentence.  
- The **root** of the tree, *prefer*, serves as the central unit of the sentence.  
- Each dependency is labeled with a **dependency tag**, which specifies the relationship between two words.  

For instance, in the phrase **"flight to Denver"**, the noun *Denver* modifies the meaning of *flight*. This creates a **dependency** where:  

- *Flight* is the **head** (governing word).  
- *Denver* is the **dependent** (child node).  
- This relationship is marked by the **nmod** (nominal modifier) tag, indicating that *Denver* provides additional information about *flight*.  

Dependency parsing plays a crucial role in natural language processing (NLP), helping models understand syntactic structures and improving tasks such as named entity recognition, question answering, and machine translation.  


All of this can be done easily with spaCy through the following:

In [None]:
Dependenct_Parsing = []
for count, token in enumerate(doc):
    Dependenct_Parsing.append([count + 1, token.text, token.dep_, [child.text for child in token.children]])

print(tabulate.tabulate(Dependenct_Parsing, headers=['Position','Text', 'Dependency', 'Children']))

This table might look very confusing, which is why spaCy offers a quick way to easily view the tree structure with the following:

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

In [None]:
Dependenct_Parsing = []
for count, token in enumerate(doc2):
    Dependenct_Parsing.append([count + 1, token.text, token.dep_, [child.text for child in token.children]])

print(tabulate.tabulate(Dependenct_Parsing, headers=['Position','Text', 'Dependency', 'Children']))

This table even more so, but luckily we can visualise it again!

In [None]:
# And a more complicated one
from spacy import displacy

displacy.render(doc2, style="dep", jupyter=True)

## 6. Subword Tokenization  

Tokenization is the process of breaking down a sentence into smaller units, enabling AI models to process text as discrete tokens rather than as a continuous block of text. In previous sections, you have used spaCy for tokenization, which primarily segments text into individual words. While this approach is efficient, it struggles with handling uncommon or out-of-vocabulary (OOV) words.  

To address this limitation, modern tokenization techniques predominantly use **subword-based methods**. Instead of strictly segmenting text into words, these approaches break words into smaller subword units when necessary. For example, the word *unhappiness* might be tokenized into *un* and *happiness*. This strategy offers several advantages:  

- **Improved Handling of Rare Words** â€“ By decomposing words into meaningful subunits, the model can recognize and generate words that were not explicitly seen during training.  
- **Compact Vocabulary** â€“ Instead of storing an extensive vocabulary of all possible words, subword tokenization relies on a smaller set of subunits, which can be combined to form complex words.  
- **Efficient Representation** â€“ By balancing whole-word tokens with subword segments, this method optimizes both memory usage and model performance.  

(**Note**: Often tokenizers try to maintain words that are frequently used, and split rare words into smaller subwords)

We will therefore explore three subword tokenization techniques:  

1. **WordPiece**  
2. **SentencePiece** 
3. **Byte-Pair Encoding (BPE)**  

These tokenization methods have become standard in modern NLP models and are widely used in recent Large Language Models (LLMs).  


Before starting, let's define a simple string that we will be tokenizing.

In [None]:
text = "Natural Language Processing is incontrovertibly a good module."

## 6.1. WordPiece Tokenization

**WordPiece Tokenization** is a subword tokenization technique used in models like BERT (Bidirectional Encoder Representations from Transformers). It breaks down words into subwords, that can efficiently handle complex words, unknown terms, or out-of-vocabulary (OOV) words.

WordPiece works by iteratively merging the most frequent pairs of characters or subword units in a large corpus. The resulting subwords represent the language's most frequent word components, which helps to reduce the size of the vocabulary while maintaining full language coverage.

### Example: Tokenizing Text with BERTâ€™s WordPiece Tokenizer

In this section, we will use the Hugging Face `transformers` library to showcase how the BERT tokenizer works. Weâ€™ll tokenize a sample sentence, convert the tokens into token IDs, and then decode those IDs back into a human-readable string.


In [None]:
# First, install the required library
!pip install transformers

In [None]:
# Importing the necessary module from Hugging Face transformers
from transformers import BertTokenizer

# Step 1: Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Step 2: Tokenize the text into subword tokens
tokens = tokenizer.tokenize(text)
print("\nBERT Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nBERT Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)

In the example above, the word **"incontrovertibly"**, which is quite rare, is split into **5 subwords** by the tokenizer. This process of splitting words into smaller subunits is particularly useful for handling rare or out-of-vocabulary (OOV) words.

Each subword is represented as a **token**, and you can see that certain tokens are prefixed with `##`. This notation indicates that these subwords are continuations of a previous subword (i.e., they are not starting a new token). The tokenizer has broken down the word into smaller, more frequent subwords that are part of the modelâ€™s vocabulary.

### Why Do We Use Token IDs?

As shown above, the tokens are also associated with **token IDs**. These token IDs are numerical representations of the words or subwords. In the context of machine learning and NLP models, it's crucial to convert words into numbers because models operate on numerical data.

Each token is mapped to a unique ID in the modelâ€™s vocabulary, which allows the model to process text efficiently. This conversion is essential because:

- **Models can't understand raw text**: Machine learning models, including NLP models, don't process text directly. Instead, they process **numerical representations** of words.
- **Token IDs map to model parameters**: The model's vocabulary is essentially a map of tokens (words or subwords) to unique IDs. These IDs are used by the model to look up the corresponding word embeddings (vector representations) in the modelâ€™s parameters.

## 6.2. SentencePiece Tokenization

**SentencePiece** is another popular subword tokenization technique used in models like T5 (Text-to-Text Transfer Transformer) and other transformer-based architectures.

In this section, we will use the Hugging Face `transformers` library to demonstrate how the **SentencePiece tokenizer** works. We will also be training our own SentencePiece tokenizer later on!


In [None]:
from transformers import T5Tokenizer

# Step 1: Load a pre-trained SentencePiece tokenizer (T5 model)
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Step 2: Tokenize the text into subword tokens using SentencePiece
tokens = tokenizer.tokenize(text)
print("\nSentencePiece Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nSentencePiece Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)


## 6.3. Byte-Pair Encoding (BPE) Tokenization

**Byte-Pair Encoding (BPE)** is also another subword tokenization technique used in models like GPT (Generative Pretrained Transformer) and other transformer-based architectures. 

### Byte-Level BPE Tokenization

Instead of treating text as sequences of **Unicode characters** (such as 'a', 'b', 'c', etc.), **byte-level BPE** tokenizes text at the **byte level**. Each character, word, and symbol is first converted into its corresponding **byte representation**.

The **base vocabulary** for byte-level BPE is much smaller, consisting of only **256 byte values**, as there are 256 possible byte values. This ensures that any character can be represented without needing to resort to an **unknown token** for out-of-vocabulary (OOV) words.

This approach allows models like **GPT-2** and **RoBERTa** to handle any character or symbol, including those from different languages, special symbols, or rare characters, without needing additional vocabularies or dealing with OOV issues.


### Example: Tokenizing Text with BPE Tokenizer

In this section, we will use the Hugging Face `transformers` library to demonstrate how a BPE tokenizer works.

In [None]:
from transformers import GPT2Tokenizer

# Step 1: Load the pre-trained BPE tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Step 2: Tokenize the text into subword tokens
tokens = tokenizer.tokenize(text)
print("\nBPE Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nBPE Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)

### The Ä  Character in Byte-Pair Encoding (BPE)

The output above reveals a noticeable difference: the **Ä ** character. In **byte-level BPE**, this character is used to indicate that a word token is preceded by a **space**. This is a crucial part of the tokenization strategy, as it helps BPE models distinguish between different words and their **contexts**. 


## 7. Training a SentencePiece Model

In this section, we'll walk through how to **train a SentencePiece model** from a text corpus using **Byte-Pair Encoding (BPE)**. As seen from above, SentencePiece is a subword tokenization technique that efficiently handles rare or out-of-vocabulary (OOV) words by splitting them into smaller, manageable units.

### Training Process Overview:
1. **Input Corpus**: We use a text file (e.g., **Shakespeare_1_10.txt**) as input.
2. **Model Parameters**:
   - **Vocabulary size**: Set to **2000**.
   - **Model type**: We use **BPE**.
3. **Training**: The model is trained using `SentencePieceTrainer.train()` to learn subword units.
4. **Output**: The model and vocabulary files are saved with the specified prefix (e.g., `mymodel.model`, `mymodel.vocab`).


In [None]:
!pip install sentencepiece

In [None]:
import sentencepiece as spm

# Step 1: Define the input corpus file (a large text file)
corpus_file = 'Shakespear_1_10.txt'  

# Step 2: Define the model output directory and parameters
model_prefix = 'mymodel' 
vocab_size = 2000
model_type = 'bpe'  # BPE model (could also be 'unigram', 'char', etc.)

# Step 3: Train the SentencePiece model
spm.SentencePieceTrainer.train(
    input=corpus_file,  
    model_prefix=model_prefix,  
    vocab_size=vocab_size,  
    model_type=model_type, 
    character_coverage=0.9995,  # Coverage for character set (default is 0.9995)
    input_format='text'  # Format of input (usually plain text)
)

print(f"Model trained and saved with prefix: {model_prefix}")


## SentencePiece Tokenization and Detokenization

Once youâ€™ve trained your **SentencePiece** model, you can use it to tokenize and detokenize sentences. The process involves converting a sentence into subword units (tokens) and then reconstructing the sentence from those tokens.


In [None]:
# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load('mymodel.model')

# Tokenize a sentence
sentence = "I have successfully trained a SentencePiece model."
tokens = sp.encode(sentence, out_type=str)  # or out_type=int for token IDs
print(f"\nTokenized sentence: {tokens}")

# Detokenize the sentence
detokenized = sp.decode(tokens)
print(f"\nDetokenized sentence: {detokenized}")

## 8. ðŸ”Ž Interactive Exploration

You can explore tokenization with various models interactively here:

**Tiktoken Visualizer:**  
ðŸ‘‰ https://tiktokenizer.vercel.app



Try pasting the following examples into the tool and compare the token counts:

<ol>
  <li>Cats like to chase mice.</li>
  <li>The University of Surrey was founded in 1966 with Â£314 million.</li>
  <li>incontrovertibly</li>
  <li>emojis ðŸ™‚ðŸ™ƒðŸ”¥</li>
</ol>

See how:

<ol>
  <li>Whitespace affects tokenization.</li>
  <li>Some words are split into subword pieces</li>
  <li>Emojis and symbols are handled</li>
</ol>