# Part-of-Speech Tagging with Transformers

In this notebook, we will explore how to perform Part-of-Speech (POS) tagging using a pre-trained BERT model fine-tuned for POS tagging.

## Objectives
By the end of this notebook, you will:
1. Understand what Part-of-Speech (POS) tagging is and its importance in Natural Language Processing (NLP).
2. Learn how to use a pre-trained transformer-based model for POS tagging.
3. Visualize POS tags in a given text.

## What is POS Tagging?
POS tagging involves labeling each word in a sentence with its part of speech, such as noun, verb, adjective, etc.

For example:

**Input Text**: "The quick brown fox jumps over the lazy dog."

**Output POS Tags**:
- The: Determiner
- quick: Adjective
- brown: Adjective
- fox: Noun
- jumps: Verb
- over: Preposition
- the: Determiner
- lazy: Adjective
- dog: Noun

POS tagging helps in understanding the grammatical structure of a sentence and is a foundational step in many NLP tasks, such as parsing, named entity recognition, and machine translation.

---



# Preliminaries: Libraries and Model

We will use the Hugging Face `transformers` library to load a pre-trained BERT model fine-tuned for POS tagging.

### Key Components:
1. **AutoTokenizer**: Converts text into tokens that the model can process.
2. **AutoModelForTokenClassification**: Loads a pre-trained model for token classification tasks like POS tagging.
3. **Pipeline**: Provides a high-level API to simplify using models for various NLP tasks.

If you don’t have the `transformers` library installed, run:
```bash
pip install transformers


In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Part-of-Speech Tagging with a Pre-trained Model

We will use the model `vblagoje/bert-english-uncased-finetuned-pos` from Hugging Face. This model is based on BERT and has been fine-tuned for POS tagging on English text.

### Steps:
1. Load the tokenizer and model.
2. Use the tokenizer to split the input text into tokens.
3. Pass the tokens through the model to obtain POS tags.
4. Aggregate and format the results for visualization.


In [2]:
# Define the POS tagging function
def perform_pos_tagging(text):
    # Load pre-trained model and tokenizer
    model_name = "vblagoje/bert-english-uncased-finetuned-pos"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(model_name)

    # Create POS tagging pipeline
    pos_pipeline = pipeline("token-classification", model=model,
                             tokenizer=tokenizer, aggregation_strategy="simple")

    # Perform POS tagging
    results = pos_pipeline(text)

    # Print results
    print(f"Text: {text}\n")
    print("POS Tags:")
    for result in results:
        print(f"{result['word']}: {result['entity_group']}")

    # Visualize POS tags in text
    words = text.split()
    tagged_text = ""
    for word, pos in zip(words, results):
        tagged_text += f"{word}[{pos['entity_group']}] "

    print("\nTagged text:")
    print(tagged_text.strip())


# POS Tagging Results and Visualization

The tagged text is displayed with each word annotated by its corresponding POS tag. This makes it easier to understand the grammatical structure of the input sentence.

**Example**:

Input Text: "The quick brown fox jumps over the lazy dog."

Output:
- Text: The[DET] quick[ADJ] brown[ADJ] fox[NOUN] jumps[VERB] over[ADP] the[DET] lazy[ADJ] dog[NOUN]

This format provides both the word and its POS tag in a readable manner. Let's pass this string into our model and see if it tags properly!


In [3]:
# Our original sample text
sample_text = "The quick brown fox jumps over the lazy dog."

# Perform POS tagging
perform_pos_tagging(sample_text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Text: The quick brown fox jumps over the lazy dog.

POS Tags:
the: DET
quick brown: ADJ
fox: NOUN
jumps: VERB
over: ADP
the: DET
lazy: ADJ
dog: NOUN
.: PUNCT

Tagged text:
The[DET] quick[ADJ] brown[NOUN] fox[VERB] jumps[ADP] over[DET] the[ADJ] lazy[NOUN] dog.[PUNCT]


It works! Our model was able to correctly tag all parts of speech for our sample sentence.

# Exercises: Hands-on Practice

1. **Input Custom Text**:
   Modify the code to accept user input. Allow students to input their own sentences and observe the POS tagging results.

2. **Extend the Visualization**:
   Highlight nouns in a sentence by wrapping them in `*` symbols. For example, convert `fox[NOUN]` to `*fox*[NOUN]`.

3. **Analyze Ambiguity**:
   Test the model with ambiguous sentences, such as "Time flies like an arrow." or "Fruit flies like a banana." Discuss how POS tags vary with different interpretations.

4. **Compare with Another Model**:
   Replace the current model with another POS tagging model from Hugging Face. Compare the results.

5. **Explore New Languages**:
   Use a multilingual model like `bert-base-multilingual-cased` for POS tagging on non-English text. Test with sentences in other languages.

6. **Extract Specific POS Tags**:
   Write a function to extract and print all nouns and verbs from a sentence.

7. **Evaluate on a Dataset**:
   Use a sample POS-tagged dataset (e.g., from the Universal Dependencies project) to evaluate the model’s accuracy.
