In [None]:
# Lab 2

import nltk, spacy
from transformers import pipeline

# Download required NLTK resources
nltk.download("punkt")
nltk.download("punkt_tab")  # fixes the tokenization issue
nltk.download("averaged_perceptron_tagger")
nltk.download("averaged_perceptron_tagger_eng")  # NEW fix for POS tagging

# Load spaCy model
spacy.cli.download("en_core_web_sm")  # ensure model is available
nlp = spacy.load("en_core_web_sm")

# Example text
text = "Google acquired DeepMind in London for $500 million."

# --- 1. Tokenization ---
tokens = nltk.word_tokenize(text)
print("Word Tokens:", tokens)

sent_tokens = nltk.sent_tokenize(text)
print("\nSentence Tokens:", sent_tokens)

# --- 2. POS Tagging ---
pos_tags = nltk.pos_tag(tokens)
print("\nPOS Tags:", pos_tags)

# --- 3. NER with spaCy ---
doc = nlp(text)
print("\nNER (spaCy):")
for ent in doc.ents:
    print(ent.text, "->", ent.label_)

# --- 4. NER with Transformer (BERT-based model) ---
ner_model = pipeline("ner", grouped_entities=True)
print("\nNER (Transformer):")
print(ner_model(text))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Word Tokens: ['Google', 'acquired', 'DeepMind', 'in', 'London', 'for', '$', '500', 'million', '.']

Sentence Tokens: ['Google acquired DeepMind in London for $500 million.']

POS Tags: [('Google', 'NNP'), ('acquired', 'VBD'), ('DeepMind', 'NNP'), ('in', 'IN'), ('London', 'NNP'), ('for', 'IN'), ('$', '$'), ('500', 'CD'), ('million', 'CD'), ('.', '.')]

NER (spaCy):
Google -> ORG
DeepMind -> GPE
London -> GPE
$500 million -> MONEY


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu



NER (Transformer):
[{'entity_group': 'ORG', 'score': np.float32(0.99947506), 'word': 'Google', 'start': 0, 'end': 6}, {'entity_group': 'ORG', 'score': np.float32(0.9989691), 'word': 'DeepMind', 'start': 16, 'end': 24}, {'entity_group': 'LOC', 'score': np.float32(0.99912757), 'word': 'London', 'start': 28, 'end': 34}]
