# What are Named Entities?

Named entities are words or phrases that refer to specific persons, organizations, locations, dates, quantities, or other real-world objects or concepts.

In the sentence:

"Apple CEO Tim Cook spoke in New York on Monday"

what are the named entities?

*"Apple" (organization), "Tim Cook" (person), "New York" (location), and "Monday" (time)*

# Techniques for Named Entity Extraction

Can you name some techniques we can use for Named Entity extraction?

Let's build a rule-based Named Entity extraction system. Here's your passage:


> The Ministry of Education announced at 1 PM on 12th of October 2024 that Perera scored the highest in the Advanced Level examination at Royal College. Captain Cook Restaurant near Lake Gardens is owned by Mr. Silva, who used to teach at Trinity. While Elephant House ice cream remains popular, new brands like Happy Cow from Green Farms Ltd. are gaining market share. The Morning News reported that Dialog's CEO will speak at SLIIT next Friday about Smart Living solutions. The company's annual meeting is scheduled for 1300 10/12/2024 at Unity Plaza. Meanwhile, Unity Plaza shops are preparing for the April Season sales, which coincide with both New Year celebrations. Eagle Insurance (now merged with Union Bank) opened their branch opposite Victoria Park, where Peace Pagoda is clearly visible. Speaking of peace, Peace Cola is launching their new drink called Life, competing with Sprite and Sprint.

Now write a function that extracts all
-- persons
-- organisations
-- locations
-- brands
-- dates or time



In [None]:
## your code here
import re

def find_date(text):
    # Regular expression pattern for dates in formats like "dd-mm-yyyy", "dd/mm/yyyy", "yyyy-mm-dd", "yyyy/mm/dd"
    date_pattern1 = r"\b(?:\d{1,2}[/-])(?:\d{1,2}[/-])\d{2,4}\b"
    date_pattern2 = r"\d{1,2}(?:st|nd|rd|th)? of [A-Z][a-z]+ \d{4}"
    # Find all dates that match the pattern
    dates = re.findall(date_pattern1, text)
    dates.extend(re.findall(date_pattern2, text))
    return dates

# Sample text
text = "The Ministry of Education announced at 1 PM on 12th of October 2024 that Perera scored the highest in the Advanced Level examination at Royal College. Captain Cook Restaurant near Lake Gardens is owned by Mr. Silva, who used to teach at Trinity. While Elephant House ice cream remains popular, new brands like Happy Cow from Green Farms Ltd. are gaining market share. The Morning News reported that Dialog's CEO will speak at SLIIT next Friday about Smart Living solutions. The company's annual meeting is scheduled for 1300 10/12/2024 at Unity Plaza. Meanwhile, Unity Plaza shops are preparing for the April Season sales, which coincide with both New Year celebrations. Eagle Insurance (now merged with Union Bank) opened their branch opposite Victoria Park, where Peace Pagoda is clearly visible. Speaking of peace, Peace Cola is launching their new drink called Life, competing with Sprite and Sprint."
print(find_date(text))

['10/12/2024', '12th of October 2024']


In [None]:
def find_name(text):
    # Regular expression pattern for dates in formats like "dd-mm-yyyy", "dd/mm/yyyy", "yyyy-mm-dd", "yyyy/mm/dd"
    name_pattern1 = r"[A-Z][a-z][.]+(?: [A-Z][a-z]+)"
    name_pattern2 = r"[A-Z][a-z]+(?: [A-Z][a-z]+)"
    # Find all dates that match the pattern
    dates = re.findall(name_pattern1, text)
    dates.extend(re.findall(name_pattern2, text))
    return dates

text = "The Ministry of Education announced at 1 PM on 12th of October 2024 that Perera scored the highest in the Advanced Level examination at Royal College. Captain Cook Restaurant near Lake Gardens is owned by Mr. Silva, who used to teach at Trinity. While Elephant House ice cream remains popular, new brands like Happy Cow from Green Farms Ltd. are gaining market share. The Morning News reported that Dialog's CEO will speak at SLIIT next Friday about Smart Living solutions. The company's annual meeting is scheduled for 1300 10/12/2024 at Unity Plaza. Meanwhile, Unity Plaza shops are preparing for the April Season sales, which coincide with both New Year celebrations. Eagle Insurance (now merged with Union Bank) opened their branch opposite Victoria Park, where Peace Pagoda is clearly visible. Speaking of peace, Peace Cola is launching their new drink called Life, competing with Sprite and Sprint."
print(find_name(text))

['Mr. Silva', 'The Ministry', 'Advanced Level', 'Royal College', 'Captain Cook', 'Lake Gardens', 'While Elephant', 'Happy Cow', 'Green Farms', 'The Morning', 'Smart Living', 'Unity Plaza', 'Unity Plaza', 'April Season', 'New Year', 'Eagle Insurance', 'Union Bank', 'Victoria Park', 'Peace Pagoda', 'Peace Cola']


# Named Entity Extraction with BERT

## What is BERT?

Read the paper: https://arxiv.org/pdf/1810.04805

Read this blog to understand transformers: https://jalammar.github.io/illustrated-transformer/

## CoNLL dataset

Read more about it here: https://www.clips.uantwerpen.be/conll2003/ner/

In [None]:
%pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import load_dataset

dataset = load_dataset("conll2003")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
# Let's inspect the dataset
sentence_0 = dataset["train"][0]

In [None]:
sentence_0

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

Can you identify each of the keys?

Explain:
- id
- tokens
- pos_tags- Each number represents a specific POS category (e.g., noun, verb, adjective) based on a tagging scheme

- chunk_tags-  These tags help identify phrases (chunks) in the sentence, such as noun phrases (NP) or verb phrases (VP)

- ner_tags - Each number indicates whether a token corresponds to a named entity (like a person, organization, location, etc.). For example:
3 might represent an organization (in this case, "EU" and "German")



In [None]:
sentence_0_str = " ".join(dataset["train"][0]['tokens'])
sentence_0_str

'EU rejects German call to boycott British lamb .'

In [None]:
# map pos tag numbers to their labels
pos_tags = dataset["train"].features["pos_tags"].feature.names
chunk_tags = dataset["train"].features["chunk_tags"].feature.names
ner_tags = dataset["train"].features["ner_tags"].feature.names

print(pos_tags)
print(chunk_tags)
print(ner_tags)


['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']
['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP']
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In [None]:
print("POS Tags for sentence_0")
for i, pos_ in enumerate(sentence_0['pos_tags']):
    print(f"{pos_}  \t- {pos_tags[pos_]} \t- {sentence_0['tokens'][i]}")

POS Tags for sentence_0
22  	- NNP 	- EU
42  	- VBZ 	- rejects
16  	- JJ 	- German
21  	- NN 	- call
35  	- TO 	- to
37  	- VB 	- boycott
16  	- JJ 	- British
21  	- NN 	- lamb
7  	- . 	- .


CC: Coordinating conjunction (e.g., and, or)

CD: Cardinal number (e.g., 1, 42)

DT: Determiner (e.g., the, a)

JJ: Adjective (e.g., big, blue)

NN: Noun, singular (e.g., dog)

NNP: Proper noun, singular (e.g., John)

VBZ: Verb, 3rd person singular present (e.g., runs)

.: Punctuation (e.g., period)

In [None]:
print("Chunk Tags for sentence_0")
for i, chunk_ in enumerate(sentence_0['chunk_tags']):
    print(f"{chunk_} \t- {chunk_tags[chunk_]} \t- {sentence_0['tokens'][i]}")

Chunk Tags for sentence_0
11 	- B-NP 	- EU
21 	- B-VP 	- rejects
11 	- B-NP 	- German
12 	- I-NP 	- call
21 	- B-VP 	- to
22 	- I-VP 	- boycott
11 	- B-NP 	- British
12 	- I-NP 	- lamb
0 	- O 	- .


B-NP: Beginning of a noun phrase

I-NP: Inside a noun phrase

B-VP: Beginning of a verb phrase

O: Outside of any chunk

In [None]:
print("NER Tags for sentence_0")
for i, ner_ in enumerate(sentence_0['ner_tags']):
    print(f"{ner_} \t- {ner_tags[ner_]} \t\t- {sentence_0['tokens'][i]}")

NER Tags for sentence_0
3 	- B-ORG 		- EU
0 	- O 		- rejects
7 	- B-MISC 		- German
0 	- O 		- call
0 	- O 		- to
0 	- O 		- boycott
7 	- B-MISC 		- British
0 	- O 		- lamb
0 	- O 		- .


B-PER: Beginning of a person's name

I-PER: Inside a person's name

B-ORG: Beginning of an organization name

I-ORG: Inside an organization name

B-LOC: Beginning of a location name

I-LOC: Inside a location name

O: Outside of any named entity

In [None]:
# Check the size of the dataset

print(f"Train size: {len(dataset['train'])}")
print(f"Validation size: {len(dataset['validation'])}")
print(f"Test size: {len(dataset['test'])}")

Train size: 14041
Validation size: 3250
Test size: 3453


## BertTokenizer example

In [None]:
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [None]:
# Define the sentence
sentence = "EU rejects German call to boycott British lamb ."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Print the tokens
print("Tokens:", tokens)


Tokens: ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']


number of words not equal to number of tokens

In [None]:
# Define the sentence
sentence = "She felt unhappiness after the event."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Print the tokens
print("Tokens:", tokens)

Tokens: ['she', 'felt', 'un', '##ha', '##pp', '##iness', 'after', 'the', 'event', '.']


use the token to next sentence as well. it helps to adapt  new words
Reducing Vocabulary Size

In [None]:
import random
sample_tokens=random.sample(list(tokenizer.vocab.keys()),100)
print("token sample:",sample_tokens)
print("lengh of tokens:", len(tokenizer.vocab))

token sample: ['##lating', 'busiest', 'classroom', 'yao', 'blues', 'blake', '##bury', 'alright', 'musa', 'author', '##vara', '##門', 'xiang', 'bishop', 'hitch', 'magical', '1765', 'medicinal', 'musicians', 'neighbour', 'iv', 'cabin', '1937', 'button', '##meter', '##act', '##kala', 'shielded', 'concluded', 'fairs', '##ɬ', 'seats', 'neutral', 'send', 'introductory', 'hancock', 'gust', '290', 'surged', 'colonel', 'unsuccessfully', '##uen', 'գ', 'luminous', 'succeed', '188', '##iaceae', '##gent', 'confused', '##ncia', '[unused226]', 'butch', '##ल', '##shin', '875', 'roads', 'authorization', 'jackie', 'basic', 'supervise', 'mosques', '##pan', 'enormous', 'owl', 'weasel', '##chua', 'meet', 'preferences', 'lucie', 'fellows', 'thousand', 'mingled', 'hadley', '##icles', '##sio', 'challenger', 'rosenthal', '##rane', '##wk', 'ivy', '##anor', '1773', '##jured', 'co', '##sque', '##ties', 'z', 'tribe', 'vegetarian', '##₄', 'rudd', 'activities', 'ß', 'orphans', 'thirteen', 'blended', '##sing', 'france

when train the tokenizer. it can preserve the verb+ing without sub-word tokenization

In [None]:
subword_tokens= [token for token in tokenizer.vocab.keys() if token.startswith("##")]
subword_tokens[:5]

['##s', '##a', '##e', '##i', '##ing']

## Dataset preparation for traning

In [None]:
from transformers import AutoTokenizer
import torch
from torch.utils.data import DataLoader, Dataset

class NERDataset(Dataset):
    def __init__(self, tokenized_dataset):
        self.tokenized_dataset = tokenized_dataset

    def __len__(self):
        return len(self.tokenized_dataset)

    def __getitem__(self, idx):
        item = self.tokenized_dataset[idx]
        return {
            'input_ids': torch.tensor(item['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(item['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(item['labels'], dtype=torch.long)
        }

def prepare_data():
    # Load the CoNLL-2003 dataset
    dataset = load_dataset("conll2003")

    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

    # Get label list from dataset
    label_list = dataset["train"].features["ner_tags"].feature.names

    def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(
            examples["tokens"],
            truncation=True,
            is_split_into_words=True,
            padding='max_length',
            max_length=128,
            return_tensors="pt"  # Return PyTorch tensors
        )

        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:
                    label_ids.append(label[word_idx])
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx
            labels.append(label_ids)

        tokenized_inputs["labels"] = labels
        return tokenized_inputs

    # Tokenize datasets
    tokenized_datasets = dataset.map(
        tokenize_and_align_labels,
        batched=True,
        remove_columns=dataset["train"].column_names
    )

    return tokenized_datasets, len(label_list),label_list


## Implement the model

In [None]:
from transformers import BertForTokenClassification
import torch.nn as nn

class NERModel(nn.Module):
    def __init__(self, num_labels):
        super().__init__()
        self.bert = BertForTokenClassification.from_pretrained(
            'bert-base-cased',
            num_labels=num_labels
        )

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        return outputs

## Implement the traning loop

In [None]:
def evaluate_model_loss(model, eval_dataloader, device):
    model.eval()
    total_val_loss = 0

    with torch.no_grad():
        for batch in eval_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            total_val_loss += outputs.loss.item()

    avg_val_loss = total_val_loss / len(eval_dataloader)
    return avg_val_loss

In [None]:
## your code here
import torch
from tqdm import tqdm

def train_model(model, train_dataloader, eval_dataloader, optimizer, scheduler, device, id2label,num_epochs=10, patience=3):
    model.to(device)

    best_val_loss = float("inf")
    epochs_no_improve = 0

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print("-" * 20)

        # Training phase
        model.train()  # Set model to training mode
        total_train_loss = 0

        for batch in tqdm(train_dataloader, desc="Training"):
            # Move batch data to the device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Clear previously calculated gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss  # Retrieve the loss from the model's output

            # Backward pass (compute gradients)
            loss.backward()

            # Update parameters and learning rate
            optimizer.step()
            scheduler.step()

            # Accumulate training loss
            total_train_loss += loss.item()

        avg_train_loss = total_train_loss / len(train_dataloader)
        print(f"Training loss: {avg_train_loss:.4f}")

        # Validation phase
        val_loss = evaluate_model_loss(model, eval_dataloader, device)
        print(f"Validation loss: {val_loss:.4f}")

        # Early Stopping check
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            epochs_no_improve = 0
            # Save the best model
            torch.save(model.state_dict(), "best_model.pt")
        else:
            epochs_no_improve += 1

        if epochs_no_improve >= patience:
            print(f"Early stopping triggered after {epoch + 1} epochs.")
            break

    # Load the best model for final evaluation
    model.load_state_dict(torch.load("best_model.pt"))
    print("Best model loaded.")


## Implement the evaluation function

In [None]:
%pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=e326a779dd22397f59b2353594b12bf1e7817a8e201c31bb37c217a3389359dc
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
from seqeval.metrics import classification_report

def evaluate_model(model, eval_dataloader, device,id2label):
    model.eval()  # Set model to evaluation mode
    true_labels = []
    predicted_labels = []
    total_val_loss = 0

    with torch.no_grad():  # Disable gradient calculation for evaluation
        for batch in eval_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass to get logits
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_val_loss += loss.item()

            logits = outputs.logits  # Raw predictions from the model
            predictions = torch.argmax(logits, dim=-1).cpu().numpy()
            labels = labels.cpu().numpy()

            # Align predictions and labels by filtering out padding tokens
            for i, label_seq in enumerate(labels):
                true_seq = []
                pred_seq = []
                for j, label_id in enumerate(label_seq):
                    if label_id == -100:  # Ignore padding tokens
                        continue
                    true_seq.append(id2label[label_id])
                    pred_seq.append(id2label[predictions[i][j]])
                true_labels.append(true_seq)
                predicted_labels.append(pred_seq)

    # Calculate average validation loss
    avg_val_loss = total_val_loss / len(eval_dataloader)
    print(classification_report(true_labels, predicted_labels))


## Implement the main() function

In [None]:
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
def main():
    # Device configuration
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")

    # Load and prepare dataset
    print("Loading and preparing dataset...")
    tokenized_datasets, num_labels, label_list = prepare_data()
    id2label = {i: label for i, label in enumerate(label_list)}
    print(id2label)
    # Convert to custom Dataset objects
    train_dataset = NERDataset(tokenized_datasets["train"])
    eval_dataset = NERDataset(tokenized_datasets["validation"])
    test_dataset = NERDataset(tokenized_datasets["test"])

    # Create data loaders
    batch_size = 32  # You can adjust this as needed
    train_dataloader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True
    )
    eval_dataloader = DataLoader(
        eval_dataset,
        batch_size=batch_size
    )
    test_dataloader = DataLoader(
        test_dataset,
        batch_size=batch_size
    )

    # Model initialization
    model = NERModel(num_labels=num_labels)
    model.to(device)

    # Define optimizer with better parameters
    optimizer = AdamW(
        model.parameters(),
        lr=5e-5,
        weight_decay=0.01,
        eps=1e-8
    )

    # Learning rate scheduler
    num_epochs = 3
    num_training_steps = num_epochs * len(train_dataloader)
    num_warmup_steps = num_training_steps // 10
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps
    )

    # Training
    print("Starting training...")
    train_model(model, train_dataloader, eval_dataloader, optimizer, scheduler, device, num_epochs=num_epochs,id2label=id2label)  # Pass id2label here


    # Evaluation
    print("\nEvaluating on validation set...")
    evaluate_model(model, eval_dataloader, device, id2label)

    return model



['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [None]:
model=main()

Using device: cuda
Loading and preparing dataset...




{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training...
Epoch 1/3
--------------------


Training: 100%|██████████| 439/439 [04:35<00:00,  1.59it/s]


Training loss: 0.2433
Validation loss: 0.0443
Epoch 2/3
--------------------


Training: 100%|██████████| 439/439 [04:34<00:00,  1.60it/s]


Training loss: 0.0264
Validation loss: 0.0327
Epoch 3/3
--------------------


Training: 100%|██████████| 439/439 [04:34<00:00,  1.60it/s]


Training loss: 0.0126
Validation loss: 0.0332


  model.load_state_dict(torch.load("best_model.pt"))


Best model loaded.

Evaluating on validation set...
              precision    recall  f1-score   support

         LOC       0.97      0.96      0.96      1837
        MISC       0.88      0.90      0.89       922
         ORG       0.91      0.92      0.92      1341
         PER       0.97      0.98      0.97      1836

   micro avg       0.94      0.95      0.94      5936
   macro avg       0.93      0.94      0.94      5936
weighted avg       0.94      0.95      0.94      5936



In [7]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
model.save_pretrained("/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model")  # Save the model
tokenizer.save_pretrained("/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model")  # Save the tokenizer


('/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model/tokenizer_config.json',
 '/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model/special_tokens_map.json',
 '/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model/vocab.txt',
 '/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model/added_tokens.json',
 '/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model/tokenizer.json')

## Implement inference function


load the previously trained model

In [16]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import os

def prepare_inference1(model_path=None):
    """Initialize tokenizer and load model for inference"""
    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

    # Load trained model if path provided
    if model_path:
        # Load model from the given path
        model = AutoModelForTokenClassification.from_pretrained(model_path)
    else:
        model = AutoModelForTokenClassification.from_pretrained('bert-base-cased')

    # Device configuration
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)  # Move the model to the device

    id2label = {
        0: "O",
        1: "B-PER",
        2: "I-PER",
        3: "B-ORG",
        4: "I-ORG",
        5: "B-LOC",
        6: "I-LOC",
        7: "B-MISC",
        8: "I-MISC"
    }

    return tokenizer, id2label, model, device

def inference(text, model, tokenizer, id2label, device):
    """Perform NER inference on input text"""
    print(f"Using device: {device}")

    model.eval()  # Set model to evaluation mode

    # Tokenize the text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, is_split_into_words=False)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to the specified device

    # Perform inference
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)
        logits = outputs.logits  # Extract logits from model outputs

    # Convert predictions to labels
    predictions = torch.argmax(logits, dim=2)  # Get the predicted label ids
    predicted_labels = predictions[0].cpu().numpy()  # Move to CPU and convert to numpy array

    # Align predictions with words
    tokens = tokenizer.tokenize(text)
    labeled_words = []

    # Convert ids to labels
    for token, label_id in zip(tokens, predicted_labels):
        label = id2label[label_id]
        labeled_words.append((token, label))

    return labeled_words

def print_entities(labeled_words):
    """Pretty print the labeled entities"""
    current_entity = None
    entity_text = []

    for word, label in labeled_words:
        if label == "O":
            if current_entity:
                print(f"{current_entity}: {' '.join(entity_text)}")
                current_entity = None
                entity_text = []
        elif label.startswith("B-"):
            if current_entity:
                print(f"{current_entity}: {' '.join(entity_text)}")
            current_entity = label[2:]  # Remove "B-" prefix
            entity_text = [word]
        elif label.startswith("I-"):
            if current_entity == label[2:]:  # If it's the same entity type
                entity_text.append(word)
            else:
                if current_entity:
                    print(f"{current_entity}: {' '.join(entity_text)}")
                current_entity = label[2:]
                entity_text = [word]

    if current_entity:  # Print last entity if exists
        print(f"{current_entity}: {' '.join(entity_text)}")



used the previously trained model

In [24]:
# First initialize with the path to your trained model
tokenizer, id2label, model, device = prepare_inference1("/content/drive/MyDrive/idp_bootcamp/week_5/bert_ner_model")  # Ensure the path points to the correct model directory

# Example texts to analyze
texts = [
    "John Smith works at Microsoft in Seattle and visited New York last summer.",
    "The European Union signed a trade deal with Japan in Brussels.",
    "Tesla CEO Elon Musk announced new features coming to their vehicles."
]

# Process each text
for text in texts:
    print("\nText:", text)
    print("Entities found:")
    results = inference(text, model, tokenizer, id2label, device)  # Pass device to inference
    print_entities(results)



Text: John Smith works at Microsoft in Seattle and visited New York last summer.
Entities found:
Using device: cpu
PER: New
PER: last
PER: .

Text: The European Union signed a trade deal with Japan in Brussels.
Entities found:
Using device: cpu
PER: Union
PER: in

Text: Tesla CEO Elon Musk announced new features coming to their vehicles.
Entities found:
Using device: cpu
PER: CEO
PER: Mu
PER: coming
PER: to


directly used the predefined

In [23]:


# First initialize with the path to your trained model
tokenizer, id2label, model, device = prepare_inference1()  # Ensure the path points to the correct model directory

# Example texts to analyze
texts1 = [
    "John Smith works at Microsoft in Seattle and visited New York last summer.",
    "The European Union signed a trade deal with Japan in Brussels.",
    "Tesla CEO Elon Musk announced new features coming to their vehicles."
]

# Process each text
for text in texts1:
    print("\nText:", text)
    print("Entities found:")
    results = inference(text, model, tokenizer, id2label, device)  # Pass device to inference
    print_entities(results)


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Text: John Smith works at Microsoft in Seattle and visited New York last summer.
Entities found:
Using device: cpu
PER: Microsoft
PER: in
PER: Seattle

Text: The European Union signed a trade deal with Japan in Brussels.
Entities found:
Using device: cpu
PER: European
PER: Union
PER: signed
PER: a
PER: trade
PER: deal
PER: with
PER: Japan
PER: Brussels

Text: Tesla CEO Elon Musk announced new features coming to their vehicles.
Entities found:
Using device: cpu
PER: CEO
PER: new
PER: their
