In [2]:
# Installing the 'transformers' library from Hugging Face, which provides pre-trained models for natural language processing (NLP) tasks
# The 'datasets' library allows easy access to a wide range of NLP datasets
# The 'tokenizers' library offers efficient tokenization algorithms, necessary for processing text data
# The 'seqeval' library is used for sequence evaluation, often in tasks like Named Entity Recognition (NER)
!pip install transformers datasets tokenizers seqeval -q

# Installing the 'accelerate' library which helps to streamline the process of training and evaluating machine learning models, especially on multiple GPUs
!pip install accelerate

# Installing 'transformers[torch]', a specific installation of the transformers library with additional support for PyTorch, a popular deep learning framework
!pip install transformers[torch]



In [3]:
import datasets
import numpy as np

Dataset link : https://huggingface.co/datasets/eriktks/conll2003

In [4]:
# The CoNLL-2003 dataset is commonly used for training and evaluating Named Entity Recognition (NER) models
conll2003 = datasets.load_dataset("conll2003")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

This JSON object represents a single example from the CoNLL-2003 dataset. Here’s a breakdown of its components:

•	chunk_tags: These tags typically represent the syntactic chunks in the sentence.

•	id: The unique identifier for this specific example.

•	ner_tags: These tags indicate the named entity recognition (NER) labels for each token in the sentence. For example, they might denote whether a token is part of a person’s name, location, organization, etc.

•	pos_tags: These tags represent the part-of-speech (POS) labels for each token, indicating the grammatical role of the word (e.g., noun, verb, adjective).

•	tokens: The actual words (tokens) in the sentence.


In [5]:
{
    "chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
    "id": "0",
    "ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
}


{'chunk_tags': [11,
  12,
  12,
  21,
  13,
  11,
  11,
  21,
  13,
  11,
  12,
  13,
  11,
  21,
  22,
  11,
  12,
  17,
  11,
  21,
  17,
  11,
  12,
  12,
  21,
  22,
  22,
  13,
  11,
  0],
 'id': '0',
 'ner_tags': [0,
  3,
  4,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'pos_tags': [12,
  22,
  22,
  38,
  15,
  22,
  28,
  38,
  15,
  16,
  21,
  35,
  24,
  35,
  37,
  16,
  21,
  15,
  24,
  41,
  15,
  16,
  21,
  21,
  20,
  37,
  40,
  35,
  21,
  7],
 'tokens': ['The',
  'European',
  'Commission',
  'said',
  'on',
  'Thursday',
  'it',
  'disagreed',
  'with',
  'German',
  'advice',
  'to',
  'consumers',
  'to',
  'shun',
  'British',
  'lamb',
  'until',
  'scientists',
  'determine',
  'whether',
  'mad',
  'cow',
  'disease',
  'can',
  'be',
  'transmitted',
  'to',
  'sheep',
  '.']}

In [6]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [7]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [8]:
conll2003["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [9]:
conll2003["train"].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

In [10]:
# Importing the fast version of the BERT tokenizer from the transformers library
# This tokenizer will be used to convert text data into token IDs that can be processed by the BERT model
from transformers import BertTokenizerFast

# Importing the data collator specifically designed for token classification tasks
# This collator ensures that the input data is properly formatted and padded for the BERT model
from transformers import DataCollatorForTokenClassification

# Importing a pre-trained BERT model for token classification
# This model will be fine-tuned on the specific token classification task (e.g., NER)
from transformers import AutoModelForTokenClassification

In [11]:
# Specifying the pre-trained BERT model to be used
# "bert-base-uncased" is a commonly used variant of BERT where all the text is lowercased during pre-processing
model = "bert-base-uncased" #bert- 340 million--> LM

In [12]:
# Initializing the fast tokenizer for the specified BERT model
tokenizer = BertTokenizerFast.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [13]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [14]:
conll2003["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [16]:
conll2003["train"].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

In [17]:
conll2003["train"].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [18]:
conll2003["train"].features['chunk_tags']

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)

In [19]:
conll2003["train"].features['pos_tags']

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

In [20]:
example_text=conll2003['train'][0]

In [21]:
example_text["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [22]:
example_text

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [23]:
# Tokenizing the example text using the BERT tokenizer
# The parameter 'is_split_into_words=True' indicates that the input is already split into individual tokens/words
tokenized_id = tokenizer(example_text["tokens"], is_split_into_words=True)

In [24]:
tokenized_id

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [25]:
tokenized_id["input_ids"]

[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]

In [26]:
# Converting the token IDs back to their corresponding tokens (words)
tokens = tokenizer.convert_ids_to_tokens(tokenized_id["input_ids"])

In [27]:
tokens

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

In [28]:
example_text['ner_tags']

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [29]:
for i, label in enumerate(example_text["ner_tags"]):
  print(i,label)

0 3
1 0
2 7
3 0
4 0
5 0
6 7
7 0
8 0


In [30]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    # Tokenize the input tokens with truncation and specify that the input is already split into words
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []

    # Iterate over each example and its corresponding NER tags
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Returns a list mapping the tokens to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.

        previous_word_idx = None
        label_ids = []
        # Special tokens like [CLS] and [SEP] are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # Set –100 as the label for these special tokens
                label_ids.append(-100)

            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # If the current word_idx is != prev then it's the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # To take care of sub-words which have the same word_idx
                # Set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # Mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [31]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [32]:
conll2003["train"][0:1]

{'id': ['0'],
 'tokens': [['EU',
   'rejects',
   'German',
   'call',
   'to',
   'boycott',
   'British',
   'lamb',
   '.']],
 'pos_tags': [[22, 42, 16, 21, 35, 37, 16, 21, 7]],
 'chunk_tags': [[11, 21, 11, 12, 21, 22, 11, 12, 0]],
 'ner_tags': [[3, 0, 7, 0, 0, 0, 7, 0, 0]]}

In [33]:
q=tokenize_and_align_labels(conll2003["train"][0:1])

In [34]:
q

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]]}

In [35]:
# Print each token along with its label
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
eu______________________________________ 3
rejects_________________________________ 0
german__________________________________ 7
call____________________________________ 0
to______________________________________ 0
boycott_________________________________ 0
british_________________________________ 7
lamb____________________________________ 0
._______________________________________ 0
[SEP]___________________________________ -100


In [36]:
# Apply the function to the dataset using the map method
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [37]:
tokenized_datasets["train"][4]

{'id': '4',
 'tokens': ['Germany',
  "'s",
  'representative',
  'to',
  'the',
  'European',
  'Union',
  "'s",
  'veterinary',
  'committee',
  'Werner',
  'Zwingmann',
  'said',
  'on',
  'Wednesday',
  'consumers',
  'should',
  'buy',
  'sheepmeat',
  'from',
  'countries',
  'other',
  'than',
  'Britain',
  'until',
  'the',
  'scientific',
  'advice',
  'was',
  'clearer',
  '.'],
 'pos_tags': [22,
  27,
  21,
  35,
  12,
  22,
  22,
  27,
  16,
  21,
  22,
  22,
  38,
  15,
  22,
  24,
  20,
  37,
  21,
  15,
  24,
  16,
  15,
  22,
  15,
  12,
  16,
  21,
  38,
  17,
  7],
 'chunk_tags': [11,
  11,
  12,
  13,
  11,
  12,
  12,
  11,
  12,
  12,
  12,
  12,
  21,
  13,
  11,
  12,
  21,
  22,
  11,
  13,
  11,
  1,
  13,
  11,
  17,
  11,
  12,
  12,
  21,
  1,
  0],
 'ner_tags': [5,
  0,
  0,
  0,
  0,
  3,
  4,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  5,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'input_ids': [101,
  2762,
  1005,
  1055,


In [None]:
data

toekns-> data
target column--> ner tag
[cls]  [sep]
token--> input id
target-->[-100] [100]

In [38]:
# Load a pre-trained BERT model for token classification with a specified number of labels
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [39]:
# TrainingArguments: This class is used to define the training configurations and hyperparameters.
# It includes settings such as output directory, evaluation strategy, learning rate, batch size,
# number of training epochs, weight decay, logging options, and more. These arguments control how
# the training and evaluation loops are executed by the Trainer class.

# Trainer: This class provides an easy-to-use training and evaluation loop for PyTorch models.
# It handles all aspects of training, including loading the data, computing the loss,
# performing backpropagation, and updating the model parameters. The Trainer class takes care
# of all the boilerplate code required to train a model, allowing you to focus on your model and data.
# It can also perform evaluation and prediction tasks and is highly customizable.
from transformers import TrainingArguments, Trainer

https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/trainer#transformers.TrainingArguments

In [40]:
args=TrainingArguments(
    "test-ner",                      # Output directory for storing model predictions and checkpoints
    evaluation_strategy="epoch",     # Evaluation strategy to use at the end of each epoch
    learning_rate=2e-5,              # Learning rate for the optimizer
    per_device_train_batch_size=16,  # Batch size for training per device (e.g., GPU)
    per_device_eval_batch_size=16,   # Batch size for evaluation per device (e.g., GPU)
    num_train_epochs=1,              # Number of training epochs
    weight_decay=0.01                # Strength of weight decay for regularization to prevent overfitting
)



https://github.com/huggingface/datasets/blob/main/metrics/seqeval/seqeval.py

In [41]:
metric = datasets.load_metric("seqeval")

# The `load_metric` function from the `datasets` library is used to load a specific evaluation metric.
# Here, "seqeval" is being loaded, which is a popular metric for sequence labeling tasks such as
# Named Entity Recognition (NER). The `seqeval` metric is used to evaluate the performance of
# token classification models by computing precision, recall, and F1-score for named entity predictions.

  metric = datasets.load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

The repository for seqeval contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/seqeval.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


In [58]:
data_collator=DataCollatorForTokenClassification(tokenizer)
# DataCollatorForTokenClassification: This class is used to dynamically pad the input sequences to the same length
# during training and evaluation. It ensures that batches of input sequences have the same length by adding padding tokens
# where necessary. This collator is specifically designed for token classification tasks such as Named Entity Recognition (NER).

In [45]:
example=conll2003["train"][0]

In [46]:
conll2003["train"].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

In [47]:
conll2003["train"].features["ner_tags"].feature. names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [48]:
example

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [49]:
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# label_list: This list defines the possible labels for the Named Entity Recognition (NER) task.
# Each label represents a different entity type or position within an entity.
# 'O' stands for outside any entity, 'B-' prefixes indicate the beginning of an entity,
# and 'I-' prefixes indicate the inside of an entity.
# 'PER' stands for person, 'ORG' for organization, 'LOC' for location, and 'MISC' for miscellaneous entities.

In [50]:
for i in example["ner_tags"]:
  print(i)

3
0
7
0
0
0
7
0
0


In [51]:
['EU','rejects','German','call','to','boycott','British','lamb','.']

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [52]:
labels = [label_list[i] for i in example["ner_tags"]]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [53]:
metric.compute(predictions=[labels], references=[labels])

# metric.compute: This function calculates the evaluation metrics using the 'seqeval' metric.
# predictions: This should be a list of predicted labels for the tokens in the dataset.
# references: This should be a list of the true labels for the tokens in the dataset.
# Here, both predictions and references are set to [labels] for demonstration purposes,
# which means it computes the metrics with the same list for both predictions and references.
# In a real scenario, predictions would be the output of your model, and references would be the true labels.

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [54]:
"""predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
seqeval = datasets.load_metric("seqeval")
results = seqeval.compute(predictions=predictions, references=references)
print(list(results.keys()))
output-['MISC', 'PER', 'overall_precision', 'overall_recall', 'overall_f1', 'overall_accuracy']
print(results["overall_f1"])"""

'predictions = [[\'O\', \'O\', \'B-MISC\', \'I-MISC\', \'I-MISC\', \'I-MISC\', \'O\'], [\'B-PER\', \'I-PER\', \'O\']]\nreferences = [[\'O\', \'O\', \'O\', \'B-MISC\', \'I-MISC\', \'I-MISC\', \'O\'], [\'B-PER\', \'I-PER\', \'O\']]\nseqeval = datasets.load_metric("seqeval")\nresults = seqeval.compute(predictions=predictions, references=references)\nprint(list(results.keys()))\noutput-[\'MISC\', \'PER\', \'overall_precision\', \'overall_recall\', \'overall_f1\', \'overall_accuracy\']\nprint(results["overall_f1"])'

In [59]:
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    # Apply argmax to get the index of the highest logit value (most likely label) for each token
    pred_logits = np.argmax(pred_logits, axis=2)
    # The logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax function

    # We remove all the values where the label is -100 (special tokens like [CLS] and [SEP])
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    # Extract the true labels, ignoring the special tokens
    true_labels = [
        [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    # Compute the evaluation metric (e.g., precision, recall, F1-score, accuracy)
    results = metric.compute(predictions=predictions, references=true_labels)

    # Return the computed metrics
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [56]:
trainer=Trainer(
    model,                             # The pre-trained model to be fine-tuned
    args,                              # Training arguments defined above
    train_dataset=tokenized_datasets["train"],  # The training dataset
    eval_dataset=tokenized_datasets["validation"], # The evaluation dataset
    data_collator=data_collator,       # Data collator that dynamically pads the inputs received
    tokenizer=tokenizer,               # The tokenizer used for preprocessing the input text
    compute_metrics=compute_metrics    # Function to compute metrics during evaluation
)

In [57]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2167,0.069599,0.910811,0.926502,0.918589,0.981334


TrainOutput(global_step=878, training_loss=0.15929987436003457, metrics={'train_runtime': 163.8204, 'train_samples_per_second': 85.71, 'train_steps_per_second': 5.36, 'total_flos': 341387954498718.0, 'train_loss': 0.15929987436003457, 'epoch': 1.0})

In [60]:
model.save_pretrained("ner_model")
# save_pretrained: This method saves the model's configuration, weights, and tokenizer files to the specified directory.

In [61]:
tokenizer.save_pretrained("tokenizer")
# save_pretrained: This method saves the tokenizer's configuration and vocabulary files to the specified directory.
# "tokenizer": The directory where the tokenizer files will be saved.
# This allows you to load the tokenizer later for preprocessing text data when using the saved model.

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [63]:
import json
# json: This module provides functions to work with JSON (JavaScript Object Notation) data.
# It allows you to parse JSON strings into Python objects and convert Python objects into JSON strings.
# JSON is a common format for data exchange, especially in web applications.

In [64]:
config = json.load(open("/content/ner_model/config.json"))

# json.load: This function parses a JSON file and returns the corresponding Python dictionary.
# open("/content/ner_model/config.json"): Opens the specified JSON file in read mode.
# config: This variable will contain the contents of the JSON file as a Python dictionary.
# This is typically used to load the model configuration file saved during the save_pretrained process.

In [65]:
config

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'LABEL_0',
  '1': 'LABEL_1',
  '2': 'LABEL_2',
  '3': 'LABEL_3',
  '4': 'LABEL_4',
  '5': 'LABEL_5',
  '6': 'LABEL_6',
  '7': 'LABEL_7',
  '8': 'LABEL_8'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'LABEL_0': 0,
  'LABEL_1': 1,
  'LABEL_2': 2,
  'LABEL_3': 3,
  'LABEL_4': 4,
  'LABEL_5': 5,
  'LABEL_6': 6,
  'LABEL_7': 7,
  'LABEL_8': 8},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.41.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [66]:
conll2003["train"].features["ner_tags"].feature. names

# conll2003["train"]: Accesses the training split of the CoNLL-2003 dataset.
# .features["ner_tags"]: Accesses the features of the dataset, specifically the 'ner_tags' feature.
# .feature.names: Retrieves the names of the labels for the 'ner_tags' feature.
# This command returns a list of label names used for the Named Entity Recognition (NER) task in the CoNLL-2003 dataset.

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [67]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
# id2label: This dictionary maps label indices to their corresponding label names.
# str(i): The index of the label is converted to a string to be used as a key in the dictionary.
# label: The label name corresponding to the index.
# enumerate(label_list): This function returns pairs of index and label for each label in label_list.
# The resulting dictionary maps each index (as a string) to its label name, which is useful for interpreting model outputs.

In [68]:
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [69]:
label2id = {
    label: str(i) for i, label in enumerate(label_list)
}

# label2id: This dictionary maps label names to their corresponding label indices as strings.
# label: The label name to be used as a key in the dictionary.
# str(i): The index of the label converted to a string to be used as a value in the dictionary.
# enumerate(label_list): This function returns pairs of index and label for each label in label_list.
# The resulting dictionary maps each label name to its index (as a string), which is useful for converting label names to indices.

In [70]:
label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

In [71]:
config["id2label"] = id2label

# config["id2label"]: Accesses or creates the 'id2label' key in the config dictionary.
# = id2label: Assigns the id2label dictionary to the 'id2label' key in the config dictionary.
# This updates the configuration dictionary to include a mapping from label indices to label names,
# which can be useful for interpreting model outputs and ensuring consistency in label representation.

In [72]:
config["label2id"] = label2id

# config["label2id"]: Accesses or creates the 'label2id' key in the config dictionary.
# = label2id: Assigns the label2id dictionary to the 'label2id' key in the config dictionary.
# This updates the configuration dictionary to include a mapping from label names to label indices,
# which can be useful for converting label names to indices, ensuring consistency in label representation,
# and facilitating model training and evaluation.

In [73]:
json.dump(config,open("/content/ner_model/config.json","w"))
# json.dump: This function serializes the config dictionary as a JSON formatted stream to a file.
# config: The dictionary to be serialized and saved as JSON.
# open("/content/ner_model/config.json", "w"): Opens the specified file in write mode. If the file does not exist, it will be created.
# This command updates the 'config.json' file with the new contents of the config dictionary, including the 'id2label' and 'label2id' mappings.

In [74]:
config=json.load(open("/content/ner_model/config.json"))
config
# json.load: This function parses the JSON file and returns the corresponding Python dictionary.
# open("/content/ner_model/config.json"): Opens the specified JSON file in read mode.
# config: This variable will contain the contents of the JSON file as a Python dictionary.
# config: After loading the JSON file, the contents of the config dictionary are displayed, showing the current configuration,
# including the 'id2label' and 'label2id' mappings that were previously added and saved.

{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'O',
  '1': 'B-PER',
  '2': 'I-PER',
  '3': 'B-ORG',
  '4': 'I-ORG',
  '5': 'B-LOC',
  '6': 'I-LOC',
  '7': 'B-MISC',
  '8': 'I-MISC'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'O': '0',
  'B-PER': '1',
  'I-PER': '2',
  'B-ORG': '3',
  'I-ORG': '4',
  'B-LOC': '5',
  'I-LOC': '6',
  'B-MISC': '7',
  'I-MISC': '8'},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.41.2',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [None]:
model_fine_tuned

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

# Tansformer Pipeline

In [75]:
from transformers import pipeline

# pipeline: This function provides a high-level API for using pre-trained models for various tasks.
# It simplifies the process of using models for tasks such as text classification, token classification,
# question answering, translation, and more. By specifying the task and the model, the pipeline function
# sets up the necessary components and allows for easy inference and evaluation.

In [76]:
model = "bert-base-uncased"

In [78]:
tokenizer = BertTokenizerFast.from_pretrained(model)

# BertTokenizerFast.from_pretrained: This method loads a pre-trained tokenizer from the specified model.
# model: The pre-trained model whose tokenizer is being loaded.
# tokenizer: The variable that stores the loaded tokenizer. This tokenizer is used for preprocessing input text,
# converting it into token IDs, and ensuring compatibility with the specified pre-trained model.

In [None]:
#tokenizer = BertTokenizerFast.from_pretrained("/content/tokenizer")

In [79]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("/content/ner_model")

# AutoModelForTokenClassification.from_pretrained: This method loads a pre-trained model for token classification from the specified directory.
# "/content/ner_model": The directory where the fine-tuned model is stored.
# model_fine_tuned: The variable that stores the loaded fine-tuned model. This model is now ready for inference or further training on token classification tasks.

In [80]:
nlp_pipeline = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)

# pipeline: This function sets up a high-level API for using the specified pre-trained model and tokenizer.
# "ner": Specifies that the task is Named Entity Recognition (NER).
# model=model_fine_tuned: The fine-tuned NER model to be used for predictions.
# tokenizer=tokenizer: The tokenizer used to preprocess input text.
# nlp_pipeline: The variable that stores the configured pipeline. This pipeline can now be used to perform NER on input texts,
# leveraging the fine-tuned model and the tokenizer.

In [81]:
example="Sunny is Data Scientist and Generative AI Engineer"

In [82]:
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.9861881,
  'index': 1,
  'word': 'sunny',
  'start': 0,
  'end': 5}]

In [83]:
example2="apple launch mobile while eating apple which taste like orange"

In [84]:
nlp_pipeline(example2)

[{'entity': 'B-ORG',
  'score': 0.797234,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5},
 {'entity': 'B-MISC',
  'score': 0.48541978,
  'index': 6,
  'word': 'apple',
  'start': 33,
  'end': 38}]

In [85]:
[['EU',
   'rejects',
   'German',
   'call',
   'to',
   'boycott',
   'British',
   'lamb',
   '.']]


[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']]

In [86]:
nlp_pipeline([['EU',
   'rejects',
   'German',
   'call',
   'to',
   'boycott',
   'British',
   'lamb',
   '.']]
)

[[{'entity': 'B-MISC',
   'score': 0.9752789,
   'index': 1,
   'word': 'british',
   'start': 0,
   'end': 7}]]

In [87]:
example="my name is suuny savita"

In [88]:
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.9804453,
  'index': 4,
  'word': 'su',
  'start': 11,
  'end': 13},
 {'entity': 'B-PER',
  'score': 0.9513342,
  'index': 5,
  'word': '##un',
  'start': 13,
  'end': 15},
 {'entity': 'B-PER',
  'score': 0.98147476,
  'index': 6,
  'word': '##y',
  'start': 15,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.9865589,
  'index': 7,
  'word': 'sa',
  'start': 17,
  'end': 19},
 {'entity': 'I-PER',
  'score': 0.9805064,
  'index': 8,
  'word': '##vita',
  'start': 19,
  'end': 23}]

In [89]:
example2="apple launch mobile while"

In [90]:
nlp_pipeline(example2)

[{'entity': 'B-ORG',
  'score': 0.91030735,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5}]

In [91]:
example="apple founder loves eating apple"

In [92]:
nlp_pipeline(example)

[{'entity': 'B-ORG',
  'score': 0.87643075,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5}]

In [93]:
example="Microsoft Windows created their software by idea that came from the window of the house"

In [94]:
nlp_pipeline(example)

[{'entity': 'B-ORG',
  'score': 0.95652646,
  'index': 1,
  'word': 'microsoft',
  'start': 0,
  'end': 9},
 {'entity': 'I-ORG',
  'score': 0.9382114,
  'index': 2,
  'word': 'windows',
  'start': 10,
  'end': 17}]

In [95]:
example= "sunny is a founder of facebook and microsoft"

In [96]:
nlp_pipeline(example)

[{'entity': 'B-PER',
  'score': 0.9837525,
  'index': 1,
  'word': 'sunny',
  'start': 0,
  'end': 5},
 {'entity': 'B-ORG',
  'score': 0.89379334,
  'index': 6,
  'word': 'facebook',
  'start': 22,
  'end': 30},
 {'entity': 'B-ORG',
  'score': 0.8724207,
  'index': 8,
  'word': 'microsoft',
  'start': 35,
  'end': 44}]

In [97]:
example= "engineer are very talented they can do anything"

In [98]:
nlp_pipeline(example)

[]