- We will first use the Pre-trained models from Hugging Face to reduce the computational cost
- https://huggingface.co/
- 

In [1]:
!pip install transformers datasets tokenizer seqeval -q

In [2]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

2024-03-04 14:25:16.623251: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-04 14:25:16.623354: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-04 14:25:16.789333: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


- Hugging Face Datasets: https://huggingface.co/docs/datasets/quickstart
- Hugging face provides Audio, Vision, NLP datasets
- 

In [3]:
# Lets check the list of datasets available in Hugging Face

from datasets import list_datasets
datasets_list = list_datasets()

num_datasets = len(datasets_list)
print("Number of datasets available on Hugging Face:", num_datasets)

Number of datasets available on Hugging Face: 115130


- We are going to do a NER task with the "conll2003" dataset, having 4 types of entities : persons, locations, organization and miscellaneous
- https://huggingface.co/datasets/conll2003

In [4]:
coll_dataset = datasets.load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3454 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
coll_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

In [6]:
coll_dataset["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14042
})

In [7]:
coll_dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

POS tagging : Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a sentence is tagged with its corresponding part of speech, such as noun, verb, adjective, etc. The main goal of POS tagging is to assign the correct grammatical category to each word in a given text automatically.



NER : NER stands for Named Entity Recognition. It is a subtask of natural language processing (NLP) that involves identifying and classifying named entities within a text into predefined categories such as names of persons, organizations, locations, dates, numerical expressions, and more.

In [8]:
# extract ner tags from features
coll_dataset["train"].features['ner_tags']

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

Their are 9 classes availble in ner_tags

In [9]:
#checkout the description of the data
coll_dataset['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

#### The Pipeline (collection of components) : will use HuggingFacePipeline to collect all this components
- Data Ingestion
- Data Preprocessing
- Model Training
- Model Evaluation

In the model training part : will use Bert-uncased model -> will lowercase the text -> Then will FineTune the model -> and save the model

In [10]:
# Import tokenizer

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [11]:
# will tokenize the train data

# lets first try and see the tokenizer on a random example
example_text = coll_dataset['train'][405]

In [12]:
example_text

{'id': '405',
 'tokens': ['Kenny',
  'Dalglish',
  'spoke',
  'on',
  'Thursday',
  'of',
  'his',
  'sadness',
  'at',
  'leaving',
  'Blackburn',
  ',',
  'the',
  'club',
  'he',
  'led',
  'to',
  'the',
  'English',
  'premier',
  'league',
  'title',
  'in',
  '1994-95',
  '.'],
 'pos_tags': [22,
  22,
  38,
  15,
  22,
  15,
  29,
  21,
  15,
  39,
  22,
  6,
  12,
  21,
  28,
  38,
  35,
  12,
  16,
  16,
  21,
  21,
  15,
  11,
  7],
 'chunk_tags': [11,
  12,
  21,
  13,
  11,
  13,
  11,
  12,
  13,
  21,
  11,
  0,
  11,
  12,
  11,
  21,
  13,
  11,
  12,
  12,
  12,
  12,
  13,
  11,
  0],
 'ner_tags': [1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0]}

In [13]:
example_text['tokens']

['Kenny',
 'Dalglish',
 'spoke',
 'on',
 'Thursday',
 'of',
 'his',
 'sadness',
 'at',
 'leaving',
 'Blackburn',
 ',',
 'the',
 'club',
 'he',
 'led',
 'to',
 'the',
 'English',
 'premier',
 'league',
 'title',
 'in',
 '1994-95',
 '.']

In [14]:
tokenized_input = tokenizer(example_text['tokens'], is_split_into_words=True)

In [15]:
tokenized_input

{'input_ids': [101, 8888, 17488, 25394, 4095, 3764, 2006, 9432, 1997, 2010, 12039, 2012, 2975, 13934, 1010, 1996, 2252, 2002, 2419, 2000, 1996, 2394, 4239, 2223, 2516, 1999, 2807, 1011, 5345, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Convert Id's into tokens

In [16]:
tokenized_input['input_ids']

[101,
 8888,
 17488,
 25394,
 4095,
 3764,
 2006,
 9432,
 1997,
 2010,
 12039,
 2012,
 2975,
 13934,
 1010,
 1996,
 2252,
 2002,
 2419,
 2000,
 1996,
 2394,
 4239,
 2223,
 2516,
 1999,
 2807,
 1011,
 5345,
 1012,
 102]

In [17]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
tokens

['[CLS]',
 'kenny',
 'dal',
 '##gli',
 '##sh',
 'spoke',
 'on',
 'thursday',
 'of',
 'his',
 'sadness',
 'at',
 'leaving',
 'blackburn',
 ',',
 'the',
 'club',
 'he',
 'led',
 'to',
 'the',
 'english',
 'premier',
 'league',
 'title',
 'in',
 '1994',
 '-',
 '95',
 '.',
 '[SEP]']

- The tokens "[CLS]" and "[SEP]" are special tokens used in the pre-training and fine-tuning of Transformer-based models in natural language processing (NLP)

1. "[CLS]": This token stands for "classification." It is prepended to the input sequence in BERT-based models and serves as a special token to represent the beginning of the sequence. During fine-tuning for specific downstream tasks such as text classification or sentence pair classification, the output representation corresponding to the "[CLS]" token is used as input to a task-specific classifier.

2. "[SEP]": This token stands for "separator." It is used to separate two sentences or sequences in input. It is used both during pre-training, where BERT is trained on tasks like next sentence prediction, and during fine-tuning, where BERT might be used for tasks like question answering where it is important to separate question and context or for tasks like sentence pair classification.

In [18]:
coll_dataset['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14042
})

In [19]:
# get the token_ids for example_text
word_ids = tokenized_input.word_ids()

print(word_ids)

[None, 0, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 23, 23, 24, None]


In [20]:
example_text["ner_tags"]

[1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0]

In [21]:
def tokenize_and_align_labels(examples, label_all_tokens=True):

    #tokeinze ids
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []


    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.

        previous_word_idx = None
        label_ids = []
        # Special tokens like `` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)

            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [22]:
coll_dataset['train'][4:5]

{'id': ['4'],
 'tokens': [['Germany',
   "'s",
   'representative',
   'to',
   'the',
   'European',
   'Union',
   "'s",
   'veterinary',
   'committee',
   'Werner',
   'Zwingmann',
   'said',
   'on',
   'Wednesday',
   'consumers',
   'should',
   'buy',
   'sheepmeat',
   'from',
   'countries',
   'other',
   'than',
   'Britain',
   'until',
   'the',
   'scientific',
   'advice',
   'was',
   'clearer',
   '.']],
 'pos_tags': [[22,
   27,
   21,
   35,
   12,
   22,
   22,
   27,
   16,
   21,
   22,
   22,
   38,
   15,
   22,
   24,
   20,
   37,
   21,
   15,
   24,
   16,
   15,
   22,
   15,
   12,
   16,
   21,
   38,
   17,
   7]],
 'chunk_tags': [[11,
   11,
   12,
   13,
   11,
   12,
   12,
   11,
   12,
   12,
   12,
   12,
   21,
   13,
   11,
   12,
   21,
   22,
   11,
   13,
   11,
   1,
   13,
   11,
   17,
   11,
   12,
   12,
   21,
   1,
   0]],
 'ner_tags': [[5,
   0,
   0,
   0,
   0,
   3,
   4,
   0,
   0,
   0,
   1,
   2,
   0,
   0,
   0,
   0,
   0,


In [23]:
tokenize_and_align_labels(coll_dataset['train'][4:5])

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}

In [24]:
# how the word is getting map with the tags
q = tokenize_and_align_labels(coll_dataset['train'][4:5])

In [25]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q['input_ids'][0]), q['labels'][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

In [26]:
## Applying on entire data
tokenized_datasets = coll_dataset.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [27]:
tokenized_datasets['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 14042
})

In [28]:
tokenized_datasets['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'input_ids': [101,
  7327,
  19164,
  2446,
  2655,
  2000,
  17757,
  2329,
  12559,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]}

Now we can see, we are getting input_ids along with the labels

Load the model:
- https://huggingface.co/transformers/v3.0.2/model_doc/auto.html

In [29]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased",num_labels=9)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are using pretrained model and finetune it

In [30]:
from transformers import TrainingArguments, Trainer

In [31]:
args = TrainingArguments(
"test-ner",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=7,
weight_decay=0.01
)

In [32]:
metric = datasets.load_metric('seqeval')

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [33]:
example_text

{'id': '405',
 'tokens': ['Kenny',
  'Dalglish',
  'spoke',
  'on',
  'Thursday',
  'of',
  'his',
  'sadness',
  'at',
  'leaving',
  'Blackburn',
  ',',
  'the',
  'club',
  'he',
  'led',
  'to',
  'the',
  'English',
  'premier',
  'league',
  'title',
  'in',
  '1994-95',
  '.'],
 'pos_tags': [22,
  22,
  38,
  15,
  22,
  15,
  29,
  21,
  15,
  39,
  22,
  6,
  12,
  21,
  28,
  38,
  35,
  12,
  16,
  16,
  21,
  21,
  15,
  11,
  7],
 'chunk_tags': [11,
  12,
  21,
  13,
  11,
  13,
  11,
  12,
  13,
  21,
  11,
  0,
  11,
  12,
  11,
  21,
  13,
  11,
  12,
  12,
  12,
  12,
  13,
  11,
  0],
 'ner_tags': [1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0]}

In [34]:
label_list = coll_dataset['train'].features['ner_tags'].feature.names

label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [35]:
for i in example_text['ner_tags']:
    print(i)

1
2
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
7
0
0
0
0
0
0


In [36]:
# keeping it all inside a list
labels = [label_list[i] for i in example_text['ner_tags']]
labels

['B-PER',
 'I-PER',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-ORG',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-MISC',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [37]:
metric.compute(predictions=[labels], references=[labels])

# as of now both the labels are same, target and prediction, so acc = 100%

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [38]:
# define method to compute the scores

def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)

    return {
          "precision": results["overall_precision"],
          "recall": results["overall_recall"],
          "f1": results["overall_f1"],
          "accuracy": results["overall_accuracy"],
  }

In [39]:
# define data collator

data_collator=DataCollatorForTokenClassification(tokenizer)


In [40]:
trainer = Trainer(
   model,
   args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

In [41]:
# run the model for finetuning, we can explore by varying #of epochs
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.067201,0.905509,0.923034,0.914188,0.981461
2,0.188300,0.056479,0.920544,0.93836,0.929367,0.983446
3,0.048700,0.057639,0.934452,0.94731,0.940837,0.986052
4,0.026600,0.059109,0.9375,0.946415,0.941936,0.986163
5,0.016700,0.064486,0.932863,0.946638,0.9397,0.985734
6,0.012800,0.063506,0.940955,0.952008,0.946449,0.987021
7,0.008500,0.065214,0.938514,0.951113,0.944772,0.986783




TrainOutput(global_step=3073, training_loss=0.04927719170830749, metrics={'train_runtime': 947.8195, 'train_samples_per_second': 103.705, 'train_steps_per_second': 3.242, 'total_flos': 2642738912648208.0, 'train_loss': 0.04927719170830749, 'epoch': 7.0})

To save the model:
- while saving the model, a config and safetensor file is created
- the config file contains 'id2label' and 'label2id' dict
- 

In [42]:
model.save_pretrained('ner_model')
print("Model Saved...")

In [43]:
# save tokenizer 
tokenizer.save_pretrained('tokenizer')

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [44]:
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [45]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}

In [46]:
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [47]:
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

label2id

{'O': '0',
 'B-PER': '1',
 'I-PER': '2',
 'B-ORG': '3',
 'I-ORG': '4',
 'B-LOC': '5',
 'I-LOC': '6',
 'B-MISC': '7',
 'I-MISC': '8'}

In [48]:
import json

In [50]:
config=json.load(open("/kaggle/working/ner_model/config.json"))

In [51]:
config


{'_name_or_path': 'bert-base-uncased',
 'architectures': ['BertForTokenClassification'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'id2label': {'0': 'LABEL_0',
  '1': 'LABEL_1',
  '2': 'LABEL_2',
  '3': 'LABEL_3',
  '4': 'LABEL_4',
  '5': 'LABEL_5',
  '6': 'LABEL_6',
  '7': 'LABEL_7',
  '8': 'LABEL_8'},
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'label2id': {'LABEL_0': 0,
  'LABEL_1': 1,
  'LABEL_2': 2,
  'LABEL_3': 3,
  'LABEL_4': 4,
  'LABEL_5': 5,
  'LABEL_6': 6,
  'LABEL_7': 7,
  'LABEL_8': 8},
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'torch_dtype': 'float32',
 'transformers_version': '4.38.1',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 30522}

In [52]:
config["id2label"] = id2label

In [54]:
config["label2id"] = label2id

In [55]:
json.dump(config,open("/kaggle/working/ner_model/config.json","w"))


In [56]:
model_fine_tuned=AutoModelForTokenClassification.from_pretrained("ner_model")

In [57]:
model_fine_tuned

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

Inferencing to check out the predictions

Transformer Pipeline

In [58]:
from transformers import pipeline

In [59]:
# load the finetuned model and tokenizer, along with project-name:'ner'
nlp_pipeline=pipeline("ner",model=model_fine_tuned,tokenizer=tokenizer)

In [60]:
nlp_pipeline

<transformers.pipelines.token_classification.TokenClassificationPipeline at 0x7b958550c8b0>

In [65]:
example="Narendra Modi is Indian Politician"

In [63]:
example="apple launch mobile while eating apple which taste like orange"

In [64]:
nlp_pipeline(example)


[{'entity': 'B-ORG',
  'score': 0.9943808,
  'index': 1,
  'word': 'apple',
  'start': 0,
  'end': 5},
 {'entity': 'B-MISC',
  'score': 0.9416092,
  'index': 6,
  'word': 'apple',
  'start': 33,
  'end': 38},
 {'entity': 'B-MISC',
  'score': 0.8638995,
  'index': 10,
  'word': 'orange',
  'start': 56,
  'end': 62}]