<a href="https://colab.research.google.com/github/hajar-hajji/BERT-Repository/blob/main/BERT%20for%20NER/ner_using_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

to know more about what i have done in the current notebook, please refer to [*huggingface page*](https://huggingface.co/transformers/v4.2.2/custom_datasets.html#token-classification-with-w-nut-emerging-entities) that provides further explnation about the topic (more specifically the encoding part...)



### Importing dataset &preprocessing

#### Importing data

In [1]:
import numpy as np
import pandas as pd

In [2]:
data=pd.read_csv("/content/drive/MyDrive/my datasets/ner_dataset.csv",encoding="unicode_escape")
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [3]:
data.shape #1 048 575 words

(1048575, 4)

#### Preprocessing data

In [4]:
data["Sentence"] = data.groupby(["Sentence #"])["Word"].transform(lambda x: " ".join(x))
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag,Sentence
0,Sentence: 1,Thousands,NNS,O,Thousands
1,,of,IN,O,
2,,demonstrators,NNS,O,
3,,have,VBP,O,
4,,marched,VBN,O,


In [5]:
#we will eliminate POS column bcs we're not interested (now) in POS tagging task, maybe i'll do it later
data=data[["Sentence #","Word","Tag"]]
data.head()

Unnamed: 0,Sentence #,Word,Tag
0,Sentence: 1,Thousands,O
1,,of,O
2,,demonstrators,O
3,,have,O
4,,marched,O


In [6]:
#some info
print("Number of Sentences :")
print(f"{len(data['Sentence #'].unique())} sentences")
print()
print("Unique tags :")
unique_tags=data.Tag.unique()
print(unique_tags)
print()
print("Number of unique tags :")
print(f"{len(unique_tags)} tags")

Number of Sentences :
47960 sentences

Unique tags :
['O' 'B-geo' 'B-gpe' 'B-per' 'I-geo' 'B-org' 'I-org' 'B-tim' 'B-art'
 'I-art' 'I-per' 'I-gpe' 'I-tim' 'B-nat' 'B-eve' 'I-eve' 'I-nat']

Number of unique tags :
17 tags


#### IOB-tags

***O*** stands for **Outside**, ***B*** for **Beginning** and ***I*** for **Inside** (entity relative position) <Br>
to know more about *IOB-tags* format, clic [*here*](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))

In [7]:
#checking if there are any missing values within Words and tags
data[data["Word"].isnull() | data["Tag"].isnull()] #c est bon

Unnamed: 0,Sentence #,Word,Tag


In [8]:
#split sentences using ffill method, which replaces missing values
#with the last known value in the same column.
data["Sentence #"]=data["Sentence #"].fillna(method="ffill")
data[20:30]

Unnamed: 0,Sentence #,Word,Tag
20,Sentence: 1,from,O
21,Sentence: 1,that,O
22,Sentence: 1,country,O
23,Sentence: 1,.,O
24,Sentence: 2,Families,O
25,Sentence: 2,of,O
26,Sentence: 2,soldiers,O
27,Sentence: 2,killed,O
28,Sentence: 2,in,O
29,Sentence: 2,the,O


In [9]:
data["tokens"] = data.groupby(["Sentence #"])["Word"].transform(lambda x: ",".join(x))
data.head()

Unnamed: 0,Sentence #,Word,Tag,tokens
0,Sentence: 1,Thousands,O,"Thousands,of,demonstrators,have,marched,throug..."
1,Sentence: 1,of,O,"Thousands,of,demonstrators,have,marched,throug..."
2,Sentence: 1,demonstrators,O,"Thousands,of,demonstrators,have,marched,throug..."
3,Sentence: 1,have,O,"Thousands,of,demonstrators,have,marched,throug..."
4,Sentence: 1,marched,O,"Thousands,of,demonstrators,have,marched,throug..."


In [10]:
data["labels"] = data.groupby(["Sentence #"])["Tag"].transform(lambda x: ",".join(x))
data.head()

Unnamed: 0,Sentence #,Word,Tag,tokens,labels
0,Sentence: 1,Thousands,O,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
1,Sentence: 1,of,O,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
2,Sentence: 1,demonstrators,O,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
3,Sentence: 1,have,O,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
4,Sentence: 1,marched,O,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."


In [11]:
data=data.drop_duplicates(subset=["tokens", "labels"], keep="first").reset_index(drop=True)
data.shape

(47610, 5)

In [12]:
data=data[["tokens","labels"]]
data.head()

Unnamed: 0,tokens,labels
0,"Thousands,of,demonstrators,have,marched,throug...","O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
1,"Families,of,soldiers,killed,in,the,conflict,jo...","O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,B-per,O,O,..."
2,"They,marched,from,the,Houses,of,Parliament,to,...","O,O,O,O,O,O,O,O,O,O,O,B-geo,I-geo,O"
3,"Police,put,the,number,of,marchers,at,10,000,wh...","O,O,O,O,O,O,O,O,O,O,O,O,O,O,O"
4,"The,protest,comes,on,the,eve,of,the,annual,con...","O,O,O,O,O,O,O,O,O,O,O,B-geo,O,O,B-org,I-org,O,..."


In [13]:
#extract appriopriate data from dataframe
list_tokens=data.tokens.apply(lambda x:x.split(",")).tolist()
print(list_tokens[:3])
print()
list_labels=data.labels.apply(lambda x:x.split(",")).tolist()
print(list_labels[:3])

[['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.'], ['Families', 'of', 'soldiers', 'killed', 'in', 'the', 'conflict', 'joined', 'the', 'protesters', 'who', 'carried', 'banners', 'with', 'such', 'slogans', 'as', '"', 'Bush', 'Number', 'One', 'Terrorist', '"', 'and', '"', 'Stop', 'the', 'Bombings', '.', '"'], ['They', 'marched', 'from', 'the', 'Houses', 'of', 'Parliament', 'to', 'a', 'rally', 'in', 'Hyde', 'Park', '.']]

[['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O']]


In [14]:
#assign an index to each tag in order to perform classification task
get_idx=dict()
for ind,tag in enumerate(unique_tags):
    get_idx[tag]=ind
get_idx

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

## BERT Model 

#### install transformers gpu version on colab

In [16]:
!pip install transformers seqeval[gpu]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval[gpu]
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 K

In [17]:
import torch
from transformers import BertTokenizerFast, BertConfig

### Loading BERT tokenizer

In [18]:
tokenizer=BertTokenizerFast.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [19]:
#look for the biggest sublist in list_tokens and return its len
max_length=len(list_tokens[0])
for liste in list_tokens:
    if len(liste)>max_length:
        max_length=len(liste)
max_length

117

In [20]:
#encode each sentence of list_tokens (represented by a list of tokens)
enc=tokenizer(list_tokens,
                is_split_into_words=True, #déjà tokenisé
                return_offsets_mapping=True, 
                padding="max_length",  #assure all inputs have same len
                truncation=True, 
                max_length=max_length)

In [21]:
print(enc[0])

Encoding(num_tokens=117, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [22]:
print(enc[0].ids)

[101, 5190, 1997, 28337, 2031, 9847, 2083, 2414, 2000, 6186, 1996, 2162, 1999, 5712, 1998, 5157, 1996, 10534, 1997, 2329, 3629, 2013, 2008, 2406, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [23]:
print(enc[0].offsets) #start position and end position relative to the original token it was split from

[(0, 0), (0, 9), (0, 2), (0, 13), (0, 4), (0, 7), (0, 7), (0, 6), (0, 2), (0, 7), (0, 3), (0, 3), (0, 2), (0, 4), (0, 3), (0, 6), (0, 3), (0, 10), (0, 2), (0, 7), (0, 6), (0, 4), (0, 4), (0, 7), (0, 1), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0)]


#### Preparing the model inputs

In [24]:
inputs_model=[]

#iterate over every sentence
for i in range(len(list_tokens)):
    #enc=data["tokens"].apply(lambda x:encode(x,max_length))
    
    label=[get_idx[lbl] for lbl in list_labels[i]]
    label+=[0]*(max_length-len(label))
    encoded_tags=[-100 for _ in range(max_length)] #-1 for tensorflow...
    
    offsets=enc[i].offsets
    k=0
    for ind,tpl in enumerate(offsets):
        start_pos=tpl[0]
        end_pos=tpl[1]
        if start_pos==0 and end_pos!=0: #means it's not [CLS] or [SEP] or [PAD],and also used to propagate label bcs we're working on a wordpiece level
            #replace -100 by label
            encoded_tags[ind]=label[k]
            k+=1
    
    #turn into pytorch tensor+only keep relevant inputs for each sentence (remove the remaining attributes)
    input_sentence={"input_ids":torch.as_tensor(enc[i].ids),"attention_mask":torch.as_tensor(enc[i].attention_mask),"labels":torch.as_tensor(encoded_tags)}
    inputs_model.append(input_sentence)

#now that is preprocessed by bert, we can feed it to the model !

In [25]:
print(len(inputs_model))

47610


In [26]:
inputs_model[:2] #list of dictionaries, each dictionary is tensor representation of each sentence

[{'input_ids': tensor([  101,  5190,  1997, 28337,  2031,  9847,  2083,  2414,  2000,  6186,
           1996,  2162,  1999,  5712,  1998,  5157,  1996, 10534,  1997,  2329,
           3629,  2013,  2008,  2406,  1012,   102,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0]),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 

In [27]:
inputs_model[0] #first sentence inputs

{'input_ids': tensor([  101,  5190,  1997, 28337,  2031,  9847,  2083,  2414,  2000,  6186,
          1996,  2162,  1999,  5712,  1998,  5157,  1996, 10534,  1997,  2329,
          3629,  2013,  2008,  2406,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [28]:
for t, lbl in zip(tokenizer.convert_ids_to_tokens(inputs_model[3000]["input_ids"]), inputs_model[3000]["labels"]):
  print(t, lbl.tolist())

[CLS] -100
lebanese 2
prime 3
minister 10
sa 10
##ad -100
hari 10
##ri -100
has 0
denied 0
a 0
newspaper 0
report 0
that 0
says 0
he 0
will 0
ask 0
a 0
u 1
. -100
n -100
. -100
tribunal 0
to 0
stop 0
its 0
investigation 0
into 0
the 0
2005 7
assassination 0
of 0
his 0
father 0
former 0
prime 0
minister 3
raf 10
##iq -100
hari 10
##ri -100
. 10
[SEP] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD]

### Loading BERT model

In [29]:
from torch.utils.data import DataLoader
from transformers import BertForTokenClassification, AdamW

In [30]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device) #to check if gpu or cpu

cuda


In [31]:
model=BertForTokenClassification.from_pretrained('bert-base-uncased',num_labels=len(unique_tags))
model.to(device)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

#### Loading data

In [32]:
from sklearn.model_selection import train_test_split

#split data to train dataset (80%) of the overall data and the remaining 20% for test
#extract features and labels and put them in lists
input_ids = [d["input_ids"] for d in inputs_model]
attention_masks = [d["attention_mask"] for d in inputs_model]
labels = [d["labels"] for d in inputs_model]

#splitting
train_input_ids, test_input_ids,\
train_attention_masks, test_attention_masks,\
train_labels, test_labels\
=train_test_split(input_ids, attention_masks, labels, test_size=.2, random_state=42)

#regroup
train_dataset = [{"input_ids": input_id, "attention_mask": attention_mask, "labels": label} \
                 for input_id, attention_mask, label in zip(train_input_ids, train_attention_masks, train_labels)]
test_dataset = [{"input_ids": input_id, "attention_mask": attention_mask, "labels": label} \
                for input_id, attention_mask, label in zip(test_input_ids, test_attention_masks, test_labels)]


In [33]:
print(len(train_dataset))
print(len(test_dataset))

38088
9522


In [34]:
#checking on a random sample (token and its label)
for t, lbl in zip(tokenizer.convert_ids_to_tokens(train_dataset[7]["input_ids"]), train_dataset[7]["labels"]):
  print(t, lbl.tolist())

[CLS] -100
the 0
american 5
embassy 6
in 0
tokyo 1
says 0
a 0
high 0
- -100
level -100
delegation 0
led 0
by 0
agriculture 0
under 0
##se -100
##cre -100
##tary -100
j 3
. -100
b 10
. -100
penn 10
will 0
arrive 0
in 0
japan 1
monday 7
to 0
discuss 0
the 0
situation 0
. 0
[SEP] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[PAD] -100
[P

#### Training model

In [35]:
batch_size=8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [36]:
from sklearn.metrics import accuracy_score

In [37]:
def train_model(train_loader,optimizer):
    model.train()
    
    sum_loss=0
    sum_acc=0
    nbr_steps=0

    train_pred=[]
    train_gold=[]
    
    for idx_batch,batch in enumerate(train_loader):  #iterate over batch of batch_size from train
        optimizer.zero_grad()
        
        input_ids=batch["input_ids"].to(device)
        attention_mask=batch["attention_mask"].to(device)
        labels=batch["labels"].to(device)
        
        #perform forward pass through model
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        #outputs of the model : loss,logits(scores or proba to belong to each class"label")
        loss=outputs[0] #Cross-Entropy Loss for this model
        sum_loss+=loss.item()
        
        nbr_steps+=1
        
        if idx_batch%100==0:
            loss_per_step = sum_loss/nbr_steps
            print("training loss per 100 steps : ",loss_per_step)
            
        flat_labels=labels.view(-1)
        
        train_logits=outputs[1]
        relevant_logits=train_logits.view(-1, model.num_labels)
        flat_pred=torch.argmax(relevant_logits, axis=1)
        
        filtering=labels.view(-1) != -100 #only keep relevant labels...
        
        relevant_labels=torch.masked_select(flat_labels,filtering)
        relevant_pred=torch.masked_select(flat_pred,filtering)
        
        accuracy=accuracy_score(relevant_labels.cpu().numpy(),relevant_pred.cpu().numpy()) #don't forget to convert it back to cpu !
        sum_acc+=accuracy

        train_pred.extend(relevant_pred)
        train_gold.extend(relevant_labels)
        
        # backward pass
        loss.backward()
        optimizer.step()

    train_predictions=[tr.tolist() for tr in relevant_pred]
    gold_labels=[gold.tolist() for gold in relevant_labels]
        
    epoch_loss=sum_loss/nbr_steps
    epoch_acc=sum_acc/nbr_steps
    print("Loss per epoch: ",epoch_loss)
    print("Accuracy per epoch: ",epoch_acc)

    return train_predictions,gold_labels

In [38]:
lr=1e-5
optimizer = torch.optim.AdamW(model.parameters(), lr=lr) 
#extension of Adam that applies weight decay regularization 
#(adds a penalty term to the loss function to prevent overfitting)

i will start by setting number of epochs to 1, then i will be progressively increasing it until i get a satisfying accuracy score

In [39]:
%%time
nbr_epochs=1
for e in range(nbr_epochs):
    print("EPOCH : ",e+1)
    train_predictions,gold_labels=train_model(train_loader,optimizer)

EPOCH :  1
training loss per 100 steps :  2.8424293994903564
training loss per 100 steps :  0.9876810164734868
training loss per 100 steps :  0.7844870378128925
training loss per 100 steps :  0.6798451715628571
training loss per 100 steps :  0.6134218634512656
training loss per 100 steps :  0.5659125389423437
training loss per 100 steps :  0.5307069876279291
training loss per 100 steps :  0.504501298346295
training loss per 100 steps :  0.4815107113431753
training loss per 100 steps :  0.4634031862037031
training loss per 100 steps :  0.4471453947129545
training loss per 100 steps :  0.4331166524572009
training loss per 100 steps :  0.42141528109974113
training loss per 100 steps :  0.4104251089583879
training loss per 100 steps :  0.4011100753596708
training loss per 100 steps :  0.3921069403146105
training loss per 100 steps :  0.38457265291840004
training loss per 100 steps :  0.37742925739613337
training loss per 100 steps :  0.37121827586314565
training loss per 100 steps :  0.364

**overall accuracy after one epoch : 0.91** <Br>
i will add an another epoch to improve accuracy...

In [42]:
%%time
nbr_epochs=1
for e in range(nbr_epochs):
    print("EPOCH : ",e+1)
    train_predictions,gold_labels=train_model(train_loader,optimizer)

EPOCH :  1
training loss per 100 steps :  0.14495129883289337
training loss per 100 steps :  0.19285037621191822
training loss per 100 steps :  0.18456226981138413
training loss per 100 steps :  0.18521641120562127
training loss per 100 steps :  0.18550439332228646
training loss per 100 steps :  0.18616061197768546
training loss per 100 steps :  0.18580325005170015
training loss per 100 steps :  0.18473884209023309
training loss per 100 steps :  0.18329173593913273
training loss per 100 steps :  0.182581993628223
training loss per 100 steps :  0.18244844534760946
training loss per 100 steps :  0.18336397470004662
training loss per 100 steps :  0.18401194839109414
training loss per 100 steps :  0.1834448211326004
training loss per 100 steps :  0.18335236872291774
training loss per 100 steps :  0.18283363598784175
training loss per 100 steps :  0.18183210030312114
training loss per 100 steps :  0.18123718493315297
training loss per 100 steps :  0.1809356306797966
training loss per 100 st

**overall accuracy after two epochs : 0.94** <Br>
i will add another one to check if the accuracy will further increase

In [46]:
%%time
nbr_epochs=1
for e in range(nbr_epochs):
    print("EPOCH : ",e+1)
    train_predictions,gold_labels=train_model(train_loader,optimizer)

EPOCH :  1
training loss per 100 steps :  0.2500331401824951
training loss per 100 steps :  0.1570617931314034
training loss per 100 steps :  0.15040052995845601
training loss per 100 steps :  0.15151779462221354
training loss per 100 steps :  0.1524734129929483
training loss per 100 steps :  0.15331773829466092
training loss per 100 steps :  0.15416756007020763
training loss per 100 steps :  0.15246599180472434
training loss per 100 steps :  0.15149768129149255
training loss per 100 steps :  0.15220322335070696
training loss per 100 steps :  0.1506141528922629
training loss per 100 steps :  0.15095367022729958
training loss per 100 steps :  0.15006203994078934
training loss per 100 steps :  0.1511121597652861
training loss per 100 steps :  0.15049141829003773
training loss per 100 steps :  0.15036287411599408
training loss per 100 steps :  0.14978760330071633
training loss per 100 steps :  0.14960740318824087
training loss per 100 steps :  0.1494589565285088
training loss per 100 step

**overall accuracy after three epochs : 0.95** <Br>

#### Evaluate

In [44]:
#there is no optimizer, no backward pass
def test_model(test_loader):
    model.eval()
    
    sum_loss=0
    sum_acc=0
    nbr_steps=0
    
    test_pred=[]
    test_gold=[]
    
    with torch.no_grad(): #no gradient
        for idx_batch,batch in enumerate(test_loader):  #iterate over batch of batch_size from test
        
            input_ids=batch["input_ids"].to(device)
            attention_mask=batch["attention_mask"].to(device)
            labels=batch["labels"].to(device)
        
            #perform forward pass through model
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            #outputs of the model : loss,logits(scores or proba to belong to each class"label")
            loss=outputs[0] #Cross-Entropy Loss for this model
            sum_loss+=loss.item()
        
            nbr_steps+=1
        
            if idx_batch%100==0:
                loss_per_step = sum_loss/nbr_steps
                print("test loss per 100 steps : ",loss_per_step)
            
            flat_labels=labels.view(-1)
        
            test_logits=outputs[1]
            relevant_logits=test_logits.view(-1, model.num_labels)
            flat_pred=torch.argmax(relevant_logits, axis=1)
        
            filtering=labels.view(-1) != -100 #only keep relevant labels...
        
            relevant_labels=torch.masked_select(flat_labels,filtering)
            relevant_pred=torch.masked_select(flat_pred,filtering)
        
            accuracy=accuracy_score(relevant_labels.cpu().numpy(),relevant_pred.cpu().numpy())
            sum_acc+=accuracy
            
            test_pred.extend(relevant_pred)
            test_gold.extend(relevant_labels)
            
    epoch_loss=sum_loss/nbr_steps
    epoch_acc=sum_acc/nbr_steps
    print("Loss test per epoch: ",epoch_loss)
    print("Accuracy data test per epoch: ",epoch_acc)
    
    test_predictions=[pred.tolist() for pred in test_pred]
    gold_labels=[gold.tolist() for gold in test_gold]
    
    return test_predictions,gold_labels

after one epoch of training...

In [41]:
%time predicted_labels,true_labels=test_model(test_loader)

test loss per 100 steps :  0.2802024483680725
test loss per 100 steps :  0.19028250281099635
test loss per 100 steps :  0.19344269301723782
test loss per 100 steps :  0.1956329899979291
test loss per 100 steps :  0.19576770038880464
test loss per 100 steps :  0.19422171673479074
test loss per 100 steps :  0.192444485942283
test loss per 100 steps :  0.19149861787944053
test loss per 100 steps :  0.1930717881071322
test loss per 100 steps :  0.1926311619071234
test loss per 100 steps :  0.19384275456252215
test loss per 100 steps :  0.1929536937946601
Loss test per epoch:  0.1933574818839467
Accuracy data test per epoch:  0.9383660654958826
CPU times: user 1min 22s, sys: 84.4 ms, total: 1min 22s
Wall time: 1min 25s


**overall accuracy on test dataset after one epoch : 0.93** <Br>
after training on an additional epoch

In [45]:
%time predicted_labels_2,true_labels_2=test_model(test_loader)

test loss per 100 steps :  0.1274585872888565
test loss per 100 steps :  0.19028832442542115
test loss per 100 steps :  0.17918728620036325
test loss per 100 steps :  0.1746260095449579
test loss per 100 steps :  0.1774536643474551
test loss per 100 steps :  0.17552775698165574
test loss per 100 steps :  0.17728594424336305
test loss per 100 steps :  0.17457403261719348
test loss per 100 steps :  0.1755712561462814
test loss per 100 steps :  0.17562729835956792
test loss per 100 steps :  0.17538276158801683
test loss per 100 steps :  0.17490779272839765
Loss test per epoch:  0.17503530036432843
Accuracy data test per epoch:  0.9427453122097624
CPU times: user 1min 18s, sys: 148 ms, total: 1min 18s
Wall time: 1min 18s


**overall accuracy on test dataset after two epochs : 0.94** <Br>
after 3 epochs, accuracy score is nearly the same, i stopped training to avoid overfitting..

In [48]:
%time predicted_labels,true_labels=test_model(test_loader)

test loss per 100 steps :  0.21542325615882874
test loss per 100 steps :  0.18943102388541297
test loss per 100 steps :  0.18005220314478548
test loss per 100 steps :  0.17751613078929757
test loss per 100 steps :  0.17431121832108185
test loss per 100 steps :  0.17424152642414123
test loss per 100 steps :  0.1714938173950735
test loss per 100 steps :  0.17078715644792763
test loss per 100 steps :  0.17093594431686454
test loss per 100 steps :  0.16946755544118144
test loss per 100 steps :  0.17025647864450927
test loss per 100 steps :  0.17000668191500717
Loss test per epoch:  0.1696483302489236
Accuracy data test per epoch:  0.9459243105473963
CPU times: user 1min 16s, sys: 138 ms, total: 1min 16s
Wall time: 1min 21s


**overall accuracy on test dataset after three epochs : 0.94** <Br>

### Some relevant metrics

#### sklearn for classic classification (classes)

In [52]:
from sklearn.metrics import classification_report

print(classification_report(true_labels, predicted_labels,zero_division=1))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98    170087
           1       0.77      0.79      0.78      7340
           2       0.87      0.87      0.87      3150
           3       0.80      0.75      0.77      3343
           4       0.71      0.70      0.71      1425
           5       0.70      0.63      0.66      4078
           6       0.74      0.66      0.70      3374
           7       0.80      0.81      0.81      3886
           8       0.33      0.07      0.12        70
           9       0.09      0.03      0.04        40
          10       0.79      0.85      0.81      3385
          11       0.76      0.61      0.67        46
          12       0.82      0.63      0.72      1177
          13       0.67      0.26      0.37        39
          14       0.58      0.23      0.33        61
          15       0.35      0.27      0.31        41
          16       1.00      0.00      0.00         9

    accuracy              

Class 0 which stands for "Outside" has the heighest scores which impacts the entire result (high overall accuracy score). Looking at the remaining classes and associated scores mean that the model is not that perfect...

#### seqeval metrics for sequences
suitable for NER tasks

In [50]:
from seqeval.metrics import classification_report

#get tag from index
get_tag=dict()
for ind,tag in enumerate(unique_tags):
    get_tag[ind]=tag

predicted_labels_copy=[[get_tag[ind]] for ind in predicted_labels]
true_labels_copy=[[get_tag[ind]] for ind in true_labels]

print(classification_report(true_labels_copy, predicted_labels_copy))

              precision    recall  f1-score   support

         art       0.23      0.05      0.09       110
         eve       0.51      0.27      0.36       102
         geo       0.79      0.80      0.80      8765
         gpe       0.87      0.87      0.87      3196
         nat       0.73      0.23      0.35        48
         org       0.76      0.68      0.71      7452
         per       0.86      0.86      0.86      6728
         tim       0.85      0.82      0.83      5063

   micro avg       0.81      0.79      0.80     31464
   macro avg       0.70      0.57      0.61     31464
weighted avg       0.81      0.79      0.80     31464



We notice that underrepresented classes in dataset such as art, eve and nat have the lowest f1-scores... removing them will probably improve results

### Save model for future use

In [53]:
model.save_pretrained('/content/drive/MyDrive/mymodels')