# BERT NER Model Traning

We aim to fine tune a pretrained model, specifically on [DistilBERT](https://huggingface.co/distilbert-base-uncased) model in order to reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch.




In [None]:
PRETRAINED_MODEL = 'distilbert-base-uncased'

In [None]:
!pip install transformers[torch] torch datasets evaluate seqeval

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Prepare Dataset

Before you can fine-tune a pretrained model, download a dataset and prepare it for training. The dataset is in Google Drive so it is pulled and then unzipped.

Our dataset has following bio_tags:

```
# BIO Tags
bio_tags = [
    "O",
    "B-TOPIC",
    "I-TOPIC"
]
```



In [None]:
!gdown "https://drive.google.com/uc?export=download&id=1yFN3KFhbKMp8aNguVxz96b8iGu98OtoH"


Traceback (most recent call last):
  File "/usr/local/bin/gdown", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/gdown/cli.py", line 151, in main
    filename = download(
  File "/usr/local/lib/python3.10/dist-packages/gdown/download.py", line 203, in download
    filename_from_url = m.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'


In [None]:
!gdown "https://drive.google.com/uc?export=download&id=1tDA7mPoLMG0PxHJ_vGDQiRpkU38vtlja"

Downloading...
From: https://drive.google.com/uc?export=download&id=1tDA7mPoLMG0PxHJ_vGDQiRpkU38vtlja
To: /content/bio_tagged_dataset.jsonl.zip
  0% 0.00/23.7M [00:00<?, ?B/s] 38% 8.91M/23.7M [00:00<00:00, 84.7MB/s]100% 23.7M/23.7M [00:00<00:00, 127MB/s] 


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Unzip the file
!unzip /content/bio_tagged_gold_dataset.jsonl.zip -d /content

unzip:  cannot find or open /content/bio_tagged_gold_dataset.jsonl.zip, /content/bio_tagged_gold_dataset.jsonl.zip.zip or /content/bio_tagged_gold_dataset.jsonl.zip.ZIP.


### Split if required (Currently NOT)
The dataset is huge and the limited resource that Colab provides won't be enough to process and train on it. Hence, we split the dataset in the chunks test and trains dataset.



In [None]:
# import json
# import os

# TRAIN_CHUNK_SIZE = 30000
# TEST_CHUNK_SIZE = 30000
# DATASET_PATH = '/content/bio_tagged_dataset.jsonl'
# PATH_OF_DATASET_CHUNKS = '/content/drive/MyDrive/MasterThesis/bert/dataset'

# def split_data(file_path, train_chunk_size=TRAIN_CHUNK_SIZE, test_size=TEST_CHUNK_SIZE, output_base_path=PATH_OF_DATASET_CHUNKS):
#     with open(file_path, 'r') as file:
#         lines = file.readlines()

#     test_data = lines[:test_size]
#     train_data = lines[test_size:]

#     # Ensure the output directory exists
#     os.makedirs(output_base_path, exist_ok=True)

#     # Save test data
#     with open(os.path.join(output_base_path, 'test_data.jsonl'), 'w') as file:
#         file.writelines(test_data)

#     # Split and save train data into chunks
#     for i in range(0, len(train_data), train_chunk_size):
#         chunk_file = os.path.join(output_base_path, f'train_data_chunk_{i//train_chunk_size}.jsonl')
#         with open(chunk_file, 'w') as file:
#             file.writelines(train_data[i:i+train_chunk_size])

# # Example usage
# split_data(DATASET_PATH)


### Load and Visualize Data Sample
```
# Sample dataset
 [{ "id": 30001, "text": "A decentralized MAC layer protocol ...","bio_tags": ["O", "O", "B-TOPIC", "I-TOPIC", "O"]},...]
```



In [None]:
from datasets import load_dataset
import json

def load_and_visualize_data(file_path, num_samples=1):
    # Load the dataset
    dataset = load_dataset('json', data_files=file_path, split='train')

    # Visualize a few samples
    for i in range(num_samples):
        sample = dataset[i]
        print(json.dumps(sample, indent=2))

In [None]:
TRAIN_DATA_PATH = '/content/bio_tagged_gold_dataset.jsonl'
# Example usage: Adjust the file path to point to one of your training data chunks
load_and_visualize_data(TRAIN_DATA_PATH)

Generating train split: 0 examples [00:00, ? examples/s]

{
  "id": "c3a7323cc4e0f762bf9feb042d15856d",
  "text": "Recommender systems with social regularization Although Recommender Systems have been comprehensively analyzed in the past decade , the study of social-based recommender systems just started . In this paper , aiming at providing a general method for improving recommender systems by incorporating social network information , we propose a matrix factorization framework with social regularization . The contributions of this paper are four-fold : ( 1 ) We elaborate how social network information can benefit recommender systems ; ( 2 ) We interpret the differences between social-based recommender systems and trust-aware recommender systems ; ( 3 ) We coin the term Social Regularization to represent the social constraints on recommender systems , and we systematically illustrate how to design a matrix factorization objective function with social regularization ; and ( 4 ) The proposed method is quite general , which can be easily exten

## Preprocess Dataset

1. Tokenize: We need to tokenize our dataset with specific tokenizer (WordPiece) which breaks down the words into smmalr units (subwords). This converts our raw text into a format that BERT can understand. AutoTokenize will automatically use the tokenizer needed for DistilBERT Model.

2. Align Labels with tokens
Then as the BERT's tokenizer (might) has split a word in multiple tokens (subwords) our BIO tagging (might) not align with the tokens, hence we need to align the BIO tagging(annotation).


### Load Tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL) #DistilBERT Model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Define function to tokenize and align the label
When we tokenize the `text` of our dataset, it adds some special tokens [CLS] and [SEP] and creates a mismatch between the `text` and `bio_tags`. A single word corresponding to a single label may now be split into two subwords. Hence, we need to align the labels as follows:

1. Mapping all tokens to their corresponding word with the word_ids method.
2. Assigning the label -100 to the special tokens [CLS] and [SEP] so they’re ignored by the PyTorch loss function (see CrossEntropyLoss).
3. Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

Also, as our label (`bio_tags`) are in string, we need to give it an integer value for computation.



In [None]:
# Define the label mapping
label_map = {'O': 0, 'B-TOPIC': 1, 'I-TOPIC': 2}

# Function to tokenize and align labels
def tokenize_and_align_labels(batch):
    tokenized_inputs = tokenizer(batch["text"], truncation=True, padding=True, return_tensors="pt")

    labels = []
    for i in range(len(batch["text"])):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        label_index = 0
        bio_tags = batch["bio_tags"][i]

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                if label_index < len(bio_tags):
                    # Convert label string to integer
                    label_ids.append(label_map[bio_tags[label_index]])
                    label_index += 1
                else:
                    label_ids.append(-100)
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        while len(label_ids) < len(tokenized_inputs["input_ids"][i]):
            label_ids.append(-100)

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

### Load, preprocess, and split the dataset

In [None]:
from datasets import load_dataset, DatasetDict

raw_dataset = load_dataset('json', data_files=TRAIN_DATA_PATH)
train_test_split = raw_dataset['train'].train_test_split(test_size=0.2)
split_dataset = DatasetDict({'train': train_test_split['train'], 'validation': train_test_split['test']})

#Preproess over entire dataset using map and speed up with batched processing
tokenized_dataset = split_dataset.map(tokenize_and_align_labels, batched=True)

ValueError: test_size=0 should be either positive and smaller than the number of samples 70 or a float in the (0, 1) range

### Dynamic Padding

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Evalution
Including a metric during training is often helpful for evaluating your model’s performance. And for this task, load the seqeval framework which produces several scores: precision, recall, F1, and accuracy.

In [None]:
import evaluate
import numpy as np
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score

seqeval = evaluate.load("seqeval")

label_list = list(label_map.keys()) #['O', 'B-TOPIC', 'I-TOPIC']

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "accuracy": accuracy_score(true_labels, true_predictions),
    }
    return results

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

## Train

In [None]:
label2id = label_map;
id2label = {v: k for k, v in label_map.items()}
print(label2id)
print(id2label)

{'O': 0, 'B-TOPIC': 1, 'I-TOPIC': 2}
{0: 'O', 1: 'B-TOPIC', 2: 'I-TOPIC'}


In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

MODEL_OUTPUT_PATH = '/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/'

label2id = label_map; #{'O': 0, 'B-TOPIC': 1, 'I-TOPIC': 2}
id2label = {v: k for k, v in label_map.items()} #{0: 'O', 1: 'B-TOPIC', 2: 'I-TOPIC'}

model = AutoModelForTokenClassification.from_pretrained(PRETRAINED_MODEL, num_labels=3, id2label=id2label, label2id=label2id)

# Define training arguments
training_args = TrainingArguments(
    output_dir="cso_ner_gold",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=15,
    weight_decay=0.01
)

# Initialize the Trainer with training and evaluation datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],  # Training dataset
    eval_dataset=tokenized_dataset["validation"],  # Evaluation dataset
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Push to hugging face in logged in
trainer.push_to_hub()


# # Save model in Google drive
model.save_pretrained(MODEL_OUTPUT_PATH)
tokenizer.save_pretrained(MODEL_OUTPUT_PATH)

# Note we need to upload the tokenizer manully from Google Drive to hugging face

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.580928,0.0,0.0,0.0,0.952031
2,No log,0.374916,0.0,0.0,0.0,0.952031
3,No log,0.266298,0.0,0.0,0.0,0.952031
4,No log,0.234569,0.0,0.0,0.0,0.952031
5,No log,0.233315,0.0,0.0,0.0,0.952031
6,No log,0.233776,0.0,0.0,0.0,0.952031
7,No log,0.230937,0.0,0.0,0.0,0.952031
8,No log,0.226727,0.0,0.0,0.0,0.952031
9,No log,0.223105,0.0,0.0,0.0,0.952031
10,No log,0.220842,0.0,0.0,0.0,0.952031


  _warn_prf(average, modifier, msg_start, len(result))


events.out.tfevents.1705431567.6a0cc29d776b.6437.0:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

('/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/tokenizer_config.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/special_tokens_map.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/vocab.txt',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/added_tokens.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/gold_dataset/tokenizer.json')

## Inference


('/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/tokenizer_config.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/special_tokens_map.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/vocab.txt',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/added_tokens.json',
 '/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/tokenizer.json')

In [None]:
from transformers import pipeline

model_path = '/content/drive/MyDrive/MasterThesis/distilbert_ner/v5/'
classifier = pipeline("ner", model=model_path)

results = classifier("The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), describing the item and enabling keyword-based classification. In this paper we propose P-TAG, a method which automatically generates personalized tags for Web pages. Upon browsing a Web page, P-TAG produces keywords relevant both to its textual content, but also to the data residing on the surfer's Desktop, thus expressing a personalized viewpoint. Empirical evaluations with several algorithms pursuing this approach showed very promising results. We are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotations as an important step towards realizing the Semantic Web.")
print(results)


[{'entity': 'B-TOPIC', 'score': 0.9951733, 'index': 5, 'word': 'semantic', 'start': 19, 'end': 27}, {'entity': 'I-TOPIC', 'score': 0.9931439, 'index': 6, 'word': 'web', 'start': 28, 'end': 31}, {'entity': 'B-TOPIC', 'score': 0.6034693, 'index': 12, 'word': 'web', 'start': 63, 'end': 66}, {'entity': 'I-TOPIC', 'score': 0.59571624, 'index': 13, 'word': 'pages', 'start': 67, 'end': 72}, {'entity': 'B-TOPIC', 'score': 0.96885026, 'index': 29, 'word': 'social', 'start': 137, 'end': 143}, {'entity': 'I-TOPIC', 'score': 0.96392673, 'index': 30, 'word': 'book', 'start': 144, 'end': 148}, {'entity': 'I-TOPIC', 'score': 0.96806115, 'index': 31, 'word': '##mark', 'start': 148, 'end': 152}, {'entity': 'I-TOPIC', 'score': 0.9144426, 'index': 32, 'word': '##ing', 'start': 152, 'end': 155}, {'entity': 'B-TOPIC', 'score': 0.49843606, 'index': 132, 'word': 'the', 'start': 618, 'end': 621}]
