# Using Huggingface Transformers

# Seqeval
Used to evaulate performance of named entitiy recognition. Given an input statement with tokens, it can generate a labeled output. Contains a set of performance metrics like ```accuracy_score, f1_score, precision_score, recall_score```


In [1]:
!pip install transformers datasets seqeval torch




# Imports

* AutoTokenizer --> converts input text into tokens. assigns numeric values to each word.
Example:- I'm Krutika --> [0,5] since ```Krutika``` is a name it has high importance than ```I'm``` which is a string.

* AutoModelForTokenClassification --> For labelling the input. LABEL_0, LABEL_1, etc for each input text

* Trainer --> Training, saving and model evaluation

*   TrainingArguments --> Hypertuning the parameters of the model. adjusting ```epochs, step_size, number_of_batches```

*   DataCollatorForTokenClassification --> prepare batches of tokens before training the mdoel

* Accuracy metrics for evaluating performance




In [2]:
import numpy as np
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from transformers import DataCollatorForTokenClassification
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score


# Entities in Dataset

Name, Location, Organization, Date_Time, Geopolitical Entity, Natural Phenomenon

In [3]:
dataset = load_dataset("rjac/kaggle-entity-annotated-corpus-ner-dataset")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:


print(dataset['train'][0])

# Check the dataset column names
print(dataset['train'].column_names)

{'sentence_id': ' 1', 'tokens': ['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.'], 'ner_tags': [0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0]}
['sentence_id', 'tokens', 'ner_tags']


# Using pre-trained model ```bert-base-cased```

```tokenize_and_align_labels``` --> I decided to first process each example in dataset and make them of constant length by adding padding to it. Split the data into individual words

Then I assigned labels to each word in input text based on word_indices of each token. That means if the word_index is empty, means its a ```NULL``` data and can be given a random id.


Once the label array had all the labels necessary, i mapped the dataset based on labels so as to get tokenized_inputs. That is [word, label] format.

Lastly, I converted the tokenized_dataset to Pytorch tensors.

In [5]:
# dataset = load_dataset("conll2003")


tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding="max_length", is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
# for conll2003
# tokenized_datasets = tokenized_datasets.remove_columns(["tokens", "pos_tags", "chunk_tags", "ner_tags"])

# for kaggle
tokenized_datasets = tokenized_datasets.remove_columns(['sentence_id', 'tokens', 'ner_tags'])
tokenized_datasets.set_format("torch")


In [6]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 47959
    })
})

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence_id', 'tokens', 'ner_tags'],
        num_rows: 47959
    })
})

# Got the number of unique labels.

In [8]:
# determine the number of unique labels in kaggle's new dataset

unique_tags = set(tag for doc in dataset['train']['ner_tags'] for tag in doc)
num_labels = len(unique_tags)
print("Number of unique labels:", num_labels)


Number of unique labels: 17


# Model Initialization

In [9]:
# 9 for CoNLL-2003 labels
# model = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=9)

# for kaggle
model = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=num_labels)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
!pip install accelerate -U
!pip install torch -U
from transformers import TrainingArguments, DataCollatorForTokenClassification, Trainer



# Fine-Tuning Parameters

In [11]:
training_args = TrainingArguments(
    output_dir="./ner_results",
    num_train_epochs=3,
    per_device_train_batch_size=16, # samples processed for training
    per_device_eval_batch_size=16,  # samples processed for testing
    warmup_steps=100,    # learning rate linearly increases --> stabilize early training phase
    weight_decay=0.01,   # prevents model's weight from being too large
    logging_dir="./ner_logs",
    learning_rate=1e-4,
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",  # evaluates model after each epoch
    # eval_steps=500,      # only if eval strategy is steps
    logging_steps=100,     # how often to log steps
    # save_strategy="steps", # only if eval strategy is steps
    gradient_accumulation_steps=4,  # increase batch size withou increasing memory requirement
    fp16=True,  # reduce memory and dpeed up training
)


# DataCollatorForTokenClassification --> distingwish between actual data and added padding

In [12]:
data_collator = DataCollatorForTokenClassification(tokenizer)


In [13]:
# Only for kaggle's ds as it has no train test splits

print(tokenized_datasets)


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 47959
    })
})


# Train-Test Splits

In [14]:
from datasets import DatasetDict

# Split the "train" dataset into training and validation sets
train_test_split = tokenized_datasets["train"].train_test_split(test_size=0.2)  # 80 % train, 20 % test
# Assign the new splits back to the tokenized_datasets
tokenized_datasets = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test']
})


# Model Training

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)


In [16]:
# for kaggle, doing train and eval after computing metrics to save time
# trainer.train()
# trainer.evaluate()


# Defining Seqeval Performance Metrics

In [17]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [model.config.id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [model.config.id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "accuracy": accuracy_score(true_labels, true_predictions),
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }


In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,0.133,0.122094,0.960283,0.86852,0.877037,0.872758
2,0.0599,0.118816,0.964006,0.881496,0.89234,0.886885




{'eval_loss': 0.11881628632545471,
 'eval_accuracy': 0.9640062081035564,
 'eval_precision': 0.8814961647849694,
 'eval_recall': 0.8923399057756064,
 'eval_f1': 0.8868848905267722,
 'eval_runtime': 119.4574,
 'eval_samples_per_second': 80.296,
 'eval_steps_per_second': 5.023,
 'epoch': 3.0}

In [19]:
print(model.config.id2label)


{0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2', 3: 'LABEL_3', 4: 'LABEL_4', 5: 'LABEL_5', 6: 'LABEL_6', 7: 'LABEL_7', 8: 'LABEL_8', 9: 'LABEL_9', 10: 'LABEL_10', 11: 'LABEL_11', 12: 'LABEL_12', 13: 'LABEL_13', 14: 'LABEL_14', 15: 'LABEL_15', 16: 'LABEL_16'}


# Redact PII information

Create an NER Pipeline to redact specific information.

```aggregation_strategy``` --> If a word belongs to One Plus Label, the one with the maximum score determines the label

In [20]:
from transformers import pipeline

def redact_personal_info(text, model, tokenizer):
    ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")

    entities = ner_pipeline(text)
    redacted_text = text

    for entity in reversed(entities):
        # Ensure you're using the correct entity types for redaction
        if entity['entity_group'] in ['LABEL_1','LABEL_2','LABEL_3', 'LABEL_4','LABEL_5','LABEL_6','LABEL_9','LABEL_10']:
            start = entity['start']
            end = entity['end']
            redacted_text = redacted_text[:start] + "[REDACTED]" + redacted_text[end:]

    return redacted_text, entities


# Trying model in sample input text

In [21]:
# sample_text = "In the heart of the Himalayas, a team from the World Health Organization led by Ava Smith embarked on a mission to study the effects of climate change on local communities. Their research, conducted during the challenging winter season of 2021, involved the use of advanced drones to map the changing landscape. This expedition coincided with the annual Snow Leopard Festival, a significant event that celebrates the region's commitment to preserving its unique wildlife and natural beauty."
sample_text = "Dr. Emily Green visited the Rocky Mountains with representatives from GreenTech Innovations to discuss climate change impacts at the Global Environmental Summit held in Denver last March. They showcased a new eco-friendly device designed to monitor atmospheric conditions in sensitive ecosystems."

redacted_text, entities = redact_personal_info(sample_text, model, tokenizer)

print("Original Text:", sample_text)
print("Redacted Text:", redacted_text)
print("Entities:")
for entity in entities:
    print(f"Text: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.4f}")


Original Text: Dr. Emily Green visited the Rocky Mountains with representatives from GreenTech Innovations to discuss climate change impacts at the Global Environmental Summit held in Denver last March. They showcased a new eco-friendly device designed to monitor atmospheric conditions in sensitive ecosystems.
Redacted Text: [REDACTED] [REDACTED] visited the [REDACTED] [REDACTED] with representatives from [REDACTED] [REDACTED] to discuss climate change impacts at the Global Environmental Summit held in [REDACTED] last [REDACTED]. They showcased a new eco-friendly device designed to monitor atmospheric conditions in sensitive ecosystems.
Entities:
Text: Dr., Label: LABEL_1, Score: 0.9991
Text: Emily Green, Label: LABEL_2, Score: 0.9671
Text: visited the, Label: LABEL_0, Score: 0.9976
Text: Rocky, Label: LABEL_5, Score: 0.8741
Text: Mountains, Label: LABEL_6, Score: 0.8780
Text: with representatives from, Label: LABEL_0, Score: 0.9999
Text: GreenTech, Label: LABEL_3, Score: 0.9978
Text: 