# **Optuna para NER**
### **Búsqueda de los hiperparámetros más óptimos**

Optuna es una biblioteca de Python que permite la optimización automática de hiperparámetros de nuestro modelo.

Para cada hiperparámetro definimos un rango de valores de búsqueda y corremos el código para que Optuna nos encuentre la combinación que resulta en el mayor recall.

#Preparacion de Entorno

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/"

Mounted at /content/drive
/content/drive


In [None]:
from transformers import AutoTokenizer
checkpoint = 'dslim/distilbert-ner'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

#Data Definition

In [None]:
label_names = ['O', 'B-PER', 'I-PER', 'B-ARCH', 'I-ARCH', 'B-LOC', 'I-LOC']

id2label = {k: v for k, v in enumerate(label_names)}

label2id = {v: k for k, v in enumerate(label_names)}

In [None]:
test_archs = [
    "Ste-Geneviéve", "Newgate Gaol", "Schauspielhaus", "Altes Museum", "British Museum", "Bibliotheque Ste-Geneviéve", "Palm House", "Gare de I’Est", "Streatham Street Flats", "Crystal Palace", "Bibliotheque Nationale", "‘Old English’ country house", "Le Raincy church",
    "Palais de Justice", "Galerie des Machines", "State Museums", "Burgtheater", "Swan House", "Rijksmuseum", "Winn Memorial Library", "Neue Hofburg", "Casa Vicens", "Sagrada Familia", "Marshall Field Wholesale Store", "Glessner House", "Palau Guell", "Miller House",
    "Century Guild Exhibition Stand", "Auditorium Building", "Walker Warehouse", "Oak Park house", "Oak Park studio", "Wainwright Building", "Dooly Block", "Bedford Park", "Fair Store", "Charnley House", "Standen", "Landesmuseum", "Hétel Tassel", "Moller House",
    "Transportation building", "Winslow House", "Guaranty Building", "Hotel Solvay", "church of St-Jean-de-Montmartre", "Van Eetvelde and his own house", "Luxfer Prism offices", "Amsterdam Exchange", "McAfee House", "Francisco Terrace apartments", "Heuberg Estate",
    "Ecole du Sacré Coeur", "Maison Carpeaux", "Heller and Husser Houses", "Sturgis House", "Secession Building", "Maison du Peuple", "Millbank Estate", "Glasgow School of Art", "The Barn", "Majolica House", "Humbert de Romans concert hall", "Colonia Guell", "Rufer House",
    "Goldman and Salatsch facade", "Broadleys", "Castel Henriette", "Ernst Ludwig House", "Schlesinger and Mayer department store", "Café Museum", "Heller and Husser Houses", "The Orchard", "Warren house", "House for an Art-lover", "Dana House", "Heurtley House",
    "Avenue Wagram", "pavilion for the Exhibition of Decorative Arts", "apartment building in the Rue Franklin", "Wertheim store", "Larkin Building", "Post Office Savings Bank", "Purkersdorf Sanatorium", "Grand Ducal School of Arts and Crafts", "Dumont Theatre",
    "Martin House", "Willow Tea Rooms", "Unity Temple", "Palais Stoclet", "Hardy House", "Nashdom", "Robie House", "Casa Mila", "Tietz department store", "Avery Coonley House", "American Bar", "hotel at Campo de’ Fiori", "Steiner House", "Viceroy’s House",  "Tristan Tzara house",
    "apartment block completed in Rue Vavin", "Central Station", "Monza Cemetery", "Leipzig Steel Pavilion", "Jahrhunderthalle", "Glass Pavilion", "Werkbund Theatre", "Midway Gardens", "Citta Nuova", "Twin Airship Hangars", "Fiat Works", "Hotel Imperial","Villa on the Lido"
    ]

In [None]:
def read_data(file_name):
    data = json.load(open(file_name,'r'))

    new_data = []

    for line in data["annotations"]:
        text = line[0]
        for arch in test_archs:
          if arch in text:
            continue
        entities = line[1]["entities"]
        sentences = line[0].split(".")

        entities_list = {"PER": [], "LOC": [], "ARCH": []}
        for start, end, label in entities:
            entities_list[label].append(text[start:end])

        for sentence in sentences:
            sentence = sentence.strip()
            new_data_line = ["", {"entities": []}]
            for key, entities in entities_list.items():
                for ent in entities:
                    if ent in sentence:
                        if new_data_line[0] == "":
                            new_data_line[0] = sentence.translate(str.maketrans('', '', string.punctuation))
                        curent_ents = " ".join([current_ent for current_ent, label in new_data_line[1]["entities"]])
                        if ent not in curent_ents:
                            new_data_line[1]["entities"].append((ent.strip(), key))
            if new_data_line[0] != "":
                new_data.append(new_data_line)
    return new_data

In [None]:
def tokenize_data(data):
    tokenize_data = []

    for line in data:
        text = line[0]
        entities = line[1]["entities"]

        tokens = tokenizer(text, return_offsets_mapping=True)

        labels = [0] * len(tokens.tokens())
        labels[0] = -100
        labels[-1] = -100

        for ent, label in entities:
            start = text.find(ent)
            end = start + len(ent)
            for idx, (token_start, token_end) in enumerate(tokens["offset_mapping"]):
                if token_start >= start and token_end <= end:
                    if token_start == start:
                        key_label = f"B-{label}"
                        labels[idx] = label2id[key_label]
                    else:
                        key_label = f"I-{label}"
                        labels[idx] = label2id[key_label]

        input_ids = tokens["input_ids"]
        attention_mask = tokens["attention_mask"]
        tokenize_data.append({
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        })

    return tokenize_data

In [None]:
folder_path = "/content/drive/MyDrive/ARCHITECTURE_NER/NER/annotations_dataset"

general_data = []

for file_name in os.listdir(folder_path):
    if file_name.endswith(".json"):
        file_path = os.path.join(folder_path, file_name)
        data = read_data(file_path)
        general_data += data

tokenized_data = tokenize_data(general_data)

In [None]:
dataset = Dataset.from_list(tokenized_data)
split_data = dataset.train_test_split(test_size=0.2, seed=1234)
dataset_dict = DatasetDict({
    'train': split_data['train'],
    'validation': split_data['test']
})

print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2320
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 581
    })
})


In [None]:
labels_count = {v: 0 for k, v in enumerate(label_names)}

labels_count_2 = {v: 0 for k, v in enumerate(label_names)}

for row in dataset_dict['train']:
  for label in row['labels']:
    if label != -100:
      labels_count[id2label[label]] += 1

for row in dataset_dict['validation']:
  for label in row['labels']:
    if label != -100:
      labels_count_2[id2label[label]] += 1

print("Train labels:")
print(labels_count)
print("\nValidation labels:")
print(labels_count_2)

Train labels:
{'O': 68902, 'B-PER': 2776, 'I-PER': 5169, 'B-ARCH': 892, 'I-ARCH': 3201, 'B-LOC': 1042, 'I-LOC': 1109}

Validation labels:
{'O': 16931, 'B-PER': 706, 'I-PER': 1227, 'B-ARCH': 195, 'I-ARCH': 695, 'B-LOC': 231, 'I-LOC': 241}


# Train

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
def precision_calculator(pred_labels: list[list[int]], true_labels: list[list[int]], entities_to_consider: list[int]) -> float:
    true_positives = 0
    false_positives = 0

    for pred_label, true_label in zip(pred_labels, true_labels):
        for pred, true in zip(pred_label, true_label):
            if pred == -100 or true == -100:
                continue

            if pred in entities_to_consider:
                if true in entities_to_consider:
                    true_positives += 1
                else:
                    false_positives += 1

    if true_positives + false_positives == 0:
        return 0.0

    return true_positives / (true_positives + false_positives)

def recall_calculator(pred_labels: list[list[int]], true_labels: list[list[int]], entities_to_consider: list[int]) -> float:
    true_positives = 0
    false_negatives = 0

    for pred_label, true_label in zip(pred_labels, true_labels):
        for pred, true in zip(pred_label, true_label):
            if pred == -100 or true == -100:
                continue

            if true in entities_to_consider:
                if pred in entities_to_consider:
                    true_positives += 1
                else:
                    false_negatives += 1

    if true_positives + false_negatives == 0:
        return 0.0

    return true_positives / (true_positives + false_negatives)

def f1_score_calculator(precision, recall) -> float:
    if precision + recall == 0:
        return 0.0

    return 2 * (precision * recall) / (precision + recall)


In [None]:
import numpy as np
from collections import defaultdict

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels

    predictions = np.argmax(logits, axis=-1)

    str_labels = []
    str_preds = []

    for label in labels:
        filtered_label = [int(t) for t in label if t != -100]
        str_labels.append(filtered_label)

    for prediction, label in zip(predictions, labels):
        filtered_prediction = [int(p) for p, t in zip(prediction, label) if t != -100]
        str_preds.append(filtered_prediction)

    entity_metrics = defaultdict(float)

    for entity in ['ARCH']:
        # entity_metrics[f"{entity}_precision"] = precision_calculator(str_preds, str_labels, [label2id[f'B-{entity}'], label2id[f'I-{entity}']])
        entity_metrics[f"{entity}_recall"] = recall_calculator(str_preds, str_labels, [label2id[f'B-{entity}'], label2id[f'I-{entity}']])
        # entity_metrics[f"{entity}_f1"] = f1_score_calculator(entity_metrics[f"{entity}_precision"], entity_metrics[f"{entity}_recall"])

    return dict(entity_metrics)


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
!pip install -qq optuna

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/362.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.8/362.8 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/233.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForTokenClassification
import optuna

def model_init():
    return AutoModelForTokenClassification.from_pretrained(
        checkpoint,
        id2label=id2label,
        label2id=label2id,
        ignore_mismatched_sizes=True
    )

def hp_space_optuna(trial):
    return {
        "learning_rate": trial.suggest_loguniform("learning_rate", 6e-05, 6e-04),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "weight_decay": trial.suggest_loguniform("weight_decay", 0.01, 0.2),
        "warmup_steps": trial.suggest_int("warmup_steps", 400, 600),
        "lr_scheduler_type": trial.suggest_categorical("lr_scheduler_type", ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"])
    }

training_args = TrainingArguments(
        output_dir="/content/drive/MyDrive/ARCHITECTURE_NER/NER/models/",
        report_to="none",
        save_strategy="no",
        eval_strategy="epoch",
        num_train_epochs = 3,
        greater_is_better=True,
        load_best_model_at_end=False,
        metric_for_best_model="ARCH recall",
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)


Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from optuna.pruners import SuccessiveHalvingPruner

best_run = trainer.hyperparameter_search(
    hp_space=hp_space_optuna,
    direction="maximize",
    backend="optuna",
    n_trials=30,
)

print("Best hyperparameters found:", best_run)

[I 2024-11-11 03:35:37,418] A new study created in memory with name: no-name-f4379c47-7337-4c53-8031-3d39bcd60ee8
  "learning_rate": trial.suggest_loguniform("learning_rate", 6e-05, 6e-04),
  "weight_decay": trial.suggest_loguniform("weight_decay", 0.01, 0.2),
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.260082,0.52861
2,0.336600,0.21934,0.663488
3,0.336600,0.21349,0.690736


[I 2024-11-11 03:36:47,615] Trial 0 finished with value: 0.6907356948228883 and parameters: {'learning_rate': 0.00023201586641598173, 'per_device_train_batch_size': 8, 'weight_decay': 0.10625362317625302, 'warmup_steps': 432, 'lr_scheduler_type': 'polynomial'}. Best is trial 0 with value: 0.6907356948228883.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.280302,0.546322
2,0.381000,0.234798,0.543597
3,0.381000,0.216446,0.649864


[I 2024-11-11 03:37:53,652] Trial 1 finished with value: 0.6498637602179836 and parameters: {'learning_rate': 7.734271546334847e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.051902282006642375, 'warmup_steps': 587, 'lr_scheduler_type': 'linear'}. Best is trial 0 with value: 0.6907356948228883.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.265489,0.512262
2,0.357200,0.231556,0.662125
3,0.357200,0.217076,0.651226


[I 2024-11-11 03:38:58,661] Trial 2 finished with value: 0.6512261580381471 and parameters: {'learning_rate': 0.00010446633489573423, 'per_device_train_batch_size': 8, 'weight_decay': 0.05905498873986834, 'warmup_steps': 488, 'lr_scheduler_type': 'cosine_with_restarts'}. Best is trial 0 with value: 0.6907356948228883.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.279989,0.553134
2,0.377300,0.229411,0.547684
3,0.377300,0.220213,0.643052


[I 2024-11-11 03:40:04,367] Trial 3 finished with value: 0.6430517711171662 and parameters: {'learning_rate': 8.195452105387542e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.04942614729052601, 'warmup_steps': 577, 'lr_scheduler_type': 'polynomial'}. Best is trial 0 with value: 0.6907356948228883.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.244592,0.599455
2,No log,0.229417,0.577657
3,No log,0.245086,0.433243


[I 2024-11-11 03:41:03,071] Trial 4 finished with value: 0.4332425068119891 and parameters: {'learning_rate': 0.0002814842147070491, 'per_device_train_batch_size': 16, 'weight_decay': 0.1316857293540562, 'warmup_steps': 450, 'lr_scheduler_type': 'cosine'}. Best is trial 0 with value: 0.6907356948228883.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.248337,0.551771


[I 2024-11-11 03:41:23,851] Trial 5 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.244286,0.512262


[I 2024-11-11 03:41:44,474] Trial 6 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.300819,0.346049


[I 2024-11-11 03:42:07,358] Trial 7 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.259612,0.574932


[I 2024-11-11 03:42:27,906] Trial 8 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.301367,0.474114


[I 2024-11-11 03:42:48,555] Trial 9 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.27547,0.544959


[I 2024-11-11 03:43:11,811] Trial 10 pruned. 
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.231033,0.557221
