# **Named Entity Recognition (NER)**

### **Modelo de Procesamiento de Lenguaje Natural para Reconocimiento de Entidades**

Este modelo tiene como objetivo identificar y extraer entidades específicas dentro de los textos, como nombres de personas, obras arquitectónicas, y lugares. Esto permite estructurar información valiosa de forma que facilite posteriores análisis o consultas en la base de datos. En este notebook se entrenara un modelo NER de este estilo.

El texto procesado se segmenta y etiqueta con categorías específicas como:  
- **ARCH** (Arquitectura)  
- **PER** (Persona)  
- **LOC** (Lugar geográfico)  

### **Modelo utilizado: distilbert-NER**

Se empleó [distilbert-NER](https://huggingface.co/dslim/distilbert-NER) ajustado para tareas de reconocimiento de entidades (NER). Este modelo se seleccionó por ser liviano y eficiente, ideal para tareas que requieren un equilibrio entre precisión y velocidad.

### **Funcionamiento del modelo**  
1. **Entrada del texto:**  
   Los textos que sirven para el entrenamiento se etiquetan mediante la herramienta online [NER Annotator](https://arunmozhi.in/ner-annotator/) y estas anotaciones son cargadas en la carpeta *annotations_dataset* de este mismo directorio.

2. **Modelo Resultante:**  
   El modelo resultante se almacena en la carpeta *best_models*. Estemodelo puede posteriormente ser utilizado para mejor identificación de arquitecturas en otras partes.  

#Preparacion de Entorno

In [None]:
!pip install -qq datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import drive
from datasets import Dataset, DatasetDict
from transformers import DataCollatorForTokenClassification, AutoTokenizer, AutoModelForTokenClassification, Trainer, pipeline
from collections import defaultdict
import sys
import string
import json
import numpy as np
import os

In [None]:
drive.mount('/content/drive')
%cd "/content/drive/"

Mounted at /content/drive
/content/drive


In [None]:
checkpoint = 'dslim/distilbert-ner'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

#Data Definition

In [None]:
label_names = ['O', 'B-PER', 'I-PER', 'B-ARCH', 'I-ARCH', 'B-LOC', 'I-LOC']

id2label = {k: v for k, v in enumerate(label_names)}

label2id = {v: k for k, v in enumerate(label_names)}

In [None]:
test_archs = [
    "Ste-Geneviéve", "Newgate Gaol", "Schauspielhaus", "Altes Museum", "British Museum", "Bibliotheque Ste-Geneviéve", "Palm House", "Gare de I’Est", "Streatham Street Flats", "Crystal Palace", "Bibliotheque Nationale", "‘Old English’ country house", "Le Raincy church",
    "Palais de Justice", "Galerie des Machines", "State Museums", "Burgtheater", "Swan House", "Rijksmuseum", "Winn Memorial Library", "Neue Hofburg", "Casa Vicens", "Sagrada Familia", "Marshall Field Wholesale Store", "Glessner House", "Palau Guell", "Miller House",
    "Century Guild Exhibition Stand", "Auditorium Building", "Walker Warehouse", "Oak Park house", "Oak Park studio", "Wainwright Building", "Dooly Block", "Bedford Park", "Fair Store", "Charnley House", "Standen", "Landesmuseum", "Hétel Tassel", "Moller House",
    "Transportation building", "Winslow House", "Guaranty Building", "Hotel Solvay", "church of St-Jean-de-Montmartre", "Van Eetvelde and his own house", "Luxfer Prism offices", "Amsterdam Exchange", "McAfee House", "Francisco Terrace apartments", "Heuberg Estate",
    "Ecole du Sacré Coeur", "Maison Carpeaux", "Heller and Husser Houses", "Sturgis House", "Secession Building", "Maison du Peuple", "Millbank Estate", "Glasgow School of Art", "The Barn", "Majolica House", "Humbert de Romans concert hall", "Colonia Guell", "Rufer House",
    "Goldman and Salatsch facade", "Broadleys", "Castel Henriette", "Ernst Ludwig House", "Schlesinger and Mayer department store", "Café Museum", "Heller and Husser Houses", "The Orchard", "Warren house", "House for an Art-lover", "Dana House", "Heurtley House",
    "Avenue Wagram", "pavilion for the Exhibition of Decorative Arts", "apartment building in the Rue Franklin", "Wertheim store", "Larkin Building", "Post Office Savings Bank", "Purkersdorf Sanatorium", "Grand Ducal School of Arts and Crafts", "Dumont Theatre",
    "Martin House", "Willow Tea Rooms", "Unity Temple", "Palais Stoclet", "Hardy House", "Nashdom", "Robie House", "Casa Mila", "Tietz department store", "Avery Coonley House", "American Bar", "hotel at Campo de’ Fiori", "Steiner House", "Viceroy’s House",  "Tristan Tzara house",
    "apartment block completed in Rue Vavin", "Central Station", "Monza Cemetery", "Leipzig Steel Pavilion", "Jahrhunderthalle", "Glass Pavilion", "Werkbund Theatre", "Midway Gardens", "Citta Nuova", "Twin Airship Hangars", "Fiat Works", "Hotel Imperial","Villa on the Lido"
    ]

In [None]:
def read_data(file_name):
    data = json.load(open(file_name,'r'))

    new_data = []

    for line in data["annotations"]:
        text = line[0]
        for arch in test_archs:
          if arch in text:
            continue
        entities = line[1]["entities"]
        sentences = line[0].split(".")

        entities_list = {"PER": [], "LOC": [], "ARCH": []}
        for start, end, label in entities:
            entities_list[label].append(text[start:end])

        for sentence in sentences:
            sentence = sentence.strip()
            new_data_line = ["", {"entities": []}]
            for key, entities in entities_list.items():
                for ent in entities:
                    if ent in sentence:
                        if new_data_line[0] == "":
                            new_data_line[0] = sentence.translate(str.maketrans('', '', string.punctuation))
                        curent_ents = " ".join([current_ent for current_ent, label in new_data_line[1]["entities"]])
                        if ent not in curent_ents:
                            new_data_line[1]["entities"].append((ent.strip(), key))
            if new_data_line[0] != "":
                new_data.append(new_data_line)
    return new_data

In [None]:
def tokenize_data(data):
    tokenize_data = []

    for line in data:
        text = line[0]
        entities = line[1]["entities"]

        tokens = tokenizer(text, return_offsets_mapping=True)

        labels = [0] * len(tokens.tokens())
        labels[0] = -100
        labels[-1] = -100

        for ent, label in entities:
            start = text.find(ent)
            end = start + len(ent)
            for idx, (token_start, token_end) in enumerate(tokens["offset_mapping"]):
                if token_start >= start and token_end <= end:
                    if token_start == start:
                        key_label = f"B-{label}"
                        labels[idx] = label2id[key_label]
                    else:
                        key_label = f"I-{label}"
                        labels[idx] = label2id[key_label]

        input_ids = tokens["input_ids"]
        attention_mask = tokens["attention_mask"]
        tokenize_data.append({
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        })

    return tokenize_data

In [None]:
folder_path = "/content/drive/MyDrive/ARCHITECTURE_NER/NER/annotations_dataset"

general_data = []

for file_name in os.listdir(folder_path):
    if file_name.endswith(".json"):
        file_path = os.path.join(folder_path, file_name)
        data = read_data(file_path)
        general_data += data

tokenized_data = tokenize_data(general_data)

In [None]:
dataset = Dataset.from_list(tokenized_data)
split_data = dataset.train_test_split(test_size=0.2, seed=1234)
dataset_dict = DatasetDict({
    'train': split_data['train'],
    'validation': split_data['test']
})

print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2320
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 581
    })
})


In [None]:
labels_count = {v: 0 for k, v in enumerate(label_names)}

labels_count_2 = {v: 0 for k, v in enumerate(label_names)}

for row in dataset_dict['train']:
  for label in row['labels']:
    if label != -100:
      labels_count[id2label[label]] += 1

for row in dataset_dict['validation']:
  for label in row['labels']:
    if label != -100:
      labels_count_2[id2label[label]] += 1

print("Train labels:")
print(labels_count)
print("\nValidation labels:")
print(labels_count_2)

Train labels:
{'O': 68003, 'B-PER': 2802, 'I-PER': 5066, 'B-ARCH': 853, 'I-ARCH': 3047, 'B-LOC': 1026, 'I-LOC': 1064}

Validation labels:
{'O': 17830, 'B-PER': 680, 'I-PER': 1330, 'B-ARCH': 234, 'I-ARCH': 849, 'B-LOC': 247, 'I-LOC': 286}


# Train

In [None]:
SAVE_MODEL = False

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
def precision_calculator(pred_labels: list[list[int]], true_labels: list[list[int]], entities_to_consider: list[int]) -> float:
    true_positives = 0
    false_positives = 0

    for pred_label, true_label in zip(pred_labels, true_labels):
        for pred, true in zip(pred_label, true_label):
            if pred == -100 or true == -100:
                continue

            if pred in entities_to_consider:
                if true in entities_to_consider:
                    true_positives += 1
                else:
                    false_positives += 1

    if true_positives + false_positives == 0:
        return 0.0

    return true_positives / (true_positives + false_positives)

def recall_calculator(pred_labels: list[list[int]], true_labels: list[list[int]], entities_to_consider: list[int]) -> float:
    true_positives = 0
    false_negatives = 0

    for pred_label, true_label in zip(pred_labels, true_labels):
        for pred, true in zip(pred_label, true_label):
            if pred == -100 or true == -100:
                continue

            if true in entities_to_consider:
                if pred in entities_to_consider:
                    true_positives += 1
                else:
                    false_negatives += 1

    if true_positives + false_negatives == 0:
        return 0.0

    return true_positives / (true_positives + false_negatives)

def f1_score_calculator(precision, recall) -> float:
    if precision + recall == 0:
        return 0.0

    return 2 * (precision * recall) / (precision + recall)

In [None]:
def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels

    predictions = np.argmax(logits, axis=-1)

    str_labels = []
    str_preds = []

    for label in labels:
        filtered_label = [int(t) for t in label if t != -100]
        str_labels.append(filtered_label)

    for prediction, label in zip(predictions, labels):
        filtered_prediction = [int(p) for p, t in zip(prediction, label) if t != -100]
        str_preds.append(filtered_prediction)

    entity_metrics = defaultdict(float)

    for entity in ['ARCH']:
        # entity_metrics[f"{entity}_precision"] = precision_calculator(str_preds, str_labels, [label2id[f'B-{entity}'], label2id[f'I-{entity}']])
        entity_metrics[f"{entity}_recall"] = recall_calculator(str_preds, str_labels, [label2id[f'B-{entity}'], label2id[f'I-{entity}']])
        # entity_metrics[f"{entity}_f1"] = f1_score_calculator(entity_metrics[f"{entity}_precision"], entity_metrics[f"{entity}_recall"])

    return dict(entity_metrics)


In [None]:
model = AutoModelForTokenClassification.from_pretrained(
  checkpoint,
  id2label=id2label,
  label2id=label2id,
  ignore_mismatched_sizes=True
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dslim/distilbert-ner and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([7]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/ARCHITECTURE_NER/NER/models/",
    report_to="none",
    save_strategy="epoch",
    eval_strategy="epoch",
    learning_rate=6.40302974239745e-05,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    lr_scheduler_type='constant',
    weight_decay=0.09864540343039606,
    warmup_steps=442,
    seed=1234,
    greater_is_better=True,
    load_best_model_at_end=True
)

In [None]:
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=dataset_dict["train"],
  eval_dataset=dataset_dict["validation"],
  tokenizer=tokenizer,
  compute_metrics=compute_metrics,
  data_collator=data_collator,
)

  trainer = Trainer(


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Arch Recall
1,No log,0.347737,0.735955
2,No log,0.407194,0.580899
3,No log,0.400131,0.761798
4,0.020700,0.37245,0.7
5,0.020700,0.439519,0.722472


TrainOutput(global_step=725, training_loss=0.018897340379912277, metrics={'train_runtime': 132.4349, 'train_samples_per_second': 87.59, 'train_steps_per_second': 5.474, 'total_flos': 242763156422688.0, 'train_loss': 0.018897340379912277, 'epoch': 5.0})

In [None]:
if SAVE_MODEL:
  model.save_pretrained("/content/drive/MyDrive/ARCHITECTURE_NER/NER/best_models/")
  tokenizer.save_pretrained("/content/drive/MyDrive/ARCHITECTURE_NER/NER/best_models/")

# Test

In [None]:
ner = pipeline(
    'token-classification',
    model="lucasdefino/architecture-NER",
    tokenizer="lucasdefino/architecture-NER",
    aggregation_strategy='simple',
    device=0
  )

In [None]:
fname = '/content/drive/MyDrive/ARCHITECTURE_NER/Procesamiento_de_textos/pages/pages_BOOK Kenneth Frampton Modern Architecture.txt'

In [None]:
import re

titles = ['Mrs.', 'Mr.', 'Dr.', 'Ms.']
uppers = [chr(i) + '.' for i in range(65, 91)]
lowers = [ ' ' + chr(i) + '.' for i in range(97, 122)]
numbers = [str(i) + '.' for i in range(0, 10)]

def remove_titles(text):
    mytext = text

    for title in titles:
        mytext = mytext.replace(title, title[:-1])
    for upper in uppers:
        mytext = mytext.replace(upper, upper[:-1])
    for number in numbers:
        mytext = mytext.replace(number, number[:-1]+',')
    for lower in lowers:
        mytext = mytext.replace(lower, lower[:-1])

    return mytext

def prepare_data(filename):
    txt = open(filename, 'r')
    txt = txt.read().replace('\n', ' ')
    txt = txt.replace('_', ' ')
    txt = txt.replace('—', '-')
    txt = txt.replace('–', '-')
    txt = txt.replace('“', '"')
    txt = txt.replace('”', '"')
    txt = txt.replace('’', "'")
    txt = txt.replace('‘', "'")
    txt = txt.replace('…', ' ')
    txt = txt.replace('...', ' ')
    txt = txt.replace('|', 'I')
    txt = txt.replace('+', ' ')
    txt = txt.replace('/', ' ')
    txt = txt.replace('(', ' ')
    txt = txt.replace(')', ' ')
    txt = txt.replace(':', ' ')
    txt = txt.replace(';', ' ')

    pattern = r"\[\d+\]"
    txt = re.sub(pattern, "", txt)
    txt = re.sub(' +', ' ', txt)
    txt = remove_titles(txt)
    data = [sentence.strip() for sentence in txt.split('.') if sentence.strip() != '']

    return data

In [None]:
data = prepare_data(fname)

In [None]:
ARCHS = []

for text in data:
  results = ner(text)
  for i, ent in enumerate(results):
    if ent['entity_group'] == 'ARCH':
      if '#' in ent['word']:
        while ent['start'] != 0 and text[ent['start']-1] != ' ':
          ent['start'] -= 1
        ent['word'] = text[ent['start']:ent['end']]
        ARCHS.append((ent['word'], ent['score']))
      else:
        ARCHS.append((ent['word'], ent['score']))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
test_archs = [
    "Ste-Geneviéve", "Newgate Gaol", "Schauspielhaus", "Altes Museum", "British Museum", "Bibliotheque Ste-Geneviéve", "Palm House", "Gare de I’Est", "Streatham Street Flats", "Crystal Palace", "Bibliotheque Nationale", "‘Old English’ country house", "Le Raincy church",
    "Palais de Justice", "Galerie des Machines", "State Museums", "Burgtheater", "Swan House", "Rijksmuseum", "Winn Memorial Library", "Neue Hofburg", "Casa Vicens", "Sagrada Familia", "Marshall Field Wholesale Store", "Glessner House", "Palau Guell", "Miller House",
    "Century Guild Exhibition Stand", "Auditorium Building", "Walker Warehouse", "Oak Park house", "Oak Park studio", "Wainwright Building", "Dooly Block", "Bedford Park", "Fair Store", "Charnley House", "Standen", "Landesmuseum", "Hétel Tassel", "Moller House",
    "Transportation building", "Winslow House", "Guaranty Building", "Hotel Solvay", "church of St-Jean-de-Montmartre", "Van Eetvelde and his own house", "Luxfer Prism offices", "Amsterdam Exchange", "McAfee House", "Francisco Terrace apartments", "Heuberg Estate",
    "Ecole du Sacré Coeur", "Maison Carpeaux", "Heller and Husser Houses", "Sturgis House", "Secession Building", "Maison du Peuple", "Millbank Estate", "Glasgow School of Art", "The Barn", "Majolica House", "Humbert de Romans concert hall", "Colonia Guell", "Rufer House",
    "Goldman and Salatsch facade", "Broadleys", "Castel Henriette", "Ernst Ludwig House", "Schlesinger and Mayer department store", "Café Museum", "Heller and Husser Houses", "The Orchard", "Warren house", "House for an Art-lover", "Dana House", "Heurtley House",
    "Avenue Wagram", "pavilion for the Exhibition of Decorative Arts", "apartment building in the Rue Franklin", "Wertheim store", "Larkin Building", "Post Office Savings Bank", "Purkersdorf Sanatorium", "Grand Ducal School of Arts and Crafts", "Dumont Theatre",
    "Martin House", "Willow Tea Rooms", "Unity Temple", "Palais Stoclet", "Hardy House", "Nashdom", "Robie House", "Casa Mila", "Tietz department store", "Avery Coonley House", "American Bar", "hotel at Campo de’ Fiori", "Steiner House", "Viceroy’s House",  "Tristan Tzara house",
    "apartment block completed in Rue Vavin", "Central Station", "Monza Cemetery", "Leipzig Steel Pavilion", "Jahrhunderthalle", "Glass Pavilion", "Werkbund Theatre", "Midway Gardens", "Citta Nuova", "Twin Airship Hangars", "Fiat Works", "Hotel Imperial","Villa on the Lido"
    ]

In [None]:
ARCHS_90 = []
ARCHS_70 = []
ARCHS_50 = []
ARCHS_30 = []
ARCHS_10 = []

for arch in ARCHS:
  if arch[1] > 0.9 and len(arch[0]) > 7:
    ARCHS_90.append(arch[0].lower())
  if arch[1] > 0.7 and len(arch[0]) > 7:
    ARCHS_70.append(arch[0].lower())
  if arch[1] > 0.5 and len(arch[0]) > 7:
    ARCHS_50.append(arch[0].lower())
  if arch[1] > 0.3 and len(arch[0]) > 7:
    ARCHS_30.append(arch[0].lower())
  if arch[1] > 0.1 and len(arch[0]) > 7:
    ARCHS_10.append(arch[0].lower())

In [None]:
not_found = []

i = 110
for arch_set in [ARCHS_90, ARCHS_70, ARCHS_50, ARCHS_30, ARCHS_10]:
  count = 0
  arch_set = set(arch_set)
  for arch in test_archs:
    if arch.lower() in arch_set:
      count += 1
    else:
      not_found.append(arch)
  i -= 20
  print(f"{i} percentage of confidence", count/len(test_archs))

90 percentage of confidence 0.6666666666666666
70 percentage of confidence 0.6929824561403509
50 percentage of confidence 0.6929824561403509
30 percentage of confidence 0.6929824561403509
10 percentage of confidence 0.6929824561403509


In [None]:
set(not_found)

{'American Bar',
 'Bibliotheque Ste-Geneviéve',
 'British Museum',
 'Broadleys',
 'Central Station',
 'Ecole du Sacré Coeur',
 'Gare de I’Est',
 'Glasgow School of Art',
 'Goldman and Salatsch facade',
 'Grand Ducal School of Arts and Crafts',
 'Heller and Husser Houses',
 'Hotel Imperial',
 'House for an Art-lover',
 'Humbert de Romans concert hall',
 'Jahrhunderthalle',
 'Luxfer Prism offices',
 'Nashdom',
 'Neue Hofburg',
 'Sagrada Familia',
 'Schauspielhaus',
 'Schlesinger and Mayer department store',
 'Standen',
 'State Museums',
 'Ste-Geneviéve',
 'The Barn',
 'The Orchard',
 'Tietz department store',
 'Twin Airship Hangars',
 'Van Eetvelde and his own house',
 'Viceroy’s House',
 'Villa on the Lido',
 'apartment block completed in Rue Vavin',
 'apartment building in the Rue Franklin',
 'church of St-Jean-de-Montmartre',
 'hotel at Campo de’ Fiori',
 'pavilion for the Exhibition of Decorative Arts',
 '‘Old English’ country house'}