De cara a hacer uso del DEEP LEARNING en una tarea de clasificación, planeamos fine tunear un fork de BERT, entrenado con conversaciones clinicas. En nuestro caso, planeamos fine-tunear ese modelo con nuestras conversaciones, evaluarlo, y finalmente usarlo en inferencia para clasificar conversaciones de la validation set y preparar un script para clasificar textos escritos por nosotros.

## PRIMER APPROACH: Fine‑tune Bio_ClinicalBERT (classification head)

In [None]:
import pandas as pd

In [None]:
import re
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import numpy as np
from sklearn.metrics import f1_score, accuracy_score

In [None]:
# carga del dataset
df = pd.read_csv("../../dataset/MTS-Dialog-TrainingSet.csv")

In [None]:
# preprocesamiento para BERT
def normalize_for_bert(s):
    if pd.isna(s):
        return ""
    s = unicodedata.normalize("NFKC", str(s))
    s = re.sub(r'\b(Doctor|Doctor_2|Patient|Guest_family(_\d)?|Guest_clinician)[:\-]\s*', '', s, flags=re.I)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

df['text_for_bert'] = df['dialogue'].apply(normalize_for_bert)


X = df['text_for_bert']
y = df['section_header']

# Encode 
le = LabelEncoder()
y_encoded = le.fit_transform(y)


X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42
)

In [None]:
# carga del tokenizer y modelo
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(le.classes_),
    problem_type="single_label_classification"
)

# Tokenización
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512
    )

In [None]:
# datasets
train_dataset = Dataset.from_dict({'text': X_train.tolist(), 'label': y_train.tolist()})
test_dataset = Dataset.from_dict({'text': X_test.tolist(), 'label': y_test.tolist()})

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

In [None]:
# métricas de evaluación
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1_macro': f1_score(labels, predictions, average='macro')
    }

In [None]:
# configuración del entrenamiento
training_args = TrainingArguments(
    output_dir='./results_clinicalbert',
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='f1_macro',
    logging_dir='./logs',
    logging_steps=10,
    seed=42,
    fp16=True,
    gradient_accumulation_steps=2,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

In [None]:
# Fine-tune
trainer.train()

# Evaluación
results = trainer.evaluate()
print(results)

# Guardar modelo
model.save_pretrained('./finetuned_clinicalbert')
tokenizer.save_pretrained('./finetuned_clinicalbert')

# Guardar encoder de labels
import pickle
with open('./finetuned_clinicalbert/label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 960/960 [00:00<00:00, 3864.05 examples/s]
Map: 100%|██████████| 241/241 [00:00<00:00, 3535.02 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,1.9317,1.64391,0.634855,0.262969
2,1.2774,1.200764,0.759336,0.339809
3,1.1068,1.117554,0.771784,0.345453


{'eval_loss': 1.1175535917282104, 'eval_accuracy': 0.7717842323651453, 'eval_f1_macro': 0.34545303347045025, 'eval_runtime': 1.8796, 'eval_samples_per_second': 128.218, 'eval_steps_per_second': 16.493, 'epoch': 3.0}


Las métricas no son muy visibles en el output, por lo que las adjuntamos a continuación:

'eval_loss': 1.1175535917282104, 

'eval_accuracy': 0.7717842323651453, 

'eval_f1_macro': 0.34545303347045025, 



El accuracy es bueno como esperabamos de una red neuronal convolucional, el training loss ha disminuido drasticamente por cada pasada del dataset (epoch), lo que indica que un entrenamiento más largo podría contribuir a un fine-tunning más eficiente.

Ahora, ya que hemos guardado el modelo, procedemos a usarlo en inferencia, ya que es interesante ver nuestro modelo en acción fuera de los números y las métricas de accuracy.

INFERENCE TEST

In [8]:
import pandas as pd
from transformers import pipeline
import pickle
import re, unicodedata

# Minimal preprocessing (same as training)
def normalize_for_bert(s):
    s = unicodedata.normalize("NFKC", str(s))
    s = re.sub(r'\b(Doctor|Doctor_2|Patient|Guest_family(_\d)?|Guest_clinician)[:\-]\s*', '', s, flags=re.I)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

# Load label encoder and model
with open('./finetuned_clinicalbert/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

clf = pipeline(
    "text-classification",
    model="./finetuned_clinicalbert",
    tokenizer="./finetuned_clinicalbert",
    device=0  # use -1 for CPU
)

# Load a validation sample
df_val = pd.read_csv("../../dataset/MTS-Dialog-ValidationSet.csv")
text = normalize_for_bert(df_val.loc[0, "dialogue"])  # any row from validation set

# Predict
out = clf(text)[0]
pred_label = le.inverse_transform([int(out['label'].split('_')[-1])])[0]
print(f"Predicted: {pred_label} | Confidence: {out['score']:.3f}")

Device set to use cuda:0


Predicted: GENHX | Confidence: 0.924


El modelo clasifica el primer texto del dataset de validación en GENHX con una confianza del 0.924, lo cual es correcto. Este hecho es muy interesante y bastante sorprendente dada la poca cantidad de epochs con los que lo hemos entrenado y la gran confianza con la que lo clasifica en esa clase.

En caso de querer clasificar un texto proporcionado por nosotros, podemos hacer uso del siguiente script:

In [None]:
sample_text = "Patient reports chest pain for 3 days..."
result = classifier(sample_text)
predicted_label = le.inverse_transform([int(result[0]['label'].split('_')[-1])])[0]
print(f"Predicted: {predicted_label}, Confidence: {result[0]['score']:.3f}")

Predicted: FAM/SOCHX, Confidence: 0.138


Está claro que el texto proporcionado es demasiado corto y le falta procesamiento, pero serviria en caso de introducir un texto más acorde al entrenamiento.