En este notebook se muestra un ejemplo de los resultados que se obtienen si tratamos de clasificar sentencias en las categorías Obligación, Derecho o Ninguna con un modelo entrenado con pocos datos, aplicando few shot.

In [None]:
!pip install setfit
!pip install huggingface-hub==0.11.0

In [2]:
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer, sample_dataset
from huggingface_hub import notebook_login

Se realiza la conexión con Hugging Face para subir el modelo entrenado.

In [None]:
notebook_login()

Cargamos los datos de entrenamiento y validación. Para el entenamiento se dispone de 8 ejemplo por categoría.

In [4]:
data_files = {"train": "train.csv", "validation": "validation.csv"}
dataset = load_dataset("csv", data_files=data_files)

dataset

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-3450257588300b9e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3450257588300b9e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 24
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 22
    })
})

In [5]:
eval_dataset = dataset["validation"]
train_dataset = sample_dataset(dataset["train"])

Filter:   0%|          | 0/24 [00:00<?, ? examples/s]

Filter:   0%|          | 0/24 [00:00<?, ? examples/s]

Filter:   0%|          | 0/24 [00:00<?, ? examples/s]

Descargamos el modelo a entrenar con el Framework SetFit.

In [None]:
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

Fine-tuning con SetFitModel

In [8]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_epochs=3,
    num_iterations=50,
    learning_rate=2e-5,
    column_mapping={"text": "text", "label": "label"},
)

In [9]:
trainer.train()

Applying column mapping to training dataset
***** Running training *****
  Num examples = 2400
  Num epochs = 3
  Total optimization steps = 450
  Total train batch size = 16


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/150 [00:00<?, ?it/s]

Iteration:   0%|          | 0/150 [00:00<?, ?it/s]

Iteration:   0%|          | 0/150 [00:00<?, ?it/s]

Evaluación del modelo.

In [10]:
metrics = trainer.evaluate()
metrics

Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.7727272727272727}

Subir modelo entrenado al repositorio de Hugging Face

In [None]:
trainer.push_to_hub('marmolpen3/p-MiniLM-L3-v2-sla-obligations-rights')

Se pueden inferir datos de test para su clasificación de la siguiente manera:

In [None]:
data_file = {"test": "test.csv"}
test_data = load_dataset("csv", data_files=data_file)
test_data

In [None]:
preds = model(test_data["test"][:]["text"])
preds

Resultados.

In [None]:
[[f for f, p in zip(labels, ps) if p] for ps in preds]