Установим все нужное


In [27]:
!pip install transformers[torch] datasets evaluate



Импорт датасета и метрики оценки качества. Она позволит произвести валидацию, так как есть оценка веростных ответов для конкретного датасета.


In [31]:
import numpy as np
import collections
import evaluate
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from transformers import DistilBertForQuestionAnswering
from transformers import AutoTokenizer
from transformers import default_data_collator
from transformers import Trainer, TrainingArguments
from transformers import pipeline
from tqdm.auto import tqdm

metric = evaluate.load("squad")
dataset = load_dataset("squad")

Так как у нас не так много памяти возьмем 5000 сэмплов для обучения и 500 для валидации

In [32]:
dataset['train'] = dataset['train'].select( range(5000))
dataset['validation'] = dataset['validation'].select(range(500))

Определим модель и токенизер. Будем файн-тюнить модельку DistilBertForQuestionAnswering

In [35]:
checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)
model = DistilBertForQuestionAnswering.from_pretrained(checkpoint).to(device)

cuda


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
def QA_answer(model_checkpoint, tokenizer, context, question):
  """
  Функция для ответа на вопросы
  """
  question_answerer = pipeline("question-answering", model=model_checkpoint)

  result = question_answerer(question=question, context=context)
  print(f"Question: {question} \n Answer: {result['answer']}")

In [55]:
i = np.random.randint(len(dataset['validation']))
context = dataset['validation'][i]['context']
question = dataset['validation'][i]['question']
answer = dataset['validation'][i]['answers']['text']
print(f"Q: {question} \n C: {context} \n A: {answer}")
print("---"*30)
print("Not fine-tuned model")
QA_answer(checkpoint, tokenizer, context, question)

Q: What site is located in the San Francisco Bay Area? 
 C: The league eventually narrowed the bids to three sites: New Orleans' Mercedes-Benz Superdome, Miami's Sun Life Stadium, and the San Francisco Bay Area's Levi's Stadium. 
 A: ["Levi's Stadium", "Levi's Stadium", "Levi's Stadium"]
------------------------------------------------------------------------------------------
Not fine-tuned model


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Question: What site is located in the San Francisco Bay Area? 
 Answer: three sites: New Orleans' Mercedes


Далее будут функции препроцессинга данных для валидации и обучения.

In [57]:
def preprocess_training_examples(examples, max_length = 512, stride = 128):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [58]:
def preprocess_validation_examples(examples, max_length = 512, stride = 128):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [59]:
def compute_metrics(start_logits, end_logits, features, examples,
                    n_best = 20, max_answer_length = 30):

    example_to_features = collections.defaultdict(list)

    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

Готовим датасеты

In [60]:
train_dataset = dataset["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

validation_dataset = dataset["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Обучения модели будем выполнять через доступную функцию Trainer

In [65]:
args = TrainingArguments(
    output_dir='./fine_tuned_model',
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    save_steps=100,
    logging_steps=200
)


trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer
)
trainer.train()

Step,Training Loss
200,1.0713
400,0.8118
600,0.6239
800,0.5283


TrainOutput(global_step=942, training_loss=0.7312760160733687, metrics={'train_runtime': 776.0788, 'train_samples_per_second': 19.363, 'train_steps_per_second': 1.214, 'total_flos': 1963324134180864.0, 'train_loss': 0.7312760160733687, 'epoch': 3.0})

In [66]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, dataset["validation"])

  0%|          | 0/500 [00:00<?, ?it/s]

{'exact_match': 65.2, 'f1': 71.41661134565473}

метрики слабоваты, но уже ничего. Попробуем на нашем вопросе. Для этого возьмем последний чекпоинт из тренированного датасета: в моем случае это 942 итерация

In [67]:
QA_answer('/content/fine_tuned_model/checkpoint-942', trainer.tokenizer, context, question)

Question: What site is located in the San Francisco Bay Area? 
 Answer: Levi's Stadium


Сразу видно, что ответ на вопрос уже улучшился.

Проверим модельку на специфических вопросах о геологии

In [70]:
q = "What is the term used to describe the process of breaking down rocks into smaller particles through physical or chemical means?"
c = "Weathering is a fundamental concept in geology."
QA_answer('/content/fine_tuned_model/checkpoint-942', trainer.tokenizer, c, q)

Question: What is the term used to describe the process of breaking down rocks into smaller particles through physical or chemical means? 
 Answer: Weathering


Если ответ есть в контексте, то модель справляется хорошо. Проверим что будет, если ответ частично или полностью отсутствует в контексте.

In [71]:
q = "What is the process by which one type of rock transforms into another type of rock under intense heat and pressure?"
c = "Geological processes can cause rocks to change from one type to another."
QA_answer('/content/fine_tuned_model/checkpoint-942', trainer.tokenizer, c, q)

Question: What is the process by which one type of rock transforms into another type of rock under intense heat and pressure? 
 Answer: Geological processes


Ожидаемый ответ - Metamorphic rock. Попробуем более простые, но рукописные вопросы\ответы

In [72]:
q = "How many countries in the world?"
c = "Countries of the world according to UN – 195. A place with its own borders and fully independent government. This is the definition of a sovereign state or a country by the United Nations (UN). According to the UN, there are 193 countries in the world and 2 observer states (Palestine and Vatican City). So in total, there are 195 countries in the world."
QA_answer('/content/fine_tuned_model/checkpoint-942', trainer.tokenizer, c, q)

Question: How many countries in the world? 
 Answer: 193


In [73]:
q = "whose windows were washed?"
c = "While we were walking, we were watching window washers wash Washington’s windows with warm washing water."
QA_answer('/content/fine_tuned_model/checkpoint-942', trainer.tokenizer, c, q)

Question: whose windows were washed? 
 Answer: Washington’s windows with warm washing water


Со сложной скороговоркой модель справилась частично, но думаю, что полный датасет решит эту проблму.