<a href="https://colab.research.google.com/github/npinto97/LLM_Fine-tuning/blob/main/NLPProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP project a.y. 2023-24

Nicolas Pinto 807348

Emanuele Tanzi 807406

# LLM360

This work presents LLM360, an innovative project that explores the use of large language models (LLMs) through a complete cycle that includes dataset generation, model fine-tuning, and response evaluation. The primary goal is to assess the effectiveness of LLMs in various application contexts, improving their capabilities in generating explanations for Italian terms and studying the semantic evolution of words over time.

In this study, we will generate a dataset of Italian terms and their diachronical explanations, fine-tune a pre-trained LLM to enhance its explanatory capabilities, and evaluate the quality of the generated explanations using both quantitative metrics and qualitative human assessments via an LLM. Through this integrated cycle, we aim to demonstrate the efficacy of LLMs in capturing and understanding the nuanced semantic evolution of words over time.

# Dataset construction

In this section, we use the dataset provided by WiC-ITA (https://wic-ita.github.io/) as the foundation for constructing the synthetic dataset on which the fine-tuning will be based.

WiC-ITA offers a dataset designed for the word sense disambiguation task, where the objective is to determine whether a word *w* that appears in two sentences *s1* and *s2* has the same meaning in both contexts.

The dataset includes several attributes, such as the lemma of the word to be examined, the two sentences in which the word appears, and other attributes useful for the task.

For our purposes, we extract the lemmas of all the words to be examined from both the training and test sets of the WiC-ITA dataset and save them into two separate files. It is important to note that, due to the nature of the task, some words in the test dataset also appear in the training dataset. Therefore, we removed all words from the test word list that were already present in the training word list.

In [None]:
!pip install together langchain_together langchain_core

Collecting together
  Downloading together-1.2.1-py3-none-any.whl (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_together
  Downloading langchain_together-0.1.3-py3-none-any.whl (9.6 kB)
Collecting langchain_core
  Downloading langchain_core-0.2.6-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting eval-type-backport<0.3.0,>=0.1.3 (from together)
  Downloading eval_type_backport-0.2.0-py3-none-any.whl (5.9 kB)
Collecting pillow<11.0.0,

In [None]:
import json

In [None]:
# Function to extract lemmas from the JSON file
def extract_lemmas(input_file_path, output_file_path):
    with open(input_file_path, 'r', encoding='utf-8') as input_file:
        lines = input_file.readlines()

    lemmas = []

    for line in lines:
        data = json.loads(line)
        lemmas.append(data["lemma"])

    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        for lemma in lemmas:
            output_file.write(lemma + '\n')


In [None]:
input_file_path = '/content/terms_sentences_train.jsonl'
output_file_path = '/content/list_of_words_train.txt'

extract_lemmas(input_file_path, output_file_path)


In [None]:
input_file_path = '/content/terms_sentences_test.jsonl'
output_file_path = '/content/list_of_words_test.txt'

extract_lemmas(input_file_path, output_file_path)


In [None]:
# Isolation of terms that appear in the test and not in the train
with open('list_of_words_train.txt', 'r') as f1:
    words_first_list = set(line.strip() for line in f1)

with open('list_of_words_test.txt', 'r') as f2:
    words_second_list = set(line.strip() for line in f2)

# Find words that are only in the test list
unique_words = words_second_list - words_first_list

with open('list_of_words_test.txt', 'w') as out_file:
    for word in unique_words:
        out_file.write(word + '\n')


## Generation of datasets with LLama3-70b via TogetherAI

This is the first of three phases in which LLMs play a central role in this project. In this phase, we use the LLama 3-70b model via a dedicated library provided by TogetherAI, a cloud platform for building and running generative AI. Using a one-shot prompt, LLama is tasked with generating an explanation for each word saved in the previous phase, detailing how the meaning of that word has evolved over time in the Italian language.


LO STESSO TERMINE È PRESENTE PIÙ VOLTE

In [None]:
import os
from together import Together
from langchain_together import Together
from langchain_core.prompts import PromptTemplate

In [None]:
def read_txt(txt_path):
    with open(txt_path, 'r', encoding='utf-8') as f:
        content = f.read().splitlines()
    return content

# Main function to build the dataset and save the results in a file
def build_dataset(file_parole, template_path, output_path):

    list_of_words = read_txt(file_parole)

    # Reads the template from the file
    with open(template_path, 'r', encoding='utf-8') as f:
        template = f.read()

    results = []

    for word in list_of_words:
        # Building the prompt with the current word
        prompt_text = template.format(parola=word)

        llm = Together(model="meta-llama/Llama-3-70b-chat-hf", max_tokens=200, temperature=0.6, together_api_key="")

        response = llm(prompt_text)

        explanation = response.strip()

        results.append(f'{{"lemma": "{word}", "spiegazione": "{explanation}"}}')
        #print(f'{{"lemma": "{word}", "spiegazione": "{explanation}"}}')

    with open(output_path, 'w', encoding='utf-8') as f:
        for result in results:
            f.write(result + '\n')


In [None]:
file_of_words = "list_of_words_train.txt"
template_path = "/content/promptLlama_con_esempio.txt"
output_path = "output_llama_train.txt"

build_dataset(file_of_words, template_path, output_path)

In [None]:
file_of_words = "list_of_words_test.txt"
template_path = "/content/promptLlama_con_esempio.txt"
output_path = "output_llama_test.txt"

build_dataset(file_of_words, template_path, output_path)

## RegExp Function to formatting the dataset

The code in this section focuses on filtering and formatting the output from LLama using specific regular expressions. This process transforms the output into a jsonl dataset format, making it ready for subsequent use.

In [None]:
import re

def modify_text(text):
    # Regex to find everything between `"spiegazione" :` and `La parola`, removing any characters including special ones
    pattern_blocks = re.compile(r'("spiegazione": ").*?(La parola)', re.DOTALL)
    modified_text = pattern_blocks.sub(r'\1La parola', text)

    # Regex to remove all occurrences of 'assistant'
    pattern_assistant = re.compile(r'\b.assistant\b', re.IGNORECASE)
    modified_text = pattern_assistant.sub(".", modified_text)

    lines = modified_text.split('\n')

    # Filter lines that do not start with a curly bracket and ensure each line ends with `"}``
    filtered_lines = []
    for line in lines:
        if line.strip().startswith("{"):
            if not line.strip().endswith('"}'):
                line = line.rstrip() + '"}'
            filtered_lines.append(line)

    modified_text = '\n'.join(filtered_lines)

    # Regex to remove all quotes between 'La parola' and '"}'
    pattern_quotes = re.compile(r'(La parola)[^}]*?(?="})')
    final_text = pattern_quotes.sub(lambda m: m.group(0).replace('"', ''), modified_text)

    return final_text


In [None]:
with open('output_llama_train.txt', 'r', encoding='utf-8') as file:
    content = file.read()

modified_content = modify_text(content)

with open('terms_explanations_train.jsonl', 'w', encoding='utf-8') as file:
    file.write(modified_content)

In [None]:
with open('output_llama_test.txt', 'r', encoding='utf-8') as file:
    content = file.read()

modified_content = modify_text(content)

with open('terms_explanations_test.jsonl', 'w', encoding='utf-8') as file:
    file.write(modified_content)

In [None]:
#Checking that json files are properly formatted
def validate_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f, start=1):
            try:
                json.loads(line)
            except json.JSONDecodeError as e:
                print(f"Error in line {i}: {e}")
                print(line)

validate_jsonl('terms_explanations_train.jsonl')
validate_jsonl('terms_explanations_test.jsonl')
validate_jsonl('terms_sentences_train.jsonl')
validate_jsonl('terms_sentences_test.jsonl')

#Fine tuning

The fine-tuning phase is the second stage where a large language model plays a central role. In this phase, we selected Gemma2b as the baseline model for fine-tuning. This choice was primarily driven by the limited resources available. Nevertheless, the Gemma model family represents a series of lightweight, state-of-the-art open models built from the research and technology that underpinned the creation of the Gemini models. These models have demonstrated strong performance across various academic benchmarks in language understanding, reasoning, and safety.

In [None]:
!pip install datasets
!pip install peft
!pip install bitsandbytes
!pip install trl

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m17.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (6

In [None]:
import json

test_data = []
with open('terms_explanations_test.jsonl', 'r') as f:
    for line in f:
        test_data.append(json.loads(line))

##Test con il modello di base


In the initial phase of fine-tuning, Gemma2b is tasked with completing the task using a one-shot prompt. This request is made using the `generate_response` function. It is important to note that, to ensure consistency, this function will be used for subsequent models without further modifications. Analyzing Gemma2b's responses before fine-tuning establishes a baseline against which the performance of the fine-tuned models can be evaluated.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
base_tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

The generate_response function is designed to generate responses from a language model, focusing on penalizing repetitions to enhance the variety and quality of the outputs. Various generation parameters are configured to control the creativity and diversity of the responses:
1. Temperature: Controls the model's creativity.
2. Top-k: Limits the number of words the model can choose from at each step.
3. Top-p (Nucleus Sampling): Limits the model's choices to a cumulative probability subset.

Parameter values were chosen empirically, however sticking to commonly used values.

In [None]:
# Function to generate answers with repetition penalty
def generate_response(model, tokenizer, lemma, max_length=256):
    input_text = f"Descrivi brevemente come si è evoluto il significato della parola {lemma} nella lingua italiana."
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Parameters for generation
    repetition_penalty = 1.5  # Repetition penalty Common values: [1.0, 2.0]
    no_repeat_ngram_size = 2  # Bigramma penalty
    temperature = 0.7  # Temperature Common values: [0.7, 1.0]
    top_k = 50  # Top-k  Common values: 40, 50 o 100
    top_p = 0.9  # Top-p (nucleus sampling)  Common values:  [0.8, 0.95]

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=repetition_penalty,
        no_repeat_ngram_size=no_repeat_ngram_size,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [None]:
# Testing the response of the base model
lemma = "accendere"
response = generate_response(base_model, base_tokenizer, lemma)
print(response)



Descrivi brevemente come si è evoluto il significato della parola accendere nella lingua italiana.

The word "accender" has a long history in Italian, and its meaning is constantly evolving to reflect the changing times. In ancient times, the word was used as a verb to describe lighting fires or candles, but over time it came to mean more than just igniting something: It also referred to starting up machinery or processes that required energy input from outside sources (such as electricity). Today, we use "accende" when referring not only these physical objects themselves – like lamps -butalso any kind of activity which requires an initial spark before proceeding further down its path towards completion!


In [None]:
# Generation of responses with the basic model
base_responses = []
true_explanations = []

for example in test_data:
    lemma = example["lemma"]
    true_explanation = example["spiegazione"]

In [None]:
for example in test_data:
    base_response = generate_response(base_model, base_tokenizer, lemma)
    base_responses.append(base_response)
    true_explanations.append(true_explanation)

KeyboardInterrupt: 

In [None]:
with open('base_responses.json', 'w') as f:
    json.dump(base_responses, f)

In [None]:
# RAM memory cleaning
del base_model
del base_tokenizer
torch.cuda.empty_cache()


## First fine-tuning

The fine-tuning phase is of central importance. During this phase, the baseline model is provided with the previously generated dataset for fine-tuning to specialize in the task of lexical semantic change. The LoraConfig method is used in this phase. Low-Rank Adaptation (LoRA) is a Parameter-Efficient Fine-Tuning (PEFT) method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This significantly reduces the number of parameters that need to be fine-tuned, making the process more efficient and less resource-intensive.

An essential part of this phase is defining a training prompt, which is formatted with the current lemma-explanation pair each time. The training parameters are set according to commonly used values, ensuring they are consistent with the available resources. This approach allows for effective fine-tuning while optimizing resource usage, ensuring that the model becomes skilled in capturing and explaining the semantic changes of words over time.

In [None]:
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, AutoModelForCausalLM, AutoTokenizer

In [None]:
model_name = "google/gemma-2b"
fine_tuned_model = AutoModelForCausalLM.from_pretrained(model_name)
fine_tuned_tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

LoRA modifies the fine-tuning process by freezing the original model weights and applying changes to a separate set of weights, which are then added to the original parameters. LoRA transforms the model parameters into a lower-rank dimension, reducing the number of parameters that need training, thus speeding up the process and lowering costs.

In [None]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Adding of LoRA to the fine tuned model
fine_tuned_model = get_peft_model(fine_tuned_model, peft_config)

In [None]:
train_dataset = load_dataset('json', data_files='terms_explanations_train.jsonl', split='train')

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Definition of the training prompt
training_prompt = """Descrivi brevemente come si è evoluto il significato della parola {} nella lingua italiana.
Spiegazione: {}
"""

EOS_TOKEN = fine_tuned_tokenizer.eos_token
def formatting_prompts_func(examples):
    lemmas = examples["lemma"]
    spiegazioni = examples["spiegazione"]
    texts = []
    for lemma, spiegazione in zip(lemmas, spiegazioni):
        text = training_prompt.format(lemma, spiegazione) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [None]:
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/2797 [00:00<?, ? examples/s]

In [None]:
def tokenize_function(examples):
    return fine_tuned_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
tokenized_train_dataset = tokenized_train_dataset.rename_column("input_ids", "labels")

Map:   0%|          | 0/2797 [00:00<?, ? examples/s]

In [None]:
# Training parameters
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    seed=3407,
    output_dir="outputs",
)

In [None]:
trainer = SFTTrainer(
    model=fine_tuned_model,
    tokenizer=fine_tuned_tokenizer,
    train_dataset=tokenized_train_dataset,
    dataset_text_field="text",
    max_seq_length=512,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/2797 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer.train()

Step,Training Loss
1,2.0813
2,2.0977
3,2.0751
4,2.0069
5,2.1055
6,2.0481
7,1.9633
8,1.9382
9,1.8359
10,1.9079


TrainOutput(global_step=60, training_loss=1.3874642034371694, metrics={'train_runtime': 251.6427, 'train_samples_per_second': 1.907, 'train_steps_per_second': 0.238, 'total_flos': 980188946743296.0, 'train_loss': 1.3874642034371694, 'epoch': 0.17155110793423875})

In [None]:
# Fine tuned model testing
lemma = "accendere"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, lemma)
print(response)



Descrivi brevemente come si è evoluto il significato della parola accendere nella lingua italiana.
Spiegazione: La prima attestazione scritta di accende in italiano risale al XIV secolo, dove indicava la funzione elettrica dell'incendio o del fuoco per illuminare una stanza oppure un edificio; successivamente nel XVI e XVII secolo acquisì anche significati più astratti quali rappresentare l'atto mentale che porta alla comprensione delle idee filosofiche ed epistemologiche (accedere significa infatti comprendere). In seguito a questo uso comunemente accettato, la definizione originaria fu estesa ad includerla anche nell accezioni scientifiche relative all'elettromagnetismo e alle scienze fisici-chimico-biologiche. Infine, con lo sviluppo dei linguaggi informatici e tecnologici ha assunto ancora altri concetti relativi ai processi informativi e cognitivi basati sulla memorizzazione dati e su algoritmi computazionali. Inoltre, l’uso corrente includeva anche le definizioni correlate legate

In [None]:
# Generation of responses with the basic model
fine_tuned_responses = []

for example in test_data:
    lemma = example["lemma"]
    true_explanation = example["spiegazione"]

    fine_tuned_response = generate_response(fine_tuned_model, fine_tuned_tokenizer, lemma)
    fine_tuned_responses.append(fine_tuned_response)

with open('fine_tuned_responses.json', 'w') as f:
    json.dump(fine_tuned_responses, f)

In [None]:
# Clean the ram
del fine_tuned_model
del fine_tuned_tokenizer
torch.cuda.empty_cache()

## Second fine-tuning

Now we have entered the second fine-tuning phase, where the baseline model (Gemma) is provided with both the synthetically generated dataset and the dataset supplied by WiC. This decision was driven by the observation that both datasets focus on the same terms, as previously discussed. Consequently, we aimed to experiment and determine whether incorporating the WiC dataset would offer additional information that the model could leverage to enhance its performance. As we will demonstrate, integrating the WiC dataset into the fine-tuning process unexpectedly proved to be counterproductive.

### Preparation and Formatting of Datasets

In [None]:
import json
from datasets import load_dataset, Dataset

In [None]:
new_train_data = []
with open('terms_sentences_train.jsonl', 'r') as f:
    for line in f:
        new_train_data.append(json.loads(line))

In [None]:
# Formatting the new dataset
def format_new_dataset(data):
    formatted_data = []
    for item in data:
        lemma = item["lemma"]
        sentence1 = item["sentence1"]
        sentence2 = item["sentence2"]
        label = item["label"]
        instruction = "Determina se la parola ha lo stesso significato in entrambe le frasi."
        input_text = f"Parola: {lemma}\nFrase 1: {sentence1}\nFrase 2: {sentence2}"
        response_text = f"Label: {label}"
        formatted_data.append({
            "instruction": instruction,
            "input": input_text,
            "output": response_text
        })
    return formatted_data

formatted_new_train_data = format_new_dataset(new_train_data)

In [None]:
# Loading existing train dataset
train_data = []
with open('terms_explanations_train.jsonl', 'r') as f:
    for line in f:
        train_data.append(json.loads(line))

In [None]:
# Formatting for existing train dataset.
def format_existing_dataset(data):
    formatted_data = []
    for item in data:
        lemma = item["lemma"]
        spiegazione = item["spiegazione"]
        instruction = "Descrivi come si è evoluto il significato della parola data nella lingua italiana."
        input_text = f"Lemma: {lemma}"
        response_text = f"Spiegazione: {spiegazione}"
        formatted_data.append({
            "instruction": instruction,
            "input": input_text,
            "output": response_text
        })
    return formatted_data

formatted_train_data = format_existing_dataset(train_data)

In [None]:
# Union of the two datasets and conversion to Dataset
format.combined_train_data = formatted_train_data + formatted_new_train_data

train_dataset = Dataset.from_list(combined_train_data)

### Fine-Tuning

To maintain consistency in the fine-tuning process for both models, the same training parameters and training prompts were used.

This decision was made to ensure that any differences in the performance of the models could be attributed to the variations in the datasets rather than changes in the training configuration.

In [None]:
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from transformers import TrainingArguments, AutoModelForCausalLM, AutoTokenizer

In [None]:
model_name = "google/gemma-2b"
fine_tuned_model_2 = AutoModelForCausalLM.from_pretrained(model_name)
fine_tuned_tokenizer_2 = AutoTokenizer.from_pretrained(model_name)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

fine_tuned_model_2 = get_peft_model(fine_tuned_model_2, peft_config)

In [None]:
training_prompt = """Lemma: {}
Spiegazione: {}
"""

EOS_TOKEN = fine_tuned_tokenizer_2.eos_token
def formatting_prompts_func(examples):
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for input_text, output in zip(inputs, outputs):
        text = training_prompt.format(input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [None]:
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/5602 [00:00<?, ? examples/s]

In [None]:
def tokenize_function(examples):
    return fine_tuned_tokenizer_2(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
tokenized_train_dataset = tokenized_train_dataset.rename_column("input_ids", "labels")

Map:   0%|          | 0/5602 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    seed=3407,
    output_dir="outputs",
)

In [None]:
trainer = SFTTrainer(
    model=fine_tuned_model_2,
    tokenizer=fine_tuned_tokenizer_2,
    train_dataset=tokenized_train_dataset,
    dataset_text_field="text",
    max_seq_length=512,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/5602 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer.train()

Step,Training Loss
1,3.0864
2,3.0509
3,2.9084
4,3.5503
5,2.1912
6,3.4864
7,2.3316
8,2.2341
9,2.6057
10,3.407


TrainOutput(global_step=60, training_loss=2.351414555311203, metrics={'train_runtime': 219.1774, 'train_samples_per_second': 2.19, 'train_steps_per_second': 0.274, 'total_flos': 845424284319744.0, 'train_loss': 2.351414555311203, 'epoch': 0.08568368439842913})

In [None]:
# Testing the second fine tuned model's response
lemma = "accendere"
response = generate_response(fine_tuned_model_2, fine_tuned_tokenizer_2, lemma)
print(response)



Descrivi brevemente come si è evoluto il significato della parola accendere nella lingua italiana.
Come ha cambiato la definizione di acchiappare nel corso del tempo?

Grazie!
Ciao,
la prima domanda non mi sembra molto chiara: cosa intendi per "accendersi"? Se ti riferisci alla luce che viene fuori quando un oggetto o una persona entra in contatto con l'aria (ad esempio se qualcuno spara a te), allora sì, questa accezione era già presente all’inizio dell’XI secolo; ma poi questo senso scomparve e fu ripreso solo verso fine XVI-inizi XVII secolo da Giovanni Battista Guarini nell’opera La grammatica moderna .
La seconda invece chiedeva semplicemente quale fosse stato lo sviluppo storico delle parole 'accendire', 'sputare' , 'spegnere'. In realtà queste due ultime erano state introdotte dal latino medievale : ad es., le prime attestate sono quelle riportata dalla Enciclopedia Treccani mentre quella relativa al secondo termine proviene dall’Enciclopedia Italiana 1930 ; quest’ultima però in

In [None]:
# Generating responsens with second fine tuned model
fine_tuned_responses = []

for example in test_data:
    lemma = example["lemma"]
    true_explanation = example["spiegazione"]

    fine_tuned_response = generate_response(fine_tuned_model_2, fine_tuned_tokenizer_2, lemma)
    fine_tuned_responses.append(fine_tuned_response)

with open('fine_tuned_responses_2.json', 'w') as f:
    json.dump(fine_tuned_responses, f)




In [None]:
del fine_tuned_model_2
del fine_tuned_tokenizer_2
torch.cuda.empty_cache()

NameError: name 'fine_tuned_model_2' is not defined

##Evaluation

In the evaluation phase, the objective was to compare the two fine-tuned models with the baseline model. This evaluation was conducted in two main instances.

Firstly, state-of-the-art metrics such as BERTScore, BLEU, and ROUGE were used. These metrics provide a quantitative assessment of the models' performance in generating accurate and relevant explanations for lexical semantic change.

Secondly, we experimented with using an LLM, specifically LLama3_70b, as a qualitative evaluator of the responses provided by the two fine-tuned models. This approach aimed to leverage the advanced language understanding capabilities of LLama3_70b to provide a more nuanced assessment of the generated explanations, beyond what can be captured by quantitative metrics alone.

By combining both quantitative and qualitative evaluations, we aimed to gain a comprehensive understanding of the strengths and limitations of the fine-tuned models compared to the baseline, ensuring a robust assessment of their performance.

### Evaluation of the two fine-tuned models against baseline by BertScore, BLeu, and ROuge

We consider BERTScore as the primary metric for evaluating the models in this task. Its ability to capture the semantic meaning of sentences makes it particularly suitable for assessing explanations of lexical semantic change. Since understanding and explaining the evolution of word meanings require a deep grasp of context and semantics, BERTScore’s approach ensures that the generated explanations are not only lexically accurate but also semantically coherent and meaningful. This makes it the most crucial metric for our evaluation.

While BERTScore is our primary metric, BLEU provides additional insights into the accuracy of word and phrase generation. It helps assess whether the generated explanations contain the correct terms and expressions found in the reference explanations. BLEU is included to observe whether the fine-tuned models can produce text that closely mirrors the reference dataset in terms of exact word usage.

ROUGE is particularly useful for evaluating the comprehensiveness of the generated explanations. Given that our task involves generating detailed and accurate historical explanations of word meanings, ROUGE helps ensure that the generated text captures the necessary breadth of information. By focusing on recall, ROUGE metrics confirm that the fine-tuned models do not miss critical elements present in the reference explanations.

#### Combined Rationale:
While BERTScore serves as the cornerstone of our evaluation due to its superior semantic evaluation capabilities, BLEU and ROUGE are utilized for a more comprehensive analysis. BLEU analyzes precision and exact word matching, and ROUGE assesses recall and completeness. This multi-faceted approach provides a well-rounded assessment, capturing both the lexical and semantic quality of the generated explanations.

In [None]:
!pip install bert_score rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=8a1f7db855eb77daf7084e907e25ad329798ce32cd0407e1ad819d0affcf2162
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import json
from bert_score import score
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer
import pandas as pd

In [None]:
with open('base_responses.json', 'r') as f:
    base_responses = json.load(f)

with open('fine_tuned_responses.json', 'r') as f:
    fine_tuned_responses = json.load(f)

with open('fine_tuned_responses_2.json', 'r') as f:
    fine_tuned_responses_2 = json.load(f)

In [None]:
# Uploading reference explanations
with open('terms_explanations_test.jsonl', 'r') as f:
    true_explanations = [json.loads(line)["spiegazione"] for line in f]

# Ensure that the number of responses and explanations is the same
assert len(base_responses) == len(true_explanations), "Numero di risposte del modello di base diverso dal numero di spiegazioni"
assert len(fine_tuned_responses) == len(true_explanations), "Numero di risposte del primo modello fine-tunato diverso dal numero di spiegazioni"
assert len(fine_tuned_responses_2) == len(true_explanations), "Numero di risposte del secondo modello fine-tunato diverso dal numero di spiegazioni"

In [None]:
# BERTScore calculation for the base model
P_base, R_base, F1_base = score(base_responses, true_explanations, lang="it", verbose=True)

# BERTScore calculation for the first fine tuned model
P_fine, R_fine, F1_fine = score(fine_tuned_responses, true_explanations, lang="it", verbose=True)

# BERTScore calculation for the first fine tuned model
P_fine_2, R_fine_2, F1_fine_2 = score(fine_tuned_responses_2, true_explanations, lang="it", verbose=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.62 seconds, 25.64 sentences/sec
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.80 seconds, 20.07 sentences/sec
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.70 seconds, 22.75 sentences/sec


In [None]:
# BLEU
reference_list = [[ref.split()] for ref in true_explanations]
base_list = [resp.split() for resp in base_responses]
fine_list = [resp.split() for resp in fine_tuned_responses]
fine_2_list = [resp.split() for resp in fine_tuned_responses_2]

bleu_base = corpus_bleu(reference_list, base_list)
bleu_fine = corpus_bleu(reference_list, fine_list)
bleu_fine_2 = corpus_bleu(reference_list, fine_2_list)

In [None]:
# ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge_base_scores = [scorer.score(ref, resp) for ref, resp in zip(true_explanations, base_responses)]
rouge_fine_scores = [scorer.score(ref, resp) for ref, resp in zip(true_explanations, fine_tuned_responses)]
rouge_fine_2_scores = [scorer.score(ref, resp) for ref, resp in zip(true_explanations, fine_tuned_responses_2)]

def avg_rouge_scores(rouge_scores):
    rouge1 = {"precision": 0, "recall": 0, "fmeasure": 0}
    rouge2 = {"precision": 0, "recall": 0, "fmeasure": 0}
    rougeL = {"precision": 0, "recall": 0, "fmeasure": 0}

    for score in rouge_scores:
        rouge1["precision"] += score["rouge1"].precision
        rouge1["recall"] += score["rouge1"].recall
        rouge1["fmeasure"] += score["rouge1"].fmeasure
        rouge2["precision"] += score["rouge2"].precision
        rouge2["recall"] += score["rouge2"].recall
        rouge2["fmeasure"] += score["rouge2"].fmeasure
        rougeL["precision"] += score["rougeL"].precision
        rougeL["recall"] += score["rougeL"].recall
        rougeL["fmeasure"] += score["rougeL"].fmeasure

    num_scores = len(rouge_scores)
    for key in rouge1:
        rouge1[key] /= num_scores
        rouge2[key] /= num_scores
        rougeL[key] /= num_scores

    return {"rouge1": rouge1, "rouge2": rouge2, "rougeL": rougeL}

avg_rouge_base = avg_rouge_scores(rouge_base_scores)
avg_rouge_fine = avg_rouge_scores(rouge_fine_scores)
avg_rouge_fine_2 = avg_rouge_scores(rouge_fine_2_scores)

In [None]:
results = {
    "Metric": ["BERTScore Precision", "BERTScore Recall", "BERTScore F1",
               "BLEU",
               "ROUGE-1 Precision", "ROUGE-1 Recall", "ROUGE-1 F1",
               "ROUGE-2 Precision", "ROUGE-2 Recall", "ROUGE-2 F1",
               "ROUGE-L Precision", "ROUGE-L Recall", "ROUGE-L F1"],
    "Base Model": [P_base.mean().item(), R_base.mean().item(), F1_base.mean().item(),
                   bleu_base,
                   avg_rouge_base["rouge1"]["precision"], avg_rouge_base["rouge1"]["recall"], avg_rouge_base["rouge1"]["fmeasure"],
                   avg_rouge_base["rouge2"]["precision"], avg_rouge_base["rouge2"]["recall"], avg_rouge_base["rouge2"]["fmeasure"],
                   avg_rouge_base["rougeL"]["precision"], avg_rouge_base["rougeL"]["recall"], avg_rouge_base["rougeL"]["fmeasure"]],
    "Fine-tuned Model 1": [P_fine.mean().item(), R_fine.mean().item(), F1_fine.mean().item(),
                           bleu_fine,
                           avg_rouge_fine["rouge1"]["precision"], avg_rouge_fine["rouge1"]["recall"], avg_rouge_fine["rouge1"]["fmeasure"],
                           avg_rouge_fine["rouge2"]["precision"], avg_rouge_fine["rouge2"]["recall"], avg_rouge_fine["rouge2"]["fmeasure"],
                           avg_rouge_fine["rougeL"]["precision"], avg_rouge_fine["rougeL"]["recall"], avg_rouge_fine["rougeL"]["fmeasure"]],
    "Fine-tuned Model 2": [P_fine_2.mean().item(), R_fine_2.mean().item(), F1_fine_2.mean().item(),
                           bleu_fine_2,
                           avg_rouge_fine_2["rouge1"]["precision"], avg_rouge_fine_2["rouge1"]["recall"], avg_rouge_fine_2["rouge1"]["fmeasure"],
                           avg_rouge_fine_2["rouge2"]["precision"], avg_rouge_fine_2["rouge2"]["recall"], avg_rouge_fine_2["rouge2"]["fmeasure"],
                           avg_rouge_fine_2["rougeL"]["precision"], avg_rouge_fine_2["rougeL"]["recall"], avg_rouge_fine_2["rougeL"]["fmeasure"]],
}

results_df = pd.DataFrame(results)
print(results_df)

                 Metric  Base Model  Fine-tuned Model 1  Fine-tuned Model 2
0   BERTScore Precision    0.634947            0.692616            0.653618
1      BERTScore Recall    0.666521            0.738854            0.681913
2          BERTScore F1    0.650136            0.714930            0.667063
3                  BLEU    0.010289            0.019053            0.015501
4     ROUGE-1 Precision    0.087910            0.247815            0.253092
5        ROUGE-1 Recall    0.110618            0.380555            0.290233
6            ROUGE-1 F1    0.094744            0.298872            0.254315
7     ROUGE-2 Precision    0.018771            0.047776            0.048094
8        ROUGE-2 Recall    0.023392            0.072917            0.049930
9            ROUGE-2 F1    0.020052            0.057449            0.045095
10    ROUGE-L Precision    0.052296            0.117114            0.123879
11       ROUGE-L Recall    0.065345            0.179274            0.136418
12          

### Evaluation Results

From the evaluation using the metrics, it is evident that the first fine-tuned model, which utilizes only the synthetically generated dataset, exhibits the best performance. Specifically, it achieved a BERTScore F1 of 0.71. The second model, fine-tuned on both the generated dataset and the WiC dataset, also performed better than the baseline but showed significantly degraded performance compared to the first model.

To conduct this evaluation, a prompt is crafted to present LLama3_70b with the responses from the baseline model and the fine-tuned models for the same word. LLama3_70b is then asked to identify which response is the most accurate. This qualitative assessment provides an additional layer of validation, ensuring that the model's performance is not solely dependent on BERT's embeddings but is also evaluated through the lens of another sophisticated language model.

This qualitative evaluation also indicated that the first fine-tuned model, which utilized only the synthetically generated dataset, was the best performer.

### Evaluation of responses using Llama 3 70b

The evaluation phase with LLama3_70b represents the third stage where an LLM takes center stage in our project. In this phase, LLama3_70b is used as a qualitative evaluator. By providing it with an appropriate prompt, we ask the model to indicate which of the responses given is the most accurate. This approach stems from the idea that BERTScore is essentially a comparison with BERT embeddings, which assess the semantic similarity between generated and reference texts. Similarly, using LLama3_70b for evaluation is akin to comparing the responses against LLama3_70b embeddings, leveraging its advanced language understanding capabilities.

To conduct this evaluation, we performed five tests where the prompt included responses provided by the baseline model and the two fine-tuned models for the same word. The responses were selected randomly to ensure unbiased evaluation. LLama3_70b was then asked to determine which response was the most accurate.

As we can see form the results, this qualitative assessment also indicated that the first fine-tuned model, which used only the synthetically generated dataset, was superior. LLama3_70b consistently identified responses from this model as the most accurate compared to those from the baseline model and the second fine-tuned model.

In [None]:
import random
import json
import requests

In [None]:
with open('base_responses.json', 'r') as f:
    base_responses = json.load(f)

with open('fine_tuned_responses.json', 'r') as f:
    fine_tuned_responses = json.load(f)

with open('fine_tuned_responses_2.json', 'r') as f:
    fine_tuned_responses_2 = json.load(f)

In [None]:
# Loading reference explanations
true_explanations = []
true_terms = []
with open('terms_explanations_test.jsonl', 'r') as f:
    for line in f:
        data = json.loads(line)
        true_explanations.append(data["spiegazione"])
        true_terms.append(data["lemma"])

In [None]:
assert len(base_responses) == len(fine_tuned_responses) == len(fine_tuned_responses_2) == len(true_explanations) == len(true_terms), "Le lunghezze dei dataset non corrispondono."

In [None]:
# Select 5 random answers
indices = random.sample(range(len(true_explanations)), 5)

selected_base_responses = [base_responses[i] for i in indices]
selected_fine_tuned_responses = [fine_tuned_responses[i] for i in indices]
selected_fine_tuned_responses_2 = [fine_tuned_responses_2[i] for i in indices]
selected_true_explanations = [true_explanations[i] for i in indices]
selected_true_terms = [true_terms[i] for i in indices]

In [None]:
# Using TogetherAI's API to evaluate responses.
!pip install langchain_together
import requests
from langchain_together import Together
from langchain_core.prompts import PromptTemplate

Collecting langchain_together
  Downloading langchain_together-0.1.3-py3-none-any.whl (9.6 kB)
Collecting langchain-core<0.3,>=0.2.2 (from langchain_together)
  Downloading langchain_core-0.2.8-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.8/315.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-openai<0.2.0,>=0.1.8 (from langchain_together)
  Downloading langchain_openai-0.1.8-py3-none-any.whl (38 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3,>=0.2.2->langchain_together)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.2.0,>=0.1.75 (from langchain-core<0.3,>=0.2.2->langchain_together)
  Downloading langsmith-0.1.79-py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.3/125.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting openai<2.0.0,>=1.26.0 (from langchain-openai<0.2.0,>=0.1.8->langchain_together)
  Downloading open

In [None]:
#api_key = ""

responses = []

for i in range(5):
    prompt = f"""
    Istruzioni: Di seguito sono riportate tre risposte alla stessa richiesta. Valuta quale risposta è la più accurata in base al contesto fornito.

    Contesto: Descrivi come si è evoluto il significato della parola {selected_true_terms[i]} nella lingua italiana.

    Risposta 1: {selected_base_responses[i]}

    Risposta 2: {selected_fine_tuned_responses[i]}

    Risposta 3: {selected_fine_tuned_responses_2[i]}

    Il tuo compito è scegliere la risposta più accurata e motivare la scelta.
    """

    print("Risposta 1 (modello base): \n", selected_base_responses[i])
    print("-----------------------------------------------\n")
    print("Risposta 2 (primo modello fine-tunato): \n", selected_fine_tuned_responses[i])
    print("-----------------------------------------------\n")
    print("Risposta 3 (secondo modello fine-tunato): \n", selected_fine_tuned_responses_2[i])
    print("-----------------------------------------------\n")
    model = Together(model="meta-llama/Llama-3-70b-chat-hf", max_tokens=180, temperature=0.4, together_api_key="")
    prompt = PromptTemplate.from_template(prompt)
    chain = prompt | model
    response = chain.invoke({"selected_true_terms": selected_true_terms[i], "selected_base_responses":selected_base_responses[i], "selected_fine_tuned_responses":selected_fine_tuned_responses[i], "selected_fine_tuned_responses_2":selected_fine_tuned_responses_2[i]})
    print(response)
    print("***********************************************\n")
    print("\n")


Risposta 1 (modello base): 
 Descrivi brevemente come si è evoluto il significato della parola minore nella lingua italiana.

The word "minor" has been used in Italian since the 16th century, but its meaning and usage have changed over time. In the early 20th Century, it was commonly used to refer to a person who is not of noble birth or status, such as a servant or a tradesman. However, by the mid-century, the term had taken on a more specific meaning related to music, specifically referring to an instrument that plays lower notes than the standard orchestra. Today, minor can also be used metaphorically to describe something small or insignificant compared to other things.
-----------------------------------------------

Risposta 2 (primo modello fine-tunato): 
 Descrivi brevemente come si è evoluto il significato della parola minore nella lingua italiana.
Spiegazione: La prima attestazione scritta di minori in italiano risale al XII secolo, dove indicava un gruppo o una categoria inf

## Considerations and Conclusions
These results suggest several important considerations. Firstly, the superior performance of the first fine-tuned model highlights the effectiveness of the synthetically generated dataset in training the model for lexical semantic change. This dataset, tailored specifically for the task, appears to provide the necessary information and context that the model needs to generate accurate and coherent explanations of semantic evolution.

On the other hand, the inclusion of the WiC dataset in the second model's fine-tuning process seems to introduce noise or conflicting signals. While the WiC dataset is high-quality and focused on word sense disambiguation, its different primary focus might have caused the model to receive mixed messages, ultimately diluting the specialized training provided by the synthetically generated dataset. This indicates that the specific nature and focus of training data are crucial for tasks requiring deep semantic understanding.

Furthermore, the results emphasize the importance of dataset coherence in fine-tuning language models. The synthetic dataset, designed to target lexical semantic change directly, aligned perfectly with the model's training objectives, leading to better performance. In contrast, the mixed datasets likely led to less effective learning due to their differing emphases and potentially conflicting information.

In conclusion, this evaluation underscores the importance of carefully curating and selecting training data for fine-tuning large language models. Ensuring that the data is specifically tailored to the task at hand can significantly enhance model performance, while combining datasets with different focuses may inadvertently degrade it. This insight is crucial for future efforts in fine-tuning language models for specific tasks, guiding the choice of datasets to optimize learning outcomes.