<a href="https://colab.research.google.com/github/petersun1937/finetune-lm-research_topics/blob/main/finetune_lm_research_topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install dependencies
!pip install transformers datasets
!pip install -q datasets rouge-score bert-score

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [None]:
# To ignore WANDB
import os
os.environ["WANDB_DISABLED"] = "true"

# **Finetune LM (Distilgpt2 or TinyMistral, uncomment relevant parts if needed)**

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer, AutoModelForCausalLM, AutoConfig
from datasets import load_dataset

# Load dataset
# Upload or mount this file into Colab first
dataset = load_dataset("json", data_files={"train": "research_nlp_chapters_100.jsonl"}, split="train")

# Preprocess into model input format
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
#tokenizer = AutoTokenizer.from_pretrained("M4-ai/TinyMistral-248M-v3")
tokenizer.pad_token = tokenizer.eos_token  # Avoids padding error

def format_and_tokenize(example):
    prompt = f"### Problem: {example['input']}\n### Approach:"
    full_text = f"{prompt} {example['output']}"
    return tokenizer(full_text, truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(format_and_tokenize)

# Load and modify config first
config = AutoConfig.from_pretrained("M4-ai/TinyMistral-248M-v3")
config.attn_pdrop = 0.1
config.resid_pdrop = 0.1

# Load model with custom config
model = GPT2LMHeadModel.from_pretrained("distilgpt2")
#model = AutoModelForCausalLM.from_pretrained("M4-ai/TinyMistral-248M-v3", config=config)
model.resize_token_embeddings(len(tokenizer))

# Define TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_strategy="no",
    logging_steps=10,
    fp16=torch.cuda.is_available(),  # Use mixed precision on GPU if possible
    weight_decay=0.01,
)

# Trainer Setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

# Train
trainer.train()

# Save model
#trainer.save_model("fine-tuned-tinyMistral")
#tokenizer.save_pretrained("fine-tuned-tinyMistral")
trainer.save_model("fine-tuned-distilgpt2")
tokenizer.save_pretrained("fine-tuned-distilgpt2")


Generating train split: 0 examples [00:00, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,5.642
20,4.8537
30,4.4139
40,4.156
50,3.9579
60,3.5238
70,3.531
80,3.5089
90,3.5332
100,3.3698


('fine-tuned-distilgpt2/tokenizer_config.json',
 'fine-tuned-distilgpt2/special_tokens_map.json',
 'fine-tuned-distilgpt2/vocab.json',
 'fine-tuned-distilgpt2/merges.txt',
 'fine-tuned-distilgpt2/added_tokens.json')

# **Test fine-tuned model**

In [None]:
import torch
from tqdm import tqdm
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util

# Load Model & Tokenizer
#model = AutoModelForCausalLM.from_pretrained("fine-tuned-tinyMistral")
#tokenizer = AutoTokenizer.from_pretrained("fine-tuned-tinyMistral")

tokenizer = GPT2Tokenizer.from_pretrained("fine-tuned-distilgpt2")
model = GPT2LMHeadModel.from_pretrained("fine-tuned-distilgpt2")

tokenizer.pad_token = tokenizer.eos_token
model.eval()

# Load Evaluation Dataset
eval_data = load_dataset("json", data_files="test_research_nlp_chapters_10.jsonl")["train"]

# Load SBERT for Cosine Similarity
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Helper Functions
def generate_response(prompt, max_len=128):
    input_text = f"### Problem: {prompt}\n### Approach:"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_len,
        do_sample=True,             # Enable sampling instead of greedy decoding
        temperature=0.8             # Lower = conservative, Higher = creative
    )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("### Approach:")[-1].strip()

#def repetition_ratio(text, n=5):
#    lines = [line.strip() for line in text.split("\n") if line.strip()]
#    return len(lines) / len(set(lines)) if lines else 1.0
def repetition_score(text):
    lines = [line.strip() for line in text.split("\n") if line.strip()]
    return len(lines) / len(set(lines)) if lines else 1.0

# Generate Responses
generated_responses, rouge_scores, repetition_scores = [], [], []
for ex in eval_data:
    prompt = ex["input"]
    gold = ex["output"]
    gen = generate_response(prompt)

    generated_responses.append(gen)

    # ROUGE-L
    rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rouge_l_f1 = rouge.score(gold, gen)['rougeL'].fmeasure
    rouge_scores.append(rouge_l_f1)

    # Repetition Score
    repetition_scores.append(repetition_score(gen))

# Cosine Similarity (batch)
gen_embeds = sbert_model.encode(generated_responses, convert_to_tensor=True)
ref_embeds = sbert_model.encode([ex["output"] for ex in eval_data], convert_to_tensor=True)
cosine_scores = util.cos_sim(gen_embeds, ref_embeds).diagonal().cpu().tolist()

# Build DataFrame
df = pd.DataFrame({
    "Prompt": [ex["input"] for ex in eval_data],
    "Generated": generated_responses,
    "Reference": [ex["output"] for ex in eval_data],
    "ROUGE-L": np.round(rouge_scores, 3),
    "CosineSim": np.round(cosine_scores, 3),
    "RepetitionScore": np.round(repetition_scores, 2)
})

# Print Summary
print(df[["Prompt", "Generated", "Reference", "ROUGE-L", "CosineSim", "RepetitionScore"]])
print("\nAverage ROUGE-L:", round(df["ROUGE-L"].mean(), 3))
print("Average Cosine Similarity:", round(df["CosineSim"].mean(), 3))
print("Average Repetition Score:", round(df["RepetitionScore"].mean(), 2))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


                                              Prompt  \
0  How can we correct multilingual grammatical er...   
1  What are the best strategies for few-shot abst...   
2  How to detect and mitigate bias in generated q...   
3  What are current solutions for summarizing ext...   
4  How do we model user preferences in dialogue g...   
5  What are tokenization-free alternatives for la...   
6  What techniques support speech-to-text alignme...   
7  How can systems handle topic drift in ongoing ...   
8  How to improve answer reranking in hybrid sear...   
9  What are viable ways to ground symbolic repres...   

                                           Generated  \
0  Train a sample vocabulary model with multiling...   
1  Leverage structured question-answer pairs to t...   
2  Create a task-length filter that assess and mi...   
3  Introduce a summarizing model using a summatio...   
4  Train dialogue generation with reinforcement l...   
5  Use tokenization-aware tokenization using mu

# **Test Pretrained Model**

In [None]:
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util

# Load Pretrained Model & Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
model = GPT2LMHeadModel.from_pretrained("distilgpt2")

#model = AutoModelForCausalLM.from_pretrained("M4-ai/TinyMistral-248M-v3")
#tokenizer = AutoTokenizer.from_pretrained("M4-ai/TinyMistral-248M-v3")

tokenizer.pad_token = tokenizer.eos_token
model.eval()

# Load Evaluation Dataset
eval_data = load_dataset("json", data_files="test_research_nlp_chapters_10.jsonl")["train"]

# Load SBERT for Cosine Similarity
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Helper Functions
def generate_response(prompt, max_len=128):
    input_text = f"### Problem: {prompt}\n### Approach:"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_len,
        do_sample=True,             # Enable sampling instead of greedy decoding
        temperature=0.8             # Lower = conservative, Higher = creative
    )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("### Approach:")[-1].strip()

#def repetition_ratio(text, n=5):
#    lines = [line.strip() for line in text.split("\n") if line.strip()]
#    return len(lines) / len(set(lines)) if lines else 1.0
def repetition_score(text):
    lines = [line.strip() for line in text.split("\n") if line.strip()]
    return len(lines) / len(set(lines)) if lines else 1.0

# Generate Responses
generated_responses, rouge_scores, repetition_scores = [], [], []
for ex in eval_data:
    prompt = ex["input"]
    gold = ex["output"]
    gen = generate_response(prompt)

    generated_responses.append(gen)

    # ROUGE-L
    rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rouge_l_f1 = rouge.score(gold, gen)['rougeL'].fmeasure
    rouge_scores.append(rouge_l_f1)

    # Repetition Score
    repetition_scores.append(repetition_score(gen))

# Cosine Similarity (batch)
gen_embeds = sbert_model.encode(generated_responses, convert_to_tensor=True)
ref_embeds = sbert_model.encode([ex["output"] for ex in eval_data], convert_to_tensor=True)
cosine_scores = util.cos_sim(gen_embeds, ref_embeds).diagonal().cpu().tolist()

# Build DataFrame
df = pd.DataFrame({
    "Prompt": [ex["input"] for ex in eval_data],
    "Generated": generated_responses,
    "Reference": [ex["output"] for ex in eval_data],
    "ROUGE-L": np.round(rouge_scores, 3),
    "CosineSim": np.round(cosine_scores, 3),
    "RepetitionScore": np.round(repetition_scores, 2)
})

# Print Summary
print(df[["Prompt", "Generated", "Reference", "ROUGE-L", "CosineSim", "RepetitionScore"]])
print("\nAverage ROUGE-L:", round(df["ROUGE-L"].mean(), 3))
print("Average Cosine Similarity:", round(df["CosineSim"].mean(), 3))
print("Average Repetition Score:", round(df["RepetitionScore"].mean(), 2))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


                                              Prompt  \
0  How can we correct multilingual grammatical er...   
1  What are the best strategies for few-shot abst...   
2  How to detect and mitigate bias in generated q...   
3  What are current solutions for summarizing ext...   
4  How do we model user preferences in dialogue g...   
5  What are tokenization-free alternatives for la...   
6  What techniques support speech-to-text alignme...   
7  How can systems handle topic drift in ongoing ...   
8  How to improve answer reranking in hybrid sear...   
9  What are viable ways to ground symbolic repres...   

                                           Generated  \
0  To understand grammatical errors, we can look ...   
1  The question is: How do you respond to a given...   
2  To detect and mitigate bias in generated quest...   
3  You must be very careful with your formatting ...   
4  The answer to this is simple. To do so we need...   
5  The problem is that the tokenization solutio

# **LoRA (performance not good)**

In [None]:
!pip install -q peft datasets transformers accelerate bitsandbytes

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoConfig,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling)
from peft import LoraConfig, get_peft_model, TaskType

# Load Dataset
dataset = load_dataset("json", data_files={"train": "research_nlp_chapters_100.jsonl"}, split="train")

# Tokenizer & Prompt Formatting
tokenizer = AutoTokenizer.from_pretrained("M4-ai/TinyMistral-248M-v3")
tokenizer.pad_token = tokenizer.eos_token

def format_and_tokenize(example):
    prompt = f"### Problem: {example['input']}\n### Approach:"
    full_text = f"{prompt} {example['output']}"
    return tokenizer(full_text, truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(format_and_tokenize, remove_columns=dataset.column_names)

# Load Base Model and Apply LoRA
config = AutoConfig.from_pretrained("M4-ai/TinyMistral-248M-v3")
base_model = AutoModelForCausalLM.from_pretrained("M4-ai/TinyMistral-248M-v3", config=config)
base_model.resize_token_embeddings(len(tokenizer))

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # Adapt based on model internals
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()  # Optional: confirm LoRA is active

# TrainingArguments
training_args = TrainingArguments(
    output_dir="./results_lora",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_steps=10,
    save_strategy="epoch",
    fp16=torch.cuda.is_available(),
    weight_decay=0.01,
    report_to="none"
)

# Trainer Setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

# Train & Save
trainer.train()
trainer.save_model("fine-tuned-lora-TinyMistral-248M-v3")
tokenizer.save_pretrained("fine-tuned-lora-TinyMistral-248M-v3")


trainable params: 319,488 || all params: 248,343,552 || trainable%: 0.1286


  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,5.3606
20,5.4316
30,5.1997
40,5.3791
50,5.3422
60,5.208
70,5.0804
80,4.9755
90,4.9892
100,4.9709


('fine-tuned-lora-TinyMistral-248M-v3/tokenizer_config.json',
 'fine-tuned-lora-TinyMistral-248M-v3/special_tokens_map.json',
 'fine-tuned-lora-TinyMistral-248M-v3/tokenizer.json')

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from bert_score import score as bertscore

# Load Fine-Tuned Model
model_path = "fine-tuned-lora-TinyMistral-248M-v3"
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

# Load Test Dataset
eval_data = load_dataset("json", data_files="test_research_nlp_chapters_10.jsonl")["train"]

def generate_response(prompt, max_len=128):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_len,
        do_sample=False,
        repetition_penalty=1.2,  # Helps reduce output loops
        temperature=0.8,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.strip()

# Initialize Metrics
rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
bleu_scores, rouge_scores, preds, refs = [], [], [], []

# Run Evaluation
for ex in eval_data:
    gold = ex["output"]
    prompt = ex["input"]
    gen = generate_response(prompt)

    # Print example
    print("\n---")
    print("Prompt:", prompt)
    print("Generated:", gen)
    print("Reference:", gold)

    # BLEU
    ref_tokens = [gold.split()]
    pred_tokens = gen.split()
    bleu = sentence_bleu(ref_tokens, pred_tokens)
    bleu_scores.append(bleu)

    # ROUGE-L
    rouge_L = rouge.score(gold, gen)["rougeL"].fmeasure
    rouge_scores.append(rouge_L)

    preds.append(gen)
    refs.append(gold)

# BERTScore
P, R, F1 = bertscore(preds, refs, lang="en", rescale_with_baseline=True)

# Print Metrics
print("\n--- Evaluation Summary ---")
print(f"Average BLEU:      {sum(bleu_scores)/len(bleu_scores):.3f}")
print(f"Average ROUGE-L:   {sum(rouge_scores)/len(rouge_scores):.3f}")
print(f"Average BERTScore-F1: {F1.mean().item():.3f}")





---
Prompt: How can we correct multilingual grammatical errors effectively?
Generated: How can we correct multilingual grammatical errors effectively?
The aim of this paper is to develop a new approach for translating phonetic expressions into English. The purpose of the study was to investigate whether phonological pronunciation and vocabulary are related in different ways, as well as how these differences affect language acquisition. We found that phonemic speech patterns differed between languages from native speakers (English) and non-native speakers (English). Our results suggest that phonemes may be more sensitive than linguists because they have higher levels of phonemes than other phonemes. This suggests that phonemes might also exhibit lower phoneme
Reference: Use a multilingual encoder-decoder model equipped with language-specific adapters, trained on synthetic grammatical errors derived from morphosyntactic templates. Integrate beam rescoring modules using monolingual fluen

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()



---
Prompt: What are the best strategies for few-shot abstractive question answering?
Generated: What are the best strategies for few-shot abstractive question answering?
The following is a list of questions that can be answered by asking:

1. How do you define an effective methodology in which to evaluate and analyze data?
2. How does one describe a strategy based on objective criteria, such as accuracy or consistency?
3. How should we assess whether a tool has been used successfully in practice?
4. How should we measure effectiveness when using tools with high quality results?
5. How should our research approach compare to other methods?
6. How should we use metrics to improve performance?
7. How should we apply statistical analysis techniques to quantify differences between
Reference: Implement a retrieval-augmented QA system that uses example-driven prompt generation. Incorporate contrastive representation learning between support examples and candidate answers for better alignmen

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()



---
Prompt: How to detect and mitigate bias in generated questions?
Generated: How to detect and mitigate bias in generated questions?
In this paper, we present a novel method for predicting bias based on the following data: 1) The probability of being falsely accused by an individual is reduced when he or she has been identified as guilty. This means that if one suspects another person’s guilt, then it will be considered false positives (i.e., false negatives). We show here that implicitly discriminating against individuals who are not innocent does not necessarily lead to negative outcomes. In addition, our results suggest that implicit prejudice may increase risky behavioral behaviors among people with low self-esteem. Our findings indicate that unconscious
Reference: Use a two-stage system where question generators are followed by bias detectors trained on linguistically annotated cues. Introduce a debiasing layer that rephrases problematic patterns via conditional rewriting.

---

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Evaluation Summary ---
Average BLEU:      0.000
Average ROUGE-L:   0.060
Average BERTScore-F1: -0.045
