# Question answering

Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you've used a question answering model before. There are two common types of question answering tasks:

- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[ALBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/albert), [BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/big_bird), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [BLOOM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bloom), [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/convbert), [Data2VecText](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie_m), [FlauBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/funnel), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [GPT Neo](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox), [GPT-J](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptj), [I-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ibert), [LayoutLMv2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv3), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LiLT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lilt), [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/luke), [LXMERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lxmert), [MarkupLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/markuplm), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MEGA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mobilebert), [MPNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mpnet), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [Nezha](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nezha), [Nyströmformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nystromformer), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [QDQBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/qdqbert), [Reformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/reformer), [RemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roformer), [Splinter](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/splinter), [SqueezeBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/squeezebert), [XLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm), [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xmod), [YOSO](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/yoso)


<!--End of the generated tip-->

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

## Load SQuAD dataset

Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [1]:
from datasets import load_dataset
cache_dir = '/proj/ciptmp/ix05ogym/.cache/'
squad = load_dataset("squad", split="train[:5000]",streaming=False,cache_dir=cache_dir)

  from .autonotebook import tqdm as notebook_tqdm


Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [10]:
squad = squad.train_test_split(test_size=0.2)

Then take a look at an example:

In [3]:
squad["train"][0]

{'id': '56ce66e4aab44d1400b8875d',
 'title': 'Solar_energy',
 'context': 'Concentrating Solar Power (CSP) systems use lenses or mirrors and tracking systems to focus a large area of sunlight into a small beam. The concentrated heat is then used as a heat source for a conventional power plant. A wide range of concentrating technologies exists; the most developed are the parabolic trough, the concentrating linear fresnel reflector, the Stirling dish and the solar power tower. Various techniques are used to track the Sun and focus light. In all of these systems a working fluid is heated by the concentrated sunlight, and is then used for power generation or energy storage.',
 'question': 'In all the different CSP systems, concentrated sunlight is used to heat what?',
 'answers': {'text': ['a working fluid'], 'answer_start': [491]}}

There are several important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

## Preprocess

The next step is to load a DistilBERT tokenizer to process the `question` and `context` fields:

In [11]:
from transformers import AutoTokenizer, albertfor

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2",cache_dir=cache_dir)

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

In [103]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs['id'] = examples['id']
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns you don't need:

In [104]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map: 100%|██████████| 4000/4000 [00:00<00:00, 5743.97 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 5551.44 examples/s]


In [105]:
tokenized_squad['train'][0].keys()

dict_keys(['id', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [14]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Train

In [8]:
"""from unsloth import FastLanguageModel , is_bfloat16_supported
model_name = "unsloth/mistral-7b-v0.3-bnb-4bit"
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!

model , tokenizer= FastLanguageModel.from_pretrained(model_name ,

                                                      load_in_4bit=True,
                                                      
                                                      max_seq_length=max_seq_length,
                                                      cache_dir='/proj/ciptmp/ix05ogym/.cache/',
                                                      
                                                      )

model = FastLanguageModel.get_peft_model( model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context

)"""

'from unsloth import FastLanguageModel , is_bfloat16_supported\nmodel_name = "unsloth/mistral-7b-v0.3-bnb-4bit"\nmax_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!\n\nmodel , tokenizer= FastLanguageModel.from_pretrained(model_name ,\n\n                                                      load_in_4bit=True,\n                                                      \n                                                      max_seq_length=max_seq_length,\n                                                      cache_dir=\'/proj/ciptmp/ix05ogym/.cache/\',\n                                                      \n                                                      )\n\nmodel = FastLanguageModel.get_peft_model( model,\n    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",\n                      "gate_proj", "up_proj", "down_proj",],\n    lora_alpha = 16,\n    use_gradient_checkpointing = "unsloth", # T

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForQuestionAnswering):

In [9]:
import torch
#from unsloth import FastLanguageModel , is_bfloat16_supported

from transformers import AutoModelForQuestionAnswering,BitsAndBytesConfig
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training
print(torch.cuda.is_available())
torch.cuda.get_device_name(0)


config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
#"albert/albert-base-v2"
model_name = "albert/albert-base-v2"#"unsloth/mistral-7b-v0.3-bnb-4bit"
model = AutoModelForQuestionAnswering.from_pretrained(model_name ,
                                                      #attn_implementation="flash_attention_2",
                                                      #quantization_config=config,
                                                      #low_cpu_mem_usage=True,
                                                      cache_dir='/proj/ciptmp/ix05ogym/.cache/',
                                                      #device_map="auto",
                                                      
                                                      )
print(model)

#model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False)
loraconfig = LoraConfig(r=16,target_modules=['query','key','value','dense'],task_type='QUESTION_ANS')
model = get_peft_model(model, loraconfig)

model.print_trainable_parameters()
print(model.get_memory_footprint())



No ROCm runtime is found, using ROCM_HOME='/usr'


True


Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AlbertForQuestionAnswering(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias

In [10]:
for n,p in model.named_parameters():
    if p.requires_grad==True:
        print(n)

base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.lora_B.default.weight
base_model.model.qa_outputs.modules_to_save.default.weight
base_model.model.qa_outputs.modules_to_save.default.bias


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [32]:
import numpy as np

def var_dump(obj, indent=0):
    spacing = '  ' * indent
    if isinstance(obj, dict):
        for key, value in obj.items():
            print(f'{spacing}{key}:')
            var_dump(value, indent + 1)
    elif isinstance(obj, (list, tuple, set)):
        for idx, item in enumerate(obj):
            print(f'{spacing}[{idx}]:')
            var_dump(item, indent + 1)
    elif hasattr(obj, '__dict__'):
        print(f'{spacing}{obj.__class__.__name__}:')
        var_dump(vars(obj), indent + 1)
    else:
        print(f'{spacing}{obj}')

In [106]:
st =squad['test']
d = tokenized_squad['test']
model=model.cuda()
o = trainer.predict(d)

In [185]:
preds = []
for i in range(len(d['input_ids'])):
    x = d['input_ids'][i][o1[i]:o2[i]]
    x = tokenizer.decode(x)
    pp =0.0
    if x == '':
        pp = 1.0
        x = ' '

    preds.append({'prediction_text':x,'id':d['id'][i],'no_answer_probability': pp})
    


In [186]:
preds[4]

{'prediction_text': ' ',
 'id': '573392e24776f41900660d9e',
 'no_answer_probability': 1.0}

In [108]:
o1 = o.predictions[0].argmax(1)
o2 = o.predictions[1].argmax(1)


ro1 = o.label_ids[0]
ro2 = o.label_ids[1]

import torchmetrics

#em = torchmetrics.classification.MulticlassExactMatch(num_classes =384)
#em(torch.tensor(o1),torch.tensor(ro1))

#torchmetrics.text.SQuAD(preds =  ,target = )
print(d[0]['id'])
st[0]

56cd7ab462d2951400fa660d


{'id': '56cd7ab462d2951400fa660d',
 'title': 'IPod',
 'context': 'Besides earning a reputation as a respected entertainment device, the iPod has also been accepted as a business device. Government departments, major institutions and international organisations have turned to the iPod line as a delivery mechanism for business communication and training, such as the Royal and Western Infirmaries in Glasgow, Scotland, where iPods are used to train new staff.',
 'question': 'Where is Royal and Western Infirmaries located?',
 'answers': {'text': ['Glasgow, Scotland'], 'answer_start': [334]}}

In [227]:
mysq = torchmetrics.text.SQuAD()
#mysq(preds,st)
import evaluate
mm =evaluate.load('squad_v2')
def transform_id(example):
    example['id'] = example['id']
    #example['answers']['text']:str
    #print(example['answers']['text'])
    example['answers']['text'] = [example['answers']['text'][0].lower()]
    
    #print(type(example['id']))# Assuming you want to extract the first character for illustration
    return example

st2 = st.select_columns(['answers','id'])
st2 = st2.map(transform_id)
st2 = st2.to_list()
j=20
ssss = mm.compute(predictions=preds[0:j+1],references=st2[0:j+1])
print(preds[j],st2[j])
#print(preds)
#print(st2)
ssss

{'prediction_text': ' ', 'id': '56cfbab4234ae51400d9bf1d', 'no_answer_probability': 1.0} {'answers': {'text': ['1790'], 'answer_start': [549]}, 'id': '56cfbab4234ae51400d9bf1d'}


{'exact': 0.0,
 'f1': 25.23088023088023,
 'total': 21,
 'HasAns_exact': 0.0,
 'HasAns_f1': 25.23088023088023,
 'HasAns_total': 21,
 'best_exact': 0.0,
 'best_exact_thresh': 0.0,
 'best_f1': 25.23088023088023,
 'best_f1_thresh': 0.0}

In [178]:
for p,a in zip(preds,st2):
    print(p['prediction_text'],a['answers']['text'])

glasgow, ['Glasgow, Scotland']
as ['Aspiro']
go set a watch ['Go Set a Watchman']
ten languages. in the years since, it has sold more than 30 million copies and been translated into more than ['more than 40']
 ['its own Great Purge']
montana ( ['from the Spanish word montaña']
adrian ['Adrian Gallagher']
tudor ['brownstone rowhouses']
2011 glastonbury ['the 2011 Glastonbury Festival']
 ['1862']
political killing ['political killings']
4 ['44']
ordon ['Ordon Village']
november ['2005']
 ['The word genocide is the combination of the Greek prefix geno- (meaning tribe or race) and caedere (the Latin word for to kill).']
 ['10,000']
the giving of ['four-hour program called The Giving of Love']
20% gramicidin and 80% tyroc ['wounds and ulcers']
april ['2013']
giselle knowles-car ['Beyoncé Giselle Knowles-Carter']
 ['1790']
$30 ['73 million']
rondo op. ['Rondo Op. 1']
 ['Madonna']
friederike ['Friederike Müller']
pre-professional ['Department of Pre-Professional Studies']
l-targeting (z-targe

In [148]:
preds[4]

{'prediction_text': '',
 'id': '573392e24776f41900660d9e',
 'no_answer_probability': 0.0}

In [65]:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
#['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit', 'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor', 'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']
#import torch._dynamo
#torch._dynamo.config.suppress_errors = True



def compute_metrics(eval_preds):
    outputs,labels = eval_preds
    o1 , o2 = outputs
    s , e = labels
    
    #var_dump(eval_preds)
    print(o1.shape)
    os = o1.argmax(1)
    oe = o2.argmax(1)
    preds =[]
    ref =[]
    print(os,oe,s,e)
    for i in range(len(s)):
        preds.append({"id": i, "prediction_text": " ".join(map(str, np.arange(os[i], oe[i])))})
        ref.append({"id": i, "answers": {"text": " ".join(map(str, np.arange(s[i], e[i])))}})
    
    print(preds)   
    print(preds)    
     
    
    return metric.compute(predictions= preds,references= ref)
 
import evaluate
metric = evaluate.load('squad')
from transformers import TrainingArguments, Trainer
t = tokenized_squad["test"].select(range(2))
trainer = Trainer(
    model=model,
    train_dataset=tokenized_squad["train"],
    eval_dataset=t,
    tokenizer=tokenizer,
    data_collator=data_collator,
    #compute_metrics=compute_metrics,
    args=TrainingArguments(
            output_dir="my_awesome_qa_model",
            eval_strategy="epoch",
            learning_rate=2e-5,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            num_train_epochs=10,
            #weight_decay=0.01,
            #bf16 =True, #amper series
            tf32=True,
            #fp16 = False,#not is_bfloat16_supported(),
            bf16 =True, #is_bfloat16_supported(),
            #optim='adamw_hf',
            #weight_decay=0.01,
            #dataloader_pin_memory=True,
            #dataloader_num_workers=0,
            #torch_compile=True,  #seems not good error :(
            
            #push_to_hub=True,
            report_to='tensorboard'
            
            )
)

#trainer.train()
print(t[0])

trainer.evaluate()



{'input_ids': [2, 113, 25, 612, 17, 650, 19, 20590, 11301, 335, 60, 3, 3410, 6555, 21, 4530, 28, 21, 10861, 2302, 3646, 15, 14, 31, 10670, 63, 67, 74, 2217, 28, 21, 508, 3646, 9, 283, 8627, 15, 394, 3449, 17, 294, 8626, 57, 412, 20, 14, 31, 10670, 293, 28, 21, 6010, 6534, 26, 508, 3291, 17, 838, 15, 145, 28, 14, 612, 17, 650, 19, 20590, 11301, 19, 6005, 15, 2207, 15, 113, 31, 16894, 50, 147, 20, 1528, 78, 1138, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

{'eval_loss': 0.484375,
 'eval_runtime': 0.0223,
 'eval_samples_per_second': 89.555,
 'eval_steps_per_second': 44.777}

 [5000/5000 15:22, Epoch 10/10]
Epoch	Training Loss	Validation Loss
1	4.859800	3.931242
2	3.372500	2.843172
3	2.442000	2.213433
4	2.050300	1.954712
5	1.865000	1.826765
6	1.750300	1.764793
7	1.677600	1.692042
8	1.626800	1.656364
9	1.599700	1.637798
10	1.583800	1.631518

In [19]:
print(model.save_pretrained("my_awesome_qa_model"))
print(tokenizer.save_pretrained("my_awesome_qa_model"))


('my_awesome_qa_model/tokenizer_config.json',
 'my_awesome_qa_model/special_tokens_map.json',
 'my_awesome_qa_model/spiece.model',
 'my_awesome_qa_model/added_tokens.json',
 'my_awesome_qa_model/tokenizer.json')

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

<Tip>

For a more in-depth example of how to finetune a model for question answering, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).

</Tip>

## Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so you're not completely in the dark about your model's performance.

If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course!

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with a question and some context you'd like the model to predict:

In [1]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for question answering with your model, and pass your text to it:

In [20]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)

Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.012835928238928318,
 'start': 89,
 'end': 118,
 'answer': 'and 13 programming languages.'}

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")
inputs['input_ids']=inputs['input_ids'].cuda()
inputs['token_type_ids']=inputs['token_type_ids'].cuda()
inputs['attention_mask']=inputs['attention_mask'].cuda()
inputs


  from .autonotebook import tqdm as notebook_tqdm


{'input_ids': tensor([[    2,   184,   151,  3143,  2556,   630,  8064,   555,    60,     3,
          8064,    63,    13, 11633,  2786, 12905,    17,    92,  7920,  1854,
            19,  5084,  2556,  1112,  2556,    17,   539,  3143,  2556,     9,
             3]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Pass your inputs to the model and return the `logits`:

In [3]:
import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model").cuda()
with torch.no_grad():
    outputs = model(**inputs)

Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No ROCm runtime is found, using ROCM_HOME='/usr'


In [36]:
outputs.start_logits.shape
#inputs['input_ids'].shape
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.1825,  1.3876,  2.3054,  1.4554,  0.0895, -1.3429,  0.3068, -1.0149,
         -3.1631, -0.2974,  1.2329,  0.2215,  3.8604,  6.7246,  1.5241,  0.1145,
         -0.4441, -0.1687,  0.0218,  1.4461,  1.5058,  8.8211,  1.8876,  4.5003,
          1.6900,  1.5468,  9.3359,  3.3335,  1.8808, -2.0084, -0.2974]],
       device='cuda:0'), end_logits=tensor([[ 0.5415, -0.3170,  3.1900, -0.0726,  4.1455, -0.4629,  0.5939, -0.1192,
         -1.2102, -0.0291,  0.9579,  0.2125,  1.6114,  6.1145,  5.2858,  2.9438,
          0.3411, -0.6261, -1.0451,  1.2660,  0.4267,  7.5396,  4.6675,  3.5092,
          5.8924,  1.7175,  8.0168,  1.6303,  5.9388, -0.1727, -0.0291]],
       device='cuda:0'), hidden_states=None, attentions=None)

Get the highest probability from the model output for the start and end positions:

In [4]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [5]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'13'

In [24]:
#trainer.evaluate()
def compute_metrics(eval_preds):
    outputs,labels = eval_preds
    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()
    return metric.compute(answer_start_index,labels)
    
for batch in trainer.get_eval_dataloader():
    batch = {k: v.to('cuda') for k, v in batch.items()}
    with torch.no_grad():
        outputs = trainer.model(**batch)
    
        print(batch.keys())
        predictions = outputs.start_logits.cpu().numpy()
        #labels = batch["labels"].cpu().numpy()

        compute_metrics((predictions, labels))
    break




dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])


NameError: name 'labels' is not defined

https://github.com/unslothai/unsloth

Here's a comparison table of the listed optimizers based on various criteria such as usage, precision, hardware support, and typical applications. Note that specific details such as supported hardware (GPU, TPU, NPU) and implementation differences can influence performance and suitability for different training scenarios.

| Optimizer Name                 | Implementation      | Precision        | Hardware Support            | Key Features                                   | Typical Applications                       |
|--------------------------------|---------------------|------------------|-----------------------------|------------------------------------------------|--------------------------------------------|
| adamw_hf                       | Hugging Face        | 32-bit           | CPU, GPU                    | Decoupled weight decay, Transformer training  | NLP, Transformers                          |
| adamw_torch                    | PyTorch             | 32-bit           | CPU, GPU                    | Decoupled weight decay, flexible             | General deep learning                      |
| adamw_torch_fused              | PyTorch             | 32-bit           | CPU, GPU                    | Fused operations for efficiency              | General deep learning                      |
| adamw_torch_xla                | PyTorch             | 32-bit           | TPU                         | TPU optimized                                 | High-performance training on TPUs          |
| adamw_torch_npu_fused          | PyTorch             | 32-bit           | NPU                         | Fused operations for NPUs                    | Training on NPU hardware                   |
| adamw_apex_fused               | NVIDIA Apex         | 32-bit, mixed    | GPU                         | Fused operations, mixed precision            | High-performance training on GPUs          |
| adafactor                      | TensorFlow, HF      | 32-bit, mixed    | CPU, GPU                    | Memory efficient, scalable                   | NLP, large models                          |
| adamw_anyprecision             | Custom              | Variable         | CPU, GPU                    | Flexible precision handling                  | Custom precision requirements              |
| sgd                            | Standard            | 32-bit           | CPU, GPU                    | Simple, effective                            | Classic machine learning, simple models    |
| adagrad                        | Standard            | 32-bit           | CPU, GPU                    | Adaptive learning rate                       | Sparse data                                |
| adamw_bnb_8bit                 | BitsAndBytes        | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Large-scale training on limited hardware   |
| adamw_8bit                     | Custom              | 8-bit            | CPU, GPU                    | Memory efficient, reduced precision          | Large models with memory constraints       |
| lion_8bit                      | Custom              | 8-bit            | CPU, GPU                    | Memory efficient, reduced precision          | Memory constrained environments            |
| lion_32bit                     | Custom              | 32-bit           | CPU, GPU                    | Higher precision                              | General deep learning                      |
| paged_adamw_32bit              | Custom              | 32-bit           | CPU, GPU                    | Paged optimizer for memory management        | Large datasets                             |
| paged_adamw_8bit               | Custom              | 8-bit            | CPU, GPU                    | Paged optimizer, memory efficient            | Large models with memory constraints       |
| paged_lion_32bit               | Custom              | 32-bit           | CPU, GPU                    | Paged optimizer for memory management        | Large datasets                             |
| paged_lion_8bit                | Custom              | 8-bit            | CPU, GPU                    | Paged optimizer, memory efficient            | Large models with memory constraints       |
| rmsprop                        | Standard            | 32-bit           | CPU, GPU                    | Adaptive learning rate                       | RNNs, general deep learning                |
| rmsprop_bnb                    | BitsAndBytes        | 32-bit           | CPU, GPU                    | BitsAndBytes optimization                    | Memory efficient training                  |
| rmsprop_bnb_8bit               | BitsAndBytes        | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Large-scale training on limited hardware   |
| rmsprop_bnb_32bit              | BitsAndBytes        | 32-bit           | CPU, GPU                    | Higher precision                              | General deep learning                      |
| galore_adamw                   | Galore              | 32-bit           | CPU, GPU                    | Enhanced AdamW                               | Advanced training scenarios                |
| galore_adamw_8bit              | Galore              | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Memory constrained environments            |
| galore_adafactor               | Galore              | 32-bit, mixed    | CPU, GPU                    | Memory efficient, scalable                   | NLP, large models                          |
| galore_adamw_layerwise         | Galore              | 32-bit           | CPU, GPU                    | Layerwise optimization                       | Advanced training scenarios                |
| galore_adamw_8bit_layerwise    | Galore              | 8-bit            | CPU, GPU                    | Memory efficient, layerwise optimization     | Memory constrained environments            |
| galore_adafactor_layerwise     | Galore              | 32-bit, mixed    | CPU, GPU                    | Layerwise optimization, memory efficient     | NLP, large models                          |

### Notes:
- **Precision**: Indicates whether the optimizer supports standard 32-bit precision or has options for mixed/8-bit precision for memory efficiency.
- **Hardware Support**: Identifies the primary hardware the optimizer is designed to run on efficiently, e.g., CPU, GPU, TPU, NPU.
- **Key Features**: Highlights unique aspects or enhancements that distinguish each optimizer.
- **Typical Applications**: Common use cases or scenarios where the optimizer is particularly effective.

Choosing the right optimizer depends on your specific training needs, hardware availability, and whether you need to manage large models or datasets within memory constraints.




The convergence speed of an optimizer depends on various factors such as the type of model, the dataset, the specific problem being solved, and the tuning of hyperparameters. However, here are some general insights into the convergence speed of the listed optimizers:

1. **AdamW variants**:
    - `adamw_hf`, `adamw_torch`, `adamw_torch_fused`, `adamw_torch_xla`, `adamw_torch_npu_fused`, `adamw_apex_fused`, `adamw_anyprecision`, `adamw_bnb_8bit`, `adamw_8bit`, `galore_adamw`, `galore_adamw_8bit`, `galore_adamw_layerwise`, `galore_adamw_8bit_layerwise`:
      - AdamW is known for fast convergence due to its adaptive learning rate and decoupled weight decay. The fused versions (`fused`, `apex_fused`, `torch_fused`) can provide additional speedup due to more efficient computations.
      - `adamw_xla` and `adamw_npu_fused` are optimized for specific hardware (TPU and NPU, respectively), which can lead to faster convergence on those platforms.

2. **Adafactor**:
    - `adafactor`, `galore_adafactor`, `galore_adafactor_layerwise`:
      - Adafactor is memory-efficient and suitable for training very large models. It can converge quickly in large-scale NLP tasks, particularly when memory constraints are an issue.

3. **Lion**:
    - `lion_8bit`, `lion_32bit`, `paged_lion_32bit`, `paged_lion_8bit`:
      - Lion optimizers are less commonly used but can offer faster convergence in some scenarios due to their specific optimization strategies.

4. **SGD**:
    - `sgd`:
      - SGD with momentum can converge quickly in some scenarios but generally requires more careful tuning of learning rates and momentum parameters. It may not converge as fast as AdamW in many deep learning tasks.

5. **Adagrad**:
    - `adagrad`:
      - Adagrad adapts the learning rate based on the historical gradient, which can be beneficial for sparse data but may lead to slower convergence in dense data scenarios.

6. **Paged AdamW**:
    - `paged_adamw_32bit`, `paged_adamw_8bit`:
      - These optimizers are designed to handle large datasets with better memory management. Convergence speed can be good, especially for large-scale training.

7. **RMSprop**:
    - `rmsprop`, `rmsprop_bnb`, `rmsprop_bnb_8bit`, `rmsprop_bnb_32bit`:
      - RMSprop is designed for fast convergence in non-stationary settings. It can converge faster than SGD in many cases.

8. **Galore Optimizers**:
    - `galore_adamw`, `galore_adamw_8bit`, `galore_adafactor`, `galore_adamw_layerwise`, `galore_adamw_8bit_layerwise`, `galore_adafactor_layerwise`:
      - These optimizers are designed for advanced training scenarios and can offer fast convergence, especially when specific memory constraints or layer-wise optimizations are needed.

### General Recommendations:
- For most standard deep learning tasks, **AdamW variants** are likely to offer the fastest convergence due to their adaptive learning rates and weight decay.
- For large-scale NLP tasks, **Adafactor** can be very efficient and fast.
- For training on specific hardware (TPU, NPU), use the optimizers optimized for those platforms like `adamw_torch_xla` or `adamw_torch_npu_fused`.
- For memory-constrained environments, **8-bit variants** and **Paged optimizers** can offer good convergence speed while managing memory efficiently.
- If using large datasets, **paged variants** of AdamW or Lion can be particularly effective.

Ultimately, the best way to determine which optimizer converges the fastest for your specific use case is to experiment with a few of them on your dataset and model. Hyperparameter tuning (such as learning rate adjustments) also plays a significant role in convergence speed.

Here's a complete table summarizing the evaluation metrics mentioned, including their pros, cons, use cases, formulas, and code usage:

| **Metric**          | **Pros**                                                    | **Cons**                                                      | **Use Case**                                           | **Formula**                                                                              | **Code Usage**                                                         |
|---------------------|-------------------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
| **MSE**             | Penalizes larger errors more                                | Sensitive to outliers                                         | Regression                                              | \[MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]                                   | `import evaluate \nmetric = evaluate.load("mse") \nresults = metric.compute(predictions, references)` |
| **Mean IoU**        | Standard in segmentation                                    | Can be misleading if classes are imbalanced                   | Image Segmentation                                     | \[Mean IoU = \frac{1}{k} \sum_{i=1}^k \frac{TP_i}{TP_i + FP_i + FN_i}\]                   | `import evaluate \nmetric = evaluate.load("mean_iou") \nresults = metric.compute(predictions, references)` |
| **Pearson Correlation Coefficient** | Measures linear correlation                        | Only captures linear relationships                            | Regression                                              | \[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\] | `import evaluate \nmetric = evaluate.load("pearsonr") \nresults = metric.compute(predictions, references)` |
| **GLUE**            | Comprehensive benchmark                                     | Complex to interpret                                          | NLP                                                     | Various component metrics                                                               | `import evaluate \nmetric = evaluate.load("glue") \nresults = metric.compute(predictions, references)` |
| **Confusion Matrix**| Comprehensive error analysis                                | Can be difficult to interpret for many classes                | Classification                                          | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("confusion_matrix") \nresults = metric.compute(predictions, references)` |
| **SQuAD**           | Standard for QA                                             | Limited to QA tasks                                           | Question Answering                                      | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("squad") \nresults = metric.compute(predictions, references)` |
| **Code Eval**       | Specialized for code generation                             | Limited to code generation                                    | Code Generation                                         | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("code_eval") \nresults = metric.compute(predictions, references)` |
| **TER**             | Measures post-edit distance                                 | Less commonly used                                            | Machine Translation                                     | \[TER = \frac{\text{# of edits}}{\text{average # of reference words}}\]                  | `import evaluate \nmetric = evaluate.load("ter") \nresults = metric.compute(predictions, references)` |
| **Mahalanobis Distance** | Accounts for data distribution                             | Requires covariance matrix, less intuitive                    | Anomaly Detection, Clustering                           | \[D_M = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}\]                                            | `import evaluate \nmetric = evaluate.load("mahalanobis") \nresults = metric.compute(predictions, references)` |
| **CUAD**            | Legal document understanding                                | Specialized for legal documents                               | Legal Document Analysis                                 | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("cuad") \nresults = metric.compute(predictions, references)` |
| **Spearman Correlation Coefficient** | Measures rank correlation                            | Only captures monotonic relationships                         | Regression                                              | \[ρ = 1 - \frac{6 \sum d_i^2}{n (n^2 - 1)}\]                                              | `import evaluate \nmetric = evaluate.load("spearmanr") \nresults = metric.compute(predictions, references)` |
| **Brier Score**     | Measures probability prediction accuracy                    | Limited to binary outcomes                                    | Probability Forecasting                                 | \[BS = \frac{1}{n} \sum_{i=1}^n (f_i - o_i)^2\]                                          | `import evaluate \nmetric = evaluate.load("brier") \nresults = metric.compute(predictions, references)` |
| **BERT Score**      | Leverages contextual embeddings, correlates with human judgment | Requires large computational resources, newer and less widespread | Text Summarization, Translation                          | \[BERTScore = cosine\_similarity(pred_embeddings, ref_embeddings)\]                     | `import evaluate \nmetric = evaluate.load("bertscore") \nresults = metric.compute(predictions, references)` |
| **IndicGLUE**       | Benchmark for Indian languages                              | Limited to specific languages                                 | NLP for Indian Languages                                | Various component metrics                                                               | `import evaluate \nmetric = evaluate.load("indic_glue") \nresults = metric.compute(predictions, references)` |
| **F1**              | Balances precision and recall                               | Not informative alone, can be misleading if data is imbalanced | Classification tasks                                    | \[F1 = 2 * (precision \* recall) / (precision + recall)\]                              | `import evaluate \nmetric = evaluate.load("f1") \nresults = metric.compute(predictions, references)` |
| **SQuAD v2**        | Includes unanswerable questions                             | Limited to QA tasks                                           | Question Answering                                      | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("squad_v2") \nresults = metric.compute(predictions, references)` |
| **chrF**            | Measures character n-gram F-score                           | Can be less interpretable                                     | Machine Translation                                     | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("chrf") \nresults = metric.compute(predictions, references)` |
| **WikiSplit**       | Measures sentence splitting and rephrasing                  | Specialized for text simplification                           | Text Simplification                                     | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("wikisplit") \nresults = metric.compute(predictions, references)` |
| **XTREME-S**        | Cross-lingual benchmark                                      | Complex and multifaceted                                      | Multilingual NLP                                        | Various component metrics                                                               | `import evaluate \nmetric = evaluate.load("xtreme_s") \nresults = metric.compute(predictions, references)` |
| **NIST_MT**         | Measures precision with information weight                  | Less commonly used                                            | Machine Translation                                     | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("nist_mt") \nresults = metric.compute(predictions, references)` |
| **ROC AUC**         | Measures the ability of a classifier to distinguish classes | Can be over-optimistic with imbalanced datasets               | Binary Classification                                   | Area under the ROC curve                                                                 | `import evaluate \nmetric = evaluate.load("roc_auc") \nresults = metric.compute(predictions, references)` |
| **CharacTER**       | Character-level translation error rate                      | Can be overly harsh on minor errors                           | Machine Translation                                     | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("character") \nresults = metric.compute(predictions, references)` |
| **CER**             | Measures character-level errors                             | Can be overly harsh on minor errors                           | Speech Recognition                                      | \[CER = (Substitutions + Deletions + Insertions) / Number of characters in reference\]  | `import evaluate \nmetric = evaluate.load("cer") \nresults = metric.compute(predictions, references)` |
| **Precision**       | Measures exactness of positive predictions                  | Does not account for false negatives                          | Classification tasks                                    | \[Precision = TP / (TP + FP)\]                                                           | `import evaluate \nmetric = evaluate.load("precision") \nresults = metric.compute(predictions, references)` |
| **BLEURT**          | Context-aware evaluation metric for text                    | Requires pretrained model                                     | Text Generation, Translation                            | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("bleurt") \nresults = metric.compute(predictions, references)` |
| **SacreBLEU**       | Standardized BLEU metric                                    | Ignores synonyms, sensitive to exact matches                  | Machine Translation                                     | \[P_n = (BP * exp(\sum_{n=1}^N w_n \* log(p_n))\]                                      | `import evaluate \nmetric = evaluate.load("sacrebleu") \nresults = metric.compute(predictions, references)` |
| **Matthews Correlation Coefficient** | Measures quality of binary classifications           | Can be less intuitive                                         | Binary Classification                                   | \[MCC = \frac{TP \* TN - FP \* FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\]       | `import evaluate \nmetric = evaluate.load("mcc") \nresults = metric.compute(predictions, references)` |
| **CharCut**         | Measures character-level translation cut error              | Can be overly harsh on minor errors                           | Machine Translation                                     | N/A                                                                                      | `import evaluate \nmetric = evaluate.load("charcut") \nresults = metric.compute(predictions, references)` |
| **Competition MATH**| Standardized math benchmark                                 | Specialized for mathematical problem-solving                  | Math Problem Solving                                    | N/A                                                                                      | `import evaluate \

Hugging Face provides a variety of evaluation metrics for assessing the performance of machine learning models, especially in natural language processing (NLP) tasks. Here’s a table summarizing the pros, cons, use cases, and formulas (where applicable) for some common evaluation metrics:

| **Metric**      | **Pros**                                                                                     | **Cons**                                                                                    | **Use Cases**                                                                                     | **Formula**                                                                 |
|-----------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| **Accuracy**    | - Easy to understand and interpret.                                                          | - Not useful for imbalanced datasets.                                                      | - Classification tasks with balanced classes.                                                    | \(\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}\)                         |
| **Precision**   | - Useful for understanding false positive rates.                                             | - Does not account for false negatives.                                                    | - When the cost of false positives is high (e.g., spam detection).                                | \(\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\)                           |
| **Recall**      | - Useful for understanding false negative rates.                                             | - Does not account for false positives.                                                    | - When the cost of false negatives is high (e.g., medical diagnosis).                             | \(\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\)                           |
| **F1 Score**    | - Balances precision and recall.                                                             | - Can be misleading if precision and recall are very different.                            | - General purpose, especially for imbalanced datasets.                                           | \(2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)                  |
| **ROC-AUC**     | - Measures the ability of the classifier to distinguish between classes.                     | - Can be over-optimistic with imbalanced datasets.                                         | - Binary classification tasks.                                                                    | Area under the ROC curve.                                                |
| **BLEU**        | - Standard for evaluating machine translation.                                               | - Sensitive to exact matches, which may not capture semantic similarity.                   | - Machine translation, text generation.                                                           | \(BLEU = BP \cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)\)                                        |
| **ROUGE**       | - Measures overlap of n-grams, useful for evaluating summaries.                              | - Can be biased towards longer summaries.                                                  | - Text summarization.                                                                             | Based on n-gram overlap and longest common subsequence.                          |
| **METEOR**      | - Considers synonymy and stemming, often correlates better with human judgment.              | - More complex and computationally expensive than BLEU.                                    | - Machine translation, text generation.                                                           | Based on precision, recall, and fragmentation of matched segments.        |
| **Perplexity**  | - Measures how well a probability model predicts a sample.                                   | - Not directly interpretable.                                                              | - Language modeling.                                                                              | \(2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(x_i)}\)                                                          |
| **WER (Word Error Rate)** | - Measures the rate of errors in speech recognition.                                                 | - Can be unfairly harsh for minor errors.                                                   | - Speech recognition.                                                                             | \(\frac{\text{Substitutions} + \text{Insertions} + \text{Deletions}}{\text{Number of words in reference}}\) |
| **CIDEr**       | - Measures consensus in image captioning, based on cosine similarity of TF-IDF vectors.      | - May not capture all nuances of human judgment.                                           | - Image captioning.                                                                               | \(\frac{1}{N} \sum_{i=1}^{N} \frac{\text{TF-IDF}(g_i) \cdot \text{TF-IDF}(c_i)}{||\text{TF-IDF}(g_i)|| \cdot ||\text{TF-IDF}(c_i)||}\) |

### Explanation of Metrics:

1. **Accuracy**: Measures the proportion of correct predictions among the total predictions.
2. **Precision**: Measures the proportion of true positive results among all positive results predicted by the classifier.
3. **Recall**: Measures the proportion of true positive results among all actual positive instances.
4. **F1 Score**: The harmonic mean of precision and recall, providing a balance between the two.
5. **ROC-AUC**: Area under the receiver operating characteristic curve, evaluating the classifier's performance across all classification thresholds.
6. **BLEU**: Bilingual Evaluation Understudy, measures the overlap between n-grams of the generated and reference texts.
7. **ROUGE**: Recall-Oriented Understudy for Gisting Evaluation, measures the overlap of n-grams and longest common subsequence between generated and reference summaries.
8. **METEOR**: Metric for Evaluation of Translation with Explicit ORdering, considers precision, recall, and fragmentation with stemming and synonymy.
9. **Perplexity**: Measures how well a probabilistic model predicts a sample, often used in language modeling.
10. **WER (Word Error Rate)**: Measures the rate of errors in speech recognition systems.
11. **CIDEr**: Consensus-based Image Description Evaluation, measures the similarity of generated captions to reference captions using TF-IDF weighting.