# Question answering

Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you've used a question answering model before. There are two common types of question answering tasks:

- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[ALBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/albert), [BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/big_bird), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [BLOOM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bloom), [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/convbert), [Data2VecText](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie_m), [FlauBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/funnel), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [GPT Neo](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox), [GPT-J](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptj), [I-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ibert), [LayoutLMv2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv3), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LiLT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lilt), [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/luke), [LXMERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lxmert), [MarkupLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/markuplm), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MEGA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mobilebert), [MPNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mpnet), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [Nezha](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nezha), [Nyströmformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nystromformer), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [QDQBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/qdqbert), [Reformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/reformer), [RemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roformer), [Splinter](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/splinter), [SqueezeBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/squeezebert), [XLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm), [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xmod), [YOSO](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/yoso)


<!--End of the generated tip-->

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

## Load SQuAD dataset

Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [1]:
from datasets import load_dataset
cache_dir = '/proj/ciptmp/ix05ogym/.cache/'
squad = load_dataset("squad", split="train[:5000]",streaming=False,cache_dir=cache_dir)

  from .autonotebook import tqdm as notebook_tqdm
Downloading data: 100%|██████████| 14.5M/14.5M [00:00<00:00, 33.9MB/s]
Downloading data: 100%|██████████| 1.82M/1.82M [00:00<00:00, 7.24MB/s]
Generating train split: 100%|██████████| 87599/87599 [00:01<00:00, 83901.86 examples/s] 
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 72123.00 examples/s]


Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [2]:
squad = squad.train_test_split(test_size=0.2)

Then take a look at an example:

In [3]:
squad["train"][0]

{'id': '56d2581659d6e41400145edc',
 'title': 'To_Kill_a_Mockingbird',
 'context': 'Absent mothers and abusive fathers are another theme in the novel. Scout and Jem\'s mother died before Scout could remember her, Mayella\'s mother is dead, and Mrs. Radley is silent about Boo\'s confinement to the house. Apart from Atticus, the fathers described are abusers. Bob Ewell, it is hinted, molested his daughter, and Mr. Radley imprisons his son in his house until Boo is remembered only as a phantom. Bob Ewell and Mr. Radley represent a form of masculinity that Atticus does not, and the novel suggests that such men as well as the traditionally feminine hypocrites at the Missionary Society can lead society astray. Atticus stands apart as a unique model of masculinity; as one scholar explains: "It is the job of real men who embody the traditional masculine qualities of heroic individualism, bravery, and an unshrinking knowledge of and dedication to social justice and morality, to set the society s

There are several important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

## Preprocess

The next step is to load a DistilBERT tokenizer to process the `question` and `context` fields:

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2",cache_dir=cache_dir)

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

In [5]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns you don't need:

In [6]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map: 100%|██████████| 4000/4000 [00:00<00:00, 5392.55 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 5351.11 examples/s]


Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [7]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Train

In [8]:
"""from unsloth import FastLanguageModel , is_bfloat16_supported
model_name = "unsloth/mistral-7b-v0.3-bnb-4bit"
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!

model , tokenizer= FastLanguageModel.from_pretrained(model_name ,

                                                      load_in_4bit=True,
                                                      
                                                      max_seq_length=max_seq_length,
                                                      cache_dir='/proj/ciptmp/ix05ogym/.cache/',
                                                      
                                                      )

model = FastLanguageModel.get_peft_model( model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context

)"""

'from unsloth import FastLanguageModel , is_bfloat16_supported\nmodel_name = "unsloth/mistral-7b-v0.3-bnb-4bit"\nmax_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!\n\nmodel , tokenizer= FastLanguageModel.from_pretrained(model_name ,\n\n                                                      load_in_4bit=True,\n                                                      \n                                                      max_seq_length=max_seq_length,\n                                                      cache_dir=\'/proj/ciptmp/ix05ogym/.cache/\',\n                                                      \n                                                      )\n\nmodel = FastLanguageModel.get_peft_model( model,\n    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",\n                      "gate_proj", "up_proj", "down_proj",],\n    lora_alpha = 16,\n    use_gradient_checkpointing = "unsloth", # T

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForQuestionAnswering):

In [9]:
import torch
#from unsloth import FastLanguageModel , is_bfloat16_supported

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer,BitsAndBytesConfig
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training
print(torch.cuda.is_available())
torch.cuda.get_device_name(0)


config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
#"albert/albert-base-v2"
model_name = "albert/albert-base-v2"#"unsloth/mistral-7b-v0.3-bnb-4bit"
model = AutoModelForQuestionAnswering.from_pretrained(model_name ,
                                                      #attn_implementation="flash_attention_2",
                                                      quantization_config=config,
                                                      #low_cpu_mem_usage=True,
                                                      cache_dir='/proj/ciptmp/ix05ogym/.cache/',
                                                      #device_map="auto",
                                                      
                                                      )
print(model)

#model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False)
loraconfig = LoraConfig(r=16,target_modules=['query','key','value','dense'],task_type='QUESTION_ANS')
model = get_peft_model(model, loraconfig)

model.print_trainable_parameters()
print(model.get_memory_footprint())



No ROCm runtime is found, using ROCM_HOME='/usr'
`low_cpu_mem_usage` was None, now set to True since model is quantized.


True


Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AlbertForQuestionAnswering(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear4bit(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear4bit(in_features=768, out_features=768, bias=True)
                (key): Linear4bit(in_features=768, out_features=768, bias=True)
                (value): Linear4bit(in_features=768, out_fe

In [10]:
for n,p in model.named_parameters():
    if p.requires_grad==True:
        print(n)

base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value.lora_B.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.lora_A.default.weight
base_model.model.albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense.lora_B.default.weight
base_model.model.qa_outputs.modules_to_save.default.weight
base_model.model.qa_outputs.modules_to_save.default.bias


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [11]:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
#['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit', 'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor', 'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']
#import torch._dynamo
#torch._dynamo.config.suppress_errors = True
 

trainer = Trainer(
    model=model,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    args=TrainingArguments(
            output_dir="my_awesome_qa_model",
            eval_strategy="epoch",
            learning_rate=1e-4,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            num_train_epochs=1,
            weight_decay=0.01,
            #bf16 =True, #amper series
            tf32=True,
            fp16 = False,#not is_bfloat16_supported(),
            bf16 =True, #is_bfloat16_supported(),
            optim='adamw_hf',
            #weight_decay=0.01,
            #dataloader_pin_memory=True,
            #dataloader_num_workers=0,
            #torch_compile=True,  #seems not good error :(
            
            #push_to_hub=True,
            report_to='tensorboard'
            
            )
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,3.2046,


NaN or Inf found in input tensor.
NaN or Inf found in input tensor.


TrainOutput(global_step=500, training_loss=3.204638671875, metrics={'train_runtime': 102.1162, 'train_samples_per_second': 39.171, 'train_steps_per_second': 4.896, 'total_flos': 67171553280000.0, 'train_loss': 3.204638671875, 'epoch': 1.0})

In [12]:
model.save_pretrained("my_awesome_qa_model")
tokenizer.save_pretrained("my_awesome_qa_model")




('my_awesome_qa_model/tokenizer_config.json',
 'my_awesome_qa_model/special_tokens_map.json',
 'my_awesome_qa_model/spiece.model',
 'my_awesome_qa_model/added_tokens.json',
 'my_awesome_qa_model/tokenizer.json')

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

<Tip>

For a more in-depth example of how to finetune a model for question answering, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).

</Tip>

## Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so you're not completely in the dark about your model's performance.

If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course!

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with a question and some context you'd like the model to predict:

In [13]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for question answering with your model, and pass your text to it:

In [14]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)

Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.01029176265001297,
 'start': 89,
 'end': 107,
 'answer': 'and 13 programming'}

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [15]:
from transformers import AutoTokenizer

#tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")
inputs['input_ids']=inputs['input_ids'].cuda()
inputs['token_type_ids']=inputs['input_ids'].cuda()
inputs['attention_mask']=inputs['input_ids'].cuda()
inputs


{'input_ids': tensor([[    2,   184,   151,  3143,  2556,   630,  8064,   555,    60,     3,
          8064,    63,    13, 11633,  2786, 12905,    17,    92,  7920,  1854,
            19,  5084,  2556,  1112,  2556,    17,   539,  3143,  2556,     9,
             3]], device='cuda:0'), 'token_type_ids': tensor([[    2,   184,   151,  3143,  2556,   630,  8064,   555,    60,     3,
          8064,    63,    13, 11633,  2786, 12905,    17,    92,  7920,  1854,
            19,  5084,  2556,  1112,  2556,    17,   539,  3143,  2556,     9,
             3]], device='cuda:0'), 'attention_mask': tensor([[    2,   184,   151,  3143,  2556,   630,  8064,   555,    60,     3,
          8064,    63,    13, 11633,  2786, 12905,    17,    92,  7920,  1854,
            19,  5084,  2556,  1112,  2556,    17,   539,  3143,  2556,     9,
             3]], device='cuda:0')}

Pass your inputs to the model and return the `logits`:

In [16]:
import torch
from transformers import AutoModelForQuestionAnswering

#model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
with torch.no_grad():
    outputs = model(**inputs)

Get the highest probability from the model output for the start and end positions:

In [17]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [18]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'[CLS]'

https://github.com/unslothai/unsloth

Here's a comparison table of the listed optimizers based on various criteria such as usage, precision, hardware support, and typical applications. Note that specific details such as supported hardware (GPU, TPU, NPU) and implementation differences can influence performance and suitability for different training scenarios.

| Optimizer Name                 | Implementation      | Precision        | Hardware Support            | Key Features                                   | Typical Applications                       |
|--------------------------------|---------------------|------------------|-----------------------------|------------------------------------------------|--------------------------------------------|
| adamw_hf                       | Hugging Face        | 32-bit           | CPU, GPU                    | Decoupled weight decay, Transformer training  | NLP, Transformers                          |
| adamw_torch                    | PyTorch             | 32-bit           | CPU, GPU                    | Decoupled weight decay, flexible             | General deep learning                      |
| adamw_torch_fused              | PyTorch             | 32-bit           | CPU, GPU                    | Fused operations for efficiency              | General deep learning                      |
| adamw_torch_xla                | PyTorch             | 32-bit           | TPU                         | TPU optimized                                 | High-performance training on TPUs          |
| adamw_torch_npu_fused          | PyTorch             | 32-bit           | NPU                         | Fused operations for NPUs                    | Training on NPU hardware                   |
| adamw_apex_fused               | NVIDIA Apex         | 32-bit, mixed    | GPU                         | Fused operations, mixed precision            | High-performance training on GPUs          |
| adafactor                      | TensorFlow, HF      | 32-bit, mixed    | CPU, GPU                    | Memory efficient, scalable                   | NLP, large models                          |
| adamw_anyprecision             | Custom              | Variable         | CPU, GPU                    | Flexible precision handling                  | Custom precision requirements              |
| sgd                            | Standard            | 32-bit           | CPU, GPU                    | Simple, effective                            | Classic machine learning, simple models    |
| adagrad                        | Standard            | 32-bit           | CPU, GPU                    | Adaptive learning rate                       | Sparse data                                |
| adamw_bnb_8bit                 | BitsAndBytes        | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Large-scale training on limited hardware   |
| adamw_8bit                     | Custom              | 8-bit            | CPU, GPU                    | Memory efficient, reduced precision          | Large models with memory constraints       |
| lion_8bit                      | Custom              | 8-bit            | CPU, GPU                    | Memory efficient, reduced precision          | Memory constrained environments            |
| lion_32bit                     | Custom              | 32-bit           | CPU, GPU                    | Higher precision                              | General deep learning                      |
| paged_adamw_32bit              | Custom              | 32-bit           | CPU, GPU                    | Paged optimizer for memory management        | Large datasets                             |
| paged_adamw_8bit               | Custom              | 8-bit            | CPU, GPU                    | Paged optimizer, memory efficient            | Large models with memory constraints       |
| paged_lion_32bit               | Custom              | 32-bit           | CPU, GPU                    | Paged optimizer for memory management        | Large datasets                             |
| paged_lion_8bit                | Custom              | 8-bit            | CPU, GPU                    | Paged optimizer, memory efficient            | Large models with memory constraints       |
| rmsprop                        | Standard            | 32-bit           | CPU, GPU                    | Adaptive learning rate                       | RNNs, general deep learning                |
| rmsprop_bnb                    | BitsAndBytes        | 32-bit           | CPU, GPU                    | BitsAndBytes optimization                    | Memory efficient training                  |
| rmsprop_bnb_8bit               | BitsAndBytes        | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Large-scale training on limited hardware   |
| rmsprop_bnb_32bit              | BitsAndBytes        | 32-bit           | CPU, GPU                    | Higher precision                              | General deep learning                      |
| galore_adamw                   | Galore              | 32-bit           | CPU, GPU                    | Enhanced AdamW                               | Advanced training scenarios                |
| galore_adamw_8bit              | Galore              | 8-bit            | CPU, GPU                    | Memory efficient, fast                       | Memory constrained environments            |
| galore_adafactor               | Galore              | 32-bit, mixed    | CPU, GPU                    | Memory efficient, scalable                   | NLP, large models                          |
| galore_adamw_layerwise         | Galore              | 32-bit           | CPU, GPU                    | Layerwise optimization                       | Advanced training scenarios                |
| galore_adamw_8bit_layerwise    | Galore              | 8-bit            | CPU, GPU                    | Memory efficient, layerwise optimization     | Memory constrained environments            |
| galore_adafactor_layerwise     | Galore              | 32-bit, mixed    | CPU, GPU                    | Layerwise optimization, memory efficient     | NLP, large models                          |

### Notes:
- **Precision**: Indicates whether the optimizer supports standard 32-bit precision or has options for mixed/8-bit precision for memory efficiency.
- **Hardware Support**: Identifies the primary hardware the optimizer is designed to run on efficiently, e.g., CPU, GPU, TPU, NPU.
- **Key Features**: Highlights unique aspects or enhancements that distinguish each optimizer.
- **Typical Applications**: Common use cases or scenarios where the optimizer is particularly effective.

Choosing the right optimizer depends on your specific training needs, hardware availability, and whether you need to manage large models or datasets within memory constraints.




The convergence speed of an optimizer depends on various factors such as the type of model, the dataset, the specific problem being solved, and the tuning of hyperparameters. However, here are some general insights into the convergence speed of the listed optimizers:

1. **AdamW variants**:
    - `adamw_hf`, `adamw_torch`, `adamw_torch_fused`, `adamw_torch_xla`, `adamw_torch_npu_fused`, `adamw_apex_fused`, `adamw_anyprecision`, `adamw_bnb_8bit`, `adamw_8bit`, `galore_adamw`, `galore_adamw_8bit`, `galore_adamw_layerwise`, `galore_adamw_8bit_layerwise`:
      - AdamW is known for fast convergence due to its adaptive learning rate and decoupled weight decay. The fused versions (`fused`, `apex_fused`, `torch_fused`) can provide additional speedup due to more efficient computations.
      - `adamw_xla` and `adamw_npu_fused` are optimized for specific hardware (TPU and NPU, respectively), which can lead to faster convergence on those platforms.

2. **Adafactor**:
    - `adafactor`, `galore_adafactor`, `galore_adafactor_layerwise`:
      - Adafactor is memory-efficient and suitable for training very large models. It can converge quickly in large-scale NLP tasks, particularly when memory constraints are an issue.

3. **Lion**:
    - `lion_8bit`, `lion_32bit`, `paged_lion_32bit`, `paged_lion_8bit`:
      - Lion optimizers are less commonly used but can offer faster convergence in some scenarios due to their specific optimization strategies.

4. **SGD**:
    - `sgd`:
      - SGD with momentum can converge quickly in some scenarios but generally requires more careful tuning of learning rates and momentum parameters. It may not converge as fast as AdamW in many deep learning tasks.

5. **Adagrad**:
    - `adagrad`:
      - Adagrad adapts the learning rate based on the historical gradient, which can be beneficial for sparse data but may lead to slower convergence in dense data scenarios.

6. **Paged AdamW**:
    - `paged_adamw_32bit`, `paged_adamw_8bit`:
      - These optimizers are designed to handle large datasets with better memory management. Convergence speed can be good, especially for large-scale training.

7. **RMSprop**:
    - `rmsprop`, `rmsprop_bnb`, `rmsprop_bnb_8bit`, `rmsprop_bnb_32bit`:
      - RMSprop is designed for fast convergence in non-stationary settings. It can converge faster than SGD in many cases.

8. **Galore Optimizers**:
    - `galore_adamw`, `galore_adamw_8bit`, `galore_adafactor`, `galore_adamw_layerwise`, `galore_adamw_8bit_layerwise`, `galore_adafactor_layerwise`:
      - These optimizers are designed for advanced training scenarios and can offer fast convergence, especially when specific memory constraints or layer-wise optimizations are needed.

### General Recommendations:
- For most standard deep learning tasks, **AdamW variants** are likely to offer the fastest convergence due to their adaptive learning rates and weight decay.
- For large-scale NLP tasks, **Adafactor** can be very efficient and fast.
- For training on specific hardware (TPU, NPU), use the optimizers optimized for those platforms like `adamw_torch_xla` or `adamw_torch_npu_fused`.
- For memory-constrained environments, **8-bit variants** and **Paged optimizers** can offer good convergence speed while managing memory efficiently.
- If using large datasets, **paged variants** of AdamW or Lion can be particularly effective.

Ultimately, the best way to determine which optimizer converges the fastest for your specific use case is to experiment with a few of them on your dataset and model. Hyperparameter tuning (such as learning rate adjustments) also plays a significant role in convergence speed.