# Introduction to Fine-Tuning Large Language Models with LoRA

In this laboratory session, we will explore an advanced technique for fine-tuning large language models, specifically focusing on the application of Low-Rank Adaptation (LoRA). This method allows for efficient parameter updating and learning, making it feasible to adapt large-scale models like Facebook's OPT to new tasks without the need for extensive computational resources.

#### Objectives:
- Understand the principles behind LoRA and its advantages for fine-tuning large language models.
- Learn how to apply LoRA to the OPT model, focusing on specific components (LoRA_A and LoRA_B) for adaptation.
- Explore the use of Hugging Face's `transformers` and `datasets` libraries for model training and evaluation.
- Gain practical experience in handling natural language processing tasks, such as question-answering, through the implementation and fine-tuning of the model.
- Evaluate the performance of the fine-tuned model using standard NLP metrics such as BLEU, ROUGE, and Exact Match.

#### Prerequisites:
Before starting this lab, you should have a basic understanding of PyTorch, neural networks, and NLP concepts. Familiarity with Hugging Face libraries will also be beneficial.

#### Lab Overview:
The lab is structured as follows:
1. **Environment Setup**: We will begin by setting up our working environment, including the installation of necessary libraries and loading of environmental variables.
2. **Model Initialization and LoRA Configuration**: You will learn how to initialize the OPT model and configure it using LoRA by modifying specific layers while freezing the rest of the model to prevent unnecessary updates.
3. **Data Preparation**: We will load and preprocess the data suitable for our task, creating custom datasets and dataloaders.
4. **Fine-Tuning and Evaluation**: You will fine-tune the model on a question-answering task and evaluate its performance using several NLP metrics.
5. **Results Analysis**: Finally, we will analyze the results and understand the impact of LoRA fine-tuning on the model's performance.

By the end of this lab, you will have hands-on experience fine-tuning large language models using LoRA, which is a valuable skill in the field of artificial intelligence and natural language processing.

# Import basic libraries

In [1]:
!pip install -q transformers[torch]
!pip install -q datasets
!pip install -q accelerate -U
!pip install -q py7zr
!pip install -q evaluate nltk rouge_score

zsh:1: no matches found: transformers[torch]


In [2]:
import evaluate
from torch import nn
import math

  from .autonotebook import tqdm as notebook_tqdm


# Load Evaluator

In this section of the code, we are initializing evaluation metrics for the language model we plan to fine-tune. Specifically, we use the `evaluate` library, which is a part of the Hugging Face ecosystem, designed for evaluating and comparing the performance of models across a wide range of NLP tasks.

1. `bleu_scorer = evaluate.load('bleu')`: This line loads the BLEU (Bilingual Evaluation Understudy) scorer from the `evaluate` library. BLEU is a widely used metric for evaluating the quality of text which has been machine-translated from one natural language to another. It works by comparing the machine-generated text to one or more reference texts (typically human-generated) and computes a score indicating how similar they are, based on the presence of the same words and phrases. BLEU is particularly popular in tasks like machine translation but is also used in other contexts like text summarization.

2. `rouge_scorer = evaluate.load('rouge')`: This line loads the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scorer. ROUGE is another popular evaluation metric used primarily in summarization tasks. Unlike BLEU, which is precision-oriented, ROUGE focuses on recall, meaning it measures how well the generated summaries cover the content present in the reference summaries. It compares the overlap of n-grams, word sequences, and word pairs between the computer-generated output and the reference texts.

These metrics will be used later in the training process to evaluate how well the fine-tuned language model performs on specific NLP tasks, such as translation or summarization. Using these evaluation metrics allows us to quantitatively assess the quality of the generated text and make informed decisions about the model's performance and potential improvements.


In [None]:
bleu_scorer = evaluate.load('bleu')
rouge_scorer = evaluate.load('rouge')

# Adding some necessary code for OPT Model

In [3]:
# @title
# coding=utf-8
# Copyright 2022 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch OPT model."""
import random
from typing import List, Optional, Tuple, Union

import torch
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

from transformers.activations import ACT2FN
from transformers.modeling_outputs import (
    BaseModelOutputWithPast,
    CausalLMOutputWithPast,
    QuestionAnsweringModelOutput,
    SequenceClassifierOutputWithPast,
)
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import (
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    logging,
    replace_return_docstrings,
)
from transformers.models.opt.configuration_opt import OPTConfig

logger = logging.get_logger(__name__)

_CHECKPOINT_FOR_DOC = "facebook/opt-350m"
_CONFIG_FOR_DOC = "OPTConfig"

# Base model docstring
_EXPECTED_OUTPUT_SHAPE = [1, 8, 1024]

# SequenceClassification docstring
_CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION = "ArthurZ/opt-350m-dummy-sc"
_SEQ_CLASS_EXPECTED_LOSS = 1.71
_SEQ_CLASS_EXPECTED_OUTPUT = "'LABEL_0'"

OPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
    "facebook/opt-125m",
    "facebook/opt-350m",
    "facebook/opt-1.3b",
    "facebook/opt-2.7b",
    "facebook/opt-6.7b",
    "facebook/opt-13b",
    "facebook/opt-30b",
    # See all OPT models at https://huggingface.co/models?filter=opt
]


# Copied from transformers.models.bart.modeling_bart._make_causal_mask
def _make_causal_mask(
        input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
):
    """
    Make causal mask used for bi-directional self-attention.
    """
    bsz, tgt_len = input_ids_shape
    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
    mask_cond = torch.arange(mask.size(-1), device=device)
    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
    mask = mask.to(dtype)

    if past_key_values_length > 0:
        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)


def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
    """
    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
    """
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len

    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)

    inverted_mask = 1.0 - expanded_mask

    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)


class OPTLearnedPositionalEmbedding(nn.Embedding):
    """
    This module learns positional embeddings up to a fixed maximum size.
    """

    def __init__(self, num_embeddings: int, embedding_dim: int):
        # OPT is set up so that if padding_idx is specified then offset the embedding ids by 2
        # and adjust num_embeddings appropriately. Other models don't have this hack
        self.offset = 2
        super().__init__(num_embeddings + self.offset, embedding_dim)

    def forward(self, attention_mask: torch.LongTensor, past_key_values_length: int = 0):
        """`input_ids_shape` is expected to be [bsz x seqlen]."""
        attention_mask = attention_mask.long()

        # create positions depending on attention_mask
        positions = (torch.cumsum(attention_mask, dim=1).type_as(attention_mask) * attention_mask).long() - 1

        # cut positions if `past_key_values_length` is > 0
        positions = positions[:, past_key_values_length:]

        return super().forward(positions + self.offset)


class OPTDecoderLayer(nn.Module):
    def __init__(self, config: OPTConfig):
        super().__init__()
        self.embed_dim = config.hidden_size
        self.self_attn = OPTAttention(
            embed_dim=self.embed_dim,
            num_heads=config.num_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
            bias=config.enable_bias,
        )
        self.do_layer_norm_before = config.do_layer_norm_before
        self.dropout = config.dropout
        self.activation_fn = ACT2FN[config.activation_function]

        self.self_attn_layer_norm = nn.LayerNorm(
            self.embed_dim, elementwise_affine=config.layer_norm_elementwise_affine
        )
        self.fc1 = nn.Linear(self.embed_dim, config.ffn_dim, bias=config.enable_bias)
        self.fc2 = nn.Linear(config.ffn_dim, self.embed_dim, bias=config.enable_bias)
        self.final_layer_norm = nn.LayerNorm(self.embed_dim, elementwise_affine=config.layer_norm_elementwise_affine)

    def forward(
            self,
            hidden_states: torch.Tensor,
            attention_mask: Optional[torch.Tensor] = None,
            layer_head_mask: Optional[torch.Tensor] = None,
            past_key_value: Optional[Tuple[torch.Tensor]] = None,
            output_attentions: Optional[bool] = False,
            use_cache: Optional[bool] = False,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            layer_head_mask (`torch.FloatTensor`, *optional*): mask for attention heads in a given layer of size
                `(encoder_attention_heads,)`.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
        """

        residual = hidden_states

        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
        if self.do_layer_norm_before:
            hidden_states = self.self_attn_layer_norm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            past_key_value=past_key_value,
            attention_mask=attention_mask,
            layer_head_mask=layer_head_mask,
            output_attentions=output_attentions,
        )
        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states

        # 350m applies layer norm AFTER attention
        if not self.do_layer_norm_before:
            hidden_states = self.self_attn_layer_norm(hidden_states)

        # Fully Connected
        hidden_states_shape = hidden_states.shape
        hidden_states = hidden_states.reshape(-1, hidden_states.size(-1))
        residual = hidden_states

        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
        if self.do_layer_norm_before:
            hidden_states = self.final_layer_norm(hidden_states)

        hidden_states = self.fc1(hidden_states)
        hidden_states = self.activation_fn(hidden_states)

        hidden_states = self.fc2(hidden_states)
        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)

        hidden_states = (residual + hidden_states).view(hidden_states_shape)

        # 350m applies layer norm AFTER attention
        if not self.do_layer_norm_before:
            hidden_states = self.final_layer_norm(hidden_states)

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs


OPT_START_DOCSTRING = r"""
    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
    etc.)

    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
    and behavior.

    Parameters:
        config ([`OPTConfig`]):
            Model configuration class with all the parameters of the model. Initializing with a config file does not
            load the weights associated with the model, only the configuration. Check out the
            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""


@add_start_docstrings(
    "The bare OPT Model outputting raw hidden-states without any specific head on top.",
    OPT_START_DOCSTRING,
)
class OPTPreTrainedModel(PreTrainedModel):
    config_class = OPTConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["OPTDecoderLayer"]
    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]

    def _init_weights(self, module):
        std = self.config.init_std
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

    def _set_gradient_checkpointing(self, module, value=False):
        if isinstance(module, (OPTDecoder)):
            module.gradient_checkpointing = value


OPT_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
            it.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
            `past_key_values`).

            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
            information on the default strategy.
        head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
            Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`:

            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.

        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
            model's internal embedding lookup matrix.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
            more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""


class OPTDecoder(OPTPreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OPTDecoderLayer`]

    Args:
        config: OPTConfig
    """

    def __init__(self, config: OPTConfig):
        super().__init__(config)
        self.dropout = config.dropout
        self.layerdrop = config.layerdrop
        self.padding_idx = config.pad_token_id
        self.max_target_positions = config.max_position_embeddings
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.word_embed_proj_dim, self.padding_idx)
        self.embed_positions = OPTLearnedPositionalEmbedding(config.max_position_embeddings, config.hidden_size)

        if config.word_embed_proj_dim != config.hidden_size:
            self.project_out = nn.Linear(config.hidden_size, config.word_embed_proj_dim, bias=False)
        else:
            self.project_out = None

        if config.word_embed_proj_dim != config.hidden_size:
            self.project_in = nn.Linear(config.word_embed_proj_dim, config.hidden_size, bias=False)
        else:
            self.project_in = None

        # Note that the only purpose of `config._remove_final_layer_norm` is to keep backward compatibility
        # with checkpoints that have been fine-tuned before transformers v4.20.1
        # see https://github.com/facebookresearch/metaseq/pull/164
        if config.do_layer_norm_before and not config._remove_final_layer_norm:
            self.final_layer_norm = nn.LayerNorm(
                config.hidden_size, elementwise_affine=config.layer_norm_elementwise_affine
            )
        else:
            self.final_layer_norm = None

        self.layers = nn.ModuleList([OPTDecoderLayer(config) for _ in range(config.num_hidden_layers)])

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
        # create causal mask
        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
        combined_attention_mask = None
        if input_shape[-1] > 1:
            combined_attention_mask = _make_causal_mask(
                input_shape,
                inputs_embeds.dtype,
                device=inputs_embeds.device,
                past_key_values_length=past_key_values_length,
            )

        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
                inputs_embeds.device
            )
            combined_attention_mask = (
                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
            )

        return combined_attention_mask

    def forward(
            self,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            head_mask: Optional[torch.Tensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        r"""
        Args:
            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.

                Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
                [`PreTrainedTokenizer.__call__`] for details.

                [What are input IDs?](../glossary#input-ids)
            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
            head_mask (`torch.Tensor` of shape `(num_hidden_layers, num_attention_heads)`, *optional*):
                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:

                - 1 indicates the head is **not masked**,
                - 0 indicates the head is **masked**.

            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of

                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.

            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
                than the model's internal embedding lookup matrix.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            output_hidden_states (`bool`, *optional*):
                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                for more detail.
            return_dict (`bool`, *optional*):
                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
            input_ids = input_ids.view(-1, input_shape[-1])
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        batch_size, seq_length = input_shape
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
        # required mask seq length can be calculated via length of past
        mask_seq_length = past_key_values_length + seq_length

        # embed positions
        if attention_mask is None:
            attention_mask = torch.ones(batch_size, mask_seq_length, device=inputs_embeds.device)
        elif attention_mask.shape[1] != mask_seq_length:
            raise ValueError(
                f"The provided attention mask has length {attention_mask.shape[1]}, but its length should be "
                f"{mask_seq_length} (sum of the lengths of current and past inputs)"
            )
        causal_attention_mask = self._prepare_decoder_attention_mask(
            attention_mask, input_shape, inputs_embeds, past_key_values_length
        )
        pos_embeds = self.embed_positions(attention_mask, past_key_values_length)

        if self.project_in is not None:
            inputs_embeds = self.project_in(inputs_embeds)

        hidden_states = inputs_embeds + pos_embeds

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = () if use_cache else None

        # check if head_mask has a correct number of layers specified if desired
        for attn_mask, mask_name in zip([head_mask], ["head_mask"]):
            if attn_mask is not None:
                if attn_mask.size()[0] != (len(self.layers)):
                    raise ValueError(
                        f"The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for"
                        f" {head_mask.size()[0]}."
                    )

        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            dropout_probability = random.uniform(0, 1)
            if self.training and (dropout_probability < self.layerdrop):
                continue

            past_key_value = past_key_values[idx] if past_key_values is not None else None

            if self.gradient_checkpointing and self.training:

                def create_custom_forward(module):
                    def custom_forward(*inputs):
                        # None for past_key_value
                        return module(*inputs, output_attentions, None)

                    return custom_forward

                layer_outputs = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(decoder_layer),
                    hidden_states,
                    causal_attention_mask,
                    head_mask[idx] if head_mask is not None else None,
                    None,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=causal_attention_mask,
                    layer_head_mask=(head_mask[idx] if head_mask is not None else None),
                    past_key_value=past_key_value,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        if self.final_layer_norm is not None:
            hidden_states = self.final_layer_norm(hidden_states)

        if self.project_out is not None:
            hidden_states = self.project_out(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )


@add_start_docstrings(
    "The bare OPT Model outputting raw hidden-states without any specific head on top.",
    OPT_START_DOCSTRING,
)
class OPTModel(OPTPreTrainedModel):
    def __init__(self, config: OPTConfig):
        super().__init__(config)
        self.decoder = OPTDecoder(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.decoder.embed_tokens

    def set_input_embeddings(self, value):
        self.decoder.embed_tokens = value

    def get_decoder(self):
        return self.decoder

    @add_start_docstrings_to_model_forward(OPT_INPUTS_DOCSTRING)
    @add_code_sample_docstrings(
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=BaseModelOutputWithPast,
        config_class=_CONFIG_FOR_DOC,
        expected_output=_EXPECTED_OUTPUT_SHAPE,
    )
    def forward(
            self,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            head_mask: Optional[torch.Tensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
        decoder_outputs = self.decoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        if not return_dict:
            return decoder_outputs

        return BaseModelOutputWithPast(
            last_hidden_state=decoder_outputs.last_hidden_state,
            past_key_values=decoder_outputs.past_key_values,
            hidden_states=decoder_outputs.hidden_states,
            attentions=decoder_outputs.attentions,
        )


class OPTForCausalLM(OPTPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]

    def __init__(self, config):
        super().__init__(config)
        self.model = OPTModel(config)

        # the lm_head weight is automatically tied to the embed tokens weight
        self.lm_head = nn.Linear(config.word_embed_proj_dim, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.decoder.embed_tokens

    def set_input_embeddings(self, value):
        self.model.decoder.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model.decoder = decoder

    def get_decoder(self):
        return self.model.decoder

    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
    def forward(
            self,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            head_mask: Optional[torch.Tensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.

                Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
                [`PreTrainedTokenizer.__call__`] for details.

                [What are input IDs?](../glossary#input-ids)
            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
            head_mask (`torch.Tensor` of shape `(num_hidden_layers, num_attention_heads)`, *optional*):
                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:

                - 1 indicates the head is **not masked**,
                - 0 indicates the head is **masked**.

            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two additional
                tensors are only required when the model is used as a decoder in a Sequence to Sequence model.

                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
                than the model's internal embedding lookup matrix.
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            output_hidden_states (`bool`, *optional*):
                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                for more detail.
            return_dict (`bool`, *optional*):
                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.

        Returns:

        Example:

        ```python
        >>> from transformers import AutoTokenizer, OPTForCausalLM

        >>> model = OPTForCausalLM.from_pretrained("facebook/opt-350m")
        >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

        >>> prompt = "Hey, are you consciours? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")

        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model.decoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        logits = self.lm_head(outputs[0]).contiguous()

        loss = None
        if labels is not None:
            # move labels to correct device to enable model parallelism
            labels = labels.to(logits.device)
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
            self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
    ):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
        return reordered_past


@add_start_docstrings(
    """
    The OPT Model transformer with a sequence classification head on top (linear layer).

    [`OPTForSequenceClassification`] uses the last token in order to do the classification, as other causal models
    (e.g. GPT-2) do.

    Since it does classification on the last token, it requires to know the position of the last token. If a
    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
    each row of the batch).
    """,
    OPT_START_DOCSTRING,
)
class OPTForSequenceClassification(OPTPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]

    def __init__(self, config: OPTConfig):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = OPTModel(config)
        self.score = nn.Linear(config.word_embed_proj_dim, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    @add_start_docstrings_to_model_forward(OPT_INPUTS_DOCSTRING)
    @add_code_sample_docstrings(
        checkpoint=_CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION,
        output_type=SequenceClassifierOutputWithPast,
        config_class=_CONFIG_FOR_DOC,
        expected_output=_SEQ_CLASS_EXPECTED_OUTPUT,
        expected_loss=_SEQ_CLASS_EXPECTED_LOSS,
    )
    def forward(
            self,
            input_ids: Optional[torch.LongTensor] = None,
            attention_mask: Optional[torch.FloatTensor] = None,
            head_mask: Optional[torch.FloatTensor] = None,
            past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size, sequence_length = input_ids.shape[:2]
        else:
            batch_size, sequence_length = inputs_embeds.shape[:2]

        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
            else:
                sequence_lengths = -1
                logger.warning(
                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
                )

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

    def get_input_embeddings(self):
        return self.model.decoder.embed_tokens

    def set_input_embeddings(self, value):
        self.model.decoder.embed_tokens = value


@add_start_docstrings(
    """
    The OPT Model transformer with a span classification head on top for extractive question-answering tasks like SQuAD
    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
    """,
    OPT_START_DOCSTRING,
)
class OPTForQuestionAnswering(OPTPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]

    def __init__(self, config: OPTConfig):
        super().__init__(config)
        self.model = OPTModel(config)
        self.qa_outputs = nn.Linear(config.word_embed_proj_dim, 2)

        # Initialize weights and apply final processing
        self.post_init()

    @add_start_docstrings_to_model_forward(OPT_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=QuestionAnsweringModelOutput, config_class=_CONFIG_FOR_DOC)
    def forward(
            self,
            input_ids: Optional[torch.LongTensor] = None,
            attention_mask: Optional[torch.FloatTensor] = None,
            head_mask: Optional[torch.FloatTensor] = None,
            past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            start_positions: Optional[torch.LongTensor] = None,
            end_positions: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
        r"""
        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.
        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.

        Returns:

        Example:

        ```python
        >>> from transformers import AutoTokenizer, OPTForQuestionAnswering
        >>> import torch

        >>> torch.manual_seed(4)  # doctest: +IGNORE_RESULT
        >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

        >>> # note: we are loading a OPTForQuestionAnswering from the hub here,
        >>> # so the head will be randomly initialized, hence the predictions will be random
        >>> model = OPTForQuestionAnswering.from_pretrained("facebook/opt-350m")

        >>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

        >>> inputs = tokenizer(question, text, return_tensors="pt")
        >>> with torch.no_grad():
        ...     outputs = model(**inputs)

        >>> answer_start_index = outputs.start_logits.argmax()
        >>> answer_end_index = outputs.end_logits.argmax()

        >>> answer_offset = len(tokenizer(question)[0])

        >>> predict_answer_tokens = inputs.input_ids[
        ...     0, answer_offset + answer_start_index : answer_offset + answer_end_index + 1
        ... ]
        >>> predicted = tokenizer.decode(predict_answer_tokens)
        >>> predicted
        ' a nice puppet'
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]

        logits = self.qa_outputs(hidden_states)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1).contiguous()
        end_logits = end_logits.squeeze(-1).contiguous()

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions = start_positions.clamp(0, ignored_index)
            end_positions = end_positions.clamp(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        if not return_dict:
            output = (start_logits, end_logits) + transformer_outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output

        return QuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

    def get_input_embeddings(self):
        return self.model.decoder.embed_tokens

    def set_input_embeddings(self, value):
        self.model.decoder.embed_tokens = value


# Task1: Implement LoRA in the OPTAttention Class

In this task, you will enhance the `OPTAttention` class by integrating Low-Rank Adaptation (LoRA) into the multi-headed attention mechanism. This modification aims to improve the model's adaptability and efficiency, making it more suitable for fine-tuning purposes.

#### Requirements:
1. Understand the structure and functionality of the multi-headed attention mechanism as introduced in the "Attention Is All You Need" paper.
2. Gain familiarity with the concept of Low-Rank Adaptation (LoRA) and how it can be applied to neural network layers, specifically within the context of attention mechanisms.

#### Instructions:

1. **LoRA Parameters Initialization**:
    - In the `__init__` method of the `OPTAttention` class, initialize the LoRA parameters `lora_A_k`, `lora_B_k`, `lora_A_v`, and `lora_B_v` following the standard attention parameters initialization.
    - These parameters introduce low-rank matrices that will modify the standard key and value projections within the attention mechanism. The matrices `lora_A_k` and `lora_A_v` reduce the dimension from the embedding dimension to a smaller rank, while `lora_B_k` and `lora_B_v` project them back to the original space.

2. **Implementing LoRA in Attention Projections**:
    - Within the `forward` method, integrate the LoRA parameters with the attention mechanism. Specifically, modify the `key_states` and `value_states` calculations where indicated by the comments.
    - Apply LoRA adaptation by combining the original key and value states (obtained from the standard projections) with the outcomes of passing `key_value_states` or `hidden_states` through the LoRA matrices.
    - Ensure proper tensor shapes and dimensions when combining LoRA-adapted projections with the standard projections to ensure model consistency and correctness.

3. **Scaling and Regularization with LoRA**:
    - Apply `lora_scaling` to the LoRA-adapted components before adding them to the standard key and value projections. `lora_scaling` controls the impact of LoRA modifications on the model and is computed as `lora_alpha / rank`, where `lora_alpha` is a predefined scaling factor and `rank` is the dimensionality reduction factor.
    - Implement `lora_dropout` by applying dropout to the reduced-dimension representations in the LoRA layers. This helps in regularizing the LoRA modifications and preventing overfitting.

4. **Validation**:
    - Validate your implementation by ensuring it passes any provided tests or validation checks. Pay particular attention to the tensor shapes and the logical flow of your LoRA modifications to ensure they align with the expected functionality of the multi-headed attention mechanism.

#### Example Implementation:
Below is an example of how you might modify a section of the code for LoRA integration within the attention projections:

```python
# Original line for context:
key_states = self._shape(self.k_proj(key_value_states), -1, bsz)

# Modified line with LoRA:
key_states = self._shape(
    # The LoRA Operation, please refer to the paper for more detail
)
#Please aware of the comment with ###
```

In [None]:


class OPTAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(
            self,
            embed_dim: int,
            num_heads: int,
            dropout: float = 0.0,
            is_decoder: bool = False,
            bias: bool = True,
    ):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.head_dim = embed_dim // num_heads

        if (self.head_dim * num_heads) != self.embed_dim:
            raise ValueError(
                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
                f" and `num_heads`: {num_heads})."
            )
        self.scaling = self.head_dim ** -0.5
        self.is_decoder = is_decoder

        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)

        ### Your code here:
        rank = 16
        lora_alpha = 32
        lora_dropout = 0.05
        self.lora_A_k = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.lora_B_k = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.lora_A_v = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.lora_B_v = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.lora_scaling = scaling
        self.lora_dropout = self.lora_dropout
        ### End of code writing 


    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

    def forward(
            self,
            hidden_states: torch.Tensor,
            key_value_states: Optional[torch.Tensor] = None,
            past_key_value: Optional[Tuple[torch.Tensor]] = None,
            attention_mask: Optional[torch.Tensor] = None,
            layer_head_mask: Optional[torch.Tensor] = None,
            output_attentions: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        """Input shape: Batch x Time x Channel"""

        # if key_value_states are provided this layer is used as a cross-attention layer
        # for the decoder
        is_cross_attention = key_value_states is not None

        bsz, tgt_len, _ = hidden_states.size()

        # get query proj
        query_states = self.q_proj(hidden_states) * self.scaling
        # get key, value proj
        if is_cross_attention and past_key_value is not None:
            # reuse k,v, cross_attentions
            key_states = past_key_value[0]
            value_states = past_key_value[1]
        elif is_cross_attention:
            # cross_attentions
            ### Your code here: add lora to key value state
            key_states = self._shape(self.k_proj(key_value_states), -1, bsz)
            value_states = self._shape(self.v_proj(key_value_states), -1, bsz)
        elif past_key_value is not None:
            # reuse k, v, self_attention
            ### Your code here: add lora to key value state
            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)
        else:
            # self_attention
            ### Your code here: add lora to key value state
            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)

        if self.is_decoder:
            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
            # Further calls to cross_attention layer can then reuse all cross-attention
            # key/value_states (first "if" case)
            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
            # all previous decoder key/value_states. Further calls to uni-directional self-attention
            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
            # if encoder bi-directional self-attention `past_key_value` is always `None`
            past_key_value = (key_states, value_states)

        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
        key_states = key_states.view(*proj_shape)
        value_states = value_states.view(*proj_shape)

        src_len = key_states.size(1)
        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))

        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
                f" {attn_weights.size()}"
            )

        if attention_mask is not None:
            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
            attn_weights = torch.max(
                attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
            )
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)

        # upcast to fp32 if the weights are in fp16. Please see https://github.com/huggingface/transformers/pull/17437
        if attn_weights.dtype == torch.float16:
            attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
        else:
            attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        if layer_head_mask is not None:
            if layer_head_mask.size() != (self.num_heads,):
                raise ValueError(
                    f"Head mask for a single layer should be of size {(self.num_heads,)}, but is"
                    f" {layer_head_mask.size()}"
                )
            attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)

        if output_attentions:
            # this operation is a bit awkward, but it's required to
            # make sure that attn_weights keeps its gradient.
            # In order to do so, attn_weights have to be reshaped
            # twice and have to be reused in the following
            attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
            attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
        else:
            attn_weights_reshaped = None

        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)

        attn_output = torch.bmm(attn_probs, value_states)

        if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
        attn_output = attn_output.transpose(1, 2)

        # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
        # partitioned aross GPUs when using tensor-parallelism.
        attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)

        attn_output = self.out_proj(attn_output)

        return attn_output, attn_weights_reshaped, past_key_value

# Initialize the model
- In this tutorial, we use OPT-2.7B as the base model

In [6]:
model_name = 'facebook/opt-2.7b'
model = OPTForCausalLM.from_pretrained(model_name)

KeyboardInterrupt: 

# Task2: Post-Initialization and Freezing Model Weights

In this section, you will perform two critical steps necessary for preparing your model for LoRA-based fine-tuning: post-initialization of LoRA parameters and freezing the base model weights. Follow the instructions below to understand and implement these steps in your code.

#### Post-Initialization of LoRA Parameters:
After integrating LoRA into your model, it is crucial to initialize the parameters correctly:

1. **Kaiming Uniform Initialization for LoRA_A**:
   - Use the `nn.init.kaiming_uniform_` function to initialize parameters associated with `lora_A`.
   - This type of initialization is well-suited for layers followed by ReLU activations, as it considers the size of the previous layer to maintain a consistent variance of activations.
   - The parameter `a` is set based on the rectifier linear unit's negative slope, optimizing the initialization for layers that use ReLU or similar activations.

2. **Zero Initialization for LoRA_B**:
   - Initialize the `lora_B` parameters with zeros using `nn.init.zeros_`.
   - This ensures that the LoRA modifications start from a neutral state, allowing the adapted parts of the model to learn from scratch during training.

#### Freezing Base Model Weights:
To focus the learning process on the LoRA parameters while preserving the pre-trained knowledge in other parts of the model, you need to freeze the base model weights:

1. **Set `requires_grad` to False for Non-LoRA Parameters**:
   - Iterate through all named parameters in the model. If a parameter's name does not contain 'lora', set its `requires_grad` attribute to `False`.
   - This action prevents the standard backpropagation updates from modifying these parameters, effectively freezing them during training.


In [None]:
### Write your code following the instructions
# do post initialization on lora A and lora B


# freeze base model weight


In [None]:
# enable gradient checkpoint for lower gpu memory usage
model.gradient_checkpointing_enable()
# enable input gradient for lora training
model.enable_input_require_grads()

# Monitoring Model Parameters

In machine learning and deep learning, understanding the structure and capacity of your model is crucial. One key aspect of this is knowing the number of parameters that are being trained. This information can help in diagnosing model complexity, memory usage, and potential overfitting or underfitting scenarios.

#### Function Overview: `print_trainable_parameters`
This function is designed to provide a clear overview of the parameters within a given model. Specifically, it calculates and prints the total number of parameters, the number of trainable parameters, and the percentage of parameters that are trainable. Here's a breakdown of its functionalities:

1. **Total Parameters Count (`all_param`)**: This represents the total number of parameters in the model, including both trainable and non-trainable (frozen) parameters.

2. **Trainable Parameters Count (`trainable_params`)**: This is the subset of the total parameters that will be updated during training. Parameters that are not trainable have been frozen and will retain their values during the training process.

3. **Trainable Percentage**: This metric provides insight into how much of the model is being actively trained. A lower percentage might indicate that a large portion of the model is frozen, which could be intentional in scenarios like transfer learning or fine-tuning.

The function iterates through all parameters in the model, counting the total and trainable parameters, and then prints out these values along with the percentage of parameters that are trainable. This utility can be particularly useful when you are experimenting with different architectures or when fine-tuning pre-trained models.

#### Usage:
Simply call `print_trainable_parameters(model)` after defining your model. This will output the parameter statistics, giving you a better understanding of your model's capacity and training scope.

Example output:
```plaintext
trainable params: 1024 || all params: 2048 || trainable%: 50.0


In [None]:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    return {"trainable": trainable_params, "all": all_param, "trainable%": 100 * trainable_params / all_param}



# Trainable Parameter Check
- Just to check how many parameters are trainable in the model.

In [None]:
print_trainable_parameters(model)

# Task3: Implementing Custom Dataset for NLP

In this task, you will create a custom dataset class, `HGDataset`, using PyTorch for a natural language processing (NLP) task. This custom dataset will facilitate data handling for training and testing NLP models. Follow the instructions below to complete the missing parts of the `HGDataset` class.

#### Objectives:
1. Learn to handle NLP datasets using PyTorch's Dataset and DataLoader.
2. Implement custom processing and loading methods for NLP data.

#### Task Instructions:

1. **Initialize the Dataset**:
   - In the `__init__` method of the `HGDataset` class, complete the code to extract questions and answers from the provided dataset.
   - Assign the questions to `self.input_x` and the answers to `self.target`.
   - Ensure that each question corresponds to its respective answer by asserting that `input_x` and `target` are of the same length.
   - Store the dataset split (e.g., 'train', 'test') in `self.split` for future reference.

2. **Implement the `__getitem__` Method**:
   - This method should return a single instance of your data. Implement the `__getitem__` method to return a dictionary containing the input question, the target answer, and the split information for a given index `idx`.
   - Ensure the method correctly maps an index to the corresponding question and answer in the dataset.

3. **Implement the `__len__` Method**:
   - This method should return the total number of items in the dataset. Implement the `__len__` method to return the size of `self.input_x`, which corresponds to the total number of questions (and answers).

#### Usage:
After completing the `HGDataset` class, you can create instances of the dataset for the training and validation splits:

```python
dataset_name = 'nq_open'  # This is the name of the dataset we are using

# Load the raw dataset
dataset = load_dataset(dataset_name)

# Create instances of HGDataset for training and validation
train_dataset = HGDataset(dataset['train'][:10000], 'train')
test_dataset = HGDataset(dataset['validation'][:], 'test')


In [None]:

import torch
from datasets import load_dataset
from torch.utils.data import DataLoader


class HGDataset(torch.utils.data.Dataset):
    # longest first for batch finder
    def __init__(self, dataset, split):
        ### Your code here

    def __getitem__(self, idx):
        ### Your code here

    def __len__(self):
        ### Your code here


dataset_name = 'nq_open'

dataset = load_dataset(dataset_name)

# You can adjust the dataset scale with your own preference
# The total number for training is 87,9K, validation is 3.61K
train_dataset = HGDataset(dataset['train'][:1000], 'train')
test_dataset = HGDataset(dataset['validation'][:100], 'test')

# Task4: Implementing Tokenization and Data Collation

In this task, you will set up tokenization parameters and implement a custom data collator function for an NLP model. This is essential for preparing text data properly before feeding it into a neural network for training or inference. Follow the instructions below to configure tokenization and complete the missing parts of the `data_collator_customized` function.

#### Objectives:
1. Configure tokenizer parameters for handling different aspects of text processing.
2. Implement a custom logic for batching and preparing NLP data.


#### Task Instructions:

1. **Implement Custom Data Collator**:
   - In the `data_collator_customized` function, complete the code to transform a list of feature dictionaries into a unified batch for model processing.
   - Collect batched features by accumulating each key's values from the feature dictionaries.
   - Determine the batch's split type (training or inference) based on the 'split' field from the input features.
   - Prepare the text for tokenization by concatenating questions and answers, adding special tokens as necessary. Use different formatting based on whether the batch is for training or inference.
   - Tokenize the concatenated text using the appropriate tokenizer (Suggestion: `rtokenizer` for processing concatenated text, `tokenizer` for inference scenarios, but you can choose your own preference nevertheless).
   - Prepare the final batch dictionary need have `input_ids`, `attention_mask`, and `labels` keys.

#### Usage:
After completing the setup and implementation, your custom data collator will be used to prepare data batches in trainer


In [None]:
import transformers

MAX_TOKEN_LENGTH = 128
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'

rtokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
rtokenizer.padding_side = 'right'
rtokenizer.truncation_side = 'right'

def data_collator_customized(features, return_tensors="pt"):
    batch = {}
    ### Your code here
    ### End of code writing
    return batch

# Setting Up the Seq2Seq Trainer

The provided code block initializes and configures a `Seq2SeqTrainer` from the Hugging Face's Transformers library. This trainer is designed specifically for sequence-to-sequence models, such as those used in translation, summarization, and other NLP tasks where both the input and output are sequences of tokens.

#### Trainer Configuration:
- `model`: This is the sequence-to-sequence model that you will be training.
- `train_dataset`: The dataset used for training the model.
- `args`: A set of training arguments configuring how the model should be trained. These arguments include:
    - `per_device_train_batch_size`: The batch size per device during training.
    - `per_device_eval_batch_size`: The batch size per device during evaluation.
    - `gradient_accumulation_steps`: The number of steps to accumulate gradients before performing a backward/update pass.
    - `warmup_steps`: The number of steps used for the warm-up phase.
    - `num_train_epochs`: The total number of training epochs.
    - `learning_rate`: The initial learning rate for the Adam optimizer.
    - `bf16`: Enables training using bfloat16 precision on supported GPUs, which can improve performance and reduce memory usage.
    - `logging_steps`: How often to log training information.
    - `report_to`: Determines the integration backends where the logs should be reported (here, logging is disabled with 'none').
    - `remove_unused_columns`: Indicates whether columns not used by the model forward pass should be removed.
    - `output_dir`: The directory where the training outputs should be saved.
    - `generation_config`: Specific generation settings for the evaluation phase, such as `max_length` and `num_beams`.
    - `predict_with_generate`: Determines whether predictions should be made by generating text during evaluation.

- `data_collator`: Custom function to prepare the data batches for training and evaluation.

# Training Process:
- `trainer.train()`: This method starts the training process based on the configurations provided.

The trainer automates many tasks associated with training a sequence-to-sequence model, including data loading, batch creation, model optimization, and logging, allowing you to focus on model architecture and data rather than boilerplate training code.

---

This setup is typical for training state-of-the-art NLP models and provides a flexible, high-level interface for custom sequence-to-sequence tasks.


In [None]:
batch_size = 2
trainer = transformers.Seq2SeqTrainer(
    model=model,
    train_dataset=train_dataset,
    args=transformers.Seq2SeqTrainingArguments(
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=1,
        warmup_steps=0,
        num_train_epochs=1.0,
        learning_rate=0.0001,
        bf16=False, # If your GPU supports, make it True
        fp16=True, # Since we disable the bf16, we use FP16 instead
        logging_steps=1,
        report_to=['none'],
        remove_unused_columns=False,
        output_dir='model_output',
        generation_config=transformers.GenerationConfig(
            max_length=5,
            num_beams=1,
        ),
        predict_with_generate=True,
    ),
    data_collator=data_collator_customized
)

trainer.train()

# Processing Evaluation Results

The provided code block is part of an evaluation process, typically after a model has been used to generate predictions on a dataset. This section explains the purpose of each line in the code:

1. `logits = eval_result.predictions`:
    - This line retrieves the predicted logits from the evaluation results. Logits are the raw output scores from the model before applying any activation function like softmax. In the context of NLP and sequence generation, these logits represent the scores assigned to each token being the next token in the sequence.

2. `logits[logits == -100] = tokenizer.eos_token_id`:
    - Here, the code is handling a specific case where certain values in the logits are marked with -100. Typically, -100 is used as a masking value in NLP tasks to ignore specific tokens in loss calculations, such as padded tokens or special tokens that should not contribute to the model's learning.
    - This line replaces all instances of -100 in the logits with the end-of-sequence (EOS) token ID from the tokenizer. The EOS token is used to signify the end of a text sequence. This replacement ensures that when converting logits back to text, the sequence ends appropriately at the designated end-of-sequence markers instead of continuing with tokens corresponding to -100, which are effectively placeholder or non-token values.

3. `text_result = []`:
    - This line initializes an empty list, `text_result`, which will be used to store the final generated text sequences after converting the logits (or token IDs) back into human-readable text.

---

This code is typically part of a larger process where the model's predictions (logits) are converted into text sequences. The logits are first cleaned or adjusted as necessary (e.g., replacing placeholder values with EOS token IDs), and then each sequence of logits is decoded into text, usually involving additional steps not shown here.


In [None]:
eval_result = trainer.predict(test_dataset, max_new_tokens=5)
logits = eval_result.predictions
logits[logits == -100] = tokenizer.eos_token_id
text_result = []

# Task5: Decoding Model Predictions

In this task, you will write code to decode the raw text predictions from your NLP model and process them to extract meaningful output. You will work with the logits generated by the model after evaluation and use them to produce human-readable text. Follow the instructions below to complete the missing parts of the code.

#### Objectives:
1. Decode raw model predictions into text.
2. Process the decoded text to extract and clean the answers.

#### Task Instructions:

1. **Decode the Logits**:
   - Use the `tokenizer.batch_decode` method to convert the `logits` obtained from model predictions into raw text sequences. Assign the result to `raw_text_result`.
   - This method converts token IDs back into strings, making them human-readable.

2. **Extract Context for Evaluation**:
   - Retrieve the original questions (context) from `test_dataset` for reference or comparison. Store these in a list called `context` by iterating over the dataset and accessing the 'input' field for each item.

3. **Process Decoded Text**:
   - Initialize an empty list `text_result` to store the processed answers.
   - Iterate over each decoded text sequence in `raw_text_result`:
       - Remove any padding tokens from the text using the `replace` method.
       - Find the index of the keyword 'Answer:' which separates the question from the answer in the decoded text.
       - Extract the answer text by slicing the string from just after 'Answer:' to the end of the text.
       - Check if the EOS (end-of-sequence) token exists in the answer and truncate the answer at this token to remove any trailing text.
       - Append the cleaned answer to `text_result`.

4. **Extract Ground Truth for Comparison**:
   - Similar to extracting the context, retrieve the ground truth answers from `test_dataset` by accessing the 'target' field for each item. Store these in a list called `ground_truth`.

5. **Output Results**:
   - Print the first 10 entries of `text_result` to verify the decoding and processing steps.

#### Usage:
After completing the above steps, your code will effectively process the raw predictions from your NLP model into a clean, human-readable format, suitable for evaluation against the ground truth answers.

```python
# Place your completed code here based on the above instructions


In [None]:
### Your code here

### End of Your code
ground_truth = [test_dataset.__getitem__(i)['target'] for i in range(test_dataset.__len__())]
for question, pred, gt in list(zip(context, text_result, ground_truth))[:10]:
    print(f"""
    Question: {question}
    Prediction: {pred}
    Ground Truth: {gt}
    """)

# Calculating Evaluation Metrics

The provided code block is involved in evaluating the performance of an NLP model, specifically focusing on text generation tasks such as translation, summarization, or question-answering. It calculates two common metrics used in NLP: BLEU and ROUGE. Here is what each part of the code is responsible for:

1. `bleu_score = bleu_scorer.compute(predictions=text_result, references=ground_truth)`:
    - This line computes the BLEU (Bilingual Evaluation Understudy) score for the model's predictions against the ground truth answers. BLEU measures the correspondence between the machine-generated text and one or more reference texts. It does this by comparing the presence and frequency of phrases in the generated text to those in the reference text(s), effectively quantifying the quality of the generated text. High BLEU scores indicate better matching with the reference, suggesting better translation or summarization quality.

2. `rouge_score = rouge_scorer.compute(predictions=text_result, references=ground_truth)`:
    - This line computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score. ROUGE is used primarily to evaluate text summarization quality but can also be applied to other generation tasks. Unlike BLEU, which is precision-oriented, ROUGE focuses on recall — it measures the amount of overlap (in terms of n-grams, word sequences, and word pairs) between the generated text and the reference texts. There are different variations of ROUGE, each focusing on different aspects of the texts' overlap. A higher ROUGE score indicates that more elements of the reference texts were captured in the generated text.

3. `print(bleu_score)`, `print(rouge_score)`:
    - These lines output the calculated BLEU and ROUGE scores, respectively. These scores give an indication of how well the generated text matches the expected output according to different metrics of textual similarity and quality.

---

By using these metrics, developers and researchers can quantitatively assess the performance of their NLP models in tasks like translation, summarization, or question answering. It helps in understanding the model's capabilities and areas for improvement.


In [None]:
bleu_score = bleu_scorer.compute(predictions=text_result, references=ground_truth)
rouge_score = rouge_scorer.compute(predictions=text_result, references=ground_truth)
print('BLEU1:', bleu_score['precisions'][0]*100)
print(f"""
ROUGE-1: {rouge_score['rouge1']*100}
ROUGE-2: {rouge_score['rouge2']*100}
ROUGE-L: {rouge_score['rougeL']*100}
""")

# Task6: Improve task performance & play with the model
- From the last implementation, you can easily find the
- You can improve the task performance by trying different model
- You can play with the model, to give it some questions you like, and check what results can be outputted.