# Setup

In [None]:
!pip install -qU bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.6/336.6 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.9 MB/s[0m eta [36m

This chunk is used to install necessary libraries in Python, which are required for working with large language models (LLMs) and fine-tuning them. Let's break it down:

```bash
!pip install -qU bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score
```

Here's what each part does:

- `!pip`: This is a command in Jupyter notebooks or other Python environments that allows you to run shell commands directly from your code. In this case, it's used to install packages using pip, which is the package installer for Python.

- `install`: This is the command that tells pip to download and install the specified packages.

- `-qU`: These are flags that modify how pip installs the packages.
  - `-q` stands for "quiet". It suppresses the output of the installation process, so you won't see all the details about what's being downloaded and installed. This can make your output cleaner and less cluttered.
  - `-U` stands for "upgrade". It tells pip to upgrade any existing packages to the latest available version if they're already installed.

- The list of package names (`bitsandbytes`, `transformers`, etc.): These are the actual libraries that are being installed. Each library serves a specific purpose:
  - `bitsandbytes`: A library for efficient integer and binary operations, which can be useful in machine learning models.
  - `transformers`: This is one of the most important libraries for natural language processing (NLP) tasks, especially when working with large language models. It provides pre-trained models and a simple interface to use them.
  - `peft`: Stands for "Parameter-Efficient Fine-Tuning", which is a method used to fine-tune pre-trained language models efficiently by modifying only a small portion of the model's parameters.
  - `accelerate`: A library designed to simplify distributed training and mixed precision training, making it easier to scale your machine learning experiments.
  - `datasets`: Provides an efficient way to load and manipulate datasets, which is crucial for training and testing machine learning models.
  - `scipy`: The SciPy library is used for scientific computing in Python. It provides functions for scientific and engineering applications, including signal processing, linear algebra, and statistics.
  - `einops`: A library that simplifies tensor operations by providing an expressive syntax to manipulate tensors in a more readable way.
  - `evaluate`: A library that helps you evaluate the performance of your machine learning models with various metrics.
  - `trl`: Stands for "Transformer Reinforcement Learning", which is used for reinforcement learning tasks with transformers.
  - `rouge_score`: This library calculates ROUGE scores, which are metrics used to evaluate the quality of text summarization and generation tasks.

In summary, this command installs a set of libraries that are commonly used in natural language processing, especially when working with large language models. These libraries can help you load data, train models efficiently, and evaluate their performance using various metrics.

In [None]:
# Core ML/DL Libraries
import torch
import numpy as np
import pandas as pd

# Hugging Face Transformers and Related Libraries
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    GenerationConfig,
    HfArgumentParser,
    TrainingArguments,
    Trainer,
    set_seed
)
from datasets import load_dataset
from peft import (
    LoraConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from trl import SFTTrainer

import evaluate

# Hugging Face Authentication and Platform
from huggingface_hub import interpreter_login, login

# System and Utilities
import os
import time
from functools import partial
from tqdm import tqdm
from google.colab import userdata
import warnings


warnings.simplefilter(action='ignore', category=FutureWarning)

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"

This code snippet is used to import necessary libraries in Python, which are required for training large language models (LLMs) using the Hugging Face ecosystem. Let's break it down into sections:

### Core ML/DL Libraries
```python
import torch
import numpy as np
import pandas as pd
```
These lines import fundamental libraries used in machine learning and deep learning:
- `torch`: A popular library for building and training neural networks.
- `numpy` (`np`): A library for efficient numerical computation, often used in conjunction with `torch`.
- `pandas` (`pd`): A library for data manipulation and analysis.

### Hugging Face Transformers and Related Libraries
```python
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    GenerationConfig,
    HfArgumentParser,
    TrainingArguments,
    Trainer,
    set_seed
)
from datasets import load_dataset
from peft import (
    LoraConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from trl import SFTTrainer
```
These lines import libraries and modules from the Hugging Face ecosystem:
- `transformers`: A library providing pre-trained models, tokenizers, and other utilities for NLP tasks.
  - `AutoModelForCausalLM`, `AutoTokenizer`, etc.: Specific classes and functions used for loading pre-trained models, tokenizing text, and configuring model training.
- `datasets`: A library for loading and manipulating datasets.
- `peft`: A library for parameter-efficient fine-tuning of pre-trained language models.
  - `LoraConfig`, `PeftModel`, etc.: Classes and functions used to configure and apply fine-tuning techniques.

### Hugging Face Authentication and Platform
```python
from huggingface_hub import interpreter_login, login
```
These lines import functions for authenticating with the Hugging Face hub:
- `interpreter_login` and `login`: Functions used to log in to the Hugging Face hub, which is required to access pre-trained models and other resources.

### System and Utilities
```python
import os
import time
from functools import partial
from tqdm import tqdm
from google.colab import userdata
import warnings
```
These lines import libraries for system-related tasks and utilities:
- `os`: A library providing functions for interacting with the operating system.
- `time`: A library providing functions for working with time and dates.
- `functools`: A library providing higher-order functions, such as `partial`.
- `tqdm`: A library providing a progress bar utility.
- `google.colab`: A library specific to Google Colab notebooks.
- `warnings`: A library allowing you to control warning messages.

### Suppressing Warnings and Environment Configuration
```python
warnings.simplefilter(action='ignore', category=FutureWarning)

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"
```
These lines configure the environment to:
- Suppress `FutureWarning` messages using `warnings`.
- Enable Hugging Face hub transfer by setting an environment variable.
- Disable Weights & Biases, a library for tracking experiment metrics, by setting another environment variable.

In summary, this code imports essential libraries for training large language models using the Hugging Face ecosystem and configures the environment to suppress warnings and disable certain features.

In [None]:
# login to HF
login(token = userdata.get('HF_TOKEN') )

This line of code logs you into the Hugging Face hub using a token:
```python
login(token = userdata.get('HF_TOKEN') )
```
Here's what's happening:

- `login()`: This is a function from the `huggingface_hub` library that allows you to log in to the Hugging Face hub.
- `token = userdata.get('HF_TOKEN')`: This retrieves an authentication token from the `userdata` object, which is specific to Google Colab notebooks. The token is stored under the key `'HF_TOKEN'`.
- By passing this token to the `login()` function, you're authenticating with the Hugging Face hub using your stored credentials.

**Why do we need to log in?**

Logging in to the Hugging Face hub allows you to:
- Access pre-trained models and datasets
- Use the Hugging Face model hub to download and upload models
- Utilize other features of the Hugging Face ecosystem, such as model sharing and collaboration

**How to obtain an HF token?**

To use this code, you'll need to obtain a Hugging Face token. You can do this by:
1. Creating an account on the [Hugging Face website](https://huggingface.co/).
2. Going to your account settings and clicking on "Access Tokens".
3. Creating a new token or using an existing one.
4. Storing the token securely, such as in an environment variable or a secure storage system.

In Google Colab, you can store the token using the `userdata` object, like this:
```python
from google.colab import userdata
userdata.set('HF_TOKEN', 'your_token_here')
```
Replace `'your_token_here'` with your actual Hugging Face token.

In [None]:
class CFG:
    device = 'cuda'
    model_name = 'microsoft/phi-2'
    seed = 42
    maxsteps = 100
    huggingface_dataset_name = "neil-code/dialogsum-test"
    temp = 0.1
    topp = 0.95

**Configuring Hyperparameters and Model Settings**

This code defines a class `CFG` that stores configuration settings for a machine learning model:
```python
class CFG:
    device = 'cuda'
    model_name = 'microsoft/phi-2'
    seed = 42
    maxsteps = 100
    huggingface_dataset_name = "neil-code/dialogsum-test"
    temp = 0.1
    topp = 0.95
```
Here's a breakdown of each attribute:

* `device`: Specifies the device to use for training and inference. In this case, it's set to `'cuda'`, which means the model will be trained on a GPU (Graphics Processing Unit) using CUDA.
* `model_name`: The name of the pre-trained model to use. Here, it's set to `'microsoft/phi-2'`, which is a specific model from the Hugging Face model hub.
* `seed`: The random seed used for reproducibility. Setting this value ensures that the same sequence of random numbers will be generated every time the code is run.
* `maxsteps`: The maximum number of training steps to perform. This value controls how many iterations the model will train for.
* `huggingface_dataset_name`: The name of the dataset to use from the Hugging Face dataset hub. In this case, it's set to `"neil-code/dialogsum-test"`, which is a specific dataset for testing dialogue summarization models.
* `temp` (temperature): A hyperparameter that controls the randomness of the model's output. A lower value (e.g., 0.1) means the model will produce more deterministic outputs, while a higher value (e.g., 1.0) means the model will produce more random outputs.
* `topp` (top-p): A hyperparameter that controls the probability of the model's output being within a certain range. In this case, it's set to 0.95, which means the model will only consider the top 5% most likely outputs.

By defining these attributes in a class, you can easily access and modify them throughout your code. This helps keep your configuration settings organized and makes it easier to experiment with different hyperparameters and models.

**Example usage:**

You can access these attributes like this:
```python
cfg = CFG()
print(cfg.device)  # Output: cuda
print(cfg.model_name)  # Output: microsoft/phi-2
```
Or, you can modify them directly:
```python
cfg.maxsteps = 200
print(cfg.maxsteps)  # Output: 200
```

# Functions

In [None]:
def gen(model,p, maxlen=100, sample=True):
    toks = eval_tokenizer(p, return_tensors="pt")
    res = model.generate(**toks.to(CFG.device), max_new_tokens=maxlen,
                    do_sample=sample,num_return_sequences=1,temperature = CFG.temp,
                num_beams=1,top_p= CFG.topp,).to('cpu')
    return eval_tokenizer.batch_decode(res,skip_special_tokens=True)

This function generates text based on a given prompt `p` using a pre-trained model. Here's a step-by-step explanation:

1. **Tokenize the input prompt**: The `eval_tokenizer` function is used to tokenize the input prompt `p`. This converts the text into a format that the model can understand.
2. **Move tokens to the device**: The tokenized input is moved to the device specified in `CFG.device` (in this case, a CUDA GPU) using the `to()` method.
3. **Generate text with the model**: The `generate()` method of the pre-trained model is called with the following arguments:
	* `**toks.to(CFG.device)`: Passes the tokenized input to the model.
	* `max_new_tokens=maxlen`: Specifies the maximum number of new tokens to generate (in this case, 100).
	* `do_sample=sample`: If `True`, the model will sample from the probability distribution over possible next tokens. If `False`, it will choose the most likely token.
	* `num_return_sequences=1`: Returns only one generated sequence.
	* `temperature=CFG.temp`: Controls the randomness of the generated text (in this case, set to 0.1).
	* `num_beams=1`: Uses a single beam search for generating text.
	* `top_p=CFG.topp`: Specifies the top-p sampling strategy (in this case, set to 0.95).
4. **Move the generated sequence back to CPU**: The generated sequence is moved back to the CPU using the `to('cpu')` method.
5. **Decode the generated sequence**: The generated sequence is decoded back into text using the `batch_decode()` method of the tokenizer, with special tokens skipped.

**Return value**:
The function returns a list containing the generated text as a string.


In [None]:
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"

    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

The main goal of this function is to take in a piece of data (called a "sample") and format it into a specific structure that can be used as input for a language model. This process is often referred to as "prompt engineering" or "fine-tuning," where we're preparing the input data to help the model learn and generate better responses.

Here's what happens step by step:

1. **Defining constants**: The code starts by defining some constant strings that will be used throughout the function. These include an introductory message, keywords for instructions and output, and a keyword to mark the end of the prompt.

2. **Extracting relevant information**: It then extracts specific fields from the input "sample" data. This includes a dialogue (if present) and a summary.

3. **Constructing the prompt parts**: The function constructs different parts of the final prompt by combining these extracted fields with the constant strings defined earlier. Each part is essentially a formatted string that serves a particular purpose in the overall prompt.

4. **Filtering out empty parts**: To ensure the final prompt doesn't contain any unnecessary or empty sections, it filters out any parts that might be blank.

5. **Assembling the final prompt**: The remaining parts are then joined together with two newline characters (`\n\n`) to create a well-structured and readable prompt. This prompt is designed to guide the language model in understanding what task it needs to perform (in this case, summarizing a conversation).

6. **Updating the sample data**: Finally, the function updates the original "sample" data by adding the newly formatted prompt under a key called "text." This updated sample can then be used as input for further processing or training of the language model.

The overall purpose of this code is to prepare and structure the input data in a way that makes it easy for a language model to understand the task at hand, which is crucial for achieving good results in tasks like text summarization.

In [None]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py

def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


The purpose of this function, `get_max_length`, is to determine the maximum sequence length that a given language model can handle. This is an important piece of information because many models have limitations on how long their input sequences can be.

Here's a step-by-step breakdown:

1. **Accessing model configuration**: The function starts by accessing the configuration (`conf`) of the provided `model`. Model configurations often contain various settings and parameters that define the model's behavior and capabilities.

2. **Searching for max length setting**: It then searches for the maximum sequence length setting within the model's configuration. This setting can be referred to differently across various models, so it checks for three possible attribute names: `n_positions`, `max_position_embeddings`, and `seq_length`.

3. **Checking each possible setting**: For each of these potential settings, it attempts to retrieve the value using `getattr`. If a value is found (i.e., not `None`), it prints a message indicating that it has found the maximum length and breaks out of the loop.

4. **Defaulting to a standard max length**: If none of the expected settings are found in the model's configuration, it defaults to a maximum sequence length of 1024. This is a common default value for many models, especially those based on transformer architectures like BERT or RoBERTa.

5. **Returning the max length**: Finally, the function returns the determined maximum sequence length, whether it was found in the model's configuration or defaulted to 1024.

In [None]:
def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

In [None]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

The purpose of `preprocess_batch` is to prepare a batch of text data for use with a language model. This involves converting the raw text into a format that the model can understand, which typically means tokenizing the text and padding or truncating it to a uniform length.

Here's a step-by-step explanation:

1. **Taking in parameters**: The function accepts three inputs:
   - `batch`: A dictionary containing the text data to be preprocessed.
   - `tokenizer`: An object responsible for splitting the input text into individual tokens (e.g., words, subwords).
   - `max_length`: The maximum sequence length that the model can handle.

2. **Tokenizing the batch**: The function uses the provided `tokenizer` to tokenize the text data stored in `batch["text"]`. This process typically involves:
   - Splitting the text into individual tokens (e.g., words, subwords).
   - Converting each token into a numerical representation (token ID) that the model can understand.

3. **Truncating to max length**: The `truncation=True` argument ensures that any input sequences longer than `max_length` are truncated to fit within this limit. This is necessary because many language models have limitations on the maximum sequence length they can handle.

4. **Returning the preprocessed batch**: The function returns the tokenized and potentially truncated batch, which is now ready for use with a language model. The returned object typically contains the following:
   - `input_ids`: A tensor of shape `(batch_size, max_length)` containing the token IDs for each input sequence.
   - `attention_mask`: A tensor of shape `(batch_size, max_length)` indicating which tokens are actual input and which are padding.

This function is a crucial step in preparing text data for use with language models, as it converts raw text into a format that can be processed by the model. The specific tokenizer used will depend on the requirements of the model being employed.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


The purpose of `print_number_of_trainable_model_parameters` is to calculate and display the number of trainable and total parameters in a given model, as well as the percentage of parameters that are trainable.

Here's a step-by-step explanation:

1. **Initializing counters**: The function starts by initializing two counters:
   - `trainable_model_params`: This counter keeps track of the number of model parameters that can be updated during training (i.e., require gradients).
   - `all_model_params`: This counter keeps track of the total number of model parameters, regardless of whether they are trainable or not.

2. **Iterating over model parameters**: The function then iterates over all the parameters in the model using the `named_parameters()` method, which returns an iterator over the model's parameters and their corresponding names.

3. **Updating counters**: For each parameter, it checks if the parameter requires gradients (i.e., is trainable) by checking the `requires_grad` attribute. If it does:
   - The `trainable_model_params` counter is incremented by the number of elements in the parameter (`param.numel()`).
   - Regardless of whether the parameter is trainable or not, the `all_model_params` counter is always incremented by the number of elements in the parameter.

4. **Calculating and returning results**: After iterating over all parameters, the function calculates:
   - The total number of trainable model parameters (`trainable_model_params`).
   - The total number of all model parameters (`all_model_params`).
   - The percentage of trainable model parameters by dividing `trainable_model_params` by `all_model_params` and multiplying by 100.

5. **Returning the results as a string**: The function returns a formatted string containing the three calculated values:
   - The number of trainable model parameters.
   - The total number of all model parameters.
   - The percentage of trainable model parameters, rounded to two decimal places.

This information is useful for understanding the complexity and capacity of a model, as well as identifying potential issues or inefficiencies in the training process. For example, if a large proportion of the model's parameters are not trainable, it may indicate that some layers or modules are frozen during training.

# Data

In [None]:
dataset = load_dataset(CFG.huggingface_dataset_name)

README.md:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/441k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/447k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1999 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/499 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/499 [00:00<?, ? examples/s]

This line of code is used to retrieve a specific dataset from the Hugging Face library, which is a popular platform for natural language processing tasks. The `load_dataset` function is a convenient way to access a wide range of datasets that are already pre-processed and ready to use.

The dataset being loaded is specified by the `CFG.huggingface_dataset_name` variable, which contains the name of the desired dataset. This suggests that the code is using a configuration file or object (`CFG`) to store important settings and parameters, including the name of the dataset to be used.

By loading the dataset in this way, the code can then use it for tasks such as training a model, fine-tuning a pre-trained language model, or performing other types of analysis. The Hugging Face library provides a simple and efficient way to work with datasets, making it easier to focus on the task at hand rather than worrying about data preparation and processing.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1999
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

In [None]:
dataset['train'][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

# Model

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

This code is setting up configuration options for optimizing the performance of a deep learning model, specifically when it comes to memory usage and computational precision.

The first part specifies the data type to be used for computations, which in this case is set to 16-bit floating point numbers. This is a lower-precision format than the default 32-bit floating point numbers, but it can still provide good results while reducing memory usage and improving performance.

The next section configures the "BitsAndBytes" optimization technique, which is used to further reduce memory usage by storing model weights in an even lower-precision 4-bit format. The configuration options specify how this quantization should be done, including the type of quantization to use and whether to use a specific computation data type.

Finally, the `device_map` variable is defined, which appears to be setting up a mapping between devices (such as GPUs) and their corresponding IDs or indices. In this case, it seems that only one device is being used, and it's being assigned an ID of 0. This mapping will likely be used later in the code to distribute the model and its computations across available devices.

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(CFG.model_name,
                                                      device_map = device_map ,
                                                      quantization_config = bnb_config,
                                                      trust_remote_code = True,
                                                      use_auth_token = True)

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

This line of code is loading a pre-trained language model from the Hugging Face library. The specific model being loaded is determined by the `CFG.model_name` variable, which contains the name of the desired model.

The model is being loaded with several custom configurations:

* `device_map`: This specifies how the model should be distributed across available devices (such as GPUs). In this case, the device map was defined earlier in the code.
* `quantization_config`: This refers to the BitsAndBytes configuration that was set up earlier, which optimizes the model's memory usage by storing weights in a lower-precision format.
* `trust_remote_code`: This option allows the model to load and execute code from remote sources, such as GitHub repositories. By setting this to `True`, the code is trusting that the remote code is safe to execute.
* `use_auth_token`: This option uses an authentication token to access the pre-trained model. This is likely necessary because the model requires a login or subscription to access.

By loading the model with these custom configurations, the code can take advantage of optimized performance and memory usage while still leveraging the power of a pre-trained language model. The loaded model can then be used for tasks such as text generation, fine-tuning, or other natural language processing applications.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model_name, trust_remote_code=True,
                                          padding_side="left", add_eos_token=True,
                                          add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

This line of code is loading a pre-trained tokenizer from the Hugging Face library, which is designed to work with the previously loaded language model. The tokenizer is responsible for converting text into a format that can be understood by the model.

The tokenizer is being loaded with several specific configurations:

* `trust_remote_code`: Similar to when the model was loaded, this option allows the tokenizer to load and execute code from remote sources.
* `padding_side`: This specifies how the tokenizer should handle padding tokens. In this case, it's set to "left", which means that padding tokens will be added to the beginning of the text.
* `add_eos_token` and `add_bos_token`: These options add end-of-sentence (EOS) and beginning-of-sentence (BOS) tokens to the text, respectively. These tokens help the model understand the boundaries of the input text.
* `use_fast`: This option is set to `False`, which means that the tokenizer will use a slower but more flexible implementation.

After loading the tokenizer, the code sets the `pad_token` attribute to be equal to the `eos_token`. This means that when the tokenizer needs to add padding tokens to the text, it will use the same token as the end-of-sentence token. This is a common configuration in language modeling tasks, where the model is designed to generate text until it reaches the EOS token.

By configuring the tokenizer in this way, the code can ensure that the input text is properly formatted and padded for the language model, which will help with tasks such as text generation and fine-tuning.

In [None]:
eval_tokenizer = AutoTokenizer.from_pretrained(CFG.model_name, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token


This code is loading another instance of the pre-trained tokenizer from the Hugging Face library, similar to the previous one. The purpose of this additional tokenizer is likely for evaluation purposes, as indicated by the variable name `eval_tokenizer`.

The configuration options used to load the `eval_tokenizer` are slightly different from the previous tokenizer:

* `add_bos_token` is set to `True`, which means that a beginning-of-sentence token will be added to the text.
* `trust_remote_code` is set to `True`, allowing the tokenizer to load and execute code from remote sources.
* `use_fast` is set to `False`, which means that the tokenizer will use a slower but more flexible implementation.

Unlike the previous tokenizer, this one does not have `add_eos_token` or `padding_side` specified explicitly. However, it still sets the `pad_token` attribute to be equal to the `eos_token`, just like the previous tokenizer. This suggests that the evaluation tokenizer will also use the EOS token as its padding token.

Having a separate tokenizer for evaluation purposes can be useful in certain scenarios, such as when the model is being fine-tuned on a specific dataset and the evaluation metrics need to be calculated separately from the training process. The `eval_tokenizer` can be used to preprocess the evaluation data in a way that's consistent with the training data, but without any potential biases or modifications introduced during the training process.

## Zero-shot

In [None]:
zeroshot_responses = {}

In [None]:
%%time
index = 10

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
output = res[0].split('Output:\n')[1]
# for later
zeroshot_responses['f' + str(index)] = output

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# attends Brian's birthday pa

This code is generating a summary of a conversation using the previously loaded language model, and then comparing it to a human-written baseline summary.

Here's what's happening:

1. The code selects a specific example from the test dataset, identified by the `index` variable (set to 10 in this case). It retrieves the conversation text (`prompt`) and the corresponding human-written summary (`summary`) for that example.
2. The conversation text is then formatted into a prompt that can be fed into the language model. The prompt includes an instruction to summarize the conversation, followed by the conversation text itself, and finally an "Output:" label to indicate where the model's response should start.
3. The `gen` function (not shown in this code snippet) is called with the formatted prompt, the language model (`original_model`), and a maximum output length of 100 tokens. This generates a response from the model, which is stored in the `res` variable.
4. The model's response is then processed to extract the actual generated text, by splitting the response at the "Output:\n" label and taking the second part (i.e., everything after the label). This extracted text is stored in the `output` variable.
5. The code stores the generated output in a dictionary called `zeroshot_responses`, with a key that includes the index of the example being processed (e.g., "f10").
6. Finally, the code prints out several things:
	* A dashed line (`dash_line`) to separate the different sections of output.
	* The input prompt that was fed into the model.
	* The human-written baseline summary for comparison.
	* The model's generated summary (i.e., the `output` variable).

The purpose of this code is to evaluate the performance of the language model in generating summaries of conversations, by comparing its output to human-written summaries. The `%time` magic command at the beginning of the code block suggests that the execution time of this code may also be measured and reported.

In [None]:
%%time

index = 50

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
output = res[0].split('Output:\n')[1]
# for later
zeroshot_responses['f' + str(index)] = output

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Yeah. Just pull on this strip. Then peel off the back.
#Person2#: You might make a few enemies this way.
#Person1#: If they don't think this is fun, they're not meant to be our friends.
#Person2#: You mean your friends. I think it's cruel.
#Person1#: Yeah. But it's fun. Look at those two ugly old ladies. . . or are they men?
#Person2#: Hurry! Get a shot!. . . Hand it over!
#Person1#: I knew you'd come around. . .
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is about to make a prank. #Person2# thinks it's cruel at first but then joins.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Person 1 and Person 2 are discussing a prank involving a 

In [None]:
index = 150

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
output = res[0].split('Output:\n')[1]
# for later
zeroshot_responses['f' + str(index)] = output

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Taxi!
#Person2#: Where will you go, sir?
#Person1#: Friendship Hotel.
#Person2#: OK, it's not far from here.
#Person1#: I have something important to do, can you fast the speed?
#Person2#: Sure, I'll try my best. Here we are.
#Person1#: It's fast! How much should I pay you?
#Person2#: The reading on the meter is 15 yuan.
#Person1#: Here's 20 yuan, keep the change.
#Person2#: Thank you very much.
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# takes a taxi to the Friendship Hotel for something important.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Person 1 takes an Uber from the airport to the Friendship Hotel and asks the driver to go fa

In [None]:
index = 450


prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
output = res[0].split('Output:\n')[1]
# for later
zeroshot_responses['f' + str(index)] = output

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: What's up?
#Person2#: I guess there is some kind of virus seeking into my computer, I can't send out this e-mail. Do you have the number of the text port?
#Person1#: Do you mind I have a look at your computer?
#Person2#: Of course not, I appreciate that.
#Person1#: Well, it has nothing to do with virus. The problem is your attachment is a bit larger. It has exceeded the e-mail capacity.
#Person2#: I see. What can I do now?
#Person1#: You can send a compressed one.
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# finds that #Person2# e-mail exceeds capacity and suggests #Person2# compress the email.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SH

# Data preparation

In [None]:
# ## Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length, CFG.seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length,CFG.seed, dataset['validation'])

Found max lenth: 2048
2048
Preprocessing dataset...


Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1999 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Filter:   0%|          | 0/499 [00:00<?, ? examples/s]

This code is pre-processing the dataset to prepare it for training and evaluation.

Here's what's happening:

1. The `get_max_length` function (not shown in this code snippet) is called with the `original_model` as an argument, and it returns the maximum length of input that the model can handle. This value is stored in the `max_length` variable.
2. The `preprocess_dataset` function (also not shown) is then called twice:
	* Once for the training dataset (`dataset['train']`), with the `tokenizer`, `max_length`, and a random seed (`CFG.seed`) as arguments. The resulting pre-processed dataset is stored in the `train_dataset` variable.
	* Once for the validation dataset (`dataset['validation']`), with the same arguments as before. The resulting pre-processed dataset is stored in the `eval_dataset` variable.

The purpose of the `preprocess_dataset` function is likely to perform tasks such as:

* Tokenizing the input text using the specified tokenizer
* Truncating or padding the input sequences to ensure they are all the same length (i.e., `max_length`)
* Converting the labels or targets into a suitable format for training and evaluation
* Possibly shuffling the dataset and splitting it into batches

The resulting pre-processed datasets (`train_dataset` and `eval_dataset`) can then be used as input to the model for training and evaluation.


In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset.shape}")
print(f"Validation: {eval_dataset.shape}")
print(train_dataset)

Shapes of the datasets:
Training: (1999, 3)
Validation: (499, 3)
Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 1999
})


This code is printing out information about the shapes and contents of the pre-processed training and validation datasets.

Here's what's happening:

1. The code prints out a message indicating that it will display the shapes of the datasets.
2. It then prints out the shape of the `train_dataset` using the `shape` attribute, followed by the shape of the `eval_dataset`.
3. Finally, it prints out the contents of the `train_dataset` itself.

The output of this code will likely show the number of examples in each dataset, as well as any other relevant dimensions or attributes. For example, if the datasets are represented as tensors or arrays, the shape might include information about the number of rows (examples), columns (features), and any other dimensions.

By printing out the shapes and contents of the datasets, the code is providing a way to verify that the pre-processing step was successful and that the datasets are in the expected format. This can be useful for debugging purposes or for gaining insight into the structure and content of the data.


# Finetuning setup

In [None]:
print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 262364160
all model parameters: 1521392640
percentage of trainable model parameters: 17.24%


In [None]:
print(original_model)

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (rotary_emb): PhiRotaryEmbedding()
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (final_layernorm): 

In [None]:
config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj', 'k_proj', 'v_proj', 'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

# 2 - Using the prepare_model_for_kbit_training method from PEFT
original_model = prepare_model_for_kbit_training(original_model)

peft_model = get_peft_model(original_model, config)
peft_model = peft_model.to(CFG.device)

This code is configuring and preparing a model for fine-tuning using the LORA (Low-Rank Adaptation) method, which is a technique for efficient and adaptable fine-tuning of large language models.

Here's what's happening:

1. The `LoraConfig` object is created with several parameters that control the behavior of the LORA method:
	* `r=32`: This sets the rank of the low-rank approximation, which controls the amount of dimensionality reduction applied to the model's weights.
	* `lora_alpha=32`: This is a hyperparameter that controls the strength of the adaptation.
	* `target_modules`: This specifies the modules within the model that should be adapted using LORA. In this case, it includes the query, key, and value projection layers (`q_proj`, `k_proj`, `v_proj`) as well as the dense layer (`dense`).
	* `bias="none"`: This indicates that the bias terms of the model's weights should not be adapted.
	* `lora_dropout=0.05`: This is a dropout rate applied to the LORA adaptation process.
	* `task_type="CAUSAL_LM"`: This specifies the type of task being performed, which in this case is causal language modeling.
2. The code enables gradient checkpointing for the original model using the `gradient_checkpointing_enable()` method. This technique stores only the gradients of the model's weights at certain checkpoints during backpropagation, rather than storing the entire computation graph. This can significantly reduce memory usage during fine-tuning.
3. The `prepare_model_for_kbit_training` function from the PEFT (Parameter-Efficient Fine-Tuning) library is applied to the original model. This prepares the model for efficient fine-tuning by modifying its architecture and adding necessary components, such as LORA adapters.
4. The `get_peft_model` function is used to create a new model that incorporates the LORA adaptation method. This model is then moved to the specified device (e.g., GPU) using the `to()` method.

The resulting `peft_model` is now ready for fine-tuning using the LORA method, which can adapt the model's weights in a efficient and effective way while minimizing the number of additional parameters required.

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 20971520
all model parameters: 1542364160
percentage of trainable model parameters: 1.36%


In [None]:
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4b

# Train PEFT adapter

In [None]:
output_dir = './peft-dialogue-summary-training/final-checkpoint'


peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps = CFG.maxsteps, # 1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    eval_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)

peft_model.config.use_cache = False

peft_trainer = Trainer(
    model = peft_model ,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    args = peft_training_args,
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

This code is setting up the training arguments and creating a trainer object for fine-tuning the `peft_model` using the Hugging Face Transformers library.

Here's what's happening:

1. The output directory for the training process is specified as `./peft-dialogue-summary-training/final-checkpoint`. This is where the model checkpoints will be saved.
2. The `TrainingArguments` object is created with several parameters that control the training process:
	* `output_dir`: The directory where the model checkpoints will be saved.
	* `warmup_steps=1`: The number of warmup steps, which is the number of initial training steps where the learning rate is gradually increased from 0 to its maximum value.
	* `per_device_train_batch_size=1`: The batch size per device (e.g., GPU) during training. In this case, it's set to 1, which means that each device will process one example at a time.
	* `gradient_accumulation_steps=4`: The number of gradient accumulation steps, which is the number of times the gradients are accumulated before updating the model weights. This can help reduce memory usage by reducing the number of weight updates.
	* `max_steps=CFG.maxsteps`: The maximum number of training steps, which is set to a value defined in the configuration file (`CFG.maxsteps`).
	* `learning_rate=2e-4`: The initial learning rate for the optimizer.
	* `optim="paged_adamw_8bit"`: The optimizer algorithm used during training. In this case, it's an 8-bit version of the AdamW optimizer with paged memory allocation.
	* `logging_steps=25`: The frequency at which logging information is printed to the console and saved to the log file.
	* `logging_dir="./logs"`: The directory where the log files will be saved.
	* `save_strategy="steps"`: The strategy used to save model checkpoints. In this case, it's set to "steps", which means that checkpoints are saved every certain number of steps (defined by `save_steps`).
	* `save_steps=25`: The frequency at which model checkpoints are saved.
	* `eval_strategy="steps"`: The strategy used to evaluate the model during training. In this case, it's set to "steps", which means that evaluation is performed every certain number of steps (defined by `eval_steps`).
	* `eval_steps=25`: The frequency at which the model is evaluated during training.
	* `do_eval=True`: A flag indicating whether evaluation should be performed during training.
	* `gradient_checkpointing=True`: A flag enabling gradient checkpointing, which can help reduce memory usage during backpropagation.
	* `report_to="none"`: The reporting tool used to track the training process. In this case, it's set to "none", which means that no reporting is performed.
	* `overwrite_output_dir='True'`: A flag indicating whether the output directory should be overwritten if it already exists.
	* `group_by_length=True`: A flag enabling grouping of examples by length during training, which can help reduce padding and improve efficiency.
3. The `use_cache` attribute of the `peft_model.config` object is set to `False`, which disables caching of attention weights during inference.
4. The `Trainer` object is created with several parameters:
	* `model=peft_model`: The model being trained, which is the `peft_model`.
	* `train_dataset=train_dataset`: The training dataset used during training.
	* `eval_dataset=eval_dataset`: The evaluation dataset used during evaluation.
	* `args=peft_training_args`: The training arguments defined earlier.
	* `data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)`: The data collator used to prepare the input data for the model. In this case, it's a language modeling data collator with masked language modeling (MLM) disabled.

In [None]:
peft_trainer.train()

Step,Training Loss,Validation Loss
25,1.6733,1.370079
50,1.2023,1.329419
75,1.4465,1.322451
100,1.2169,1.315212


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=100, training_loss=1.3847660064697265, metrics={'train_runtime': 657.3905, 'train_samples_per_second': 0.608, 'train_steps_per_second': 0.152, 'total_flos': 1846465383444480.0, 'train_loss': 1.3847660064697265, 'epoch': 0.2001000500250125})

In [None]:
# Free memory for merging weights
del original_model
del peft_trainer
torch.cuda.empty_cache()

# Test and compare

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(CFG.model_name,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
eval_tokenizer = AutoTokenizer.from_pretrained(CFG.model_name, add_bos_token=True,
                                               trust_remote_code=True,
                                               use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

In [None]:
ft_model = PeftModel.from_pretrained(base_model,
                "/content/peft-dialogue-summary-training/final-checkpoint/checkpoint-"+str(CFG.maxsteps),
                torch_dtype=torch.float16,is_trainable=False)


This code is loading a pre-trained model that has been fine-tuned using the PEFT (Parameter-Efficient Fine-Tuning) library.

Here's what's happening:

1. The `PeftModel.from_pretrained` method is used to load a pre-trained model from a checkpoint file.
2. The `base_model` argument specifies the base model architecture that was used for fine-tuning. This is likely the same model that was used in the previous training step.
3. The second argument `/content/peft-dialogue-summary-training/final-checkpoint/checkpoint-"+str(CFG.maxsteps)` specifies the path to the checkpoint file that contains the fine-tuned weights. The `CFG.maxsteps` variable is used to construct the filename, which likely corresponds to the maximum number of training steps that were performed during fine-tuning.
4. The `torch_dtype=torch.float16` argument specifies the data type that should be used for the model's weights and activations. In this case, it's set to `torch.float16`, which is a 16-bit floating-point format that can help reduce memory usage and improve performance on certain hardware platforms.
5. The `is_trainable=False` argument indicates that the loaded model should not be trainable, meaning that its weights will not be updated during subsequent training or inference steps.

By loading the pre-trained model in this way, the code is able to leverage the fine-tuned weights and architecture that were learned during the previous training step, while also ensuring that the model is configured correctly for subsequent use cases (e.g., inference, evaluation). The resulting `ft_model` object can be used as a starting point for further fine-tuning, or as a final model for deployment in a production environment.

In [None]:
index = 10
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')
zs = zeroshot_responses['f' + str(index)]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'OG MODEL:\n{zs}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# attends Brian's birthday pa

This code is generating a summary of a conversation using the fine-tuned PEFT model and comparing it to the original model's output, as well as a human-written baseline summary.

Here's what's happening:

1. The code selects a specific example from the test dataset, identified by the `index` variable (set to 10 in this case). It retrieves the conversation text (`dialogue`) and the corresponding human-written summary (`summary`) for that example.
2. A prompt is constructed by concatenating an instruction to summarize the conversation with the conversation text itself, followed by an "Output:\n" label.
3. The `gen` function (not shown in this code snippet) is called with the fine-tuned PEFT model (`ft_model`), the prompt, and a maximum output length of 100 tokens. This generates a response from the model, which is stored in the `peft_model_res` variable.
4. The model's response is then processed to extract the actual generated text, by splitting the response at the "Output:\n" label and taking the second part (i.e., everything after the label). This extracted text is stored in the `peft_model_output` variable.
5. The `partition` method is used to split the `peft_model_output` into three parts: `prefix`, `success`, and `result`. However, since there is no "#End" token in the output, `success` will be an empty string, and `result` will also be an empty string. Therefore, only the `prefix` part is used.
6. The code retrieves the original model's output for the same example from the `zeroshot_responses` dictionary, which was computed earlier.
7. A dashed line (`dash_line`) is printed to separate the different sections of output.
8. The input prompt, baseline human summary, original model's output, and PEFT model's output are all printed out for comparison.

By generating summaries using both the original model and the fine-tuned PEFT model, the code can compare their performance and evaluate the effectiveness of the PEFT fine-tuning approach.

In [None]:
index = 50
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')
zs = zeroshot_responses['f' + str(index)]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'OG MODEL:\n{zs}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Yeah. Just pull on this strip. Then peel off the back.
#Person2#: You might make a few enemies this way.
#Person1#: If they don't think this is fun, they're not meant to be our friends.
#Person2#: You mean your friends. I think it's cruel.
#Person1#: Yeah. But it's fun. Look at those two ugly old ladies. . . or are they men?
#Person2#: Hurry! Get a shot!. . . Hand it over!
#Person1#: I knew you'd come around. . .
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is about to make a prank. #Person2# thinks it's cruel at first but then joins.

---------------------------------------------------------------------------------------------------
OG MODEL:
Person 1 and Person 2 are discussing a prank involving a strip. Person 1 sugg

In [None]:
index = 150
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')
zs = zeroshot_responses['f' + str(index)]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'OG MODEL:\n{zs}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Taxi!
#Person2#: Where will you go, sir?
#Person1#: Friendship Hotel.
#Person2#: OK, it's not far from here.
#Person1#: I have something important to do, can you fast the speed?
#Person2#: Sure, I'll try my best. Here we are.
#Person1#: It's fast! How much should I pay you?
#Person2#: The reading on the meter is 15 yuan.
#Person1#: Here's 20 yuan, keep the change.
#Person2#: Thank you very much.
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# takes a taxi to the Friendship Hotel for something important.

---------------------------------------------------------------------------------------------------
OG MODEL:
Person 1 takes an Uber from the airport to the Friendship Hotel and asks the driver to go fast. The driver agree

In [None]:
index = 450
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')
zs = zeroshot_responses['f' + str(index)]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'OG MODEL:\n{zs}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: What's up?
#Person2#: I guess there is some kind of virus seeking into my computer, I can't send out this e-mail. Do you have the number of the text port?
#Person1#: Do you mind I have a look at your computer?
#Person2#: Of course not, I appreciate that.
#Person1#: Well, it has nothing to do with virus. The problem is your attachment is a bit larger. It has exceeded the e-mail capacity.
#Person2#: I see. What can I do now?
#Person1#: You can send a compressed one.
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# finds that #Person2# e-mail exceeds capacity and suggests #Person2# compress the email.

---------------------------------------------------------------------------------------------------
OG MODEL:
Person 1: What's

# Numerical eval

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(CFG.model_name,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]

    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('###')

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ms. Dawson takes a dictation for #Person1#, who wants to restrict all office communications to email and memos. Ms. Dawson asks if the restriction applies to external communications as well. #Person1# says yes, and warns that employees who continue to use Instant Messaging will face termination.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ms. Dawson takes a dictation for #Person1# to write a memo about a new office communication policy. The memo will restrict all communications to email and official memos. Any employee who continues to use Instant Messaging will face termination.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ms. Dawson takes a dictation for #Person1# to write a memo about a new office communication policy. The memo will restrict all communications to email and official memos. Any employee who continues to use Instant Messaging will face termination.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


#Person2# decides to take public transport and start biking to work to reduce pollution and stress.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


#Person2# decides to take public transport and start biking to work to reduce pollution and stress.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


#Person2# decides to quit driving to work and start taking public transport or biking to work.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Kate tells #Person2# that Masha and Hero are getting divorced. They are having a separation for 2 months and filed for divorce. Masha and Hero are well matched and no quarrelling about who get the house and stock. The divorce will be final in the New Year.

#End of output#



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Kate tells Masha and Hero are getting divorced. They are having a separation for 2 months and filed for divorce. Masha and Hero are well matched and no quarrelling about who get the house and stock. The divorce will be final in the New Year.

#Person1#: Kate, you never believe what's happened.
#Person2#: What do you mean?
#Person1#: Well, I don't really know, but I heard that Masha


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Kate tells Masha and Hero are getting divorced. They are having a separation for 2 months and filed for divorce. Masha and Hero are well matched and no quarrelling about who get the house and stock. The divorce will be final in the New Year.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Brian's birthday party is a success. Brian and #Person1# have a dance and #Person1# compliments Brian's appearance.

#Person1# and #Person2# have a drink together to celebrate Brian's birthday.



Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"Person 1: Ms. Dawson, I need you to take a dic...","Ms. Dawson takes a dictation for #Person1#, wh..."
1,In order to prevent employees from wasting tim...,"Person 1: Ms. Dawson, I need you to take a dic...",Ms. Dawson takes a dictation for #Person1# to ...
2,Ms. Dawson takes a dictation for #Person1# abo...,"Person 1: Ms. Dawson, I need you to take a dic...",Ms. Dawson takes a dictation for #Person1# to ...
3,#Person2# arrives late because of traffic jam....,Person1 and Person2 are discussing the traffic...,#Person2# decides to take public transport and...
4,#Person2# decides to follow #Person1#'s sugges...,Person2: I'm going to stop driving to work bec...,#Person2# decides to take public transport and...
5,#Person2# complains to #Person1# about the tra...,Person1 and Person2 are discussing the traffic...,#Person2# decides to quit driving to work and ...
6,#Person1# tells Kate that Masha and Hero get d...,Kate informed that Masha and Hero are getting ...,Kate tells #Person2# that Masha and Hero are g...
7,#Person1# tells Kate that Masha and Hero are g...,Kate informed that Masha and Hero are getting ...,Kate tells Masha and Hero are getting divorced...
8,#Person1# and Kate talk about the divorce betw...,Kate informed that Masha and Hero are getting ...,Kate tells Masha and Hero are getting divorced...
9,#Person1# and Brian are at the birthday party ...,"Person1 and Person2 are at a party, and Person...",Brian's birthday party is a success. Brian and...


This code is generating summaries for a set of dialogues using both the original model and the fine-tuned PEFT model, and comparing them to human-written baseline summaries. The results are then stored in a pandas DataFrame.

Here's what's happening:

1. The code selects a subset of 10 dialogues from the test dataset, along with their corresponding human-written baseline summaries.
2. Three empty lists are created to store the summaries generated by:
	* `original_model_summaries`: the original model
	* `instruct_model_summaries`: (not used in this code snippet)
	* `peft_model_summaries`: the fine-tuned PEFT model
3. The code loops through each dialogue and its corresponding human baseline summary.
4. For each dialogue, a prompt is constructed to instruct the models to summarize the conversation.
5. The original model and the PEFT model are used to generate summaries for each dialogue, using the `gen` function.
6. The generated summaries are processed to extract the actual text output, by splitting the response at the "Output:\n" label.
7. For the PEFT model's output, an additional step is taken to partition the output into three parts using the "###" token. However, since there is no "###" token in the output, only the first part (`peft_model_text_output`) is used.
8. The generated summaries are appended to their respective lists.
9. The human baseline summaries, original model summaries, and PEFT model summaries are zipped together into a list of tuples, where each tuple contains one summary from each category.
10. A pandas DataFrame is created from the zipped list, with columns named "human_baseline_summaries", "original_model_summaries", and "peft_model_summaries".
11. The resulting DataFrame (`df`) can be used to compare and analyze the performance of the original model and the fine-tuned PEFT model.

By storing the summaries in a DataFrame, the code provides an easy way to access and manipulate the data for further analysis or evaluation.

In [None]:

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.2907777637564819, 'rouge2': 0.11537375527395644, 'rougeL': 0.22293006187622993, 'rougeLsum': 0.23326764700409144}
PEFT MODEL:
{'rouge1': 0.400124870806793, 'rouge2': 0.14012417659234377, 'rougeL': 0.3003604714495282, 'rougeLsum': 0.3108840296482332}
Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 10.93%
rouge2: 2.48%
rougeL: 7.74%
rougeLsum: 7.76%


This code is using the ROUGE evaluation metric to compare the quality of summaries generated by the original model and the fine-tuned PEFT model.

Here's what's happening:

1. The `rouge` evaluator is loaded from the Hugging Face `evaluate` library.
2. The `compute` method of the `rouge` evaluator is used to calculate the ROUGE scores for:
	* `original_model_summaries`: the summaries generated by the original model
	* `peft_model_summaries`: the summaries generated by the PEFT model
3. The `references` argument specifies the human-written baseline summaries that are being compared against.
4. The `use_aggregator=True` argument enables the use of an aggregator to combine the ROUGE scores across multiple references (in this case, the same reference is used for all predictions).
5. The `use_stemmer=True` argument enables the use of a stemmer to normalize the words in the summaries and references.
6. The results are stored in the `original_model_results` and `peft_model_results` variables, which contain dictionaries with the ROUGE scores (e.g., ROUGE-1, ROUGE-2, ROUGE-L).
7. The results for both models are printed to the console.
8. The absolute percentage improvement of the PEFT model over the original model is calculated by subtracting the original model's ROUGE scores from the PEFT model's ROUGE scores and dividing by the original model's ROUGE scores.
9. The improvement is printed to the console for each ROUGE metric (e.g., ROUGE-1, ROUGE-2, ROUGE-L).

The ROUGE metrics are used to evaluate the quality of text summaries by comparing them against human-written references. The metrics include:

* ROUGE-1: measures the overlap between the summary and reference at the unigram level
* ROUGE-2: measures the overlap between the summary and reference at the bigram level
* ROUGE-L: measures the longest common subsequence between the summary and reference

By comparing the ROUGE scores of the original model and the PEFT model, this code can help determine whether the fine-tuning process has improved the quality of the generated summaries.