<a href="https://colab.research.google.com/github/parzival1l/Cohere-C4AI-Challenge/blob/main/Nanda_Kumar_C4AIScholarsChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Background**

Welcome to the C4AI Scholars Program Take-Home Challenge! This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

These tasks were chosen as a setting to see how you think about problems, even if they are not in your own research field of interest. The tasks and dataset are not meant to be indicative of the research goals of the Scholar Program. We purposefully have selected a simple toy problem so the focus is on how you think, and does not require significant machine learning resources (can be run in this colab).

Good luck!

**How to Use and Submit this Document?**

*   **Make a copy of this document** and rename it **Firstname_Lastname_C4AIScholarsChallenge**
*   Once you have completed all tasks, save and pin your revisions
*   Submit the assignment by responding directly to this email with a link to your final document by Sunday, September 15th, 11 PM PDT.

## **Coding Challenge Part 1: Debugging custom SmolLM code [10 points]**

In this coding challenge, you are required to debug and fix a bare-bones implementation of the following model.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

We have 10 bugs in the following implementation.
There is a `check_solution` function for your convenience to verify you have correctly identified all the bugs. If you have found all bugs, the generated outputs will match the reference model exactly.

**Rules**:
1. **Bug Definition:**
  - There are 10 bugs to be fixed.
  - A bug is *defined as **{incorrect, missing, unnecessary}** lines of code*.
  - You earn 1 point for each correctly identified and fixed bug.
2. **Fix Guidelines:**
  - You are encouraged to make the smallest possible fix, wherever possible (e.g. edit a line instead of replacing it entirely).
  - Do not optimize the code; only fix the bugs. The implementation is *intentionally* non-optimized but valid.
3. **Documentation:** Document each fix by adding a comment on the line above the fix: : `### BUG FIX ###`.
4. **Sections:** *1. Setup [Helper Functions]* and *3. Test* don't contain bugs and shouldn't be changed.
5. **Submission:** Your final submission should be the exact same file except with your proposed fixes and the respective comments as per Rule #3.

## 1. Setup [Helper Functions]

In [2]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################


# [Don't use. Rate limit issues.] Use gdown to get weights file(BareBones_SmolLM-135M.pt) at https://drive.google.com/file/d/1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU/view . gdown should be installed by default else use `pip install gdown`
# !gdown 1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU


# [Recommended]Use HF to download the weights
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 1.06 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  sample_data


In [None]:
!pip install peft
!pip install transformers
!pip install datasets
!pip install accelerate
!pip install evaluate
!pip install trl

In [2]:
# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 2. Custom SmolLM (for BugFixes)

In [None]:
#BUG 1 - Added an if case of repeat_kv when n_rep == 1
#BUG 2 - move the inv_freq to the same device as the model , taking device as a parameter to the Rotary Embedder class
#BUG 3 - rewriting the whole forward pass of RotaryEmbedder starting with unsqueezing the position_ids to another dimension, the previous implementation was throwing a tensor size mismatch.
#BUG 4 - rsqrt not sqrt in RMSNorm forward pass.
#BUG 5 - In RopeAttention - output must be passed through the o_proj layer not directly sent over.

In [None]:
#checked
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

#checked
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    if n_rep == 1:  #added if loop #BUG 1
        return hidden_states
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base, device):
        super().__init__()
        self.inv_freq = 1.0/(base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device)/dim)) #have to be on the same device #BUG 2
    # have to revisit
    @torch.no_grad()
    # def forward(self,x):
    #     pos = torch.arange(x.shape[-2],dtype=torch.long)
    #     angles = torch.einsum('f,p->fp', self.freq, pos.float()).unsqueeze(dim=0)
    #     emb = torch.cat((angles, angles), dim=-1)
    #     return emb.cos(), emb.sin()
    def forward(self, x, position_ids) :
        device_type = x.device.type
        position_ids = torch.arange(x.shape[-2],dtype=torch.long).unsqueeze(0) #BUG - 3
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
        position_ids_expanded = position_ids[:, None, :].float()
        device_type = x.device.type
        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
        with torch.autocast(device_type=device_type, enabled=False):
            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
            emb = torch.cat((freqs, freqs), dim=-1)
            cos = emb.cos()
            sin = emb.sin()
        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


class MLP(nn.Module):
    #checked
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = torch.nn.modules.activation.SiLU()
    #checked
    def forward(self, x):
        down_proj = self.W_down(self.act_fn((self.W_gate(x)) * self.W_up(x)))
        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-5): #modified the default value according to the blogs
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        #input_dtype assignment missing
        #hiddenstate torch coversion missing to float32 missing.
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) #sqrt is applied - but original norm is rsqrt. #BUG 4
        return self.weight * hidden_states.to(hidden_states.dtype) #input type conversion added

class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0
        # max_position_head missing

        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta,
                                         dim=config.hidden_size//self.num_heads,
                                         device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')) #BUG - 5

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
        position_ids = None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        cos, sin = self.rotary_emb(v_states, position_ids)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin)

        __kv_groups = self.num_heads / self.kv_heads
        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.hidden_size)
        attn_weights = attn_weights + attention_mask #have to slice it before appending it to the attention mask.   #BUG 2
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
        attn_weights = nn.functional.dropout(attn_weights)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)
        # attn_output = self.o_proj(attn_output) #output layer not defined. #or maybe this is the output layer defined as extra in the function calling this one  #BUG 5 - suspect
        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        #hidden layer size not assigned.
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)
        # attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1) #BUG excess statement

        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual
        residual = hidden_states #update the residual #BUG 3
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)

        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.rope_theta = 10000.0
        self.num_heads = config.num_heads
        #padding and vocab initiation missing
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size) #padding not passed over as a parameter
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)]) #of hidden layers are the same.  #layer_idx is not passed over as a parameter.
        self.norm = RMSNorm(config.hidden_size, eps=1e-05) #same

        self.rotary_emb = RotaryEmbedder(base=self.rope_theta,
                                        dim=config.hidden_size//self.num_heads,
                                        device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')) #BUG - 5

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
        position_ids = None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        position_embeddings = self.rotary_emb(hidden_states, position_ids) #BUG - 6 create the rotary embedding here.
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]

class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = outputs[0].squeeze()
        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}

In [None]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

# 3. Test

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)


In [None]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=50,
               model_A=__reference_model,
               model_B=__test_model)

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

## Documentation:

https://docs.google.com/document/d/1De1v9IXa23Ha6-7Ik44pW7Ea8_L3d0m5DhEGyeeGCTY/edit?usp=sharing

# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions. Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

In [25]:
from datasets import load_dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")

README.md:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/692k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [26]:
# TODO: Filter examples, keeping only GEC task

full_train_ds = full_train_ds.filter(lambda example: example['task'] == 'gec')
full_test_ds = full_test_ds.filter(lambda example: example['task'] == 'gec')

Filter:   0%|          | 0/69071 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1712 [00:00<?, ? examples/s]

Expected number of train and test samples are 19823 and 485, respectively.

In [27]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_name = "HuggingFaceTB/SmolLM-135M"

# TODO: Load the model and the tokenizer from huggingface
config = AutoConfig.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

if tokenizer.pad_token is None:
    tokenizer.pad_token = "<empty_output>"
    print("Assigned pad_token:", tokenizer.pad_token) #reusing a special_token not used much for pad_token since we got the warning

if tokenizer.eos_token is None:
    tokenizer.eos_token = "<|endoftext|>"
    print("Assigned eos_token:", tokenizer.eos_token)

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    config=config,
    device_map="auto",
    # attn_implementation="flash_attention_2", #tried this but getting an error asking for weights.
    torch_dtype=torch.bfloat16 #as per the blog psot
)

model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id

print("Model vocab size:", model.config.vocab_size)
print("Tokenizer vocab size:", len(tokenizer))
# No need to resize model embeddings as we're reusing existing tokens - to avoild

Assigned pad_token: <empty_output>
Model vocab size: 49152
Tokenizer vocab size: 49152


In [41]:
# TRL - Transformer Reinforcement Learning -- https://huggingface.co/docs/trl/en/index
from trl import SFTConfig, SFTTrainer

# TODO: Run SFT
sft_config = SFTConfig(
    output_dir="smol_output",
    bf16=True,
    max_seq_length=2048,
    do_eval=False,
    evaluation_strategy="epoch",
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    num_train_epochs=1,
    seed=42,
    save_strategy="epoch",
    lr_scheduler_type="cosine",
    max_steps=-1,
    weight_decay=0.1,
    per_device_eval_batch_size=16,
    per_device_train_batch_size=16,
    learning_rate=3e-03,          # Increased learning rate according to the launch blog which suggested this and also from the sft recipes.
    logging_dir="smol_logs",
    log_level="debug",
    logging_steps=100,
    packing=True,  # Important for preprocessed datasets according to the SFTTrainer page
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


#### Preparing the data and Training

In [39]:
# 2. Modifying the Tokenization Function

def tokenize_function(example):
    source = example['src']
    target = example['tgt']

    sep_token = tokenizer.sep_token if tokenizer.sep_token else '\n'
    eos_token = tokenizer.eos_token if tokenizer.eos_token else '<|endoftext|>'
    # Merge source, separator, target, and eos_token as the string to train the model on
    input_text = source + sep_token + target + eos_token
    # input_text = f"### Instruction: {source} {sep_token} ### Answer: {target} {eos_token}" # this was the alternate I tried but doesn't perform to the level this does.

    tokenized_input = tokenizer(
        input_text,
        max_length=2048,
        padding='max_length',
        truncation=True,
    )

    tokenized_source = tokenizer(
        source + sep_token,
        max_length=1024,
        truncation=True,
    )

    source_len = len(tokenized_source['input_ids'])

    labels = [-100] * source_len + tokenized_input['input_ids'][source_len:]
    labels = labels[:2048]
    if len(labels) < 2048:
        labels += [-100] * (2048 - len(labels))

    tokenized_input['labels'] = labels
    tokenized_input['source'] = source
    tokenized_input['target'] = target

    return tokenized_input

full_train_ds = full_train_ds.map(
    tokenize_function,
    batched=False,
    remove_columns=[],
)

full_test_ds = full_test_ds.map(
    tokenize_function,
    batched=False,
    remove_columns=[],
)

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

In [42]:
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()

trainer = SFTTrainer(
    model=model,
    train_dataset=full_train_ds,
    eval_dataset=full_test_ds,
    args=sft_config,
    data_collator=data_collator,
)

trainer.train()

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/tokenizer_config.json
Using auto half precision backend
Currently training with a batch size of:

Epoch,Training Loss,Validation Loss
0,0.0045,0.032991


Saving model checkpoint to smol_output/checkpoint-309
Configuration saved in smol_output/checkpoint-309/config.json
Configuration saved in smol_output/checkpoint-309/generation_config.json
Model weights saved in smol_output/checkpoint-309/model.safetensors
tokenizer config file saved in smol_output/checkpoint-309/tokenizer_config.json
Special tokens file saved in smol_output/checkpoint-309/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `LlamaForCausalLM.forward` and have been ignored: target, task, src, _id, source, tgt. If target, task, src, _id, source, tgt are not expected by `LlamaForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 485
  Batch size = 16
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/g

TrainOutput(global_step=309, training_loss=0.15038072008000608, metrics={'train_runtime': 1256.5462, 'train_samples_per_second': 15.776, 'train_steps_per_second': 0.246, 'total_flos': 2.580823505947853e+16, 'train_loss': 0.15038072008000608, 'epoch': 0.9975786924939467})

In [None]:
# Quick test if your model works properly
def format_text(text: str) -> str:
    # here you may have formatting of the input that you adopted for training
    return text


# Example of how to run inference on a single example
text = "Fix grammatically: I likes turtles"
inputs = tokenizer(format_text(text), return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.0)
print(tokenizer.decode(outputs[0]))

Expected output: I like turtles.

#### Single example evaluation

In [23]:
model.eval()

def evaluate_single_examples(model, tokenizer, examples):
    preds = []
    targets = []
    model.eval()
    for example in examples:
        source = example['source']
        target = example['target']
        sep_token = tokenizer.sep_token if tokenizer.sep_token else '\n'
        eos_token = tokenizer.eos_token if tokenizer.eos_token else '<|endoftext|>'
        input_text = source + sep_token
        inputs = tokenizer(
            input_text,
            return_tensors='pt',
            truncation=True,
            max_length=1024,
        ).to(model.device)

        outputs = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=1024,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            top_p = 0.95, #parameter mentioned in the smollm blog post
            temperature = 0.2, #parameter mentioned in the smollm blog post
        )

        # Decode the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract the generated correction
        generated_correction = generated_text[len(source):].strip()

        # Print the results
        print(f"Original: {source}")
        print(f"Target Correction: {target}")
        print(f"Model Correction: {generated_correction}")
        print("-" * 50)
        preds.append(generated_correction)
        targets.append(target)
    return preds, targets


samples = [
    {'source': "Fix the grammar : She don't like apples.", 'target': "She doesn't like apples."},
    {'source': "Fix the grammar : He go to school everyday.", 'target': "He goes to school every day."},
]

preds, targets = evaluate_single_examples(model, tokenizer, samples)


# preds, targets = evaluate_single_examples(model, tokenizer, full_test_ds.select(range(8))) #temporary use of the test set.

In [None]:
import evaluate
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

test_loader = DataLoader(
    full_test_ds,
    batch_size=16,
    collate_fn=data_collator,
)

def evaluate_batch(model, tokenizer, dataloader):
    model.eval()
    bleu = evaluate.load("bleu")
    all_predictions = []
    all_references = []

    for batch in tqdm(dataloader):
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=1024,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
                early_stopping=True,
                top_p = 0.95, #parameter mentioned in the smollm blog post
                temperature = 0.2, #parameter mentioned in the smollm blog post
            )
        sources = batch['source']
        targets = batch['target']

        for i in range(len(outputs)):
            generated_text = tokenizer.decode(outputs[i], skip_special_tokens=True)
            source = sources[i]
            target = targets[i]
            generated_correction = generated_text[len(source):].strip()
            all_predictions.append(generated_correction)
            all_references.append([target])

    results = bleu.compute(predictions=all_predictions, references=all_references)['bleu']
    print(f"BLEU score: {results['score']}")

    return all_predictions, all_references, results['score']


predictions, references, bleu_score = evaluate_batch(model, tokenizer, test_loader)

In my testing of the model built with this config and data packaged this way resulted in a BLEU score of 0.2 which was slightly better than when I tried to instruction tune the model using a formatting function.

Expected BLEU score after 1 epoch SFT is ~ 0.48.

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


In [4]:
!unzip checkpoint-927.zip

Archive:  checkpoint-927.zip
   creating: checkpoint-927/
  inflating: __MACOSX/._checkpoint-927  
  inflating: checkpoint-927/model.safetensors  
  inflating: __MACOSX/checkpoint-927/._model.safetensors  
  inflating: checkpoint-927/rng_state.pth  
  inflating: __MACOSX/checkpoint-927/._rng_state.pth  
  inflating: checkpoint-927/tokenizer_config.json  
  inflating: __MACOSX/checkpoint-927/._tokenizer_config.json  
  inflating: checkpoint-927/special_tokens_map.json  
  inflating: __MACOSX/checkpoint-927/._special_tokens_map.json  
  inflating: checkpoint-927/optimizer.pt  
  inflating: __MACOSX/checkpoint-927/._optimizer.pt  
  inflating: checkpoint-927/config.json  
  inflating: __MACOSX/checkpoint-927/._config.json  
  inflating: checkpoint-927/scheduler.pt  
  inflating: __MACOSX/checkpoint-927/._scheduler.pt  
  inflating: checkpoint-927/tokenizer.json  
  inflating: __MACOSX/checkpoint-927/._tokenizer.json  
  inflating: checkpoint-927/generation_config.json  
  inflating: __MAC

In [19]:
# load the best model saved
model_loaded = AutoModelForCausalLM.from_pretrained(
    "checkpoint-927",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

In [6]:
!pip install fast_edit_distance

Collecting fast_edit_distance
  Downloading fast_edit_distance-1.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Downloading fast_edit_distance-1.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (115 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.8/115.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fast_edit_distance
Successfully installed fast_edit_distance-1.2.1


In [None]:
import random
def generate_variants(model, tokenizer, input_text, num_variants=2, max_length=1024):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    input_ids = inputs["input_ids"]
    variants = []
    for _ in range(num_variants):
        strategy = random.choice(['temperature', 'top_k', 'top_p', 'beam_search'])
        if strategy == 'temperature':
          with torch.no_grad():
            output = model.generate(
                input_ids,
                max_length=max_length,
                do_sample=True,
                temperature=0.7 + torch.rand(1).item() * 0.3,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        elif strategy == 'top_k':
          with torch.no_grad():
            output = model.generate(
                input_ids,
                max_length=max_length,
                do_sample=True,
                top_k=random.randint(20, 50),
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        elif strategy == 'top_p':
          with torch.no_grad():
            output = model.generate(
                input_ids,
                max_length=max_length,
                do_sample=True,
                top_p=0.7 + torch.rand(1).item() * 0.25  ,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        elif strategy == 'beam_search':
          with torch.no_grad():
            output = model.generate(
                input_ids,
                max_length=max_length,
                num_beams=random.randint(3, 5),
                early_stopping=True,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        variants.append(decoded_outputs)
    return list(zip(*variants))

In [None]:
from fast_edit_distance import edit_distance
from datasets import Dataset

def create_preference_dataset(model, tokenizer, dataset, batch_size=16, num_samples=1000):
    dataloader = DataLoader(dataset.select(range(num_samples)), batch_size=batch_size, shuffle=False)
    preference_data = []

    for batch in tqdm(dataloader, desc="Processing batches"):
        input_texts = batch['src']
        ground_truths = batch['tgt']

        variants_batch = generate_variants(model, tokenizer, input_texts)

        for input_text, ground_truth, variants in zip(input_texts, ground_truths, variants_batch):
            distances = [edit_distance(variant, ground_truth) for variant in variants]
            chosen_idx = distances.index(min(distances))
            rejected_idx = 1 - chosen_idx

            preference_data.append({
                'prompt': input_text,
                'chosen': variants[chosen_idx],
                'rejected': variants[rejected_idx],
                'ground_truth': ground_truth
            })

    return preference_data

preference_dataset = create_preference_dataset(model_loaded, tokenizer, full_train_ds, batch_size=32, num_samples=6400)
hf_preference_dataset = Dataset.from_list(preference_dataset)

Beyond edit distance we can even use the BlEU score, use an LLM to compute relevance as a new metric, convert into an embedding and can use similarity.  

In [None]:
#save the DPO dataset.
output_directory = "./preference_dataset"
hf_preference_dataset.save_to_disk(output_directory)

In [None]:
!pip install datasets

In [11]:
from datasets import load_dataset, Dataset
dataset = load_dataset('json', data_files='/content/preference_dataset.jsonl')
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'ground_truth'],
        num_rows: 6400
    })
})


In [12]:
from sklearn.model_selection import train_test_split

train_dataset, test_dataset = train_test_split(dataset['train'], test_size=0.1, random_state=42)

# # Convert back to HuggingFace Datasets
train_dataset = Dataset.from_dict(train_dataset)
test_dataset = Dataset.from_dict(test_dataset)

In [13]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.
for i in range(5):
    print(f"Sample {i+1}:")
    print(f"Prompt: {dataset['train'][i]['prompt']}")
    print(f"Chosen: {dataset['train'][i]['chosen']}")
    print(f"Rejected: {dataset['train'][i]['rejected']}")
    print(f"Ground Truth: {dataset['train'][i]['ground_truth']}")
    print("\n")

Sample 1:
Prompt: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.
Chosen: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.iticus.
Rejected: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.ArgumentParsererous
Ground Truth: For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.


Sample 2:
Prompt: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably e

## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

In [17]:
import os
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM
from datasets import Dataset
import pandas as pd

# TODO: Run Direct Preference Optimization (DPO)

training_args = DPOConfig(
    beta=0.1,
    learning_rate=3e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    bf16=True,
    output_dir="dpo_output",
)

checkpoint = "HuggingFaceTB/SmolLM-135M"
base_model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    config=config,
    device_map="auto",
    # attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = "<empty_output>"
if tokenizer.eos_token is None:
    tokenizer.eos_token = "<|endoftext|>"


model_loaded.config.pad_token_id = tokenizer.pad_token_id
model_loaded.config.eos_token_id = tokenizer.eos_token_id

dpo_trainer = DPOTrainer(
    base_model,
    model_loaded,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()



Tokenizing train dataset:   0%|          | 0/5760 [00:00<?, ? examples/s]

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


TrainOutput(global_step=270, training_loss=1.3549750434027779, metrics={'train_runtime': 403.6255, 'train_samples_per_second': 42.812, 'train_steps_per_second': 0.669, 'total_flos': 0.0, 'train_loss': 1.3549750434027779, 'epoch': 3.0})

In [20]:
output_dir = "dpo_model/"
model_loaded.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)


dpo_tokenizer = AutoTokenizer.from_pretrained(output_dir)
dpo_model = AutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16 )

In [None]:
# TODO: Evaluate model, use evaluate_model function



Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

In [31]:
base_model.eval()
preds, targets = evaluate_single_examples(base_model, tokenizer, full_test_ds.select(range(2)))

Original: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.
Target Correction: First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.
Model Correction: ttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

KeyboardInterrupt: 

Had to break here, since the base model went haywire which was a widely observed problem when training

In [29]:
model_loaded.eval()
preds, targets = evaluate_single_examples(model_loaded, tokenizer, full_test_ds.select(range(2)))



Original: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.
Target Correction: First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.
Model Correction: First of all, from you, you read just to find out what well-known critic have already found out, you have lost the pleasures of reading something which is expecting to be a new experience to you.
--------------------------------------------------
Original: Fix grammatical errors: Their research shown that before Hurricane Sandy only " about 50 percent during resident used the emergency departments, " and " only about 35 percents sought inpatient cares there and less than 10 percent used the hospitals when needing surgerie

In [30]:
dpo_model.eval()
preds, targets = evaluate_single_examples(dpo_model, tokenizer, full_test_ds.select(range(2)))

Original: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.
Target Correction: First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.
Model Correction: First of all, from you, you read just to find out what well-known critic have already found out, you have lost the pleasures of reading something which is expecting to be a new experience to you.
--------------------------------------------------
Original: Fix grammatical errors: Their research shown that before Hurricane Sandy only " about 50 percent during resident used the emergency departments, " and " only about 35 percents sought inpatient cares there and less than 10 percent used the hospitals when needing surgerie

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.

1. ORPO (Odds Ratio Preference Optimization) is a good alternative for SFT + DPO. The main highlight is its loss function. It incorporates an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses. As also demoonstrated by https://github.com/Aisuko/notebooks/blob/35396f18a7c4573ca12d553ca5ab226dc51efb0a/reinforcement-learning/orpo/fine-tuning-smollm-135m-instruct.ipynb

2. As we can see in the end of previous section the base-model performs the worst and even after training on more than 5k samples using DPO we are getting nearly the same results for SFT + DPO => SFT.

In [34]:
from trl import ORPOTrainer, ORPOConfig

orpo_config=ORPOConfig(
    output_dir="orpo_output",
    bf16=True,
    do_eval=False,
    evaluation_strategy="epoch",
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    num_train_epochs=1,
    seed=42,
    save_strategy="epoch",
    lr_scheduler_type="cosine",
    max_steps=-1,
    weight_decay=0.1,
    per_device_eval_batch_size=16,
    per_device_train_batch_size=16,
    learning_rate=3e-03,          # Increased learning rate according to the launch blog which suggested this and also from the sft recipes.
    logging_dir="smol_logs",
    log_level="debug",
    logging_steps=100,
    beta = 0.1
)

trainer = ORPOTrainer(
        model=model_loaded,
        train_dataset =train_dataset,
        eval_dataset =test_dataset,
        args=orpo_config,
        tokenizer=tokenizer,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Map:   0%|          | 0/5760 [00:00<?, ? examples/s]

Map:   0%|          | 0/640 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 16
***** Running training *****
  Num examples = 5,760
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 4
  Total optimization steps = 90
  Number of trainable parameters = 134,515,584
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
1,No log,1.622359,3.7497,170.682,10.668,-0.05198,-0.100133,0.754687,0.048153,-1.001331,-0.519803,-1.631327,-1.27473,1.571508,-0.508517,0.991316


Saving model checkpoint to orpo_output/checkpoint-90
Configuration saved in orpo_output/checkpoint-90/config.json
Configuration saved in orpo_output/checkpoint-90/generation_config.json
Model weights saved in orpo_output/checkpoint-90/model.safetensors
tokenizer config file saved in orpo_output/checkpoint-90/tokenizer_config.json
Special tokens file saved in orpo_output/checkpoint-90/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 640
  Batch size = 16
Saving model checkpoint to orpo_output/checkpoint-90
Configuration saved in orpo_output/checkpoint-90/config.json
Configuration saved in orpo_output/checkpoint-90/generation_config.json
Model weights saved in orpo_output/checkpoint-90/model.safetensors
tokenizer config file saved in orpo_output/checkpoint-90/tokenizer_config.json
Special tokens file saved in orpo_output/checkpoint-90/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=90, training_loss=1.9904630872938367, metrics={'train_runtime': 124.4972, 'train_samples_per_second': 46.266, 'train_steps_per_second': 0.723, 'total_flos': 0.0, 'train_loss': 1.9904630872938367, 'epoch': 1.0})

In [35]:
model_loaded.eval()
preds, targets = evaluate_single_examples(model_loaded, tokenizer, full_test_ds.select(range(2)))



Original: Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you.
Target Correction: First of all, if you read just to find in the poem or novel what well-known critics have already found out, you lose the pleasure of reading something that is expected to be a new experience to you.
Model Correction: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you. cryptocurriesome.
--------------------------------------------------
Original: Fix grammatical errors: Their research shown that before Hurricane Sandy only " about 50 percent during resident used the emergency departments, " and " only about 35 percents sought inpatient cares there and less than 10 percent used the ho

From the results we probably will need more samples to show case a clear distinction is both the approaches capabilites, but so far is a good starting point.
