# Quiz 3 : Instruction Tuning Model with LM

In [1]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence

import torchtext, datasets, math
from tqdm.auto import tqdm

from queue import PriorityQueue
import operator

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# torch.cuda.get_device_name(0)

cuda


## 1. Load data - Alpaca dataset (1 point)

We will be using Alpaca dataset which contains a large corpus of text, perfect for language modeling task.

[Download dataset](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data.json)

In [3]:
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

import datasets
#there are raw and preprocessed version; we used the raw one and preprocessed ourselves for fun
dataset = datasets.load_dataset('json', data_files='./data/alpaca_data.json')
dataset

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 52002
    })
})

In [4]:
dataset = dataset['train'].train_test_split(test_size=0.05, shuffle=True, seed=555)
dataset

DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 49401
    })
    test: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 2601
    })
})

In [5]:
dataset['train'].num_rows

49401

## 2. Preprocessing (2 points)

We used the following prompts for fine-tuning the Alpaca model:

- For examples with a non-empty input field:
```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}

### Input:
{input}

### Response:
```

- For examples with an empty input field:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
```

For example

```
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Explain why saving for retirement is important.

### Response:
Saving for retirement is a vital part of your financial wellness. Retirement planning is essential if you want to maintain your current lifestyle as you transition into retirement. Retirement savings will provide you with reliable income to cover your daily expenses, medical costs, and other expenditure. 

Without retirement savings, you may need to rely on Social Security or other government options for income, which may not provide you with enough money to take care of all your needs. Additionally, planning for retirement allows you to take advantage of unique savings and tax benefits which can give you greater financial security. Saving for retirement is an important step to ensure you have the financial security you need when you retire.'
```

```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
The food was delicious, but the restaurant's decor was unsatisfactory.

### Input:
The food was delicious, however, the restaurant has a poor decor.

### Response:
The food was delicious, but the restaurant's decor was unsatisfactory.
```

### Tokenizing (1 point)

Simply tokenize the given text to tokens.

In [6]:
import torchtext
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

#function to tokenize
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(f'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction: {example["instruction"]}\n### Response:{example["output"]}')}

#map the function to each example
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['instruction','input','output'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][333]['tokens'])

['below', 'is', 'an', 'instruction', 'that', 'describes', 'a', 'task', '.', 'write', 'a', 'response', 'that', 'appropriately', 'completes', 'the', 'request', '.', '###', 'instruction', 'help', 'me', 'make', 'a', 'plan', 'for', 'when', 'i', "'", 'm', 'running', 'late', '###', 'response', 'when', 'you', "'", 're', 'running', 'late', ',', 'the', 'best', 'course', 'of', 'action', 'is', 'to', 'try', 'and', 'keep', 'your', 'cool', '.', 'take', 'a', 'deep', 'breath', ',', 'and', 'make', 'a', 'plan', 'for', 'how', 'you', 'can', 'make', 'up', 'the', 'time', '.', 'figure', 'out', 'the', 'quickest', 'route', 'to', 'your', 'destination', 'and', 'if', 'possible', ',', 'call', 'ahead', 'to', 'let', 'the', 'person', 'or', 'people', 'you', "'", 're', 'meeting', 'know', 'that', 'you', "'", 'll', 'be', 'late', '.', 'if', 'you', "'", 're', 'unable', 'to', 'make', 'up', 'the', 'time', ',', 'apologize', 'and', 'explain', 'why', 'you', "'", 're', 'running', 'late', '.', 'making', 'sure', 'to', 'be', 'as', '

In [7]:
# Assertion statement
assert "below is an instruction that describes a task" in " ".join(tokenized_dataset['train'][0]['tokens']), "Word not found in tokenized dataset"

### Numericalizing (1 point)

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.

In [8]:
## numericalizing

# Define special symbols and indices
UNK_IDX, PAD_IDX, SOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<sos>', '<eos>']

vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], min_freq=3, specials=special_symbols)   

vocab.set_default_index(vocab['<unk>'])   
print(len(vocab))                         
print(vocab.get_itos()[:10])       

23844
['<unk>', '<pad>', '<sos>', '<eos>', '.', 'the', 'a', ',', 'that', 'and']


## 3. Prepare the batch loader (1 point)

### Prepare data

In [9]:
def get_data(dataset, vocab, batch_size):
    data = []                                                   
    for example in dataset:
        if example['tokens']:         
            #appends eos so we know it ends....so model learn how to end...                             
            tokens = example['tokens'].append('<eos>')   
            #numericalize          
            tokens = [vocab[token] for token in example['tokens']] 
            data.extend(tokens)                                    
    data = torch.LongTensor(data)                                 
    num_batches = data.shape[0] // batch_size #get the int number of batches...
    data = data[:num_batches * batch_size] #make the batch evenly, and cut out any remaining                      
    data = data.view(batch_size, num_batches)          
    return data #[batch size, bunch of tokens]

In [10]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['test'], vocab, batch_size)
# test_data  = get_data(tokenized_dataset['test'], vocab, batch_size)
train_data.shape, valid_data.shape

(torch.Size([128, 32796]), torch.Size([128, 1719]))

## 4. Modeling (1 point)

In [11]:
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
                
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
                
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
                
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim = -1)
        #attention = [batch size, n heads, query len, key len]
                
        x = torch.matmul(self.dropout(attention), V)
        #x = [batch size, n heads, query len, head dim]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        #x = [batch size, query len, n heads, head dim]
        
        x = x.view(batch_size, -1, self.hid_dim)
        #x = [batch size, query len, hid dim]
        
        x = self.fc_o(x)
        #x = [batch size, query len, hid dim]
        
        return x, attention

In [12]:
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [batch size, seq len, hid dim]
        
        x = self.dropout(torch.relu(self.fc_1(x)))
        #x = [batch size, seq len, pf dim]
        
        x = self.fc_2(x)
        #x = [batch size, seq len, hid dim]
        
        return x

Here I am using Batched Beam Search, where instead of feeding each hypothesis one by one, which takes a lot of time;  I simply concat everything into one list and feed them all at once, which is much faster.

In [13]:
class Decoder(nn.Module):
    def __init__(self, output_dim, hid_dim, n_layers, n_heads, 
                 pf_dim, dropout, device, pad_idx, max_length = 100):
                
        super().__init__()
        
        self.device = device
        self.output_dim = output_dim
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([DecoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim, 
                                                  dropout, 
                                                  device)
                                     for _ in range(n_layers)])
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        self.pad_idx = pad_idx
    
    def make_mask(self, x):
        
        #x = [batch size, len]
        
        pad_mask = (x != self.pad_idx).unsqueeze(1).unsqueeze(2)
        #pad_mask = [batch size, 1, 1, len]
        
        x_len = x.shape[1]
        
        sub_mask = torch.tril(torch.ones((x_len, x_len), device = self.device)).bool()
        #sub_mask = [len, len]
            
        mask = pad_mask & sub_mask
        #mask = [batch size, 1, len, len]
        
        return mask 
    
    def forward(self, x):
        
        #x = [batch size, len]
                
        batch_size = x.shape[0]
        x_len      = x.shape[1]
        
        #get mask here since we remove seq2seq class
        mask   = self.make_mask(x)
        #mask = [batch size, 1, len, len]

        pos = torch.arange(0, x_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)          
            
        x = self.dropout((self.tok_embedding(x) * self.scale) + self.pos_embedding(pos))
        #x = [batch size, len, hid dim]
        
        for layer in self.layers:
            x, attention = layer(x, mask)
        
        #x = [batch size, len, hid dim]
        #attention = [batch size, n heads, len, len]
        
        output = self.fc_out(x)
        #output = [batch size, len, output dim]
            
        return output, attention

    def beam_decode(self, prompt = 'Harry Potter is ', penalty_alpha = 0.9, max_length = 5, beam_size = 5):
        
        tokens = tokenizer(prompt)
        indices = [SOS_IDX] + [vocab[t] for t in tokens]

        decoder_input = torch.Tensor([indices]).long().to(device)
        #decoder_input: [batch size, len] = [1, 1]
        scores = torch.Tensor([0.]).to(device)
        #scores: [1]
        
        for i in range(max_length):
            
            # print(f"========Length: {i}")
            
            # Decoder prediction
            logits, _ = self.forward(decoder_input)
            #[beam_size, current dec len=i, vocab_size]
                        
            logits = logits[:, -1] 
            # Last sequence step: [beam_size, current dec len=i, vocab_size] => [beam_size, vocab_size]
            
            # print(f"{logits.shape=}")

            # Softmax
            # Log softmax is better, since beam search accumulates probability
            # if simply softmax, the probability can get too small and then become unstable
            log_probs = torch.log_softmax(logits, dim=1)
    
            # Add length penalty, otherwise, always very short sentence will win...
            penalty   = ((5 + (i+1)) / (5 + 1)) ** penalty_alpha #see https://arxiv.org/abs/1609.08144
            log_probs = log_probs / penalty
            
            # print(f"{decoder_input[:, -1]=}")
            
            # Update score where EOS has not been reached
            log_probs[decoder_input[:, -1]==EOS_IDX, :] = -2 #discouraged it to end
            log_probs[decoder_input[:, -1]==UNK_IDX, :] = -10 #very discouraged to spit out unk
            scores = scores.unsqueeze(1) + log_probs 
            # scores: [beam_size, vocab_size]
            # log_probs: [beam_size, vocab_size]

            # print(f"{log_probs.shape=}")
            # print(f"{scores.shape=}")
            #log_probs: torch.Size([1, 29475])
            #scores.shape=torch.Size([1, 29475])
            
            # Flatten scores from [beams, vocab_size] to [beams * vocab_size] to get top k, and reconstruct beam indices and token indices
            # Since we flatten it, we have to retrieve the actual beam indices and token_indices using floor division and remainder
            # You can try on paper; it will make sense
            scores, indices = torch.topk(scores.reshape(-1), beam_size) #scores: [beam_size]; #indices: [beam_size]
            beam_indices  = torch.divide   (indices, self.output_dim, rounding_mode='floor') # indices // vocab_size
            token_indices = torch.remainder(indices, self.output_dim)                        # indices %  vocab_size
            
            # print(f"{scores=}")
            # print(f"{indices.shape=}")
            
            # print(f"{indices=}")
            # print(f"{beam_indices=}")
            # print(f"{token_indices=}")
            
            # Build the next decoder input
            # For efficiency, the trick is to concatenate all hypotheses into one string and sent to decoder at once
            # We can later chop it ...
            next_decoder_input = []
            for beam_index, token_index in zip(beam_indices, token_indices):
                # print(f"{beam_index=}")
                prev_decoder_input = decoder_input[beam_index]
                # print(f"{prev_decoder_input=}")
                if prev_decoder_input[-1]==EOS_IDX:
                    token_index = EOS_IDX # once EOS, always EOS
                token_index = torch.LongTensor([token_index]).long().to(device)
                next_decoder_input.append(torch.cat([prev_decoder_input, token_index]))
                # print("here: " + " ".join([vocab.lookup_token(i) for i in next_decoder_input[-1]]) + "; score: " + str(scores[beam_index].item()))
            decoder_input = torch.vstack(next_decoder_input)
            
            # print(f"{decoder_input=}")
            
             # If all beams are finished, and the length is at least 5, exit
            if i > 5:
                if (decoder_input[:, -1]==EOS_IDX).sum() == beam_size:
                    break
                
        # convert the top scored sequence to a list of text tokens
        decoder_output, _ = max(zip(decoder_input, scores), key=lambda x: x[1])
        decoder_output = decoder_output[1:].cpu().numpy() # remove SOS
        
        return [vocab.lookup_token(i) for i in decoder_output if i != EOS_IDX] # remove EOS if exists

In [14]:
class DecoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, dropout, device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)        
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, pf_dim, dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        
        #x = [batch size, len, hid dim]
        #mask = [batch size, 1, len, len]
        
        #multi attention, skip and then norm
        _x, attention = self.self_attention(x, x, x, mask)
        x = self.self_attn_layer_norm(x + self.dropout(_x))
        #x = [batch size, len, hid dim]
        #attention = [batch size, n heads, len, len]
    
        #positionwise feedforward
        _x = self.positionwise_feedforward(x)
        x = self.ff_layer_norm(x + self.dropout(_x))
        #x = [batch size, len, hid dim]
        
        return x, attention

###  FastText Embedding

In [15]:
from torchtext.vocab import FastText
fast_vectors = FastText(language='simple')
fast_embedding =  fast_vectors.get_vecs_by_tokens(vocab.get_itos()).to(device)
fast_embedding.shape

torch.Size([23844, 300])

In [16]:
vocab_size = len(vocab)
hid_dim    = 300 # match fasttext dim 
dec_layers = 3              
dec_heads  = 10 # hid dim % num head must be 0 
dec_pf_dim = 512
dec_dropout = 0.1     
lr = 1e-3                    

In [17]:
#creating model (0.5 points)
model = Decoder(
    vocab_size, 
    hid_dim, 
    dec_layers, 
    dec_heads, 
    dec_pf_dim, 
    dec_dropout, 
    device, 
    PAD_IDX
).to(device)

#Applying FastText embedding to the Decoder (0.5 points)
model.tok_embedding.weight.data = fast_embedding

In [18]:
model

Decoder(
  (tok_embedding): Embedding(23844, 300)
  (pos_embedding): Embedding(100, 300)
  (layers): ModuleList(
    (0-2): 3 x DecoderLayer(
      (self_attn_layer_norm): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
      (ff_layer_norm): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
      (self_attention): MultiHeadAttentionLayer(
        (fc_q): Linear(in_features=300, out_features=300, bias=True)
        (fc_k): Linear(in_features=300, out_features=300, bias=True)
        (fc_v): Linear(in_features=300, out_features=300, bias=True)
        (fc_o): Linear(in_features=300, out_features=300, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (positionwise_feedforward): PositionwiseFeedforwardLayer(
        (fc_1): Linear(in_features=300, out_features=512, bias=True)
        (fc_2): Linear(in_features=512, out_features=300, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )

## 5. Training  (1 point)

In [19]:
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 16,371,480 trainable parameters


In [20]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [21]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, bunch of tokens]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
        
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, _ = model(src)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [22]:
def evaluate(model, data, criterion, batch_size, seq_len, device):
    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]
    
    decoded_batch_list = []

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            #target = [batch size, dec len]

            batch_size= src.shape[0]
            prediction, _ = model(src)
            #prediction = [batch size, dec len, output_dim]
            
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
            
    #decoding using beam_search as example (you don't need to put here, because beam_search is for intference)
    decoded_batch = model.beam_decode()
    print("Sample beam sentence: " + " ".join(decoded_batch))
            
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [23]:
n_epochs = 5
seq_len  = 100 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

save_path = f'models/{model.__class__.__name__}.pt'

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), save_path)

    print(f'Epoch: {epoch+1}:')
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                           

Sample beam sentence: harry potter is an instruction that describes a
Epoch: 1:
	Train Perplexity: 54.777
	Valid Perplexity: 26.307


                                                           

Sample beam sentence: harry potter is an instruction that describes a
Epoch: 2:
	Train Perplexity: 23.586
	Valid Perplexity: 19.734


                                                           

Sample beam sentence: harry potter is the sorcerer ' s stone
Epoch: 3:
	Train Perplexity: 18.118
	Valid Perplexity: 17.482


                                                           

Sample beam sentence: harry potter is an instruction that describes a
Epoch: 4:
	Train Perplexity: 15.392
	Valid Perplexity: 16.476


                                                           

Sample beam sentence: harry potter is an instruction that describes a
Epoch: 5:
	Train Perplexity: 13.708
	Valid Perplexity: 15.926


In [24]:
#load pretrained model
model.load_state_dict(torch.load(save_path,  map_location=device))

<All keys matched successfully>

## 6. Evaluation (2 points)
1. comparing reference (output from alpaca eval) and candidate (generated model) with ROGUE-1 (only)
2. using 100 sample evalutions
3. During inference, you use the user instruction with an empty input field (second option).

```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
```

In [25]:
#example
# Import the rouge_scorer function from rouge_score
from rouge_score import rouge_scorer
# Initialize the scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
# Compute the Rouge scores for reference and candidate.
reference = 'The quick brown fox jumps over the lazy dog'
candidate = 'The quick brown dog jumps on the log.' #generate by model
scores = scorer.score(reference, candidate)
# Print the scores
print(scores)

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765), 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666), 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471), 'rougeLsum': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}


In [26]:
##Use this evaluation dataset
from datasets import load_dataset

eval_dataset = load_dataset("tatsu-lab/alpaca_eval")['eval']
eval_dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset({
    features: ['instruction', 'output', 'generator', 'dataset'],
    num_rows: 805
})

In [27]:
eval_dataset[0]

{'instruction': 'What are the names of some famous actors that started their careers on Broadway?',
 'output': 'Some famous actors that started their careers on Broadway include: \n1. Hugh Jackman \n2. Meryl Streep \n3. Denzel Washington \n4. Julia Roberts \n5. Christopher Walken \n6. Anthony Rapp \n7. Audra McDonald \n8. Nathan Lane \n9. Sarah Jessica Parker \n10. Lin-Manuel Miranda',
 'generator': 'text_davinci_003',
 'dataset': 'helpful_base'}

In [41]:
scores = []

for i in range(100):
    instruc = f'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction: {eval_dataset[i]["instruction"]}\n### Response:'
    reference = eval_dataset[i]['output']
    candidate_full = model.beam_decode(instruc)
    for i, word in enumerate(candidate_full):
        if word == 'response' and candidate_full[i - 1] == '###':
            candidate = " ".join(candidate_full[i + 1:])
            break
    score = scorer.score(reference, candidate)
    scores.append(score)
    
# example score
print(scores[0])
#It should return average ROGUE-1

{'rouge1': Score(precision=0.2, recall=0.023809523809523808, fmeasure=0.0425531914893617), 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rougeL': Score(precision=0.2, recall=0.023809523809523808, fmeasure=0.0425531914893617), 'rougeLsum': Score(precision=0.2, recall=0.023809523809523808, fmeasure=0.0425531914893617)}


In [42]:
def Average(list): 
    return sum(list) / len(list)

In [44]:
precisions = recall = fmeasure = []
for sco in scores:
    precisions.append(sco['rouge1'][0])
    recall.append(sco['rouge1'][1])
    fmeasure.append(sco['rouge1'][2])

pre_avg = Average(precisions)
recall_avg = Average(recall)
f_avg = Average(fmeasure)
overall_avg = (pre_avg + recall_avg + f_avg) / 3
print(f"ROUGE-1 scores:\nAverage Precision: {pre_avg}\nAverage Recall: {recall_avg}\nAverage fmeasure: {f_avg}\nAverage overall: {overall_avg}")

ROUGE-1 scores:
Average Precision: 0.14896409923221426
Average Recall: 0.14896409923221426
Average fmeasure: 0.14896409923221426
Average overall: 0.14896409923221426


## 7. Conclusion (2 points (0.5 per each))
1. State the problem why your average ROGUE-1 scores is not good?
1. Why existing evaluation metrics (e.g, ROGUE, BLEU) don't work well with instruction tuning?
2. How does Alpaca eval propose for evaluation?
3. What is the problem of Alpaca eval in your opinions?

1. The reason the average ROUGE-1 scores are not good is because of the limited model architecture complexity (due to it being from scratch) and the small number of epochs due to limited computational power. The generation also does not score well with word overlap similarity score like ROUGE-1.

2. Both ROUGE and BLEU are word overlap metrics, meaning they are not good evaluation metrics for an open ended text generations tasks, such as this quiz. They work better with tasks like machine translation where the expected output is the same across all models.

3. AlpacaEval calculates win-rates for models across a variety of tasks, including traditional NLP and instruction-tuning datasets, providing a comprehensive measure of model capabilities.
AlpacaEval is a single-turn benchmark, which means it evaluates models based on their responses to single-turn prompts.

4. The problem is limitations such as a bias towards longer outputs and models similar to the evaluator's base model (GPT-4 Turbo). Different models would have different generations which when compared might not be as objective.