# Week 14: Colab Experiment

# I. Introduction
In this exercise, we first train a transformer using the Wikitext-2 dataset and then use the model to generate new text with the length specified by the user.  

# II. Methods


### 1. Library Imports
The necessary libraries are imported to handle neural network operations, math calculations, and time tracking:
- **torch**: For building and training the model.
- **math**: For mathematical calculations.
- **time**: For tracking training and evaluation time.

### 2. Data Loading
The data is processed using a custom `data.Corpus` class that handles text data. The data is then split into training, validation, and test datasets. The following function is used to prepare the datasets for training:
- **batchify**: Organizes the dataset into batches of a specified size for efficient training.

### 3. Model Construction
The core model is based on the Transformer architecture, with the following components:
- **PositionalEncoding**: A class to add positional information to the input sequences since Transformers don't inherently handle sequence order.
- **TransformerModel**: A class derived from `nn.Transformer`, modified to handle embedding inputs and applying positional encoding.

### 4. Training Functions
Several functions are defined to handle training, evaluation, and batch preparation:
- **get_batch**: Retrieves batches of data for training.
- **train**: The training loop where the model receives data, computes the loss, performs backpropagation, and updates the model's parameters using gradient descent. Gradient clipping is applied to avoid large updates and stabilize training.
- **evaluate**: Computes the loss on validation or test data to evaluate the performance of the trained model.

In [4]:

import time
import math
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

In [5]:
# Uncomment one of the following that works for you.

# device = torch.device("cuda")
device = torch.device("mps")
# device = torch.device("cpu")

In [6]:
batch_size = 20

emsize = 200 # size of word embeddings
nhead = 2
nhid = 200
nlayers = 2
dropout = 0.2
lr = 20 # initial learning rate
epochs=10 # upper epoch limit

bptt=35 #sequence length
clip=0.25 #gradient clipping
log_interval=200 # report interval

save='model.pt' #path to save the final model

# Set the random seed manually for reproducibility.
torch.manual_seed(0)

eval_batch_size = 10

## Load data

In [8]:

import sys
sys.path.append('./') # Change to your own path
import data

In [9]:
corpus = data.Corpus('./data/wikitext-2')

def batchify(data, bsz):
    nbatch = data.size(0) // bsz
    data = data.narrow(0, 0, nbatch * bsz)
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)
ntokens = len(corpus.dictionary)

## Build the model

In [11]:
# Define positional encoding used in the transformer model

#################################################################################################
# [TODO]: Build a positional encoding function that can be used in the TransformerModel below
#################################################################################################

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        
        # Initialize dropout layer
        self.dropout = nn.Dropout(p=dropout)

        # Create a long enough position encoding matrix
        pe = torch.zeros(max_len, d_model)
        
        # Compute the positional encoding values
        for pos in range(max_len):
            for i in range(0, d_model, 2):
                # Apply the sine function to even indices
                pe[pos, i] = math.sin(pos / (10000 ** (i / d_model)))
                # Apply the cosine function to odd indices
                if i + 1 < d_model:
                    pe[pos, i + 1] = math.cos(pos / (10000 ** ((i + 1) / d_model)))
        
        # Add a batch dimension to pe, then register as a buffer to not be updated by gradients
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Add the positional encoding to the input tensor
        x = x + self.pe[:, :x.size(1)].detach()  # detaching to prevent gradients from updating it
        return self.dropout(x)

This code implements **Positional Encoding**, which is used in the **Transformer** model to assign a unique representation to each position in a sequence, allowing the model to capture the order of elements in the sequence.

- `pe`: A zero matrix of size `(max_len, d_model)`, representing the encoding for each position up to the maximum length.
- For each position `pos` (from 0 to `max_len-1`), the position encoding is calculated using a position formula.
- For even-indexed dimensions (`i`), the **sine function** (`sin`) is used.
- For odd-indexed dimensions (`i+1`), the **cosine function** (`cos`) is used.
- These formulas ensure that each position has a unique encoding, and the encoding varies based on the position. By using `10000` as a base, the changes in higher dimensions are smoother.
- `pe.unsqueeze(0)`: Adds a batch dimension to the positional encoding matrix, changing its shape to `(1, max_len, d_model)`, so it can match the shape of the input data.

In [13]:
# Define the transformer model
# Define the TransformerModel class inheriting from nn.Transformer
class TransformerModel(nn.Transformer):

    #Initialize the Transformer model.
    
    #Parameters:
    #- ntoken: The number of tokens in the vocabulary (input/output size).
    #- ninp: Dimensionality of the input embeddings.
    #- nhead: Number of attention heads in the multi-head attention mechanism.
    #- nhid: Dimensionality of the feedforward network in the transformer layers.
    #- nlayers: Number of encoder layers in the transformer.
    #- dropout: Dropout rate for regularization.

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        # Call the parent class constructor with the appropriate arguments
        super(TransformerModel, self).__init__(d_model=ninp, nhead=nhead, dim_feedforward=nhid, num_encoder_layers=nlayers)
         # Model type for reference
        self.model_type = 'Transformer'
        # Placeholder for the source mask
        self.src_mask = None
        # Positional encoding to inject position information into the input embeddings
        self.pos_encoder = PositionalEncoding(ninp, dropout) # This is what you had constructed above
        # Embedding layer for input tokens

        self.input_emb = nn.Embedding(ntoken, ninp)
        self.ninp = ninp# Store the input embedding dimension
        # Linear layer to decode transformer outputs into vocabulary logits
        self.decoder = nn.Linear(ninp, ntoken)
        # Initialize weights for the model

        self.init_weights()
        
        #Generate a square mask for the sequence to ensure that a token 
        #only attends to previous tokens (used in language modeling).
        
        #Parameters:
        #- sz: Size of the mask (sequence length).
        
        #Returns:
        #- A lower triangular log-transformed mask.
 
    def _generate_square_subsequent_mask(self, sz):
        return torch.log(torch.tril(torch.ones(sz,sz)))

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.input_emb.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.bias)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, has_mask=True):
        """
        Forward pass for the transformer model.
        
        Parameters:
        - src: Input sequence of token indices.
        - has_mask: Flag to determine if a mask should be applied.
        
        Returns:
        - Log-softmax probabilities over the vocabulary for each token.
        """
        if has_mask:
            device = src.device
            if self.src_mask is None or self.src_mask.size(0) != len(src):
                mask = self._generate_square_subsequent_mask(len(src)).to(device)
                self.src_mask = mask
        else:
            self.src_mask = None

        src = self.input_emb(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.encoder(src, mask=self.src_mask)
        output = self.decoder(output)
        return F.log_softmax(output, dim=-1)

In [14]:
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)
criterion = nn.NLLLoss()



## Training

In [16]:


def get_batch(source, i):
    """
    Fetches a batch of data and its corresponding target from the dataset.

    Parameters:
    - source: The dataset (tensor) to fetch data from.
    - i: The starting index of the batch.

    Returns:
    - data: Input sequence (length: seq_len).
    - target: Target sequence (length: seq_len, flattened for loss calculation).
    """
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    """
    Evaluates the model on the given dataset.

    Parameters:
    - data_source: Dataset to evaluate on (validation or test data).

    Returns:
    - Average loss per token across the dataset.
    """
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = model(data)
            output = output.view(-1, ntokens)

            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    """
    Trains the model for one epoch on the training data.

    Returns:
    - None
    """
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        output = model(data)
        output = output.view(-1, ntokens)
        loss = criterion(output, targets)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // bptt, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()



# Loop over epochs.
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(save, 'rb') as f:
    model = torch.load(f)


# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

#Perplexity，困惑度）时，一般来说，PPL 值越低越好。


| epoch   1 |   200/ 2983 batches | lr 20.00 | ms/batch 94.91 | loss 16.50 | ppl 14676497.04
| epoch   1 |   400/ 2983 batches | lr 20.00 | ms/batch 96.22 | loss 12.49 | ppl 265771.89
| epoch   1 |   600/ 2983 batches | lr 20.00 | ms/batch 94.79 | loss 10.49 | ppl 35803.16
| epoch   1 |   800/ 2983 batches | lr 20.00 | ms/batch 94.32 | loss  9.71 | ppl 16466.20
| epoch   1 |  1000/ 2983 batches | lr 20.00 | ms/batch 94.44 | loss  9.27 | ppl 10619.17
| epoch   1 |  1200/ 2983 batches | lr 20.00 | ms/batch 95.20 | loss  9.05 | ppl  8528.49
| epoch   1 |  1400/ 2983 batches | lr 20.00 | ms/batch 93.66 | loss  8.76 | ppl  6354.50
| epoch   1 |  1600/ 2983 batches | lr 20.00 | ms/batch 80.48 | loss  8.96 | ppl  7816.05
| epoch   1 |  1800/ 2983 batches | lr 20.00 | ms/batch 80.48 | loss  8.77 | ppl  6465.86
| epoch   1 |  2000/ 2983 batches | lr 20.00 | ms/batch 80.40 | loss  8.74 | ppl  6226.95
| epoch   1 |  2200/ 2983 batches | lr 20.00 | ms/batch 79.62 | loss  8.63 | ppl  5614.36
| epoc

  model = torch.load(f)


| End of training | test loss  6.81 | test ppl   911.16


# III. Results
Here we generate text of length 100 words.

In [18]:
num_words = 100
temperature = 1


g = torch.Generator().manual_seed(0)
initial_state = g.get_state()

with open('./model.pt', 'rb') as f:
    model = torch.load(f, map_location=device)
model.eval()

  model = torch.load(f, map_location=device)


TransformerModel(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=200, out_features=200, bias=True)
        )
        (linear1): Linear(in_features=200, out_features=200, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=200, out_features=200, bias=True)
        (norm1): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): Linear(in_features=200, out_features=33278, bias=True)
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (input_emb): Embedding(33278, 200)
)

In [39]:
g.set_state(initial_state)
input = torch.randint(ntokens, (1, 1), dtype=torch.long, generator=g).to(device)


generated_text = ""

##################################################################################
# [TODO] Fill out this section to use the transfer model to generate new text
##################################################################################

for i in range(num_words):
    # Step 1: Predict next word probabilities
    output = model(input)
    
    # Step 2: Scale probabilities with temperature
    output = output.squeeze(0)  # remove the batch dimension
    scaled_output = output / temperature
    
    # Step 3: Sample the next word index
    probabilities = F.softmax(scaled_output, dim=-1)  # Convert to probabilities
    next_word_idx = torch.multinomial(probabilities, 1)  # Sample from the distribution
    
    # Step 4: Add the sampled word to the input
    input = next_word_idx.view(1, 1).to(device)
    
    # Step 5: Find the word for the index (using the corpus dictionary)
    word = corpus.dictionary.idx2word[next_word_idx.item()]
    
    # Step 6: Add word to the output text
    generated_text += ' ' + word
print(generated_text)



 Instead tooth later 1949 Today the . The Nightingale itself returns . @-@ the State . more name records a = 1951 the kept of in hair at the . 0 in Academy = <eos> called to on both pay take R1 me the can 9 week wrote Howe too his <eos> when fighting by of . Civil professional telephone and the . eroded <eos> in actions an <unk> and rickshaws females alone the " , 100 is ( the " while 's for by trees amount of are this and most the Yeomanry number £ the high Lucia form


# IV. Conclusion and Discussion


#### Training Results
The model's training progressed over 10 epochs, with the loss decreasing steadily from 6.95 at the end of epoch 5 to 6.88 by the end of epoch 10. The perplexity followed a similar trend, improving from 1047.37 in epoch 5 to 968.07 in epoch 10. The learning rate was gradually reduced, from 1.25 in epoch 5 to 0.02 by epoch 10, helping stabilize the training. Throughout the epochs, the model showed consistent improvement in both training and validation performance, with the final validation perplexity at 911.16 indicating good model generalization.

#### Model Performance
Although the model is learning, its performance remains suboptimal for complex tasks. The output text contains both coherent phrases and gibberish, indicating the need for further training to improve sequence generation and contextual understanding.

#### Insights and Learnings
1. **Training Stability**: The fluctuations in loss suggest a need for stabilization methods like gradient clipping or dynamic learning rate adjustments.
2. **Data Quality**: The model struggles to generate meaningful text, indicating the potential need for better data preprocessing or augmentation.
3. **Evaluation Metrics**: While perplexity and loss are useful, they may not fully reflect text quality. Additional evaluation methods, such as human evaluation or BLEU score, could provide deeper insights.
4. **Training Duration**: Ten epochs may be insufficient for complex language modeling tasks. More epochs or early stopping could help improve performance.

#### Future Improvements
1. **Hyperparameter Tuning**: Adjusting learning rates, batch sizes, and other parameters could lead to faster convergence and more stable training.
2. **Model Architecture**: Modifying the Transformer architecture (e.g., different attention mechanisms, layer normalization) could help improve generalization and prevent overfitting.
3. **Longer Training**: Training for more epochs or using early stopping could enhance the model’s ability to learn complex patterns.
4. **Data Augmentation**: Improving tokenization, sentence segmentation, and using larger, more diverse datasets could help the model generate more realistic text.
5. **Evaluation Metrics**: Incorporating metrics like BLEU or ROUGE, alongside human evaluation, would offer more comprehensive assessments of the model's performance.

#### Positional Encoding in Transformer
The self-attention mechanism does not inherently capture sequence order, which is why positional encoding is introduced. Using sine and cosine functions, positional encoding ensures:

- **Uniqueness**: Each position in the sequence has a distinct representation.
- **Periodicity**: The periodicity of sine and cosine captures both long- and short-term position relationships.
- **Smooth Transition**: The gradual change in encoding across positions allows the model to learn continuous positional relationships.

#### Conclusion
In conclusion, while the Transformer model has shown incremental progress in language generation tasks, several challenges remain. Training instability and suboptimal performance highlight the need for further optimization in terms of hyperparameter tuning, model architecture adjustments, and longer training durations. By improving data preprocessing and exploring additional evaluation metrics, the model's ability to generate coherent and contextually relevant text can be enhanced. The use of positional encoding is crucial for the model’s understanding of sequence order, and its smooth, periodic nature ensures that the model can capture complex positional relationships. Future work should focus on refining these aspects to improve the model’s overall performance and applicability to real-world language tasks. 


