<a href="https://colab.research.google.com/github/glz200133/7008project/blob/main/Transformer_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## STAT8021 / STAT8307
### Assignment 3: Transformer Mechanics, Application, and Pre-training/Fine-tuning Analysis
### DUE: April 27, 2025, Sunday, 11:59 PM

## 1. Understanding Transformer

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import math

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

### Transformer: Multi-head Attention




#### Q1 (a) Query, Key, Value


In Transformers, we perform self-attention, which means that the values, keys and query are derived from the input $X \in \mathbb{R}^{\ell \times d_1}$, where $\ell$ is our sequence length. Specifically, we learn parameter matrices $V_i,K_i,Q_i \in \mathbb{R}^{d_1\times d/h}$ to map our input $X$ as follows:

\begin{align}
v_i = V_iX\ \ i \in \{1,\dots,h\}\\
k_i = K_iX\ \ i \in \{1,\dots,h\}\\
q_i = Q_iX\ \ i \in \{1,\dots,h\}
\end{align}

where $i$ refers to the $i$-th head and $h$ is the number of heads.

In [2]:
def get_multihead_qkv(query, key, value, embed_dim, n_heads):
  """
  Inputs:
        - query: Input data to be used as the query, of shape (N, S, hidden_dim)
        - key: Input data to be used as the key, of shape (N, T, hidden_dim)
        - value: Input data to be used as the value, of shape (N, T, hidden_dim)
        - embed_dim: The embedding dimension of q,k,v (d in the formula)
        - n_heads: the number of heads
        Note: In the shape definitions above, N is the batch size, S is the source
        sequence length, T is the target sequence length, and hidden_dim is the hidden dimension of X (d1 in the formula).
  Returns:
        - output: a tuple containg query, key, value with shapes of (N, H, S, head_dim), (N, H, T, head_dim), (N, H, T, head_dim) respectively
  """
  N, S, E = query.shape
  N, T, E = value.shape
  assert embed_dim % n_heads == 0

  head_dim = embed_dim // n_heads
  # Notes:
  #  1) Define your projections using nn.Linear() and initialize them following the order q,k,v
  #  2) You'll want to split your shape from (N, T, embed_dim) into (N, T, H, head_dim),
        #     where H is the number of heads.
  #  3) Tensor.view() and Tensor.permute() might help
  # ------------------------------------------------------------------------------------------------------------------------------
  # Write your code here
  q_proj = nn.Linear(E, embed_dim)
  k_proj = nn.Linear(E, embed_dim)
  v_proj = nn.Linear(E, embed_dim)
  # apply project
  q = q_proj(query)
  k = k_proj(key)
  v = v_proj(value)
  #resahpe
  q = q.view(N,S,n_heads,head_dim)
  k = k.view(N, T, n_heads, head_dim)
  v = v.view(N, T, n_heads, head_dim)
  #permute
  q = q.permute(0, 2, 1, 3)
  k = k.permute(0, 2, 1, 3)
  v = v.permute(0, 2, 1, 3)











  # ------------------------------------------------------------------------------------------------------------------------------

  return q,k,v



In [3]:
torch.manual_seed(123)
batch_size = 1
sequence_length = 3
embed_dim = 8 #d
hidden_dim = 8 #d1
n_heads = 2
data = torch.randn(batch_size, sequence_length, hidden_dim)
q, k, v = get_multihead_qkv(data, data, data, embed_dim, n_heads)
print('The shape of query is {}.'.format(q.shape))
print('The L2 norm of query is {:.4f}.'.format(torch.linalg.norm(q)))

The shape of query is torch.Size([1, 2, 3, 4]).
The L2 norm of query is 2.7053.


#### Q1 (b) Multi-Headed Scaled Dot-Product Attention
In the case of multi-headed attention, we learn a parameter matrix for each head, which gives the model more expressivity to attend to different parts of the input. Let $Y_i$ be the attention output of head $i$. Thus we learn individual matrices $Q_i$, $K_i$ and $V_i$. To keep our overall computation the same as the single-headed case, we choose $Q_i \in \mathbb{R}^{d\times d/h}$, $K_i \in \mathbb{R}^{d\times d/h}$ and $V_i \in \mathbb{R}^{d\times d/h}$. Adding in a scaling term $\frac{1}{\sqrt{d/h}}$ to our simple dot-product attention above, we have

\begin{equation} \label{qkv_eqn}
A_i = \text{softmax}\bigg(\frac{(XQ_i)(XK_i)^\top}{\sqrt{d/h}}\bigg)
\end{equation}

In [4]:
def calculate_multihead_attention(q, k, attn_mask=None):
  """
  Inputs:
        - q: Multi-head query with the shape of (N, H, S, head_dim)
        - k: Multi-head key with the shape of (N, H, T, head_dim)
        - attn_mask (if provided): Array of shape (S, T) where attn_mask[i,j] == 0 indicates token
          j in the key/value should not influence token i in the query output.
        - Note: head_dim refers to embed_dim/n_heads
  Returns:
        - attention_weights: attention tensor with shape of (N, H, S, T)
  """
  # Notes:
  #  1) You need to transpose k
  #  2) You need to set scores to '-inf' where mask==0. Tensor.masked_fill() might help.
  # ------------------------------------------------------------------------------------------------------------------------------
  # Write your code here
  # Transpose the key for matrix multiplication
  # k: (N, H, T, head_dim) -> k_t: (N, H, head_dim, T)
  N, H, S, head_dim = q.shape
  N, H, T, head_dim = k.shape
  k_t = k.transpose(-2, -1)

  # Calculate the scaled dot product: (q @ k_t) / sqrt(head_dim)
  # q: (N, H, S, head_dim) @ k_t: (N, H, head_dim, T) -> (N, H, S, T)
  #head_dim = q.size(-1)
  scale = math.sqrt(head_dim)
  scores = torch.matmul(q, k_t) / scale

  # Apply mask if provided
  if attn_mask is not None:
     # Expand the mask to match the scores shape
     # attn_mask: (S, T) -> (1, 1, S, T)
      expanded_mask = attn_mask.unsqueeze(0).unsqueeze(0)
      scores = scores.masked_fill(expanded_mask == 0, float('-inf'))

    # Apply softmax to get attention weights
  attention_weights = F.softmax(scores, dim=-1)











  # ------------------------------------------------------------------------------------------------------------------------------
  return attention_weights


In [5]:
torch.manual_seed(456)
# Create a 0/1 mask where 0 means mask out, 1 means keep
mask = torch.randn(sequence_length, sequence_length) < 0.5 # ~50% are 0s (masked)
self_attn_output = calculate_multihead_attention(q, k)
masked_self_attn_output = calculate_multihead_attention(q, k, attn_mask=mask)
print('The shape of attention is {}.'.format(self_attn_output.shape))
print('The L2 norm of self-attention is {:.4f}.'.format(torch.linalg.norm(self_attn_output)))
print('The L2 norm of masked_self_attn_output is {:.4f}.'.format(torch.linalg.norm(masked_self_attn_output)))

The shape of attention is torch.Size([1, 2, 3, 3]).
The L2 norm of self-attention is 1.4747.
The L2 norm of masked_self_attn_output is 2.2622.


#### Q1 (c) Final outputs and Wrap-up

Now we have got our attention $A_i$, and each head's output could be calculated using the following formula.

\begin{equation}
Y_i = A_i(XV_i)
\end{equation}

where $Y_i\in\mathbb{R}^{\ell \times d/h}$, where $\ell$ is our sequence length.



In our implementation, we apply dropout to the attention weights (though in practice it could be used at any step):

\begin{equation} \label{qkvdropout_eqn}
Y_i = \text{dropout}\bigg(A_i\bigg)(XV_i)
\end{equation}

Finally, then the output of the self-attention is a linear transformation of the concatenation of the heads:

\begin{equation}
Y = [Y_1;\dots;Y_h]W
\end{equation}

where $W \in\mathbb{R}^{d\times d}$ and $[Y_1;\dots;Y_h]\in\mathbb{R}^{\ell \times d}$.


In [6]:
class MultiHeadAttention(nn.Module):
    """
    A model layer which implements a simplified version of masked attention, as
    introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).
    """

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.

        Inputs:
         - embed_dim: Dimension of the token embedding
         - num_heads: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert embed_dim % num_heads == 0

        # Linear layers for Q, K, V projections (input dim = output dim = embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

        # Final linear projection layer
        self.proj = nn.Linear(embed_dim, embed_dim)

        # Dropout layer for attention weights
        self.attn_drop = nn.Dropout(dropout)

        # Store dimensions
        self.n_head = num_heads
        self.emd_dim = embed_dim
        self.head_dim = self.emd_dim // self.n_head

    def forward(self, query, key, value, attn_mask=None):
        """
        Calculate the masked attention output for the provided data, computing
        all attention heads in parallel.

        In the shape definitions below, N is the batch size, S is the source
        sequence length, T is the target sequence length, and E is the embedding
        dimension.

        Inputs:
        - query: Input data to be used as the query, of shape (N, S, E)
        - key: Input data to be used as the key, of shape (N, T, E)
        - value: Input data to be used as the value, of shape (N, T, E)
        - attn_mask: Array of shape (S, T) where mask[i,j] == 0 indicates token
          i in the source should not influence token j in the target.

        Returns:
        - output: Tensor of shape (N, S, E) giving the weighted combination of
          data in value according to the attention weights calculated using key
          and query.
        """
        N, S, E = query.shape
        N, T, E = value.shape

        # Notes:
        #  1) Please do not directly call the functions defined above. Instead, write your code step by step.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        Q = self.query(query)
        K = self.key(key)
        V = self.value(value)
        #reshape
        Q = Q.view(N, S, self.n_head, self.head_dim).transpose(1, 2)  # (N, nh, S, hd)
        K = K.view(N, T, self.n_head, self.head_dim).transpose(1, 2)   # (N, nh, T, hd)
        V = V.view(N, T, self.n_head, self.head_dim).transpose(1, 2)   # (N, nh, T, hd)
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)  # (N, nh, S, T)

        # Apply attention mask if provided
        if attn_mask is not None:
            attn_scores = attn_scores.masked_fill(attn_mask.unsqueeze(0).unsqueeze(0) == 0, float('-inf'))

        # Compute attention weights with softmax and dropout
        attn_weights = torch.softmax(attn_scores, dim=-1)
        attn_weights = self.attn_drop(attn_weights)

        # Apply attention weights to values
        Y = torch.matmul(attn_weights, V)  # (N, nh, S, hd)

        # Concatenate heads and apply final projection
        Y = Y.transpose(1, 2).contiguous().view(N, S, E)  # (N, S, E)
        output = self.proj(Y)


















        # ------------------------------------------------------------------------------------------------------------------------------
        return output

In [7]:
torch.manual_seed(231)


batch_size = 1
sequence_length = 3
attn = MultiHeadAttention(embed_dim, num_heads=2)

# Self-attention.
data = torch.randn(batch_size, sequence_length, embed_dim)
self_attn_output = attn(query=data, key=data, value=data)

# Masked self-attention.
mask = torch.randn(sequence_length, sequence_length) < 0.5
masked_self_attn_output = attn(query=data, key=data, value=data, attn_mask=mask)

# Attention using two inputs.
other_data = torch.randn(batch_size, sequence_length, embed_dim)
attn_output = attn(query=data, key=other_data, value=other_data)

expected_self_attn_output = np.asarray([[
[-0.2494,  0.1396,  0.4323, -0.2411, -0.1547,  0.2329, -0.1936,
          -0.1444],
         [-0.1997,  0.1746,  0.7377, -0.3549, -0.2657,  0.2693, -0.2541,
          -0.2476],
         [-0.0625,  0.1503,  0.7572, -0.3974, -0.1681,  0.2168, -0.2478,
          -0.3038]]])

expected_masked_self_attn_output = np.asarray([[
[-0.1347,  0.1934,  0.8628, -0.4903, -0.2614,  0.2798, -0.2586,
          -0.3019],
         [-0.1013,  0.3111,  0.5783, -0.3248, -0.3842,  0.1482, -0.3628,
          -0.1496],
         [-0.2071,  0.1669,  0.7097, -0.3152, -0.3136,  0.2520, -0.2774,
          -0.2208]]])

expected_attn_output = np.asarray([[
[-0.1980,  0.4083,  0.1968, -0.3477,  0.0321,  0.4258, -0.8972,
          -0.2744],
         [-0.1603,  0.4155,  0.2295, -0.3485, -0.0341,  0.3929, -0.8248,
          -0.2767],
         [-0.0908,  0.4113,  0.3017, -0.3539, -0.1020,  0.3784, -0.7189,
          -0.2912]]])

print('self_attn_output error: ', rel_error(expected_self_attn_output, self_attn_output.detach().numpy()))
print('masked_self_attn_output error: ', rel_error(expected_masked_self_attn_output, masked_self_attn_output.detach().numpy()))
print('attn_output error: ', rel_error(expected_attn_output, attn_output.detach().numpy()))

self_attn_output error:  0.0003775124598178026
masked_self_attn_output error:  0.0001526367643724865
attn_output error:  0.0003530104862933477


Checker: The correct implementation will give an error no more than `e-3`.

### Transformer: Positional Embedding

#### Q1 (d) Positional Embedding

While transformers are able to easily attend to any part of their input, the attention mechanism has no concept of token order. However, for many tasks (especially natural language processing), relative token order is very important. To recover this, the authors add a positional encoding to the embeddings of individual word tokens.

Let us define a matrix $P \in \mathbb{R}^{l\times d}$, where $P_{ij} = $

$$
\begin{cases}
\text{sin}\left(i \cdot 10000^{-\frac{j}{d}}\right) & \text{if j is even} \\
\text{cos}\left(i \cdot 10000^{-\frac{(j-1)}{d}}\right) & \text{otherwise} \\
\end{cases}
$$

Rather than directly passing an input $X \in \mathbb{R}^{l\times d}$ to our network, we instead pass $X + P$.

In [8]:
class PositionalEncoding(nn.Module):
    """
    Encodes information about the positions of the tokens in the sequence. In
    this case, the layer has no learnable parameters, since it is a simple
    function of sines and cosines.
    """
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        """
        Construct the PositionalEncoding layer.

        Inputs:
         - embed_dim: the size of the embed dimension
         - dropout: the dropout value
         - max_len: the maximum possible length of the incoming sequence
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        assert embed_dim % 2 == 0
        # Create an array with a "batch dimension" of 1 (which will broadcast
        # across all examples in the batch).
        pe = torch.zeros(1, max_len, embed_dim)
        # Notes:
        #  1) Construct the positional encoding array as described above.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        #生成位置索引 (0到max_len-1)
        #position = torch.arange(max_len, dtype=torch.float).unsqueeze(0)  # (1, max_len)
        position = torch.arange(max_len).unsqueeze(1)  # 形状 (max_len, 1)

        #频率项：exp(-j/d * log(10000))
        div_term = torch.exp(
            torch.arange(0, embed_dim, 2).float() *
            (-math.log(10000.0) / embed_dim)
        ).unsqueeze(0).unsqueeze(0)  # (1,1 embed_dim//2)

        #正弦和余弦项
        pe[0, :, 0::2] = torch.sin(position * div_term)  # 偶数位置用sin
        pe[0, :, 1::2] = torch.cos(position * div_term)  # 奇数位置用cos








        # ------------------------------------------------------------------------------------------------------------------------------

        # Make sure the positional encodings will be saved with the model
        # parameters (mostly for completeness).
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Element-wise add positional embeddings to the input sequence.

        Inputs:
         - x: the sequence fed to the positional encoder model, of shape
              (N, S, D), where N is the batch size, S is the sequence length and
              D is embed dim
        Returns:
         - output: the input sequence + positional encodings, of shape (N, S, D)
        """
        N, S, D = x.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, S, D))
        # Notes:
        #  1) Index into your array of positional encodings, and add the appropriate ones to the input sequence.
        #  2) Don't forget to apply dropout afterward.
        # ------------------------------------------------------------------------------------------------------------------------------
        # Write your code here
        pe = self.pe[:, :S, :]  # 形状从 (1, max_len, D) → (1, S, D)

        # 步骤2: 将位置编码广播到批次维度并与输入相加
        output = x + pe  # 广播后形状 (N, S, D)

        output = self.dropout(output)




        # ------------------------------------------------------------------------------------------------------------------------------


        return output

In [9]:
torch.manual_seed(231)

batch_size = 1
sequence_length = 2
embed_dim = 6
data = torch.randn(batch_size, sequence_length, embed_dim)

pos_encoder = PositionalEncoding(embed_dim)
output = pos_encoder(data)

expected_pe_output = np.asarray([[[-1.2340,  1.1127,  1.6978, -0.0865, -0.0000,  1.2728],
                                  [ 0.9028, -0.4781,  0.5535,  0.8133,  1.2644,  1.7034]]])

print('pe_output error: ', rel_error(expected_pe_output, output.detach().numpy()))

pe_output error:  0.00010421011374914356


Checker: The correct implementation will give an error no more than `e-3`.

## 2. Applying Transformer

In [10]:
! pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [3

In [11]:
from datasets import load_dataset, DatasetDict

ag_news_dataset = load_dataset("ag_news")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [12]:
ag_news_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [13]:
# Just take the first 100 tokens for speed/running on cpu
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:100]),
        'label': example['label']
    }

# Take 1024 random examples for train and 128 validation
small_ag_news_dataset = DatasetDict(
    train=ag_news_dataset['train'].shuffle(seed=1111).select(range(1024)).map(truncate),
    val=ag_news_dataset['test'].shuffle(seed=1111).select(range(128)).map(truncate),
)

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

In [14]:
small_ag_news_dataset['train'][0]

{'text': 'India and Pakistan balk at bold Kashmir peace plan Pakistani President Pervez Musharraf this week urged steps to end the bitter dispute.',
 'label': 0}

In [15]:
small_ag_news_dataset['val'][0]

{'text': 'Nortel warns of lower Q3 revenue TORONTO - Nortel Networks warned Thursday its third-quarter revenue will be below the \\$2.6 billion US preliminary unaudited revenues it reported for the second quarter.',
 'label': 2}

In [16]:
id2label = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech",
    }

#### Q2 (a)

In [17]:
from transformers import DistilBertTokenizerFast

# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize_function(examples):
    tokenized_inputs = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    return {
        "input_ids": tokenized_inputs["input_ids"].squeeze(0),     # 形状 [128]
        "attention_mask": tokenized_inputs["attention_mask"].squeeze(0),  # 形状 [128]
        "labels": torch.tensor(examples["label"], dtype=torch.long)
    }


small_tokenized_dataset = small_ag_news_dataset.map(
    tokenize_function,
    batched=False,
    remove_columns=["text","label"]
)



# print the frist 3 processed samples
small_tokenized_dataset['train'][:3]
# ------------------------------------------------------------------------------------------------------------------------------

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

{'input_ids': [[101,
   2634,
   1998,
   4501,
   28352,
   2243,
   2012,
   7782,
   13329,
   3521,
   2933,
   9889,
   2343,
   2566,
   26132,
   14163,
   7377,
   11335,
   2546,
   2023,
   2733,
   9720,
   4084,
   2000,
   2203,
   1996,
   8618,
   7593,
   1012,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [101,
   3042,
   2194,
   14054,
   2000,
   2158,
   2654,
   2086,
   2101,
   1037,
   7056,
   2158,
   1001,
   4464,

#### Q2 (b)

In [18]:
from transformers import get_linear_schedule_with_warmup
from tqdm.notebook import tqdm
import os
from transformers import DistilBertForSequenceClassification
import torch
import evaluate
from torch.utils.data import DataLoader
from torch.optim import AdamW

# This session might take a long time (≈1 hour), please be patient
# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here

# Define your model. optimizer, hyper-parameter and etc.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
batch_size = 16
learning_rate = 2e-5
num_epochs = 3

try:
    # 显式设置数据集格式为 PyTorch 张量
    small_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
    print("Dataset format set to 'torch'.")
except Exception as e:
    print(f"Error setting dataset format: {e}")
    # Consider exiting if format cannot be set
    exit()

train_loader = DataLoader(
    small_tokenized_dataset["train"],
    batch_size=batch_size,
    shuffle=True
)

val_loader = DataLoader(
    small_tokenized_dataset["val"],
    batch_size=batch_size
)


model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4
)
model.to(device)
# 优化器配置
optimizer = AdamW(model.parameters(), lr=learning_rate)

# 学习率调度器
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)


metric = evaluate.load("accuracy")



for epoch in range(num_epochs):
    #train and evaluate your model
    model.train()
    train_preds, train_labels = [], []
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        # 前向传播
        outputs = model(input_ids,
                       attention_mask=attention_mask,
                       labels=labels)
        loss = outputs.loss

        # 反向传播
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        # 收集训练结果
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        train_preds.extend(preds.cpu().numpy())
        train_labels.extend(labels.cpu().numpy())

    # 计算训练准确率
    train_acc = metric.compute(predictions=train_preds, references=train_labels)["accuracy"]

    # ======================== 验证阶段 ========================
    model.eval()
    val_preds, val_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=1)
            val_preds.extend(preds.cpu().numpy())   # 直接操作CPU数据
            val_labels.extend(labels.cpu().numpy())

    validation_acc = metric.compute(predictions=val_preds, references=val_labels)["accuracy"]

    # print the training process
    print("Epoch {}: train acc = {:.4f}, validation acc = {:.4f}".format(epoch + 1, train_acc, validation_acc))

# ------------------------------------------------------------------------------------------------------------------------------

Using device: cuda
Dataset format set to 'torch'.


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Epoch 1:   0%|          | 0/64 [00:00<?, ?it/s]

Epoch 1: train acc = 0.7490, validation acc = 0.8984


Epoch 2:   0%|          | 0/64 [00:00<?, ?it/s]

Epoch 2: train acc = 0.8945, validation acc = 0.8906


Epoch 3:   0%|          | 0/64 [00:00<?, ?it/s]

Epoch 3: train acc = 0.9238, validation acc = 0.8672


In [19]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Q2 (c)

In [1]:
chatgpt_generated_news = [
    "In an exciting match last night, the Los Angeles Lakers defeated the Brooklyn Nets 115-110. Lakers' LeBron James made a comeback after missing several games due to injury and scored 25 points while teammate Anthony Davis added 28 points. Nets' star player Kevin Durant scored 32 points but couldn't lead his team to victory.",
    "Scientists have discovered a new species of dinosaur that roamed the earth 80 million years ago. The species, named Almatherium, was found in Uzbekistan and is believed to be an ancestor of the modern-day armadillo. The discovery sheds new light on the evolution of mammals and their relationship with dinosaurs.",
    "The United Nations has called for an immediate ceasefire in Yemen as the country faces a growing humanitarian crisis. The UN's special envoy for Yemen, Martin Griffiths, urged all parties to end the violence and engage in peace talks. The conflict has left millions of Yemenis at risk of famine and disease.",
    "Amazon has announced that it will be opening its first fulfillment center in New Zealand, creating more than 500 new jobs. The center will be located in Auckland and is expected to open in 2022. This move will allow Amazon to expand its operations in the region and improve delivery times for customers.",
]
prediction_label = []

# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here

# test your finetuned model on chatgpt_genreated_news
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


model.eval()


prediction_label = []


label_mapping = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech",

}


for news in chatgpt_generated_news:
    inputs = tokenizer(
        news,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    #predicate
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        pred = torch.argmax(logits, dim=1).item()

    # 获取预测的标签文本
    predicted_label = label_mapping.get(pred, f"lable {pred}")
    prediction_label.append(predicted_label)


# print the predictions for chatgpt_genreated_news
print(prediction_label)

# ------------------------------------------------------------------------------------------------------------------------------

KeyboardInterrupt: 

#### Q2 (d)

In [None]:
# This session might take a long time (≈1.5 hours), please be patient
# ------------------------------------------------------------------------------------------------------------------------------
# Write your code here


# Define your model. optimizer, hyper-parameter and etc.


for epoch in range(num_epochs):
    #train and evaluate your model

    # print the training process
    print("Epoch {}: train acc = {:.4f}, validation acc = {:.4f}".format(epoch + 1, train_acc, validation_acc))

# ------------------------------------------------------------------------------------------------------------------------------