# Lab4 - Sentiment Analysis with LLM

第四組｜蕭名妍 劉貞莉 黃詩涵

我們用三種不同的流程去執行以下程式碼進行比較：

| | 流程一 | 流程二 | 流程三 |
| --- | :---: | :---: | :---: |
| Loading Data (data size)| 10,000筆 | 10,000筆 | 50,000筆 |
| Load Model | CPU | MPS(GPU) | MPS(GPU) |
| Fine-tuning (training set) | 8,000筆 | 8,000筆 | 40,000筆 |
| Evaluation | CPU | CPU | CPU |
| Inference (testing set) | 2,000筆 | 2,000筆 | 10,000筆 |

詳細比較成果請見簡報檔。

In [55]:
# pip install transformers

In [56]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import pandas as pd
import random

# Pytorch with MPS (GPU on Mac)

In [57]:
print(torch.__version__)

2.1.0


In [58]:
# Is MPS even available? macOS 12.3+
print(torch.backends.mps.is_available())

# Was the current version of PyTorch built with MPS activated?
print(torch.backends.mps.is_built())

# If the outputs “true” & “true” then it’s time to celebrate because you now have access to Apple Metal GPU.

True
True


In [59]:
# source: https://developer.apple.com/metal/pytorch/
if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print (x)
else:
    print ("MPS device not found.")

tensor([1.], device='mps:0')


# Loading Data

In [60]:
df = pd.read_csv("IMDB_Dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [61]:
df["sentiment"].value_counts() # 可以看到，sentiment欄位只有兩種-positive or negative
# 兩者是平衡的

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [62]:
# shrink the data (because free colab cannot use that much resources)
# df = df.sample(frac=0.2, random_state=42) # 50k -> 10k
reviews = df['review'].to_numpy()
labels = df['sentiment'].map({'positive':1, 'negative':0}).to_numpy() # convert category data to integer

In [63]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(reviews,labels, test_size=0.2, random_state=42)

# Load Model

In [64]:
# load pre-trained BERT model and tokenizer

# 分詞器(tokenizer)用於將文本分割成單詞或子詞，並將其轉化為模型可以理解的輸入格式
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# bert-base-uncased 是Hugging Face Model Hub上提供的一個已經訓練好的BERT模型的名稱
# 這個模型是基於小寫字母的英文文本訓練的（"uncased"表示不區分大小寫）

# 創建一個BERT模型，用於序列分類任務，這是一個適用於文本分類任務的BERT模型變體，這個模型也已在大規模文本數據上進行了訓練
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# 你可以將文本數據傳遞給tokenizer進行編碼，然後將編碼後的數據傳遞給model以獲得模型的預測輸出
# cpu: 1.5s
# mps(gpu): 1.8s
# mps(gpu): 2.1s

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [65]:
# BERT input conversion
# 將文本數據轉換為適用於BERT模型的輸入格式，以便進行訓練和推理
def tokenize_reviews(reviews, labels):
    input_ids = [] # convert tokens to integers (each id represent a unique token)
    attention_masks = [] # purpose: handle sequence of varying lengths

    for review in reviews:
        encoded_dict = tokenizer.encode_plus( # 使用`tokenizer.encode_plus`函數將文本轉換為BERT可接受的編碼格式
            review, # 當前要處理的文本
            add_special_tokens=True, # 告訴分詞器添加BERT模型所需的特殊tokens，如[CLS]和[SEP]
            max_length=128, # 指定最大編碼後的文本長度
            padding = 'max_length',
            truncation = True, 
            return_attention_mask=True, # 生成一個關注（attention）掩碼，以標識輸入中的實際標記
            return_tensors='pt' # 返回PyTorch張量作為結果
        )
        input_ids.append(encoded_dict['input_ids']) # 存了所有文本評論的編碼後的表示
        attention_masks.append(encoded_dict['attention_mask']) # 存了關注掩碼

    # 透過使用`torch.cat`函數，將`input_ids`和`attention_masks`從列表形式轉換為一個大張量，其中 `dim=0` 表示在第一個維度上進行拼接，通常是用於批量處理的維度
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels) # 與每個評論相關聯的標籤，它們被存儲為PyTorch張量。

    return input_ids, attention_masks, labels
    # 函數返回的值可以直接用於BERT模型進行文本分類或其他NLP任務的輸入

# Fine-tuning

In [66]:
# BERT fine-tuning
# convert data to PyTorch tensors
input_ids, attention_masks, labels = tokenize_reviews(X_train, y_train)
# cpu: 10.8s
# mps(gpu): 9.9s
# mps(gpu): 53.1s

In [67]:
# create DataLoader for training data
# 創建一個用於訓練數據的DataLoader對象，通常在深度學習中用於對數據進行批處理（batch processing）
dataset = TensorDataset(input_ids, attention_masks, labels)
# `TensorDataset`是PyTorch庫中的一個類，用於將多個張量（Tensors）組合成一個數據集
# `input_ids` 包含文本評論的編碼，`attention_masks` 是關注掩碼，`labels` 是與每個評論相關聯的標籤
# 創建了一個`TensorDataset`對象，將這三個張量打包在一起，使它們在數據集中按相應的順序配對

train_dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# `DataLoader`是PyTorch庫中的一個類，用於對數據集進行批處理，並提供了數據加載的功能。它允許你按指定的批量大小（`batch_size`）叠代數據，並可以選擇是否對數據進行隨機打亂（`shuffle`）。
# 我們創建了一個`DataLoader`，每次從數據集中加載32個樣本作為一個批次
# 在每個epoch開始之前，將數據打亂，以確保模型在每個時代中看到的數據順序不同，通常有助於提高模型的訓練效果

# 一旦創建了`train_dataloader`，就可以在訓練循環中使用它來加載批量的數據，用於模型的訓練
# 通常在每個epoch中，會叠代`train_dataloader`，並將每個批次的數據提供給模型進行訓練

In [68]:
# BERT fine-tuning
# 建立一個優化器，用於更新模型的權重，以最小化損失函數
# 使用了AdamW優化器，它是Adam優化器的一個變體，特別適用於自然語言處理任務
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
# model.parameters()返回了模型的參數（權重和偏差），它告訴優化器要更新哪些參數。
# lr=2e-5 指定了學習率（learning rate），即每次參數更新的步長。學習率是一個超參數，需要根據具體任務和數據進行調整

loss_fn = torch.nn.CrossEntropyLoss()
epochs = 2

# epochs 是一個超參數，用於指定訓練神經網絡的時期（epochs）的數量。在深度學習中，一個時期表示將訓練數據集中的所有樣本都經過模型的一次前向傳播和反向傳播，並用於更新模型的權重。在每個時期中，模型會反覆訓練，以逐漸學習更好的表示和適應訓練數據。epochs 的值通常是根據實驗和模型性能來選擇的，通常需要進行多次試驗以找到最佳值。它可以根據以下因素來確定：
# 數據集的大小：如果你的訓練數據集很大，通常需要更多的時期來確保模型充分學習數據的特征。較小的數據集可能需要較少的時期。
# 訓練時間：更多的時期意味著更長的訓練時間。你需要考慮可用的計算資源和時間來確定時期的數量。
# 模型性能：通常，你可以通過訓練準確度或損失函數的變化來監控模型的性能。你可以選擇在性能達到穩定水平之前進行更多的時期。
# 過擬合：如果模型在訓練集上表現得越來越好，但在驗證集上表現變差，這可能是過擬合的跡象。過擬合發生時，增加時期可能會使情況變得更糟。你可能需要使用早停（early stopping）來防止過擬合。
# 一般來說，你可以從一個較小的值開始，例如2或3個時期，然後觀察模型的性能。如果模型在這個時間內沒有收斂或表現得不夠好，你可以考慮增加時期數。但請謹慎增加時期，以避免過度訓練。最終的選擇應該基於實驗和驗證結果。

In [69]:
model.to(mps_device) # 將 model 移動到所選的計算設備上

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [72]:
# 用mps(gpu)訓練模型
for epoch in range(epochs): # 模型訓練將在多個時期中進行
  model.train() # 將模型切換到訓練模式。在訓練模式下，模型會計算梯度，並在反向傳播時更新模型參數
  for batch in train_dataloader: # 這是一個內部循環，用於遍歷訓練數據集中的所有批次（batches）。train_dataloader 是之前創建的數據加載器，每次叠代會提供一個批次的數據。
    input_ids, attention_mask, labels = batch # 從當前批次中解壓出輸入張量（input_ids和attention_mask）和標籤（labels）
    input_ids, attention_mask, labels = input_ids.to(mps_device), attention_mask.to(mps_device), labels.to(mps_device) # 將批次的數據移動到之前選擇的計算設備（GPU或CPU）
    optimizer.zero_grad() # 將模型參數的梯度設置為零。在每個批次之前，需要清零梯度，以便計算新的梯度。這是因為PyTorch會累積梯度，如果不清零，會導致梯度計算錯誤。

    outputs = model(input_ids, attention_mask=attention_mask, labels=labels) # 將輸入數據傳遞給模型以進行前向傳播，同時計算損失（loss）。labels 參數表示預期的標簽，因此模型可以計算預測和損失。
    loss = outputs.loss # 從模型輸出中計算的損失值
    loss.backward() # compute the gradient # 執行反向傳播，計算模型參數的梯度。梯度用於優化器的參數更新
    optimizer.step() # update parameters(optimization) # 使用優化器來更新模型參數，以最小化損失函數。它執行參數的梯度下降步驟，以改善模型的性能
# cpu: 25m 59s
# mps(gpu): 6m 30.5s
# mps(gpu): 37m 43.9s

# Evaluation

In [73]:
# torch.cuda.empty_cache()
torch.mps.empty_cache()

In [None]:
# 改用cpu進行evaluation（因為gpu記憶體不夠了）
device = torch.device('cpu')
model.to(device) # 移動到cpu
model.eval() # set to evaluation mode
test_input_ids, test_attention_masks, test_labels = tokenize_reviews(X_test, y_test) # 將文本數據轉換為適用於BERT模型的輸入格式
test_input_ids, test_attention_masks, test_labels = test_input_ids.to(device), test_attention_masks.to(device),test_labels.to(device) # 資料也要移動到cpu

with torch.no_grad(): # 這是一個上下文管理器，它告訴PyTorch在接下來的代碼塊中不計算梯度。這對於推理階段非常重要，因為在推理過程中，我們通常不需要計算梯度。通過使用`torch.no_grad()`，可以提高推理的速度和減少內存占用。
  logits = model(test_input_ids, attention_mask=test_attention_masks) # 分類器，使用訓練好的BERT模型進行推理
  # 模型返回的 `logits` 包含了每個類別的得分或原始預測值。這些得分可以用於計算類別概率，或者通過`argmax`操作找到模型的最終預測標簽
predicted_labels = np.argmax(logits.logits.cpu().numpy(), axis=1) # 從`logits`中找到每個樣本的最高得分的類別，即模型的預測標籤
# `logits` 是一個PyTorch張量，因此我們使用 `.cpu().numpy()` 將其轉換為NumPy數組，以便更容易進行操作
# cpu: 59s
# cpu: 1m 2.4s
# cpu: 234m 34.3s for 50k data

In [None]:
# 建立成dataframe
evaluation = pd.DataFrame({'value':[accuracy_score(y_test, predicted_labels),
                                    precision_score(y_test, predicted_labels, average='macro',),
                                    recall_score(y_test, predicted_labels, average='macro'),
                                    f1_score(y_test, predicted_labels, average='macro')]},
                           index=["accuracy", "precision", "recall", "f1_score"])

evaluation.round(4)

Unnamed: 0,value
accuracy,0.875
precision,0.8753
recall,0.875
f1_score,0.875


In [None]:
evaluation

Unnamed: 0,value
accuracy,0.875
precision,0.875345
recall,0.874985
f1_score,0.874968


In [None]:
logits

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.0041, -1.1740],
        [ 1.3857, -0.6221],
        [ 2.4623, -1.8269],
        ...,
        [ 2.6914, -2.0562],
        [ 1.0100, -0.3644],
        [ 1.0894, -0.4423]]), hidden_states=None, attentions=None)

In [None]:
y_test[:10]

array([0, 0, 0, 0, 0, 1, 0, 0, 1, 1])

In [None]:
predicted_labels[:10]

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1])

In [None]:
X_test[:10]

array(['the tortuous emotional impact is degrading, whether adult or adolescent the personal values shown in this movie belong in a bad psychodrama if anywhere at all. This movie has a plot, but it is all evil from start to end. This is no way for people to act and degrades both sexes all the way through the movie. teen killing - bad preteen sex - bad emotional battering - bad animal cruelty - bad psychological torture - bad parental neglect - bad the only merit if any is the excellent color shots of contrasting red, blond and green leaves a bad feeling for anyone that respects life and peace, what a bad mistake to make, or to watch... it is UGLY',
       'Anyone who knows anything about evolution wouldn\'t even need to see the film to say "fake". "it\'s never been disproved" also is a weak argument. Saying the universe was created by a giant hippo cannot be disproved. Although, to be fair, it does seem like the only people who do believe are the same people who open email attachments 

In [None]:
# 輸出模型的參數數量
def get_learnable_params(module):
    return [p for p in module.parameters() if p.requires_grad]

model_params =get_learnable_params(model)
clf_params = get_learnable_params(model.classifier)

print(f"""
整個分類模型的參數量：{sum(p.numel() for p in model_params)}
線性分類器的參數量：{sum(p.numel() for p in clf_params)}
""")

# Inference

In [None]:
# 把預測結果變回顯示 positive or negative，並合併test output成一個dataframe
mapping = {1: 'positive', 0: 'negative'}
y_test_map = np.array([mapping[val] for val in y_test])
predicted_labels_map = np.array([mapping[val] for val in predicted_labels])
result_df = pd.DataFrame({'review': X_test,'sentiment': y_test_map, 'predicted_sentiment': predicted_labels_map})
result_df

Unnamed: 0,review,sentiment,predicted_sentiment
0,"the tortuous emotional impact is degrading, wh...",negative,negative
1,Anyone who knows anything about evolution woul...,negative,negative
2,I'm glad I rented this movie for one reason: i...,negative,negative
3,"Yes, the votes are in. This film may very well...",negative,positive
4,This mini-series is actually more entertaining...,negative,positive
...,...,...,...
1995,I can tell by the other comments that NOBODY c...,negative,negative
1996,The story turns around Antonio 'Scarface' Mont...,positive,positive
1997,I am so disappointed. After waiting for 3 year...,negative,negative
1998,I saw this movie in Blockbuster and thought it...,negative,negative


In [None]:
result_df.to_csv("result_df.csv", index=False, encoding="utf-8")

In [26]:
# true is positive, predict is positive
t_sentiment = "positive"
p_sentiment = "positive"
random.seed(2023) # 設定隨機種子，確保每次執行都是一樣的結果
temp_df = result_df[(result_df["sentiment"]==t_sentiment) & (result_df["predicted_sentiment"]==p_sentiment)]
temp = temp_df.iloc[random.sample([x for x in range(len(temp_df))], k=5),]
for i in range(5):
    print(temp.iloc[i,0])
    print("========================")
temp

An hilariously accurate caricature of trying to sell a script. Documentary hits all the beats, plot points, character arcs, seductions, moments of elation and disappointments and the allure but insane prospect of selling a script or getting an agent in Hollywood;and all the fleeting, fantasy-realizing but ultimately empty rites of passage attendant to being socialized into "the system." Hotz and Rice capture the moment of thinking you're finally a player, only to find that what goes up comes down fast and in a blind-siding fashion;that for inexplicable reasons, Hollywood has moved on and left you checking your heart, your dreams, and your pockets. Pitch is a must-see for students in film school to taste the mind and ego-bashing gantlet that is, for most, the road that must be traveled to sell oneself and one's projects in Hollywood. If your teacher or guru has never been there, they can't tell you what you need to prepare for this gantlet. To enter the"biz," talent is necessary but far

Unnamed: 0,review,sentiment,predicted_sentiment
7213,An hilariously accurate caricature of trying t...,positive,positive
8395,This movie is just great. It's entertaining fr...,positive,positive
7325,I did not read anything about the film before ...,positive,positive
6053,A comedy of epically funny proportions from th...,positive,positive
6377,"I am going to go out on a limb, and actually d...",positive,positive


In [27]:
# true is negative, predict is negative
t_sentiment = "negative"
p_sentiment = "negative"
random.seed(2023) # 設定隨機種子，確保每次執行都是一樣的結果
temp_df = result_df[(result_df["sentiment"]==t_sentiment) & (result_df["predicted_sentiment"]==p_sentiment)]
temp = temp_df.iloc[random.sample([x for x in range(len(temp_df))], k=5),]
for i in range(5):
    print(temp.iloc[i,0])
    print("========================")
temp

I can't believe this movie managed to get such a relatively high rating of 6! It is barely watchable and unbelievably boring, certainly one of the worst films I have seen in a long, long time.<br /><br />In a no-budget way, it reminded me of Star Wars Episodes I and II for the sheer impression that you are watching a total creative train wreck.<br /><br />This film should be avoided at all costs. It's one of those "festival films" that only please the pseudo-intellectuals because they are so badly made those people think it makes it "different", therefore good.<br /><br />Bad film-making is not "different", it's just bad film-making.
SPOILERS<br /><br />I love movies. I've seen a lot of movies. I didn't think I'd ever see a film that I actually hated. Son of the Mask ruined it. Son of the Mask is so bad I'm not even going to do a detailed comment like I usually do. In fact, I'm not even going to write a lot. I think all of you should know that this movie is horribly awful. And poor Jam

Unnamed: 0,review,sentiment,predicted_sentiment
6795,I can't believe this movie managed to get such...,negative,negative
7999,SPOILERS<br /><br />I love movies. I've seen a...,negative,negative
6927,If pulp fiction and Get shorty didn't exist th...,negative,negative
5683,"""The Secret Life"" starts with the worst possib...",negative,negative
5983,This is one of the most hilariously bad movies...,negative,negative


In [28]:
# true is positive, predict is negative
t_sentiment = "positive"
p_sentiment = "negative"
random.seed(2023) # 設定隨機種子，確保每次執行都是一樣的結果
temp_df = result_df[(result_df["sentiment"]==t_sentiment) & (result_df["predicted_sentiment"]==p_sentiment)]
temp = temp_df.iloc[random.sample([x for x in range(len(temp_df))], k=5),]
for i in range(5):
    print(temp.iloc[i,0])
    print("========================")
temp

I had never read much about (or even seen stills of) the six-man British comedy group The Crazy Gang, but my positive experiences with their contemporaries Will Hay and Arthur Askey – and especially Graham Greene’s high praise of THE FROZEN LIMITS itself (“The funniest English picture yet produced…it can bear comparison with SAFETY LAST and THE GENERAL”) – made me take the plunge with the bare-bones R2 DVDs from Network of this and their subsequent film GASBAGS (1941; see below), both of which were released earlier this year with virtually no fanfare.<br /><br />A British-made Western is a rarity, but a British Western spoof is rarer still (CARRY ON COWBOY [1965] was still some 25 years away). Incidentally, going back to the Silent classics mentioned by Greene, the film seems to me to be more obviously indebted to THE GOLD RUSH (1925) and WAY OUT WEST (1937). Besides, it also plays like a variation on the “Snow White And The Seven Dwarfs” fairy-tale (which had just been immortalized on

Unnamed: 0,review,sentiment,predicted_sentiment
5621,I had never read much about (or even seen stil...,positive,negative
6413,"Ahhhh, 1984.... I was young and stupid, and ju...",positive,negative
5681,"This movie is pretty cheesy, but I do give it ...",positive,negative
4888,Toy Soldiers is an okay action movie but what ...,positive,negative
8796,When two writers make a screenplay of a horror...,positive,negative


In [29]:
# true is negative, predict is positive
t_sentiment = "negative"
p_sentiment = "positive"
random.seed(2023) # 設定隨機種子，確保每次執行都是一樣的結果
temp_df = result_df[(result_df["sentiment"]==t_sentiment) & (result_df["predicted_sentiment"]==p_sentiment)]
temp = temp_df.iloc[random.sample([x for x in range(len(temp_df))], k=5),]
for i in range(5):
    print(temp.iloc[i,0])
    print("========================")
temp

I debated quite a bit over what rating to give this one because it's my least favorite Herschell Gordon Lewis film so far other than The Gruesome Twosome, but it has the best acting I've seen in a Lewis film. However, we all know that's not saying much. Once the movie was done, I was happy because it felt like I had been sitting through a 4 hour movie, though it was only 82 minutes long. I'm trying to see all of HGL's films and that's probably the only reason to see this one.<br /><br />The gore is good as usual, the one thing that Herschell seemed to get right. The acting is just as bad as usual with one exception. That exception is Frank Kress. Now, would I say that he's a good actor? No way, but he's good compared to everyone else. The story is boring and flat and goes no where and by the end, I didn't care what happened just so long as it ended. I know this is a cult classic but I didn't enjoy it very much at all. I hope you will.
The trouble with the book, "Memoirs of a Geisha" is

Unnamed: 0,review,sentiment,predicted_sentiment
5743,I debated quite a bit over what rating to give...,negative,positive
9708,"The trouble with the book, ""Memoirs of a Geish...",negative,positive
6597,The latest Rumor going around is that Vh1 is s...,negative,positive
5830,I first watched Kindred in 1987 along with ano...,negative,positive
4713,Kevin Kline and Meg Ryan are among that class ...,negative,positive
