### <strong>主題:
啤酒評論評分預測 - 分類模型建構
### <strong>說明:
繼續上次啤酒的評鑑資料集的練習，我們這次的最終目標這要是把啤酒評分的預測當作分類問題 <br />
，建構BERT模型，評估其各項屬性(apperance, aroma, overall, palate, taste)得分。特 <br />
注意的是，與課程中範例不同的地方在於這次必須預測多個目標，也就是典型的多標籤問題 <br />
(multi-label classification)
### <strong>題目
1. 以上次處理好的啤酒資料為範例，建構相對應的pytorch Dataset與pytorhc Dataloader<br />
(完成底下的BeerDataset與create_data_loader)
2. 以上次處理好的啤酒資料為範例，建構主要模型的架構(完成底下的BeerRateClassifier)
3. 完成最後的訓練流程並得到權重檔，確認模型架構沒有問題

#### <strong>提示1: 若同學因GPU限制無法快速訓練，可以考慮調低訓練回合數，MAX_LEN，或選擇較小的bert模型。
#### <strong>提示2: 若還是對multi-labeling問題建構不知從何下手，可以考[範例](https://www.learnopencv.com/multi-label-image-classification-with-pytorch/)

https://learnopencv.com/multi-label-image-classification-with-pytorch/

In [1]:
# 連接個人資料 讀取 ＰＴＴ 訓練資料和儲存模型
#先連接自己的GOOGLE DRIVE 為了要儲存資料和訓練模型
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days-part2/project_2_3/')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days-part2/project_2_3


In [3]:
!pip install torch
!pip install transformers
#!pip install -q transformers
#!pip install transformers==3
# 設定 torchtext 版本 安裝完必須重新啟動執行階段
!pip install torchtext==0.6.0
#!pip install -r requirements.txt




In [4]:
import torch
import transformers
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from torch import nn, optim
from transformers import BertModel, BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup

In [5]:
PRE_TRAINED_MODEL_NAME = "bert-base-cased"
BATCH_SIZE = 16
MAX_LEN = 255
EPOCHS = 10

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
TOKENIZER = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [6]:
class BeerDataset(Dataset):
    """
    將資料集轉換為後續data DataLoader 需求的 pytorch Dataset形式
    Convert beer review dataframe into torch dataset instance
    """
    def __init__(self,
                 comments,
                 appearance_target,
                 aroma_target,
                 overall_target,
                 palate_target,
                 taste_target, max_len):
        # 需完成部分...
        self.comments = comments
        self.appearance_target = appearance_target
        self.aroma_target = aroma_target
        self.overall_target = overall_target
        self.palate_target = palate_target
        self.taste_target = taste_target
        self.max_len = max_len
    def __len__(self):
        return len(self.comments)

    def __getitem__(self, item):
        # 需完成部分...
        comment = str(self.comments[item])
        appearance_target = self.appearance_target[item]
        aroma_target = self.aroma_target[item]
        overall_target = self.overall_target[item]
        palate_target = self.palate_target[item]
        taste_target = self.taste_target[item]
        encoding = TOKENIZER.encode_plus(
            comment,
            max_length=self.max_len,
            truncation=True,
            add_special_tokens=True,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
        )        
        return {
            'comment': comment,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'appearance_target': torch.tensor(appearance_target, dtype=torch.long),
            'aroma_target': torch.tensor(aroma_target, dtype=torch.long),
            'overall_target': torch.tensor(overall_target, dtype=torch.long),
            'palate_target': torch.tensor(palate_target, dtype=torch.long),
            'taste_target': torch.tensor(taste_target, dtype=torch.long)
        }

In [7]:
def create_data_loader(dataframe, max_len, batch_size):
    """
    將pytorch Dataset形式資料集包裝為data DataLoader
    convert dataset to pytorch dataloader format object
    """
    dataset = BeerDataset( # 需完成部分...
        comments=dataframe['review/text'],
        appearance_target=dataframe.review_appearance,
        aroma_target=dataframe.review_aroma,
        overall_target=dataframe.review_overall,
        palate_target=dataframe.review_palate,
        taste_target=dataframe.review_taste,
        max_len=max_len)

    return DataLoader(
        dataset,
        batch_size=batch_size
    )

In [8]:
class BeerRateClassifier(nn.Module):
    """
    啤酒評論評分分類模型的主體
    Beer sentiment main model for review sentiment analyzer
    """
    def __init__(self,
                 appearance_n_classes,
                 aroma_n_classes,
                 overall_n_classes,
                 palate_n_classes,
                 taste_n_classes,
                ):
        super(BeerRateClassifier, self).__init__()
        # 需完成部分...
        aspects = {   
            'appearance': appearance_n_classes,
            'aroma': aroma_n_classes,
            'overall': overall_n_classes,
            'palate': palate_n_classes,
            'taste': taste_n_classes
        }

        self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        self.aspect_outs = nn.ModuleDict({
            aspect: nn.Linear(self.bert.config.hidden_size, n_classes)
            for aspect, n_classes in aspects.items()  
        })
        self.drop = nn.Dropout(0.2)


    def forward(self, input_ids, attention_mask):
        # 需完成部分...
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        out = self.drop(outputs.pooler_output) ##outputs['pooler_output']
        aspect_outputs = {
            aspect: aspect_out(out)
            for aspect, aspect_out in self.aspect_outs.items()
        }

        return aspect_outputs
        # return {
        #     "appearance": appearance_output,
        #     "aroma": aroma_output,
        #     "overall": overall_output,
        #     "palate": palate_output,
        #     "taste": taste_output,
        # }

In [9]:
def train_epoch(model,
                data_loader,
                loss_fn,
                optimizer,
                scheduler,
                n_examples):
    """
    分類器的主要訓練流程
    Main training process of bert sentiment classifier
    """
    model = model.train()

    losses = []
    correct_predictions = 0.
    for batch in data_loader:
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        preds = {
            aspect: torch.max(output, dim=1)[1]
            for aspect, output in outputs.items()
        }
        targets = {
            aspect: batch[f"{aspect}_target"].view(-1).to(DEVICE)
            for aspect in preds.keys()
        }
        aspect_losses = {
            aspect: loss_fn(outputs[aspect], targets[aspect])
            for aspect in preds.keys()
        }
        correct_predictions += sum([
            torch.sum(preds[aspect] == targets[aspect]).item() for aspect in preds.keys()
        ])

        loss = torch.stack([val for _, val in aspect_losses.items()]).sum()
        losses.append(loss.item())

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    return correct_predictions / n_examples / 5, np.mean(losses)

In [11]:
def eval_model(model,
               data_loader,
               loss_fn,
               n_examples):
    """
    分類器訓練時，每個 epoch 評估流程
    Main evaluate process in training of bert sentiment classifier
    """
    model = model.eval()

    losses = []
    correct_predictions = 0.
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            preds = {
                aspect: torch.max(output, dim=1)[1]
                for aspect, output in outputs.items()
            }
            targets = {
                aspect: batch[f"{aspect}_target"].view(-1).to(DEVICE)
                for aspect in preds.keys()
            }
            aspect_losses = {
                aspect: loss_fn(outputs[aspect], targets[aspect])
                for aspect in preds.keys()
            }
            correct_predictions += sum([
                torch.sum(preds[aspect] == targets[aspect]).item() for aspect in preds.keys()
            ])

            loss = torch.stack([val for _, val in aspect_losses.items()]).sum()
            losses.append(loss.item())

    return correct_predictions / n_examples / 5, np.mean(losses)

In [13]:
TRAIN = pd.read_json("./data/train_set_lng.json")
TRAIN = TRAIN.sample(frac=1).reset_index(drop=True)
VAL = pd.read_json("./data/test_set_lng.json")
VAL = VAL.sample(frac=1).reset_index(drop=True)
TRAIN = TRAIN.append(VAL[500:]).reset_index(drop=True)
VAL = VAL.iloc[:500]

In [14]:
TRAIN.head(2)

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName,review_appearance,review_aroma,review_overall,review_palate,review_taste,text_length
0,39708,5.11,28950,13397,Nut Brown Ale,English Brown Ale,3.0,3.0,2.5,2.5,3.0,12oz brown bottle serveed into a nonic glass.\...,"{'min': 6, 'hour': 2, 'mday': 12, 'sec': 24, '...",1213236384,1677420000.0,"Oct 16, 1961",-259084800.0,Male,jcdiflorio,1,1,1,1,1,90
1,8266,8.5,29687,395,Jefferson's Reserve Bourbon Barrel Stout,American Double / Imperial Stout,,,,,,Picked this up in Ohio along with the other be...,"{'min': 29, 'hour': 6, 'mday': 11, 'sec': 34, ...",1231655374,,,,,JerzDevl2000,3,3,3,3,3,211


In [15]:
VAL.head(2)

Unnamed: 0,index,beer/ABV,beer/beerId,beer/brewerId,beer/name,beer/style,review/appearance,review/aroma,review/overall,review/palate,review/taste,review/text,review/timeStruct,review/timeUnix,user/ageInSeconds,user/birthdayRaw,user/birthdayUnix,user/gender,user/profileName,review_appearance,review_aroma,review_overall,review_palate,review_taste,text_length
0,24760,6.9,23474,1199,Founders RÃ¼bÃ¦us,Fruit / Vegetable Beer,3.5,4.0,4.0,4.0,4.0,Pours from the bottle a reddish copper hue wit...,"{'min': 47, 'hour': 23, 'mday': 29, 'sec': 16,...",1154216836,,,,Male,merlin48,1,2,2,2,2,87
1,42257,8.0,54731,263,Aecht Schlenkerla Eiche,Doppelbock,3.5,4.5,4.5,4.0,4.5,500ml bottle.\t\tPours a slightly hazy orangy ...,"{'min': 3, 'hour': 1, 'mday': 9, 'sec': 28, 'y...",1320800608,,,,Male,rangerred,1,3,3,2,3,88


In [16]:
MODEL = BeerRateClassifier(4, 4, 4, 4, 4)
MODEL.to(DEVICE)

TRAIN_DATA_LOADER = create_data_loader(TRAIN, MAX_LEN, BATCH_SIZE)
VAL_DATA_LOADER = create_data_loader(VAL, MAX_LEN, BATCH_SIZE)

OPTIMIZER = AdamW(MODEL.parameters(), lr=2e-5, correct_bias=False)
TOTAL_STEPS = len(TRAIN_DATA_LOADER) * EPOCHS
SCHEDULER = get_linear_schedule_with_warmup(
    OPTIMIZER,
    num_warmup_steps=0,
    num_training_steps=TOTAL_STEPS
)
LOSS_FN = nn.CrossEntropyLoss().to(DEVICE)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
BEST_ACCURACY = 0

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch(
        MODEL,
        TRAIN_DATA_LOADER,
        LOSS_FN,
        OPTIMIZER,
        SCHEDULER,
        len(TRAIN)
    )

    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(
        MODEL,
        VAL_DATA_LOADER,
        LOSS_FN,
        len(VAL)
    )

    print(f'Val   loss {val_loss} accuracy {val_acc}')
    print()

    if val_acc > BEST_ACCURACY:
        MODEL.bert.save_pretrained("./")
        best_accuracy = val_acc

Epoch 1/10
----------




Train loss 4.704062187525251 accuracy 0.5284323232323233
Val   loss 4.423305444419384 accuracy 0.5599999999999999

Epoch 2/10
----------
Train loss 4.357919468368032 accuracy 0.5706141414141415
Val   loss 4.44634410738945 accuracy 0.5556

Epoch 3/10
----------
Train loss 4.076980198285005 accuracy 0.6081414141414141
Val   loss 4.592077486217022 accuracy 0.5396

Epoch 4/10
----------
