<a href="https://colab.research.google.com/github/kcwanglucky/bert_run_lm_streamline/blob/master/bert_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### File to put in current directory:
intent.csv: 原本labeled data\
cleaned_wo_punc.csv: 將原本的4個unlabeled data檔案合併（共17000筆），並把多餘的標點符號空格刪除了


### Environment

In [0]:
from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'ai-report-240709'
!gcloud config set project {project_id}

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey



In [0]:
# Download the file from a given Google Cloud Storage bucket.
!gsutil cp gs://luckykcw/cleaned_w_embed.tsv .

Copying gs://luckykcw/cleaned_w_embed.tsv...
\ [1 files][ 33.4 MiB/ 33.4 MiB]                                                
Operation completed over 1 objects/33.4 MiB.                                     


In [0]:
# Copy raw intent.csv file to environment
!gsutil cp gs://luckykcw/intent.csv .

Copying gs://luckykcw/intent.csv...
/ [0 files][    0.0 B/277.0 KiB]                                                -- [1 files][277.0 KiB/277.0 KiB]                                                
Operation completed over 1 objects/277.0 KiB.                                    


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!cp -r /content/drive/My Drive/ESUN/0303_intent_newclass .

### 下載並安裝 transformers

In [0]:
!cp -r drive/My\ Drive/ESUN/0206_data_lm_ft/transformers .
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |▋                               | 10kB 22.4MB/s eta 0:00:01[K     |█▎                              | 20kB 5.9MB/s eta 0:00:01[K     |██                              | 30kB 7.1MB/s eta 0:00:01[K     |██▋                             | 40kB 5.5MB/s eta 0:00:01[K     |███▎                            | 51kB 5.9MB/s eta 0:00:01[K     |████                            | 61kB 7.0MB/s eta 0:00:01[K     |████▋                           | 71kB 7.1MB/s eta 0:00:01[K     |█████▎                          | 81kB 6.9MB/s eta 0:00:01[K     |██████                          | 92kB 7.7MB/s eta 0:00:01[K     |██████▋                         | 102kB 8.1MB/s eta 0:00:01[K     |███████▏                        | 112kB 8.1MB/s eta 0:00:01[K     |███████▉                        | 122kB 8.1M

### Import package

In [0]:
import torch
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
from IPython.display import clear_output
import pandas as pd
import numpy as np
import os

In [0]:
df_data = pd.read_csv("intent.csv")

In [0]:
# 只留取index及question coluumn
df_data = df_data.iloc[:, [0, 2]]

### **1. Train a bert with current labeled data**

#### Data Preprocessing (此function用來過濾掉太長的question或是sample太少的種類)

In [0]:
# 把各組數量大於mineachgroup的及Question長度小於maxlength的取出來
# 然後再轉換index，讓被刪除的index不要留空
# Output一個更新後的dataframe
def preprocessing(df_data, mineachgroup, maxlength):
    # filter out long question
    df_data = df_data[~(df_data.question.apply(lambda x : len(x)) > maxlength)]

    freq = df_data['index'].value_counts()
    idxs = np.array(freq.index)
    counts = freq

    # index numbers of groups with count >= mineachgroup
    list_idx = [i for i, c in zip(idxs, counts) if c > mineachgroup]

    # filter out data with "index" in list_idx 
    df_data = df_data[df_data['index'].isin(list_idx)]

    # # Redindex the topic group so that it is continuous
    # index = df_data["index"]
    # index2label = {idx:val for idx, val in enumerate(index.unique()) }
    # label2index = {val:idx for idx, val in index2label.items() }
    # def getindex4label(label):
    #     return label2index[label]
    
    # df_data["index"] = df_data["index"].apply(getindex4label) 

    return df_data    

In [0]:
df_data_prep = preprocessing(df_data, 4, 30)

#### Redindex the topic group so that it is continuous

In [0]:
def reIndex(df):
    index = df["index"]
    index2label = {idx:val for idx, val in enumerate(index.unique()) }
    label2index = {val:idx for idx, val in index2label.items() }
    def getindex4label(label):
        return label2index[label]
    df["index"] = df["index"].apply(getindex4label) 
    return df

In [0]:
df_data = reIndex(df_data)

#### Create split based on percentage (Get every index sampled) 
邏輯：每種class都取perct %的資料進去training set

In [0]:
NUM_LABELS = len(df_data['index'].value_counts())
print("label的數量：{}".format(NUM_LABELS))

label的數量：499


In [0]:
""" 從各類別random sample出fraction比例的資料集
    data: df data that includes the "index" and "question" column
    fraction: the fraction of data you want to sample (ex: 0.7)
""" 
def bootstrap(data, fraction):
    # This function will be applied on each group of instances of the same
    # class in data.
    def sampleClass(classgroup):
        return classgroup.sample(frac = fraction)

    samples = data.groupby('index').apply(sampleClass)
    
    # If you want an index which is equal to the row in `data` where the sample came from
    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.
    samples.index = samples.index.get_level_values(1)
    return samples

#### 輸出預處理解果(train and test set)

In [0]:
""" 將原本全部的cleaned data依照指定的比例分成train/val/test set，
    並output成tsv檔到環境中(檔名ex: 70%train.tsv)
    df: df data that includes the "index" and "question" column
    fraction: fraction of all data to be assigned to training set
    The remaining (1-fraction) data will be equally splitted between
    validation and testing set
"""

def output_split(df, fraction = 0.7):
    df_train = bootstrap(df, fraction)
    df_remain = pd.concat([df_train, df]).drop_duplicates(keep=False)
    df_val = df_remain.sample(frac = 0.5)
    df_test = pd.concat([df_val, df_remain]).drop_duplicates(keep=False)

    # 放入label資料夾，以區分出之後unlabel的資料
    path = os.path.join("data", "0226label")
    if not os.path.exists(path):
        os.makedirs(path)

    print("訓練樣本數：", len(df_train))
    df_train.to_csv(os.path.join(path, str(int(fraction * 100))+"%train.tsv"), sep="\t", index=False)

    print("validation樣本數：", len(df_val))
    df_val.to_csv(os.path.join(path, str(int(fraction * 100))+"%val.tsv"), sep="\t", index=False)

    print("預測樣本數：", len(df_test))
    df_test.to_csv(os.path.join(path, str(int(fraction * 100))+"%test.tsv"), sep="\t", index=False)

In [0]:
output_split(df_data, 0.7)

訓練樣本數： 3177
validation樣本數： 625
預測樣本數： 625


#### 讀取前面預處理結果

In [0]:
path = os.path.join("data", "0226label")
fraction = 0.7
train_path = os.path.join(path, str(int(fraction * 100))+"%train.tsv")
val_path = os.path.join(path, str(int(fraction * 100))+"%val.tsv")
test_path = os.path.join(path, str(int(fraction * 100))+"%test.tsv")
df_train = pd.read_csv(train_path, sep="\t").fillna("")
df_val = pd.read_csv(val_path, sep="\t").fillna("")
df_test = pd.read_csv(test_path, sep="\t").fillna("")

#### 用OnlineQueryDataset來存取資料

In [0]:
"""
    實作一個可以用來讀取訓練 / 測試集的 Dataset，此 Dataset 每次將 tsv 裡的一筆成對句子
    轉換成 BERT 相容的格式，並回傳 3 個 tensors：
    - tokens_tensor：兩個句子合併後的索引序列，包含 [CLS] 與 [SEP]
    - segments_tensor：可以用來識別兩個句子界限的 binary tensor
    - label_tensor：將分類標籤轉換成類別索引的 tensor, 如果是測試集則回傳 None
"""
from torch.utils.data import Dataset
   
class OnlineQueryDataset(Dataset):
    # 讀取前處理後的 tsv 檔並初始化一些參數
    # mode: in ["train", "test", "val"]
    # tokenizer: one of bert tokenizer
    # perc: percentage of data to put in training set
    # path: if given, then read data from the path(ex training set)
    def __init__(self, mode, tokenizer, perc = 70, path = None):
        assert mode in ["train", "val", "test"]  # 一般訓練你會需要 dev set
        self.mode = mode
        if not path: 
            path = os.path.join("data", str(perc) + "%" + mode + ".tsv")
        self.df = pd.read_csv(path, sep="\t").fillna("")
        self.len = len(self.df)
        self.tokenizer = tokenizer 
    
    # 定義回傳一筆訓練 / 測試數據的函式
    #@pysnooper.snoop()  # 加入以了解所有轉換過程
    def __getitem__(self, idx):
        if self.mode == "test":
            text = self.df.iloc[idx, 1]
            label_tensor = None
        elif self.mode == "val":
            label, text = self.df.iloc[idx, :].values
            label_tensor = torch.tensor(label)
        else:
            label, text = self.df.iloc[idx, :].values
            # 將label文字也轉換成索引方便轉換成 tensor
            label_tensor = torch.tensor(label)
            
        """
        # 建立第一個句子的 BERT tokens 並加入分隔符號 [SEP]
        word_pieces = ["[CLS]"]
        tokens_a = self.tokenizer.tokenize(text_a)
        word_pieces += tokens_a + ["[SEP]"]
        len_a = len(word_pieces)
        
        # 第二個句子的 BERT tokens
        tokens_b = self.tokenizer.tokenize(text_b)
        word_pieces += tokens_b + ["[SEP]"]
        len_b = len(word_pieces) - len_a
        """
        # 建立句子的 BERT tokens 
        word_pieces = ["[CLS]"]
        tokens = self.tokenizer.tokenize(text)
        word_pieces += tokens + ["[SEP]"]
        len_a = len(word_pieces)
        
        # 將整個 token 序列轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.tensor(ids)
        
        # 將第一句包含 [SEP] 的 token 位置設為 0，其他為 1 表示第二句
        segments_tensor = torch.tensor([1] * len_a, dtype=torch.long)
        
        return (tokens_tensor, segments_tensor, label_tensor)
    
    def __len__(self):
        return self.len

In [0]:
PRETRAINED_MODEL_NAME = "bert-base-chinese"
# 取得此預訓練模型所使用的 tokenizer
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)
clear_output()

In [0]:
# 初始化一個專門讀取訓練樣本的 Dataset，使用中文 BERT 斷詞
trainset = OnlineQueryDataset("train", tokenizer=tokenizer, path = "data/0226label/70%train.tsv")

In [0]:
valset = OnlineQueryDataset("val", tokenizer=tokenizer, perc=70, path = "data/0226label/70%val.tsv")
testset = OnlineQueryDataset("test", tokenizer=tokenizer, perc=70, path = "data/0226label/70%test.tsv")

In [0]:
"""
實作可以一次回傳一個 mini-batch 的 DataLoader
這個 DataLoader 吃我們上面定義的 `OnlineQueryDataset`，
回傳訓練 BERT 時會需要的 4 個 tensors：
- tokens_tensors  : (batch_size, max_seq_len_in_batch)
- segments_tensors: (batch_size, max_seq_len_in_batch)
- masks_tensors   : (batch_size, max_seq_len_in_batch)
- label_ids       : (batch_size)
"""

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# 這個函式的輸入 `samples` 是一個 list，裡頭的每個 element 都是
# 剛剛定義的 `FakeNewsDataset` 回傳的一個樣本，每個樣本都包含 3 tensors：
# - tokens_tensor
# - segments_tensor
# - label_tensor
# 它會對前兩個 tensors 作 zero padding，並產生前面說明過的 masks_tensors
def create_mini_batch(samples):
    tokens_tensors = [s[0] for s in samples]
    segments_tensors = [s[1] for s in samples]
    
    # 訓練集有 labels
    if samples[0][2] is not None:
        label_ids = torch.stack([s[2] for s in samples])
    else:
        label_ids = None
    
    # zero pad 到同一序列長度
    tokens_tensors = pad_sequence(tokens_tensors, 
                                  batch_first=True)
    segments_tensors = pad_sequence(segments_tensors, 
                                    batch_first=True)
    
    # attention masks，將 tokens_tensors 裡頭不為 zero padding
    # 的位置設為 1 讓 BERT 只關注這些位置的 tokens
    masks_tensors = torch.zeros(tokens_tensors.shape, 
                                dtype=torch.long)
    masks_tensors = masks_tensors.masked_fill(
        tokens_tensors != 0, 1)
    
    return tokens_tensors, segments_tensors, masks_tensors, label_ids

In [0]:
# 初始化一個每次回傳 64 個訓練樣本的 DataLoader
# 利用 `collate_fn` 將 list of samples 合併成一個 mini-batch 是關鍵
BATCH_SIZE = 64
trainloader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True,  
                         collate_fn=create_mini_batch)

In [0]:
valloader = DataLoader(valset, batch_size=BATCH_SIZE,  
                         collate_fn=create_mini_batch)

#### Check first batch

In [0]:
data = next(iter(trainloader))

tokens_tensors, segments_tensors, \
    masks_tensors, label_ids = data

print(f"""
tokens_tensors.shape   = {tokens_tensors.shape} 
{tokens_tensors}
------------------------
segments_tensors.shape = {segments_tensors.shape}
{segments_tensors}
------------------------
masks_tensors.shape    = {masks_tensors.shape}
{masks_tensors}
------------------------
label_ids.shape        = {label_ids.shape}
{label_ids}
""")


tokens_tensors.shape   = torch.Size([64, 24]) 
tensor([[ 101,  100, 2791,  ...,    0,    0,    0],
        [ 101, 4192, 7442,  ...,    0,    0,    0],
        [ 101, 6296,  928,  ...,    0,    0,    0],
        ...,
        [ 101,  784, 7938,  ...,    0,    0,    0],
        [ 101, 1912, 2395,  ...,    0,    0,    0],
        [ 101, 9878, 1377,  ...,    0,    0,    0]])
------------------------
segments_tensors.shape = torch.Size([64, 24])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
------------------------
masks_tensors.shape    = torch.Size([64, 24])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
------------------------
label_ids.shape        

In [0]:
#tokenizer.convert_ids_to_tokens(tokens_tensors[0])

In [0]:
#df_train[df_train["index"] == 391]

#### Training step
初始化一個BertForSequenceClassification Model

In [0]:
PRETRAINED_MODEL_NAME = "bert-base-chinese"
NUM_LABELS = len(df_data['index'].value_counts())

model = BertForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL_NAME, num_labels=NUM_LABELS)
clear_output()

In [0]:
def get_predictions(model, dataloader, compute_acc=False):
    predictions = None
    correct = 0
    total = 0
      
    with torch.no_grad():
        # 遍巡整個資料集
        for data in dataloader:
            # 將所有 tensors 移到 GPU 上
            if next(model.parameters()).is_cuda:
                data = [t.to("cuda:0") for t in data if t is not None]
            
            # 別忘記前 3 個 tensors 分別為 tokens, segments 以及 masks
            # 且強烈建議在將這些 tensors 丟入 `model` 時指定對應的參數名稱
            tokens_tensors, segments_tensors, masks_tensors = data[:3]
            outputs = model(input_ids=tokens_tensors, 
                            token_type_ids=segments_tensors, 
                            attention_mask=masks_tensors)
            
            logits = outputs[0]
            _, pred = torch.max(logits.data, 1)
            
            # 用來計算訓練集的分類準確率
            if compute_acc:
                labels = data[3]
                total += labels.size(0)
                correct += (pred == labels).sum().item()
                
            # 將當前 batch 記錄下來
            if predictions is None:
                predictions = pred
            else:
                predictions = torch.cat((predictions, pred))
    
    if compute_acc:
        acc = correct / total
        return predictions, acc
    return predictions

In [0]:
# 讓模型跑在 GPU 上並取得訓練集的分類準確率
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)
model = model.to(device)
pred, acc = get_predictions(model, trainloader, compute_acc=True)
print("classification acc:", acc)

device: cuda:0
classification acc: 0.0015738117721120553


In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)
model = model.to(device)

# 使用 Adam Optim 更新整個分類模型的參數
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

EPOCHS = 35
for epoch in range(EPOCHS):
    
    running_loss = 0.0
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, EPOCHS))
    print('Training...')

    # 訓練模式
    model.train()

    for data in trainloader: # trainloader is an iterator over each batch
        
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]

        # 將參數梯度歸零
        optimizer.zero_grad()
        
        # forward pass
        outputs = model(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors, 
                        labels=labels)

        loss = outputs[0]
        # backward
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        # 紀錄當前 batch loss
        running_loss += loss.item()
        
    # 計算分類準確率
    logit, acc = get_predictions(model, trainloader, compute_acc=True)

    print('loss: %.3f, acc: %.3f' %
          (running_loss, acc))    
    print("")
    print("Running Validation...")

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for data in valloader:
        
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            outputs = model(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors, 
                        labels=labels)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]


    _, acc = get_predictions(model, valloader, compute_acc=True)
        # Move logits and labels to CPU
        #logits = logits.detach().cpu().numpy()
        #label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        #tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        #eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        #nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(acc))

device: cuda:0

Training...
loss: 175.478, acc: 0.677

Running Validation...
  Accuracy: 0.60

Training...
loss: 168.623, acc: 0.698

Running Validation...
  Accuracy: 0.61

Training...
loss: 163.112, acc: 0.711

Running Validation...
  Accuracy: 0.62

Training...
loss: 157.072, acc: 0.744

Running Validation...
  Accuracy: 0.65

Training...
loss: 151.195, acc: 0.754

Running Validation...
  Accuracy: 0.65

Training...
loss: 145.896, acc: 0.773

Running Validation...
  Accuracy: 0.67

Training...
loss: 140.078, acc: 0.790

Running Validation...
  Accuracy: 0.67

Training...
loss: 135.148, acc: 0.806

Running Validation...
  Accuracy: 0.68

Training...
loss: 129.663, acc: 0.813

Running Validation...
  Accuracy: 0.69

Training...
loss: 124.711, acc: 0.843

Running Validation...
  Accuracy: 0.70

Training...
loss: 119.587, acc: 0.841

Running Validation...
  Accuracy: 0.71

Training...
loss: 115.062, acc: 0.854

Running Validation...
  Accuracy: 0.71

Training...
loss: 110.288, acc: 0.86

#### Save Model

In [0]:
import os
output_dir = os.path.join("model", "70%labeled")

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

Saving model to model/70%labeled


('model/70%labeled/vocab.txt',
 'model/70%labeled/special_tokens_map.json',
 'model/70%labeled/added_tokens.json')

#### Testing

In [0]:
# 建立測試集。這邊我們可以用跟訓練時不同的 batch_size，看你 GPU 多大
#testset = OnlineQueryDataset("test", tokenizer=tokenizer)
testloader = DataLoader(testset, batch_size=32, 
                        collate_fn=create_mini_batch)

# 用分類模型預測測試集
predictions = get_predictions(model, testloader)

In [0]:
def accuracy(label, pred):
  return (label == pred).sum().item()/len(label)

In [0]:
test_label = torch.cuda.FloatTensor(testset.df['index'])

In [0]:
print("Testset accuracy: %f" % accuracy(test_label, predictions))

Testset accuracy: 0.822400


若有把太少sample的組別以及太長的問題刪掉的話（用preprocessing function），可達準確率84%左右

### **2. 加入unlabel data一起train bert**
Random sample出大約3177筆Unlabelled資料（須經過rule based過濾 ex: 長度/數字更換/drop duplicate等等），將它丟進原本bert model做分類，然後再用317 + 3177筆labeled的資料重新train一個Bert model。再用此model預測剩下那30% labeled的testing data成效如何

#### Load model

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = BertForSequenceClassification.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir)

# Copy the model to the GPU.
model.to(device)

#### Preprocess the unlabeled data

In [0]:
unlabel_path = os.path.join("data", "0226unlabel")
if not os.path.exists(unlabel_path):
    os.makedirs(unlabel_path)

# cleaned_wo_punc.csv 是將原本的4個unlabel data file合併
# （共17000筆），並把多餘的標點符號空格刪除了
df_unlabel = pd.read_csv("cleaned_wo_punc.csv", sep = "\t")

In [0]:
import re

In [0]:
def toolongshort(x):
    return len(x) > MAXLENGTH or len(x) < MINLENGTH

In [0]:
def containsPhone(x):
    pattern = r"(0\d{1,2}(\d{6,8})|0\d{1,2}-?(\d{6,8})|09\d{2}-?(\d{3})-?(\d{3}))"
    prog = re.compile(pattern)
    if prog.match(x):
        return True
    else:
        return False

In [0]:
def containsAddr(x):
    pattern = r".*(路|縣|市|街|巷|號|樓).*"
    prog = re.compile(pattern)
    if prog.match(x):
        return True
    else:
        return False

In [0]:
# 包含抱歉的句子幾乎都其實沒有要問事情
def containSorry(x):
    pattern = r".*(對不起|抱歉|sorry|錯了|不好意思).*"
    prog = re.compile(pattern)
    if prog.match(x):
        return True
    else:
        return False

In [0]:
def tobefiltered(x):
    return containsPhone(x) or toolongshort(x) or containsAddr(x) or containSorry(x)

In [0]:
# 用一些rule來過濾出不適合的資料集
# Output一個更新後的dataframe
MINLENGTH = 5
MAXLENGTH = 20
def preprocessing(df):
    # drop the duplicate question in the dataset
    df = df.drop_duplicates(subset='question')

    # drop question with phone number and filter out long question
    df = df[~(df["question"].apply(tobefiltered))]
    
    return df

In [0]:
df_unlabel_clean = preprocessing(df_unlabel)

In [0]:
# Random sample出大約3177筆Unlabelled資料，利用先前的bert model來自動分類，
# 再將這些資料加入原本的training set重新train一個bert
df_unlabel_train = df_unlabel_clean.sample(n=3177, random_state=1)
df_unlabel_train.to_csv(os.path.join(unlabel_path, "unlabel_test.tsv"), sep="\t", index=False)

In [0]:
# df_unlabel_train目前尚無標籤，將這些視為test set放入bert分類
testset = OnlineQueryDataset("test", tokenizer, perc = 75, path = os.path.join(unlabel_path, "unlabel_test.tsv"))
testloader = DataLoader(testset, batch_size=32, collate_fn=create_mini_batch)

# 用分類模型預測測試集
predictions = get_predictions(model, testloader)

In [0]:
# 把欲操出來的分類放到index column
df_unlabel_train['index'] = np.array(predictions.cpu())

In [0]:
# 與原本的trainset合併，共有3177+3177筆新的training set
df_new_train = pd.concat([df_unlabel_train, df_train])

In [0]:
df_new_train.to_csv("trainset.tsv", sep = "\t", index = False)

In [0]:
trainset = OnlineQueryDataset("train", tokenizer=tokenizer, perc = 100, path = "trainset.tsv")

#### Retrain a bert model with 6354 training data

In [0]:
PRETRAINED_MODEL_NAME = "bert-base-chinese"
NUM_LABELS = len(df_new_train['index'].value_counts())

model = BertForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL_NAME, num_labels=NUM_LABELS)
clear_output()

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)
model = model.to(device)

# 使用 Adam Optim 更新整個分類模型的參數
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

EPOCHS = 35
for epoch in range(EPOCHS):
    
    running_loss = 0.0
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, EPOCHS))
    print('Training...')

    # 訓練模式
    model.train()

    for data in trainloader: # trainloader is an iterator over each batch
        
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]

        # 將參數梯度歸零
        optimizer.zero_grad()
        
        # forward pass
        outputs = model(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors, 
                        labels=labels)

        loss = outputs[0]
        # backward
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        # 紀錄當前 batch loss
        running_loss += loss.item()
        
    # 計算分類準確率
    logit, acc = get_predictions(model, trainloader, compute_acc=True)

    print('loss: %.3f, acc: %.3f' %
          (running_loss, acc))    
    print("")
    print("Running Validation...")

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for data in valloader:
        
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            outputs = model(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors, 
                        labels=labels)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]


    _, acc = get_predictions(model, valloader, compute_acc=True)
        # Move logits and labels to CPU
        #logits = logits.detach().cpu().numpy()
        #label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        #tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        #eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        #nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(acc))

device: cuda:0

Training...
loss: 36.268, acc: 0.986

Running Validation...
  Accuracy: 0.82

Training...
loss: 33.903, acc: 0.987

Running Validation...
  Accuracy: 0.81

Training...
loss: 31.953, acc: 0.986

Running Validation...
  Accuracy: 0.82

Training...
loss: 30.145, acc: 0.987

Running Validation...
  Accuracy: 0.82

Training...
loss: 28.237, acc: 0.989

Running Validation...
  Accuracy: 0.83

Training...
loss: 26.808, acc: 0.990

Running Validation...
  Accuracy: 0.82

Training...
loss: 25.050, acc: 0.990

Running Validation...
  Accuracy: 0.83

Training...
loss: 23.748, acc: 0.991

Running Validation...
  Accuracy: 0.83

Training...
loss: 22.331, acc: 0.992

Running Validation...
  Accuracy: 0.82

Training...
loss: 20.934, acc: 0.992

Running Validation...
  Accuracy: 0.83

Training...
loss: 19.601, acc: 0.992

Running Validation...
  Accuracy: 0.83

Training...
loss: 18.455, acc: 0.992

Running Validation...
  Accuracy: 0.83

Training...
loss: 17.284, acc: 0.993

Running Va

In [0]:
testset = OnlineQueryDataset("test", tokenizer=tokenizer, perc=70, path = "data/0226label/70%test.tsv")
testloader = DataLoader(testset, batch_size=32, 
                        collate_fn=create_mini_batch)

In [0]:
# 用分類模型預測測試集
predictions_with_unlabel = get_predictions(model, testloader)

In [0]:
print("Testset accuracy with unlabel data: %f" % accuracy(test_label, predictions_with_unlabel))

Testset accuracy with unlabel data: 0.820800


沒有什麼進步XD

In [0]:
output_dir = os.path.join("model", "70%withunlabeled")

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Saving model to model/70%withunlabeled


('model/70%withunlabeled/vocab.txt',
 'model/70%withunlabeled/special_tokens_map.json',
 'model/70%withunlabeled/added_tokens.json')