- author: Lee Meng
- date: 2019-07-08 09:00
- title: 進擊的 BERT：運用自然語言處理的巨人之力
- slug: attack_on_bert_transfer_learning_in_nlp
- tags: 自然語言處理, NLP, Pytorch
- description: 介紹目前自然語言處理領域中非常熱門的語言代表模型 BERT 以及遷移學習的運作方式。本文將透過一個假新聞分類問題，以 Pytorch 向讀者展示如何將強大的語言代表模型運用到自己有興趣的 NLP 任務之上，從而飛得更快更遠。
- summary: 介紹目前自然語言處理領域中非常熱門的語言代表模型 BERT 以及遷移學習的運作方式。本文將透過一個假新聞分類問題，以 Pytorch 向讀者展示如何將強大的語言代表模型運用到自己有興趣的 NLP 任務之上，從而飛得更快更遠。
- image: attack_on_bert.jpg


!quote
- 這篇文章帶你了解並實際運用現在 NLP 領域的巨人之力：BERT 模型。



如果你還有印象，在[自然語言處理與深度學習入門指南](https://leemeng.tw/shortest-path-to-the-nlp-world-a-gentle-guide-of-natural-language-processing-and-deep-learning-for-everyone.html)裡我使用了 LSTM 以及 Google 的語言代表模型 [BERT](https://github.com/google-research/bert) 來分類中文假新聞。說來有趣，因為 BERT 本身的強大，我不費吹灰之力就在[該 Kaggle 競賽](https://www.kaggle.com/c/fake-news-pair-classification-challenge/leaderboard)達到 85 % 的正確率，距離第一名 3 %，總排名前 30 %。



!image
- nlp-kaggle-intro/kaggle-final-result.png

雖說如此，使用 BERT 一直不是那麼直觀的事情。最近適逢 [PyTorch Hub](https://pytorch.org/hub) 上架 [BERT](https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/)，李宏毅教授的[機器學習課程](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML19.html)也推出了 [BERT 的教學影片](https://github.com/openai/gpt-2)，我認為現在正是了解並**實際運用** BERT 的最佳時機！


閱讀完這篇文章，你也能跟我一樣運用強大的 BERT，而且是以最直覺、簡單的方式。你會了解目前 NLP 領域非常熱門的[遷移學習（Transfer Learning）](https://docs.google.com/presentation/d/1DJI1yX4U5IgApGwavt0AmOCLWwso7ou1Un93sMuAWmA/edit?usp=sharing)技術，並實際運用這股巨人之力來解決你自己有興趣的自然語言任務。我在文中也會提供一些有趣的研究及應用 ，讓你之後可以進一步探索這個變化快速的 NLP 世界。

我們等等會簡單回顧 BERT 裡的一些重要概念，但如果你完全不熟 NLP 或是壓根子沒聽過什麼是 BERT，我強力建議你之後找時間（或是現在！）觀看李宏毅教授說明 [ELMo](https://allennlp.org/elmo)、BERT 以及 [GPT](https://github.com/openai/gpt-2) 等模型的影片，淺顯易懂：

!youtube
- UYPa347-DdE
- 李宏毅教授講解目前 NLP 領域的最新研究是如何讓機器讀懂文字的（我超愛這截圖）

## BERT：理解上下文的語言代表模型

一個簡單 convention，等等文中會穿插使用的：
- 代表
- representation
- repr.
- repr. 向量

指的都是一個可以用來**代表**某詞彙（在某個語境下）的多維連續向量（continuous vector）。 

現在在 NLP 圈混的，應該沒有人會說自己不曉得 Transformer 的[經典論文 Attention Is All You Need](https://arxiv.org/abs/1706.03762) 以及其知名的[自注意力機制（Self-attention mechanism）](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#Encoder-Decoder-%E6%A8%A1%E5%9E%8B-+-%E6%B3%A8%E6%84%8F%E5%8A%9B%E6%A9%9F%E5%88%B6)。[BERT](https://arxiv.org/abs/1810.04805) 全名為 **B**idirectional **E**ncoder **R**epresentations from **T**ransformers，是 Google 以無監督的方式利用大量無標註文本「煉成」的**語言模型**，其架構為 Transformer 中的 Encoder。

我在[淺談神經機器翻譯 & 用 Transformer 英翻中](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html)一文已經鉅細靡遺地解說過所有 Transformer 的相關概念，這邊就不再贅述。

!image
- bert/bert-intro.jpg
- BERT 其實就是 Transformer 中的 Encoder，只是很多層
- https://youtu.be/UYPa347-DdE?list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4

BERT 是一種語言模型，而[語言模型（**L**anguage **M**odel, LM）](https://youtu.be/iWea12EAu6U)做的事情就是在給定一些詞彙的前提下， 去估計下一個詞彙出現的機率分佈。在[讓 AI 寫點金庸](https://leemeng.tw/how-to-generate-interesting-text-with-tensorflow2-and-tensorflow-js.html)裡的 LSTM 也是一個語言模型 ，只是跟 BERT 差了很多個數量級。

為何會想要訓練一個語言模型（LM）？因為：
- 不像 [ImageNet](http://www.image-net.org/) 還要找人標注數據，要訓練 LM 的話網路上的所有文本都是你的潛在資料集，數據無限大（BERT 預訓練使用的數據集共有 33 **億**個字，其中包含維基百科及 [BooksCorpus](https://arxiv.org/abs/1506.06724)）
- 厲害的 LM 能夠學會語法結構、解讀語義甚至[指代消解](http://ckip.iis.sinica.edu.tw/project/coreference/)。透過特徵擷取或是 fine-tuning 能更有效率地訓練下游任務並提升其表現

這就是近來 NLP 領域非常流行的兩階段做法：
- 先以無監督方式預訓練一個巨大的 LM
- 再將該 LM 拿來做特徵擷取或是 fine-tuning 下游有標註數據的任務


!image
- bert/lm-equation.jpg
- 給定前 t 個在字典裡的詞彙，語言模型要去估計第 t + 1 個詞彙的機率分佈 P

當然天下沒有白吃的午餐。

要訓練好一個有 1.1 億參數的 12 層 **BERT-BASE** 得用 16 個 [TPU chips](https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-global-availability.html) 跑上整整 4 天，[花費 500 鎂](https://medium.com/syncedreview/the-staggering-cost-of-training-sota-ai-models-e329e80fa82)；24 層的 **BERT-LARGE** 則有 3.4 億個參數，得用 64 個 TPU chips（約 7000 鎂）訓練。喔對了，別忘了多次實驗得把這些成本乘上幾倍。

值得慶幸的是作者們有釋出訓練好的模型。因此只要使用 [TensorFlow](https://github.com/google-research/bert) 或是 [PyTorch](https://github.com/huggingface/pytorch-pretrained-BERT) 將已訓練好的 BERT 載入，就能省去預訓練的所有成本，並馬上使用強大的 BERT。

讓我們簡單瞭解一下 BERT 是怎麼訓練出來的。

!image
- bert/bert-pretrain-tasks.jpg
- BERT 在預訓練時需要完成的兩個任務
- https://youtu.be/UYPa347-DdE?list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4



Google 在預訓練 BERT 時讓它**同時**進行兩個任務：
- 克漏字填空（[1953 年被提出的 Cloze task](https://journals.sagepub.com/doi/abs/10.1177/107769905303000401)）
- 判斷第 2 個句子本來是否跟第 1 個句子相接（**N**ext **S**entence **P**rediction, NSP）

對上通天文下知地理的鄉民們來說，要完成這兩個任務簡單到爆。只要稍微看一下**前後文**就能知道左邊克漏字任務的 `[MASK]` 裡頭該填 `退了`；而 `醒醒吧` 後面接 `你沒有妹妹` 也十分合情合理。

多說無益，這是一篇 BERT 實用文章，因此讓我們馬上載入 [PyTorch Hub](https://pytorch.org/hub) 上的 [BERT 模型](https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/)實際體驗看看。首先我們需要安裝一些簡單的函式庫：


In [0]:
%%bash
pip install tqdm boto3 requests regex -q

我們接著把中文 BERT 使用的 tokenizer 載入：

In [0]:
import torch
from IPython.display import clear_output

GITHUB_REPO = "huggingface/pytorch-pretrained-BERT" # 感謝 HuggingFace 團隊造福後人
PRETRAINED_MODEL_NAME = "bert-base-chinese"  # 指定繁簡中文 BERT-BASE 預訓練模型

# 取得此預訓練模型所使用的 tokenizer
tokenizer = torch.hub.load(GITHUB_REPO, 'bertTokenizer', PRETRAINED_MODEL_NAME)
clear_output()

# 顯示字典資訊
vocab = tokenizer.vocab
print("字典大小：", len(vocab))
# tokenizer.vocab

字典大小： 21128


沒記錯的話，英文 BERT 的字典大小大約是 3 萬左右。我們可以瞧瞧中文 BERT 字典裡頭紀錄的一些 tokens 以及對應的索引：

In [0]:
import random
random_tokens = random.sample(list(vocab), 10)
random_ids = [vocab[t] for t in random_tokens]

print("{0:20}{1:15}".format("token", "index"))
print("-" * 25)
for t, id in zip(random_tokens, random_ids):
    print("{0:15}{1:10}".format(t, id))

token               index          
-------------------------
抒                    2830
##ein               11858
##薯                 19018
result              13170
149                  9491
evolution           12691
mwc                 12184
##刃                 14202
kelly               11394
##换                 15997


BERT 使用當初 [Google NMT](https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html) 提出的 [WordPiece Tokenization](https://arxiv.org/abs/1609.08144) ，將本來的 words 拆成更小粒度的 wordpieces，有效改善 [OOV](https://en.wiktionary.org/wiki/OOV) 問題。中文的話大致上就像是 character-level tokenization，而有 `##` 前綴的 tokens 即為 wordpieces。

ㄅㄆㄇㄈ當然也是有被收錄的：

In [0]:
indices = list(range(647, 657))
some_pairs = [(t, idx) for t, idx in vocab.items() if idx in indices]
for pair in some_pairs:
    print(pair)

('ㄅ', 647)
('ㄆ', 648)
('ㄇ', 649)
('ㄉ', 650)
('ㄋ', 651)
('ㄌ', 652)
('ㄍ', 653)
('ㄎ', 654)
('ㄏ', 655)
('ㄒ', 656)



另外你也可以在 [Hugging Face 團隊的 repo ](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/hubconfs/bert_hubconf.py) 看到所有可供使用的 BERT 預訓練模型。截至目前為止有以下模型可供使用：

- bert-base-chinese
- bert-base-uncased
- bert-base-cased
- bert-base-german-cased
- bert-base-multilingual-uncased
- bert-base-multilingual-cased
- bert-large-cased
- bert-large-uncased
- bert-large-uncased-whole-word-masking
- bert-large-cased-whole-word-masking

這些模型的主要差別在於預訓練時用的文本語言以及層數有所不同。常被拿來應用與研究的是英文的 `bert-base-cased`，但為了方便了解 BERT 運作，本文使用包含繁體與簡體中文的預訓練模型。 

讓我們實際拿個中文句子來做斷詞看看：

In [0]:
# 利用中文 BERT 的 tokenizer 將中文句子做 tokenization
text = "[CLS] 等到潮水 [MASK] 了，就知道誰沒穿褲子。"
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

print(text)
print(tokens[:10], '...')
print(ids[:10], '...')

[CLS] 等到潮水 [MASK] 了，就知道誰沒穿褲子。
['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
[101, 5023, 1168, 4060, 3717, 103, 749, 8024, 2218, 4761] ...


除了一般的 wordpieces 以外，BERT 裡頭總共有 5 個特殊 tokens 各司其職：
- `[CLS]`：在做分類任務時其最後一層的 repr. 會被視為整個輸入序列的 repr.
- `[SEP]`：兩個句子會合併成一個序列，中間插入這個 token 以做區隔
- `[UNK]`：沒出現在字典裡頭的字會被這個 token 取代
- `[PAD]`：padding 遮罩，將長度不一的序列補齊方便做 batch 運算
- `[MASK]`：未知遮罩，僅在預訓練階段會用到

padding 遮罩在之前的 [Transformer](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#%E7%9B%B4%E8%A7%80%E7%90%86%E8%A7%A3%E9%81%AE%E7%BD%A9%E5%9C%A8%E6%B3%A8%E6%84%8F%E5%87%BD%E5%BC%8F%E4%B8%AD%E7%9A%84%E6%95%88%E6%9E%9C) 文章有詳細介紹，而 `[MASK]` token 一般在 fine-tuning 或是 feature extraction 時不會用到，這邊只是為了展示預訓練階段的克漏字任務才使用的。

現在馬上讓我們看看給定上面有 `[MASK]` 的句子，BERT 會填入什麼字：

In [0]:
"""這段程式碼載入已經訓練好的 masked 語言模型並對有 [MASK] 的句子做預測"""

# 除了 tokens 以外我們還需要辨別句子的 segment ids
tokens_tensor = torch.tensor([ids])  # (1, seq_len)
segments_tensors = torch.zeros_like(tokens_tensor)  # (1, seq_len)
maskedLM_model = torch.hub.load(GITHUB_REPO, 
                                'bertForMaskedLM', 
                                PRETRAINED_MODEL_NAME)
clear_output()

# 使用 masked LM 估計 [MASK] 位置所代表的實際 token 
maskedLM_model.eval()
with torch.no_grad():
    predictions = maskedLM_model(tokens_tensor, segments_tensors)
    # (1, seq_len, num_hidden_units)

# 將 [MASK] 位置的機率分佈取 top k 最有可能的 tokens 出來
masked_index = 5
k = 3
probs, indices = torch.topk(torch.softmax(predictions[0, masked_index], -1), k)
predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())

print("輸入句子    ：", tokens[:10], '...')
print('-' * 50)
for i, (t, p) in enumerate(zip(predicted_tokens, probs), 1):
    tokens[masked_index] = t
    print("Top {} ({:2}%)：{}".format(i, int(p.item() * 100), tokens[:10]), '...')

輸入句子    ： ['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
--------------------------------------------------
Top 1 (67%)：['[CLS]', '等', '到', '潮', '水', '來', '了', '，', '就', '知'] ...
Top 2 (25%)：['[CLS]', '等', '到', '潮', '水', '濕', '了', '，', '就', '知'] ...
Top 3 ( 2%)：['[CLS]', '等', '到', '潮', '水', '過', '了', '，', '就', '知'] ...


Google 在訓練中文 BERT 鐵定沒看[批踢踢](https://term.ptt.cc/)，還無法抓到鄉民們想要的那個 `退` 字。不過以語言模型的角度來看預測結果已經挺好的了。BERT 利用關注 `潮水` 這兩個字，從 2 萬多個 wordpieces 的字典裡頭估計該 `[MASK]` token 為 `來`，也還說的過去。

!image
- bert/bert-attention.jpg

這是 [BertViz](https://github.com/jessevig/bertviz) 視覺化 BERT 注意力的結果，我等等會列出安裝步驟讓你自己玩玩。值得一提的是，這是第 8 層的 [Multi-head attention](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#Multi-head-attention%EF%BC%9A%E4%BD%A0%E7%9C%8B%E4%BD%A0%E7%9A%84%EF%BC%8C%E6%88%91%E7%9C%8B%E6%88%91%E7%9A%84) 裡頭某一個 head 的注意力結果。 

學會做克漏字讓 BERT 可以更好去 model 在不同語境下每個詞彙該有的 repr.，而 NSP 任務則能幫助 BERT model 兩個句子之間的關係，這在[問答系統 QA](https://zh.wikipedia.org/wiki/%E5%95%8F%E7%AD%94%E7%B3%BB%E7%B5%B1)、[自然語言推論 NLI ](http://nlpprogress.com/english/natural_language_inference.html)或是任何包含兩個句子的分類任務都很有幫助。


這樣的 repr. 就是近年超級熱門的 [contextual word representation](https://youtu.be/S-CspeZ8FHc) 概念，跟以往沒有蘊含上下文資訊的 [Word2Vec、GloVe](https://youtu.be/8rXD5-xhemo) 等無語境的詞嵌入向量有很大的差異。用學術一點的說法就是：



!quote
- Contextual word repr. 讓同 word type 的 word token 在不同語境下有不同的表示方式；而傳統的 word repr. 無論上下文，都會讓同 word type 的 word token 的 repr. 相同。

直覺上 contextual word representation 比較能反映人類語言的真實情況，畢竟同個詞彙在不同情境下的意涵相異是再正常不過的事情了。讓我再舉個具體例子：


```text
情境 1：

胖虎叫大雄去買漫畫，回來慢了就打他。

情境 2：

妹妹說胖虎是「胖子」，他聽了很不開心。

```

很明顯地，在這兩個情境裡頭「他」所代表的語義不同。如果仍使用沒蘊含上下文資訊的詞向量，機器就會很難正確「解讀」這兩個句子所蘊含的語義了。

現在讓我們跟隨[這個 Colab 筆記本](https://colab.research.google.com/drive/1g2nhY9vZG-PLC3w3dcHGqwsHBAXnD9EY)來使用剛剛看到的 [BertViz](https://github.com/jessevig/bertviz)，看看 BERT 會怎麼處理這兩個情境：


In [0]:
# 安裝 BertViz
import sys
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']

# import packages
from bertviz.pytorch_pretrained_bert import BertModel, BertTokenizer
from bertviz.head_view_bert import show

# 在 jupyter notebook 裡頭顯示 visualzation 的 helper
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


Setup 以後就能非常輕鬆地將 BERT 內部的注意力機制視覺化出來：

```python
# 記得我們是使用中文 BERT
bert_version = 'bert-base-chinese'
model = BertModel.from_pretrained(bert_version)
tokenizer = BertTokenizer.from_pretrained(bert_version)

# 情境 1 的句子
sentence_a = "胖虎叫大雄去買漫畫，"
sentence_b = "回來慢了就打他。"
call_html()
show(model, tokenizer, sentence_a, sentence_b)

# 注意：執行這段程式碼以後只會顯示下圖左側的結果。
# 為了方便你比較，我把情境 2 的結果也同時附上
```

!image
- bert/bert-coreference.jpg

這是 BERT 裡第 9 層 Encoder block 的其中一個 head 的注意力結果。

圖中的線條代表該 head 在更新「他」（左側）的 repr. 時關注其他詞彙（右側）的注意力程度。越粗代表關注權重（attention weights）越高。很明顯地這個 head 具有一定的[指代消解（Coreference Resolution）](https://youtu.be/i19m4GzBhfc)能力，能正確地找出「他」所指代的對象。

指代消解可不是一項簡單任務，但 BERT 透過自注意力機制、深度雙向語言模型以及大量的訓練文本達到這樣的水準，是一件令人雀躍的事情。


當然 BERT 並不是第一個嘗試產生 contextual word repr. 的語言模型。在它之前最知名的例子有剛剛提到的 [ELMo](https://allennlp.org/elmo) 以及 [GPT](https://github.com/openai/gpt-2)：




!image
- bert/bert_elmo_gpt.jpg
- ELMo、GPT 以及 BERT 都透過訓練語言模型來獲得 contextual word representation

ELMo 利用兩層雙向 LSTM  做語言模型並將中間得到的隱狀態向量串接當作每個詞彙的 contextual word representation；GPT 則是使用 Transformer 的 Decoder 來訓練一個中規中矩，從左到右的語言模型。

BERT 跟它們的差異在於利用 **M**asked **L**anguage **M**odel（MLM，即克漏字的文雅說法）的概念訓練一個**雙向**的語言模型，使得其輸出的每個 token 的 repr. 都同時蘊含了前後文資訊。

跟以往模型相比，BERT 能更好地處理自然語言，在著名的問答任務 [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) 也有卓越表現：

!image
- bert/squad2.jpg
- SQuAD 2.0 目前排行榜的前 5 名有 4 個有使用 BERT

### Load chinese BERT

- 李宏毅：BERT 和 GPT-2 都是使用 word piece (例如: fragment 可以拆成 frag + ment 兩個 pieces ，一個 word 也可以獨自形成一個 word piece) ，word piece 可以由蒐集大量的資料找出常出現的 pattern 取得

In [0]:
from IPython.display import clear_output

In [0]:
!pip install tqdm boto3 requests regex -q

In [0]:
import torch

github_repo = "huggingface/pytorch-pretrained-BERT"
PRETRAINED_MODEL_NAME = "bert-base-chinese"

# bert_tokenizer = torch.hub.load(github_repo, 'bertTokenizer', PRETRAINED_MODEL_NAME, do_basic_tokenize=False)
bert_tokenizer = torch.hub.load(github_repo, 'bertTokenizer', PRETRAINED_MODEL_NAME)
# TODO: True / False

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
text = "[CLS] 東京好熱， [SEP] 台灣也是嗎？ [SEP] [PAD]"
tokens = bert_tokenizer.tokenize(text)
ids = bert_tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(ids)

['[CLS]', '東', '京', '好', '熱', '，', '[SEP]', '台', '灣', '也', '是', '嗎', '？', '[SEP]', '[PAD]']
[101, 3346, 776, 1962, 4229, 8024, 102, 1378, 4124, 738, 3221, 1621, 8043, 102, 0]


### Sentence finetune model

In [0]:
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0]

segments_tensors = torch.tensor([segments_ids])
tokens_tensors = torch.tensor([ids])
masks_tensors = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=torch.long)

In [0]:
print(tokens_tensors.shape)
print(segments_tensors.shape)
print(masks_tensors.shape)

torch.Size([1, 15])
torch.Size([1, 15])
torch.Size([1, 15])


In [0]:
# Load bertForSequenceClassification
finetune_task = "bertForSequenceClassification"
num_labels = 3

classifier = torch.hub.load(github_repo, finetune_task, PRETRAINED_MODEL_NAME, num_labels=num_labels)
classifier.eval()

# Predict the sequence classification logits
with torch.no_grad():
    logits = classifier(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors)

clear_output()
print(logits)

tensor([[-0.5932, -0.7918,  0.1374]])


### Compare custom loss to predefined loss

In [0]:
tmp_criterion = torch.nn.CrossEntropyLoss()

In [0]:
labels = torch.tensor([1])
tmp_loss = tmp_criterion(logits, labels)
tmp_loss

tensor(1.5586)

In [0]:
loss = classifier(tokens_tensors, segments_tensors, attention_mask=masks_tensors, labels=labels)
loss

tensor(1.5586, grad_fn=<NllLossBackward>)

In [0]:
# train on GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)

classifier.to(device)

device: cuda:0


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediat

### Attentionmap & BertViz

In [0]:
# bert_tokenizer.save_vocabulary("vocab.txt")
# classifier.bert.config.to_json_file("bert_config.json")

### Download raw data and sample

In [0]:
# TODO: 看能不能使用 torch utils 直接下載 kaggle dataset

In [0]:
#ignore
zip_file = "drive-download-20190516T113709Z-001.zip"
file_url = "https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/fake_news/" + zip_file

!wget {file_url}
!unzip {zip_file}
!mv train_bert.tsv train_orig.tsv
!mv dev_bert.tsv dev_orig.tsv
!mv test_bert.tsv test.tsv
clear_output()

In [0]:
import pandas as pd

In [0]:
#ignore
full_train = True
MAX_LENGTH = 30

df_train = pd.read_csv("train_orig.tsv", sep="\t")
df_dev = pd.read_csv("dev_orig.tsv", sep="\t")

if not full_train:
    df_train = df_train.sample(frac=0.1, random_state=9537)

df_train = df_train[~(df_train.text_a.apply(lambda x : len(x)) > MAX_LENGTH)]
df_train = df_train[~(df_train.text_b.apply(lambda x : len(x)) > MAX_LENGTH)]
    
    
df_train.to_csv("train.tsv", sep="\t", index=False)    
df_dev.sample(frac=0.1, random_state=9537).to_csv("dev.tsv", sep="\t", index=False)

In [0]:
df_train = pd.read_csv("train.tsv", sep="\t")
df_train.head(5)

Unnamed: 0,text_a,text_b,label
0,2018年社保改革新政策：这四大情况可提前支取养老金,2018年社保改革新政策，这三大养老金调整方式须知晓！,unrelated
1,注射胰岛素会上瘾？5种情况必须使用胰岛素治疗！,谣言粉碎机｜吃粗粮能降糖？胰岛素注射液必须放冰箱？,unrelated
2,三高了？不如试试用凉水泡它喝，坚持一个月，效果惊到你！,凉水喝了拉肚子还致癌？关于“凉水”你该知道的3个真相,unrelated
3,罕见的水果吃过5种以上是土豪，吃过10种以上就是贵族！,这些水果吃过3种是土豪，吃过5种是贵族，你吃过几种呢？,agreed
4,最管用的“天然胰岛素”，糖尿病的救星终于出现！血糖猛降不是梦,“天然胰岛素”，糖尿病“死对头”，每天吃两口，血糖不升反降！,agreed


In [0]:
df_train.shape

(23871, 3)

### Load tsv as pytorch dataset

- input: (text_a, text_b, label)
- output: 
    - tokens_tensor: `[CLS] T1 T2 [SEP] T3 T4 [SEP] [PAD]`
    - segments_tensors
    - mask_tensors
    
```python
input_mask = torch.zeros(x.shape, device=x.device, dtype=torch.long).masked_fill(x != 0, 1)
input_type_ids = torch.zeros(x.shape, device=x.device, dtype=torch.long)
```

In [0]:
!pip install pysnooper -q
import pysnooper

```python
# The convention in BERT is:
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  #
  # Where "type_ids" are used to indicate whether this is the first
  # sequence or the second sequence. The embedding vectors for `type=0` and
  # `type=1` were learned during pre-training and are added to the wordpiece
  # embedding vector (and position vector). This is not *strictly* necessary
  # since the [SEP] token unambiguously separates the sequences, but it makes
  # it easier for the model to learn the concept of sequences.
  #
  # For classification tasks, the first vector (corresponding to [CLS]) is
  # used as the "sentence vector". Note that this only makes sense because
  # the entire model is fine-tuned.
 ```

In [0]:
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
 
    
class FakeNewsDataset(Dataset):
    # 讀取原始 tsv 檔並 setup 一些參數
    def __init__(self, mode, tokenizer):
        assert mode in ["train", "dev", "test"]
        self.df = pd.read_csv(mode + ".tsv", sep="\t")
        self.len = len(self.df)
        self.labels = sorted(self.df.label.unique())
        self.label_map = {label: i for i, label in enumerate(self.labels)}
        self.tokenizer = tokenizer
    
    # 回傳一筆訓練 / 驗證 / 測試數據
#     @pysnooper.snoop()
    def __getitem__(self, idx):
        text_a, text_b, label = self.df.iloc[idx, :].values
        
        # 建立第一個句子的 BERT tokens 並加入分隔符號 [SEP]
        word_pieces = ["[CLS]"]
        tokens_a = self.tokenizer.tokenize(text_a)
        word_pieces += tokens_a + ["[SEP]"]
        len_a = len(word_pieces)
        
        # 第二個句子的 BERT tokens
        tokens_b = self.tokenizer.tokenize(text_b)
        word_pieces += tokens_b + ["[SEP]"]
        len_b = len(word_pieces) - len_a
        
        # 將整個 token 序列轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.tensor(ids)
        
        # 將 label 也轉換成索引
        label_id = self.label_map[label]
        
        segments_tensor = torch.tensor([0] * len_a + [1] * len_b, 
                                        dtype=torch.long)
        
        return (tokens_tensor, segments_tensor, torch.tensor(label_id))
    
    def __len__(self):
        return self.len


def create_mini_batch(samples):
    tokens_tensors = [s[0] for s in samples]
    segments_tensors = [s[1] for s in samples]
    label_ids = torch.stack([s[2] for s in samples])
    
    tokens_tensors = pad_sequence(tokens_tensors, 
                                  batch_first=True)
    
    segments_tensors = pad_sequence(segments_tensors, 
                                    batch_first=True)
    
    masks_tensors = torch.zeros(tokens_tensors.shape, 
                                dtype=torch.long)
    masks_tensors = masks_tensors.masked_fill(
        tokens_tensors != 0, 1)
    
    return tokens_tensors, segments_tensors, masks_tensors, label_ids


In [0]:
idx = 1
v = df_train.loc[idx, ['text_a', 'text_b']].values
text_a = v[0]
text_b = v[1]
input_words = ["[CLS]"] + bert_tokenizer.tokenize(text_a) + ["[SEP]"] + bert_tokenizer.tokenize(text_b) + ['[SEP]']

In [0]:
BATCH_SIZE = 64

train = FakeNewsDataset("train", tokenizer=bert_tokenizer)
trainloader = DataLoader(train, batch_size=BATCH_SIZE, collate_fn=create_mini_batch)

c = 0
for i, data in enumerate(trainloader):
    
    tokens_tensors, segments_tensors, masks_tensors, label_ids = \
        data
    
    print(f"batch {i}:")
    print("tokens_tensors:", tokens_tensors.shape)
    print("segments_tensors:", segments_tensors.shape)
    print("masks_tensors:", masks_tensors.shape)
    print("label_ids:", label_ids.shape)
    print('-' * 30)
    c += 1
    
    # input > token > segment > position
    input_words += ['[PAD]'] * (len(masks_tensors[idx]) - len(input_words))
    for i, t, s, m in zip(input_words, tokens_tensors[idx], segments_tensors[idx], masks_tensors[idx]):
        print(i, t.item(), '   ', s.item(),  '  ', m.item())
    
    
    if c >= 1:
        break




batch 0:
tokens_tensors: torch.Size([64, 63])
segments_tensors: torch.Size([64, 63])
masks_tensors: torch.Size([64, 63])
label_ids: torch.Size([64])
------------------------------
[CLS] 101     0    1
注 3800     0    1
射 2198     0    1
胰 5536     0    1
岛 2270     0    1
素 5162     0    1
会 833     0    1
上 677     0    1
瘾 4614     0    1
？ 8043     0    1
5 126     0    1
种 4905     0    1
情 2658     0    1
况 1105     0    1
必 2553     0    1
须 7557     0    1
使 886     0    1
用 4500     0    1
胰 5536     0    1
岛 2270     0    1
素 5162     0    1
治 3780     0    1
疗 4545     0    1
！ 8013     0    1
[SEP] 102     0    1
谣 6469     1    1
言 6241     1    1
粉 5106     1    1
碎 4810     1    1
机 3322     1    1
｜ 8078     1    1
吃 1391     1    1
粗 5110     1    1
粮 5117     1    1
能 5543     1    1
降 7360     1    1
糖 5131     1    1
？ 8043     1    1
胰 5536     1    1
岛 2270     1    1
素 5162     1    1
注 3800     1    1
射 2198     1    1
液 3890     1    1
必 2553     1    1
须 7557  

### 直接預測


In [0]:
df_dev = pd.read_csv("dev.tsv", sep="\t")
df_dev.shape

(3206, 3)

In [0]:
df_dev.label.value_counts()

unrelated    2187
agreed        934
disagreed      85
Name: label, dtype: int64

In [0]:
df_dev.label.apply(lambda x: 1 if x == "unrelated" else 0).sum() / len(df_dev)

0.682158452900811

In [0]:
dev = FakeNewsDataset("dev", tokenizer=bert_tokenizer)
devloader = DataLoader(dev, batch_size=BATCH_SIZE, collate_fn=create_mini_batch)

In [0]:
correct = 0
total = 0
with torch.no_grad():
    for data in devloader:
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]
        
        logits = classifier(tokens_tensors, segments_tensors, masks_tensors)
        _, predicted = torch.max(logits.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on dev set: %d %%' % (
    100 * correct / total))

Accuracy of the network on dev set: 67 %


In [0]:
labels, predicted

(tensor([1, 0, 0, 2, 2, 0], device='cuda:0'),
 tensor([0, 0, 0, 2, 0, 0], device='cuda:0'))

### Loss function & Optimizer

In [0]:
# num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# print(f"Model has {num_params} parameters") 

# model.to('cuda:0')
# loss = torch.nn.NLLLoss()
# loss = loss.to('cuda:0')

# learnable_params = [p for p in model.parameters() if p.requires_grad]
# optimizer = torch.optim.Adam(learnable_params, lr=1.0e-4)

In [0]:
criterion = torch.nn.CrossEntropyLoss()
criterion = criterion.to(device)

learnable_params = [p for p in classifier.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(learnable_params, lr=1.0e-4)
print(f"# params: {sum(p.numel() for p in learnable_params)}")

# params: 102269955


### Finetune Train the network

Outputs:
    if `labels` is not `None`:
        Outputs the CrossEntropy classification loss of the output with the labels.
    if `labels` is `None`:
        Outputs the classification logits of shape [batch_size, num_labels].

In [0]:
classifier.train()
clear_output()

In [0]:
def evaluate_on_dev():

    correct = 0
    total = 0
    with torch.no_grad():
        for data in devloader:
            tokens_tensors, segments_tensors, \
            masks_tensors, labels = [t.to(device) for t in data]

            logits = classifier(tokens_tensors, segments_tensors, masks_tensors)
            _, predicted = torch.max(logits.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print('Accuracy of the network on dev set: %.2f %%' % (
        100 * correct / total))

In [0]:
from tqdm import tqdm

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [0]:
MODEL_PATH = '/content/drive/My Drive/latest_checkpoint.pth'

In [0]:
import os
if os.path.exists(MODEL_PATH):
    classifier.load_state_dict(torch.load(MODEL_PATH))
    print("recovered latest checkpoints.")

- 1 epoch: 50 minutes

In [44]:
%%time
EPOCHS = 30

for epoch in tqdm(range(EPOCHS)):
    
    
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
#         logits = classifier(input_ids=tokens_tensors, 
#                             token_type_ids=segments_tensors, 
#                             attention_mask=masks_tensors)
#         loss = criterion(logits, labels)
#         loss.backward()
        loss = classifier(input_ids=tokens_tensors, 
                            token_type_ids=segments_tensors, 
                            attention_mask=masks_tensors, labels=labels)
        loss.backward()
        optimizer.step()
        

        # print statistics
        running_loss += loss.item()
        
        if i % 100 == 99:    # print every 100 mini-batches
            print('[epoch %d, steps %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss))
            running_loss = 0.0
    
    
    evaluate_on_dev()
    torch.save(classifier.state_dict(), 
               f"/content/drive/My Drive/checkpoint_epoch_{epoch + 1}.pth")
    torch.save(classifier.state_dict(), MODEL_PATH)

print('Finished Training')

  0%|          | 0/10 [00:00<?, ?it/s]

[epoch 1, steps   100] loss: 47.437
[epoch 1, steps   200] loss: 38.050
[epoch 1, steps   300] loss: 36.146
Accuracy of the network on dev set: 83.91 %


 10%|█         | 1/10 [05:26<48:59, 326.59s/it]

[epoch 2, steps   100] loss: 31.601
[epoch 2, steps   200] loss: 29.002
[epoch 2, steps   300] loss: 27.861
Accuracy of the network on dev set: 85.50 %


 20%|██        | 2/10 [10:53<43:32, 326.54s/it]

[epoch 3, steps   100] loss: 25.822
[epoch 3, steps   200] loss: 22.838
[epoch 3, steps   300] loss: 26.275
Accuracy of the network on dev set: 83.81 %


 30%|███       | 3/10 [16:19<38:06, 326.66s/it]

[epoch 4, steps   100] loss: 24.291
[epoch 4, steps   200] loss: 22.369
[epoch 4, steps   300] loss: 21.653
Accuracy of the network on dev set: 85.75 %


 40%|████      | 4/10 [21:46<32:39, 326.59s/it]

[epoch 5, steps   100] loss: 19.547
[epoch 5, steps   200] loss: 15.473
[epoch 5, steps   300] loss: 16.879
Accuracy of the network on dev set: 83.06 %


 50%|█████     | 5/10 [27:12<27:12, 326.55s/it]

[epoch 6, steps   100] loss: 16.239
[epoch 6, steps   200] loss: 12.886
[epoch 6, steps   300] loss: 13.843
Accuracy of the network on dev set: 82.41 %


 60%|██████    | 6/10 [32:39<21:45, 326.49s/it]

[epoch 7, steps   100] loss: 15.607
[epoch 7, steps   200] loss: 17.340
[epoch 7, steps   300] loss: 19.988
Accuracy of the network on dev set: 86.03 %


 70%|███████   | 7/10 [38:05<16:19, 326.44s/it]

[epoch 8, steps   100] loss: 13.632
[epoch 8, steps   200] loss: 13.587
[epoch 8, steps   300] loss: 13.003
Accuracy of the network on dev set: 86.28 %


 80%|████████  | 8/10 [43:31<10:52, 326.46s/it]

[epoch 9, steps   100] loss: 11.371
[epoch 9, steps   200] loss: 10.259
[epoch 9, steps   300] loss: 11.037
Accuracy of the network on dev set: 85.25 %


 90%|█████████ | 9/10 [48:58<05:26, 326.62s/it]

[epoch 10, steps   100] loss: 9.708
[epoch 10, steps   200] loss: 9.456
[epoch 10, steps   300] loss: 10.046
Accuracy of the network on dev set: 85.75 %


100%|██████████| 10/10 [54:26<00:00, 326.76s/it]

Finished Training
CPU times: user 31min 2s, sys: 22min 52s, total: 53min 54s
Wall time: 54min 26s





In [105]:
train.df.shape

(23871, 3)

In [45]:
correct = 0
total = 0
with torch.no_grad():
    for data in trainloader:
        tokens_tensors, segments_tensors, \
        masks_tensors, labels = [t.to(device) for t in data]

        logits = classifier(tokens_tensors, segments_tensors, masks_tensors)
        _, predicted = torch.max(logits.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on train set: %.2f %%' % (
    100 * correct / total))

Accuracy of the network on train set: 97.64 %


In [0]:
# sdf

### Feature extraction

- 1 epoch 20 minutes

In [0]:
# for p in classifier.bert.parameters():
#     p.requires_grad = False

In [0]:
# learnable_params = [p for p in classifier.parameters() if p.requires_grad]
# print(f"# params: {sum(p.numel() for p in learnable_params)}")

In [0]:
# %%time
# EPOCHS = 10

# for epoch in tqdm(range(EPOCHS)):
    
    
#     running_loss = 0.0
#     for i, data in enumerate(trainloader, 0):
        
#         tokens_tensors, segments_tensors, \
#         masks_tensors, labels = [t.to(device) for t in data]

#         # zero the parameter gradients
#         optimizer.zero_grad()

#         # forward + backward + optimize
#         logits = classifier(input_ids=tokens_tensors, 
#                             token_type_ids=segments_tensors, 
#                             attention_mask=masks_tensors)
#         loss = criterion(logits, labels)
#         loss.backward()
#         optimizer.step()

#         # print statistics
#         running_loss += loss.item()
        
#         if i % 100 == 99:    # print every 100 mini-batches
#             print('[epoch %d, steps %5d] loss: %.3f' %
#                   (epoch + 1, i + 1, running_loss))
#             running_loss = 0.0
    
    
#     evaluate_on_dev()
#     torch.save(classifier.state_dict(), './checkpoint.pth')

# print('Finished Training')

In [0]:
# correct = 0
# total = 0
# with torch.no_grad():
#     for data in trainloader:
#         tokens_tensors, segments_tensors, \
#         masks_tensors, labels = [t.to(device) for t in data]
        
#         logits = classifier(tokens_tensors, segments_tensors, masks_tensors)
#         _, predicted = torch.max(logits.data, 1)
#         total += labels.size(0)
#         correct += (predicted == labels).sum().item()

# print('Accuracy of the network on train set: %d %%' % (
#     100 * correct / total))

In [0]:
# TODO: demo 加　script 版本

### 產生 test 結果

In [0]:
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
 
    
class FakeNewsDataset(Dataset):
    # 讀取原始 tsv 檔並 setup 一些參數
    def __init__(self, mode, tokenizer):
        assert mode in ["train", "dev", "test"]
        self.mode = mode
        self.df = pd.read_csv(mode + ".tsv", sep="\t")
        self.len = len(self.df)
        self.label_map = {'agreed': 0, 'disagreed': 1, 'unrelated': 2} # TEMP

            
#         self.labels = sorted(self.df.label.unique())
#         self.label_map = {label: i for i, label in enumerate(self.labels)}
        self.tokenizer = tokenizer
    
    # 回傳一筆訓練 / 驗證 / 測試數據
#     @pysnooper.snoop()
    def __getitem__(self, idx):
        if self.mode == "test":
            text_a, text_b = self.df.iloc[idx, :].values
            label_tensor = None
        else:
            text_a, text_b, label = self.df.iloc[idx, :].values
            # 將 label 也轉換成索引
            label_id = self.label_map[label]
            label_tensor = torch.tensor(label_id)
            
        
        # 建立第一個句子的 BERT tokens 並加入分隔符號 [SEP]
        word_pieces = ["[CLS]"]
        tokens_a = self.tokenizer.tokenize(text_a)
        word_pieces += tokens_a + ["[SEP]"]
        len_a = len(word_pieces)
        
        # 第二個句子的 BERT tokens
        tokens_b = self.tokenizer.tokenize(text_b)
        word_pieces += tokens_b + ["[SEP]"]
        len_b = len(word_pieces) - len_a
        
        # 將整個 token 序列轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.tensor(ids)
        
        segments_tensor = torch.tensor([0] * len_a + [1] * len_b, 
                                        dtype=torch.long)
        
        return (tokens_tensor, segments_tensor, label_tensor)
    
    def __len__(self):
        return self.len


def create_mini_batch(samples):
    tokens_tensors = [s[0] for s in samples]
    segments_tensors = [s[1] for s in samples]
    
    if samples[0][2] is not None:
        label_ids = torch.stack([s[2] for s in samples])
    else:
        label_ids = None
    
    tokens_tensors = pad_sequence(tokens_tensors, 
                                  batch_first=True)
    
    segments_tensors = pad_sequence(segments_tensors, 
                                    batch_first=True)
    
    masks_tensors = torch.zeros(tokens_tensors.shape, 
                                dtype=torch.long)
    masks_tensors = masks_tensors.masked_fill(
        tokens_tensors != 0, 1)
    
    return tokens_tensors, segments_tensors, masks_tensors, label_ids


In [65]:
# dev.label_map

{'agreed': 0, 'disagreed': 1, 'unrelated': 2}

In [98]:
# test.df.title2_zh.apply(lambda x: 1 if type(x) == float else 0 ).sum()

0

In [0]:
test = FakeNewsDataset("test", tokenizer=bert_tokenizer)
testloader = DataLoader(test, batch_size=BATCH_SIZE, collate_fn=create_mini_batch)

In [0]:
test.df.fillna("", inplace=True)

In [0]:
def predict_test():

    predictions = None
    with torch.no_grad():
        for data in testloader:
            tokens_tensors = data[0].to(device)
            segments_tensors = data[1].to(device)
            masks_tensors = data[2].to(device)

            logits = classifier(tokens_tensors, segments_tensors, masks_tensors)
            _, predicted = torch.max(logits.data, 1)
            
            if predictions is None:
                
                predictions = predicted
            else:
                predictions = torch.cat((predictions, predicted))

#     print('Accuracy of the network on dev set: %.2f %%' % (
#         100 * correct / total))
    return predictions

In [0]:
predictions = predict_test()

In [101]:
predictions.shape

torch.Size([80126])

In [102]:
index_map = {v: k for k, v in test.label_map.items()}
index_map

{0: 'agreed', 1: 'disagreed', 2: 'unrelated'}

In [103]:
result = pd.DataFrame({"result": predictions.tolist()})
result.columns = ['label']
result['label'] = result.label.apply(lambda x: index_map[x])
result.head()

Unnamed: 0,label
0,unrelated
1,unrelated
2,unrelated
3,unrelated
4,unrelated


In [78]:
!test -d test.csv.zip || wget https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/test.csv.zip
!unzip test.csv.zip

--2019-07-07 14:53:09--  https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/test.csv.zip
Resolving s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)... 52.219.4.142
Connecting to s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)|52.219.4.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6906894 (6.6M) [application/zip]
Saving to: ‘test.csv.zip’


2019-07-07 14:53:09 (48.4 MB/s) - ‘test.csv.zip’ saved [6906894/6906894]

Archive:  test.csv.zip
  inflating: test.csv                


In [79]:
df_test = pd.read_csv("test.csv").fillna('')
df_test.head()

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en
0,321187,167562,59521,萨拉赫人气爆棚!埃及总统大选未参选获百万选票 现任总统压力山大,辟谣！里昂官方否认费基尔加盟利物浦，难道是价格没谈拢？,egypt 's presidential election failed to win m...,Lyon! Lyon officials have denied that Felipe F...
1,321190,167564,91315,萨达姆被捕后告诫美国的一句话，发人深思,10大最让美国人相信的荒诞谣言，如蜥蜴人掌控着美国,A message from Saddam Hussein after he was cap...,The Top 10 Americans believe that the Lizard M...
2,321189,167563,167564,萨达姆此项计划没有此国破坏的话，美国还会对伊拉克发动战争吗,萨达姆被捕后告诫美国的一句话，发人深思,Will the United States wage war on Iraq withou...,A message from Saddam Hussein after he was cap...
3,321193,167564,160994,萨达姆被捕后告诫美国的一句话，发人深思,被绞刑处死的萨达姆是替身？他的此男人举动击破替身谣言！,A message from Saddam Hussein after he was cap...,The hanging Saddam is a surrogate? This man's ...
4,321191,167564,15084,萨达姆被捕后告诫美国的一句话，发人深思,中国川贝枇杷膏在美国受到热捧？纯属谣言！,A message from Saddam Hussein after he was cap...,Chinese loquat loquat plaster in America? Pure...


In [104]:
final_result = pd.concat([df_test.reset_index().loc[:, 'id'], result.loc[:, 'label']], axis=1)
final_result.columns = ['Id', 'Category']
final_result.head()

Unnamed: 0,Id,Category
0,321187,unrelated
1,321190,unrelated
2,321189,unrelated
3,321193,unrelated
4,321191,unrelated


In [0]:
final_result.to_csv('bert_epoch10_bsize64_pytorch_10prec_num_sample_23871_train.csv', index=False)

We can see that this is a *massive* gain over our CNN baseline and also improves over our ELMo contextual embeddings for this dataset.  BERT has been shown high-performance results across many datasets, and integrating it into unstructured prediction problems is quite simple, as we saw in this section.

## Conclusion

In this section we investigated the Transformer model architecture, particularly in the context of pretraining LMs.  We discussed some of the model details and we looked at how BERT extends the GPT approach from OpenAI.  We then built our own fine-tuned classifier using the Hugging Face PyTorch library to create and re-load the BERT model and add our own layers on top.

### Some further resources

We have only scratched the surface of the exciting way that transfer learning is transforming NLP. 


- **Transformer Architecture**
  - [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html): mentioned previously, but so good it deserves mentioning again
  - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/):  good tutorial on how the Transformer works
-  A really nice blogpost on transfer learning from Sebastian Ruder (http://ruder.io/nlp-imagenet/)

- **Transfer Learning**
  - A [fantastic tutorial at NAACL this year](https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc/edit) which is both thorough and introductory.  It covers a lot of material including how to probe pretrain models to try and figure out what they are up to
  - A nice colab from the Google BERT devs showing using BERT from TF-Hub (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb)

- **Model Intepretation and Probing**
  - Jesse Vig's Blog post analyzing the different heads of BERT based
    - Part I: https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
    -  Part II: https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
    - And a colab that drills into the [Q and K vectors during multi-head attention here](https://colab.research.google.com/drive/1Nlhh2vwlQdKleNMqpmLDBsAwrv_7NnrB): 
  - [Kevin Clark's Jupyter Notebooks](https://github.com/clarkkev/attention-analysis) for [What Does BERT Look At? An Analysis of BERT's Attention, Clark et al., 2019](https://arxiv.org/abs/1906.04341)
  - [Tal Linzen's code](https://github.com/TalLinzen/rnn_agreement) for [Assessing the ability of LSTMs to learn syntax-sensitive dependencies, Linzen et al., 2016](https://arxiv.org/abs/1611.01368)
  - [Yoav Goldberg's code](https://github.com/yoavg/bert-syntax) assessing syntactic abilities of BERT
  - [Nelson Liu's code](https://github.com/nelson-liu/contextual-repr-analysis) for [Linguistic Knowledge and Transferability of Contextual Representations, Liu et al., 2019](https://homes.cs.washington.edu/~nfliu/papers/liu+gardner+belinkov+peters+smith.naacl2019.pdf)

- **More about Neural NLP**
  -  Get right into the source material.  Some papers that are helpful to understand deep learning in NLP (https://github.com/dpressel/lit)

- **Get Hacking**
  - Implementations of most of what we talked about today in TensorFlow and PyTorch (https://github.com/dpressel/baseline)

There is also an end-to-end example using the Baseline API above to train a GPT-like LM using the code above in PyTorch:

https://github.com/dpressel/baseline/blob/master/api-examples/pretrain-transformer-lm.py






# Part III: Fine-tuning a pre-trained model

In the last section, we looked at using a biLM networks layers as embeddings for our classification model.  In that approach, we maintain the exact same model architecture as before, but just switching our word embeddings out for context embeddings (or, more commonly, using them in concert).

The paper [Improving Language Understanding
by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) (Radford et al 2018) explored a different approach, much more similar to what is typically done in computer vision.  In fine-tuning, we reuse the network architecture and simply replace the head.  We dont use any model specific architecture anymore, just a final layer.  There is an accompanying blog post [here](https://openai.com/blog/language-unsupervised/).  The image below is borrowed from that blog post

![alt text](https://openai.com/content/images/2018/06/zero-shot-transfer@2x.png)

As we can see from the images, these models can rapidly improve our downstream performance with very limited fine-tuning supervision.



## The Transformer



The original Transformer is an all-attention encoder-decoder model first introduced in [Attention Is All You Need, Vaswani et al., 2017](https://arxiv.org/abs/1706.03762).  It is described at a high-level in [this Google AI post](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html).
Here is an image of the model architecture for a Transformer:

![Transformer Architecture](http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png)

The reference implementation from Google is the [tensor2tensor repository](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor).  There is a lot going on in that codebase, which some people may find hard to follow.

We are going to go through each component in a hands-on manner, which will hopefully give you a visual feel of what is happening.

If you want to understand Transformers better, there is a terrific blog post called [The Annotated Transformer, Rush, 2018](http://nlp.seas.harvard.edu/2018/04/03/attention.html) where you can see how to code up a Transformer from scratch to do Neural Machine Translation (NMT) while following along with the paper.

In versions used in practice, there are slight differences from the actual image, most notably, that layer norm is performed first.  Also, in a causal LM pre-training setting, as in the case of GPT, we have no need for the decoder, which simplifies our architecture substantially, leaving only a masked self-attention in the encoder (this prevents us from seeing the future as we predict).




### A Transformer Encoder Layer



Here is code adapted from [Baseline](https://github.com/dpressel/baseline) that implements a Transformer block used in a GPT-like architecture (pictured above).  We are going to take a closer look at these blocks, so lets think of this as the high-level overview.  The input to this class is a `torch.Tensor` of shape `BxT`.  The first sub-component in a Transformer block is the Multi-Headed Attention.  The second is the "FFN" shown in the image -- an MLP layer followed by a linear projection back to the original size.  We encapsulate these transformations in an `nn.Sequential`.  Notice that each sub-layer is also a residual connection.


```python

class TransformerEncoder(nn.Module):
    def __init__(self, num_heads, d_model, pdrop, scale=True, activation_type='relu', d_ff=None):
        """
        :param num_heads (`int`): the number of heads for self-attention
        :param d_model (`int`): The model dimension size
        :param pdrop (`float`): The dropout probability
        :param scale (`bool`): Whether we are doing scaled dot-product attention
        :param activation_type: What activation type to use
        :param d_ff: The feed forward layer size
        """
        super(TransformerEncoder, self).__init__()
        self.d_model = d_model
        self.d_ff = d_ff if d_ff is not None else num_heads * d_model
        self.self_attn = MultiHeadedAttention(num_heads, d_model, pdrop, scale=scale)
        self.ffn = nn.Sequential(nn.Linear(self.d_model, self.d_ff),
                                 pytorch_activation(activation_type),
                                 nn.Linear(self.d_ff, self.d_model))
        self.ln1 = nn.LayerNorm(self.d_model, eps=1e-12)
        self.ln2 = nn.LayerNorm(self.d_model, eps=1e-12)
        self.dropout = nn.Dropout(pdrop)

    def forward(self, x, mask=None):
        """
        :param x: the inputs
        :param mask: a mask for the inputs
        :return: the encoder output
        """
        # Builtin Attention mask
        x = self.ln1(x)
        h = self.self_attn(x, x, x, mask)
        x = x + self.dropout(h)

        x = self.ln2(x)
        x = x + self.dropout(self.ffn(x))
        return x

```



### Multi-headed Attention



Multi-headed attention is one of the key innovations of the Transformer.  The idea was to allow each attention head to learn different relations.

![MHA](https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png)

#### Scaled dot product attention

Here is a picture of the operations involved in scaled dot product attention.

![MHA Architecture](http://nlp.seas.harvard.edu/images/the-annotated-transformer_33_0.png)

`Q`, `K` and `V` are low-order projections of the input.  For Encoder-Decoders, the `Q` is a query vector in the decoder, and `K` and `V` are representations of the Encoder.  A dot product of the encoder keys and the query vector determines a set of weights that are applied against the `V` (again, also a representation of the encoder values).  In the case of the encoder, these are all drawn from the same input.  Basic dot product attention was actually introduced in [Effective Approaches to Attention-based Neural Machine Translation, Luong et al., 2014](https://arxiv.org/abs/1508.04025), but in the the Transformer paper, the authors made a strong case that the basic dot product attention benefits from scaling.

This is implemented (again adapted from Baseline), as follows:

```python
def scaled_dot_product_attention(query, key, value, mask=None, dropout=None):
    """Scaled dot product attention, as defined in https://arxiv.org/abs/1706.03762

    We apply the query to the keys to recieve our weights via softmax, which are then applied
    for each value, but in a series of efficient matrix operations.  In the case of self-attention,
    the key, query and values are all low order projections of the same input.

    :param query: a query for alignment. Can come from self in case of self-attn or decoder in case of E/D
    :param key: a set of keys from encoder or self
    :param value: a set of values from encoder or self
    :param mask: masking (for destination) to prevent seeing what we shouldnt
    :param dropout: apply dropout operator post-attention (this is not a float)
    :return: A tensor that is (BxHxTxT)

    """
    # (., H, T, T) = (., H, T, D) x (., H, D, T)
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = F.softmax(scores, dim=-1)
    if dropout is not None:
        weights = dropout(weights)
    return torch.matmul(weights, value), weights
```


#### The Multi-head part

Each of the attention operations above that we apply is going to learn some weighted representation of our input -- what are we paying attention to?  There are lots of things that might be useful!   We might want to attend to the next word for language modeling.  To remember what we said, we might want to learn something like which pronouns refer to which nouns that we saw in previous tokens (this is called anaphora resolution and is a subset of coreference resolution).  We might hope that it picks up parse dependencies, that could help us with tasks that benefit from syntax.  Remember that each of our `Q`, `K` and `V` are low-order projections of our input.  What if we had many low-order projections and used each to learn different weightings?  This  is exactly what multi-head attention is.  Each "head" does the operation above and learns something meaningful (or at least, we hope it does!).

Here is some code that implements multi-headed attention using our function above:

```python
class MultiHeadedAttention(nn.Module):
    """
    Multi-headed attention from https://arxiv.org/abs/1706.03762 via http://nlp.seas.harvard.edu/2018/04/03/attention.html

    Multi-headed attention provides multiple looks of low-order projections K, Q and V using an attention function
    (specifically `scaled_dot_product_attention` in the paper.  This allows multiple relationships to be illuminated
    via attention on different positional and representational information from each head.

    The number of heads `h` times the low-order projection dim `d_k` is equal to `d_model` (which is asserted upfront).
    This means that each weight matrix can be simply represented as a linear transformation from `d_model` to `d_model`,
    and partitioned into heads after the fact.

    Finally, an output projection is applied which brings the output space back to `d_model`, in preparation for the
    sub-sequent `FFN` sub-layer.

    There are 3 uses of multi-head attention in the Transformer.
    For encoder-decoder layers, the queries come from the previous decoder layer, and the memory keys come from
    the encoder.  For encoder layers, the K, Q and V all come from the output of the previous layer of the encoder.
    And for self-attention in the decoder, K, Q and V all come from the decoder, but here it is masked to prevent using
    future values
    """
    def __init__(self, h, d_model, dropout=0.1, scale=False):
        """Constructor for multi-headed attention

        :param h: The number of heads
        :param d_model: The model hidden size
        :param dropout (``float``): The amount of dropout to use
        :param attn_fn: A function to apply attention, defaults to SDP
        """
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.w_Q = nn.Linear(d_model, d_model)
        self.w_K = nn.Linear(d_model, d_model)
        self.w_V = nn.Linear(d_model, d_model)
        self.w_O = nn.Linear(d_model, d_model)
        self.attn_fn = scaled_dot_product_attention if scale else dot_product_attention
        self.attn = None
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, mask=None):
        """Low-order projections of query, key and value into multiple heads, then attention application and dropout

        :param query: a query for alignment. Can come from self in case of self-attn or decoder in case of E/D
        :param key: a set of keys from encoder or self
        :param value: a set of values from encoder or self
        :param mask: masking (for destination) to prevent seeing what we shouldnt
        :return: Multi-head attention output, result of attention application to sequence (B, T, d_model)
        """
        batchsz = query.size(0)

        # (B, H, T, D)
        query = self.w_Q(query).view(batchsz, -1, self.h, self.d_k).transpose(1, 2)
        key = self.w_K(key).view(batchsz, -1, self.h, self.d_k).transpose(1, 2)
        value = self.w_V(value).view(batchsz, -1, self.h, self.d_k).transpose(1, 2)

        x, self.attn = self.attn_fn(query, key, value, mask=mask, dropout=self.dropout)

        x = x.transpose(1, 2).contiguous() \
            .view(batchsz, -1, self.h * self.d_k)
        return self.w_O(x)
```



We are going to take a look at how multi-headed attention works visually. To do this, we are going to use the [viz-bert codebase](https://github.com/jessevig/bertviz) from Jesse Vig.  The accompanying paper is [A Multiscale Visualization of Attention in the Transformer Model, Vig, 2019](https://arxiv.org/pdf/1906.05714.pdf).



In [0]:
# import sys

# !test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# # !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
# !test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
# if not 'bertviz_repo' in sys.path:
#   sys.path += ['bertviz_repo']
# !pip install regex


In [0]:
# from bertviz import attention, visualization
# from bertviz.pytorch_pretrained_bert import BertModel as VizBertModel
# from bertviz.pytorch_pretrained_bert import BertTokenizer as VizBertTokenizer

In [0]:
# %%javascript
# require.config({
#   paths: {
#       d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
#   }
# });

def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [0]:
# model = VizBertModel.from_pretrained('bert-base-uncased')
# tokenizer = VizBertTokenizer.from_pretrained('bert-base-uncased')
# sentence_a = "The dog crossed the road ."
# sentence_b = "The owner came out and put him on a leash ."
# attention_visualizer = visualization.AttentionVisualizer(model, tokenizer)
# tokens_a, tokens_b, attn = attention_visualizer.get_viz_data(sentence_a, sentence_b)
# call_html()
# attention.show(tokens_a, tokens_b, attn)

Try playing around with `sentence_a` and `sentence_b`.  You can select and unselect different attention heads, as well as the layer that you are visualizing.  There is a lot going on here.  [This blog post](https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77)  by Jesse Vig, the author of the software we are using to render the attention heads above, discusses how BERT attention heads learn various types of attention.  [Clark et al 2019 have a paper](https://arxiv.org/abs/1906.04341) that also delves into what learns, particular in the context of our linguistic notions of syntax

It turns out BERT learns a lot of stuff:


- **next/previous/identical word tracking**

- **stuff that correlates closely to linguistic notions of syntax**:  

  - BERT attention heads learn something like coreference
  - BERT attention heads learn some approximation of dependency parsing.  Different attention heads learn different dependency/governor relationships




#### Multi-Headed Attention is easy now in PyTorch!!

This operation is now built into PyTorch.  There is a caveat that only scaled-dot product attention is supported.  The code above does not use that module since it supports both scaled and unscaled attention.





### Positional embeddings

To eliminate auto-regressive (RNN) models from the transformer, positional embeddings need to be created and added to the word embeddings.  Otherwise, during attention there would be no way to account for word position. There are several ways to support positional embeddings.

The first way is very simple -- you just need to create a `nn.Embedding` that you give your offsets for each token.  Embedding representations will be learned for each position, but you can only learn up to the number of positions you have seen.

Another way, used in the original Transformer is to embed a bunch of sinusoids with different frequencies that are a function of the position:

$$PE_{(pos,2i)}=sin(pos/10000^{2i}/dmodel)$$
$$PE_{(pos,2i+1)}=cos(pos/10000^{2i}/dmodel)$$ 

where $pos$ is the position and $i$ is the dimension corresponding to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000\times2\pi$.



## BERT



For this section of the tutorial, we are going to fine-tune BERT [Devlin et al 2018](https://arxiv.org/abs/1810.04805), a transformer architecture that replaces the causal LM objective with 2 new objectives:

1. **Masking out words** with some probability, predict the missing words (MLM objective)

![MLM](https://2.bp.blogspot.com/-pNxcHHXNZg0/W9iv3evVyOI/AAAAAAAADfA/KTSvKXNzzL0W8ry28PPl7nYI1CG_5WuvwCLcBGAs/s1600/f1.png)

2. Given 2 adjacent sentences, **predict if the second sentence follows the first** (NSP objective)

![NSP](https://4.bp.blogspot.com/-K_7yu3kjF18/W9iv-R-MnyI/AAAAAAAADfE/xUwR_G1iTY0vq9X-Z3LnW5t4NLS9BQzdgCLcBGAs/s1600/f2.png)

From an architecture diagram, [this blog post announcing BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) notes the differences:

![BERT vs GPT and ELMo](https://1.bp.blogspot.com/-RLAbr6kPNUo/W9is5FwUXmI/AAAAAAAADeU/5y9466Zoyoc96vqLjbruLK8i_t8qEdHnQCLcBGAs/s1600/image3.png)

Our model will simply build on the existing model architecture with a single transformation layer to the output number of classes.  BERT is [open source](https://github.com/google-research/bert) but the code is in TensorFlow, and since this tutorial is written in PyTorch, we need a different solution.  We will use the [Hugging Face Transformer codebase](https://github.com/huggingface/pytorch-pretrained-BERT) as our API -- it can read in the original Google-trained weights.

In [0]:
# !pip install pytorch-pretrained-bert


In [0]:
# import numpy as np
# import torch
# import torch.nn as nn
# import torch.nn.functional as F
# import io
# import os
# import re
# import codecs
# from collections import Counter
# from torch.utils.data import DataLoader, TensorDataset
# from pytorch_pretrained_bert.tokenization import BertTokenizer
# from pytorch_pretrained_bert.modeling import BertModel


### Tokenization in BERT

In the last sequence, we talked about how ELMo biLMs can limit their parameters while accounting for unseen words using character-compositional word embeddings.  This technique is very powerful, but its also slow.  It is common in NMT to use some sort of sub-word encoding that limits the vocabulary size, but allows us to not have unattested words.  The `tensor2tensor` codebase, for example, creates an invertible encoding for words into sub-tokens with a limited vocabulary.  The tokenizer is built from a corpus upfront and stored in a file, and then can be used to encode text.

There are 4 phases in this algorithm described in the tensor2tensor codebase:


    1. Tokenize into a list of tokens.  Each token is a unicode string of either
      all alphanumeric characters or all non-alphanumeric characters.  We drop
      tokens consisting of a single space that are between two alphanumeric
      tokens.
    2. Escape each token.  This escapes away special and out-of-vocabulary
      characters, and makes sure that each token ends with an underscore, and
      has no other underscores.
    3. Represent each escaped token as a the concatenation of a list of subtokens
      from the limited vocabulary.  Subtoken selection is done greedily from
      beginning to end.  That is, we construct the list in order, always picking
      the longest subtoken in our vocabulary that matches a prefix of the
      remaining portion of the encoded token.
    4. Concatenate these lists.  This concatenation is invertible due to the
      fact that the trailing underscores indicate when one list is finished.



We can access Google's trained BERT Tokenizer via the Hugging Face API

### Bert tokenizer / vectorizer

Our model this time around is very simple.  It has an output linear layer that comes from pooled output from BERT

In [0]:
# def whitespace_tokenizer(words):
#     return words.split() 

# def sst2_tokenizer(words):
#     REPLACE = { "'s": " 's ",
#                 "'ve": " 've ",
#                 "n't": " n't ",
#                 "'re": " 're ",
#                 "'d": " 'd ",
#                 "'ll": " 'll ",
#                 ",": " , ",
#                 "!": " ! ",
#                 }
#     words = words.lower()
#     words = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", words)
#     for k, v in REPLACE.items():
#             words = words.replace(k, v)
#     return [w.strip() for w in words.split()]

# BERT_TOKENIZER = BertTokenizer.from_pretrained('bert-base-uncased')
# BERT_MODEL = BertModel.from_pretrained('bert-base-uncased')
# def bert_tokenizer(words, pretokenizer=whitespace_tokenizer):
#     subwords = ['[CLS]']
#     for word in pretokenizer(words):
#         if word == '<unk>':
#             subword = '[UNK]'
#         else:
#             subword = BERT_TOKENIZER.tokenize(word)
#         subwords += subword
#     return subwords + ['[SEP]']

# def bert_vectorizer(sentence):
#     return BERT_TOKENIZER.convert_tokens_to_ids(sentence)
#     #return [BERT_TOKENIZER.vocab.get(subword, BERT_TOKENIZER.vocab['[PAD]']) for subword in sentence]



In [0]:
# words = "National Taiwan University"
# print("words:")
# print(words)
# print()
# print("BERT_TOKENIZER.tokenize(words):")
# print(BERT_TOKENIZER.tokenize(words))
# print()
# print("bert_tokenizer(words):")
# tokens = bert_tokenizer(words)
# print(tokens)

In [0]:
# print(tokens)
# ids = BERT_TOKENIZER.convert_tokens_to_ids(tokens)
# print(ids)
# print(BERT_TOKENIZER.convert_ids_to_tokens(ids))

### FineTuneClassifier

In [0]:

# class FineTuneClassifier(nn.Module):

#     def __init__(self, base_model, num_classes, embed_dim, hidden_units=[]):
#         super().__init__()
#         self.base_model = base_model
#         input_units = embed_dim
#         output_units = embed_dim
#         sequence = []
#         for h in hidden_units:
#             sequence.append(nn.Linear(input_units, h))
#             input_units = h
#             output_units = h
            
#         sequence.append(nn.Linear(output_units, num_classes))
#         self.outputs = nn.Sequential(*sequence)

#     def forward(self, inputs):
#         x, lengths = inputs
        
#         input_mask = torch.zeros(x.shape, device=x.device, dtype=torch.long).masked_fill(x != 0, 1)
#         input_type_ids = torch.zeros(x.shape, device=x.device, dtype=torch.long)
#         _, pooled = self.base_model(x, token_type_ids=input_type_ids, attention_mask=input_mask)
        
#         stacked = self.outputs(pooled)
#         return F.log_softmax(stacked, dim=-1)

All the rest of our code comes from the previous sections

In [0]:
# import torch
# import torch.nn as nn
# import torch.nn.functional as F
# from typing import List, Tuple
# import os
# import io
# import re
# import codecs
# import numpy as np
# from collections import Counter
# from torch.utils.data import DataLoader, TensorDataset


### ConfusionMatrix

In [0]:

# class ConfusionMatrix:
#     """Confusion matrix with metrics

#     This class accumulates classification output, and tracks it in a confusion matrix.
#     Metrics are available that use the confusion matrix
#     """
#     def __init__(self, labels):
#         """Constructor with input labels

#         :param labels: Either a dictionary (`k=int,v=str`) or an array of labels
#         """
#         if type(labels) is dict:
#             self.labels = []
#             for i in range(len(labels)):
#                 self.labels.append(labels[i])
#         else:
#             self.labels = labels
#         nc = len(self.labels)
#         self._cm = np.zeros((nc, nc), dtype=np.int)

#     def add(self, truth, guess):
#         """Add a single value to the confusion matrix based off `truth` and `guess`

#         :param truth: The real `y` value (or ground truth label)
#         :param guess: The guess for `y` value (or assertion)
#         """

#         self._cm[truth, guess] += 1

#     def __str__(self):
#         values = []
#         width = max(8, max(len(x) for x in self.labels) + 1)
#         for i, label in enumerate([''] + self.labels):
#             values += ["{:>{width}}".format(label, width=width+1)]
#         values += ['\n']
#         for i, label in enumerate(self.labels):
#             values += ["{:>{width}}".format(label, width=width+1)]
#             for j in range(len(self.labels)):
#                 values += ["{:{width}d}".format(self._cm[i, j], width=width + 1)]
#             values += ['\n']
#         values += ['\n']
#         return ''.join(values)

#     def save(self, outfile):
#         ordered_fieldnames = OrderedDict([("labels", None)] + [(l, None) for l in self.labels])
#         with open(outfile, 'w') as f:
#             dw = csv.DictWriter(f, delimiter=',', fieldnames=ordered_fieldnames)
#             dw.writeheader()
#             for index, row in enumerate(self._cm):
#                 row_dict = {l: row[i] for i, l in enumerate(self.labels)}
#                 row_dict.update({"labels": self.labels[index]})
#                 dw.writerow(row_dict)

#     def reset(self):
#         """Reset the matrix
#         """
#         self._cm *= 0

#     def get_correct(self):
#         """Get the diagonals of the confusion matrix

#         :return: (``int``) Number of correct classifications
#         """
#         return self._cm.diagonal().sum()

#     def get_total(self):
#         """Get total classifications

#         :return: (``int``) total classifications
#         """
#         return self._cm.sum()

#     def get_acc(self):
#         """Get the accuracy

#         :return: (``float``) accuracy
#         """
#         return float(self.get_correct())/self.get_total()

#     def get_recall(self):
#         """Get the recall

#         :return: (``float``) recall
#         """
#         total = np.sum(self._cm, axis=1)
#         total = (total == 0) + total
#         return np.diag(self._cm) / total.astype(float)

#     def get_support(self):
#         return np.sum(self._cm, axis=1)

#     def get_precision(self):
#         """Get the precision
#         :return: (``float``) precision
#         """

#         total = np.sum(self._cm, axis=0)
#         total = (total == 0) + total
#         return np.diag(self._cm) / total.astype(float)

#     def get_mean_precision(self):
#         """Get the mean precision across labels

#         :return: (``float``) mean precision
#         """
#         return np.mean(self.get_precision())

#     def get_weighted_precision(self):
#         return np.sum(self.get_precision() * self.get_support())/float(self.get_total())

#     def get_mean_recall(self):
#         """Get the mean recall across labels

#         :return: (``float``) mean recall
#         """
#         return np.mean(self.get_recall())

#     def get_weighted_recall(self):
#         return np.sum(self.get_recall() * self.get_support())/float(self.get_total())

#     def get_weighted_f(self, beta=1):
#         return np.sum(self.get_class_f(beta) * self.get_support())/float(self.get_total())

#     def get_macro_f(self, beta=1):
#         """Get the macro F_b, with adjustable beta (defaulting to F1)

#         :param beta: (``float``) defaults to 1 (F1)
#         :return: (``float``) macro F_b
#         """
#         if beta < 0:
#             raise Exception('Beta must be greater than 0')
#         return np.mean(self.get_class_f(beta))

#     def get_class_f(self, beta=1):
#         p = self.get_precision()
#         r = self.get_recall()

#         b = beta*beta
#         d = (b * p + r)
#         d = (d == 0) + d

#         return (b + 1) * p * r / d

#     def get_f(self, beta=1):
#         """Get 2 class F_b, with adjustable beta (defaulting to F1)

#         :param beta: (``float``) defaults to 1 (F1)
#         :return: (``float``) 2-class F_b
#         """
#         p = self.get_precision()[1]
#         r = self.get_recall()[1]
#         if beta < 0:
#             raise Exception('Beta must be greater than 0')
#         d = (beta*beta * p + r)
#         if d == 0:
#             return 0
#         return (beta*beta + 1) * p * r / d

#     def get_all_metrics(self):
#         """Make a map of metrics suitable for reporting, keyed by metric name

#         :return: (``dict``) Map of metrics keyed by metric names
#         """
#         metrics = {'acc': self.get_acc()}
#         # If 2 class, assume second class is positive AKA 1
#         if len(self.labels) == 2:
#             metrics['precision'] = self.get_precision()[1]
#             metrics['recall'] = self.get_recall()[1]
#             metrics['f1'] = self.get_f(1)
#         else:
#             metrics['mean_precision'] = self.get_mean_precision()
#             metrics['mean_recall'] = self.get_mean_recall()
#             metrics['macro_f1'] = self.get_macro_f(1)
#             metrics['weighted_precision'] = self.get_weighted_precision()
#             metrics['weighted_recall'] = self.get_weighted_recall()
#             metrics['weighted_f1'] = self.get_weighted_f(1)
#         return metrics

#     def add_batch(self, truth, guess):
#         """Add a batch of data to the confusion matrix

#         :param truth: The truth tensor
#         :param guess: The guess tensor
#         :return:
#         """
#         for truth_i, guess_i in zip(truth, guess):
#             self.add(truth_i, guess_i)


### Trainer

In [0]:

# class Trainer:
#     def __init__(self, optimizer: torch.optim.Optimizer):
#         self.optimizer = optimizer

#     def run(self, model, labels, train, loss, batch_size): 
#         model.train()       
#         train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)

#         cm = ConfusionMatrix(labels)

#         for batch in train_loader:
#             loss_value, y_pred, y_actual = self.update(model, loss, batch)
#             _, best = y_pred.max(1)
#             yt = y_actual.cpu().int().numpy()
#             yp = best.cpu().int().numpy()
#             cm.add_batch(yt, yp)

#         print(cm.get_all_metrics())
#         return cm
    
#     def update(self, model, loss, batch):
#         self.optimizer.zero_grad()
#         x, lengths, y = batch
#         lengths, perm_idx = lengths.sort(0, descending=True)
#         x_sorted = x[perm_idx]
#         y_sorted = y[perm_idx]
#         y_sorted = y_sorted.to('cuda:0')
#         inputs = (x_sorted.to('cuda:0'), lengths)
#         y_pred = model(inputs)
#         loss_value = loss(y_pred, y_sorted)
#         loss_value.backward()
#         self.optimizer.step()
#         return loss_value.item(), y_pred, y_sorted


### Evaluator

In [0]:

# class Evaluator:
#     def __init__(self):
#         pass

#     def run(self, model, labels, dataset, batch_size=1):
#         model.eval()
#         valid_loader = DataLoader(dataset, batch_size=batch_size)
#         cm = ConfusionMatrix(labels)
#         for batch in valid_loader:
#             y_pred, y_actual = self.inference(model, batch)
#             _, best = y_pred.max(1)
#             yt = y_actual.cpu().int().numpy()
#             yp = best.cpu().int().numpy()
#             cm.add_batch(yt, yp)
#         return cm

#     def inference(self, model, batch):
#         with torch.no_grad():
#             x, lengths, y = batch
#             lengths, perm_idx = lengths.sort(0, descending=True)
#             x_sorted = x[perm_idx]
#             y_sorted = y[perm_idx]
#             y_sorted = y_sorted.to('cuda:0')
#             inputs = (x_sorted.to('cuda:0'), lengths)
#             y_pred = model(inputs)
#             return y_pred, y_sorted


### fit()

In [0]:

# def fit(model, labels, optimizer, loss, epochs, batch_size, train, valid, test):

#     trainer = Trainer(optimizer)
#     evaluator = Evaluator()
#     best_acc = 0.0
    
#     for epoch in range(epochs):
#         print('EPOCH {}'.format(epoch + 1))
#         print('=================================')
#         print('Training Results')
#         cm = trainer.run(model, labels, train, loss, batch_size)
#         print('Validation Results')
#         cm = evaluator.run(model, labels, valid)
#         print(cm.get_all_metrics())
#         if cm.get_acc() > best_acc:
#             print('New best model {:.2f}'.format(cm.get_acc()))
#             best_acc = cm.get_acc()
#             torch.save(model.state_dict(), './checkpoint.pth')
#     if test:
#         model.load_state_dict(torch.load('./checkpoint.pth'))
#         cm = evaluator.run(model, labels, test)
#         print('Final result')
#         print(cm.get_all_metrics())
#     return cm.get_acc()

### Train

In [0]:
# BASE = 'trec'
# TRAIN = os.path.join(BASE, 'trec.nodev.utf8')
# VALID = os.path.join(BASE, 'trec.dev.utf8')
# TEST = os.path.join(BASE, 'trec.test.utf8')

# # lowercase=False so we can defer to BERT's tokenizer to handle
# r = Reader((TRAIN, VALID, TEST,), lowercase=False, vectorizer=bert_vectorizer, tokenizer=bert_tokenizer)
# train = r.load(TRAIN)
# valid = r.load(VALID)
# test = r.load(TEST)

In [0]:
# bert_small_dims = 768
# batch_size = 50
# epochs = 12
# model = FineTuneClassifier(BERT_MODEL, len(r.labels), bert_small_dims)
# num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# print(f"Model has {num_params} parameters") 


# model.to('cuda:0')
# loss = torch.nn.NLLLoss()
# loss = loss.to('cuda:0')

# learnable_params = [p for p in model.parameters() if p.requires_grad]
# optimizer = torch.optim.Adam(learnable_params, lr=1.0e-4)

# fit(model, r.labels, optimizer, loss, epochs, batch_size, train, valid, test)

In [0]:
# train.tensors