<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W05-transformer-and-document-similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is written by [Haowen Jiang](https://howard-haowen.rohan.tw/), and is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [1]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2022-05-12


# Transformer Embeddings 

![](https://i.pinimg.com/originals/52/cd/a2/52cda28bf15c418805d76a6c309ba6d3.jpg)

> A transformer is a deep learning model that adopts the mechanism of self-attention ([Wikipedia](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model))). 

- Self-attention

![](https://jalammar.github.io/images/t/transformer_self-attention_visualization_2.png)

Many state-of-art NLP models are built on top of a transformer, such as
- USE (Universal Sentence Encoder)
- [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) (Bidirectional Encoder Representations from Transformers)

These pretrained models are optimized for text similarity tasks.

![](https://miro.medium.com/max/1400/1*hPxezDTuv308MxlX03eYtg.png)

BERT has its own family, and there's even a name for the study of BERT, namely, Bertology.

![](https://miro.medium.com/max/896/1*IdLJIaaandrB_aR_2ZCnlg.jpeg)

In this tutorial, we'll embed documents using USE (no pun intended 😼!) and do text searches based on similarity and a clustering model.

## Dataset

For the purpose of this tutorial, we'll work with a tiny corpus of online posts crawled from Dcard. If you want to get more updated data, feel free to follow the sample code in [this post](https://howard-haowen.rohan.tw/blog.ai/cloudscraper/schedule/sqlite3/logging/2021/09/12/Scraping-Dcard-with-cloudscraper.html) of mine.

In [2]:
!wget -O Dcard.db https://github.com/howard-haowen/NLP-demos/raw/main/Dcard_20220304.db

--2022-05-12 03:16:01--  https://github.com/howard-haowen/NLP-demos/raw/main/Dcard_20220304.db
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/howard-haowen/NLP-demos/main/Dcard_20220304.db [following]
--2022-05-12 03:16:02--  https://raw.githubusercontent.com/howard-haowen/NLP-demos/main/Dcard_20220304.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151552 (148K) [application/octet-stream]
Saving to: ‘Dcard.db’


2022-05-12 03:16:02 (26.2 MB/s) - ‘Dcard.db’ saved [151552/151552]



In [3]:
import sqlite3
import pandas as pd

In [4]:
conn = sqlite3.connect("Dcard.db")  
df = pd.read_sql("SELECT * FROM Posts;", conn)
df

Unnamed: 0,createdAt,title,excerpt,categories,topics,forum_en,forum_zh
0,2022-03-04T07:54:19.886Z,專題需要數據🥺🥺幫填～,希望各位能花個20秒幫我填一下,,,dressup,穿搭
1,2022-03-04T07:42:59.512Z,#詢問 找衣服🥲,想找這套衣服🥲，但發現不知道該用什麼關鍵字找，（圖是草屯囝仔的校園演唱會截圖）,詢問,衣服 | 鞋子 | 衣物 | 男生穿搭 | 尋找,dressup,穿搭
2,2022-03-04T07:24:25.147Z,#黑特 網購50% FIFTY PERCENT請三思,因為文會有點長，先說結論是，50%是目前網購過的平台退貨最麻煩的一家，甚至我認為根本是刻意刁...,,黑特 | 網購 | 三思 | 退貨 | 售後服務,dressup,穿搭
3,2022-03-04T06:39:13.017Z,尋衣服,來源：覺得呱吉這襯衫好好看~~，或有人知道有類似的嗎,,衣服 | 尋找 | 日常穿搭 | 男生穿搭,dressup,穿搭
4,2022-03-04T06:28:06.137Z,#詢問 想問,各位，因為這個證件夾臺灣買不到，是美國outlet 的限量版貨，所以在以下的這間蝦皮上買，但...,詢問,穿搭 | 閒聊版 | 閒聊排解 | 假貨,dressup,穿搭
...,...,...,...,...,...,...,...
355,2022-03-03T03:41:10.972Z,開了新頻道,昨天上了第一支影片，之前有發過沒有線條的動畫影片，新的頻道改成有線條的，感覺大家好像比較喜歡...,,Youtuber | 頻道 | 有趣 | 日常 | 搞笑,youtuber,YouTuber
356,2022-03-03T02:26:58.821Z,估計某個YTUBER又有陰謀論可以寫了,今天全台灣大停電，應該過幾天就會有個戴面具的出來說，一定是中共……，我從上個影片就預測了……,,陰謀論 | Youtuber,youtuber,YouTuber
357,2022-03-02T21:25:51.080Z,#問 阿神和放火發生過什麼嗎？,想問有沒有人知道阿神和放火是認識還是有結過什麼仇之類的嗎？首先我個人基本沒關注過放火，但是最...,,Youtuber | 放火 | 阿神,youtuber,YouTuber
358,2022-03-02T20:33:47.713Z,#文長 我眼中的Rice&Shine,無意引戰，單純分享我的觀察與個人想法～這幾天看了Dcard幾篇關於Rice& Shine的貼...,,Riceandshine | Youtuber | 生活 | Vlog | youtuber板,youtuber,YouTuber


## Universal Sentence Encoder

Google's [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) transforms every sentence to a 512-dimensional embedding, and is available from the Tensorflow Hub. There's an English version and a multilingual version, which **supports 16 languages**!. We'll be using the multilingual version to embed Chinese.

![](https://amitness.com/images/use-overall-pipeline.png)

In [5]:
!pip3 install tensorflow_text>=2.0.0rc0

In [6]:
import tensorflow_hub as hub
import numpy as np
import tensorflow_text

embed_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Here's some sample code taken from Tensorflow Hub.

In [7]:
# Some texts of different lengths.
english_sentences = ["dog", "Puppies are nice.", "I enjoy taking long walks along the beach with my dog."]
chinese_sentences = ["狗", "小狗很好", "我喜歡和我的狗一起沿著海灘散步"]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
en_result = embed_model(english_sentences)
zh_result = embed_model(chinese_sentences)
ja_result = embed_model(japanese_sentences)

similarity_matrix_zh = np.inner(en_result, zh_result)
similarity_matrix_ja = np.inner(en_result, ja_result)
print(f"En-Zh Sim: {similarity_matrix_zh}")
print(f"En-Ja Sim: {similarity_matrix_ja}")

En-Zh Sim: [[0.92495644 0.5396135  0.2973976 ]
 [0.44881964 0.66703236 0.33681804]
 [0.25609976 0.30248648 0.56330156]]
En-Ja Sim: [[0.9171356  0.51152694 0.3158718 ]
 [0.44313586 0.6586347  0.3092131 ]
 [0.26650536 0.25377443 0.7672991 ]]


In [8]:
docid = 5
texts = df['title'] + ' ' + df['excerpt']
texts[docid]

'#請益 請問這雙怎麼入手 蝦皮上有一個賣3500，我下單後他跟我說斷貨了，叫我取消訂單，想請問有什麼方式能入手這雙鞋'

In [9]:
embeddings = embed_model(texts)
embeddings.shape

TensorShape([360, 512])

## Similarity

It's cubersome to do pairwise calculations of text similarity, so we'll use the `faiss` (standing for Facebook AI Similarity Search) library to create an embedding index, which makes it much faster to search similar vectors. Another similar library is `annoy`.

In [10]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 15.6 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


In [11]:
import faiss

To create the index, we'll first convert the inputs to `np.array`.

In [12]:
type(embeddings)

tensorflow.python.framework.ops.EagerTensor

In [13]:
embed_arrays = np.array(embeddings)
type(embed_arrays)

numpy.ndarray

In [14]:
index_arrays = df.index.values
type(index_arrays)

numpy.ndarray

The following snippet is taken from [this post](https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8).

In [15]:
def create_index_embeddings(doc_vectors: np.array, 
                            index_series: np.array):
    # Step 1: Change data type
    embeddings = np.array([embedding for embedding in doc_vectors]).astype("float32")

    # Step 2: Instantiate the index using a type of distance, which is L2 here
    index = faiss.IndexFlatL2(embeddings.shape[1]) 

    # Step 3: Pass the index to IndexIDMap
    index = faiss.IndexIDMap(index)

    # Step 4: Add vectors and their IDs
    index.add_with_ids(embeddings, index_series)

    return index, embeddings

In [16]:
index, faiss_embeddings = create_index_embeddings(embed_arrays, index_arrays)

In [17]:
docid = 5
test_text = df.loc[docid, 'excerpt']
test_text

'蝦皮上有一個賣3500，我下單後他跟我說斷貨了，叫我取消訂單，想請問有什麼方式能入手這雙鞋'

Let's retrieve the 10 nearest neighbours of the document with the ID 5.


In [18]:
docid = 5
D, I = index.search(np.array([faiss_embeddings[docid]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nDoc IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 0.8845899701118469, 0.9940855503082275, 0.9995702505111694, 1.00590181350708, 1.0115234851837158, 1.1354154348373413, 1.1602306365966797, 1.184503436088562, 1.1936644315719604]

Doc IDs: [5, 270, 23, 4, 294, 16, 13, 25, 299, 1]


In [19]:
D

array([[0.        , 0.88458997, 0.99408555, 0.99957025, 1.0059018 ,
        1.0115235 , 1.1354154 , 1.1602306 , 1.1845034 , 1.1936644 ]],
      dtype=float32)

In [20]:
D.flatten()

array([0.        , 0.88458997, 0.99408555, 0.99957025, 1.0059018 ,
       1.0115235 , 1.1354154 , 1.1602306 , 1.1845034 , 1.1936644 ],
      dtype=float32)

Now let's check out the results using document IDs.


In [21]:
cols_to_show = ['title', 'excerpt', 'forum_zh']
df.loc[I.flatten(), cols_to_show]

Unnamed: 0,title,excerpt,forum_zh
5,#請益 請問這雙怎麼入手,蝦皮上有一個賣3500，我下單後他跟我說斷貨了，叫我取消訂單，想請問有什麼方式能入手這雙鞋,穿搭
270,Diana 樂福鞋求收,如題，因為之前活動關係買了兩雙兩個尺寸，以為把不合的退掉就好了好聰明，結果沒想到有活動滿20...,女孩
23,#詢問 想問哪裡還買得到這件外套,這件mouggan+Mercci22聯名的直紋五分西裝外套找了官網都沒有m號了好難過，不知道...,穿搭
4,#詢問 想問,各位，因為這個證件夾臺灣買不到，是美國outlet 的限量版貨，所以在以下的這間蝦皮上買，但...,穿搭
294,#問 Tory Burch 包款,請問有人知道這款包的名字嗎？因為不是熱門款有點難找，有人賣4500 猶豫要不要下手 也不知道...,女孩
16,#詢問 此款tommy牛仔外套還買得到嗎？,各位朋友半夜好。找了整整兩天，在網路上都難以找到有這件外套「正品」的購買資訊。有丟了幾家代購...,穿搭
13,有人知道哪裡有賣這件衣服嗎？,在小客廳看到這件衣服被燒到，本來想說今天下單 結果剛剛發現已經被下架了，想詢問一下大家知道哪...,穿搭
25,#詢問 #詢問 賣場真假,最近一直看有沒有不錯的賣場，有人知道這家是不是正的嗎，謝謝🥺🥺,穿搭
299,腳歪歪的,其實好久了，每次穿短的都覺得好醜，但坐著很正常。但真的太熱了🥲，這有什麼方法拯救嗎？！，ps...,女孩
1,#詢問 找衣服🥲,想找這套衣服🥲，但發現不知道該用什麼關鍵字找，（圖是草屯囝仔的校園演唱會截圖）,穿搭


## Clustering

Last time, we built a clustering model using Scikit-learn's K-means. This time, let's try
[Scikit-learn's AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) instead.



![](https://miro.medium.com/max/1039/0*afzanWwrDq9vd2g-)

Here're some parameters for this class taken straight from the official documentation:

- `n_clusters`: int or None, default=2
The number of clusters to find. 

- `affinity`: str or callable, default=’euclidean’
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. 

- `linkage`: {‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

    - `ward` minimizes the variance of the clusters being merged.
    - `average` uses the average of the distances of each observation of the two sets.
    - `complete` or `maximum` linkage uses the maximum distances between all observations of the two sets.
    - `single` uses the minimum of the distances between all observations of the two sets.

In [22]:
from sklearn.cluster import AgglomerativeClustering

clusterer = AgglomerativeClustering(n_clusters=100, 
                                    affinity="euclidean", 
                                    linkage="ward")
clusters = clusterer.fit_predict(embeddings)

Let's pick some random clusters to inspect the results.

> Using the parenthesis to connect multiple lines is a trick I learned from the book [*Effective Pandas*](https://www.amazon.com/Effective-Pandas-Patterns-Manipulation-Treading/dp/B09MYXXSFM). It's really worth reading if you are into data wrangling. 

In [23]:
df['clusters'] = clusters
sample_clusters = [10, 50, 80]
cols_to_show = ['title', 'excerpt', 'forum_zh', 'clusters']

# the parenthesis trick
results = (
    df[
       # filtering
       df['clusters'].
       isin(sample_clusters) 
    ]
    # selecting cols
    [cols_to_show]  
)
results

Unnamed: 0,title,excerpt,forum_zh,clusters
213,開合約之前，你知道什麼是杠杆率和保證金率嗎？,近期市場情緒低迷，暴漲暴跌時有發生，但整體行情仍然處於震盪狀態。不少投資者看著盤面有些著急，...,理財,50
214,Richart外幣帳戶,爬了好多文還是不太懂外幣的一些問題，希望能有人幫我解答 謝謝，1.想請問買賣美元是在賺那個匯...,理財,80
218,股票區塊鏈金融交易理財就是：抄心態！,投資的實質是與自我頑固的靈魂作鬥爭，股票區塊鏈金融交易最後是炒心態。大師級的高手最後的較量並...,理財,50
219,3.4比特幣行情分析參考建議,比特幣昨晚先是小幅拉升至44000一線，期間也是到了財神給大家的參考區間43800—4430...,理財,50
222,#分享 3/4（五）盤前分享,1. 道瓊工業下跌96.69點/-0.29%，那斯達克下路214.08點/-1.56%，費半...,理財,50
225,為什麼不能用槓桿型及反向型ETF賺的更多,嗨 各位大家好~，我是油油的股票肥宅，最近有小粉絲問了一個問題。如果要長期投資，是否可以開槓...,理財,50
230,【投資理財】甚麼是雙幣投資 | 超高報酬背後的風險,現在的DeFi礦池收益沒有以前那麼好了，最近常聽到許多人在討論雙幣投資的，看到那個年化利率可...,理財,80
231,#請益 海運大漲該停利嘛,從上波高點套到現在，今天終於變紅了好感動，想請問該怎麼操作呢，要加碼還是停利嘛，第一次買霸脫霸脫,理財,50
238,在幣圈炒幣屬於投機嗎？,不少人問我炒幣是投機嗎？說實話炒幣大概率是投機，不是為了投機還能是為了什麼，難道真的相信do...,理財,80
330,《庫洛魔法使》（迷你）服裝製作,又來跟大家分享新的作品了~，頻道常常分享 {縫紉} {服裝製作} 等相關教學，大家對服裝製...,YouTuber,10


# Assignment

### BERT

You can easily leverage BERT embeddings using spaCy. Your task is to build a similarity index and a clustering model on any dataset by vectorizing texts with BERT embeddings. The following shows how to get a document vector.

> Use [this notebook](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/nlp_datasets.ipynb) to explore more datasets.

> Make sure to download the language model that matches the language of your dataset. It's `zh_core_web_trf` for Chinese and `en_core_web_trf` for English.

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_trf

In [25]:
!pip list | grep spacy

spacy                         3.3.0
spacy-alignments              0.8.5
spacy-legacy                  3.0.9
spacy-loggers                 1.0.2
spacy-pkuseg                  0.0.30
spacy-transformers            1.1.5


In [26]:
import spacy

In [27]:
nlp = spacy.load('zh_core_web_trf')

In [28]:
docid = 5
texts = df['title'] + ' ' + df['excerpt']
sample = texts[docid]
sample

'#請益 請問這雙怎麼入手 蝦皮上有一個賣3500，我下單後他跟我說斷貨了，叫我取消訂單，想請問有什麼方式能入手這雙鞋'

All transformer-related data is stored in the `.trf_data` attribute of a Doc object.

In [43]:
doc._.trf_data

TransformerData(wordpieces=WordpieceBatch(strings=[['[CLS]', '#', '請', '益', '請', '問', '這', '雙', '怎', '麼', '入', '手', '蝦', '皮', '上', '有', '一', '個', '賣', '3500', '，', '我', '下', '單', '後', '他', '跟', '我', '說', '斷', '貨', '了', '，', '叫', '我', '取', '消', '訂', '單', '，', '想', '請', '問', '有', '什', '麼', '方', '式', '能', '入', '手', '這', '雙', '鞋', '[SEP]']], input_ids=array([[ 101,  108, 6313, 4660, 6313, 1558, 6857, 7427, 2582, 7938, 1057,
        2797, 6076, 4649,  677, 3300,  671,  943, 6546, 9252, 8024, 2769,
         678, 1606, 2527,  800, 6656, 2769, 6303, 3174, 6515,  749, 8024,
        1373, 2769, 1357, 3867, 6242, 1606, 8024, 2682, 6313, 1558, 3300,
         784, 7938, 3175, 2466, 5543, 1057, 2797, 6857, 7427, 7490,  102]],
      dtype=int32), attention_mask=array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1.,

More specifically, `.wordpieces` stores tokens in the BERT framework.

In [32]:
doc._.trf_data.wordpieces

WordpieceBatch(strings=[['[CLS]', '#', '請', '益', '請', '問', '這', '雙', '怎', '麼', '入', '手', '蝦', '皮', '上', '有', '一', '個', '賣', '3500', '，', '我', '下', '單', '後', '他', '跟', '我', '說', '斷', '貨', '了', '，', '叫', '我', '取', '消', '訂', '單', '，', '想', '請', '問', '有', '什', '麼', '方', '式', '能', '入', '手', '這', '雙', '鞋', '[SEP]']], input_ids=array([[ 101,  108, 6313, 4660, 6313, 1558, 6857, 7427, 2582, 7938, 1057,
        2797, 6076, 4649,  677, 3300,  671,  943, 6546, 9252, 8024, 2769,
         678, 1606, 2527,  800, 6656, 2769, 6303, 3174, 6515,  749, 8024,
        1373, 2769, 1357, 3867, 6242, 1606, 8024, 2682, 6313, 1558, 3300,
         784, 7938, 3175, 2466, 5543, 1057, 2797, 6857, 7427, 7490,  102]],
      dtype=int32), attention_mask=array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1.]], dtype=float32), 

NOTE: `doc._.trf_data.tensors` is different from `doc.tensor` or `doc.vector`. 

In [45]:
doc._.trf_data.tensors

(array([[[-0.06093796,  0.4808125 , -1.515294  , ...,  1.2998419 ,
           0.06041297,  0.34086958],
         [-0.4960665 ,  0.37047324, -0.5070612 , ...,  0.48365518,
           1.2482525 ,  0.24359411],
         [ 0.09421694,  0.4672077 , -1.5198385 , ...,  0.42167848,
          -0.175895  , -0.26311597],
         ...,
         [-0.62458724, -0.5568265 , -1.0633736 , ...,  0.8060819 ,
           0.4572911 ,  0.28040668],
         [ 0.3076699 , -0.73708737, -1.0244833 , ...,  0.36052567,
           1.2926208 , -0.11057007],
         [-1.143755  ,  0.4068495 , -2.2327175 , ...,  1.5238872 ,
           1.1633435 ,  1.027017  ]]], dtype=float32),
 array([[ 0.9482748 ,  0.3646351 ,  0.49285644,  0.7384991 ,  0.695666  ,
          0.8379382 , -0.6671747 , -0.9659504 ,  0.9981446 , -0.9838525 ,
          0.9991503 ,  0.11251239, -0.9947545 , -0.7300057 ,  0.997386  ,
         -0.96912575, -0.797553  , -0.8557561 , -0.01615306,  0.55186325,
          0.9844663 , -0.24069616, -0.55427885, 

- (doc number, token number, vector dimensions)

In [46]:
doc._.trf_data.tensors[0].shape

(1, 55, 768)

- (doc number, vector dimensions)

In [47]:
doc._.trf_data.tensors[1].shape

(1, 768)

Finally, here's how you get access to the overall vector of a document.

In [49]:
doc_vec = doc._.trf_data.tensors[1]
type(dov_vec)

numpy.ndarray