# Related Question
這個主題是因為想要透過類似主題的方式，讓客人在還沒有將問題提交前就可以得到類似問題的解答，以期可以解決簡單的客人問題，並且減少客訴的量。
這邊我使用的方式是使用 Bert 來做骨幹架構，來表示出 sentence。
其中 Bert 我是使用 [bert-as-service](https://github.com/hanxiao/bert-as-service) 套件，搭配 [BERT-Base, Uncased(12-layer, 768-hidden, 12-heads, 110M parameters)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pretrained model 輸出 sentence representation。

我會選擇上面那個 model 單純只是因為 GPU 記憶體只塞得進這個 model 的關係，如果用更大的 model 可想而知效果應該會更好。

使用流程：
1. Install require package

    ```
    pip install bert-serving-server  # server
    pip install bert-serving-client  # client, independent of `bert-serving-server`
    ```
2. Download pretrain model
3. Start the BERT service
    在同一台電腦的 shell 輸入底下的 command，並且直到 shell 輸出 all set, ready to serve request!
    另外 num_worker 會牽扯到記憶體用量，如果一直沒有輸出 all workers ready，有可能就是因為記憶體不夠的關係。
    
    `bert-serving-start -model_dir uncased_L-12_H-768_A-12 -num_worker 3  -port 1355 -max_seq_len 150 -device_map 3 -show_tokens_to_client`
4. Run this jupyter notebook!

In [1]:
import torch
import torch.nn as nn

import pickle
import numpy as np

loadpath = "processed_data_bert_expand"
bert_data_path = "bert_expand.pkl"

## Predict Dataset

讀取 Data Preprocessing.ipynb 已經預處理完的資料。

In [2]:
with open(loadpath, "rb") as f:
    output = pickle.load(f)
clean_data = output["clean_data"]
reduced_data = output["reduced_data"]
token_data = output["token_data"]

開始取得 dataset 中每個句子的 sentence representation。處理時間會因為 `n_worker` 的數量以及 gpu 的運算能力而有差別，我自己是在 GeForce GTX 1080 Ti 上面 n_worker=4，共跑了約 3 個小時。

如果連線成功在你 run `bert-serving-start` 的那個 shell 應該會有一堆 log 出現。

In [None]:
from bert_serving.client import BertClient
bc = BertClient(port=1355)
print("Start predicting")
bert_output = bc.encode(clean_data)

將好不容易跑出來的結果儲存起來，之後使用就不需要重新跑一遍。

In [None]:
bert_data = {
    "clean_data": clean_data,
    "reduced_data": reduced_data,
    "token_data": token_data,
    "bert_data": bert_output
}
with open(bert_data_path, "wb") as f:
    pickle.dump(bert_data, f)

Read dataset with bert sentence representation

In [3]:
with open(bert_data_path, "rb") as f:
    bert_data = pickle.load(f)
clean_data = bert_data["clean_data"]
reduced_data = bert_data["reduced_data"]
token_data = bert_data["token_data"]
bert_output = bert_data["bert_data"]

In [4]:
print("Type: ", type(bert_output), bert_output.shape)
bert_tensor = torch.from_numpy(bert_output)
print(bert_tensor.size())

Type:  <class 'numpy.ndarray'> (100910, 768)
torch.Size([100910, 768])


## Testing
這邊模擬 testing 的情境，當有一個新的 query sentence，要先把句子預處理完後再丟進 `predict()` function。這邊我偷懶就直接拿之前已經預處理好的句子丟進去。

要注意要執行 `predict()` function 前還是要在 shell 用 `bert-serving-start` 把 model run 起來。

`predict()` function 會先取得 query sentence 的 sentence representation，接著再與先前 dataset 取得的 Bert sentence representation 去算 cosine similarity，數值越高就與現在這個 query sentence 越相似。

In [10]:
import re

from bert_serving.client import BertClient
bc = BertClient(port=1355)
bert_norm = bert_tensor / torch.norm(bert_tensor, dim=1).view(-1, 1)



def predict(test_sentence, num_related, ignore_first=False):
    test_sentence = test_sentence.lower()
    print("Query: {}".format(test_sentence))
    test_array, token = bc.encode([test_sentence], show_tokens=True)
    #print(token)
    test_tensor = torch.tensor(test_array[0])
    #print("bert_tensor:", bert_tensor.size()) # torch.Size([100868, 768])
    
    test_norm = test_tensor / torch.norm(test_tensor)
    similarity = torch.matmul(bert_norm, test_norm.view(-1,1))
    
    rank = torch.argsort(similarity, dim=0, descending=True)
    start = 1 if ignore_first else 0

    for i in range(start, num_related + start):
        print("\n" + "=" * 10 + "Similarity: {}".format(similarity[rank[i]][0][0]) + "=" * 10)
        print(re.sub(r'<[^<]*?/?>', '', reduced_data[rank[i]])) # remove output sentence html 
        #print()
        #print(clean_data[rank[i]])
    return None

In [11]:
import random

for i, index in enumerate(random.sample(range(len(clean_data)), 5)):
    #print("Query: {}".format(reduced_data[index]))
    predict(clean_data[index], 3, ignore_first=True)
    print("\n" + "*" * 50 + "\n")

Query: hi team,    please send me the link for downloading digital copy of powerdirector 17 ultra and photodirector 10     i have purchased the software and i have the dvd  i would like to download digital copies of the     thank you,  karthik   

Hi  I purchased PowerDirector 14 Ultimate - Incl. Premium Effects and Templates on 02/08/2016, order number 213323203.  Today I accidentally lost the digital copy copies of the software and installations.  Could you kindly provide the links to download all of them once again for which I will be very grateful. Thank you. Sincerely  Giridhar Havanoor

Hi there,    I tried to download the files but the link is broken.    Please kindly provide the link to download     PowerDirector 15 Ultimate + PhotoDirector 8 Ultra    Thanks a lot   Shanyao Lee

I purchased these three products, I had to installed windows 10 pro on my laptop, I have contacted your tech support about the last product and getting it re downloaded (Power Director 17 Ultimate, Hi n

In [None]:
while True:
    input_sentence = input("Please type your question here:") # I cannot activate my PowerDVD.
    if input_sentence == "EOF":
        break
    predict(input_sentence, 5)

Please type your question here:I cannot activate my PowerDVD.
Query: i cannot activate my powerdvd.

I am unable to activate my PowerDVD 19 software. 

I am unable to activate my PowerDVD

I am unable to activate my Power2Go 11. 

I am unable to activate my Power2Go 12 software. 

I am unable to activate my PowerDirector 17. Product key is invalid.
Please type your question here:I lose my CD key.
Query: i lose my cd key.

I lost my product key numbers

I accidentally lost the product key for my director zone 15.

i lose again my power dvd

My PC Crash again and need to reinstall my Power Director and the Screen Recorder     Thanks

i reinstall my  pc and i  need the key
Please type your question here:My subscription is up i believe in October of 2019. I want to terminate my subscription and do not want to renew it. Please help me with this?
Query: my subscription is up i believe in october of 2019. i want to terminate my subscription and do not want to renew it. please help me with thi

## Future Work
從最後輸出的結果來看其實還不錯，可是再經過一些調查後發現 bert 並不適合這樣直接當作 sentence encoder，目前想到的解決方法如下。

1. 先 find tune 在一些 task 上，像是最一開始做得 supervised classification 後再拿 `[CLS]` 的 output 作為 sentence represention。
2. 使用 [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf)。