# Related Question
這個主題是因為想要透過類似主題的方式，讓客人在還沒有將問題提交前就可以得到類似問題的解答，以期可以解決簡單的客人問題，並且減少客訴的量。
這邊我使用的方式是使用 Bert 來做骨幹架構，來表示出 sentence。
其中 Bert 我是使用 [bert-as-service](https://github.com/hanxiao/bert-as-service) 套件，搭配 [BERT-Base, Uncased(12-layer, 768-hidden, 12-heads, 110M parameters)](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pretrained model 輸出 sentence representation。

我會選擇上面那個 model 單純只是因為 GPU 記憶體只塞得進這個 model 的關係，如果用更大的 model 可想而知效果應該會更好。

使用流程：
1. Install require package

    ```
    pip install bert-serving-server  # server
    pip install bert-serving-client  # client, independent of `bert-serving-server`
    ```
2. Download pretrain model
3. Start the BERT service
    在同一台電腦的 shell 輸入底下的 command，並且直到 shell 輸出 all set, ready to serve request!
    另外 num_worker 會牽扯到記憶體用量，如果一直沒有輸出 all workers ready，有可能就是因為記憶體不夠的關係。
    
    `bert-serving-start -model_dir uncased_L-12_H-768_A-12 -num_worker 3  -port 1355 -max_seq_len 150 -device_map 3 -show_tokens_to_client`
4. Run this jupyter notebook!

In [1]:
import torch
import torch.nn as nn

import pickle
import numpy as np

loadpath = "processed_data_bert_expand"
bert_data_path = "bert_expand.pkl"

## Predict Dataset

讀取 Data Preprocessing.ipynb 已經預處理完的資料。

In [2]:
with open(loadpath, "rb") as f:
    output = pickle.load(f)
clean_data = output["clean_data"]
reduced_data = output["reduced_data"]
token_data = output["token_data"]

開始取得 dataset 中每個句子的 sentence representation。處理時間會因為 `n_worker` 的數量以及 gpu 的運算能力而有差別，我自己是在 GeForce GTX 1080 Ti 上面 n_worker=4，共跑了約 3 個小時。

如果連線成功在你 run `bert-serving-start` 的那個 shell 應該會有一堆 log 出現。

In [3]:
from bert_serving.client import BertClient
bc = BertClient(port=1355)
print("Start predicting")
bert_output = bc.encode(clean_data)

Start predicting


here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


將好不容易跑出來的結果儲存起來，之後使用就不需要重新跑一遍。

In [4]:
bert_data = {
    "clean_data": clean_data,
    "reduced_data": reduced_data,
    "token_data": token_data,
    "bert_data": bert_output
}
with open(bert_data_path, "wb") as f:
    pickle.dump(bert_data, f)

Read dataset with bert sentence representation

In [5]:
with open(bert_data_path, "rb") as f:
    bert_data = pickle.load(f)
clean_data = bert_data["clean_data"]
reduced_data = bert_data["reduced_data"]
token_data = bert_data["token_data"]
bert_output = bert_data["bert_data"]

In [6]:
print("Type: ", type(bert_output), bert_output.shape)
bert_tensor = torch.from_numpy(bert_output)
print(bert_tensor.size())

Type:  <class 'numpy.ndarray'> (100910, 768)
torch.Size([100910, 768])


## Testing
這邊模擬 testing 的情境，當有一個新的 query sentence，要先把句子預處理完後再丟進 `predict()` function。這邊我偷懶就直接拿之前已經預處理好的句子丟進去。

要注意要執行 `predict()` function 前還是要在 shell 用 `bert-serving-start` 把 model run 起來。

`predict()` function 會先取得 query sentence 的 sentence representation，接著再與先前 dataset 取得的 Bert sentence representation 去算 cosine similarity，數值越高就與現在這個 query sentence 越相似。

In [7]:
import re

from bert_serving.client import BertClient
bc = BertClient(port=1355)
bert_norm = bert_tensor / torch.norm(bert_tensor, dim=1).view(-1, 1)



def predict(test_sentence, num_related):
    print("Query: {}".format(test_sentence))
    test_array, token = bc.encode([test_sentence], show_tokens=True)
    #print(token)
    test_tensor = torch.tensor(test_array[0])
    #print("bert_tensor:", bert_tensor.size()) # torch.Size([100868, 768])
    
    test_norm = test_tensor / torch.norm(test_tensor)
    similarity = torch.matmul(bert_norm, test_norm.view(-1,1))
    
    rank = torch.argsort(similarity, dim=0, descending=True)

    for i in range(1, num_related + 1):
        print("\n" + "=" * 10 + "Similarity: {}".format(similarity[rank[i]][0]) + "=" * 10)
        print(re.sub(r'<[^<]*?/?>', '', reduced_data[rank[i]])) # remove output sentence html 
        #print()
        #print(clean_data[rank[i]])
    return None

In [9]:
import random

for i, index in enumerate(random.sample(range(len(clean_data)), 5)):
    #print("Query: {}".format(reduced_data[index]))
    predict(clean_data[index], 3)
    print("\n" + "*" * 50 + "\n")

Query: ﻿   dear sir,  kindly confim that is it possible to edit  (quick time) videos in your android video editing           my both nikon camera produce  videos, which is not edit able by any other           related          kaushik        ﻿

Hi,  I have an indoor security camera uses avi format, it's not MP3 or MP4,  I'm unable to play any video from this security device, even if I edited then play in different video players,, , the clip will be distorted, i love this app just can't edit any videos from this sec cam, your help is much appreciated.  Gashi Attach File : DeviceInfo.txt

I want to capture clips/videos from my HDV sony camera (with tape) by means of a fire wire. Power director recognize the camera but when I want to capture I got a message (see attachment) The setting at my camera are correct. A friend could capture this video with his editing software (Magix) without problem after downloading MPEG-2 codec. I have already downloaded MPEG-2 codec but it still doesn't work.

## Future Work
從最後輸出的結果來看其實還不錯，可是再經過一些調查後發現 bert 並不適合這樣直接當作 sentence encoder，目前想到的解決方法如下。

1. 先 find tune 在一些 task 上，像是最一開始做得 supervised classification 後再拿 `[CLS]` 的 output 作為 sentence represention。
2. 使用 [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf)。