# BERT의 SBERT를 이용하여 한국어 챗봇 만들기
#### https://wikidocs.net/154530 참고

In [1]:
pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.5 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 17.6 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.11.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 26.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 53.0 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 422 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 62.8 MB/s 
Collecti

In [2]:
import numpy as np
import pandas as pd
from numpy import dot
from numpy.linalg import norm
import urllib.request
from sentence_transformers import SentenceTransformer

In [3]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv", filename="ChatBotData.csv")
train_data = pd.read_csv('ChatBotData.csv')
train_data.head()

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0


### 사전 훈련된 BERT Loading (다국어 모델)
- 100가지 언어를 지원하며 BERT Base model로 SNLI 데이터를 학습한 후 STS-B 데이터로 학습되었으며, mean pooling 사용
- NLI 데이터 학습 후에 STS로 추가 fine tuning 함.

In [4]:
model = SentenceTransformer('sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens')

Downloading:   0%|          | 0.00/574 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/731 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/527 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

## embedding vector 계산

In [6]:
train_data['embedding'] = train_data.apply(lambda row: model.encode(row.Q), axis = 1)
train_data.head()
#train Q에 대한 embedding vector를 구해서 embedding 열에 저장

Unnamed: 0,Q,A,label,embedding
0,12시 땡!,하루가 또 가네요.,0,"[0.20179598, -0.034438115, 1.539572, 0.0106974..."
1,1지망 학교 떨어졌어,위로해 드립니다.,0,"[0.07716598, -0.03427816, 0.8624426, 0.0263606..."
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0,"[0.10445247, -0.01243222, 1.0132877, 0.0225015..."
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0,"[0.09760746, -0.046716865, 0.8936941, 0.021047..."
4,PPL 심하네,눈살이 찌푸려지죠.,0,"[-0.07002865, 0.03196122, 1.4915428, 4.3391115..."


In [7]:
def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))

In [10]:
def return_answer(question):
    embedding = model.encode(question)
    train_data['score'] = train_data.apply(lambda x: cos_sim(x['embedding'], embedding), axis=1)
    return train_data.loc[train_data['score'].idxmax()]['A']

In [12]:
return_answer('사랑해')

'상대방에게 전해보세요.'

In [18]:
return_answer('위로를?')

'미스트나 가습기, 젖은 수건 등을 사용해보세요.'

In [17]:
return_answer('말 해봐')

'오늘 헤어졌어 라고 하면 위로해 드려요.'

In [16]:
return_answer('뭐해?')

'일해요.'