# Why Chatbot
* 앱을 새로 깔 필요가 없음
* 앱을 깔필요가 없으니 배울 것도 없음
* 편한 UX - 그냥 텍스트 치면됨
* 즉각적인 Feedback

## Seq2Seq를 활용한 간단한 Q/A 봇을 만들어보자
![이미지](http://suriyadeepan.github.io/img/seq2seq/seq2seq2.png)
* Python 3.5, Tensorflow 1.1, Konlpy (Mecab),Word2Vec (Gensim), matplotlib (Graph)

In [1]:
# -*- coding: utf-8 -*-
import tensorflow as tf
from eunjeon import Mecab
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

mecab = Mecab()
print(tf.__version__)

1.9.0


### seq2seq를 위한 Data 구성
* 질의 응답별로 LIST로 구성
* operator사용 value값 기준 정렬

In [2]:
train_data = [
    ['안녕', '만나서 반가워'],
    ['넌누구니', '나는 AI 봇이란다.'],
    ['피자 주문 할께', '페파로니 주문해줘'],
    ['음료는 멀로', '콜라로 해줘']
]


all_input_sentences = []
all_target_sentences = []

tokenizer = Tokenizer()

for row_data in train_data:

    print(mecab.morphs(row_data[0]))
    print(mecab.morphs(row_data[1]))
    input_morphs = mecab.morphs(row_data[0])
    output_morphs = mecab.morphs(row_data[1])
    tokenizer.fit_on_texts(input_morphs)
    tokenizer.fit_on_texts(output_morphs)
    all_input_sentences.append(input_morphs)
    all_target_sentences.append(output_morphs)

input_texts = tokenizer.texts_to_sequences(all_input_sentences)
output_texts = tokenizer.texts_to_sequences(all_target_sentences)

MAX_NB_WORDS = len(tokenizer.word_index) + 1
MAX_SEQUENCE_LENGTH_X = max([len(seq) for seq in input_texts])
MAX_SEQUENCE_LENGTH_Y = max([len(seq) for seq in output_texts])
MAX_SEQUENCE_LENGTH = max(MAX_SEQUENCE_LENGTH_X, MAX_SEQUENCE_LENGTH_Y)

input_texts = pad_sequences(input_texts, maxlen=MAX_SEQUENCE_LENGTH)
output_texts = pad_sequences(output_texts, maxlen=MAX_SEQUENCE_LENGTH)

['안녕']
['만나', '서', '반가워']
['넌', '누구', '니']
['나', '는', 'AI', '봇', '이', '란다', '.']
['피자', '주문', '할께']
['페파', '로니', '주문', '해', '줘']
['음료', '는', '멀', '로']
['콜라', '로', '해', '줘']


# Vector 구성 (입력된 문장의 글자별 Vector)
 - 일반적으로 처리단위가 작아질수록 미등록어에서 자유롭고 작은 vector 차원을 유지할 수 있지만
 - 문장의 길이가 길어지고, 학습이 어려워지는 문제가 있기에 적절한 embedding을 찾아야하는데 
 - 이부분은 Biz Domain 별 차이가 있음 복잡도나 표현 가능성등을 적절한 균형에서 찾아야함 
 - 아래 소스는 이해하기 쉽도록 글자단위의 Onehot으로 구성

In [3]:
index_word = {v: k for k, v in tokenizer.word_index.items()}
print(index_word)

{1: '는', 2: '주문', 3: '해', 4: '줘', 5: '로', 6: '안녕', 7: '만나', 8: '서', 9: '반가워', 10: '넌', 11: '누구', 12: '니', 13: '나', 14: 'ai', 15: '봇', 16: '이', 17: '란다', 18: '피자', 19: '할께', 20: '페파', 21: '로니', 22: '음료', 23: '멀', 24: '콜라'}


### One Hot Encodeing
* '안녕??'의 정렬하여 1의 값으로 정렬 <br>
안 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] <br>
녕 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] <br>
? [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] <br>

In [4]:
from tensorflow.python.keras.utils import np_utils
input_texts = np_utils.to_categorical(input_texts,25)
output_texts = np_utils.to_categorical(output_texts,25)

## Model

In [5]:
from tensorflow.python.keras.layers import LSTM, RepeatVector, TimeDistributed, Dense, Flatten
from tensorflow.python.keras.models import Sequential
model = Sequential()
model.add(LSTM(150, input_shape=(MAX_SEQUENCE_LENGTH, MAX_NB_WORDS)))
model.add(RepeatVector(MAX_SEQUENCE_LENGTH))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(MAX_NB_WORDS, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 150)               105600    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 5, 150)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 5, 150)            180600    
_________________________________________________________________
time_distributed (TimeDistri (None, 5, 25)             3775      
Total params: 289,975
Trainable params: 289,975
Non-trainable params: 0
_________________________________________________________________


### 예측 수행

In [6]:
def inference_embed(data) :
    mecab = Mecab()
    encode_raw = mecab.morphs(data)
    output = tokenizer.texts_to_sequences([encode_raw])
    output = pad_sequences(output, maxlen=MAX_SEQUENCE_LENGTH)
    output = np_utils.to_categorical(output, MAX_NB_WORDS)
    return output

def predict(data):
    x_predict = inference_embed(data)
    y = model.predict(x_predict, verbose=0)
    arr = []
    for dim in y[0]:
        arr.append(dim.argmax())
    index_word = {v: k for k, v in tokenizer.word_index.items()}  # map back
    words = []
    for seq in arr:
        if(seq == 0):
            words.append('')
        else:
            words.append(index_word.get(seq))
    output_text = ''.join(words)
    print('input text :' + data)
    print('output text :' + output_text)  # output

predict('안녕')
predict('넌누구니')
predict('피자 주문 할께')
predict('음료는 멀로')

input text :안녕
output text :줘줘줘줘줘
input text :넌누구니
output text :만나만나만나만나만나
input text :피자 주문 할께
output text :넌넌넌넌넌
input text :음료는 멀로
output text :란다란다란다란다란다
