Theano란 Keras와 같은 현재 많이 알려진 오픈소스 딥러닝 프레임워크의 기반이된 도구 <br>
최근 Tensorflow와 Theano를 두고 비교하여, 적절한 딥러닝 프레임워크를 사용하고 있다.<br>

# 자동 문장 만들기

In [39]:
import csv
import itertools
import operator
import numpy as np
import nltk # 형태소 분석을 위한 자연어 처리 패키지
import sys
from datetime import datetime
from utils import *
import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloadi

True

### RNN 학습을 위한 문자열 토큰 제한 설정

In [40]:
vocabulary_size = 8000 # 8천 단어 입력으로 제한
unknown_token = "UNKNOWN_TOKEN" # 단어장에 단어가 없을 때, 리턴
sentence_start_token = "SENTENCE_START" # 문장에 가장 첫 포인터 역할
sentence_end_token = "SENTENCE_END" # 문장에 가장 마지막 포인터 역할

In [41]:
#Google BigQuery API를 이용하여 다운로드 받은 댓글문자열 15,000 데이터셋
with open('./reddit-comment-20180718-104322.csv', 'r', encoding="utf-8") as f:
    reader = csv.reader(f, skipinitialspace=True, delimiter=",")
    next(reader)
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]

print("약 %d 문장을 파서했습니다." % (len(sentences)))

약 37657 문장을 파서했습니다.


In [42]:
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print("%d 단어를 찾았습니다.." % len(word_freq.items()))

35260 단어를 찾았습니다..


In [43]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print(index_to_word) # 단어 스플릿

print("사용한 단어 수 %d." % vocabulary_size)
print("자주 등장하는 단어는 '%s' 이며, %d 만큼 등장." % (vocab[-1][0], vocab[-1][1]))

사용한 단어 수 8000.
자주 등장하는 단어는 'ankle' 이며, 4 만큼 등장.


In [44]:
#문자열 파싱
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
    
#문자열[0] 과 각 단어별 스플릿[0]
print("\nExample sentence: '%s'" % sentences[0])
print("\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])


Example sentence: 'SENTENCE_START i don't know where the people on this sub live. SENTENCE_END'

Example sentence after Pre-processing: '['SENTENCE_START', 'i', 'do', "n't", 'know', 'where', 'the', 'people', 'on', 'this', 'sub', 'live', '.', 'SENTENCE_END']'


In [45]:
#트레이닝 데이터 생성
#문자열을 수치화로 표현
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

print(X_train, y_train)

[list([0, 8, 26, 16, 96, 140, 3, 45, 22, 20, 608, 288, 2])
 list([0, 6, 342, 37, 11, 50, 129, 288, 14, 39, 868, 4, 194, 670, 242, 606, 2])
 list([0, 10, 50, 27, 7999, 4, 29, 10, 268, 37, 80, 15, 40, 106, 22, 3, 1882, 35, 219, 9, 3, 873, 2])
 ... list([0, 1642, 2]) list([0, 25, 15, 88, 103, 7999, 1600, 2])
 list([0, 7, 43, 7999, 588, 20, 183, 4, 8, 141, 126, 25, 134, 2916, 6, 163, 9, 787, 5, 3, 280, 2])] [list([8, 26, 16, 96, 140, 3, 45, 22, 20, 608, 288, 2, 1])
 list([6, 342, 37, 11, 50, 129, 288, 14, 39, 868, 4, 194, 670, 242, 606, 2, 1])
 list([10, 50, 27, 7999, 4, 29, 10, 268, 37, 80, 15, 40, 106, 22, 3, 1882, 35, 219, 9, 3, 873, 2, 1])
 ... list([1642, 2, 1]) list([25, 15, 88, 103, 7999, 1600, 2, 1])
 list([7, 43, 7999, 588, 20, 183, 4, 8, 141, 126, 25, 134, 2916, 6, 163, 9, 787, 5, 3, 280, 2, 1])]


---
### 역전파 생성

In [49]:
def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

RNNTheano.bptt = bptt

# 학습 시작

In [61]:
# Outer SGD Loop
# - model: RNNTheano
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: 손실 함수 평가

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch): # 100번의 epoch 실행
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            #손실 함수 게산
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print ("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))
            
            
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5
                print ("Setting learning rate to %f" % learning_rate)
            
            
            sys.stdout.flush()
            
        #SGD로 가중치 및 학습률 조정
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

In [62]:
np.random.seed(42)
# Train on a small subset of the data to see what happens
model = RNNTheano(vocabulary_size)
# RNN의 배니싱 장면을 목격.
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=100, evaluate_loss_after=1)

  o_t = T.nnet.softmax(V.dot(s_t))


2018-07-19 05:22:54: Loss after num_examples_seen=0 epoch=0: 8.987464
2018-07-19 05:22:59: Loss after num_examples_seen=100 epoch=1: 8.971848
2018-07-19 05:23:03: Loss after num_examples_seen=200 epoch=2: 8.947441
2018-07-19 05:23:08: Loss after num_examples_seen=300 epoch=3: 8.888321
2018-07-19 05:23:12: Loss after num_examples_seen=400 epoch=4: 7.688737
2018-07-19 05:23:16: Loss after num_examples_seen=500 epoch=5: 7.321134
2018-07-19 05:23:21: Loss after num_examples_seen=600 epoch=6: 7.134718
2018-07-19 05:23:25: Loss after num_examples_seen=700 epoch=7: 6.990103
2018-07-19 05:23:30: Loss after num_examples_seen=800 epoch=8: 6.889382
2018-07-19 05:23:34: Loss after num_examples_seen=900 epoch=9: 6.821579
2018-07-19 05:23:39: Loss after num_examples_seen=1000 epoch=10: 6.773368
2018-07-19 05:23:43: Loss after num_examples_seen=1100 epoch=11: 6.734595
2018-07-19 05:23:48: Loss after num_examples_seen=1200 epoch=12: 6.702999
2018-07-19 05:23:52: Loss after num_examples_seen=1300 epoch

KeyboardInterrupt: 

## Test with Theano framework

In [63]:
sys.path.append("./rnn-tutorial-rnnlm/")
from rnn_theano import RNNTheano, gradient_check_theano
from utilsfortheano import load_model_parameters_theano, save_model_parameters_theano

np.random.seed(42)
# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 5
model = RNNTheano(grad_check_vocab_size, 10)
gradient_check_theano(model, [0,1,2,3], [1,2,3,4])

  o_t = T.nnet.softmax(V.dot(s_t))


Performing gradient check for parameter U with size 50.
Gradient check for parameter U passed.
Performing gradient check for parameter V with size 50.
Gradient check for parameter V passed.
Performing gradient check for parameter W with size 100.
Gradient check for parameter W passed.


In [64]:
np.random.seed(42)
model = RNNTheano(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)
model = RNNTheano(vocabulary_size, hidden_dim=50)

# losses = train_with_sgd(model, X_train, y_train, nepoch=50)
# save_model_parameters_theano('./data/trained-model-theano.npz', model)
load_model_parameters_theano('./rnn-tutorial-rnnlm/data/trained-model-theano.npz', model)
def generate_sentence(model):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    # Repeat until we get an end token
    while not new_sentence[-1] == word_to_index[sentence_end_token]:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs[-1])
            sampled_word = np.argmax(samples)
        new_sentence.append(sampled_word)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    return sentence_str

num_sentences = 10
senten_min_length = 7

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model)
    print (" ".join(sent)+"\n")

  o_t = T.nnet.softmax(V.dot(s_t))


49 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


  o_t = T.nnet.softmax(V.dot(s_t))


Loaded model parameters from ./rnn-tutorial-rnnlm/data/trained-model-theano.npz. hidden_dim=50 word_dim=8000
’ are sleeping and second ) just

a when have public i during quickly .

're bones a other have annoying people you feel .

watch if and frigate prakrit in he makes .

those 're legally have do you .

a when wrong to tell to latest thought did .

still on players ) notch ; to from and invaded .

a ] difference 's turned will ram did .

read , stages google i ‘ .

she want '' change are that .

