In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.3.0'

# LSTM으로 텍스트 생성하기

이 노트북은 [케라스 창시자에게 배우는 딥러닝](https://tensorflow.blog/%EC%BC%80%EB%9D%BC%EC%8A%A4-%EB%94%A5%EB%9F%AC%EB%8B%9D/) 책의 8장 1절의 코드 예제입니다. 책에는 더 많은 내용과 그림이 있습니다. 이 노트북에는 소스 코드에 관련된 설명만 포함합니다.

----

[...]

## 글자 수준의 LSTM 텍스트 생성 모델 구현

이런 아이디어를 케라스로 구현해 보죠. 먼저 언어 모델을 학습하기 위해 많은 텍스트 데이터가 필요합니다. 위키피디아나 반지의 제왕처럼 아주 큰 텍스트 파일이나 텍스트 파일의 묶음을 사용할 수 있습니다. 이 예에서는 19세기 후반 독일의 철학자 니체의 글을 사용하겠습니다(영어로 번역된 글입니다). 학습할 언어 모델은 일반적인 영어 모델이 아니라 니체의 문체와 특정 주제를 따르는 모델일 것입니다.

## 데이터 전처리

먼저 말뭉치를 다운로드하고 소문자로 바꿉니다:

In [2]:
import keras
import numpy as np

path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('말뭉치 크기:', len(text))

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
말뭉치 크기: 600893


In [3]:
type(text)

str

그 다음 `maxlen` 길이를 가진 시퀀스를 중복하여 추출합니다. 추출된 시퀀스를 원-핫 인코딩으로 변환하고 크기가 `(sequences, maxlen, unique_characters)`인 3D 넘파이 배열 `x`로 합칩니다. 동시에 훈련 샘플에 상응하는 타깃을 담은 배열 `y`를 준비합니다. 타깃은 추출된 시퀀스 다음에 오는 원-핫 인코딩된 글자입니다.

In [4]:
# 60개 글자로 된 시퀀스를 추출합니다.
maxlen = 60

# 세 글자씩 건너 뛰면서 새로운 시퀀스를 샘플링합니다.
step = 3

# 추출한 시퀀스를 담을 리스트
sentences = []

# 타깃(시퀀스 다음 글자)을 담을 리스트
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('시퀀스 개수:', len(sentences))

# 말뭉치에서 고유한 글자를 담은 리스트
chars = sorted(list(set(text)))
print('고유한 글자:', len(chars))
# chars 리스트에 있는 글자와 글자의 인덱스를 매핑한 딕셔너리
char_indices = dict((char, chars.index(char)) for char in chars)

# 글자를 원-핫 인코딩하여 0과 1의 이진 배열로 바꿉니다.
print('벡터화...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

시퀀스 개수: 200278
고유한 글자: 58
벡터화...


## 네트워크 구성

이 네트워크는 하나의 `LSTM` 층과 그 뒤에 `Dense` 분류기가 뒤따릅니다. 분류기는 가능한 모든 글자에 대한 소프트맥스 출력을 만듭니다. 순환 신경망이 시퀀스 데이터를 생성하는 유일한 방법은 아닙니다. 최근에는 1D 컨브넷도 이런 작업에 아주 잘 들어 맞는다는 것이 밝혀졌습니다.

In [5]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

Instructions for updating:
Colocations handled automatically by placer.


타깃이 원-핫 인코딩되어 있기 때문에 모델을 훈련하기 위해 `categorical_crossentropy` 손실을 사용합니다:

In [6]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## 언어 모델 훈련과 샘플링

훈련된 모델과 시드로 쓰일 간단한 텍스트가 주어지면 다음과 같이 반복하여 새로운 텍스트를 생성할 수 있습니다.

1.	지금까지 생성된 텍스트를 주입하여 모델에서 다음 글자에 대한 확률 분포를 뽑습니다.
2.	특정 온도로 이 확률 분포의 가중치를 조정합니다.
3.	가중치가 조정된 분포에서 무작위로 새로운 글자를 샘플링합니다.
4.	새로운 글자를 생성된 텍스트의 끝에 추가합니다.

다음 코드는 모델에서 나온 원본 확률 분포의 가중치를 조정하고 새로운 글자의 인덱스를 추출합니다(샘플링 함수입니다):

In [7]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

마지막으로 다음 반복문은 반복적으로 훈련하고 텍스트를 생성합니다. 에포크마다 학습이 끝난 후 여러가지 온도를 사용해 텍스트를 생성합니다. 이렇게 하면 모델이 수렴하면서 생성된 텍스트가 어떻게 진화하는지 볼 수 있습니다. 온도가 샘플링 전략에 미치는 영향도 보여 줍니다.

In [8]:
import random
import sys

random.seed(42)
start_index = random.randint(0, len(text) - maxlen - 1)

# 60 에포크 동안 모델을 훈련합니다
for epoch in range(1, 60):
    print('에포크', epoch)
    # 데이터에서 한 번만 반복해서 모델을 학습합니다
    model.fit(x, y, batch_size=128, epochs=1)

    # 무작위로 시드 텍스트를 선택합니다
    seed_text = text[start_index: start_index + maxlen]
    print('--- 시드 텍스트: "' + seed_text + '"')

    # 여러가지 샘플링 온도를 시도합니다
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ 온도:', temperature)
        generated_text = seed_text
        sys.stdout.write(generated_text)

        # 시드 텍스트에서 시작해서 400개의 글자를 생성합니다
        for i in range(400):
            # 지금까지 생성된 글자를 원-핫 인코딩으로 바꿉니다
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            # 다음 글자를 샘플링합니다
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

에포크 1
Instructions for updating:
Use tf.cast instead.
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through for the astent and allong and the interistion of the something the master and in a man a something and string the this of the are strend the self-and the something and as a souls a stand in the same all the the are the last and in the reast the persons and the also the master and in a man the conscience, the part of the making as a man in the really the something and man the man a think of the senst
------ 온도: 0.5
the slowly ascending ranks and classes, in which,
through for the are make of the discomedity of the man because the masterss and interpection of the should morality and the secent which which in the beerable to be nother from to ethic can the moile of the scinguth a there in the harts and and must means of the action: the proment to stractire the man man the still lo

mose regard renate at loodles of all over sep sometion. they made for thew form--in, bron the footful
all 
------ 온도: 1.2
the slowly ascending ranks and classes, in which,
through fowstulgen wampuld, did egain of the world mowing brought besie also "every having poments love, tradic forlevolative
fet is incpind, very inveryed nothing do?
" "other attens-or
do not civibidd,
and bodieant things of the of moral trie under. do rend are by aakids, exrestimitual, senfraung , metsop, every whate" hate
to
contemnceop light for aasmive" if him--nor trays intention on them, in indixtic
에포크 9
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through for the belief is so that the contradict the subject of the and the soul of the deception of the consists of the commander the religions and the concealed the and comparison of the same to delightful and the same the son and the experience and contrary 

discourses--the well-reason to the aimages--pirtewing he lacking, but the whole philosopher mutually eye spool by nature, perhaps accordant to
tacting, possibilities, by repute. in the
only iust "commanding in attained? and dangerous vew sworrm of the hepic intensation -hind affect and deal
funday and clear to
former
of the stench," of many ou
------ 온도: 1.2
the slowly ascending ranks and classes, in which,
through foundation, the
cases,
"tirely higher kue, intentially soger iborfation tuithe a
one is avour agest think, and fund predion of schopenhauedly.="--a suicating regard to learns, we freewive rue knows, but humblenties
deedy, negation? by heave which still hopes, of existinguly belief of understand to cold
the very tears. "so, a colsing, creal senseon--chases-is the gueds, deliete at
aom,  owing,
of wej
에포크 17
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through for in the su

  This is separate from the ipykernel package so we can avoid doing imports until


most seventy hiviliwoble say, enway, laid eguticed, which perhaps that un
in civwally, bold, nongean
fromnlity, good of aograhy, most, in the brinate, and like
to let us to me
to it would be mudyment of chinging," as truthfulness and
pognourhen
here and security of mothous modes on, light"
에포크 19
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through form the more comparanical person the commander the subject the contradishor to the contradishor to the same the successful the commander the commander the contemporarily comparanical end the more things of the experience of the experience of the same the world in the presentimine the world of the good the subtle the subnlining the same the reason to the devil and suffered in the sense of the same t
------ 온도: 0.5
the slowly ascending ranks and classes, in which,
through form of the more
discovered the attester there is development which 

in
a being
li! bbost, less anothers sch, hardlgene prelave much and ih the signto work and advanners. christianity,
acts and deliring sympathy of all even but he debtures
and pessimistine or necessary from the preaching, indeed, accuracule sestes meas god, nespowen, however, and by mind like judgm these much care.

171. the humanit-:lishers othern, some appreciation motay for any orrxculous beli
에포크 23
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through former and sense of the same the sense of the matters of the soul is all the same and all the power to the philosophers of the sense of the philosophers of the same and distingust and distingust and sacrifice the evil and superiority of the same and subjection of the same and subjection of the same the stronger of the same the matters of the sense of the same and all the contrary and subject and sou
------ 온도: 0.5
the slowly ascendi

to moral rights of happened times, the comprehension. it was eoritation of their equally confurins of selfess, which hence alone our own very "their makes thus will somanstrackes which it is ene" also, as it
usually with the platos to nowadays therepands, as so wrong it is a has to "disgaleless
her and rational
advancees 
------ 온도: 1.2
the slowly ascending ranks and classes, in which,
through foc' usmofternessese, the wordism, of
suicabce
(what
light, in it: nams a smile is turn of rehard
through one, of thethly new raes, for these gavanity but curitus
stupages more soul. as
his vicauine music and fly beare
beeved, varieck that
sued ex-cerrable others account, for his bad tempo of
philosophical. their truthe). frath a e,
too variepory--man
praisedilitg do alchringful discovere gen) ateet
에포크 31
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the slowly ascending ranks and classes, in which,
through for the experient of the world and 

through fourness, psychols more so for i knows that self-derplate the poying, the midday, that was the
depthy, more
"bidder of nociotby through every lasis and by
the south, on a science, that ispating on an astath throlude medns cyre tauncture, vengenely of his precisely beneffals, in views good and single he had apthech through armeritude of the shutivity and science, is as feews! hourdeding,
"pass of inv
------ 온도: 1.2
the slowly ascending ranks and classes, in which,
through founders, and it, and is these
ca notions of scientific craise yence merely-nature
and
extravampatal proud shadet, it is everynready sbunders thinkss or, prudence of the bjease when world jesumment the rudsed us the very inmeatations something one myexhous to away appreciation
agefully
once to yord. in evolving, and this work of the fatuer carrhuallished! a men of here this sastersts, inventedfer su
에포크 35
Epoch 1/1
--- 시드 텍스트: "the slowly ascending ranks and classes, in which,
through fo"
------ 온도: 0.2
the sl

through forething to an attain of the best for the contemm of religious and defented for the bad of the soul-has not been necessary and such a subtle of a matery on the are the way are should not something self-as the contradiction from the same creathing and life and the disposition of the good to any one
seems to the powerful expendence of the philosophy has assertion, for the militanity man is that the c
------ 온도: 1.0
the slowly ascending ranks and classes, in which,
through following of her doing
through the solped state sometux it must praise and
to balow! that i kmon amost charm beliefing as individuve to sembies something is withdrates
that which do his question man therefore, as other all retain in arts "to pleasure and the philosophers
and viscoursts and creally "manifols and civilization,
however, the
putherst, the worth
of them
standardly, consequence,
but so sa
------ 온도: 1.2
the slowly ascending ranks and classes, in which,
through fore
an ply elcheman move receenckains d

through formerly desire that the same and the same the same taste of the same the sense of the same the interests of the same the intellectual the same the sense of the things, the same the sense of the same the sense of the same and its again for the same the sense of the most person the religions and faculty and perceive that the same and in the same and the same common for the same the artists of the sam
------ 온도: 0.5
the slowly ascending ranks and classes, in which,
through formerly not
now the artist of the thing of the same and experience and not common present, which the most the new interests, what is not the same the interests it is all the absued that the part of the most faith in the religion of the
comparison and consequently and religions and the intellectual man as the religion, the motives to the same their partision, the contrary for the most patently as su
------ 온도: 1.0
the slowly ascending ranks and classes, in which,
through foot, to care may be that utrals of grad

through former are there is a conceptions of the most profound the problem of the expedient of the more things of a thing the more their the formul and for the problem that the world of the most personal conception of the most pain, and in the same and the more profound the most men and all the suffering the more property and the sense of the most men who has a man of the reality of the same artistic of the
------ 온도: 0.5
the slowly ascending ranks and classes, in which,
through for the arrend the value of the comprehend the convolition, and almost the more profound the god and his spirit of men and manifold, as the belief in the conceptions of the formul and with a more pain of the progress, with here and with the mouth, to an attempt
a promisent of the old the early and the signification, are he may also the more promise as
they makes are may always the very one must perh
------ 온도: 1.0
the slowly ascending ranks and classes, in which,
through for what nearently, imagination--there a

through formerly the same and stand in the world and successful and stand and in the superstition of the sense of the sense of the soul, the world and all the most distrust and desire the problem of the most deceived and all the spirit of the world and successful and the most delight to the subject and something the presence of the sense of the strength and foreign to the sense of the superstition of the wo
------ 온도: 0.5
the slowly ascending ranks and classes, in which,
through formerly of all the position of the mad successful and all the future of the duting and schopenhauer's profound the states of his states the
sense of the whole the constant in the secined which still have the same and all the sense them as it is its worth; and they are not so domally with the soul, with the weaker become the consequences, and the world of man the truth the great to be something in s
------ 온도: 1.0
the slowly ascending ranks and classes, in which,
through fordemptranity fentwiagre and
own, 
    

여기서 볼 수 있듯이 낮은 온도는 아주 반복적이고 예상되는 텍스트를 만듭니다. 하지만 국부적인 구조는 매우 실제와 같습니다. 특히 모든 단어(단어는 글자의 지역 패턴으로 이루어집니다)가 실제 영어 단어입니다. 높은 온도에서 생성된 텍스트는 아주 흥미롭고 놀라우며 창의적이기도 합니다. 이따금 꽤 그럴싸하게 보이는 완전히 새로운 단어를 창조합니다(‘begarmed’와 ‘isharent’ 같은 단어입니다). 높은 온도에서는 국부적인 구조가 무너지기 시작합니다. 대부분의 단어가 어느정도 무작위한 문자열로 보입니다. 확실히 이 네트워크에서는 텍스트 생성에 가장 좋은 온도는 0.5입니다. 항상 다양한 샘플링 전략으로 실험해 봐야합니다! 학습된 구조와 무작위성 사이에 균형을 잘 맞추면 흥미로운 것을 만들 수 있습니다.

더 많은 데이터에서 크고 깊은 모델을 훈련하면 이것보다 훨씬 논리적이고 실제와 같은 텍스트 샘플을 생성할 수 있습니다. 당연히 우연이 아닌 의미 있는 텍스트가 생성된다고 기대하지 마세요. 글자를 연속해서 나열하기 위한 통계 모델에서 데이터를 샘플링한 것뿐입니다. 언어는 의사소통의 수단입니다. 의사소통이 의미하는 것과 의사소통이 인코딩된 메시지의 통계 구조 사이는 차이가 있습니다. 이 차이를 검증하기 위해 다음과 같은 사고 실험을 해보죠. 컴퓨터가 대부분의 디지털 통신에서 하는 것처럼 사람의 언어가 의사소통을 압축하는데 더 뛰어나다면 어떨까요? 언어의 의미가 줄진 않지만 고유한 통계 구조가 사라질 것입니다. 이는 방금과 같은 언어 모델을 학습하는 것을 불가능하게 만듭니다.

## 정리

* 이전의 토큰이 주어지면 다음 토큰(들)을 예측하는 모델을 훈련하여 시퀀스 데이터를 생성할 수 있습니다.
* 텍스트의 경우 이런 모델을 언어 모델이라 부릅니다. 단어 또는 글자 단위 모두 가능합니다.
* 다음 토큰을 샘플링할 때 모델이 만든 출력에 집중하는 것과 무작위성을 주입하는 것 사이에 균형을 맞추어야 합니다.
* 이를 위해 소프트맥스 온도 개념을 사용합니다. 항상 다양한 온도를 실험해서 적절한 값을 찾습니다.