<a href="https://colab.research.google.com/github/kcw0331/Deeplearning/blob/main/8_1_text_generation_with_lstm2021%EB%85%845%EC%9B%9426%EC%9D%BC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow
tensorflow.keras.__version__

'2.5.0'

# Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

- text generation LSTM을 실습해본다.

## Preparing the data

Let's start by downloading the corpus and converting it to lowercase:

- Preparing the data해서 nietzsche에 대한 데이터를 다운 받아 준다.

In [None]:
import tensorflow.keras
import numpy as np

path = tensorflow.keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
Corpus length: 600893


- 전체 텍스트의 길이는 600893정도 되는 것을 볼 수 있다.

In [None]:
text[:100]

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'


Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

In [None]:
# Length of extracted character sequences
maxlen = 60  #60단어를 input받아서 그 다음 단어를 예측하는 모형을 만들어 준다.

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus #이 부분은 chars의 개수를 파악해주는 것이다.
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) #x, y의 공간을 확보 해놓고, char_indices를 통해서 문자들을 입력받아서 그것들을 숫자들로 해서 x, y공간에 숫자로 넣게 해준다.
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 200278
Unique characters: 57
Vectorization...


In [None]:
sentences[:4]  

['preface\n\n\nsupposing that truth is a woman--what then? is the',
 'face\n\n\nsupposing that truth is a woman--what then? is there ',
 'e\n\n\nsupposing that truth is a woman--what then? is there not',
 '\nsupposing that truth is a woman--what then? is there not gr']

- sentences만을 떼어와서 보면 3칸씩 전진하면서 가는 것을 볼 수 있다.

In [None]:
list(set(text)) #이걸하면 text에 사용된 모든 단어들이 나오게 된다.
#그리고 set형식을 list형식으로 바꾸어 준다.

['c',
 'e',
 '8',
 'z',
 '3',
 "'",
 'm',
 'é',
 'x',
 '4',
 'v',
 'q',
 '"',
 'æ',
 'u',
 'b',
 'p',
 'r',
 ',',
 'ä',
 'a',
 'n',
 'o',
 'l',
 '=',
 't',
 '2',
 'y',
 'k',
 ':',
 'ë',
 's',
 ')',
 '6',
 '7',
 '?',
 'd',
 '0',
 ' ',
 '!',
 'w',
 'i',
 ';',
 '(',
 '[',
 ']',
 'j',
 '9',
 '\n',
 'h',
 '1',
 '5',
 '-',
 '_',
 '.',
 'f',
 'g']

In [None]:
chars[0]

'\n'

In [None]:
char_indices['\n']

0

- chars와 char_indices를 사용해서 문자를 숫자로 바꾸는게 가능하다.
 숫자에서 문자로 보낼 때는 chars[]를 사용하고, 문자에서 숫자로 보낼때는 char_indices[]를 사용한다. 

In [None]:
char_indices


{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 "'": 4,
 '(': 5,
 ')': 6,
 ',': 7,
 '-': 8,
 '.': 9,
 '0': 10,
 '1': 11,
 '2': 12,
 '3': 13,
 '4': 14,
 '5': 15,
 '6': 16,
 '7': 17,
 '8': 18,
 '9': 19,
 ':': 20,
 ';': 21,
 '=': 22,
 '?': 23,
 '[': 24,
 ']': 25,
 '_': 26,
 'a': 27,
 'b': 28,
 'c': 29,
 'd': 30,
 'e': 31,
 'f': 32,
 'g': 33,
 'h': 34,
 'i': 35,
 'j': 36,
 'k': 37,
 'l': 38,
 'm': 39,
 'n': 40,
 'o': 41,
 'p': 42,
 'q': 43,
 'r': 44,
 's': 45,
 't': 46,
 'u': 47,
 'v': 48,
 'w': 49,
 'x': 50,
 'y': 51,
 'z': 52,
 'ä': 53,
 'æ': 54,
 'é': 55,
 'ë': 56}

In [None]:
next_chars[:4]

['r', 'n', ' ', 'o']

## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

LSTM을 적합시켜 준다.

In [None]:
from tensorflow.keras import layers

model = tensorflow.keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax')) #다은 chars를 예측하는 간단한 코드를 만들어 주었다.

Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [None]:
optimizer = tensorflow.keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

  "The `lr` argument is deprecated, use `learning_rate` instead.")


## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

- 아래 코드에서는 샘플하는 함수를 만들어 주었다. 

In [None]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

In [None]:
import random
import sys

for epoch in range(1, 60):  #epoch을 1부터 60까지 돌리면서 모형으로 부터 텍스트를 뽑아내는 부분이다.
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,   #모델을 피팅 해준다.
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)  #문장에서 랜덤하게 하나를 뽑아서 seed로 해준다.
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:  #temperature는 이렇게 사용을 해준다.
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters  #temperature를 4개를 돌리는데 각 temperature를 돌릴때 400단어가 나오는 것을 볼 수 있다.
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

epoch 1
--- Generating with seed: "for example, in music;
and if a philosophy alleges to us the"
------ temperature: 0.2
for example, in music;
and if a philosophy alleges to us the self-and intelloper and the consequent, and the interracher and self-all the self-all the serience of the self-and the belief in a moral to the self-previdually and presentiness of the self-compariness and the have and anterially and the mankind and manifices of the self-and the self-consequents, and the self-divines himself the self-consequents, and the comes and conceptions and and the have and
------ temperature: 0.5
sequents, and the comes and conceptions and and the have and to a bound formunces of the many, are cause and teren in its our friend the fares and diviners, in the greates and all the meally the desestives and all the condriestions and one any perion actual a seater and its appechisciences in its becours, instimes in the destrainst for the fally. for sould in must consedfiness of diffore and

  This is separate from the ipykernel package so we can avoid doing imports until


r nthrut
am forme, with anovglac through europe--the
other only doysmuch,"      that the by state of ratpersvkying and is also and proms religious and mediocre things. in the faculty. the mor gatt then, he englors--he bad but avis thesrame
as! veuseds
wure stage of beith other in--the
enotmise shares sfouns trevere of 
epoch 18
--- Generating with seed: ", they have to be something new, they have to
signify someth"
------ temperature: 0.2
, they have to be something new, they have to
signify something that the sense of the continue of the sense of the continue of the consideres of the continuers of the same of the strength of the spirit and suffering the spirit and suffering the spirit of the struggle of the strength of the continuer of the same of the consequent for the struggle of the develops of the sense of the consists of the continue of the consists of the spirit of the sense of the s
------ temperature: 0.5
continue of the consists of the spirit of the sense of the spirit of the

- temperature가 낮을 때는 repetitive한 것을 볼 수 있다. 하지만 real English words가 나오는 것은 볼 수 잇다. 
- higher temperature는 좀 더 재미있는 단어가 나오기는 하는데, 가끔은 너무 새로운 단어를 내 놓기도 한다든가, 말이 안되는 말들을 내놓는 경향이 있다고 한다.


As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in 
particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text 
becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as 
"eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings 
of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment 
with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.

Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and 
realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is 
sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is 
a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To 
evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like 
our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic 
statistical structure, thus making it impossible to learn a language model like we just did.


## Take aways

* We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
* In the case of text, such a model is called a "language model" and could be based on either words or characters.
* Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
* One way to handle this is the notion of _softmax temperature_. Always experiment with different temperatures to find the "right" one.