<a href="https://colab.research.google.com/github/jsansao/teic-20231/blob/main/TEIC_Licao19_Geracao_de_Texto_RNN_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Geração de Texto em Nível de Caractere com RNN (LSTM)

 Este notebook demonstra como treinar uma Rede Neural Recorrente (LSTM) para gerar texto, um caractere de cada vez.

 Usaremos um pequeno dataset (um soneto de Shakespeare) para treinar o modelo. O objetivo é que o modelo aprenda os padrões da linguagem (quais caracteres tendem a seguir quais outros) e, em seguida, gere um novo texto que se assemelhe ao original.

 O processo é dividido em:
 1.  **Carregar e Preparar os Dados**: Carregar o texto e criar um "vocabulário" de caracteres únicos.
 2.  **Mapeamento**: Criar dicionários para converter caracteres em números inteiros e vice-versa.
 3.  **Criar Sequências**: Transformar o texto em sequências de entrada (X) e o próximo caractere alvo (y).
 4.  **Construir o Modelo**: Definir a arquitetura da rede com Embedding e LSTM.
 5.  **Treinar o Modelo**: Treinar a rede para prever o próximo caractere.
 6.  **Gerar Texto**: Usar o modelo treinado para gerar um novo texto a partir de uma "semente" inicial.


## 1. Importações e Definição do Dataset ---

In [6]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences
import sys

print(f"Versão do TensorFlow: {tf.__version__}")

Versão do TensorFlow: 2.19.0


In [7]:

# Nosso "toy dataset": Soneto 18 de Shakespeare
# É pequeno o suficiente para treinar rápido, mas tem
# estrutura, pontuação e capitalização para a rede aprender.
text = """
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance, or nature's changing course, untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou ow'st;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou grow'st:
  So long as men can breathe or eyes can see,
  So long lives this, and this gives life to thee.
"""

# Limpa o texto (opcional, mas remove o \n do início)
text = text.strip()

print(f"Texto original com {len(text)} caracteres.")

Texto original com 625 caracteres.


In [32]:
# Nosso "toy dataset": Vários sonetos de Shakespeare.
# Este texto é significativamente maior que o anterior,
# permitindo ao modelo aprender padrões mais ricos.
text = """
FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, mak'st waste in niggarding.
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery, so gazed on now,
Will be a tatter'd weed, of small worth held:
Then being ask'd where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.

Look in thy glass, and tell the face thou viewest
Now is the time that face should form another;
Whose fresh repair if now thou not renewest,
Thou dost beguile the world, unbless some mother.
For where is she so fair whose unear'd womb
Disdains the tillage of thy husbandry?
Or who is he so fond will be the tomb
Of his self-love, to stop posterity?
Thou art thy mother's glass, and she in thee
Calls back the lovely April of her prime:
So thou through windows of thine age shalt see,
Despite of wrinkles, this thy golden time.
But if thou live, remember'd not to be,
Die single, and thine image dies with thee.

Unthrifty loveliness, why dost thou spend
Upon thyself thy beauty's legacy?
Nature's bequest gives nothing, but doth lend,
And being frank, she lends to those are free.
Then, beauteous niggard, why dost thou abuse
The bounteous largess given thee to give?
Profitless usurer, why dost thou use
So great a sum of sums, yet canst not live?
For having traffic with thyself alone,
Thou of thyself thy sweet self dost deceive.
Then how, when nature calls thee to be gone,
What acceptable audit canst thou leave?
Thy unused beauty must be tomb'd with thee,
Which, used, lives th' executor to be.

Those hours, that with gentle work did frame
The lovely gaze where every eye doth dwell,
Will play the tyrants to the very same
And that unfair which fairly doth excel;
For never-resting time leads summer on
To hideous winter and confounds him there;
Sap check'd with frost and lusty leaves quite gone,
Beauty o'ersnow'd and bareness every where:
Then, were not summer's distillation left,
A liquid prisoner pent in walls of glass,
Beauty's effect with beauty were bereft,
Nor it, nor no remembrance what it was:
But flowers distill'd, though they with winter meet,
Leese but their show; their substance still lives sweet.

Then let not winter's ragged hand deface
In thee thy summer, ere thou be distill'd:
Make sweet some vial; treasure thou some place
With beauty's treasure, ere it be self-kill'd.
That use is not forbidden usury,
Which happies those that pay the willing loan;
That's for thyself to breed another thee,
Or ten times happier, be it ten for one;
Ten times thyself were happier than thou art,
If ten of thine ten times refigured thee:
Then what could death do, if thou shouldst depart,
Leaving thee living in posterity?
Be not self-will'd, for thou art much too fair
To be death's conquest and make worms thine heir.

Lo, in the orient when the gracious light
Lifts up his burning head, each under eye
Doth homage to his new-appearing sight,
Serving with looks his sacred majesty;
And having climb'd the steep-up heavenly hill,
Resembling strong youth in his middle age,
Yet mortal looks adore his beauty still,
Attending on his golden pilgrimage;
But when from highmost pitch, with weary car,
Like feeble age, he reeleth from the day,
The eyes, 'fore duteous, now converted are
From his low tract, and look another way:
So thou, thyself outgoing in thy noon,
Unlook'd on diest, unless thou get a son.

Music to hear, why hear'st thou music sadly?
Sweets with sweets war not, joy delights in joy.
Why lov'st thou that which thou receiv'st not gladly,
Or else receiv'st with pleasure thine annoy?
If the true concord of well-tuned sounds,
By unions married, do offend thine ear,
They do but sweetly chide thee, who confounds
In singleness the parts that thou shouldst bear.
Mark how one string, sweet husband to another,
Strikes each in each by mutual ordering;
Resembling sire and child and happy mother,
Who, all in one, one pleasing note do sing:
Whose speechless song, being many, seeming one,
Sings this to thee: 'Thou single wilt prove none.'

Is it for fear to wet a widow's eye
That thou consum'st thyself in single life?
Ah! if thou issueless shalt hap to die,
The world will wail thee, like a makeless wife;
The world will be thy widow, and still weep
That thou no form of thee hast left behind,
When every private widow well may keep
By children's eyes her husband's shape in mind.
Look, what an unthrift in the world doth spend
Shifts but his place, for still the world enjoys it;
But beauty's waste hath in the world an end,
And kept unused, the user so destroys it.
No love toward others in that bosom sits
That on himself such murderous shame commits.

For shame! deny that thou bear'st love to any,
Who for thyself art so unprovident.
Grant, if thou wilt, thou art beloved of many,
But that thou none lov'st is most evident;
For thou art so possess'd with murderous hate,
That 'gainst thyself thou stick'st not to conspire,
Seeking that beauteous roof to ruinate
Which to repair should be thy chief desire.
O, change thy thought, that I may change my mind!
Shall hate be fairer lodged than gentle love?
Be, as thy presence is, gracious and kind,
Or to thyself at least kind-hearted prove:
Make thee another self, for love of me,
That beauty still may live in thine or thee.
"""

# Limpa o texto (remove espaços/quebras de linha extras no início e fim)
text = text.strip()

print(f"Texto original com {len(text)} caracteres.")


Texto original com 6149 caracteres.


## 2. Pré-processamento: Mapeamento de Caracteres

 A rede neural não entende 'a', 'b', 'c'. Ela entende números.

 Primeiro, criamos um "vocabulário" com todos os caracteres únicos do texto. Em seguida, criamos dois dicionários:

 1.  `char_to_int`: Mapeia um caractere (ex: 'h') para um número (ex: 15).
 2.  `int_to_char`: Mapeia um número (ex: 15) de volta para um caractere (ex: 'h').


In [33]:
# Encontra todos os caracteres únicos e os ordena
chars = sorted(list(set(text)))
vocab_size = len(chars)

print(f"Total de caracteres únicos (vocabulário): {vocab_size}")
print(f"Vocabulário: {''.join(chars)}")

# Cria os dicionários de mapeamento
char_to_int = {char: i for i, char in enumerate(chars)}
int_to_char = {i: char for i, char in enumerate(chars)}

# Exemplo de mapeamento
print(f"\nExemplo: 'S' -> {char_to_int['S']}, 'a' -> {char_to_int['a']}")
print(f"Exemplo: 10 -> '{int_to_char[10]}', 20 -> '{int_to_char[20]}'")


Total de caracteres únicos (vocabulário): 55
Vocabulário: 
 !',-.:;?ABCDFGHILMNOPRSTUWYabcdefghijklmnopqrstuvwxyz

Exemplo: 'S' -> 24, 'a' -> 29
Exemplo: 10 -> 'A', 20 -> 'N'


## 3. Criação das Sequências de Treinamento

 Agora, transformamos o texto em um problema de aprendizado supervisionado. Usaremos uma "janela deslizante" para criar sequências de entrada (X) e um caractere alvo (y).

 Se `seq_length = 5` e o texto for "hello":

 * `X` = "hell", `y` = "o"

In [34]:
# Define o tamanho da nossa sequência de entrada
seq_length = 40
step = 1 # Desliza a janela 1 caractere de cada vez

sequences = []   # Armazena as sequências de entrada (X)
next_chars = []  # Armazena os caracteres alvo (y)

# Converte o texto inteiro para inteiros
encoded_text = [char_to_int[c] for c in text]

# Cria as sequências
for i in range(0, len(encoded_text) - seq_length, step):
    sequences.append(encoded_text[i : i + seq_length])
    next_chars.append(encoded_text[i + seq_length])

n_sequences = len(sequences)
print(f"Total de sequências de treinamento: {n_sequences}")

# Converte para arrays numpy
X = np.array(sequences)
y = np.array(next_chars)

print(f"Formato de X (antes do pad): {X.shape}")
print(f"Formato de y: {y.shape}")

Total de sequências de treinamento: 6109
Formato de X (antes do pad): (6109, 40)
Formato de y: (6109,)


## 4. Construção do Modelo RNN (LSTM)

 Construímos nosso modelo:

 1.  **Input**: Define o formato de entrada, que é `(seq_length,)`.
 2.  **Embedding**: Esta camada é muito importante. Ela transforma nossos números inteiros (ex: 15) em vetores densos (ex: `[0.1, 0.7, -0.2, ...]`). Isso permite que a rede aprenda *relações* entre caracteres (ex: 'a' é semanticamente mais próximo de 'e' do que de 'z').
 3.  **LSTM**: A camada recorrente principal que processa a sequência e mantém a memória.
 4.  **Dense (Saída)**: Uma camada densa com `vocab_size` neurônios e ativação `softmax`. Ela produzirá uma distribuição de probabilidade sobre *todos os caracteres possíveis* do vocabulário, indicando qual ela acha que é o próximo.



In [35]:
# Hiperparâmetros
EMBEDDING_DIM = 64
LSTM_UNITS = 128

model = Sequential([
    # A entrada é a sequência de `seq_length` inteiros
    Input(shape=(seq_length,)),

    # Camada de Embedding: transforma inteiros em vetores densos
    # input_dim = tamanho do vocabulário
    # output_dim = tamanho de cada vetor de embedding
    Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM),

    # A camada LSTM que processa a sequência de vetores
    LSTM(LSTM_UNITS),

    # A camada de saída que prevê o próximo caractere
    Dense(vocab_size, activation='softmax')
])

# Compila o modelo
# Usamos 'sparse_categorical_crossentropy' porque nossos alvos (y)
# são inteiros (0, 1, 2...), e não vetores one-hot.
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()


## 5. Treinamento do Modelo

Agora, treinamos o modelo. Como o dataset é muito pequeno, o treinamento será rápido e podemos usar um número maior de épocas.


In [36]:
print("Iniciando o treinamento...")

# O dataset é pequeno, então podemos usar um número alto de épocas.
# 50-100 épocas devem mostrar um resultado interessante.
history = model.fit(
    X,
    y,
    epochs=100,
    batch_size=64,
    verbose=1
)

print("Treinamento concluído.")

Iniciando o treinamento...
Epoch 1/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 60ms/step - accuracy: 0.1347 - loss: 3.4391
Epoch 2/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 54ms/step - accuracy: 0.1857 - loss: 3.0185
Epoch 3/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 57ms/step - accuracy: 0.2815 - loss: 2.6783
Epoch 4/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 56ms/step - accuracy: 0.3051 - loss: 2.4888
Epoch 5/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 54ms/step - accuracy: 0.3266 - loss: 2.3727
Epoch 6/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 57ms/step - accuracy: 0.3452 - loss: 2.2968
Epoch 7/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 60ms/step - accuracy: 0.3289 - loss: 2.2674
Epoch 8/100
[1m96/96[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 51ms/step - accuracy: 0.3520 - loss: 2.2387
Epoch 9/100


## 6. Geração de Texto

 Esta é a parte divertida!

 O processo de geração é iterativo:
 1.  Forneça uma "semente" (seed) de texto (ex: "Shall I compare").
 2.  Prepare a semente: converta-a em inteiros e aplique padding para ter o tamanho `seq_length`.
 3.  Peça ao modelo para *prever* o próximo caractere.
 4.  O modelo retorna probabilidades. Em vez de pegar o caractere mais provável (o que leva a um texto chato e repetitivo), nós *amostramos* a partir dessa distribuição.
 5.  Pegamos o caractere amostrado e o adicionamos ao final da nossa semente (e removemos o primeiro caractere da semente).
 6.  Repetimos os passos 3-5 para gerar quantos caracteres quisermos.



In [37]:
def generate_text(model, seed_text, num_chars_to_gen=200):
    """
    Gera texto usando o modelo treinado.
    """
    print(f"\n--- Gerando texto a partir da semente: '{seed_text}' ---")

    # Armazena o texto gerado
    generated_text = seed_text

    # 'current_input' é a nossa "janela deslizante" de texto
    current_input = seed_text

    for i in range(num_chars_to_gen):
        # 1. Preparar a entrada (seed)
        # Converte o texto de entrada para inteiros
        x_pred = np.array([char_to_int[c] for c in current_input])

        # Adiciona padding para ter o tamanho `seq_length`
        # 'pre' padding: [0, 0, ..., id1, id2, id3]
        x_pred = pad_sequences([x_pred], maxlen=seq_length, padding='pre', truncating='pre')

        # 2. Fazer a previsão (obter probabilidades)
        # model.predict retorna (1, vocab_size), então pegamos [0]
        preds = model.predict(x_pred, verbose=0)[0]

        # 3. Amostrar o próximo caractere
        # Usamos np.random.choice para amostrar com base nas probabilidades 'p'
        # Isso torna a geração não-determinística e mais criativa.
        next_int = np.random.choice(len(chars), p=preds)

        # 4. Converter o inteiro de volta para um caractere
        next_char = int_to_char[next_int]

        # 5. Adicionar ao resultado e atualizar a entrada
        generated_text += next_char
        current_input = current_input[1:] + next_char

        # Imprime o caractere gerado em tempo real
        sys.stdout.write(next_char)
        sys.stdout.flush()

    print("\n--- Fim da Geração ---")
    return generated_text

In [41]:

# Vamos gerar!
# Pegamos uma semente do início do texto.
start_index = np.random.randint(0, len(text) - seq_length - 1)
seed_text = text[start_index : start_index + seq_length]
# Gera o texto
generated_text = generate_text(model, seed_text, num_chars_to_gen=300)


--- Gerando texto a partir da semente: 'stop posterity?
Thou art thy mother's gl' ---
ass, and seathid;
To eart to be thy winters warone if there;
Srace if thy farld wilt ways hear's bearte
Or the wiel, from hichming not ent so viede
fatr of thou faccelv'st bacuners glive
Dof maness the conts torld on time leats
Lind to he has loof thyself anst beguith's deft
And summ'd with beauty b
--- Fim da Geração ---



```

Like feeble age, he reeleth from the day,
The eyes, 'fore duteous, now converted are
From his low tract, and look another way:
So thou, thyself outgoing in thy noon,
Unlook'd on diest, unless thou get a son.


--- Gerando texto a partir da semente: 'ch, with weary car,
Like feeble age, he ' ---
hee anund'st the ares,
Or tenger of thou alow a sonf in thy braire.
Prout thyself for stoms tont thyself thy eve?
To reweir my inow, she gold to though beraith's in thein.
Fo consing, of thee happier thou beauty livery And,
Reseaptures with tin widow werl, form dom die,
Thy entracs, and the ir thy f
--- Fim da Geração ---


```

