## [Text Generation NLP – Everything You Need To Know / Python Code To Get Started](https://spotintelligence.com/2022/12/19/text-generation-nlp/)
Here is an example of how you could use the **NLTK library** to train a simple generative model for text using a **bigram language model**

In [13]:
import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist, ConditionalFreqDist

nltk.download('brown')
# Load and preprocess the data
text = brown.words()

# Create a bigram language model
bigrams = nltk.bigrams(text)
cfd = ConditionalFreqDist(bigrams)
print(list(cfd.keys())[:20])  # Print the first 20 keys in cfd

# Generate text
seed_text = "recent"
generated_text = seed_text
print(seed_text in cfd)
for i in range(10):
    # Find the next word using the bigram model
    next_word = cfd[seed_text].max()
    generated_text += " " + next_word
    seed_text = next_word
print(generated_text)

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\karol\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that']
True
recent years ago , and the same time , and the


In [48]:
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Load and preprocess the data
# text = "This is an example of some text that we want to use to train a generative model."
text = " ".join(brown.words()[:50000])

# Tokenize the text and create a vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
vocab_size = len(tokenizer.word_index) + 1

# Convert the text to a sequence of word indices
sequences = tokenizer.texts_to_sequences([text])[0]

# Create input-output pairs
sequence_length = 10
input_sequences = []
for i in range(len(sequences) - sequence_length):
    input_sequences.append(sequences[i:i + sequence_length + 1])

# Convert to NumPy array
input_sequences = np.array(input_sequences)

# Split into X (inputs) and y (outputs)
X = input_sequences[:, :-1]  # Inputs: all except the last word
y = input_sequences[:, -1]   # Outputs: the last word

# One-hot encode the outputs
y = to_categorical(y, num_classes=vocab_size)

# Define the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=10, input_length=sequence_length))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))

# Compile and fit the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=100, verbose=1)

# Generate text
seed_text = "This is an"
for i in range(10):
    # Encode the seed text as a sequence of word indices
    seed_sequence = tokenizer.texts_to_sequences([seed_text])[0]
    seed_sequence = pad_sequences([seed_sequence], maxlen=sequence_length, padding='pre')
    # Predict the next word
    next_word_probs = model.predict(seed_sequence, verbose=0)[0]
    next_word_idx = np.argmax(next_word_probs)
    next_word = tokenizer.index_word.get(next_word_idx, '')  # Safely fetch word
    seed_text += " " + next_word

print(seed_text)


Epoch 1/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.0704 - loss: 7.5890
Epoch 2/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0693 - loss: 6.9798
Epoch 3/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0776 - loss: 6.8593
Epoch 4/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0844 - loss: 6.7278
Epoch 5/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0910 - loss: 6.5291
Epoch 6/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0916 - loss: 6.3655
Epoch 7/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.0989 - loss: 6.1863
Epoch 8/100
[1m1415/1415[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.1096 - loss: 5.9873
Epoch 9/100
[1m

In [50]:
seed_text = "It is"
for i in range(10):
    # Encode the seed text as a sequence of word indices
    seed_sequence = tokenizer.texts_to_sequences([seed_text])[0]
    seed_sequence = pad_sequences([seed_sequence], maxlen=sequence_length, padding='pre')
    # Predict the next word
    next_word_probs = model.predict(seed_sequence, verbose=0)[0]
    next_word_idx = np.argmax(next_word_probs)
    next_word = tokenizer.index_word.get(next_word_idx, '')  # Safely fetch word
    seed_text += " " + next_word

print(seed_text)

It is caught the vital departments of the legislature mississippi's mitchell the


### Conclusion
I need **a lot** of data, and storage for this data, consider other file formats, for big data