## **Introduction**

Text generation refers to the process of using artificial intelligence algorithms to automatically generate natural language text. By generating large amounts of text data, machine learning algorithms can be trained to recognize patterns and relationships within language, which can then be used to develop more advanced NLP applications like chatbots and language translation.

### 1 Import Tools

In [1]:
import re
import string
import numpy as np 
import random
import pandas as pd 
import nltk
import keras.utils
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow
from keras.preprocessing.text import Tokenizer
import string
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
           os.path.join(dirname, filename)

### 2 Load Dataset
Here, I use nyt-comments dataset for text-generation and take only articles file and then extract only headlines columns.

In [2]:
a_ap_17 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesApril2017.csv")
a_ap_18 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesApril2018.csv")
a_feb_17 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesFeb2017.csv")
a_feb_18 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesFeb2018.csv")
a_jan_17 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesJan2017.csv")
a_jan_18 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesJan2018.csv")
a_mr_17 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesMarch2017.csv")
a_mr_18 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesMarch2018.csv")
a_my_17 = pd.read_csv("/kaggle/input/nyt-comments/ArticlesMay2017.csv")

totaldata = pd.concat([a_ap_17,a_ap_18,a_feb_17,a_feb_18,a_jan_17,a_jan_18,a_mr_17,a_mr_18,a_my_17])

data = totaldata.headline
data = [i for i in data if i != "Unknown"]

### 3 Text Cleaning

Text cleaning in NLP refers to the process of removing irrelevant information from text data for task. It typically involves many steps, including Removing non-alphabetic characters, Lowercasing and more.

In [3]:
def cleaning(a):
    a = (a).lower()
    a = re.sub('[%s]' % re.escape(string.punctuation), '', a)
    return a
data = [cleaning(x) for x in data]

### 4 Tokenization and Creating N_gram Sequences
Tokenization is the process of converting text into a sequence of words or sub-words, known as tokens. In natural language processing, tokenization is a crucial step in preparing text data for further analysis or modeling. The Tokenizer class in the Keras library is a powerful tool for tokenizing text data. N-gram sequences are simply a sequence of N tokens in a row. They are used to represent the context of a word or phrase in a text sequence. To create n-gram sequences, we first tokenize the text data using the Tokenizer class in Keras. We then generate n-gram sequences by sliding a window of n words over the tokenized text and adding each window of words to a list.



In [4]:
token = Tokenizer()
token.fit_on_texts(data)
total_words = len(token.word_index) + 1 
Input = []
for i in data:
    token_list = token.texts_to_sequences([i])[0]
    for j in range(1, len(token_list)):
        n_gram = token_list[:j+1]
        Input.append(n_gram)
Input[:10]

[[381, 17],
 [381, 17, 5220],
 [381, 17, 5220, 511],
 [381, 17, 5220, 511, 4],
 [381, 17, 5220, 511, 4, 2],
 [381, 17, 5220, 511, 4, 2, 1573],
 [381, 17, 5220, 511, 4, 2, 1573, 139],
 [381, 17, 5220, 511, 4, 2, 1573, 139, 5],
 [381, 17, 5220, 511, 4, 2, 1573, 139, 5, 1930],
 [7, 69]]

### 5 Padding and Separate Label (target) and Predictors (features)
Padding in NLP refers to the process of adding zeros to the end of sequences or to the start of the sequences to make them all the same length. This is often necessary because neural networks require fixed-length inputs, and real-world text data typically varies in length.

In [5]:
max_sequence_len = len(max(Input, key=len))
Input = np.array(pad_sequences(Input, maxlen=max_sequence_len))
predictors = Input[:,:-1]  #selects all the columns of the Input array except for the last column
label = Input[:,-1] #elects only the last column of the Input array
label =keras.utils.to_categorical(label, num_classes=total_words)

### 6 Initialize Model

In [6]:
input_len = max_sequence_len - 1
model = Sequential()
model.add(Embedding(total_words,300, input_length=input_len))
model.add(LSTM(150))
model.add(Dropout(0.2))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')   
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 23, 300)           3645000   
                                                                 
 lstm (LSTM)                 (None, 150)               270600    
                                                                 
 dropout (Dropout)           (None, 150)               0         
                                                                 
 dense (Dense)               (None, 12150)             1834650   
                                                                 
Total params: 5,750,250
Trainable params: 5,750,250
Non-trainable params: 0
_________________________________________________________________


### 7 Train the Model

In [7]:
model.fit(predictors, label, epochs=70)

Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


<keras.callbacks.History at 0x7128d1e3fe10>

### 8 Define Function to Generate Text

In [8]:
def generate_text(model, token, max_sequence_len):
    #seed_text = input("Enter some text: ")
    seed_text = "President Donald Trump"
    #next_words = int(input("Enter the number of words to generate: "))
    next_words = 5
    for _ in range(next_words):
        token_list = token.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)[0]
        
        predicted = np.argmax(predicted_probs) + 1
        
        output_word = ""
        for word,index in token.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

### 9 Function Call

In [9]:
generated_text = generate_text(model, token, max_sequence_len)
print("The generated text is : ",generated_text)


The generated text is :  President Donald Trump To War Is Their Trivia
