## Creating NN 

In order to support the development of a model that can do text generation across different domains (with unqiue text difficulty), a base implementation of a NN for text generation must be developed. 

__Text Generation is defined as: "is the task of generating text with the goal of appearing indistinguishable to human-written text"__

In this case, we will be attempting to build short stories/articles by influencing the model's output. Our influence is created by providing a starter sentance for the model to utilize for tone, subjects and sentiment. 

#### Key Resources 

- [News Articles](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection) (non-fiction, adult)
- [Blogs](https://imerit.net/blog/25-best-nlp-datasets-for-machine-learning-all-pbm/) (non-fiction, adult)
- [Edgar Allen Poe's Short Stories](https://www.kaggle.com/datasets/leangab/poe-short-stories-corpuscsv) (fiction, adult)
- [Professional/Law](https://metatext.io/datasets/hansards-canadian-parliament) (law, adult)
- [Harry Potter Text Corpus](https://www.kaggle.com/datasets/balabaskar/harry-potter-books-corpora-part-1-7) (fiction, young adult)
- [Children's Books](https://venturebeat.com/business/facebook-releases-1-6gb-data-set-of-childrens-stories-for-training-its-ai/) (fiction, children)

*Potentially for use* 
- [Book Summaries](https://www.kaggle.com/datasets/applecrazy/cmu-book-summary-dataset)


In [30]:
import pandas as pd
import re
import random
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# defining preprocessing steps 
def preprocess(text):
	text_input = re.sub('[^a-zA-Z1-9]+', ' ', str(text))
	out = re.sub(r'\d+', '',text_input)
	return out.lower().strip()

In [14]:
# load in HP data 
hp = pd.read_csv("../data/harrypotter.csv")
hp['sentence'] = [preprocess(t) for t in hp['sentence']]

### Split into testing/training dataset 

In [31]:
hpd = hp['sentence']

# shuffle data
random.shuffle(hpd)

# fraction of training data
split_train_valid = 0.9

# split dataset
train_size = int(split_train_valid * len(hpd))
valid_size = len(hpd) - train_size
train_dataset, valid_dataset = torch.utils.data.random_split(hpd, [train_size, valid_size])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]


In [None]:
def make_dataset(dataset, epochs):
    total_text = '<|endoftext|>'
    tt = [t for t in dataset]
    for _ in range(epochs):
        random.shuffle(tt)
        total_text += '<|endoftext|>'.join(tt) + '<|endoftext|>'
    return total_text

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(hp['sentence'])
# tokenizer.get_config()

In [16]:
input_sequences = []

for review in hp['sentence']:
	token_list = tokenizer.texts_to_sequences([review])[0]
	print()
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)











































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [17]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import tensorflow.keras.utils as ku

# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
total_words = len(tokenizer.word_index) + 1
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

In [18]:
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential

def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

In [19]:
model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 201, 10)           77910     
                                                                 
 lstm (LSTM)                 (None, 100)               44400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 7791)              786891    
                                                                 
Total params: 909,201
Trainable params: 909,201
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.fit(predictors, label, epochs=1, verbose=5)

<keras.callbacks.History at 0x3becbcd00>

### Model Built... now generate the text

In [27]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        # predicted = model.predict_classes(token_list, verbose=0)
        # predicted = (model.predict(token_list) > 0.5).astype("int32")
        predicted = np.argmax(model.predict(token_list), axis=-1)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [28]:
print (generate_text("ron and harry were on their way to", 1, model, 10))

ValueError: in user code:

    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/engine/training.py", line 2041, in predict_function  *
        return step_function(self, iterator)
    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/engine/training.py", line 2027, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/engine/training.py", line 2015, in run_step  **
        outputs = model.predict_step(data)
    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/engine/training.py", line 1983, in predict_step
        return self(x, training=False)
    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/nakraft/venv-metal/lib/python3.9/site-packages/keras/engine/input_spec.py", line 295, in assert_input_compatibility
        raise ValueError(

    ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 201), found shape=(None, 9)
