# GAN Model Training

GANs are used for generating words that might match up to the input provided, for example for the input "I am feeling", the model might give the output "I am feeling very great today!" as a result. It gives the input based on the patterns obtained from the training datasets, which might not be very applicable to what I am trying to aim for right now. Nonetheless, I still gave it a try.

## Importing the data

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
import string
import numpy as np
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout, Input
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from random import randint

In [2]:
df = pd.read_csv("data/cleaned_training_data.csv")

## Cleaning the data

In [3]:
# text = ''.join([c for c in ' '.join(df.dropna().values.flatten()).lower() if c not in string.punctuation])
text = ' '.join([str(sentence).strip() if str(sentence).strip()[-1] == '.' else str(sentence).strip() + '.' for sentence in df['prompts'].values.flatten()]).lower()
words = word_tokenize(text)
n_words = len(words)
unique_words = len(set(words))

print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)

Total Words: 678114
Unique Words: 16515


In [4]:
unique_chars = sorted(list(set(text)))
char_to_index = {char: idx for idx, char in enumerate(unique_chars)}
char_to_index

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '$': 4,
 "'": 5,
 '(': 6,
 ')': 7,
 '*': 8,
 '+': 9,
 ',': 10,
 '-': 11,
 '.': 12,
 '/': 13,
 '0': 14,
 '1': 15,
 '2': 16,
 '3': 17,
 '4': 18,
 '5': 19,
 '6': 20,
 '7': 21,
 '8': 22,
 '9': 23,
 ':': 24,
 ';': 25,
 '?': 26,
 '_': 27,
 'a': 28,
 'b': 29,
 'c': 30,
 'd': 31,
 'e': 32,
 'f': 33,
 'g': 34,
 'h': 35,
 'i': 36,
 'j': 37,
 'k': 38,
 'l': 39,
 'm': 40,
 'n': 41,
 'o': 42,
 'p': 43,
 'q': 44,
 'r': 45,
 's': 46,
 't': 47,
 'u': 48,
 'v': 49,
 'w': 50,
 'x': 51,
 'y': 52,
 'z': 53,
 '|': 54,
 'à': 55,
 'á': 56,
 'ç': 57,
 'è': 58,
 'é': 59,
 'í': 60,
 'ð': 61,
 'ñ': 62,
 'ó': 63,
 'ô': 64,
 'ù': 65,
 'ú': 66,
 'ł': 67,
 'ń': 68,
 'š': 69,
 'ž': 70,
 'क': 71,
 'च': 72,
 'प': 73,
 'म': 74,
 'र': 75,
 'व': 76,
 'ी': 77,
 '्': 78,
 '–': 79,
 '—': 80,
 '‘': 81,
 '’': 82,
 '“': 83,
 '”': 84,
 '•': 85,
 '→': 86,
 '道': 87}

## Preparing data for training

In [5]:
input_sequence = []
output_words = []
input_seq_length = 40

for i in range(0, len(text) - input_seq_length , 1):
    in_seq = text[i:i + input_seq_length]
    out_seq = text[i + input_seq_length]
    input_sequence.append([char_to_index[word] for word in in_seq])
    output_words.append(char_to_index[out_seq])

In [6]:
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))
X = X / float(len(unique_chars))

y = to_categorical(output_words, num_classes=len(unique_chars))

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=.2,
    random_state=12
)

In [8]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_train shape: (2952565, 40, 1)
y_train shape: (2952565, 88)


## Initializing model

In [9]:
model = Sequential([
    Input((input_seq_length, 1)),
    LSTM(128, return_sequences=True),
    LSTM(128),
    Dense(len(unique_chars), activation='softmax')
])
model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam')

## Model training

In [10]:
model.fit(X_train, y_train, batch_size=1024, epochs=10, verbose=1)

Epoch 1/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m827s[0m 286ms/step - loss: 2.9440
Epoch 2/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m808s[0m 280ms/step - loss: 2.4920
Epoch 3/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m807s[0m 280ms/step - loss: 2.2558
Epoch 4/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m808s[0m 280ms/step - loss: 2.1045
Epoch 5/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m808s[0m 280ms/step - loss: 1.9972
Epoch 6/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m826s[0m 286ms/step - loss: 1.9119
Epoch 7/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m874s[0m 303ms/step - loss: 1.8439
Epoch 8/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m856s[0m 297ms/step - loss: 1.7887
Epoch 9/10
[1m2884/2884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m827s[0m 287ms/step - loss: 1.7421
Epoch 10/10
[1m2884/2884[0m [32m━━

<keras.src.callbacks.history.History at 0x29bbc157d10>

## Example usage

In [12]:
import random

# Pick a random starting sequence
start_index = random.randint(0, len(text) - input_seq_length - 1)
seed_text = text[start_index:start_index + input_seq_length]

# Generate characters
generated_text = seed_text
for _ in range(200):  # Generate 200 characters
    input_seq = np.array([[char_to_index[char] for char in seed_text]]).reshape(1, input_seq_length, 1)
    input_seq = input_seq / float(len(unique_chars))

    # Predict the next character
    predicted_index = np.argmax(model.predict(input_seq, verbose=0))
    next_char = unique_chars[predicted_index]

    # Append to generated text and update seed
    generated_text += next_char
    seed_text = seed_text[1:] + next_char

print(generated_text)

manship permeating the space. the composition is a medi-muered seene in the soft and a soft glow of a soft glgwsent and a soatle eroendent strles and a soatle eroendent of a woung aod a soit groe and a soft glow of a lodern soited sarterns 


As we can see here, this is not really what we're looking for right now, instead it should have a specific pattern based on the whole context of the sentence, and that also means it should generate a new sentence with the same meaning but with different tones or aspect instead, and GANs don't do that quite well based on my understanding.