# Deep Learning Project: Text Generation

In this project, I'd like to explore generating realistic-sounding text. After some preliminary research, I'd like to experiemnt doing this through Generative Adversarial Networks (GANs) and Adversarial Autoencoders. The reason I'm doing this is because I'd like to verify how true the hypothesis is that this model is able to do well in text (even though it's primarily used for images and is great at that).

For GANs, we will have a competition between the two neural networks: the generator and the discriminator.

There has been some research done that explore text generation using GANs (as the normal usage of them revolves around image generation) and it has been found that they're surprisingly good at getting more realistic sounding text compared to other methods. For more realistic human sounding datasets, I chose to use the Amazon Q&A dataset, where it includes all of the questions and answers of products on amazon. This gives me the freedom to also reduce the size to a specific category of products' questions too. I'd like to explore this phenomenon and test it out and see how do results look like and how true that statement is.

First, we need to load in the data and do some code to process it into a format that we'll be using later. The original json file is not a valid one, it's a string of Python dictionaries, which makes it need some extra processing. The data came with some clean-up code (create_data.py) that formats it correctly, but unfortauntely only works within the Google Cloud Platform (GCP). Consequrntly, I wrote some Python code that manually converts them into txt files that we can use. The original data is stored in the folder `data` while the cleaned up ones are in `clean`.

In [1]:
# Importing libraries
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
import re
import torch.nn.functional as F
import random



In [2]:
# Generate a text file with a custom number of question/answer pair that samples all text files
def generate_random_qa_file(file_path, num_pairs):
    with open(file_path, 'r') as file:
        lines = [line.strip() for line in file.readlines()]

    qa_pairs = []
    current_qa_pair = []

    for line in lines:
        if line:
            current_qa_pair.append(line)
        else:
            qa_pairs.append(current_qa_pair)
            current_qa_pair = []

    random.shuffle(qa_pairs)

    with open('clean/sample.txt', 'w') as output_file:
        for qa_pair in qa_pairs[:num_pairs]:
            if len(qa_pair) == 2:
                output_file.write(qa_pair[0] + '\n')
                output_file.write(qa_pair[1] + '\n')
                output_file.write('\n')

# This number is the number of pairs you would like to have in the data file named "sample.txt" in the folder "clean".
generate_random_qa_file('clean/full_data.txt', 1000)

In [3]:
# Data processing (to input into training/generation later)
class TextDataset(Dataset):
    def __init__(self, file_path, max_length=100):
        with open(file_path, 'r') as file:
            data = file.read().lower()
        data = re.findall(r'\b\w+\b', data)
        self.vocab = sorted(set(data))
        if "<START>" not in self.vocab:
            self.vocab.append("<START>")
        self.word_to_idx = {word: idx for idx, word in enumerate(self.vocab)}       
        self.idx_to_word = {idx: word for idx, word in enumerate(self.vocab)}
        self.data = [self.word_to_idx[word] for word in data]
        print("Vocabulary size:", len(self.vocab))

    def __len__(self):
        return len(self.data)

    def get_input_dim(self):
        return len(self.vocab)

    def __getitem__(self, idx):
        return self.data[idx]

In [4]:
class Generator(nn.Module):
    def __init__(self, noise_dim, output_dim, hidden_dim):
        super(Generator, self).__init__()
        self.fc1 = nn.Linear(noise_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

class Discriminator(nn.Module):
     def __init__(self, input_dim, hidden_dim):
        super(Discriminator, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 1)
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()

     def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

class GAN(nn.Module):
    def __init__(self, generator, discriminator):
        super(GAN, self).__init__()
        self.generator = generator
        self.discriminator = discriminator

    def forward(self, x):
        return self.discriminator(self.generator(x))

In [5]:
def train(path = 'clean/sample.txt', model_class=GAN, lr=1e-5, noise_dim=1024, input_dim=5000, hidden_dim=1024, output_dim=5000, max_length=25, embedding_dim=250, epochs=5, batch_size=64):
    train_data = TextDataset(path, max_length=max_length)
    data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)

    generator = Generator(noise_dim, train_data.get_input_dim(), hidden_dim)
    discriminator = Discriminator(train_data.get_input_dim(), hidden_dim)
    loss = nn.BCELoss()

    opt_gen = optim.Adam(generator.parameters(), lr=lr)
    opt_discrim = optim.Adam(discriminator.parameters(), lr=lr)

    for i in range(epochs):
        for batch in data_loader:
            noise = torch.randn(batch_size, noise_dim)
            data_real = F.one_hot(batch, num_classes=train_data.get_input_dim()).float()
            data_fake = generator(noise)

            labels_real = torch.ones((data_real.size(0), 1), device=data_real.device)
            labels_fake = torch.zeros((data_fake.size(0), 1), device=data_fake.device)
            
            output_real = discriminator(data_real)
            output_fake = discriminator(data_fake.detach())
            
            # Training discriminator
            opt_discrim.zero_grad()
            loss_discrim = loss(output_real, labels_real) + loss(output_fake, labels_fake)
            loss_discrim.backward()
            opt_discrim.step()
    
            # Training generator
            opt_gen.zero_grad()
            labels_real = torch.ones((data_fake.size(0), 1), device=data_fake.device)
            loss_gen = loss(discriminator(data_fake), labels_real)
            loss_gen.backward()
            opt_gen.step()
    
            # Clear gradients
            opt_discrim.zero_grad()
            opt_gen.zero_grad()
            
        if (i+1) % 1 == 0:
            print("Epoch:", i, "Generator Loss:", loss_gen.item(), "Discriminator Loss:", loss_discrim.item())
            
    return generator

# Generate the text
def beam_search(generator, noise_dim, train_data, max_length=100, beam_width=20):
    generator.eval()
    with torch.no_grad():
        beam = [([], 0.0)]
        for _ in range(max_length):
            new_beam = []
            for seq, log_prob in beam:
                output_probs = F.softmax(generator(torch.randn(1, noise_dim)), dim=1)
                top_candidates = torch.topk(output_probs, beam_width, dim=1)
                for i in range(beam_width):
                    token_idx = top_candidates.indices[0][i].item()
                    token_prob = top_candidates.values[0][i].item()
                    new_seq = seq + [train_data.idx_to_word[token_idx]]
                    new_log_prob = np.log(token_prob) + log_prob
                    new_beam.append((new_seq, new_log_prob))

            new_beam.sort(key=lambda x: -x[1])
            beam = new_beam[:beam_width]

        best_sequence, _ = beam[0]
        generated_text = ' '.join(best_sequence)
        return generated_text

In [6]:
path = 'clean/sample.txt'
generator = train(path)

Vocabulary size: 5636
Epoch: 0 Generator Loss: 1.1007364988327026 Discriminator Loss: 1.1103870868682861
Epoch: 1 Generator Loss: 1.043791651725769 Discriminator Loss: 1.1330885887145996
Epoch: 2 Generator Loss: 1.1323392391204834 Discriminator Loss: 1.084793210029602
Epoch: 3 Generator Loss: 1.187387466430664 Discriminator Loss: 1.0572293996810913
Epoch: 4 Generator Loss: 1.3126481771469116 Discriminator Loss: 1.0036396980285645


In [9]:
generated_text = beam_search(generator, noise_dim=1024, train_data=TextDataset(path), max_length = 100, beam_width = 20)
print("Generated Text:", generated_text)

Vocabulary size: 5636
Generated Text: 1996 corners extra expert spite corners golf dedicated include inquiring corners squeeze catch wave hook had unlocking had video had had real double had wave group mx20 matters process cleanable incoming corners roofing had had varies anymore few dish had xl34 speakers had wheelchairs diffuser anymore incoming had ties fibrous had s5s incoming i435 corners allows suddenly ideal scented weekly developing channel fibrous suddenly improved had 5lb catch monitor had i435 golf wave wrapper suddenly hawaiian ringger golf lleg had concerning othe wheelchairs porch squeeze columbia draw suddenly equipment ohm andrew had cleanable cropped incoming consoles diffuser roofing louder dish
