# Controllable spooky text generation based on author's embedding

### Lucas H. Ueda

***Abstract—Natural language processing (NLP) has advanced
very recently. With the emergence of Transformers and networks
based on them, textual generation has shown itself to be
increasingly closer to human, as the texts generated by GPT-2 and
more recently GPT-3. However, in these architectures the control
of the generated text style is not considered. This project seeks,
based on a language model approach, to study the possibility
of controlling the writing style based on the differentiation of
authors in a latent space (author embedding) generated by the
model. In particular we work with 3 horror story writers (Marry
Shelley, Edgar Allan Poe and H.P. Lovecraft). Our source code is
publicly available at https://github.com/lucashueda/reproducible
research.***

***Index Terms—NLP, Language modeling, embedding,
neural networks, controllable***

In [9]:
# Run this to do the data preprocess (almost in an instant )
%%capture
%cd ../dev
%run ../dev/1_data_preprocess.ipynb

In [10]:
# Run this to do the model training (almost 20min in a RTX 2060 6gb)
%%capture
import time
init = time.time()
%run ../dev/2_training_language_model.ipynb
end = time.time()
print(end - init)

In [12]:
# Just print the time to run training cell
print(end - init)

1371.2590572834015


## INTRODUCTION

LANGUAGE modelling tries to estimate a probability
density function that can predict next token by the past
ones [1]. This technique allows us to represent words in a
latent space by using dense vectors with fixed dimensional.
This dense vector is used in the literature with two main
purposes, the first is a better representation of a phrase instead
of an one hot vector and the second one is to control latent
meanings by finding patterns in this latent space ([2], [3]).

Text generation it’s a very hard task in NLP because of its
”human” nature, it is very hard to find a way to generate text
as a human since there is no specific pattern known in humans
generated texts. Text DE-Generation is yet a big problem
which makes algorithms very repetitive and with no reasonable
meaning of the generated texts [4]. Additionally there is no
main way to validate a text generator in terms of reasonable
and coherence with human language in an automatic way.
Additionally to control the text that is generated is one more
problem in this task.

In this work we try to make a fully text generator system
that is able to produce reasonable texts conditioned by an
author embedding, i.e., by given as input an initial text and
a author embedding the system is able to produce text as
this target author. We use an dataframe that has texts from
3 spooky authors, Edgar Allan Poe, H.P. Lovecraft and Marry
Shelley. This is a dataframe from a kaggle competition and
has a lot of sentences of these 3 authors, we will use this
dataframe to extract author’s embeddings and generate our
proposed method.

This work is organized as follows: Section 1 is the
introduction, Section 2 we describe the methods that motives
our proposed method, in Section 3 we describe the experiments
made in the dataset and the architectures tested, Section 4 and 5 we discuss and conclude our work. Further sections are the
source code and acknowledgements.

## METHOD

In this section we are going to summarize the motivations
of our proposed method.

### A. LANGUAGE MODELLING

Language modelling is the process to estimate the
probability density function that can predict a token given an
array of past tokens [1]. This approach can be used to generate
tokens, but the more gain about this technique was about word
representation [2] where a dense vector could represent better
a token than its one-hot vector, it allows a lot of evolution in
NLP with GloVE, ElMO and more recently with Transformer.

In this project we are going to use the meaning of
language modelling to generate our tokens, we could use
a sequence-to-sequence modelling instead of per token
generation but for isolated effects purposes we choose to work
with a simple way to do the token generation. Also we will
use top-k and nucleus sampling [4] as decoding metodologies
that are demonstrating to be the best for our purpose.

### B. tSNE

t-Distributed Stochastic Neighbot Embedding (tSNE) [5]
is a non-linear technique for dimnesionality reduction that
is particularly well suited for the data visualization of
high-dimensional dataset. It is a technique commonly used
in unstructured data, such as images, texts and speech.

The tSNE is based on maintaining the similarity in the
distribution of vectors that exist in the high dimension also
in the low dimension. This makes it possible to maintain
statistical characteristics of similarity of the data even in the
smallest dimension. For the specific project it will be used to
visualize the quality of the authors’ differentiation in the latent
space generated by the model.

### C. EVALUATION

In order to measure the quality of our model we analyze
classical loss curve between training and validation steps.
We also look for Perplexity [6] which measure how well
distributed are our logits compared to the correct token.
Perplexity if often used to measure quality of next token
generation task and thats why it will be chosen.

In the future we will seek to evaluate the task from other
metrics, such as BLEU [6] and MRR [7], as well as a
customized authorial metric in order to measure the rate
of words generated by the model in relation to an author
compared to the most frequent words used by that author.

## EXPERIMENTS

The dataset is balanced between the 3 author’s and have
around 20.000 observations, each one with a sentence and a
target of each author writes it. This dataset was preprocessed
to be in the language modeling format. We perform 5 tokens
of context size and the 6th as the target one, this give to us
around 500.000 lines of observations which we use 90% as
training dataset and 10% as validation one.

The 2 models were performed using this 5 tokens context
input tokens with a batch size of 256. All experiments were
made in a local machine with 16gb RAM and a RTX 2060
6gb GPU.

### A. VANILLA MODEL

As a baseline we use a Vanilla model that consists of the
simplest implementation of the language model mentioned in
Bengio’s, with a look up embedding layer and a linear layer
that takes the dimension of embedding into the space of the
vocabulary tokens.

The model had as parameters an embedding dimension of
128, with the SGD optimizer, a learning rate of 5e-3 and
was run for 20 epochs. It is possible to see how difficult it
is for a simple model to achieve good perplexity, as shown in
the Figure 1. Additionally, the visualization in 2 dimensions
(Figure 2) shows that the model cannot differentiate the
authors, indicating that it lacks the capacity to generate a
differentiable latent space in relation to the authors.

<img src="../dev/partial_results/vanilla_model_train_perplexity_graph.png" width="1000">

Fig 1: Loss and perplexity of the Vanilla model.

<img src="../dev/partial_results/vanilla_model_umap_embedding.png" width="1000">
Fig 2: Author's embeddings of the Vanlla model by tSNE.

### B. PROPOSED MODEL

Our proposed model consists of adding a recurrent unit
GRU [8] to the end of the loop up embedding layer of the
Vanilla model, here the model was also run for 20 epochs
with a learning rate of 5e-3 in a SGD optimizer. The recurrent
layer aims at a temporal analysis of the generated tokens, thus
allowing greater assertiveness of the model and consequently
the generation of a differentiable latent space.

Unfortunately, the model also failed to converge well for
the dataset in question (Figure 3). However, it is possible to observe that it was able to generate a more differentiable space
than the previous one (Figure 4). Differentiation was what
was sought and therefore it is expected that a more robust
architecture allows a more assertive space.

<img src="../dev/partial_results/proposed_model_train_perplexity_graph.png" width="1000">
Fig 3: Loss and perplexity of the Proposed model.

<img src="../dev/partial_results/proposed_model_umap_embedding.png" width="1000">
Fig 4: Author's embeddings of the Proposed model by tSNE.

The idea in the future is that a model structure based
on Transformers [9] be developed, as it has been showing
great results in NLP tasks. For the specific task of generating
latent space, we will try to use the technique of Variational
AutoEncoder (VAE) [3] where it is possible to generate a latent
Gaussian space from the optimization of a probability density
function that describes the generator of the dense vectors of
each author. Finally, at the time of inference, the centroid of
each grouping of authors in the latent space will be used as a
parameter for the control of textual generation, the final model
proposal is outlined in the Figure 5.

<img src="../figures/proposed_model.png" width="1000">
Fig 5: Pipeline of training of the future aimed model. In inference
time there is no connection between embedding layer and VAE
encoder layer, the target author will be setted by the hard coded
centroid of author in the latent space.

## DISCUSSION

The results show that models like states are not able
to effectively model the probability density function that
generates the tokens of the dataset in question. The low loss
in validation and perplexity value show an easy overfittig of
the model.

The proposed model can minimally differentiate the authors
in the latent space in a smaller dimension, while the vanilla
model cannot differentiate any of the actors. The quality of
the models as language models preclude a deeper analysis
of their quality for the task of controlling textual generation,
however it is reasonable to assume that it is necessary to use
more robust and more consolidated architectures in the current
literature. The proposed model for the future looks promising
given works in similar areas that have achieved good results
using such techniques [10].

## CONCLUSION

Textual generation is definitely a difficult task. Adding
control to this generation proved to be a far more complicated
task than it was imagined to accomplish in time for a
discipline. Tests with state-of-the-art models were carried out,
however without success in controlling textual generation.
However, it is hoped that in the future with a model based
on Transformers it will be possible to differentiate the latent
space more, in addition to a better performance in the task of
language modeling.

## SOURCE CODE

All the code used to do this project is available in https:
//github.com/lucashueda/reproducible research.

## REFERENCES

[1] P. V. C. J. Yoshua Bengio, Rejean Ducharme, “A neural probabilistic ´
language model,” online: http://www.jmlr.org/ papers/volume3/
bengio03a/ bengio03a.pdf , 2003.

[2] G. C. J. D. Tomas Mikolov, Kai Chen, “Efficient estimation of word
representations in vector space,” online: https:// arxiv.org/ abs/ 1301.
3781, 2003.

[3] M. W. Diederik P Kingma, “Auto-encoding variational bayes,” online:
https:// arxiv.org/ abs/ 1312.6114, 2013.

[4] L. D. M. F. Y. C. Ari Holtzman, Jan Buys, “The curious case of neural
text degeneration,” online: https:// arxiv.org/ abs/ 1904.09751, 2019.

[5] G. H. Laurens van der Maaten, “Visualizing data using t-sne,”
online: http://www.jmlr.org/ papers/volume9/vandermaaten08a/
vandermaaten08a.pdf , 2008.

[6] R. R. Stanley Chen, Douglas Beeferman, “Evaluation metrics
for language models,” online: https://www.cs.cmu.edu/∼roni/ papers/
eval-metrics-bntuw-9802.pdf , 2001.

[7] N. Craswell, Mean Reciprocal Rank, pp. 1703–1703. Boston, MA:
Springer US, 2009.

[8] K. C. Y. B. Junyoung Chung, Caglar Gulcehre, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” online: https:
// arxiv.org/ abs/ 1412.3555, 2014.

[9] N. P. J. U. L. J. A. N. G. L. K. I. P. Ashish Vaswani, Noam Shazeer,
“Attention is all you need,” online: https:// arxiv.org/ abs/ 1706.03762,
2017.

[10] Y. M. Kei Akuzawa, Yusuke Iwasawa, “Expressive speech synthesis
via modeling expressions with variational autoencoder,” online: https:
// arxiv.org/ abs/ 1804.02135, 2018.

## ACKNOWLEDGEMENTS

This project is part of Computational Reproducible Research
course at Unicamp (1S2020).