<a href="https://colab.research.google.com/github/gustavo-medinav/char-rnn-arxiv-simple/blob/main/arxiv_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation for abstracts of scientific articles

This notebook puts forward an implementation of a minimal language model for text generation of abstracts of scientific articles using a recurrent neural network. This is implemented at character level, including not only alphanumeric chacters but also punctuation and symbols used for typing mathematical expressions in latex, such as $$.

The architecture includes an embedding layer to learn representations of characters in the text, thus, no pre-learnt embeddings are required.
This minimal example makes use of two layers of LSTMs stacked on top of each other, where the state of the cells at each time-step is passed forward (many-to-many).
The final layer is a fully connected dense layer to model the probability distribution over the vocabulary (characters). Rhe problem can be regarded as a classification problem where the number of categories is the size of the vocabulary.
Thus, the loss function is selected to be categorical cross-entropy, where each example in the training set belongs to a single category.

The model was trained on 100k abstracts from papers in the arXiv across all categories, taking about 8 hours of training to complete with the GPU provided by google colabs. The tained model can autocomplete words and generate syntactically correct sentences, although it lacks semantinc understanding and does displays limited long-term memory. The notebook outlines each step in the process and demonstrates the capabilities and limitations of the model.

## 1. Import dataset from Kaggle

To make use of the Kaggle API, we must upload our authentication key (see your Kaggle profile to generate an API token)

In [4]:
from google.colab import files

In [5]:
files.upload();

Saving kaggle.json to kaggle.json


By the default, the API looks for the authentication key in '/root/.kaggle', so we must make sure to move the file to that location.

In [6]:
!mkdir /root/.kaggle
!mv kaggle.json /root/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

Now we can downlaod the data set. We'll be looking at the arXiv meta-data dataset

In [7]:
!kaggle datasets list -s arxiv

ref                                                        title                                              size  lastUpdated          downloadCount  
---------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  
Cornell-University/arxiv                                   arXiv Dataset                                     902MB  2020-11-08 00:52:17           4062  
neelshah18/arxivdataset                                    ARXIV data from 24,000+ papers                     18MB  2018-03-31 03:47:25           1698  
prasunroy/natural-images                                   Natural Images                                    342MB  2018-08-11 18:24:11          10524  
rmisra/news-headlines-dataset-for-sarcasm-detection        News Headlines Dataset For Sarcasm Detection        3MB  2019-07-03 23:52:57          19412  
tayorm/arxiv-papers-metadata                               Arxiv Papers Metadata D

The following line will start the download, which should not take too long

In [8]:
!kaggle datasets download -d 'Cornell-University/arxiv'

Downloading arxiv.zip to /content
100% 898M/902M [00:29<00:00, 33.4MB/s]
100% 902M/902M [00:29<00:00, 32.4MB/s]


Unzip the file to access the contents

In [9]:
import zipfile
with zipfile.ZipFile('/content/arxiv.zip','r') as file:
  file.extractall('/content/')

Now we should have in our directory the dataset in a JSON file

In [10]:
!ls

arxiv-metadata-oai-snapshot.json  arxiv.zip  drive  sample_data


## 2. Load and transform the data

The dataset is too large to load all at once. Use a generator to read the records one at a time. Moreover, a little data-cleaning is required, as some of the abstract may contain characters will want to remove. Therefore, let's import the regular expressions module to clean the data.

In [14]:
import json
import re
import string
import itertools
import numpy as np

Since this is going to be a character-level model, the vocabulary is known beforehand, the set of alphanumeric characters, whitespace characters and special characters. Thus, it can be loaded a priori from the 'string' module. The vocabulary can then be used to create lookup tables mapping characters to indices and vice versa.

In [15]:
vocab = [n for n in string.printable]
vocab_size = len(vocab)
char2idx = {v:idx for idx,v in enumerate(vocab)}
idx2char= np.array(vocab)

The dataset contains a lot of records. We can control how many records we want to load by using itertools to define an iterator with a fixed start and fixed end.

At this point, you should load a few examples and perform some exploratory data analysis to understand what the data set contains. In this example we are only interested in the abstracts of the papers, thus, we will not pay attention to any of the other fields.

In [24]:
gen_json = (json.loads(line) for line in itertools.islice(open('arxiv-metadata-oai-snapshot.json','r'),10))
temp = []
for line in gen_json:
  temp.append(line['abstract'])

In [25]:
temp[0]

'  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n'

Going through the examples, you will have noticed a few things:
1. All abstracts start with 2 blanks and end with a newline character.
2. Newline characters are included within the text for aesthetic purposes.
3. Abstracts use latex to type mathematical expressions, which are enclosed in $$. These expressions also make use of the back-slash '\', underscore '_' and the caret (hat) '^' characters, and may include all types of parenthesis, i.e. '()'. '[]', and '{}'.
4. In a few examples, the text has been corrupted and non-printable characters appear. For instance, in one exampple, "don't" is rendered as "donâ\x80\x99t".

These characteristics may introduce some noise to the data. While it may be true that a healthy degree of noise can naturally help prevent overfitting, we don't want the model to learn spurious patterns. Thus, some transformations are in order to clean the data.

1. Newline characters within the text are removed. In turn, an additional newline character is appended at the end, such that all abstract start with two blank spaces and end with two newline characters.
2. Characters resulting from corrupted text are transformed back to their original meaning. This is inferred from reading the examples.

It is straightfoward, if somewhat tedious, to find the relevant instances of corrupted text. The following cell performs the necessary transformations on the first 100k records. Note that further corrupted characters may appear beyond this and must be checked manually if more examples are to be included.

Finally, the characters are represented as indices using the lookup table previously defined from the vocabulary.

In [26]:
num_examples = 100000
gen_json = (json.loads(line) for line in itertools.islice(open('arxiv-metadata-oai-snapshot.json','r'),num_examples))
abs_list = []
for line in gen_json:
  abs = line['abstract']
  abs = re.sub(r'(\S)\s+(\S)',r'\1 \2',abs).replace('\n','\n\n')
  abs = abs.replace('â\x80\x99',"'")
  abs = abs.replace('\x7f',"")
  abs = abs.replace('â\x88\x9e',"'")
  abs = abs.replace('â\x89¤',"'")
  abs = abs.replace('â\x80\x94',"'")
  abs = abs.replace('â\x80\x93',"-")
  for k in abs:
    abs_list.append(char2idx[k])

abs_list = np.array(abs_list)

In [27]:
abs_list.shape

(80064734,)

## 3. Create the model

Next we can initialise the model. This is done using the keras API with tensorflow on the backend. We will use the Embedding, LSTM and Dense layers. Alternatively, you can replace LSTM with GRU which is faster to train. We also import os now as this will be used for setup up the checkpoint during training.

In [2]:
import os
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense

The strategy for training the model will be dividing the text into non-overlapping fixed-length sequences. The model will be trained to predict the following word within each sequence. Thus, for each sequence there is a training and a target sequence, where the target is just the sequence shifted one character forward.

At this point, we need to choose the length of the sequences. Scientific writting tends to containg longer sentences than other texts, as technical expressions are usually longer and sentences need to be precise in meaning to avoid ambiguities. Therefore, a sequence of 150 characters can be chosen based on this intuition. You can experiment with different lenghts and verify the results.

The training, which makes use of gradient descent, is performed using mini-batch. This means that the training examples are fed in blocks containing a fixed number of examples. The batch size impacts training speed and accuracy. Too small or too large values can be slow to train. The optimum value is not apriori known and this is another hyper-parameter to experiment with. Moreover, it has been noted that large batch sizes may lead to poorer generalisation capability if the learning rate is not adjusted accordingly. In this case, a value of 256 is chosen as initial guess. You can experiment recording training speed and performance using different batch sizes.

The following cell prepares the dataset to use for training:
1. The dataset is intiatlised using the Dataset class from the Tensorflow module. This allows for easy manipulation of the data. 
2. The text is then separeted into sequences
3. Training and target sets are created
4. The examples are shuffled and packed into batches.

In [28]:
seq_len = 150
batch_size = 256
dataset = tf.data.Dataset.from_tensor_slices(abs_list)
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = dataset.map(lambda x: (x[:-1],x[1:]))
dataset = dataset.shuffle(1000).batch(batch_size,drop_remainder=True)

In [29]:
dataset

<BatchDataset shapes: ((256, 150), (256, 150)), types: (tf.int64, tf.int64)>

The model can now be specificied using the Sequential model class from Keras. This minimal example includes four layers.

The first is an Embedding layer. This layer learns a representation of the characters in a vector space of some dimension. The embedding dimension is the third hyper-parameter so far introduced. Larger dimensions allow for representations that capture more structure, however, they also increase the number of parameters to train. In principle, transfer-learning could be used by loading learnt representations and thus saving training time.

The second and third layers are made of LSTMs, recurrent units whose current state and output depend not only on the inputs but on the previous state. The number of cells can be tuned to increase the complexity of the model. Note that return_sequences is set to True, indicating that the cells pass forward their state at each time step (many-to-many mode), instead of just returning the state at the last time step (many-to-one mode). The 'stateful' parameter will be set to False during training but can be turned on for generating text later on. In this mode, the internal state of the cells is randomly initialised at the start of each batch. Note that since we have shuffled the training examples and we are assuming that 150 characters is enough context to predict the next one, it would not make sense to train with 'stateful' in this case.

The fourth layer is a fully connected dense layer with the number of cells given by the vocabulary size. The activation values from this layer will be passed through a softmax function so they can be interpreted as the probability distribution over the space of characters. Note that the activation function is not included in here as it is more efficient to apply it when the loss is computed.

The loss is the categorical cross-entropy loss. The function 'SparseCategoricalCrossentropy', since the targets are sparse vectors. This saves resources on memory as it is not necessary to load all the elements, just the index where the vectors are non-zero. Note the option 'from_logits' is set to True, indicating that the softmax function is to be applied to the result of the last layer.

In [12]:
def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
  model = Sequential()
  model.add(Embedding(vocabulary_size,embedding_dimension,
                      batch_input_shape=[batch_size,None]))
  model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
  model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
  model.add(Dense(vocabulary_size))
  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer='adam',metrics=['accuracy'])
  model.summary()
  return model

The embedding dimension is set to 256 and the number of cells per layer, to 1024

In [31]:
emb_dim = 256
rnn_units = 1024
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (256, None, 256)          25600     
_________________________________________________________________
lstm_6 (LSTM)                (256, None, 1024)         5246976   
_________________________________________________________________
lstm_7 (LSTM)                (256, None, 1024)         8392704   
_________________________________________________________________
dense_3 (Dense)              (256, None, 100)          102500    
Total params: 13,767,780
Trainable params: 13,767,780
Non-trainable params: 0
_________________________________________________________________


## 4. Train the model

It will be useful to setup checkpoints during training so the weights will be saved at the end of each epoch. In the following, the directory where the check points are saved is declared. Then, the filename of the checkpoints is given, keeping the epoch number as a variable. To implement this in training, a callback is created using 'ModelCheckpoint' from 'callbacks'.

In [11]:
checkpoint_dir = '/content/training_checkpoints/'
checkpoint_prefix = os.path.join(checkpoint_dir,'chkpt_{epoch}')
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_prefix,
                                                         save_weights_only=True)

Train the model using 10 epochs and the callback previously defined.

In [None]:
model.fit(dataset,epochs=10,callbacks=[checkpoint_callback])

Before using the model to generate text, we need to make a technical adjustment. Once the model has been constructed, the batch size cannot be changed. Thus, we need to build a new model with the batch size we want to use. In this case, a batch size of 1 is enough as we will only feed in one piece of text at a time. This time 'stateful' can be set to True, as in theory we could  want to feed several lines of text sequentually. The weights can be loaded from the latest checkpoint. Make sure that the path to the latest checkpoint is the correct one. In my case, after training I moved my checkpoints to my google drive, as local files are deleted when the colab notebok is closed. The model is built by passing an empty example of the appropriate shape.

In [8]:
# Change this to the directory where you saved your weights
checkpoint_dir = '/content/drive/My Drive/nlp_projects/arxiv_text_generation/training_checkpoints'

You can verify the filename of the latest checkpoint

In [10]:
tf.train.latest_checkpoint(checkpoint_dir)

'/content/drive/My Drive/nlp_projects/arxiv_text_generation/training_checkpoints/chkpt_10'

In [16]:
model = make_model(vocab_size,emb_dim,rnn_units,1,True)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1,None]))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (1, None, 256)            25600     
_________________________________________________________________
lstm (LSTM)                  (1, None, 1024)           5246976   
_________________________________________________________________
lstm_1 (LSTM)                (1, None, 1024)           8392704   
_________________________________________________________________
dense (Dense)                (1, None, 100)            102500    
Total params: 13,767,780
Trainable params: 13,767,780
Non-trainable params: 0
_________________________________________________________________


Optionally, you can save the full model in an HDF5 file

In [17]:
model.save('/content/drive/My Drive/nlp_projects/arxiv_text_generation/model_13p8M_parameters.hdf5')

## 5. Generate text

The model can now be used to generate text. To do this, we can pass a string to act as a seed for the text we want to generate. Note this should be converted to the index representation using the table previously created. The following function takes care of this by converting a given string to the index representation, adding an artificial batch dimension of one, and running predictions on a loop until the desired number of characters is generated.

The next character is predicted by drawing at random from the probability distribution returned by the model. Note that since the targets were shifted sequences, the model outputs a vector of lenth 150 with probability distributions for each character in the sequence. We are only interested in the last one, i.e., the probability distribution of the unseen character.

The predicted character is used as the input in the next iteration. Note that the state of the LSTM cells is reset at the start of each independent prediction but it remembered while predicting a sequence of characters. In other words, if you call the function more than once, it is assumed that each call is independent on the previous ones.

If you want the model to remember what it saw in previous calls, you can comment out the line for modelreset_states(). This could be, for example, if all calls are part of the same sequence of text. For example, if you let the model automplete some words of your writing in a long document, alternating between human input and machine prediction.

In [20]:
def generate_text(model,seed,num_characters):
  seed_text = tf.expand_dims([char2idx[k] for k in seed],0)
  generated_text = []
  model.reset_states()
  for n in range(num_characters+1):
    result = tf.random.categorical(model(seed_text)[0,-1:],num_samples=1)
    result = result[0,0].numpy()
    generated_text.append(result)
    seed_text = tf.expand_dims([result],0)
  return ''.join(idx2char[generated_text])

Next we can try some different examples to see how good or bad the model performs. We'll give the model some initial text and call it 5 times to predict the next 100 characters. We start with a simple test:

In [65]:
# Can the model guess obvious characters/words?
seed = ("This is a short string to test if the model knows what the next character sh")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

are both the dipole moment. Based on these assumptions, we show how ZAMO is that the source points on
ould be observable in less than two of only one of the variables Vega sup.This paper addresses the ap
ould invaled a given almost sure constant. However for which we have two types of Caucations the topo
ows a classical warping coefficient. Our approach is modeled by calculating arbitrary functional form
ould be valid against one and histosized (i.e. cooperative) function. We extend and transform the sma


It correctly autocomplets "sh-" to "share", "should" and "shows", which could be possible options. Given the context, "should" should be favoured, and it does appear 3/5 times. Can it predict an 'obvious' word based on the previous ones?

In [23]:
# Can it guess obvious characters/words?
seed = ("This is a text about Quantum Field ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

Theory (CDMFT). The quantum analysis of an upper bound on the character expressing the coefficients e
Theory. The paper concludes by incorporating symmetric double sets of the projection of the ideal of 
(CQF) is a buffer system, and is not an enhanced channel via an antagonia which capability and quanti
Evaluation Formula (QMTs) to anharmonic DLAs).In this work we prove that all dedicated systems engang
Nonestatic Welping group combines the optimal detector design techniques using Monte Carlo studies to


In [24]:
# Can it guess obvious characters/words?
seed = ("Einstein introduced the theory of General ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

Relativity and electromagnetic fluxes. In this respect, we introduce the study of stochastic deformat
Relativity, this reduction takes place as quasi-local the density enhanced dynamics with a definite q
Relativity, when the dark energy coupless type 1 by slower cascaded cosmology with an accelerated cos
Relativity higher order at infinite barrier and coordinates.We review how the strongly coupled theory
Relativity or state transitions.Recent proofs of the orbifold compactification of a complex projectiv


In [26]:
# Can it guess obvious characters/words?
seed = ("The results of the experiment indicate that the null hypothesis can be ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

included in the previous calculations and how creation of silicon diodes to form target states is des
proved either in terms of the kinetic dependence of the dissipation function of an individual atom la
reconstructed from 5 yr-emission. The X-ray and X-ray light curves and photospheric dynamical cosmic 
supposed to jet carryin different properties associated with $f_{\rm H1}$, demonstrating that miscold
suppressed using NLTE model.In this paper, we investigate dynamics of a chain octage of asymmetric nu


In [30]:
# Can it guess obvious characters/words?
seed = ("The gradient of the potential is computed by taking the derivative of the field with ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

respect to an element.
  One kind of equation is the total height function to Berry within the framew
respect to the slip operator and some further hotter distributions. In this case we strengthen the sc
the result of a condition on a Chern chase of the interior is given. We present a dynamical model whi
one of the boundary conditions on the equation ${\cal M}^{w,t} = \frag{g}q(M)$, which is the sharp ma
the knot initial state equation. To this end, we address lack of solutions to dimension reductions in


It seems to do well with proper nouns such as QFT and GR, but not too well in other cases. We would expect a null hypothesis to be rejected, not "included". However, the model does predict verbs in the past participle form, ending in "-ed", which are syntactically correct. In the second example, the fairly common phrase "with respect to" is correctly autocompleted 2/5 times. The alternatives are also syntactically correct.

Next we can test if the model is sensitive to how initial much context it has to make predictions on. This can be taken as a proxy to evaluate the memory span of the model.

In [32]:
# A short seed text, not much context given
seed = ("Quantum field theory is the framework that describes physics at the " 
        "miscroscopic scale. In this paper, we ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

present the familiar analysis of semi-infinite graded splitting and actions on suitable isochrones fu
analyze the model assuming SU(N_c) gauge theories with the supersymmetry and decay changes to CP even
address the topology of a binary system in de Sitter space-time and the ML necessities from the sandw
discuss the terminology of some previous questions on truncated spheroid-operator algebras of removal
discuss fundamental quantum flows with arbitrarily lare classicality. It is shown that an accelerated


In [36]:
# Medium-sized seed text, some context given
seed = ("Quantum field theory is the framework that describes physics at the " 
        "miscroscopic scale. General relativity, on the other hand, describes "
        "dynamics of gravity across large distances. These two theories form "
        "the state-of-the-art of human understanding of the universe. However, "
        "the Standard Model and Einstein gravity are not compatible at the "
        "quantum level. In this paper, we ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

provide a new photothe design regularisation approach.A finite temperature transition at C(H) substit
consider the model consisting of hypersurfaces to be on an open compact operator, and find that E(B-g
propose, assuming smooth Bow Bodies - bounds transform this black hole. From equivalence with the fil
provide quantitative evidence that this problem is modelled out by introduction of unbounded modes. W
characterize the semilinear phase space description for delocalization of cavity quantum interference


In [57]:
# Actual abstract from 2011.02926 minus the last sentence.
seed = ("We consider two fundamental long-standing problems in quantum "
        "chromodynamics (QCD): the origin of color confinement and structure "
        "of a true vacuum and color singlet quantum states. There is a common "
        "belief that resolution to these problems needs a knowledge of a "
        "strict non-perturbative quantum Yang-Mills theory and new ideas. "
        "Our principal idea in resolving these problems is that structure of "
        "color confinement and color singlet quantum states must be determined "
        "by a Weyl symmetry which is an intrinsic symmetry of the Yang-Mills "
        "gauge theory, and by properties of a selected class of solutions "
        "satisfying special requirements. Following this idea we construct for "
        "the first time a space of color singlet one particle quantum states "
        "for primary colorless gluons and quarks and reveal the structure of "
        "color confinement in quantum Yang-Mills theory. "
        "As an application we demonstrate ")
        # "formation of physical observables in a pure QCD, pure glueballs.")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

that the so-called global solution in QCD is established by generating a Josephson exchange (PI) fiel
that the second correspondence between two simple clusters with virtual gluons is conserved in shaplo
that the CO gate is disregarted as given on by O(1) in terms of the effective masses on full gauge th
how the coupling also reproduces the observed anomaly at the internal layers or in angular scales for
that the non-classical sources for $P_{1/1}$ perturbations at finite magnetic fields like a real vari


The predictions show relatively weak dependence on context. For short context, 2/5 predictions contain keywords that can be attributed to particle physics. For the medium-context, only 1/5 predictions is related to gravity, while the others do not clearly follow a specific theme. In the long context, all predictions are clearly related to particle physics. Thus, longer contexts help the model to focus in the specific topic of the text, in particular, keywords are generated, instead of more general sentences that could apply to different topics.

Let's now consider what does the model do when it encounters something completely different from what it has been learning.

In [38]:
# Unrelated string
seed = ("This string does not have to do with science at all, it's a text "
        "about baby shark. Did you know baby shark is the most watched video "
        "on Youtube as of November 2020? That is insane, this is because ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

of its matrix identity the Bell inequality. In this work, we use the ladder ensembles of long standin
orthogonalization between the Nitrix de Braue allows the source and the coalition properties of the t
the view will be measured to be reduced in 1901 up to 95.4 (161 -276-250 km) are re-expressed as dist
all elastic control systems which are plausible from the more and most studied form of the classical 
its separation is between the referee out of the two target features for each case. We provide an alg


It has only been trained on abstracts of scientific articles, thus, of course it does not know how to write sentences in an informal manner and it tends to go back to sciente topics.

In [40]:
# Text in another language. The characters have structure (word and sentences) 
# but the model has never seen these combinations before
seed = ("Este texto esta escrito en otro idioma que el modelo no ha aprendido, "
        "veamos que pasa si recibe estas palabras raras que no existen en "
        "ingles. Lo que el modelo predice es que ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

time financert' resonances considerable, we concentrate on the seesaw maturity of its operators and t
la  \`on k-fabrideaum d'e de lence immediate & compilator exponentialtion la cettian. To each in plac
les problema components - Pont\citeplete langar of 10 percent t. A molecule removal (ECRT). In this w
les variant M column teches corrabolog expression de medistille can communction kernel to pass abstra
earths et ap networks under revised Solid Enstreides, bidirectionally coupled them, his semilinear "s


The training set is written mostly in English, therefore, the model does not know what to do with text written in another langauge. However, note that some French words appear every now and then, as well as made-up words. It is possible some of the abstracts either were written in or contained French words and the model remembers this. It can somehow tell that this pattern of characters is not English and it trying to replicate non-English sequences it has seen before. However, it has not seen enough non-English examples to really do anything with this. In some sense, the model is 'pretending' to speak another language.

In [43]:
# Gibberish
seed = ("m0n923h lxnaefpw;'[kdawpe_dlen;a[ak[k] [';jd0389hufw")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

ol]2ittwavever, both quantum and networks with respect to cluster sizes are also studied.The calculat
own]_scot]) = +1.375(8), 1.86523.5853(2). We also show that it was still simple and useful to constru
en,8e]In. We will discuss new applications; Zasparefandillo's theory of Schwartz semigroups and canon
]^storisate segments under the Uniseneal coupling instead of axiomatized families for the pattern of 
]+ -> \mathbf{CP}^{\infty.})_{n=1}^{+} (in [0,1],\pi_1|XT_n(x+1)X_2|X_2|I_{n=1}^*;\leq |\net{V}}|B|V_


Given gibberish as an input, the model seems to be assuming that this is part of some technical or numeric expression. In particular, see that it tries to close the open parenthesis given in the seed text.

In [67]:
# Does it know how to use question marks?
seed = ("Can naked singularities exist")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

 as well as an elliptic definite. If ${\mathbb Z}^\sigma$ and a closed ti bell conjecture, and if the
, in fact the media of a compact coordinate group enlible in a symmetric even when such that physical
 with respect to the geometric structure and details of the technics or as integrator theory on its b
 at rest of the original precompact abelian function. We give formulas to study the most general disc
, thus providing a framework for iteratively carrier more in the theory. In particular, an interestin


However, it does not introduce question marks when it could be an option. This is because relatively few abstracts contain question marks compared to the bulk of them, thus, this is virtually absent from the training.

Finally, what will it do when it encounters an out-of-vocabular character? Given that we have not taken any precaution to treat this scenario, the answer is obvious, it will throw an error.

In [44]:
# OOV test
seed = ("This string contains a character out of vocabulary, the character "
        "ñ, what will the model do? It predicts the following, ")
for k in range(5):
  temp = generate_text(model,seed,100)
  print(temp)

KeyError: ignored

Finally, let it predict a long text and watch it make less and less sense

In [54]:
# Actual abstract from 2011.02926 minus the last sentence.
seed = ("We consider two fundamental long-standing problems in quantum "
        "chromodynamics (QCD): the origin of color confinement and structure "
        "of a true vacuum and color singlet quantum states. There is a common "
        "belief that resolution to these problems needs a knowledge of a "
        "strict non-perturbative quantum Yang-Mills theory and new ideas. "
        "Our principal idea in resolving these problems is that structure of "
        "color confinement and color singlet quantum states must be determined "
        "by a Weyl symmetry which is an intrinsic symmetry of the Yang-Mills "
        "gauge theory, and by properties of a selected class of solutions "
        "satisfying special requirements. Following this idea we construct for "
        "the first time a space of color singlet one particle quantum states "
        "for primary colorless gluons and quarks and reveal the structure of "
        "color confinement in quantum Yang-Mills theory. "
        "As an application we demonstrate ")
        # "formation of physical observables in a pure QCD, pure glueballs.")
temp = generate_text(model,seed,1000)

In [55]:
import textwrap
print(textwrap.fill(temp,80))

that specifics of the non-Abelian algebra are studied via exact solutions of the
\textit{type} identity theorem for SU(3)Gissilov equations and the Reed-Solomon
triples modified Jolan formula and apply this method. We present optimal control
of the various dynamical systems that defines the Hamiltonian of the four-state
free association a decay of e^+ invariance. Neglecting some other variations on
some webs, this negative treatment predicts the general list of Lorentz-
invariance. The present work investigates smoothness violations of finite box
scattered light in superfields as they share intuitionists.In the framework of
nonclassical low bulk systems, we extend the method to deconstruction at the
equator of presentation. Here we propose that abstract approaches at seesaw
manifolds are obtained for Gowers over knots. We calculate weak actions for the
so called "MacsherWith algebra". Conservation of secular equations is
considered.We explore the phenomenology of a new phase state mode

## 6. Summary

We have implemented a simple character-RNN model to generate text trained on abstracts from articles in the arXiv. The model contains an Embedding layer, two layers of LSTMs and a dense fully-connected layer. Given enough training time, this simple architecture can autocomplete words, predict the next word and formulate complete sentences that are syntactically correct. The more context it is given, the more specific its predictions are, as evidenced by the appearance of key words associated to particular scientific areas. When faced with non-scientific text, it will tend to go back to scientific topics, as that is what it has been trained to do. It can recognise when sequences are non-English and non-words, however in these cases it cannot give coherent predictions. Although the spelling and grammar of the predicted text is mostly correct, the model lacks understanding of the topic and does not display long-term memory.