# RNN For Text Generation

John Sullivan  
Daniel Crews

Recurrent Neural Networks are effective at detecting patterns which makes them a good choice for applications like time series predictions.  This same strength can be leveraged to detect patterns in other types of input, such as language.

The purpose of this project is to utilize neural networks to create text, specifically tweets.  With a corpus of 20,000+ tweets, our goal is to be able to generate text that is similar in style to the the source text.  This is accomplished by first training the network on the corpus and then generating 140 characters starting with a randomly chosen character as input.

## Collect the tweets from repository

Our collection of tweets comes @realDonaldTrump as there are a large number of tweets available (in a github repository hosted by user *bpb27*) as well as the author's distinctive style.

The collect_tweets function is used to collect the tweets from the repository and unzip them.  fix_tweets handles the removal of any tweets from the dataset that contain links leaving only tweets with actual prose.

In [1]:
import zipfile
import urllib.request
import html
import json
import re
import numpy as np

URL_F = 'https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_{}.json.zip'
TWEET_YEARS = ['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016']

regex = re.compile(r'[\r\n\t]')
hypertext = re.compile(r'http')

def fix_tweets(fnames):
    tweets = []
    for fname in fnames:
        with open(fname) as fp:
            for msg in json.load(fp):
                if msg['is_retweet'] or hypertext.search(msg['text']):
                    continue
                text = regex.sub('', html.unescape(msg['text']))
                for t in text:
                    tweets.append(t)
                tweets.append(' ')
    return np.array(tweets)

def collect_tweets(download = True, years = TWEET_YEARS, dl_years = TWEET_YEARS):
    if download:
        for year in dl_years:
            request = urllib.request.urlopen(URL_F.format(year))
            with open(f'condensed_{year}.json.zip', 'wb') as fp:
                fp.write(request.read())
            with zipfile.ZipFile(f'condensed_{year}.json.zip') as jzip:
                jzip.extractall()
    return fix_tweets([f'condensed_{year}.json' for year in years])

## Preprocessing

After collecting the tweets we create the X and y data sets.  Since the neural network works best with numeric data, we use two dictionaries, one which turns the chars into ints and another to reverse the process.

The output layer of the network should be the total number of characters possible in the data set, this is obtained by gathering all the unique characters before processing.  Next, labels for each letter of the training set are created by shifting every character to the left.  Since the last character doesn't have a "correct" value to check, it's removed.

In [2]:
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder(sparse=False)
tweets = collect_tweets(download = False) # Collect the tweets

In [3]:
# Dictionaries used to translate the 
char_to_int = {char:ind for ind, char in enumerate(np.unique(tweets))}
int_to_char = {ind:char for ind, char in enumerate(np.unique(tweets))}

# Get the size of the vocab
char_count = len(np.unique(tweets)) # Total number of characters / output size

# Transform the data into numbers for one-hot encode
data = [char_to_int[char] for char in tweets]

# One hot encode the data
data = one_hot.fit_transform(np.reshape(data, (len(data), 1)))

# Split the data into training and test sets
X = data[:-1]
y = data[1:]

## Tensorflow

The RNN is small with only a single layer of 100 neurons to make training time fast enough to experiment and test without too much downtime.  This would need to be expanded greatly to train the final network.

One challenge we're dealing with is the best way to structure the data when sending it into the network.  dynamic_rnn expects the data to be structured like `[instances, time_steps, n_inputs]`.  If one character is being predicted at a time, should the labels (`y`) be fed as an array the length of a tweet or as a flat array?

Another issue we have encountered is how to determine the cost function.  Most other projects that have tackled this issue have used `sparse_softmax_cross_entropy_with_logits`, but we're unsure how to measure performance.  At the moment, the network always produces the same accuracy score, 0.74 regardless of how much time is spent training.  This is likely because the extra space following a tweet are encoded with zeroes.  However, when the loss function is monitored, it is decreasing.

In [4]:
n_inputs = char_count
n_outputs = char_count
n_neurons = 100
n_steps = 20
n_layers = 1 # Single layer to make experimentation faster

# Using 2 million characters to make it a nice, even number
X_train = np.reshape(X[:2000000], (int(len(X[:2000000])/n_steps), n_steps, char_count))
y_train = np.reshape(y[:2000000], (int(len(y[:2000000])/n_steps), n_steps, char_count))

In [5]:
import tensorflow as tf

In [6]:
tf.reset_default_graph()

learning_rate = 0.001

# Importing and reshaping data
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None, None, n_outputs])
y_batch_long = tf.reshape(y, [-1, n_outputs])


# LSTM
lstm_cells = [tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons) for layer in range(n_layers)]
multi_cell = tf.contrib.rnn.MultiRNNCell(lstm_cells)


# Iteratively compute output of recurrent network
outputs, states = tf.nn.dynamic_rnn(multi_cell, X, dtype=tf.float32)


# Linear activation (FC layer on top of the LSTM net)
rnn_out_W = tf.Variable(tf.random_normal( (n_neurons, n_outputs), stddev=0.01 ))
rnn_out_B = tf.Variable(tf.random_normal( (n_outputs, ), stddev=0.01 ))

outputs_reshaped = tf.reshape( outputs, [-1, n_neurons] )
network_output = ( tf.matmul( outputs_reshaped, rnn_out_W ) + rnn_out_B )


# Catch the output for text generation
batch_time_shape = tf.shape(outputs)
final_outputs = tf.reshape( tf.nn.softmax( network_output), (batch_time_shape[0], batch_time_shape[1], n_outputs) )


# Training: provide target outputs for supervised training.
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=network_output, labels=y_batch_long) )
train_op = tf.train.RMSPropOptimizer(learning_rate, 0.9).minimize(cost)        

In [7]:
init = tf.global_variables_initializer()

n_epochs = 10
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(len(X_train)//batch_size):
            X_batch = X_train[iteration*batch_size:iteration*batch_size+batch_size]
            y_batch = y_train[iteration*batch_size:iteration*batch_size+batch_size]
            cst, _, pred = sess.run([cost, train_op, final_outputs], feed_dict={X: X_batch, y: y_batch})
            
            # Testing out predictions for each input
            letter_probs = final_outputs[0][0].eval(feed_dict={X: X_batch, y: y_batch})
            element = np.random.choice( range(char_count), p=letter_probs )
            print(int_to_char[element], end="")
            
        print("Epoch", epoch, "Cost = ", cst)

☺t['JM8é😳😆❤👋👋☞👢=😎B📺😂!

KeyboardInterrupt: 

## The plan

Seeing the cost function being reduced gives us hope that the network is being trained, but we are still unsure about effective it would be in detecting the patterns and predicting.  We still need to experiment with batch size and training instance size.  If our current approach isn't actually working we will need to continue to research how others have structured these kinds of networks.