## Tweeting like Trump - Generating Trump-like tweets using Stacked LSTMs

#### By: Ian Herve U. Chu Te

In this notebook, we use a **two-layer LSTM model** to learn from U.S. President Donald Trump's tweets and generate new unique tweets from them (complete with hashtags, tags and links).

The data we use is sourced from Kaggle's **Better Donald Trump Tweets** dataset. You may download the dataset <a href="https://www.kaggle.com/kingburrito666/better-donald-trump-tweets">here</a>.

### 1. Feature Engineering

Raw text data cannot be directly fed into the LSTM model. We have to engineer the data first before we can proceed to the modelling step.

Firstly, let us import some libraries.

In [1]:
import numpy as np
import pandas as pd

Then, let us load the dataset.

**NOTE: **If you wish to reuse this notebook, download the <a href="https://www.kaggle.com/kingburrito666/better-donald-trump-tweets">Kaggle Dataset</a> and unzip and rename the csv file to *data.csv*.

In [2]:
data = pd.read_csv('data.csv')

In [3]:
data.head()

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,


All we need is the **Tweet_Text** field. Let us combine all the rows to create a text corpus by concatenating tweets but separating them with two newlines:

In [4]:
text = '\n\n'.join(data['Tweet_Text'].values)

To reduce the size of our feature space and our training time, we remove rare characters:

In [5]:
from collections import Counter
import re

In [6]:
cntr = Counter(text)
rare = list(np.asarray(list(cntr.keys()))[np.asarray(list(cntr.values())) < 300])
for c in rare:
    text = re.sub('[' + c + ']', '', text)

Here is how the start of the corpus looks like:

In [7]:
text[:1000]

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z\n\nBusy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!\n\nLove the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!\n\nJust had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!\n\nA fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!\n\nHappy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4\n\nSuch a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before\n\nWatching the returns at 9:45pm.\n#ElectionNight #MAGA__ https://t.co/HfuJeRZ

The corpus is 857177 characters long and there are 78 unique characters within it:

In [8]:
print('corpus length:', len(text))
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 857177
total chars: 78


Now, let us cut the text in semi-redundant sequences of *maxlen* characters so that it can be fed into an LSTM model:

In [9]:
maxlen = 50
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 285709


Then, let us vectorize the sentences:

In [10]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

### 2. Generative Modelling

Now we can proceed to the modelling phase.

First, let us import *Keras* - a powerful neural network library.

In [11]:
from __future__ import print_function
import random
import sys
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

Using TensorFlow backend.


Let's create some reusable functions that can sample and generate text from our generative model:

In [14]:
cntr = Counter(text)
cntr_sum = sum(cntr.values())
char_probs = list(map(lambda c: cntr[c] / cntr_sum, chars))

In [15]:
def sample(preds):
    preds = np.asarray(preds).astype('float64')
    preds = preds / np.sum(preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [16]:
def generate(model, length, seed=''):
    
    if len(seed) != 0:
        sys.stdout.write(seed)
    
    generated = seed
    sentence = seed
    
    for i in range(length):
        x = np.zeros((1, maxlen, len(chars)))

        padding = maxlen - len(sentence)
        
        for i in range(padding):
            x[0, i] = char_probs # pad using the priors
            
        for t, char in enumerate(sentence):
            x[0, padding + t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds)
        next_char = indices_char[next_index]

        sentence = sentence[1:] + next_char
        generated += next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
        
    return generated

Now, let us a build the graph of our neural network.

Afterwards, let us train our model, and display some samples at every epoch. 

At the end of the training, we save the model so we can quickly reuse it in the future.

In [17]:
from os.path import isfile
from keras.models import load_model

MODEL_PATH = 'stacked-lstm-2-layers-128-hidden.h5'

if isfile(MODEL_PATH):
    model = load_model(MODEL_PATH)
else:
    N_HIDDEN = 128

    model = Sequential()
    model.add(LSTM( \
        N_HIDDEN, dropout=0.1, input_shape=(maxlen, len(chars)), return_sequences=True))
    model.add(LSTM(N_HIDDEN, dropout=0.1))
    model.add(Dense(len(chars), activation='softmax'))

    optimizer = RMSprop(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)

    for iteration in range(1, 40):
        print('\n')
        print('-' * 50)
        print('\nIteration', iteration)
        model.fit(X, y, batch_size=3000, epochs=1)

        print('\n-------------------- SAMPLE ---------------------\n')

        rand = np.random.randint(len(text) - maxlen)
        seed = text[rand:rand + maxlen]
        generate(model, 400, seed)

    model.save(MODEL_PATH)



--------------------------------------------------

Iteration 1
Epoch 1/1

-------------------- SAMPLE ---------------------

 put up approximately 50 million for my successfull es FoK Tump mumird bis Vof loutp oremy Fall #/axhoor,. Tho iss Tf amy lutrns ailr Corrt
#Ahate,s dort is bhitin fafd yhis hacs Wial yor yrup thin Caryant te mety aund Bonnt wacy targypos na as 1 ufpin  ibe bo sestant yornor:e.. Eus aseind frort. https://t.co/pfIGPBKKh

"@coners Horlerd I mtaris het haalda bopp dorele omrst. AAgoo MaGDingare abtins Here - ant brinn @atpad bo the  eakop me batp of

--------------------------------------------------

Iteration 2
Epoch 1/1

-------------------- SAMPLE ---------------------

aldJTrumpJr: An Honor to be in #Indiana w @realDonaldTTump is upsor homqrtup soas. The bern. Sated pestibey!

De that bezt
with the nutterteljo. #TrumpT nathxesdering pesterg lest and THPSCs! https://t.co/xGr8e_tOqV https://t.co/KPWaJI7lHD"

Tnggreight oppaile allmenss far Beart Geating sameat


-------------------- SAMPLE ---------------------

s - all others are status quo #MakeAmericaGreatAgain!

I will be on @FluneTrump is TPEILENTE #DakelAmerica great!

"@Rentarnie: @realDonaldTrump 118 funny for cant finally clear but not Clain!

Wow, I dont want۪ will do the @TODAYshamed who wants to show a state-the women. I would take it for wife!

Hispagace, a smills, just last tegetion on 2 umress.

Crooked Hillary Cliston join us Donald Trump 32 Cless Otioka. #Trump2016

Problegs in TV,R. Enj

--------------------------------------------------

Iteration 27
Epoch 1/1

-------------------- SAMPLE ---------------------

support! TOGETHER we will MAKE AMERICA GREAT AGAIN!"

"@MikcLeyBustickand: @realDonaldTrump you mind of woll speak rann in Graham for years &amp; you she has defend it to get a stablishment prejicision in 200/15.3 Supio and get youre۪s good for shlaph!

Lightwatc media cant be massive Florida great job days  on me!

Hillary Clinton pifacked guy he is out lat this mor

Now let's try out the model!

Using the first sentence of this <a href='https://twitter.com/realDonaldTrump/status/890764622852173826'>tweet</a> as seed, let us try to continue Trump's sentence and see what interesting stuff our model can say:

In [50]:
sample_tweet_start = 'Go Republican Senators, Go!'
_ = generate(model, 200, sample_tweet_start)

Go Republican Senators, Go!.. Big crowd! I will be last thoughts.

The everyone be not look encertive - https://t.co/6dcnr62pna

They get I will people is you.!

"@KathyFurnAnier: @getepkis26 16 @BirlinieleyBy34Ie ht_

Just doe