Requirements 

- Tensorflow, if you have anaconda `conda install tensorflow`
- `pip install -y pandas numpy keras`

In [132]:
# Data
import pandas as pd
import numpy as np
import sys
import re

# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

## Inputs

In [2]:
df = pd.read_csv('data/tweets-fixed.tsv', sep='\t', index_col=0)
df.head(2)

Unnamed: 0,source,text,created,retweets,favorites,is_retweet,id
0,Twitter for iPhone,RT @GOPChairwoman: .@realDonaldTrump is the Pa...,12-14-2017 23:26:54,4262,0,True,941449449850761217
1,Twitter for iPhone,“Manufacturing Optimism Rose to Another All-Ti...,12-14-2017 21:20:51,4789,19906,False,941417725833998340


# Processing

Sanitize tweets into a format suitable for training

In [225]:
# Build a set of characters to remove to reduce our vocab set
chars_to_remove = {'"', '$', '%', "'", '(', ')', '*', '+', '/', ';'
                   #'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '#',
                   '<', '=', '>', '[', '\\', ']', '_', '`',  '{', '|', '}', '~'}
# In 32,000 tweets he apparently never uses a comma, probably an artifact of CSV export.

text = ''
for i, tweet in enumerate(df[df.is_retweet == False].text):
    print tweet
    # Remove URLs
    tweet = re.sub(r'http\S+', '', tweet)
    
    # Fix ampersand HTML artifact
    tweet = tweet.replace('&amp;', '&')
    
    # lower() to reduce pool of possible characters (lower-case's strings)
    # the decode/encode step is to remove non-ascii characters like 
    tweet = tweet.lower().decode("ascii", errors="ignore").encode()

    # Remove chars from our chars_to_remove set with list comprehension
    for x in chars_to_remove:
        tweet = tweet.replace(x, ' ')
    #tweet = ''.join([x for x in tweet if x not in chars_to_remove]).rstrip()
    
    # If there's no tweet left (just a URL for example)
    if not tweet:
        continue
        
    # split and rejoin for odd spacing
    tweet = ' '.join(tweet.split())        
    
    # Add period if no ending line or a space
    if not tweet[-1] in ['.', '!', '?']:
        tweet = tweet + '. '
    else:
        tweet += ' '
    
    print tweet

    # Add to text (by redefining with +=)
    text += tweet
    
    print
    if i > 100: break
    
len(text)

“Manufacturing Optimism Rose to Another All-Time High in the Latest @ShopFloorNAM Outlook Survey” https://t.co/LuV4BMp0Xc
manufacturing optimism rose to another all-time high in the latest @shopfloornam outlook survey. 

In 1960 there were approximately 20000 pages in the Code of Federal Regulations. Today there are over 185000 pages as seen in the Roosevelt Room.Today we CUT THE RED TAPE! It is time to SET FREE OUR DREAMS and MAKE AMERICA GREAT AGAIN! https://t.co/teAVNzjvcx
in 1960 there were approximately 20000 pages in the code of federal regulations. today there are over 185000 pages as seen in the roosevelt room.today we cut the red tape! it is time to set free our dreams and make america great again! 

When Americans are free to thrive innovate & prosper there is no challenge too great no task too large & no goal beyond our reach. We are a nation of explorers pioneers innovators & inventors. We are nation of people who work hard dream big & who never ever give up... https://t.co

19457

Create character set and mapping

In [220]:
chars = sorted(list(set(text)))
print 'Total chars: {}'.format(len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Total chars: 45


Cut text up into arbitrary sequences using a maximum length and a step size

In [62]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

('nb sequences:', 1103342)


Vectorization of inputs

- x: has dimensionality (num_sequences, maxlength, num_chars)
- y: has dimensionality (num_sequences, num_chars)

A sequence in `x` is represented as a 2d matrix of `maxlength` by `num_chars`. This means each row corresponds to what character in the possible character set, with every value being 0 except for a 1 in the position corresponding to the character. For example, the letter 'a' corresponds to the 5th position (column) in the matrix. 

In [66]:
# Example from above
char_indices['a']

5

In [67]:
# Start by creating arrays of zeros with our final dimensionality
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Model

We'll build a simple LSTM model to start with that will likely need to be tweaked to get better results

In [71]:
# Simplest model abstraction is Sequential
model = Sequential()
# Add LSTM layer with 256 memory units
model.add(LSTM(256, input_shape=(maxlen, len(chars))))
# Add Dropout layer at 20% node dropout (avoids overfitting)
model.add(Dropout(0.2))
# Add final Dense layer which corresponds to each of our characters
model.add(Dense(len(chars), activation='softmax'))
# Compile the model 
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Training

We'll use checkpointing and select the final with the lowest loss

In [72]:
path = "model/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(path, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

Fit the model

In [74]:
model.fit(x, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x17fbc2d10>

# Text Generation

Load model with weights corresponding to smallest loss

In [76]:
weights = 'model/weights-improvement-19-1.3245.hdf5'
model.load_weights(weights)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Pull out random seed

In [125]:
start_index = np.random.randint(0, len(text) - maxlen - 1)

generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print 'Generating with seed: {}\n'.format(sentence)

Generating with seed: ut the weather. no wonder their ratings 



In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Generate text from seed

In [126]:
sys.stdout.write(generated)
diversity = 0.1
for i in range(400):
    # Convert text seed to input
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.
    
    # Make prediction with model
    preds = model.predict(x_pred, verbose=0)[0]
    next_index = sample(preds, diversity)
    next_char = indices_char[next_index]
    
    #print next_char
    # Generate text
    generated += next_char
    sentence = sentence[1:] + next_char
    sys.stdout.write(next_char)
    sys.stdout.flush()
    #break

ut the weather. no wonder their ratings and the state of the world and the state of the world in the world in the world with the state of the best thing i would be a great thing is the only one of the great state of the world in the world in the world in the world with the presidential beautiful and a great time to be a great time to be a great time to be a great time to be a great time to be a great time to be a great time to be a grea