Requirements 

- Tensorflow, if you have anaconda `conda install tensorflow`
- `pip install -y pandas numpy keras`

In [1]:
# Data
import pandas as pd
import numpy as np

# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


## Inputs

In [2]:
df = pd.read_csv('data/tweets-fixed.tsv', sep='\t', index_col=0)
df.head(2)

Unnamed: 0,source,text,created,retweets,favorites,is_retweet,id
0,Twitter for iPhone,RT @GOPChairwoman: .@realDonaldTrump is the Pa...,12-14-2017 23:26:54,4262,0,True,941449449850761217
1,Twitter for iPhone,“Manufacturing Optimism Rose to Another All-Ti...,12-14-2017 21:20:51,4789,19906,False,941417725833998340


## Processing

Combine texts into a single string, dropping retweets

In [54]:
# Build a set of characters to remove to reduce our vocab set

chars_to_remove = {'"', '#', '$', '%', "'", '(', ')', '*', '+', '-', '/', 
                   '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 
                   '<', '=', '>',  '@', '[', '\\', ']', '_', '`',  '{', '|', '}', '~'}

# `.`, `!`, `?` and `&` were kept as valid punctuation. 
# In 32,000 tweets he apparently never uses a comma, probably an artifact of CSV export.

text = ''
for tweet in df[df.is_retweet == False].text:
    # lower() to reduce pool of possible characters (lower-case's strings)
    # the decode/encode step is to remove non-ascii characters like 
    tweet = tweet.lower().decode("ascii", errors="ignore").encode() 
    
    # Remove chars from our chars_to_remove set with list comprehension
    tweet = ''.join([x for x in tweet if x not in chars_to_remove])
    
    # Strings are immutable in Python - redefine raw_text with +=
    text += tweet
len(text)

3310064

Create character set and mapping

In [61]:
chars = sorted(list(set(text)))
print 'Total chars: {}'.format(len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Total chars: 31


Cut text up into arbitrary sequences using a maximum length and a step size

In [62]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

('nb sequences:', 1103342)


Vectorization of inputs

- x: has dimensionality (num_sequences, maxlength, num_chars)
- y: has dimensionality (num_sequences, num_chars)

A sequence in `x` is represented as a 2d matrix of `maxlength` by `num_chars`. This means each row corresponds to what character in the possible character set, with every value being 0 except for a 1 in the position corresponding to the character. For example, the letter 'a' corresponds to the 5th position (column) in the matrix. 

In [66]:
# Example from above
char_indices['a']

5

In [67]:
# Start by creating arrays of zeros with our final dimensionality
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [None]:
x