# Predicting the Success of a Reddit Submission with Deep Learning and Keras

by Max Woolf



BigQuery used to get data:

```sql
#standardSQL 
SELECT id, title,
  CAST(FORMAT_TIMESTAMP('%H', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS hour,
  CAST(FORMAT_TIMESTAMP('%M', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS minute,
  CAST(FORMAT_TIMESTAMP('%w', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS dayofweek,
  CAST(FORMAT_TIMESTAMP('%j', TIMESTAMP_SECONDS(created_utc), 'America/New_York') AS INT64) AS dayofyear,
  IF(PERCENT_RANK() OVER (ORDER BY score ASC) >= 0.50, 1, 0) as is_top_submission
  FROM `fh-bigquery.reddit_posts.*`
  WHERE (_TABLE_SUFFIX BETWEEN '2017_01' AND '2017_04')
  AND subreddit = 'AskReddit'
```

In [68]:
import numpy as np
import os
import csv
from random import random, sample, seed

data_path = '/Volumes/Extreme 510/Data/askreddit_data_timings.csv'
embeddings_path = '/Volumes/Extreme 510/Data/glove.6B.50d.txt'

In [69]:
titles = []
hours = []
minutes = []
dayofweeks = []
dayofyears = []
is_top_submission = []

with open(data_path, 'r', encoding="latin1") as f:
    reader = csv.DictReader(f)
    for submission in reader:
        titles.append(submission['title'])
        hours.append(submission['hour'])
        minutes.append(submission['minute'])
        dayofweeks.append(submission['dayofweek'])
        dayofyears.append(submission['dayofyear'])
        is_top_submission.append(submission['is_top_submission'])
            
titles = np.array(titles)
hours = np.array(hours, dtype=int)
minutes = np.array(minutes, dtype=int)
dayofweeks = np.array(dayofweeks, dtype=int)
dayofyears = np.array(dayofyears, dtype=int)
is_top_submission = np.array(is_top_submission, dtype=int)

In [71]:
print(titles[0:2])
print(titles.shape)
print(hours[0:2])
print(minutes[0:2])
print(dayofweeks[0:2])
print(dayofyears[0:2])
print(is_top_submission[0:2])

['People who have been cheated on: how did you find out?'
 "What is the biggest display of confidence/charisma you've ever seen?"]
(976538,)
[ 1 16]
[10 48]
[3 3]
[46 46]
[0 0]


In [72]:
1 - np.mean(is_top_submission)

0.64075949937432031

The No-Information Rate is 80% (i.e. say all AskReddit submissions are terrible), so any model trained must do better than that.

# Process /r/AskReddit Submission Title Text

In [73]:
from keras.preprocessing import sequence
from keras.preprocessing.text import text_to_word_sequence, Tokenizer

max_features = 40000

word_tokenizer = Tokenizer(max_features)
word_tokenizer.fit_on_texts(titles)

print(str(word_tokenizer.word_counts)[0:100])
print(str(word_tokenizer.word_index)[0:100])
print(len(word_tokenizer.word_counts))   # true word count

{'people': 77432, 'who': 80438, 'have': 120730, 'been': 24255, 'cheated': 946, 'on': 87077, 'how': 1
{'you': 1, 'what': 2, 'the': 3, 'to': 4, 'a': 5, 'of': 6, 'your': 7, 'is': 8, 'do': 9, 'and': 10, 'i
98770


In [74]:
titles_tf = word_tokenizer.texts_to_sequences(titles)

print(titles_tf[0])

[27, 26, 17, 75, 1222, 24, 15, 33, 1, 119, 55]


In [75]:
maxlen = 20
titles_tf = sequence.pad_sequences(titles_tf, maxlen=maxlen)

print(titles_tf[0])

[   0    0    0    0    0    0    0    0    0   27   26   17   75 1222   24
   15   33    1  119   55]


## Add Pretrained Embeddings

Adapted from [the official keras tutorial](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html).

Use pretrained GloVe embeddings to both give Embeddings training a good start, and to account for words that might be present in the test set but not in the training set.

First, load the 50D embeddings into memory.

In [11]:
embedding_vectors = {}

with open(embeddings_path, 'r') as f:
    for line in f:
        line_split = line.strip().split(" ")
        vec = np.array(line_split[1:], dtype=float)
        word = line_split[0]
        embedding_vectors[word] = vec
        
print(embedding_vectors['you'])

[ -1.09190000e-03   3.33240000e-01   3.57430000e-01  -5.40410000e-01
   8.20320000e-01  -4.93910000e-01  -3.25880000e-01   1.99720000e-03
  -2.38290000e-01   3.55540000e-01  -6.06550000e-01   9.89320000e-01
  -2.17860000e-01   1.12360000e-01   1.14940000e+00   7.32840000e-01
   5.11820000e-01   2.92870000e-01   2.83880000e-01  -1.35900000e+00
  -3.79510000e-01   5.09430000e-01   7.07100000e-01   6.29410000e-01
   1.05340000e+00  -2.17560000e+00  -1.32040000e+00   4.00010000e-01
   1.57410000e+00  -1.66000000e+00   3.77210000e+00   8.69490000e-01
  -8.04390000e-01   1.83900000e-01  -3.43320000e-01   1.07140000e-02
   2.39690000e-01   6.67480000e-02   7.01170000e-01  -7.37020000e-01
   2.08770000e-01   1.15640000e-01  -1.51900000e-01   8.59080000e-01
   2.26200000e-01   1.65190000e-01   3.63090000e-01  -4.56970000e-01
  -4.89690000e-02   1.13160000e+00]


Initialize the weights matrix as zeroes, then replace the corresponding index of the weights matrix with the index of the corresponding word.

In [76]:
weights_matrix = np.zeros((max_features + 1, 50))

for word, i in word_tokenizer.word_index.items():

    embedding_vector = embedding_vectors.get(word)
    if embedding_vector is not None and i <= max_features:
        weights_matrix[i] = embedding_vector

# index 0 vector should be all zeroes, index 1 vector should be the same one as above
print(weights_matrix[0:2,:])

[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [ -1.09190000e-03   3.33240000e-01   3.57430000e-01  -5.40410000e-01
    8.20320000e-01  -4.93910000e-01  -3.25880000e-01 

# Process Other Metadata

All metadata must be zero-indexed integers.

* `hours` in the correct format. (`0` = 12AM EST, `23` = 11PM EST)
* `dayofweeks` is in the correct format (`0` = Sunday, `6` = Saturday)
* `minutes` must be between [0,1]; scale by dividing by 59.
* `dayofyears` must be between [0,1]; scale by dividing by 366.

In [78]:
#minutes_tf = minutes / 59
#dayofyears_tf = dayofyears / 366
dayofyears_tf = dayofyears - 1

print(dayofyears_tf[0:10])

[ 45  45  44 104  47  46 115 115  57   1]


# Build the Model

Use Keras's functional API to build a branching model.

In [79]:
from keras.models import Input, Model
from keras.layers import Dense, Embedding, GlobalAveragePooling1D, concatenate, Activation
from keras.layers.core import Masking, Dropout, Reshape
from keras.layers.normalization import BatchNormalization

batch_size = 32
embedding_dims = 50
epochs = 20

## Text Branch

Encode the text using a mock fasttext approach. Use `weights_matrix` derived above.

In [80]:
titles_input = Input(shape=(maxlen,))
titles_embedding = Embedding(max_features + 1, embedding_dims, weights=[weights_matrix])(titles_input)
titles_pooling = GlobalAveragePooling1D()(titles_embedding)

Add an auxillary output to regularize the text component.

In [81]:
aux_output = Dense(1, activation='sigmoid', name='aux_out')(titles_pooling)

## Metadata Branch

Each metadata variable gets its own input and Embeddings. (size of each Embedding is already known by construction of the variables)

`Reshape` is necessary to convert from 2D to 1D.

In [82]:
meta_embedding_dims = 64

hours_input = Input(shape=(1,))
hours_embedding = Embedding(24, meta_embedding_dims)(hours_input)
hours_reshape = Reshape((meta_embedding_dims,))(hours_embedding)

dayofweeks_input = Input(shape=(1,))
dayofweeks_embedding = Embedding(7, meta_embedding_dims)(dayofweeks_input)
dayofweeks_reshape = Reshape((meta_embedding_dims,))(dayofweeks_embedding)

minutes_input = Input(shape=(1,))
minutes_embedding = Embedding(60, meta_embedding_dims)(minutes_input)
minutes_reshape = Reshape((meta_embedding_dims,))(minutes_embedding)

dayofyears_input = Input(shape=(1,))
dayofyears_embedding = Embedding(366, meta_embedding_dims)(dayofyears_input)
dayofyears_reshape = Reshape((meta_embedding_dims,))(dayofyears_embedding)

Minutes and dayofyears are single scalars; no need to `Reshape`.

## Merge the Branches and Complete Model

Combine the 4 embeddings (200D total), add a two-layer MLP to understand latent characteristic, output 1 value for logistic regression.

In [83]:
merged = concatenate([titles_pooling, hours_reshape, dayofweeks_reshape, minutes_reshape, dayofyears_reshape])

hidden_1 = Dense(256, activation='relu')(merged)
hidden_1 = BatchNormalization()(hidden_1)

main_output = Dense(1, activation='sigmoid', name='main_out')(hidden_1)

## Compile the Model

In [84]:
model = Model(inputs=[titles_input,
                      hours_input,
                      dayofweeks_input,
                      minutes_input,
                      dayofyears_input], outputs=[main_output, aux_output])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'],
              loss_weights=[1, 0.2])

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_41 (InputLayer)            (None, 20)            0                                            
____________________________________________________________________________________________________
input_42 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_43 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_44 (InputLayer)            (None, 1)             0                                            
___________________________________________________________________________________________

# Train the Model!

Randomize the model before training, since Keras [takes the last 20%](https://keras.io/getting-started/faq/#how-is-the-validation-split-computed) as the validation set.

In [85]:
seed(123)
split = 0.2

# returns randomized indices with no repeats
idx = sample(range(titles_tf.shape[0]), titles_tf.shape[0])

titles_tf = titles_tf[idx, :]
hours = hours[idx]
dayofweeks = dayofweeks[idx]
minutes = minutes[idx]
dayofyears_tf = dayofyears_tf[idx]
is_top_submission = is_top_submission[idx]

Determine No-Information Rate of the test set: the `val_main_out_acc` must be better than it.

In [86]:
print(1 - np.mean(is_top_submission[:(int(titles_tf.shape[0] * split))]))

0.64137998126


In [89]:
model.fit([titles_tf, hours, dayofweeks, minutes, dayofyears_tf], [is_top_submission, is_top_submission],
          batch_size=batch_size,
          epochs=epochs,
          validation_split=split)

Train on 781230 samples, validate on 195308 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20

KeyboardInterrupt: 