## Kaggle Exploration

This notebook is just focused on playing with the Deep Neural Networks
with the text recommendation. 
(This can be adapted to work with google colaboratory - free GPU time!)

I am currently redoing the Tensorflow to use the Dataset interface,
rather than the slower feed_dict approach.  I found initially that
the GPU/CPU were just as slow on Colab.

Looking at the NMT tutorial,
they used the dataset interface with batch generation.  Importantly,
they also loaded the embedding matrix onto the GPU, and used a tensorflow intrinsic
to build up the matrix representation of each sentence.  I think that will provide a
huge training speedup.  

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys

from sklearn.model_selection import StratifiedKFold, KFold

from util import clean_up, check_predictions

from IPython.display import clear_output

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
#load in raw data.
#can then clean, and find word indices.
df = pd.read_csv('data/boardgame-comments-english.csv')

In [3]:
# Cleaning and Tokenizing
df['comment_clean']=clean_up(df['comment'])

In [66]:
#set lower bound on scores to 1.
msk=df['rating']<1
df.loc[msk,'rating']=1

# Word Embeddings

As an initial test, let's use the GloVE 50-dim text embeddings.  There are 400k tokens (with numerous rude words) included.
Note this is not trimming out any stop words. 

In [26]:
from util import load_glove, sentence_lookup,sent_to_matrix
glove_vec,glove_dict=load_glove()

In [10]:
#Prepare strings for tokenization:
#1) make lower case
#2) split at space.
#split comment at spaces to make a list
#com_split = df_train['comment_clean'].str.lower().apply(str.split,' ')

#now get those indices. (took a minute or two)
df['comment_vec_index']=df['comment_clean'].apply(lambda x:sentence_lookup(x,glove_dict))

In [None]:
#rename the first column to be 'userID'
col_names=df.columns.values
col_names[0]='userID'
df.columns=col_names

In [22]:
#save df after cleaning, and including indices (super slow-took around 5min)
df.to_csv('data/boardgame-comments-clean.csv.gzip',index=False,columns=['userID','gameID','rating','comment_vec_index'],compression='gzip')

In [None]:
df.head()

In [7]:
df = pd.read_csv('data/boardgame-comments-clean.csv')

In [12]:
sys.getsizeof(df)/(1024**2)

659.228157043457

In [47]:
tuple(['1','2'])

('1', '2')

In [46]:
#forces lists saved as strings to be evaluated, and stored as lists.
#could probably do better just with a tuple.
df['comment_vec_index']=df['comment_vec_index'].apply(eval)

# Test/Train split

Let's split the training data.  I'll hold back 10\% of the data as a test set, and use the rest for training and tuning.

In [55]:
seed=2343
np.random.seed(seed=seed)
msk=np.random.random(len(df))<0.1
df_train=df[~msk]
df_test=df[msk]

Let's further split the training data into training/dev sets using KFold.
I'll use this cross-validation for tuning hyperparameters. For this initial screwing around I'll just use the first such fold.

In [25]:
n_splits=5
kseed=3032
kfold = KFold(n_splits=n_splits, shuffle=True,random_state=kseed)
index=df_train.index.values
splits=kfold.split(index)

#just grab a single split
train_index,dev_index=next(splits)

#and create new variables.
df_train_K=df_train.loc[train_index]
df_dev_K=df_train.loc[dev_index]

In [18]:
mat=sent_to_matrix(df_train.loc[12,'comment_vec_index'],glove_vec)

In [29]:
vec_len=df_train['comment_vec_index'].apply(len)

In [48]:
len(vec_len)

757029

In [27]:
def count_words(string):
    word_list=string.split(' ')
    return len(word_list)

In [28]:
#find number of words in each comment
word_len=df_train['comment_clean'].apply(count_words)

In [80]:
#Plot histogram of logs (shift up by 0.1 to avoid 0 counts from nonsense comments).
plt.hist((vec_len+0.1),bins=[1,10,50,100,250,500,750,1000,1500,2000])
plt.xlabel('Log10(Number of recognized words per comment)')
plt.show()

<matplotlib.figure.Figure at 0x7efaf6e69470>

In [59]:
np.sum(vec_len==0)

1857

So, a small fraction of comments (around 2000) recognize no words whatsoever.
Otherwise, the majority of comments had between 10-100 words.
So the zero-length comments are either missing comments, or one word reviews, often suggesting the game was sold "thrifted".
And my cleaning script was screwing up by eating too much markup,
since some reviews were entirely within markup.

# Recurrent Neural Network

Let's try a multi-layer recurrent neural network (GRU, then LSTM).
This is 2-layers using the 50-dim pretrained word vectors as a first step.
I'll use dropout with a rate of 0.5 to try to regularize the results.
It may be worth implementing an early-stopping criterion as well: every N epochs, check the performance on a
dev set, and stop training when that error starts to increase. 

As pointed out, the training set is quite unbalanced in terms of favourable reviews, and relatively few negative reviews.
This reflects the nature of the dataset. These are the most reviewed games on boardgamegeek, and popular games are likely to be decent.
They won't necessarily be the best games for all tastes, but they are probably fairly good.

This imbalance can be handled either by rebalancing the dataset, via stratified sampling.
Alternatively, we could also tweak the metric to be a weighted average.
This puts more weight on the rarer reviews. The new metric assumption does assume that there are enough training examples to see,
and that you are likely to see them often enough. 
There are around 60k negative reviews (score<5), and only 12k reviews with scores <= 2, out of 700k total reviews. 

In [173]:
import tensorflow as tf
from recurrent_network import RNNConfig,recurrentNeuralNetwork

In [74]:
Nsub=50000
names=['rating']
df_sub=df_train.iloc[:Nsub]
X0=df_sub['comment_vec_index'].values
y0=df_sub['rating'].values/10

In [174]:
#create config object.
RNN_Config=RNNConfig(Nepoch=100,maxlen=100,Ndim=50,wordvec=glove_vec,
keep_prob=0.75, cell='GRU', lr=0.01)

%pdb off 
RNN=recurrentNeuralNetwork(RNN_Config)

Automatic pdb calling has been turned OFF


Instead the NMT tutorial load the embedding matrix as a TF constant onto a particular device (CPU/GPU), and use tf.embedding_lookup.    

In [178]:
#RNN.train_graph(X,y)
%pdb on
RNN.train_graph(X0,y0,'./tf_models/rnn')

TypeError: The value of a feed cannot be a tf.Tensor object. Acceptable feed values include Python scalars, strings, lists, numpy ndarrays, or TensorHandles.

Automatic pdb calling has been turned ON


In [145]:
# Note: this only seems to work AFTER I've already run the 
#training_graph?  Then I can load up another checkpoint.
rnn_pred=RNN.predict_all('./tf_models/rnn',X0[:1000],reset=True)

INFO:tensorflow:Restoring parameters from ./tf_models/rnn-100


In [146]:
rnn_pred2=RNN.predict_all('./tf_models/rnn',X0[:1000],reset=True)

INFO:tensorflow:Restoring parameters from ./tf_models/rnn-100


In [158]:
tf.nn.embedding_lookup(glove_vec,np.array([[1,2,50],[3,4,0]]))

<tf.Tensor 'embedding_lookup_5:0' shape=(2, 3, 50) dtype=float64>

In [169]:
A=np.zeros(3)
A[:2]=[1,2]

In [170]:
A

array([ 1.,  2.,  0.])

In [71]:
from sklearn.metrics import log_loss, roc_auc_score, mean_squared_error

In [137]:
mean_squared_error(rnn_pred,y0[:1000])

0.032085286715421012

In [138]:
rnn_pred.reshape(-1).shape

(1000,)

In [140]:
plt.hist(rnn_pred.reshape(-1),log=False)
plt.show()

<matplotlib.figure.Figure at 0x7efaf55a6630>

In [114]:
RNN.pred

<tf.Tensor 'strided_slice:0' shape=(100, 1) dtype=float32>