<h1> Creating a custom Word2Vec embedding on your data </h1>

This notebook illustrates:
<ol>
<li> Creating a training dataset
<li> Running word2vec
<li> Examining the created embedding
<li> Export the embedding into a file you can use in other models
<li> Training the text classification model of [txtcls2.ipynb](txtcls2.ipynb) with this custom embedding.
</ol>


In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

# Creating a training dataset

The training dataset simply consists of a bunch of words separated by spaces extracted from your documents. The words are simply in the order that they appear in the documents and words from successive documents are simply appended together. In other words, there is not "document separator".
<p>
The only preprocessing that I do is to replace anything that is not a letter or hyphen by a space.
<p>
Recall that word2vec is unsupervised. There is no label.

In [9]:
import google.datalab.bigquery as bq

query="""
SELECT
  CONCAT( LOWER(REGEXP_REPLACE(title, '[^a-zA-Z $-]', ' ')), 
  " ", 
  LOWER(REGEXP_REPLACE(text, '[^a-zA-Z $-]', ' '))) AS text
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 100
  AND LENGTH(text) > 100
"""

df = bq.Query(query).execute().result().to_dataframe()

In [10]:
df[:5]

Unnamed: 0,text
0,reddit bookmarklets allow web site owners to c...
1,why not let online ads fight it out in a geome...
2,smashing the clock bestbuy s location and ho...
3,ask hn can google aggregate everything you ve...
4,ask yc think out loud - like twitter justi...


In [11]:
with open('word2vec/words.txt', 'w') as ofp:
  for txt in df['text']:
    ofp.write(txt + " ")

This is what the resulting file looks like:

In [12]:
!cut -c-1000 word2vec/words.txt

reddit bookmarklets allow web site owners to cheat to get mostly up votes  simple realistic example given   the idea is to associate a positive link and a negative link with your site  you would submit both to reddit  p based on the user s experience  you would switch him her to the positive negative link  p that way  happy users would vote up the positive link while unhappy users would vote down the negative link   your site now has a better chance of making the front page  p as an example  suppose your site has a game puzzle  p when the user visits the site via the positive or negative link  you redirect to the negative link  p if the user plays several levels of the game puzzle  then he she probably likes it and then you can switch him her to the positive link  why not let online ads fight it out in a geometric real-time game played by advertisers and consumers  the advertiser may display his her ad along with all the other ads currently on display    p larger ads have the disadvant

## Running word2vec

We can run the existing tutorial code as-is.

In [None]:
%bash
cd word2vec
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 \
  -shared word2vec_ops.cc word2vec_kernels.cc \
  -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} \
  -O2 -D_GLIBCXX_USE_CXX11_ABI=0

#   -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public \

The actual evaluation dataset doesn't matter.  Let's just make sure to have some words in the input also in the eval. The analogy dataset is of the form 
<pre>
Athens Greece Cairo Egypt
Baghdad Iraq Beijing China
</pre>
i.e. four words per line where the model is supposed to predict the fourth given the first three. But we'll just make up a junk file.

In [30]:
%writefile word2vec/junk.txt
: analogy-questions-ignored
the user plays several levels
of the game puzzle
vote down the negative

Writing word2vec/junk.txt


In [None]:
%bash
cd word2vec
rm -rf trained
python word2vec.py \
   --train_data=./words.txt --eval_data=./junk.txt --save_path=./trained \
   --min_count=1 --embedding_size=10 --window_size=2

## Examine the created embedding

Let's load up the embedding file in TensorBoard.  Start up TensorBoard, switch to the "Projector" tab and then click on the button to "Load data".  Load the vocab.txt that is in the output directory of the model.

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start('word2vec/trained')

Here, for example, is the word "founders" in context -- it's near doing, creative, difficult, and fight, which sounds about right ...  The numbers next to the words reflect the count -- we should try to get a large enough vocabulary that we can use --min_count=10 when training word2vec, but that would also take too long for a classroom situation. <img src="embeds.png" />

In [None]:
for pid in TensorBoard.list()['pid']:
    TensorBoard().stop(pid)
    print 'Stopped TensorBoard with pid {}'.format(pid)

## Export the embedding vectors into a text file

Let's export the embedding into a text file, so that we can use it the way we used the Glove embeddings in txtcls2.ipynb.

Notice that we have written out our vocabulary and vectors into two files.  We just have to merge them now.

In [38]:
!wc word2vec/trained/*.txt

   890   8900 226934 word2vec/trained/vectors.txt
   890   1780   8259 word2vec/trained/vocab.txt
  1780  10680 235193 total


In [39]:
!head -3 word2vec/trained/*.txt

==> word2vec/trained/vectors.txt <==
-2.472065091133117676e-01 -3.885798156261444092e-01 -2.226969599723815918e-01 8.574548363685607910e-02 4.453513324260711670e-01 3.030938208103179932e-01 2.762222662568092346e-02 -4.628151655197143555e-01 6.405805051326751709e-02 -4.708295166492462158e-01
-1.005752161145210266e-01 3.006918132305145264e-01 1.801920384168624878e-01 -3.159367144107818604e-01 -3.252084553241729736e-01 4.999429285526275635e-01 -3.082303404808044434e-01 2.440812736749649048e-01 -4.505534768104553223e-01 -2.321645617485046387e-01
3.727774024009704590e-01 2.538295388221740723e-01 -9.570891410112380981e-02 -2.781682647764682770e-02 4.326484501361846924e-01 4.568791389465332031e-01 3.149969279766082764e-01 2.019654512405395508e-01 -4.677839279174804688e-01 -1.786493211984634399e-01

==> word2vec/trained/vocab.txt <==
UNK 0
to 99
the 98


In [50]:
import pandas as pd
vocab = pd.read_csv("word2vec/trained/vocab.txt", sep="\s+", header=None, names=('word', 'count'))
vectors = pd.read_csv("word2vec/trained/vectors.txt", sep="\s+", header=None)
vectors = pd.concat([vocab, vectors], axis=1)
del vectors['count']
vectors.to_csv("word2vec/trained/embedding.txt.gz", sep=" ", header=False, index=False, index_label=False, compression='gzip')

In [52]:
!zcat word2vec/trained/embedding.txt.gz | head -3

UNK -0.247206509113 -0.388579815626 -0.222696959972 0.0857454836369 0.445351332426 0.30309382081 0.0276222266257 -0.46281516552 0.0640580505133 -0.470829516649
to -0.100575216115 0.300691813231 0.180192038417 -0.315936714411 -0.325208455324 0.499942928553 -0.308230340481 0.244081273675 -0.45055347681 -0.232164561749
the 0.372777402401 0.253829538822 -0.0957089141011 -0.0278168264776 0.432648450136 0.456879138947 0.314996927977 0.201965451241 -0.467783927917 -0.178649321198

gzip: stdout: Broken pipe


## Training model with custom embedding

Now, you can use this embedding file instead of the Glove embedding used in [txtcls2.ipynb](txtcls2.ipynb)

In [56]:
%bash
gsutil cp word2vec/trained/embedding.txt.gz gs://${BUCKET}/txtcls2/custom_embedding.txt.gz

Copying file://word2vec/trained/embedding.txt.gz [Content-Type=text/plain]...
/ [0 files][    0.0 B/ 66.1 KiB]                                                / [1 files][ 66.1 KiB/ 66.1 KiB]                                                
Operation completed over 1 objects/66.1 KiB.                                     


In [None]:
%bash
OUTDIR=gs://${BUCKET}/txtcls2/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=$(pwd)/txtcls1/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=1.4 \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --glove_embedding=gs://${BUCKET}/txtcls2/custom_embedding.txt.gz \
   --train_steps=36000

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License