# Simple Substitution
`w266 Final Project: Crosslingual Word Embeddings`

The code in this notebook was used to develop an algorithm to generate crosslingual word embeddings by training on a monolingual corpus and substituting translations at runtime.

# Notebook Setup

In [1]:
# general imports
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# tell matplotlib not to open a new window
%matplotlib inline

# autoreload modules
%load_ext autoreload
%autoreload 2

In [2]:
# filepaths
BASE = '~/Documents/MIDS/w266/Data' #'/home/mmillervedam/Data'
PROJ = '~/Documents/MIDS/w266/FinalProject'#'/home/mmillervedam/ProjectRepo'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
#FULL_EN = BASE + '/en/full.txt'
#FULL_ES = BASE + '/es/full.txt'
EN_ES_DICT = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT_DICT  = PROJ +'/XlingualEmb/data/dicts/en.it.panlex.all.processed'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
EN_IT_RAW = "/Users/mmillervedam/Documents/MIDS/w266/FinalProject/XlingualEmb/data/mono/en_it.shuf.10k"

# Load & Preprocess Data
__`ORIGINAL AUTHORS SAY:`__ "Normally, the monolingual word embeddings are trained on billions of words. However, getting that much of monolingual data for a low-resource language is also challenging. That is why we only select the top 5 million sentences (around 100 million words) for each language." - _Section 5.1, Duong et. al._ 

In [3]:
from parsing import Corpus, Vocabulary, batch_generator

### Corpus

In [4]:
# load corpus
en_it_data = Corpus(EN_IT_RAW)

In [5]:
# Corpus Stats
!wc {EN_IT_RAW}

   20000  430928 3746786 /Users/mmillervedam/Documents/MIDS/w266/FinalProject/XlingualEmb/data/mono/en_it.shuf.10k


__`i.e.:`__ 20K sentences (10K in each language) with ~430K tokens
> So this must not be their full data For now, I'm just going to look at the top 20K words and see what happens. In reality we should probably modify the Vocab class so that it explicily collects the top words for each language separately and then concatenates the index.

### Dictionary

In [6]:
# loading english-italian dictionary
pld = pd.read_csv(EN_IT_DICT, sep='\t', names = ['en', 'it'], dtype=str)
en_set = set(pld.en.unique())
it_set = set(pld.it.unique())

In [7]:
# dictionary vocab lengths:
print('EN:', len(en_set))
print('IT:', len(it_set))

EN: 266450
IT: 258641


### Vocabulary

In [8]:
# train multilingual Vocabulary# create vocab
en_it_vocab = Vocabulary(en_it_data.gen_tokens(), size = 100000)

### CBOW Data Generator
__`CHECK PAPER for HYPERPARAMS!`__: I can't seem to find where they talk abou the context window size, embedding size and batch size they use -- it may actually be in the Vulic and Moens paper instead of the Duong one.

__`RLH Update`__: Duong et al. section 6, footnote 4: "Default learning rate of 0.025, negative sampling with 25 samples, subsampling rate of value 1e−4, embedding dimension d = 200, window size cs = 48 and run for 15 epochs"


In [110]:
BATCH_SIZE = 48
WINDOW_SIZE = 1
MAX_EPOCHS = 1 # fail safe

In [111]:
batches = batch_generator(en_it_data, 
                          en_it_vocab, 
                          BATCH_SIZE, 
                          WINDOW_SIZE, 
                          MAX_EPOCHS)

In [112]:
# sanity check
for context, label in batches:
    print("CONTEXT IDS:", context[:5])
    print("LABEL IDS:", label[:5])
    break

CONTEXT IDS: [[0, 1], [0, 1], [0, 34], [20, 17318], [34, 1638]]
LABEL IDS: [25668, 37957, 20, 34, 17318]


# Initialize Model
__`CODE NOTES:`__ To get this running I had to hard code the context length (set to 8) inside `BuildCoreGraph()` where we generate `self.input_` in line 102. That should really be inferred from the `self.context_` itself but it doesn't seem to like the placeholder dimension (we don't have a span length until runtime). Does tensorflow not have a vectorized average? Something to fix (later). I also had to hard code the number of samples for softmax (I had originally put this as a `tf.placeholder_with_default` thinking we could pass it in to the training function (since its a training parameter) but TF kicked out an error message asking for an integer so for now I'll just give it what it wants. I need to think more about why TF doesn't want this changing from batch to batch. (or if there is another reason it wants an int).

In [149]:
from models import BiW2V

In [150]:
EMBEDDING_SIZE = 128

In [151]:
model = BiW2V(index = en_it_vocab.index, H = EMBEDDING_SIZE)

In [152]:
model.BuildCoreGraph()

In [153]:
model.BuildTrainingGraph()

In [154]:
model.BuildValidationGraph()

# Training

__`IMPORTANT!`__ right now the model only works with a window of 1 because the feed dict can't handle context windows of different lengths. We'll either need to figure out how to have a variable length dimension or else add extra padding to the sentences to account for the window size.

In [155]:
model.train(1, batches)

... Model Initialized
	 <tf.Variable 'Embedding_Layer/ContextEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/WordEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/b:0' shape=(48579,) dtype=float32_ref>
... Starting Training
Average loss at step  0 :  6.77033853531


InvalidArgumentError: You must feed a value for placeholder tensor 'Validation/Placeholder' with dtype int32 and shape [?]
	 [[Node: Validation/Placeholder = Placeholder[dtype=DT_INT32, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'Validation/Placeholder', defined at:
  File "//anaconda/envs/nlp/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "//anaconda/envs/nlp/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 478, in start
    self.io_loop.start()
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 281, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 232, in dispatch_shell
    handler(stream, idents, msg)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 397, in execute_request
    user_expressions, allow_stdin)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
    if self.run_code(code, result):
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-154-92690c90bb77>", line 1, in <module>
    model.BuildValidationGraph()
  File "models.py", line 33, in wrapper
    return function(self, *args, **kwargs)
  File "models.py", line 164, in BuildValidationGraph
    self.valid_words_ = tf.placeholder(tf.int32, shape=[None,])
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1599, in placeholder
    return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3091, in _placeholder
    "Placeholder", dtype=dtype, shape=shape, name=name)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "//anaconda/envs/nlp/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Validation/Placeholder' with dtype int32 and shape [?]
	 [[Node: Validation/Placeholder = Placeholder[dtype=DT_INT32, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
