-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][GSoC 2018] Similarity Learning #2050
[WIP][GSoC 2018] Similarity Learning #2050
Conversation
This reverts commit 6c06fbc.
…into similarity_learning_develop
@menshikh-iv
About current state:
I am trying to get things reviewed sooner so I can stay in the right direction. Thus, some changes are still missing which I am working on. These are:
For Random seeding I need help. I don't understand what you want. Could you link me to some code example or some tutorial. How should I set the seed? I have also added the .rst but am not sure if it is correct. Is there a way to generate the docs? |
with
but I cannot see any indentation on that line. 😕 |
That's expected
In your case, you need to create random vector for each OOV manually (not full matrix for all OOV at one moment), like (this isn't perfect example, but demonstrate what I mean) import numpy as np
emb_size = 300
oov_words = ["hello", "world", "wow"]
matrix = []
for word in oov_words:
rng = np.random.RandomState(seed=abs(hash(word)) % (2 ** 32 - 1))
matrix.append(rng.rand(emb_size))
matrix = np.array(matrix) # use this matrix in Keras
assert matrix.shape == (len(oov_words), emb_size)
This pointed to class docstring (not module), because
replace |
@@ -68,6 +68,7 @@ Modules: | |||
models/deprecated/keyedvectors | |||
models/deprecated/fasttext_wrapper | |||
models/base_any2vec | |||
models/experimental/drmm_tks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need also include other files to documentation building (like callbacks, layers, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the link below which shows the diff of the requested changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note
tox -e docs
will throw errors. Not on my files but on some keras files since I am inheriting from the Keras Layer class which has some unformatted docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aneesh-joshi that's shouldn't happen (because you include only your files, not Keras), can you show me log of tox -e docs
that mention the error in some Keras file (not your)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:10: WARNING: Unexpected indentation.
/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.add_weight:12: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/aneeshj/Projects/gensim/.tox/docs/local/lib/python2.7/site-packages/gensim/models/experimental/custom_layers.py:docstring of gensim.models.experimental.custom_layers.TopKLayer.call:4: WARNING: Inline strong start-string without end-string.
I haven't implemented any of the above functions. Just inherited the Layer class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, looks like you are right (issue with docstring of the parent class that we can't control).
Simple workaround - define these methods yourself and call super (but don't worry much about it now), you have more critical tasks now.
|
||
The trained model needs to be trained on data in the format: | ||
|
||
>>> queries = ["When was World War 1 fought ?".lower().split(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indents (again), here and everywhere.
Also, all imports should be on top of examples (also please import current model too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indents (again)
Sorry for making you to keep repeating. I keep missing it.
>>> queries = ["how are glacier caves formed ?".lower().split()] | ||
>>> docs = ["A partly submerged glacier cave on Perito Moreno Glacier".lower().split(), | ||
... "A glacier cave is a cave formed within the ice of a glacier".lower().split()] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is your testing?
- fixes all docs and doctest errors - fixes requested changes in PR
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!python experimental_data/get_data.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better place this code directly in notebook & remove get_data.py
from repo
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"queries = [simple_preprocess(\"how are glacier caves formed\"),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, no vertical (here and everywhere)
"skipping query-doc pair due to no words in vocab\n", | ||
"MAP: 0.56\n", | ||
"nDCG@1 : 0.41 \n", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply this function to your NN too
---------- | ||
test_data : dict | ||
A dictionary which holds the validation data. It consists of the following keys: | ||
- "X1" : numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this rendered correctly? I can't check because current build failed.
) | ||
for key in test_data.keys(): | ||
if key not in ['X1', 'X2', 'y', 'doc_lengths']: | ||
raise ValueError("test_data dictionary doesn't have the keys: 'X1', 'X2', 'y', 'doc_lengths'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect check: if test_data.keys()
contains needed keys + some additional key - this will fail
|
||
# get all the vocab words | ||
for q in self.queries: | ||
self.word_counter.update(q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I call build_vocab
twice, what happens?
self.build_vocab(self.queries, self.docs, self.labels, self.word_embedding) | ||
|
||
is_iterable = False | ||
if isinstance(self.queries, Iterable) and not isinstance(self.queries, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again is_iterable
super strange, your input always iterable
loss = 'mse' | ||
self.model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy']) | ||
else: | ||
logger.info("Model will be retrained") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what mean "retrained"? Is this updated or train from scratch?
) | ||
val_callback = [val_callback] # since `model.fit` requires a list | ||
|
||
# If train is called again, not all values should be reset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's values? Can you clarify please?
self.first_train = False | ||
|
||
if is_iterable: | ||
self.model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, callbacks=val_callback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be always fit_generator
and no is_iterable
* fix TopK Layer output dim shape * update ipynb to have newest model
run
dssm_example.py
to get a complete run on implementationsWork is in progress, so several features need to be added and code needs to be cleaned.
This is provided as a proof of concept/demo