Skip to content

Commit

Permalink
- adds non model files to docs
Browse files Browse the repository at this point in the history
- fixes all docs and doctest errors
- fixes requested changes in PR
  • Loading branch information
aneesh-joshi committed Jul 3, 2018
1 parent 157b7d7 commit 451e3b1
Show file tree
Hide file tree
Showing 11 changed files with 121 additions and 94 deletions.
4 changes: 4 additions & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ Modules:
models/deprecated/fasttext_wrapper
models/base_any2vec
models/experimental/drmm_tks
models/experimental/custom_callbacks
models/experimental/custom_layers
models/experimental/custom_losses
models/experimental/evaluation_metrics
similarities/docsim
similarities/index
sklearn_api/atmodel
Expand Down
9 changes: 9 additions & 0 deletions docs/src/models/experimental/custom_callbacks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.experimental.custom_callbacks` -- Custom Callbacks for Similarity Learning
=======================================================================================

.. automodule:: gensim.models.experimental.custom_callbacks
:synopsis: Custom Callbacks for Similarity Learning
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions docs/src/models/experimental/custom_layers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.experimental.custom_layers` -- Custom Layers for Similarity Learning
=================================================================================

.. automodule:: gensim.models.experimental.custom_layers
:synopsis: Custom Layers for Similarity Learning
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions docs/src/models/experimental/custom_losses.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.experimental.custom_losses` -- Loss for Similarity Learning
========================================================================

.. automodule:: gensim.models.experimental.custom_losses
:synopsis: Loss functions for Similarity Learning
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
4 changes: 2 additions & 2 deletions docs/src/models/experimental/drmm_tks.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
:mod:`models.experimental.drmm_tks` -- Similarity Learning
============================================================================
:mod:`models.experimental.drmm_tks` -- Neural Nets for Similarity Learning
==========================================================================

.. automodule:: gensim.models.experimental.drmm_tks
:synopsis: Neural Network Similarity Learning
Expand Down
9 changes: 9 additions & 0 deletions docs/src/models/experimental/evaluation_metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.experimental.evaluation_metrics` -- Evaluation Metrics for Similarity Learning
===========================================================================================

.. automodule:: gensim.models.experimental.evaluation_metrics
:synopsis: Evaluation Metrics for Similarity Learning
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
21 changes: 10 additions & 11 deletions gensim/models/experimental/custom_callbacks.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,18 @@ def __init__(self, test_data):
Parameters
----------
test_data : dict
A dictionary which holds the validation data
It consists of the following keys:
"X1" : numpy array
A dictionary which holds the validation data. It consists of the following keys:
- "X1" : numpy array
The queries as a numpy array of shape (n_samples, text_maxlen)
"X2" : numpy array
- "X2" : numpy array
The candidate docs as a numpy array of shape (n_samples, text_maxlen)
"y" : list of int
It is the labels for each of the query-doc pairs as a 1 or 0 with shape (n_samples,)
where 1: doc is relevant to query
0: doc is not relevant to query
"doc_lengths" : list of int
It contains the length of each document group. I.e., the number of queries
which represent one topic. It is needed for calculating the metrics.
- "y" : list of int
It is the labels for each of the query-doc pairs as a 1 or 0 with shape (n_samples,)
where 1 : doc is relevant to query, 0 : doc is not relevant to query
- "doc_lengths" : list of int
It contains the length of each document group. I.e., the number of queries
which represent one topic. It is needed for calculating the metrics.
"""

if not KERAS_AVAILABLE:
Expand Down
6 changes: 3 additions & 3 deletions gensim/models/experimental/custom_layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ def __init__(self, output_dim, topk, **kwargs):
Parameters
----------
output_dim : tuple of ints
The dimension of the tensor after going through this layer
output_dim : tuple of int
The dimension of the tensor after going through this layer.
topk : int
The k topmost values to be returned
The k topmost values to be returned.
"""
self.output_dim = output_dim
self.topk = topk
Expand Down
120 changes: 55 additions & 65 deletions gensim/models/experimental/drmm_tks.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@
Abbreviations
=============
DRMM : Deep Relevance Matching Model
TKS : Top K Solutions
- DRMM : Deep Relevance Matching Model
- TKS : Top K Solutions
About DRMM_TKS
==============
Expand All @@ -33,15 +32,13 @@
The trained model needs to be trained on data in the format:
>>> queries = ["When was World War 1 fought ?".lower().split(),
... "When was Gandhi born ?".lower().split()]
>>> docs = [["The world war was bad".lower().split(),
... "It was fought in 1996".lower().split()],
... ["Gandhi was born in the 18th century".lower().split(),
... "He fought for the Indian freedom movement".lower().split(),
... "Gandhi was assasinated".lower().split()]]
>>> labels = [[0, 1], [1, 0, 0]]
>>> from gensim.models.experimental import DRMM_TKS
>>> import gensim.downloader as api
>>> queries = ["When was World War 1 fought ?".lower().split(), "When was Gandhi born ?".lower().split()]
>>> docs = [["The world war was bad".lower().split(), "It was fought in 1996".lower().split()], ["Gandhi was born in"
... "the 18th century".lower().split(), "He fought for the Indian freedom movement".lower().split(),
... "Gandhi was assasinated".lower().split()]]
>>> labels = [[0, 1], [1, 0, 0]]
>>> word_embeddings_kv = api.load('glove-wiki-gigaword-50')
>>> model = DRMM_TKS(queries, docs, labels, word_embedding=word_embeddings_kv, verbose=0)
Expand All @@ -59,27 +56,24 @@
Testing on new data :
>>> queries = ["how are glacier caves formed ?".lower().split()]
>>> docs = ["A partly submerged glacier cave on Perito Moreno Glacier".lower().split(),
... "A glacier cave is a cave formed within the ice of a glacier".lower().split()]
Predicting on new data :
>>> from gensim.test.utils import datapath
>>> model = DRMM_TKS.load(datapath('drmm_tks'))
>>> print(model.predict([["hello", "world"]], [["i", "am", "happy"], ["good", "morning"]]))
[[0.99346054]
[0.999115 ]
[0.9989991 ]]
>>>
>>> queries = ["how are glacier caves formed ?".lower().split()]
>>> docs = [["A partly submerged glacier cave on Perito Moreno Glacier".lower().split(), "glacier cave is cave formed"
... " within the ice of glacier".lower().split()]]
>>> print(model.predict(queries, docs))
[[0.9915068 ]
[0.99228466]]
>>> print(model.predict([["hello", "world"]], [[["i", "am", "happy"], ["good", "morning"]]]))
[[0.9975487]
[0.999115 ]]
More information can be found in:
More information can be found in:
`Jiafeng Guo, Yixing Fan, Qingyao Ai, W. Bruce Croft "A Deep Relevance Matching Model for Ad-hoc Retrieval"
<http://www.bigdatalab.ac.cn/~gjf/papers/2016/CIKM2016a_guo.pdf>`_
`MatchZoo Repository <https://github.com/faneshion/MatchZoo>`_
`Similarity Learning Wikipedia Page <https://en.wikipedia.org/wiki/Similarity_learning>`_
"""
Expand Down Expand Up @@ -224,8 +218,8 @@ def __init__(self, queries=None, docs=None, labels=None, word_embedding=None,
The candidate answers for the similarity learning model.
labels: iterable list of list of int, optional
Indicates when a candidate document is relevant to a query
1 : relevant
0 : irrelevant
- 1 : relevant
- 0 : irrelevant
word_embedding : :class:`~gensim.models.keyedvectors.KeyedVectors`, optional
a KeyedVector object which has the embeddings pre-loaded.
If None, random word embeddings will be used.
Expand All @@ -249,22 +243,20 @@ def __init__(self, queries=None, docs=None, labels=None, word_embedding=None,
the way the model should be trained, either to rank or classify
verbose : {0, 1, 2}
the level of information shared while training
0 = silent, 1 = progress bar, 2 = one line per epoch
- 0 : silent
- 1 : progress bar
- 2 : one line per epoch
Examples
--------
The trained model needs to be trained on data in the format
>>> queries = ["When was World War 1 fought ?".lower().split(),
... "When was Gandhi born ?".lower().split()]
>>> docs = [["The world war was bad".lower().split(),
... "It was fought in 1996".lower().split()],
... ["Gandhi was born in the 18th century".lower().split(),
... "He fought for the Indian freedom movement".lower().split(),
... "Gandhi was assasinated".lower().split()]]
>>> labels = [[0, 1],
... [1, 0, 0]]
>>> queries = ["When was World War 1 fought ?".lower().split(), "When was Gandhi born ?".lower().split()]
>>> docs = [["The world war was bad".lower().split(), "It was fought in 1996".lower().split()], ["Gandhi was"
... "born in the 18th century".lower().split(), "He fought for the Indian freedom movement".lower().split(),
... "Gandhi was assasinated".lower().split()]]
>>> labels = [[0, 1], [1, 0, 0]]
>>> import gensim.downloader as api
>>> word_embeddings_kv = api.load('glove-wiki-gigaword-50')
>>> model = DRMM_TKS(queries, docs, labels, word_embedding=word_embeddings_kv, verbose=0)
Expand Down Expand Up @@ -292,8 +284,9 @@ def __init__(self, queries=None, docs=None, labels=None, word_embedding=None,
self._get_full_batch_iter = _get_full_batch_iter

if self.target_mode not in ['ranking', 'classification']:
raise ValueError("Unkown target_mode %s. It must be either"
"'ranking' or 'classification'" % self.target_mode)
raise ValueError(
"Unkown target_mode %s. It must be either 'ranking' or 'classification'" % self.target_mode
)

if unk_handle_method not in ['random', 'zero']:
raise ValueError("Unkown token handling method %s" % str(unk_handle_method))
Expand Down Expand Up @@ -346,8 +339,7 @@ def build_vocab(self, queries, docs, labels, word_embedding):
# Initialize the embedding matrix
# UNK word gets the vector based on the method
if self.unk_handle_method == 'random':
self.embedding_matrix = np.random.uniform(-0.2, 0.2,
(self.vocab_size, self.embedding_dim))
self.embedding_matrix = np.random.uniform(-0.2, 0.2, (self.vocab_size, self.embedding_dim))
elif self.unk_handle_method == 'zero':
self.embedding_matrix = np.zeros((self.vocab_size, self.embedding_dim))

Expand All @@ -361,9 +353,10 @@ def build_vocab(self, queries, docs, labels, word_embedding):
# Creates the same random vector for the given string each time
self.embedding_matrix[i] = self._seeded_vector(word, self.embedding_dim)
n_non_embedding_words += 1
logger.info("There are %d words out of %d (%.2f%%) not in the embeddings. Setting them to %s" %
(n_non_embedding_words, self.vocab_size, n_non_embedding_words * 100 / self.vocab_size,
self.unk_handle_method))
logger.info(
"There are %d words out of %d (%.2f%%) not in the embeddings. Setting them to %s", n_non_embedding_words,
self.vocab_size, n_non_embedding_words * 100 / self.vocab_size, self.unk_handle_method
)

# Include embeddings for words in embedding file but not in the train vocab
# It will be useful for embedding words encountered in validation and test set
Expand Down Expand Up @@ -410,11 +403,9 @@ def build_vocab(self, queries, docs, labels, word_embedding):
logger.info("Normalizing the word embeddings")
self.embedding_matrix = normalize(self.embedding_matrix)

logger.info("Embedding Matrix build complete. It now has shape %s" %
str(self.embedding_matrix.shape))
logger.info("Pad word has been set to index %d" % self.pad_word_index)
logger.info("Unknown word has been set to index %d" %
self.unk_word_index)
logger.info("Embedding Matrix build complete. It now has shape %s", str(self.embedding_matrix.shape))
logger.info("Pad word has been set to index %d", self.pad_word_index)
logger.info("Unknown word has been set to index %d", self.unk_word_index)
logger.info("Embedding index build complete")
self.needs_vocab_build = False

Expand Down Expand Up @@ -566,8 +557,10 @@ def train(self, queries, docs, labels, word_embedding=None,
indexed_long_query_list = self._translate_user_data(long_query_list)
indexed_long_doc_list = self._translate_user_data(long_doc_list)

val_callback = ValidationCallback({"X1": indexed_long_query_list, "X2": indexed_long_doc_list,
"doc_lengths": doc_lens, "y": long_label_list})
val_callback = ValidationCallback(
{"X1": indexed_long_query_list, "X2": indexed_long_doc_list, "doc_lengths": doc_lens,
"y": long_label_list}
)
val_callback = [val_callback] # since `model.fit` requires a list

# If train is called again, not all values should be reset
Expand Down Expand Up @@ -613,16 +606,17 @@ def _translate_user_data(self, data):
translated_sentence.append(self.unk_word_index)
n_skipped_words += 1
if len(sentence) > self.text_maxlen:
logger.info("text_maxlen: %d isn't big enough. Error at sentence of length %d."
"Sentence is %s" % (
self.text_maxlen, len(sentence), str(sentence))
)
logger.info(
"text_maxlen: %d isn't big enough. Error at sentence of length %d."
"Sentence is %s", self.text_maxlen, len(sentence), str(sentence)
)
translated_sentence = translated_sentence + \
(self.text_maxlen - len(sentence)) * [self.pad_word_index]
translated_data.append(np.array(translated_sentence))

logger.info("Found %d unknown words. Set them to unknown word index : %d" %
(n_skipped_words, self.unk_word_index))
logger.info(
"Found %d unknown words. Set them to unknown word index : %d", n_skipped_words, self.unk_word_index
)
return np.array(translated_data)

def predict(self, queries, docs):
Expand All @@ -643,9 +637,9 @@ def predict(self, queries, docs):
>>> model = DRMM_TKS.load(datapath('drmm_tks'))
>>>
>>> queries = ["When was World War 1 fought ?".split(), "When was Gandhi born ?".split()]
>>> docs = [["The world war was bad".split(), "It was fought in 1996".split()],
... ["Gandhi was born in the 18th century".split(), "He fought for the Indian freedom movement".split(),
... "Gandhi was assasinated".split()]]
>>> docs = [["The world war was bad".split(), "It was fought in 1996".split()], ["Gandhi was born in the 18th"
... " century".split(), "He fought for the Indian freedom movement".split(), "Gandhi was"
... " assasinated".split()]]
>>> print(model.predict(queries, docs))
[[0.9933108 ]
[0.9925415 ]
Expand All @@ -672,9 +666,9 @@ def predict(self, queries, docs):

return predictions


def evaluate(self, queries, docs, labels):
"""Evaluates the model and provides the results in terms of metrics (MAP, nDCG)
This should ideally be called on the test set.
Parameters
----------
Expand All @@ -685,7 +679,6 @@ def evaluate(self, queries, docs, labels):
labels : list of list of int
The relevance of the document to the query. 1 = relevant, 0 = not relevant
"""

long_doc_list = []
long_label_list = []
long_query_list = []
Expand All @@ -698,19 +691,16 @@ def evaluate(self, queries, docs, labels):
long_label_list.append(l)
i += 1
doc_lens.append(len(doc))

indexed_long_query_list = self._translate_user_data(long_query_list)
indexed_long_doc_list = self._translate_user_data(long_doc_list)
predictions = self.model.predict(x={'query': indexed_long_query_list, 'doc': indexed_long_doc_list})
Y_pred = []
Y_true = []
offset = 0

for doc_size in doc_lens:
Y_pred.append(predictions[offset: offset + doc_size])
Y_true.append(long_label_list[offset: offset + doc_size])
offset += doc_size

logger.info("MAP: %.2f", mapk(Y_true, Y_pred))
for k in [1, 3, 5, 10, 20]:
logger.info("nDCG@%d : %.2f", k, mean_ndcg(Y_true, Y_pred, k=k))
Expand Down
12 changes: 7 additions & 5 deletions gensim/models/experimental/evaluation_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

logger = logging.getLogger(__name__)
logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO
)


def mapk(Y_true, Y_pred):
Expand All @@ -19,8 +20,8 @@ def mapk(Y_true, Y_pred):
Y_pred : numpy array or list of floats
Contains the predicted similarity score between a query and document
Usage
-----
Examples
--------
>>> Y_true = [[0, 1, 0, 1], [0, 0, 0, 0, 1, 0], [0, 1, 0]]
>>> Y_pred = [[0.1, 0.2, -0.01, 0.4], [0.12, -0.43, 0.2, 0.1, 0.99, 0.7], [0.5, 0.63, 0.92]]
>>> print(mapk(Y_true, Y_pred))
Expand Down Expand Up @@ -61,8 +62,9 @@ def mean_ndcg(Y_true, Y_pred, k=10):
Y_pred : numpy array or list of floats
Contains the predicted similarity score between a query and document
Usage
-----
Examples
--------
>>> Y_true = [[0, 1, 0, 1], [0, 0, 0, 0, 1, 0], [0, 1, 0]]
>>> Y_pred = [[0.1, 0.2, -0.01, 0.4], [0.12, -0.43, 0.2, 0.1, 0.19, 0.7], [0.5, 0.63, 0.72]]
>>> for k in [1, 3, 5, 10]:
Expand Down
Loading

0 comments on commit 451e3b1

Please sign in to comment.