Renamed one_hot function to hashing_trick, made hashing stable #6887

aesuli · 2017-06-07T16:02:02Z

As pointed out in #2294 the one_hot function does not really implement a one hot encoding, but a hashing-based encoding, with possible collisions between words.
This method is a well-known indexing method, a.k.a. the hashing trick.
For this reason I propose to rename the function from one_hot to hashing_trick.
I also changed the implementation, as the original one_hot function used the hash() function that is not implemented as a stable hashing function (due to security concerns), and replaced it with md5() from hashlib, which is stable.
The use of md5 is not as straightforward and fast as using a dedicated hashing function, such as murmurhash, but using it as default avoids adding new dependencies to keras.
A custom hashing function can be eventually passed as a parameter to the hashing_trick function.

aesuli · 2017-06-07T16:08:53Z

Forgot to commit edited test.

fchollet · 2017-06-07T19:30:31Z

The purpose of using hash here is speed. It can be a pretty significant bottleneck for this use case. Using a cryptographic hash function, via a hasher class, is slower. Try benchmarking it.

Arbitrarily renaming functions in the public API breaks backwards compatibility and is not acceptable.

Stability would be good to have: if you have a PR that introduces stability in this function at no performance cost, we will merge it.

aesuli · 2017-06-07T20:31:43Z

I'm aware of the efficiency aspect, yet a model learned using hash is not usable after restart because there is no control on the randomization component of hash.
I can change this PR:

setting the default for the hash_function argument of hashing_trick to hash so speed is preserved, but there is the possibility of using other hash functions (I actually use mmh3 but I kept it out of this PR because is would add a package dependency, md5 was a fallback that had added no dependencies).
adding in the documentation a comment about default being not stable.
putting back one_hot as a wrapper to hashing_trick defaulting to hash (with a deprecation warning? really, one_hot name is confusing - also with the respect to the meaning of one_hot methods in the backend of keras)

fchollet · 2017-06-08T19:00:19Z

keras/preprocessing/text.py

+    # Arguments
+        text: Input text (string).
+        n: Dimension of the hashing space.
+        hash_function: The hash function to use. Takes in input a string,


Should this argument accepted a string from a predefined set, such as "md5"? What would be some good (fast) functions here?

fchollet · 2017-06-08T19:00:57Z

keras/preprocessing/text.py

            lower=True,
            split=' '):
+    """One-hot encode a text into a list of word indexes in a vocabulary of
+    size n (unicity of word to index mapping non-guaranteed).


One-line docstring description should be one line and end with a period.

fchollet · 2017-06-08T19:01:05Z

keras/preprocessing/text.py

+                  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
+                  lower=True,
+                  split=' '):
+    """Converts a text to a sequence of indices in a fixed-size hashing space


One-line docstring description should end with a period.

fchollet · 2017-06-08T19:02:09Z

keras/preprocessing/text.py

+            it is not consistent across different run.
+            If a model learned that uses the hashing trick is meant to be
+            saved and reused a stable hashing function must be given as
+            argument.


Docstring contains a few typos, please fix / rephrase

fchollet · 2017-06-08T19:02:25Z

keras/preprocessing/text.py

+    collisions by the hashing function.
+    The probability of a collision is in relation to the dimension of
+    the hashing space and the number of distinct objects, see
+    https://en.wikipedia.org/wiki/Birthday_problem#Probability_table


Use markdown format for links

fchollet · 2017-06-13T22:31:51Z

docs/templates/preprocessing/text.md

+```python
+keras.preprocessing.text.hashing_trick(text, n,
+                  hash_function=None,
+                  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',


These arguments should be documented, too. Also please fix the indent in the signature.

fchollet · 2017-06-13T22:32:27Z

keras/preprocessing/text.py

+try:
+    import mmh3
+except ImportError:
+    mmh3 = None


I don't think this is required, please remove mmh3-related code.

fchollet · 2017-06-13T22:33:49Z

keras/preprocessing/text.py

+    the number of distinct objects.
+    """
+    if hash_function == 'hash':
+        hash_function = hash


It is safe to just make it default to hash, no need for the string conversion. Hash is not the name of a specific hash function anyway, just a Python utility.

aesuli · 2017-06-16T15:10:37Z

The check error is due to a timeout: "The job exceeded the maximum time limit for jobs, and has been terminated."

fchollet · 2017-06-16T18:58:47Z

docs/templates/preprocessing/text.md

 ```python
 keras.preprocessing.text.text_to_word_sequence(text, 
-    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ")
+                                               filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ")


One keyword argument per line (same below)

fchollet · 2017-06-16T18:59:08Z

docs/templates/preprocessing/text.md

+- __Arguments__:
+    - __text__: str.
    - __n__: int. Size of vocabulary.
+    - __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.


Split this line into 3

fchollet · 2017-06-16T18:59:23Z

docs/templates/preprocessing/text.md

+            Note that 'hash' is not a stable hashing function, so
+            it is not consistent across different runs, while 'md5'
+            is a stable hashing function.
+    - __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.


Split this line into 3

fchollet · 2017-06-16T20:00:07Z

keras/preprocessing/text.py

            filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
            lower=True,
            split=' '):
+    """One-hot encode a text into a list of word indexes of size n.


fchollet · 2017-06-20T17:21:57Z

LGTM

aesuli added 2 commits June 7, 2017 16:53

Replaced one_hot function with hashing_trick

a591c9c

Update text_test.py

691f4ca

aesuli added 2 commits June 7, 2017 18:46

PEP8 fix

fc94419

Update text.py

597f28a

Put one_hot back, added documentation

3152d9f

fchollet reviewed Jun 8, 2017

View reviewed changes

aesuli added 2 commits June 13, 2017 17:13

Changes following the review comments

9e9f827

PEP8

0b430f2

fchollet reviewed Jun 13, 2017

View reviewed changes

aesuli added 2 commits June 16, 2017 15:04

Chages after second review

5f50801

fixed wrong default for hashing_trick

cde5dc6

fchollet approved these changes Jun 16, 2017

View reviewed changes

aesuli added 2 commits June 19, 2017 10:21

formatted documentation

093865f

typo

a99903f

fchollet merged commit 6814506 into keras-team:master Jun 20, 2017

mohanson mentioned this pull request Jul 22, 2017

Synchronize updates to English documents MoyanZitto/keras-cn#101

Merged

Renamed one_hot function to hashing_trick, made hashing stable #6887

Renamed one_hot function to hashing_trick, made hashing stable #6887

Uh oh!

Conversation

aesuli commented Jun 7, 2017

Uh oh!

aesuli commented Jun 7, 2017

Uh oh!

fchollet commented Jun 7, 2017

Uh oh!

aesuli commented Jun 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aesuli commented Jun 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fchollet commented Jun 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants