Skip to content

Conversation

@aesuli
Copy link
Contributor

@aesuli aesuli commented Jun 7, 2017

As pointed out in #2294 the one_hot function does not really implement a one hot encoding, but a hashing-based encoding, with possible collisions between words.
This method is a well-known indexing method, a.k.a. the hashing trick.
For this reason I propose to rename the function from one_hot to hashing_trick.
I also changed the implementation, as the original one_hot function used the hash() function that is not implemented as a stable hashing function (due to security concerns), and replaced it with md5() from hashlib, which is stable.
The use of md5 is not as straightforward and fast as using a dedicated hashing function, such as murmurhash, but using it as default avoids adding new dependencies to keras.
A custom hashing function can be eventually passed as a parameter to the hashing_trick function.

@aesuli
Copy link
Contributor Author

aesuli commented Jun 7, 2017

Forgot to commit edited test.

@fchollet
Copy link
Collaborator

fchollet commented Jun 7, 2017

The purpose of using hash here is speed. It can be a pretty significant bottleneck for this use case. Using a cryptographic hash function, via a hasher class, is slower. Try benchmarking it.

Arbitrarily renaming functions in the public API breaks backwards compatibility and is not acceptable.

Stability would be good to have: if you have a PR that introduces stability in this function at no performance cost, we will merge it.

@aesuli
Copy link
Contributor Author

aesuli commented Jun 7, 2017

I'm aware of the efficiency aspect, yet a model learned using hash is not usable after restart because there is no control on the randomization component of hash.
I can change this PR:

  • setting the default for the hash_function argument of hashing_trick to hash so speed is preserved, but there is the possibility of using other hash functions (I actually use mmh3 but I kept it out of this PR because is would add a package dependency, md5 was a fallback that had added no dependencies).
  • adding in the documentation a comment about default being not stable.
  • putting back one_hot as a wrapper to hashing_trick defaulting to hash (with a deprecation warning? really, one_hot name is confusing - also with the respect to the meaning of one_hot methods in the backend of keras)

# Arguments
text: Input text (string).
n: Dimension of the hashing space.
hash_function: The hash function to use. Takes in input a string,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this argument accepted a string from a predefined set, such as "md5"? What would be some good (fast) functions here?

lower=True,
split=' '):
"""One-hot encode a text into a list of word indexes in a vocabulary of
size n (unicity of word to index mapping non-guaranteed).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-line docstring description should be one line and end with a period.

filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' '):
"""Converts a text to a sequence of indices in a fixed-size hashing space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-line docstring description should end with a period.

it is not consistent across different run.
If a model learned that uses the hashing trick is meant to be
saved and reused a stable hashing function must be given as
argument.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring contains a few typos, please fix / rephrase

collisions by the hashing function.
The probability of a collision is in relation to the dimension of
the hashing space and the number of distinct objects, see
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use markdown format for links

```python
keras.preprocessing.text.hashing_trick(text, n,
hash_function=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arguments should be documented, too. Also please fix the indent in the signature.

try:
import mmh3
except ImportError:
mmh3 = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required, please remove mmh3-related code.

the number of distinct objects.
"""
if hash_function == 'hash':
hash_function = hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe to just make it default to hash, no need for the string conversion. Hash is not the name of a specific hash function anyway, just a Python utility.

@aesuli
Copy link
Contributor Author

aesuli commented Jun 16, 2017

The check error is due to a timeout: "The job exceeded the maximum time limit for jobs, and has been terminated."

```python
keras.preprocessing.text.text_to_word_sequence(text,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ")
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One keyword argument per line (same below)

- __Arguments__:
- __text__: str.
- __n__: int. Size of vocabulary.
- __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this line into 3

Note that 'hash' is not a stable hashing function, so
it is not consistent across different runs, while 'md5'
is a stable hashing function.
- __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this line into 3

filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' '):
"""One-hot encode a text into a list of word indexes of size n.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodes

@fchollet
Copy link
Collaborator

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants