-
Notifications
You must be signed in to change notification settings - Fork 19.7k
Renamed one_hot function to hashing_trick, made hashing stable #6887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Forgot to commit edited test. |
|
The purpose of using Arbitrarily renaming functions in the public API breaks backwards compatibility and is not acceptable. Stability would be good to have: if you have a PR that introduces stability in this function at no performance cost, we will merge it. |
|
I'm aware of the efficiency aspect, yet a model learned using
|
keras/preprocessing/text.py
Outdated
| # Arguments | ||
| text: Input text (string). | ||
| n: Dimension of the hashing space. | ||
| hash_function: The hash function to use. Takes in input a string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this argument accepted a string from a predefined set, such as "md5"? What would be some good (fast) functions here?
keras/preprocessing/text.py
Outdated
| lower=True, | ||
| split=' '): | ||
| """One-hot encode a text into a list of word indexes in a vocabulary of | ||
| size n (unicity of word to index mapping non-guaranteed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-line docstring description should be one line and end with a period.
keras/preprocessing/text.py
Outdated
| filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', | ||
| lower=True, | ||
| split=' '): | ||
| """Converts a text to a sequence of indices in a fixed-size hashing space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-line docstring description should end with a period.
keras/preprocessing/text.py
Outdated
| it is not consistent across different run. | ||
| If a model learned that uses the hashing trick is meant to be | ||
| saved and reused a stable hashing function must be given as | ||
| argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring contains a few typos, please fix / rephrase
keras/preprocessing/text.py
Outdated
| collisions by the hashing function. | ||
| The probability of a collision is in relation to the dimension of | ||
| the hashing space and the number of distinct objects, see | ||
| https://en.wikipedia.org/wiki/Birthday_problem#Probability_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use markdown format for links
docs/templates/preprocessing/text.md
Outdated
| ```python | ||
| keras.preprocessing.text.hashing_trick(text, n, | ||
| hash_function=None, | ||
| filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arguments should be documented, too. Also please fix the indent in the signature.
keras/preprocessing/text.py
Outdated
| try: | ||
| import mmh3 | ||
| except ImportError: | ||
| mmh3 = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is required, please remove mmh3-related code.
| the number of distinct objects. | ||
| """ | ||
| if hash_function == 'hash': | ||
| hash_function = hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safe to just make it default to hash, no need for the string conversion. Hash is not the name of a specific hash function anyway, just a Python utility.
|
The check error is due to a timeout: "The job exceeded the maximum time limit for jobs, and has been terminated." |
docs/templates/preprocessing/text.md
Outdated
| ```python | ||
| keras.preprocessing.text.text_to_word_sequence(text, | ||
| filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ") | ||
| filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One keyword argument per line (same below)
docs/templates/preprocessing/text.md
Outdated
| - __Arguments__: | ||
| - __text__: str. | ||
| - __n__: int. Size of vocabulary. | ||
| - __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this line into 3
docs/templates/preprocessing/text.md
Outdated
| Note that 'hash' is not a stable hashing function, so | ||
| it is not consistent across different runs, while 'md5' | ||
| is a stable hashing function. | ||
| - __filters__: list (or concatenation) of characters to filter out, such as punctuation. Default: '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' , includes basic punctuation, tabs, and newlines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this line into 3
keras/preprocessing/text.py
Outdated
| filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', | ||
| lower=True, | ||
| split=' '): | ||
| """One-hot encode a text into a list of word indexes of size n. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encodes
|
LGTM |
As pointed out in #2294 the one_hot function does not really implement a one hot encoding, but a hashing-based encoding, with possible collisions between words.
This method is a well-known indexing method, a.k.a. the hashing trick.
For this reason I propose to rename the function from one_hot to hashing_trick.
I also changed the implementation, as the original one_hot function used the hash() function that is not implemented as a stable hashing function (due to security concerns), and replaced it with md5() from hashlib, which is stable.
The use of md5 is not as straightforward and fast as using a dedicated hashing function, such as murmurhash, but using it as default avoids adding new dependencies to keras.
A custom hashing function can be eventually passed as a parameter to the hashing_trick function.