Inital Research Repository. Most recent code should be with NuPIC.
SimHash Distributed Document Encoder (SHaDDE)
Using the same approch as the recent SimHash Scalar Encoder, there is now a version available which can encode Documents.
This also uses a Locality-Sensitive Hashing approach towards encoding semantic document text data into Sparse Distributed Representations, ready to be fed into an Hierarchical Temporal Memory, like NuPIC by Numenta. This uses the SimHash algorithm to accomplish this. LSH and SimHash come from the world of nearest-neighbor document similarity searching.
Document Tokens are supplied with opitional weighting values. We generate a SHA-3 hash digest for each word token (using SHAKE256 to get a variable-width digest output size). The hashes for a document are combined into a sparse SimHash. Documents that are semantically similar will have similar encodings. Dissimilar documents will have very different encodings from each other. Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You'll want http://cortical.io for that.
How It Works
- Take a document, split up the words, and hash each.
- You can optionally add weights to the word hashes of your document.
- Combine the hashes into a sparse SimHash for the document (same method as SimHash Scalar encoder).
- Change each word from being a hash, to being a simhash created from the hashes of each letter in the word. This way, near-spellings will be considered similar ("eat" vs. "eats"), which they currently are not.