simhash_htm_encoders/document at main · brev/simhash_htm_encoders

History

Name		Name	Last commit message	Last commit date
parent directory ..
.python-version		.python-version
README.md		README.md
requirements.txt		requirements.txt
simhash_distributed_document.py		simhash_distributed_document.py
test.py		test.py

README.md

Inital Research Repository. Most recent code should be with NuPIC.

SimHash Distributed Document Encoder (SHaDDE)

Using the same approch as the recent SimHash Scalar Encoder, there is now a version available which can encode Documents.

This also uses a Locality-Sensitive Hashing approach towards encoding semantic document text data into Sparse Distributed Representations, ready to be fed into an Hierarchical Temporal Memory, like NuPIC by Numenta. This uses the SimHash algorithm to accomplish this. LSH and SimHash come from the world of nearest-neighbor document similarity searching.

Document Tokens are supplied with opitional weighting values. We generate a SHA-3 hash digest for each word token (using SHAKE256 to get a variable-width digest output size). The hashes for a document are combined into a sparse SimHash. Documents that are semantically similar will have similar encodings. Dissimilar documents will have very different encodings from each other. Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You'll want http://cortical.io for that.

How It Works

Take a document, split up the words, and hash each.
You can optionally add weights to the word hashes of your document.
Combine the hashes into a sparse SimHash for the document (same method as SimHash Scalar encoder).

Source Code

Pull Request against Old NuPIC.
Research Repo with my test runners and original research code.

Next Steps

Change each word from being a hash, to being a simhash created from the hashes of each letter in the word. This way, near-spellings will be considered similar ("eat" vs. "eats"), which they currently are not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document

document

.python-version

.python-version

README.md

README.md

requirements.txt

requirements.txt

simhash_distributed_document.py

simhash_distributed_document.py

test.py

test.py

README.md

SimHash Distributed Document Encoder (SHaDDE)

How It Works

Source Code

Next Steps

More Information

Files

document

Directory actions

More options

Directory actions

More options

Latest commit

History

document

Folders and files

parent directory

SimHash Distributed Document Encoder (SHaDDE)

How It Works

Source Code

Next Steps

More Information