Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Implementing the fruit fly's similarity hashing

This is an implementation of the random indexing method described by Dasgupta et al (2017) in A neural algorithm for a fundamental computing problem. The system takes high-dimensional vectors in input, applies random projections to each vector, and returns a hash for that vector.

Before you run any analysis, make sure you understand what the algorithm does to the input, referring to the paper.

Description of the data

Your data/ directory contains two small semantic spaces:

  • one from the British National Corpus, containing lemmas expressed in 4000 dimensions.
  • one from a subset of Wikipedia, containing words expressed in 1000 dimensions.

The cells in each semantic space are normalised co-occurrence frequencies without any additional weighting (PMI, for instance).

The directory also contains test pairs from the MEN similarity dataset, both in lemmatised and natural forms.

Finally, it contains a file generic_pod.csv, which is a compilation of around 2400 distributional web page signatures, in PeARS format. The web pages span various topics: Harry Potter, Star Wars, the Black Panther film, the Black Panther social rights movement, search engines and various small topics involving architecture.

Running the fruit fly code

To run the code, you need to enter the corpus you would like to test on, and the number of Kenyon cells you are going to use for the experiment. For instance, for the BNC space:

python3 bnc 8000 6 5

Or for the Wikipedia space:

python3 wiki 4000 4 10

The program returns the Spearman correlation with the MEN similarity data, as calculated a) from the raw frequency space; and b) after running the fly's random projections.

Tuning parameters

First, get a sense for which parameters give best results on the MEN dataset, for both BNC and Wikipedia data. If you know how to code, you can do a random parameter search automatically. If not, just try different values manually and write down what you observe.

Analysing the results

Compare results for the BNC and the Wikipedia data. You should see that results on the BNC are much better than on Wikipedia. Why is that?

To help you with the analysis, you can print a verbose version of the random projections with the -v flag. E.g.:

python3 bnc 8000 6 1 -v

This will print out the projection neurons that are most responsible for the activation in the Kenyon layer.

Using the fly for document similarity search

You can test the capability of the the fly's algorithm to return web pages that are similar to a given one (and crucially, dimensionality-reduced), by typing:

python3 data/generic_pod.csv 2000 6 5!


Unsupervised learning tutorial. Technique: random projections.



No releases published


No packages published