Michigan State University's CSE484 "Information Retrieval" final project for Content-Based Image Retrieval. Implemented in Python.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



This project has been superceded by a C++ variant. For whatever reason, generating the clusters in Python is obscenely slow. While I never tested the nearest-neighbor operation, I imagine that it would be as well. I figure this is Python's fault, not FLANN's.

You can find the new project here.

Old Documentation for Legacy Sake

Using Py-CBIR

MSU Py-CBIR is modularized for easy use.

  • processFeatures.py
    • Processes a feature file into an in-memory array
  • buildIndex.py
    • Builds an Index from an in-memory array of points, saves it to file
  • cluster.py
    • Clusters all of the points from an Index and saves those clusters to file
  • bag.py
    • Loads a feature file and cluster file into memory, determines which cluster each feature belongs to.
    • Determines which image that feature belongs to, and writes the cluster number to file.
      • This cluster number is taken from the ordering of the clusters in the cluster file
      • This list of cluster numbers can be directly used as 'words' in a bag-of-words approach

Example Usage

# Load the first 10 documents' features into their own file, so that 
# processing goes faster
$ head -n 2158 features/esp.feature > features.first10documents
$ FEATURES='features.first10documents'

# Convert the MSU-supplied feature list into a NumPy list that loads much faster
$ python txt2npy.py -f $FEATURES

# Build the index
$ python buildIndex.py -f $FEATURES.npy -o $FEATURES.index

# Find 100 clusters.  Note that the project description requires 125000
$ python cluster.py -f $FEATURES.npy -i $FEATURES.index -o $FEATURES.clusters -n 100

# Find cluster associations ('words)
$ python bag.py -f $FEATURES.npy -c $FEATURES.clusters -d docs -l features/imglist.txt -s features/esp.size

# Take a look at the first Image, to see which clusters its points are associated with
$ more docs/00004fc7eb4bf00ba434c167890b99fa.txt