Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] image based search (GSOC) #28009

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

NirobNabil
Copy link

@NirobNabil NirobNabil commented Apr 2, 2024

PR summary

This PR is mainly meant to be as a reference point for a test run i did for the GSOC proposal of implementing image based search on matplotlib docs.

I have tried updating the PR with cleaner code and more explanation. But note that, everything is still relevant only in the sense of a proof of concept / test run.

In short the process is like this,

  • the image_search extension goes through all the .py files of examples and fetches the generated images (i tried to copy the exact way gen_gallery.py from sphinx-gallery iterates through example files, to replicate the same behavior)
  • generate html for thumbnails with link to example's page for each example and attach a thumbnail_id to each of them for uniquely identifying each. these thumbnails are stored in /image_search/index.recommendations file.
  • feed all fetched images through the SearchSetup class and then generate the corresponding vectors for each image.
  • the generated vectors are stored at _static/data.json along with the thumbnail_id of the image/example this vector represents
  • galleries/image_search/index.rst contains the layout of the image search page and this page includes all generated thumbnails from index.recommendations but all the thumbnails are kept hidden through a css rule
  • the js file at _static/image_search.js controls the search logic.
    • it loads the vectors from _static/data.json
    • calculates cosine similarity with the input image following the PR by Arturo (for now first image in data.json is considered as the input image)
    • and then grabs the html of thumbnails corresponding to the top 5 most similar vectors (the thumbnails were already present in the html file but were hidden) and appends them to the sphx-glr-imgsearchresult-container div.

Currently tested integrations,

  • add a simple sphinx extension that fetches links to all generated example images at built time
  • adding links with appropriated thumbnail to a standalone reST file that is acting as the page for image search
  • generate vector embeddings for each example image and use them to perform image similarity search

Some modifications that i have in mind and will be adding to the test run gradually

  • size of data.json can be drastically reduced using a binary data format
  • the torch model needs to be saved in onnx format and then loaded at the frontend for calculating vector of input image
  • need to add support for examples that have multiple images (currently only taking the first image)

PR checklist

@github-actions github-actions bot added the Documentation: build building the docs label Apr 2, 2024
@story645
Copy link
Member

story645 commented Apr 3, 2024

Can you please add a more informative title? And add in your description that this is a GSOC proof of concept/experiment?

@WeatherGod
Copy link
Member

There are a bunch of extraneous (or just incomplete) files in this PR. For example, the zip file seems to be empty (and I would be highly against including such a file in an MR), along with the javascript file. The recommendations file seems to be a huge file of possible recommendations, which does not seem to be maintainable or verifiable.

@story645 story645 changed the title test run [POC] image based search (GSOC) Apr 3, 2024
@story645
Copy link
Member

story645 commented Apr 3, 2024

@WeatherGod this PR started as part of a GSOC application for a visual search/detexify extension to the documentation build, but it looks likely that if we do this as GSOC this year it'll be a joint mentorship with sphinx-gallery and the code/PRs will likely go there. It'd likely be built off of/jumping off sphinx-gallery/sphinx-gallery#1125

Whether we want to enable that feature is a separate issue, but I think those pages can get created on build (like sphinx tags) and therefore don't need to be committed to the repo.

Basically yes I agree w/ you that this PR is unlikely to go in, but wanted to try and give you some more context on what's going on.

@NirobNabil
Copy link
Author

I'm really sorry a lot of the files were mostly for test runs and i didn't organize them properly yet. I'll add more documentation and try to explain what i did in a short summary here within tonight. Really sorry for the unorganized mess

@story645
Copy link
Member

story645 commented Apr 4, 2024

That's fine, but can you do your test run on your fork and just send me a link to the fork/update your application w/ it rather than opening up a pull request here since this is so experimental and if it gets accepted will likely live under sphinx-gallery.

ETA: which our general practice for these things is that if it's not quite ready for a PR, to instead link folks to the branch on the fork.

@NirobNabil
Copy link
Author

That's fine, but can you do your test run on your fork and just send me a link to the fork/update your application w/ it rather than opening up a pull request here since this is so experimental and if it gets accepted will likely live under sphinx-gallery.

Understood. In that case, is the preferred communication channel gitter or should i email?

update your application w/ it rather than opening up a pull request hereupdate your application w/ it rather than opening up a pull request here

I checked the portal at gsoc and they wont allow updating the proposal anymore. the submission deadline is over. But I'll update my fork with new test runs and documentation and send you link over your preferred medium.

@story645
Copy link
Member

story645 commented Apr 4, 2024

Understood. In that case, is the preferred communication channel gitter or should i email?

gitter is fine.

@NirobNabil
Copy link
Author

There are a bunch of extraneous (or just incomplete) files in this PR.

For example, the zip file seems to be empty (and I would be highly against including such a file in an MR), along with the javascript file.

The recommendations file seems to be a huge file of possible recommendations, which does not seem to be maintainable or verifiable.

I had accidentally added these files that are generated at built time into the PR. I have modified the gitignore to not include auto generated files.

I have tried updating the PR with cleaner code and more explanation. But note that, everything is still relevant only in the sense of a proof of concept / test run.

In short the process is like this,

  • the image_search extension goes through all the .py files of examples and fetches the generated images (i tried to copy the exact way gen_gallery.py from sphinx-gallery iterates through example files, to replicate the same behavior)
  • generate html for thumbnails with link to example's page for each example and attach a thumbnail_id to each of them for uniquely identifying each. these thumbnails are stored in /image_search/index.recommendations file.
  • feed all fetched images through the SearchSetup class and then generate the corresponding vectors for each image.
  • the generated vectors are stored at _static/data.json along with the thumbnail_id of the image/example this vector represents
  • galleries/image_search/index.rst contains the layout of the image search page and this page includes all generated thumbnails from index.recommendations but all the thumbnails are kept hidden through a css rule
  • the js file at _static/image_search.js controls the search logic.
    • it loads the vectors from _static/data.json
    • calculates cosine similarity with the input image following the PR by Arturo (for now first image in data.json is considered as the input image)
    • and then grabs the html of thumbnails corresponding to the top 5 most similar vectors (the thumbnails were already present in the html file but were hidden) and appends them to the sphx-glr-imgsearchresult-container div.

Some modifications that i have in mind and will be adding to the test run gradually

  • size of data.json can be drastically reduced using a binary data format
  • the torch model needs to be saved in onnx format and then loaded at the frontend for calculating vector of input image
  • need to add support for examples that have multiple images (currently only taking the first image)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation: build building the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants