[POC] image based search (GSOC) #28009

NirobNabil · 2024-04-02T17:13:04Z

PR summary

This PR is mainly meant to be as a reference point for a test run i did for the GSOC proposal of implementing image based search on matplotlib docs.

I have tried updating the PR with cleaner code and more explanation. But note that, everything is still relevant only in the sense of a proof of concept / test run.

In short the process is like this,

the image_search extension goes through all the .py files of examples and fetches the generated images (i tried to copy the exact way gen_gallery.py from sphinx-gallery iterates through example files, to replicate the same behavior)
generate html for thumbnails with link to example's page for each example and attach a thumbnail_id to each of them for uniquely identifying each. these thumbnails are stored in /image_search/index.recommendations file.
feed all fetched images through the SearchSetup class and then generate the corresponding vectors for each image.
the generated vectors are stored at _static/data.json along with the thumbnail_id of the image/example this vector represents
galleries/image_search/index.rst contains the layout of the image search page and this page includes all generated thumbnails from index.recommendations but all the thumbnails are kept hidden through a css rule
the js file at _static/image_search.js controls the search logic.
- it loads the vectors from _static/data.json
- calculates cosine similarity with the input image following the PR by Arturo (for now first image in data.json is considered as the input image)
- and then grabs the html of thumbnails corresponding to the top 5 most similar vectors (the thumbnails were already present in the html file but were hidden) and appends them to the sphx-glr-imgsearchresult-container div.

Currently tested integrations,

add a simple sphinx extension that fetches links to all generated example images at built time
adding links with appropriated thumbnail to a standalone reST file that is acting as the page for image search
generate vector embeddings for each example image and use them to perform image similarity search

Some modifications that i have in mind and will be adding to the test run gradually

size of data.json can be drastically reduced using a binary data format
the torch model needs to be saved in onnx format and then loaded at the frontend for calculating vector of input image
need to add support for examples that have multiple images (currently only taking the first image)

PR checklist

[N/A ] "closes #0000" is in the body of the PR description to link the related issue
new and changed code is tested
[N/A ] Plotting related features are demonstrated in an example
New Features and API Changes are noted with a directive and release note
Documentation complies with general and docstring guidelines

story645 · 2024-04-03T18:07:26Z

Can you please ~~add a more informative title? And~~ add in your description that this is a GSOC proof of concept/experiment?

WeatherGod · 2024-04-03T22:34:21Z

There are a bunch of extraneous (or just incomplete) files in this PR. For example, the zip file seems to be empty (and I would be highly against including such a file in an MR), along with the javascript file. The recommendations file seems to be a huge file of possible recommendations, which does not seem to be maintainable or verifiable.

story645 · 2024-04-03T22:43:03Z

@WeatherGod this PR started as part of a GSOC application for a visual search/detexify extension to the documentation build, but it looks likely that if we do this as GSOC this year it'll be a joint mentorship with sphinx-gallery and the code/PRs will likely go there. It'd likely be built off of/jumping off sphinx-gallery/sphinx-gallery#1125

Whether we want to enable that feature is a separate issue, but I think those pages can get created on build (like sphinx tags) and therefore don't need to be committed to the repo.

Basically yes I agree w/ you that this PR is unlikely to go in, but wanted to try and give you some more context on what's going on.

NirobNabil · 2024-04-04T08:59:34Z

I'm really sorry a lot of the files were mostly for test runs and i didn't organize them properly yet. I'll add more documentation and try to explain what i did in a short summary here within tonight. Really sorry for the unorganized mess

story645 · 2024-04-04T13:47:32Z

That's fine, but can you do your test run on your fork and just send me a link to the fork/update your application w/ it rather than opening up a pull request here since this is so experimental and if it gets accepted will likely live under sphinx-gallery.

ETA: which our general practice for these things is that if it's not quite ready for a PR, to instead link folks to the branch on the fork.

NirobNabil · 2024-04-04T18:18:46Z

That's fine, but can you do your test run on your fork and just send me a link to the fork/update your application w/ it rather than opening up a pull request here since this is so experimental and if it gets accepted will likely live under sphinx-gallery.

Understood. In that case, is the preferred communication channel gitter or should i email?

update your application w/ it rather than opening up a pull request hereupdate your application w/ it rather than opening up a pull request here

I checked the portal at gsoc and they wont allow updating the proposal anymore. the submission deadline is over. But I'll update my fork with new test runs and documentation and send you link over your preferred medium.

story645 · 2024-04-04T21:28:03Z

Understood. In that case, is the preferred communication channel gitter or should i email?

gitter is fine.

NirobNabil · 2024-04-08T19:45:29Z

There are a bunch of extraneous (or just incomplete) files in this PR.

For example, the zip file seems to be empty (and I would be highly against including such a file in an MR), along with the javascript file.

The recommendations file seems to be a huge file of possible recommendations, which does not seem to be maintainable or verifiable.

I had accidentally added these files that are generated at built time into the PR. I have modified the gitignore to not include auto generated files.

I have tried updating the PR with cleaner code and more explanation. But note that, everything is still relevant only in the sense of a proof of concept / test run.

In short the process is like this,

the image_search extension goes through all the .py files of examples and fetches the generated images (i tried to copy the exact way gen_gallery.py from sphinx-gallery iterates through example files, to replicate the same behavior)
generate html for thumbnails with link to example's page for each example and attach a thumbnail_id to each of them for uniquely identifying each. these thumbnails are stored in /image_search/index.recommendations file.
feed all fetched images through the SearchSetup class and then generate the corresponding vectors for each image.
the generated vectors are stored at _static/data.json along with the thumbnail_id of the image/example this vector represents
galleries/image_search/index.rst contains the layout of the image search page and this page includes all generated thumbnails from index.recommendations but all the thumbnails are kept hidden through a css rule
the js file at _static/image_search.js controls the search logic.
- it loads the vectors from _static/data.json
- calculates cosine similarity with the input image following the PR by Arturo (for now first image in data.json is considered as the input image)
- and then grabs the html of thumbnails corresponding to the top 5 most similar vectors (the thumbnails were already present in the html file but were hidden) and appends them to the sphx-glr-imgsearchresult-container div.

Some modifications that i have in mind and will be adding to the test run gradually

size of data.json can be drastically reduced using a binary data format
the torch model needs to be saved in onnx format and then loaded at the frontend for calculating vector of input image
need to add support for examples that have multiple images (currently only taking the first image)

test run

a9eb68f

github-actions bot added the Documentation: build building the docs label Apr 2, 2024

story645 changed the title ~~test run~~ [POC] image based search (GSOC) Apr 3, 2024

NirobNabil added 3 commits April 6, 2024 01:15

tidy up image_search ext

7952565

add autogenerated files to gitignore and tidy up image_search.js

b3f7ec9

integrate image vector generation with image_search extension

d4d1520

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] image based search (GSOC) #28009

[POC] image based search (GSOC) #28009

NirobNabil commented Apr 2, 2024 •

edited

story645 commented Apr 3, 2024 •

edited

WeatherGod commented Apr 3, 2024

story645 commented Apr 3, 2024 •

edited

NirobNabil commented Apr 4, 2024

story645 commented Apr 4, 2024 •

edited

NirobNabil commented Apr 4, 2024

story645 commented Apr 4, 2024

NirobNabil commented Apr 8, 2024

[POC] image based search (GSOC) #28009

Are you sure you want to change the base?

[POC] image based search (GSOC) #28009

Conversation

NirobNabil commented Apr 2, 2024 • edited

PR summary

Some modifications that i have in mind and will be adding to the test run gradually

PR checklist

story645 commented Apr 3, 2024 • edited

WeatherGod commented Apr 3, 2024

story645 commented Apr 3, 2024 • edited

NirobNabil commented Apr 4, 2024

story645 commented Apr 4, 2024 • edited

NirobNabil commented Apr 4, 2024

story645 commented Apr 4, 2024

NirobNabil commented Apr 8, 2024

Some modifications that i have in mind and will be adding to the test run gradually

NirobNabil commented Apr 2, 2024 •

edited

story645 commented Apr 3, 2024 •

edited

story645 commented Apr 3, 2024 •

edited

story645 commented Apr 4, 2024 •

edited