# Clip retrieval

Notebook to use the published huggingface index using [clip retrieval](https://github.com/rom1504/clip-retrieval). 

In [1]:
!pip install dask huggingface_hub



Clip retrieval expect the following structure: 

- image_index/
  - image.index
  - metadata/
    - file1.parquet
    - file2.parquet
    - ...

Following the [getting started](https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-retrieval-getting-started.ipynb), the metadata should contain the path to the image (on local machine). Clip filter can work with urls as well. 

In [3]:
import dask.dataframe as dd

ddf = dd.read_parquet("hf://datasets/fondant-ai/datacomp-small-clip/id_mapping")

# Reset index and rename columns
ddf = ddf.reset_index(drop=False)

In [4]:
ddf = ddf.assign(image_path=ddf['url']) # duplicate url column to image_path, without image path we will run into issues

In [5]:
# Set index
ddf = ddf.set_index("i") # the column i is the image idx used in the faiss idx
ddf = ddf[['url', 'image_path']]

In [6]:
# Create folder structure
!mkdir index_folder

In [8]:
# Export to parquet
ddf.to_parquet("index_folder/metadata")

# Faiss index

In [9]:
# Download faiss index
import fsspec

index_path = "hf://datasets/fondant-ai/datacomp-small-clip/faiss"
with fsspec.open(index_path, "rb") as f:
    file_contents = f.read()

# Save as image.index
with open("index_folder/image.index", "wb") as out:
    out.write(file_contents)

# Clip retrieval

In [7]:
!pip install clip-retrieval faiss-cpu



In [1]:
!clip-retrieval filter --query "cat" --output_folder "cat/" --indice_folder "index_folder" --num_results 5

Found 5 items with query 'cat'
The minimum distance is 1.54 and the maximum is 1.70
You may want to use these numbers to increase your --num_results parameter. Or use the --threshold parameter.
Copying the images in cat/
https://familia.willamowski.org/index.php?route=%2Ftree%2FWillamowski%2Fmedia-thumbnail&xref=M940&fact_id=d25530d141ec6c2a163d2457148bac82&w=100&h=100&fit=contain&mark=1&s=747d2a7b91064c3581ffaaf1f7c58d11
http://bright-media.brightmls.com/bright/images/0000/3021/8988/7272/302189887272_1440_1080_WM_4ZLKWGPvExnMZd-G.jpg
http://www.phonesreview.co.uk/wp-content/phoneimages/Google-Motorola-X-phone-release-claimed-to-shake-up-industry.jpg
https://www.allianceonline.co.uk/blog/wp-content/uploads/2018/11/pexels-photo-887827-1.jpeg
https://shop.shera.de/media/catalog/product/cache/a6bb0ee3d4b190da50cace3e771cdde4/S/I/SI160_4.jpg


`clip-retrieval filter ...` works, however for some reasons we don't find `cat` pictures in this case. 

Also the image folder doesn't contain any images since the images are not available on the local machine. 

In [4]:
!ls -R cat

cat:


# Host index

Tunneling works, but the webui dosen't display any images. Probably cause they are not available on the local machine.

In [2]:
%%bash
echo '{"example_index": "index_folder"}' > indices_paths.json
npm install -g localtunnel


changed 22 packages in 2s

3 packages are looking for funding
  run `npm fund` for details
starting boot of clip back
warming up with batch size 1 on cpu
done warming up in 3.9866015911102295s
  return self.fget.__get__(instance, owner)()
tokenizer_config.json: 100%|███████████████████| 399/399 [00:00<00:00, 1.49MB/s]
sentencepiece.bpe.model: 100%|██████████████| 5.07M/5.07M [00:00<00:00, 121MB/s]
tokenizer.json: 100%|██████████████████████| 9.08M/9.08M [00:00<00:00, 15.4MB/s]
special_tokens_map.json: 100%|██████████████████| 239/239 [00:00<00:00, 947kB/s]


In [None]:
from threading import Thread
#import ssl
#ssl._create_default_https_context = ssl._create_unverified_context

def app():
  !clip-retrieval back --port 1234 --indices-paths indices_paths.json

if __name__ == '__main__':
    t1 = Thread(target = app)
    a = t1.start()
    !lt --port 1234

your url is: https://salty-pears-jump.loca.lt
