# Downloading images from LAION-5B

LAION-5B is a dataset with 5 billion image/text pairs.

It has been indexed using CLIP and can be queried using text/images.

Goal:
* Download 100 images for each of the classes in FoodVision.
* See how good these images are to use as a dataset for future models.
* Increase the number of images downloaded if they are good quality.

Resources:
* See the code on GitHub: https://github.com/rom1504/clip-retrieval 
* See the blog post: https://laion.ai/blog/h14_clip_retrieval/ 
* See the demo: https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion5B-H-14&useMclip=false&query=salt+shaker+photo 
* Example notebook: https://github.com/rom1504/clip-retrieval/blob/main/notebook/clip-client-query-api.ipynb 

## Try downloading images for a single class

In [12]:
from clip_retrieval.clip_client import ClipClient, Modality

client = ClipClient(url="https://knn.laion.ai/knn-service", 
                    indice_name="laion5B-H-14", # H = Huge, L = Large (huge is bigger than large)
                    num_images=100)

In [13]:
results = client.query(text="salt shaker photo")
results[0]

{'url': 'https://diabetesdietblogdotcom.files.wordpress.com/2016/03/salt_shaker_on_white_background.jpg?w=610',
 'caption': 'Salt_shaker_on_white_background',
 'id': 3637752355,
 'similarity': 0.40506356954574585}

In [14]:
# Save results to JSON file
import json
with open("clip_retrieval_results.json", "w") as f:
    json.dump(results, f)

## Download images

Going to download the resulting images with `img2dataset`, see the GitHub here: https://github.com/rom1504/img2dataset

In [16]:
!img2dataset "clip_retrieval_results.json" --input_format="json" --caption_col "caption" --output_folder="clip_retrieval_results" --resize_mode="no" --output_format="files"

Starting the downloading of this file
Sharding file number 1 of 1 called /home/daniel/code/nutrify/foodvision/notebooks/clip_retrieval_results.json
0it [00:00, ?it/s]File sharded in 1 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
1it [00:09,  9.72s/it]
worker  - success: 0.881 - failed to download: 0.119 - failed to resize: 0.000 - images per sec: 5 - count: 42
total   - success: 0.881 - failed to download: 0.119 - failed to resize: 0.000 - images per sec: 5 - count: 42


In [15]:
# !rm -rf clip_retrieval_results/*

In [5]:
# List how many images are in clip_retrieval_results folder
!ls -l clip_retrieval_results | wc -l

4


## Get class names from labels

In [17]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np

from pathlib import Path

In [18]:
# Get config
from configs.default_config import config

args = config
print(args)

# Connect to GCP
from utils.gcp_utils import set_gcp_credentials, test_gcp_connection
set_gcp_credentials(path_to_key="../utils/google-storage-key.json")
test_gcp_connection()

import wandb

# Initialize a new run
from utils.wandb_utils import wandb_load_artifact, wandb_download_and_load_labels

run = wandb.init(project=args.wandb_project, 
                 job_type=args.wandb_job_type,
                 tags=['internet_image_download'],
                 notes="download ~100x images per class using clip-retrieval")

annotations, class_names, class_dict, reverse_class_dict, labels_path = wandb_download_and_load_labels(wandb_run=run,
wandb_labels_artifact_name=args.wandb_labels_artifact)

namespace(annotations_columns_to_export=['filename', 'image_name', 'class_name', 'label', 'split', 'clear_or_confusing', 'whole_food_or_dish', 'one_food_or_multiple', 'label_last_updated_at', 'label_source', 'image_source'], auto_augment=True, batch_size=128, epochs=10, gs_bucket_name='food_vision_bucket_with_object_versioning', gs_image_storage_path='https://storage.cloud.google.com/food_vision_bucket_with_object_versioning/all_images/', input_size=224, label_smoothing=0.1, learning_rate=0.001, model='coatnext_nano_rw_224', num_to_try_and_autocorrect=1000, path_to_gcp_credentials='utils/google-storage-key.json', path_to_label_studio_api_key='utils/label_studio_api_key.json', pretrained=True, seed=42, use_mixed_precision=True, wandb_dataset_artifact='food_vision_199_classes_images:latest', wandb_job_type='', wandb_labels_artifact='food_vision_labels:latest', wandb_model_artifact='trained_model:latest', wandb_project='test_wandb_artifacts_by_reference', wandb_run_notes='', wandb_run_tag

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmrdbourke[0m. Use [1m`wandb login --relogin`[0m to force relogin


[INFO] Labels directory: ./artifacts/food_vision_labels:v17
[INFO] Labels path: artifacts/food_vision_labels:v17/annotations.csv
[INFO] Working with: 199 classes


In [20]:
class_names[:10]

['almond_butter',
 'almonds',
 'apple',
 'apricot',
 'asparagus',
 'avocado',
 'bacon',
 'bacon_and_egg_burger',
 'bagel',
 'baklava']

### Loop through class names and download ~100 images per class

In [21]:
from clip_retrieval.clip_client import ClipClient, Modality

client = ClipClient(url="https://knn.laion.ai/knn-service", 
                    indice_name="laion5B-H-14", # H = Huge, L = Large (huge is bigger than large)
                    num_images=100)

In [28]:
from tqdm.auto import tqdm

combined_results = []
for class_name in tqdm(class_names):
    results = client.query(text=f"{class_name} (food) photo")
    for result in results:
    # print(results)
        result["class_name"] = class_name
        result["label"] = reverse_class_dict[class_name]
    combined_results.extend(results)

len(combined_results)

  0%|          | 0/199 [00:00<?, ?it/s]

13343

In [34]:
# Select a random result
import random
random.choice(combined_results)
# combined_results[0]

{'url': 'https://ais.kochbar.de/kbrezept/503738_843721/400x266/fleisch-kaese-strudel-rezept.jpg',
 'caption': 'Rezept: Fleisch Käse Strudel',
 'id': 2796819674,
 'similarity': 0.37485966086387634,
 'class_name': 'sausage_roll',
 'label': 167}

In [37]:
combined_results[0]

{'url': 'https://thumbnail.image.rakuten.co.jp/@0_mall/rtor/cabinet/00770170/imgrc0070055155.jpg?_ex=300x300',
 'caption': 'カークランド【クリーミーアーモンドバター 765g入り】アーモンド100%/大容量/たっぷり/KS',
 'id': 4226929696,
 'similarity': 0.3932535946369171,
 'class_name': 'almond_butter',
 'label': 0}

In [35]:
# Save results to JSON
import json

# Get current timestamp
from utils.misc import get_now_time
now = get_now_time()

save_path = f"{now}_clip_retrieval_100_images_per_class.json"

with open(save_path, "w") as f:
    json.dump(combined_results, f)

In [36]:
# Download results to local machine
# Path 2023-02-07_15-43-32_clip_retrieval_100_images_per_class.json
!img2dataset "2023-02-07_15-43-32_clip_retrieval_100_images_per_class.json" --input_format="json" --caption_col "caption" --output_folder="2023_02_07_clip_retrieval_100_images_per_class" --resize_mode="no" --output_format="files"

Starting the downloading of this file
Sharding file number 1 of 1 called /home/daniel/code/nutrify/foodvision/notebooks/2023-02-07_15-43-32_clip_retrieval_100_images_per_class.json
0it [00:00, ?it/s]File sharded in 2 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
1it [01:31, 91.30s/it]worker  - success: 0.908 - failed to download: 0.084 - failed to resize: 0.008 - images per sec: 112 - count: 10000
total   - success: 0.908 - failed to download: 0.084 - failed to resize: 0.008 - images per sec: 112 - count: 10000
2it [02:11, 65.79s/it]
worker  - success: 0.903 - failed to download: 0.083 - failed to resize: 0.014 - images per sec: 83 - count: 3343
total   - success: 0.907 - failed to download: 0.084 - failed to resize: 0.010 - images per sec: 103 - count: 13343


In [None]:
# Next:
# Go through downloaded images and label them with their appropriate class (they can be matched via their URL)
# Add a UUID to each image 
# Create labels for images in same style as FoodVision labels
# Make an image source tag for where the images came from
# Upload to GCP as all training images (can swap 20% to test set later)
# Track updated labels and images in Weights & Biases
# Train a model with and without new images (on unmodified test set)

In [38]:
# Check the file size of 2023_02_07_clip_retrieval_100_images_per_class.json in MB
!ls -lh /home/daniel/code/nutrify/foodvision/notebooks/2023-02-07_15-43-32_clip_retrieval_100_images_per_class.json

-rw-rw-r-- 1 daniel daniel 3.7M Feb  7 15:43 /home/daniel/code/nutrify/foodvision/notebooks/2023-02-07_15-43-32_clip_retrieval_100_images_per_class.json
