# Image Search Helper

## Environment Setup

In [1]:
!pip install requests transformers torch tqdm



## Download Data

Download the dataset using the URL provided in the [unsplash-25K](https://huggingface.co/datasets/jamescalam/unsplash-25k-photos) dataset page.

In [2]:
!wget https://unsplash-datasets.s3.amazonaws.com/lite/latest/unsplash-research-dataset-lite-latest.zip

--2023-02-19 16:47:19--  https://unsplash-datasets.s3.amazonaws.com/lite/latest/unsplash-research-dataset-lite-latest.zip
Resolving unsplash-datasets.s3.amazonaws.com (unsplash-datasets.s3.amazonaws.com)... 52.216.37.17, 3.5.11.188, 54.231.231.1, ...
Connecting to unsplash-datasets.s3.amazonaws.com (unsplash-datasets.s3.amazonaws.com)|52.216.37.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 632351052 (603M) [application/zip]
Saving to: 'unsplash-research-dataset-lite-latest.zip'


2023-02-19 16:49:39 (4.35 MB/s) - 'unsplash-research-dataset-lite-latest.zip' saved [632351052/632351052]



In [3]:
# unzip the downloaded files into a temporary directory

!unzip unsplash-research-dataset-lite-latest.zip -d tmp

Archive:  unsplash-research-dataset-lite-latest.zip
  inflating: tmp/collections.tsv000  
  inflating: tmp/__MACOSX/._collections.tsv000  
  inflating: tmp/colors.tsv000       
  inflating: tmp/__MACOSX/._colors.tsv000  
  inflating: tmp/conversions.tsv000  
  inflating: tmp/__MACOSX/._conversions.tsv000  
  inflating: tmp/DOCS.md             
  inflating: tmp/keywords.tsv000     
  inflating: tmp/__MACOSX/._keywords.tsv000  
  inflating: tmp/photos.tsv000       
  inflating: tmp/__MACOSX/._photos.tsv000  
  inflating: tmp/README.md           
  inflating: tmp/TERMS.md            


## Read Data

In [4]:
import numpy as np
import pandas as pd
import glob

documents = ['photos', 'conversions']
datasets = {}

for doc in documents:
    files = glob.glob("tmp/" + doc + ".tsv*")
    subsets = []
    for filename in files:
        df = pd.read_csv(filename, sep='\t', header=0)
        subsets.append(df)
    datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)

In [5]:
df_photos = datasets['photos']
df_conversions = datasets['conversions']

# the dataset has 25K photos and all conversion history
print("photos: ", len(df_photos))
print("conversions: ", len(df_conversions))

photos:  25000
conversions:  12166088


In [6]:
df_photos.head()

Unnamed: 0,photo_id,photo_url,photo_image_url,photo_submitted_at,photo_featured,photo_width,photo_height,photo_aspect_ratio,photo_description,photographer_username,...,photo_location_country,photo_location_city,stats_views,stats_downloads,ai_description,ai_primary_landmark_name,ai_primary_landmark_latitude,ai_primary_landmark_longitude,ai_primary_landmark_confidence,blur_hash
0,XMyPniM9LF0,https://unsplash.com/photos/XMyPniM9LF0,https://images.unsplash.com/uploads/1411949294...,2014-09-29 00:08:38.594364,t,4272,2848,1.5,Woman exploring a forest,michellespencer77,...,,,2375421,6967,woman walking in the middle of forest,,,,,L56bVcRRIWMh.gVunlS4SMbsRRxr
1,rDLBArZUl1c,https://unsplash.com/photos/rDLBArZUl1c,https://images.unsplash.com/photo-141633941111...,2014-11-18 19:36:57.08945,t,3000,4000,0.75,Succulents in a terrarium,ugmonk,...,,,13784815,82141,succulent plants in clear glass terrarium,,,,,LvI$4txu%2s:_4t6WUj]xat7RPoe
2,cNDGZ2sQ3Bo,https://unsplash.com/photos/cNDGZ2sQ3Bo,https://images.unsplash.com/photo-142014251503...,2015-01-01 20:02:02.097036,t,2564,1710,1.5,Rural winter mountainside,johnprice,...,,,1302461,3428,rocky mountain under gray sky at daytime,,,,,LhMj%NxvM{t7_4t7aeoM%2M{ozj[
3,iuZ_D1eoq9k,https://unsplash.com/photos/iuZ_D1eoq9k,https://images.unsplash.com/photo-141487280988...,2014-11-01 20:15:13.410073,t,2912,4368,0.67,Poppy seeds and flowers,krisatomic,...,,,2890238,33704,red common poppy flower selective focus phography,,,,,LSC7DirZAsX7}Br@GEWWmnoLWCnj
4,BeD3vjQ8SI0,https://unsplash.com/photos/BeD3vjQ8SI0,https://images.unsplash.com/photo-141700759404...,2014-11-26 13:13:50.134383,t,4896,3264,1.5,Silhouette near dark trees,jonaseriksson,...,,,8704860,49662,trees during night time,,,,,L25|_:V@0hxtI=W;odae0ht6=^NG


In [7]:
df_conversions.head()

Unnamed: 0,converted_at,conversion_type,keyword,photo_id,anonymous_user_id,conversion_country
0,2020-07-29 00:08:04.221,download,clouds,ABmygVJcYgY,dd01ebdd-7691-4518-ab19-b2105782ae8b,VE
1,2020-07-29 00:25:23.426,download,shark,fB2jl6Rb3l4,c48ba6e0-c6a7-4a92-b569-fe57808a8a2c,QA
2,2020-07-29 00:26:13.122,download,dogs,k1hbfag2na0,62c4f043-579c-438f-8815-eb8ba3c54d34,KR
3,2020-07-29 00:37:03.308,download,astronaut,-SyUjRlHauQ,7ad6dc18-a02e-4ba2-b93c-fd7ea2e551d8,JP
4,2020-07-29 00:54:28.942,download,red roses,A0iTJUhK4es,f03a5708-32e4-4fae-8210-3c5d2632cbfb,NZ


## Image Feature Extraction

We leverage the [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) model to extract image features for each image. Subsequently, we create two dataframes: one for photo information with embeddings and the other for conversion information.

These dataframes provide us with the necessary information to carry out image similarity queries and examine conversion details."

In [8]:
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def extract_image_features(image):
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        outputs = model.get_image_features(**inputs)
        outputs = outputs / outputs.norm(dim=-1, keepdim=True)
    return outputs.squeeze(0).tolist()

In [9]:
# pick first 1000 images

photo_ids = df_photos['photo_id'].tolist()[:1000]
df_photos = df_photos[df_photos['photo_id'].isin(photo_ids)].reset_index(drop=True)

In [10]:
# download images and perform feature extraction
# 
# Note: some images may fail to download due to outdated links

from PIL import Image
import requests
from tqdm.auto import tqdm


df_photos = df_photos[['photo_id', 'photo_image_url']]
df_photos['photo_embed'] = None

for i, row in tqdm(df_photos.iterrows(), total=len(df_photos)):   
    # construct a URL to download an image with a smaller size by modifying the image URL
    url = row['photo_image_url'] + "?q=75&fm=jpg&w=200&fit=max"
    
    try:
        res = requests.get(url, stream=True).raw
        image = Image.open(res)
    except:
        # remove photo if image download fails
        photo_ids.remove(row['photo_id'])
        continue
    
    # extract feature embedding
    df_photos.at[i, 'photo_embed'] = extract_image_features(image)

  0%|          | 0/1000 [00:00<?, ?it/s]

In [11]:
df_photos = df_photos[df_photos['photo_id'].isin(photo_ids)].reset_index(drop=True)
df_photos.head()

Unnamed: 0,photo_id,photo_image_url,photo_embed
0,XMyPniM9LF0,https://images.unsplash.com/uploads/1411949294...,"[-0.02423190325498581, 0.05229705199599266, 0...."
1,rDLBArZUl1c,https://images.unsplash.com/photo-141633941111...,"[-0.032578177750110626, 0.028756439685821533, ..."
2,cNDGZ2sQ3Bo,https://images.unsplash.com/photo-142014251503...,"[-0.03327571600675583, 0.051595136523246765, 0..."
3,iuZ_D1eoq9k,https://images.unsplash.com/photo-141487280988...,"[0.008161935955286026, 0.00707155792042613, -0..."
4,BeD3vjQ8SI0,https://images.unsplash.com/photo-141700759404...,"[-0.01028392929583788, -0.0009960911702364683,..."


In [12]:
df_conversions = df_conversions[df_conversions['photo_id'].isin(photo_ids)].reset_index(drop=True)
df_conversions = df_conversions[['photo_id', 'keyword', 'conversion_country']]
df_conversions.head()

Unnamed: 0,photo_id,keyword,conversion_country
0,Knwea-mLGAg,starry sky,TW
1,AZMmUy2qL6A,camping,KR
2,agE97zp_Xvo,lonely,IN
3,EXbGG5dBZKw,happy,CN
4,EWDvHNNfUmQ,wood,EC


## Transform dataset to parquet

Finally, we convert the dataframes into Parquet files and upload them to the Hugging Face dataset. This makes it simple to access and share the data.

In [13]:
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

# Create a Table object from the data and schema
photos_table = pa.Table.from_pandas(df_photos)
conversion_table = pa.Table.from_pandas(df_conversions)

# Write the table to a Parquet file
pq.write_table(photos_table, 'photos.parquet')
pq.write_table(conversion_table, 'conversions.parquet')