
# Setup the Unsplash dataset




This notebook can be used to download all images from the Unsplash dataset: https://github.com/unsplash/datasets. 

There are two versions Lite (25000 images) and Full (2M images). 

For the Full one you will need to apply for access (see [here](https://unsplash.com/data)). This will allow you to run CLIP on the whole dataset yourself. 

Put the .TSV files in the folder `unsplash-dataset/full` or `unsplash-dataset/lite` or adjust the path in the cell below. 

## About

The Unsplash Dataset is made up of over 250,000+ contributing global photographers and data sourced from hundreds of millions of searches across a nearly unlimited number of uses and contexts. Due to the breadth of intent and semantics contained within the Unsplash dataset, it enables new opportunities for research and learning.



The Unsplash Dataset is offered in two datasets:

- the Lite dataset: available for commercial and noncommercial usage, containing 25k nature-themed Unsplash photos, 25k keywords, and 1M searches
- the Full dataset: available for noncommercial usage, containing 3M+ high-quality Unsplash photos, 5M keywords, and over 250M searches

from [Readme](https://github.com/unsplash/datasets)

## Imports

In [1]:
import datasets
import requests
import pandas as pd
import os
import urllib
import gcsfs

import PIL
from PIL import Image


In [2]:
%%bash
mkdir -p ../data/raw/
cd ../data/raw
if [ ! -f latest ]
then
    wget https://unsplash.com/data/lite/latest --quiet
    unzip latest

fi




In [3]:
!ls ../data

raw


In [4]:
df_photos = pd.read_csv("../data/raw/photos.tsv000", sep='\t', header=0)
df_photos.head()

Unnamed: 0,photo_id,photo_url,photo_image_url,photo_submitted_at,photo_featured,photo_width,photo_height,photo_aspect_ratio,photo_description,photographer_username,...,photo_location_country,photo_location_city,stats_views,stats_downloads,ai_description,ai_primary_landmark_name,ai_primary_landmark_latitude,ai_primary_landmark_longitude,ai_primary_landmark_confidence,blur_hash
0,XMyPniM9LF0,https://unsplash.com/photos/XMyPniM9LF0,https://images.unsplash.com/uploads/1411949294...,2014-09-29 00:08:38.594364,t,4272,2848,1.5,Woman exploring a forest,michellespencer77,...,,,2375421,6967,woman walking in the middle of forest,,,,,L56bVcRRIWMh.gVunlS4SMbsRRxr
1,rDLBArZUl1c,https://unsplash.com/photos/rDLBArZUl1c,https://images.unsplash.com/photo-141633941111...,2014-11-18 19:36:57.08945,t,3000,4000,0.75,Succulents in a terrarium,ugmonk,...,,,13784815,82141,succulent plants in clear glass terrarium,,,,,LvI$4txu%2s:_4t6WUj]xat7RPoe
2,cNDGZ2sQ3Bo,https://unsplash.com/photos/cNDGZ2sQ3Bo,https://images.unsplash.com/photo-142014251503...,2015-01-01 20:02:02.097036,t,2564,1710,1.5,Rural winter mountainside,johnprice,...,,,1302461,3428,rocky mountain under gray sky at daytime,,,,,LhMj%NxvM{t7_4t7aeoM%2M{ozj[
3,iuZ_D1eoq9k,https://unsplash.com/photos/iuZ_D1eoq9k,https://images.unsplash.com/photo-141487280988...,2014-11-01 20:15:13.410073,t,2912,4368,0.67,Poppy seeds and flowers,krisatomic,...,,,2890238,33704,red common poppy flower selective focus phography,,,,,LSC7DirZAsX7}Br@GEWWmnoLWCnj
4,BeD3vjQ8SI0,https://unsplash.com/photos/BeD3vjQ8SI0,https://images.unsplash.com/photo-141700759404...,2014-11-26 13:13:50.134383,t,4896,3264,1.5,Silhouette near dark trees,jonaseriksson,...,,,8704860,49662,trees during night time,,,,,L25|_:V@0hxtI=W;odae0ht6=^NG


In [5]:
df_photos.iloc[0]

photo_id                                                                XMyPniM9LF0
photo_url                                   https://unsplash.com/photos/XMyPniM9LF0
photo_image_url                   https://images.unsplash.com/uploads/1411949294...
photo_submitted_at                                       2014-09-29 00:08:38.594364
photo_featured                                                                    t
photo_width                                                                    4272
photo_height                                                                   2848
photo_aspect_ratio                                                              1.5
photo_description                                          Woman exploring a forest
photographer_username                                             michellespencer77
photographer_first_name                                                    Michelle
photographer_last_name                                                      

In [6]:
pd.isna(df_photos.ai_description).value_counts()

False    23641
True      1359
Name: ai_description, dtype: int64

In [7]:
pd.isna(df_photos.photo_description).value_counts()

True     14098
False    10902
Name: photo_description, dtype: int64

In [8]:
df_photos['description_final'] =  df_photos.photo_description.fillna(df_photos.ai_description).fillna(" ")

## Download images

In [9]:
df_photos.head()

Unnamed: 0,photo_id,photo_url,photo_image_url,photo_submitted_at,photo_featured,photo_width,photo_height,photo_aspect_ratio,photo_description,photographer_username,...,photo_location_city,stats_views,stats_downloads,ai_description,ai_primary_landmark_name,ai_primary_landmark_latitude,ai_primary_landmark_longitude,ai_primary_landmark_confidence,blur_hash,description_final
0,XMyPniM9LF0,https://unsplash.com/photos/XMyPniM9LF0,https://images.unsplash.com/uploads/1411949294...,2014-09-29 00:08:38.594364,t,4272,2848,1.5,Woman exploring a forest,michellespencer77,...,,2375421,6967,woman walking in the middle of forest,,,,,L56bVcRRIWMh.gVunlS4SMbsRRxr,Woman exploring a forest
1,rDLBArZUl1c,https://unsplash.com/photos/rDLBArZUl1c,https://images.unsplash.com/photo-141633941111...,2014-11-18 19:36:57.08945,t,3000,4000,0.75,Succulents in a terrarium,ugmonk,...,,13784815,82141,succulent plants in clear glass terrarium,,,,,LvI$4txu%2s:_4t6WUj]xat7RPoe,Succulents in a terrarium
2,cNDGZ2sQ3Bo,https://unsplash.com/photos/cNDGZ2sQ3Bo,https://images.unsplash.com/photo-142014251503...,2015-01-01 20:02:02.097036,t,2564,1710,1.5,Rural winter mountainside,johnprice,...,,1302461,3428,rocky mountain under gray sky at daytime,,,,,LhMj%NxvM{t7_4t7aeoM%2M{ozj[,Rural winter mountainside
3,iuZ_D1eoq9k,https://unsplash.com/photos/iuZ_D1eoq9k,https://images.unsplash.com/photo-141487280988...,2014-11-01 20:15:13.410073,t,2912,4368,0.67,Poppy seeds and flowers,krisatomic,...,,2890238,33704,red common poppy flower selective focus phography,,,,,LSC7DirZAsX7}Br@GEWWmnoLWCnj,Poppy seeds and flowers
4,BeD3vjQ8SI0,https://unsplash.com/photos/BeD3vjQ8SI0,https://images.unsplash.com/photo-141700759404...,2014-11-26 13:13:50.134383,t,4896,3264,1.5,Silhouette near dark trees,jonaseriksson,...,,8704860,49662,trees during night time,,,,,L25|_:V@0hxtI=W;odae0ht6=^NG,Silhouette near dark trees


In [10]:
image_path = "../data/raw/images"
os.makedirs(image_path , exist_ok=True)

In [24]:
def download_image(row):
    
    photo_id = row['photo_id']

    photo_url = row['photo_image_url'] + "?w=640"

    photo_path = f"{image_path}/{photo_id}.jpg"

    image = None
    # Only download a photo if it doesn't exist
    if not os.path.exists(photo_path):
        try:
            urllib.request.urlretrieve(photo_url, photo_path)
            image = Image.open(photo_path)
        except Exception as e:
            # Catch the exception if the download fails for some reason
            print(f"Cannot download {photo_url} ; {e}")
            pass
    else:
        image = PIL.Image.open(photo_path)
        
        
    row['image'] = image
    return row
        

In [25]:
dset = datasets.Dataset.from_pandas(df_photos)

In [26]:
dset

Dataset({
    features: ['photo_id', 'photo_url', 'photo_image_url', 'photo_submitted_at', 'photo_featured', 'photo_width', 'photo_height', 'photo_aspect_ratio', 'photo_description', 'photographer_username', 'photographer_first_name', 'photographer_last_name', 'exif_camera_make', 'exif_camera_model', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'photo_location_name', 'photo_location_latitude', 'photo_location_longitude', 'photo_location_country', 'photo_location_city', 'stats_views', 'stats_downloads', 'ai_description', 'ai_primary_landmark_name', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'blur_hash', 'description_final'],
    num_rows: 25000
})

In [27]:
dset = dset.map(download_image, num_proc =8)

         

#0:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#2:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#3:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#4:   0%|          | 0/3125 [00:00<?, ?ex/s]

Cannot download https://images.unsplash.com/photo-1498144846853-60ca2d43853b?w=640 ; HTTP Error 404: Not Found
 

#5:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#6:   0%|          | 0/3125 [00:00<?, ?ex/s]

 

#7:   0%|          | 0/3125 [00:00<?, ?ex/s]

Cannot download https://images.unsplash.com/photo-1578166671353-7978081a6f9c?w=640 ; HTTP Error 404: Not Found
Cannot download https://images.unsplash.company%20by%20Alessandro%20Desantis%20-%20Downloaded%20from%20500px_jpg.jpg?w=640 ; URL can't contain control characters. 'images.unsplash.company by Alessandro Desantis - Downloaded from 500px_jpg.jpg' (found at least ' ')
Cannot download https://images.unsplash.com/photo-1578411246981-e0394f597159?w=640 ; HTTP Error 404: Not Found
Cannot download https://images.unsplash.com-grass-sun.jpg?w=640 ; <urlopen error [Errno -2] Name or service not known>
Cannot download https://images.unsplash.com_TheBeach.jpg?w=640 ; <urlopen error [Errno -2] Name or service not known>
Cannot download https://images.unsplash.com/photo-1573486145949-182147241fa6?w=640 ; HTTP Error 404: Not Found
Cannot download https://images.unsplash.com/photo-1583307709837-2d0e3a82be15?w=640 ; HTTP Error 404: Not Found


In [28]:
dset

Dataset({
    features: ['photo_id', 'photo_url', 'photo_image_url', 'photo_submitted_at', 'photo_featured', 'photo_width', 'photo_height', 'photo_aspect_ratio', 'photo_description', 'photographer_username', 'photographer_first_name', 'photographer_last_name', 'exif_camera_make', 'exif_camera_model', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'photo_location_name', 'photo_location_latitude', 'photo_location_longitude', 'photo_location_country', 'photo_location_city', 'stats_views', 'stats_downloads', 'ai_description', 'ai_primary_landmark_name', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'blur_hash', 'description_final', 'image'],
    num_rows: 25000
})

In [29]:
dset[0]

{'photo_id': 'XMyPniM9LF0',
 'photo_url': 'https://unsplash.com/photos/XMyPniM9LF0',
 'photo_image_url': 'https://images.unsplash.com/uploads/14119492946973137ce46/f1f2ebf3',
 'photo_submitted_at': '2014-09-29 00:08:38.594364',
 'photo_featured': 't',
 'photo_width': 4272,
 'photo_height': 2848,
 'photo_aspect_ratio': 1.5,
 'photo_description': 'Woman exploring a forest',
 'photographer_username': 'michellespencer77',
 'photographer_first_name': 'Michelle',
 'photographer_last_name': 'Spencer',
 'exif_camera_make': 'Canon',
 'exif_camera_model': 'Canon EOS REBEL T3',
 'exif_iso': 400.0,
 'exif_aperture_value': '1.8',
 'exif_focal_length': '50.0',
 'exif_exposure_time': '1/100',
 'photo_location_name': None,
 'photo_location_latitude': None,
 'photo_location_longitude': None,
 'photo_location_country': None,
 'photo_location_city': None,
 'stats_views': 2375421,
 'stats_downloads': 6967,
 'ai_description': 'woman walking in the middle of forest',
 'ai_primary_landmark_name': None,
 'ai_

In [30]:
# storage_options={"project": "np-public-training"}
# fs = gcsfs.GCSFileSystem(storage_options )

remove missing images

In [31]:
dset = dset.filter(lambda x: x['image']!=None, num_proc =8)

         

#0:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#5:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#6:   0%|          | 0/4 [00:00<?, ?ba/s]

 

#7:   0%|          | 0/4 [00:00<?, ?ba/s]

In [19]:
#!rm -rf ../data/processed/*

In [32]:
dset.save_to_disk("../data/processed")


Flattening the indices:   0%|          | 0/25 [00:00<?, ?ba/s]

In [33]:
dset = datasets.load_from_disk("../data/processed")

In [34]:
dset[0]

{'photo_id': 'XMyPniM9LF0',
 'photo_url': 'https://unsplash.com/photos/XMyPniM9LF0',
 'photo_image_url': 'https://images.unsplash.com/uploads/14119492946973137ce46/f1f2ebf3',
 'photo_submitted_at': '2014-09-29 00:08:38.594364',
 'photo_featured': 't',
 'photo_width': 4272,
 'photo_height': 2848,
 'photo_aspect_ratio': 1.5,
 'photo_description': 'Woman exploring a forest',
 'photographer_username': 'michellespencer77',
 'photographer_first_name': 'Michelle',
 'photographer_last_name': 'Spencer',
 'exif_camera_make': 'Canon',
 'exif_camera_model': 'Canon EOS REBEL T3',
 'exif_iso': 400.0,
 'exif_aperture_value': '1.8',
 'exif_focal_length': '50.0',
 'exif_exposure_time': '1/100',
 'photo_location_name': None,
 'photo_location_latitude': None,
 'photo_location_longitude': None,
 'photo_location_country': None,
 'photo_location_city': None,
 'stats_views': 2375421,
 'stats_downloads': 6967,
 'ai_description': 'woman walking in the middle of forest',
 'ai_primary_landmark_name': None,
 'ai_

In [35]:
dset

Dataset({
    features: ['photo_id', 'photo_url', 'photo_image_url', 'photo_submitted_at', 'photo_featured', 'photo_width', 'photo_height', 'photo_aspect_ratio', 'photo_description', 'photographer_username', 'photographer_first_name', 'photographer_last_name', 'exif_camera_make', 'exif_camera_model', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'photo_location_name', 'photo_location_latitude', 'photo_location_longitude', 'photo_location_country', 'photo_location_city', 'stats_views', 'stats_downloads', 'ai_description', 'ai_primary_landmark_name', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'blur_hash', 'description_final', 'image'],
    num_rows: 24992
})