# Unsplash Dataset Creator

![](https://drive.google.com/uc?export=view&id=1TR9ydn_Kw180cQzP-LDvT52VwtNyxgr6)

This *Jupyter Notebook* creates a dataset containing locally (on a hard drive) stored images, downloaded from the official [Unsplash Dataset](https://unsplash.com/data). It can be used with the *Lite* and the *Full* version, but the official dataset has to be downloaded or requested separately.

Import `DatasetCreator` from `dataset_creation_helpers.py`.

In [1]:
from dataset_creation import DatasetCreator
import cv2

A new instance of `DatasetCreator` is created and the `photos.tsv000` file from the official [Unsplash Dataset](https://unsplash.com/data) are loaded.

>| Parameter    | Description |
>|--------------|-------------|
>|path_in       | Path to the official [Unsplash Dataset](https://unsplash.com/data).
>|path_out      | Indicates where the downloaded dataset should be stored.
>|sizes         | List with the resolutions the images should be downloaded (All the images are squares).
>|dataset_size  | Amount of images the user wants to use in a future project.

In [None]:
dataset_creator =DatasetCreator( 'D:\\UNSPLASH Dataset Full', 'D:\\Local UNSPLASH Dataset Full',
        sizes = (1024, 512, 256, 128, 64, 32),
        interpolation=cv2.INTER_LANCZOS4,
        dataset_size=1.2e6,
        dataset_type='Unsplash Dataset Full',
        author_name='Lukas Hueglin')

dataset_creator.load_dataframe()

If you want to know the estimated dataset size, run the next code cell.

In [None]:
dataset_creator.estimate_dataset_size()

As many *photo_image_urls* as specified in `dataset_size` are loaded(`search_images()`) and then downloaded(`download_images()`). They are downloaded via multiprocessing, the number of parallel processes can be specified in `NUM_PROCESSES`. The hole downloading process is structured into different batches. The number of these can be entered through `BATCH_SIZE`. If you want to continue to download images anfer a few days, you can specify the batches you want to download with `batch_range=` in `download_images()`. \
<sup>*Note: The search_images() function might be updated, so that it is possible to narrow down the download with keywords.*<sup>

In [None]:
BATCH_SIZE = 1000
NUM_PROCESSES = 10

urls = dataset_creator.search_images()
dataset_creator.download_images(urls, BATCH_SIZE, NUM_PROCESSES, batch_range=(1134, 1200))

In `save_cache()` all the image URLs will be stored in `cache.csv`. Then a README.md file will be created through the call of `make_README()`.

In [None]:
dataset_creator.save_cache(urls)
dataset_creator.make_README(urls, BATCH_SIZE)