# Creating your own dataset from Google Images Dataset

*The dataset creation of this work has been inspired and adopted by https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb*

In [1]:
from fastai.vision import *

### Search and scroll
Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

    "canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

### Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

In Google Chrome press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>j</kbd> on Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>j</kbd> on macOS, and a small window the javascript 'Console' will appear. In Firefox press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>k</kbd> on Windows/Linux or <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>k</kbd> on macOS. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn't work. Then you can run the following commands:

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

### Fruits dataset

We searching for **apple**, **banana** and **orange** to create our datasets. These are prepared in the [data](data) folder.

### Download data
Next we need to download the images through the urls.

In [2]:
!head data/apple/urls_apple.csv

https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQKbAPZi3RS4CSKXLoeKWaiTUa2L2aE7IVwOKnryDkUdZuQO-IX&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTh5ZwxB7imd-P3RSychtdaSp_ty1HJl5r8ViZbHsqK_ked6S9m&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRmbzRn6H6tWtmWx6bVhlE-M-wvkbZ47xLbgU0kVm-9xNFtc9q7&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTS2H6WPq4JdC2jb-mMNPxr4WuTkUd_-SG-dqNFI0Vfqk5Zs7od&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT62GykE0Xglul21_dgniVZJgqbJW3oMIMA8TOsh_sY9mFlb-49&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRS84wdHqXjyTOxBpmvxADtYjgV3DBU9nhwGk4Ip49hsENP8AQZ&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTnHNhFIADnH8b6FQy03r-2I-OqKs38J6JOZKofdvZxPq3Pcfh6&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSjjs0o-jLceNVUs-kCE_5BBh9VxzAK-6Z3otGZ4O2IpoJZqA9s&usqp=CAU
https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSNSyd23O

In [3]:
classes = ["apple", "banana", "orange", "grape", "strawberry"]
for c in classes:
    class_folder = f"data/{c}"
    file = f"{class_folder}/urls_{c}.csv"
    download_images(file, class_folder)
    verify_images(f"data/{c}", delete=True, max_size=500)
