# Creating your own dataset from Google Images

*Reference: Francisco Ingham and Jeremy Howard. Inspired by [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)*

https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb

In this tutorial we will see how to easily create an image dataset through Google Images. **Note**: We will have to repeat these steps for any new category for which images are needed from Google (e.g For Images for Dogs and Cats, we have to run once for dogs and once for cats).

In [None]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll

Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the "Alphonso" mango, it might be a good idea to exclude other variants:

    "alphonso mango" -kesar -payari -langda -tetrapack -bottle -tin -yogurt -juice -bar -ice -kulfi -jam -pulp

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

### Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

In Google Chrome press <kbd>Ctrl</kbd><kbd>+Shift+</kbd><kbd>j</kbd> on Windows/Linux and <kbd>Cmd</kbd><kbd>+Opt+</kbd><kbd>j</kbd> on macOS, and a small window the javascript 'Console' will appear. In Firefox press <kbd>Ctrl</kbd><kbd>+Shift+</kbd><kbd>k</kbd> on Windows/Linux or <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>k</kbd> on macOS. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn't work. Then you can run the following commands:

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

### Create directory and upload urls file into your server

Mounting GoogleDrive, This is where we will download our images and later use them for building our model

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Choose an appropriate name for your labeled images. You can run these steps multiple times to create different labels.

In [13]:
folder = 'alphonso'
file = 'alphonso.csv'

You will need to run this cell once per each category.

In [14]:
root_dir = "/content/drive/My Drive/"
base_dir = root_dir + 'Colab Notebooks/data/mangoes/'

In [15]:
path = Path(base_dir)
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [16]:
path.ls()

[PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/.ipynb_checkpoints'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/alphonso.csv'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/urls_othermangoes.csv'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/train'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/validation'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/alphonso'),
 PosixPath('/content/drive/My Drive/Colab Notebooks/data/mangoes/othermangoes')]

### Download images

Now you will need to download your images from their respective urls.

fast.ai has a function that allows you to do just that. You just have to specify the urls filename as well as the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.

Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.

You will need to run this line once for every category.

In [None]:
download_images(path/file, dest, max_pics=1000)

In [None]:
# If you have problems download, try with `max_workers=0` to see exceptions:
# download_images(path/file, dest, max_pics=1000, max_workers=0)



In [None]:
#Only required first time to persist the changes made in this colab session to be visible in Drive
# drive.flush_and_unmount()

This can we executed after the images are downloaded.

In [17]:
from fastai.vision import *

In [18]:
folder = 'alphonso'
file = 'alphonso.csv'

In [19]:
root_dir = "/content/drive/My Drive/"
base_dir = root_dir + 'Colab Notebooks/data/mangoes/'

path = Path(base_dir)
dest = path/folder

Then we can remove any images that can't be opened:

In [20]:
classes = ['alphonso']

In [21]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

alphonso
