# Create a Dataset of Images

Key outcome: create our own image dataset that we can use to train a model to differentiate between whatever images we choose. In this case, I'll be grabbing images of the abominable snowman, and the friendly snowman Olaf from Disney's Frozen for my nieces.     

Based on tutorial by Francisco Ingham and Jeremy Howard, inspired by [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)*

In [0]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll
Step 1: Using [Google Images](http://images.google.com) first search for images of Olaf. Repeat this process later to get images of the abominable snowman. The more specific the search, the less pruning you'll need to do later. To make the search more specific, put things you want to exclude, for example for Eurasian wolf images, "canis lupus lupus", exclude other variants like so: "canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis. You can  limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown, but Olaf is a character in a cartoon movie, so we'll keep that filter off.   

Step 2: Scroll down until we've seen all the images we want to download, or the 'Show more results' button. All the images we scrolled past are now available to download. I want a few more so, I clicked on the button scrolled to the bottom. The max number of images Google Images shows is 700.

### Download URLs of images

Now let's run some Javascript code in the browser to save the URLs of all the images.

Step 3: One the page with the image search results, open the javascript console' by pressing CtrlShiftJ in Windows/Linux and CmdOptJ in Mac.

Step 4: Paste the JavaScript commands below into the console window to get the urls of each of the images.

```javascript
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

### Create directory and upload 1st url file to your server
Choose a clear name for your images.

In [0]:
folder = 'olaf'
file = 'urls_olaf.csv'

In [0]:
path = Path('data/snowman')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [0]:
path.ls()

Upload the urls file. Press 'Upload' in your working directory, select your file, then click 'Upload' for each of the displayed files.

## Download 1st set of images
Let's download our images! fast.ai has a function to download each of these images from their urls. To use it, specify the url filename and the destination folder. The function will download and save all images that can be opened. Images that have an issue being opened will not be saved.

You can choose a maximum number of images to be downloaded. We'll choose 300 of the 700 images we selected.

In [0]:
classes = ['olaf','abominable']

In [0]:
# If you have problems downloading, try `max_workers=0` to see exceptions
# download_images(path/file, dest, max_pics=20, max_workers=0)
download_images(path/file, dest, max_pics=300)

In [0]:
# Remove any images that can't be opened.
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

### Create 2nd set of images

In [0]:
folder = 'abominable'
file = 'urls_abominable.csv'

In [0]:
path = Path('data/snowman')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [0]:
path.ls()

In [0]:
classes = ['olaf','abominable']

In [0]:
download_images(path/file, dest, max_pics=300)

In [0]:
# Remove any images that can't be opened.
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

## View data

In [0]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [0]:
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
#         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [0]:
# Let's take a look at some of the images in our snowman data set.
data.classes

In [0]:
data.show_batch(rows=3, figsize=(7,8))

In [0]:
data.classes, data.c len(data.train_ds), len(data.valid_ds)

## Train model

In [0]:
learn = create_cnn(data, models.resnet34, metrics=error_rate)

NameError: name 'create_cnn' is not defined

In [0]:
learn.fit_one_cycle(4)

In [0]:
learn.save('stage-1')

In [0]:
learn.unfreeze()

In [0]:
learn.lr_find()

In [0]:
learn.recorder.plot()

In [0]:
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))

In [0]:
learn.save('stage-2')

## Interpretation

In [0]:
learn.load('stage-2');

In [0]:
interp = ClassificationInterpretation.from_learner(learn)

In [0]:
interp.plot_confusion_matrix()

## Cleaning Up

Some of our top losses may not be due to bad performance by our model. There could be images in our data set that shouldn't be. Using the `ImageCleaner` widget from `fastai.widgets` we can prune our top losses, removing photos that don't belong.

In [0]:
from fastai.widgets import *