### Run in terminal

`pip uninstall enum34`

In [1]:
!pip install fastai==1.0.61

Collecting fastai==1.0.61
  Downloading fastai-1.0.61-py3-none-any.whl (239 kB)
[K     |████████████████████████████████| 239 kB 10.3 MB/s eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.7-py3-none-any.whl (18 kB)
Collecting fastprogress>=0.2.1
  Downloading fastprogress-1.0.0-py3-none-any.whl (12 kB)
Collecting numexpr
  Downloading numexpr-2.7.1-cp36-cp36m-manylinux1_x86_64.whl (162 kB)
[K     |████████████████████████████████| 162 kB 21.8 MB/s eta 0:00:01
Collecting bottleneck
  Downloading Bottleneck-1.3.2.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 1.8 MB/s  eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: bottleneck
  Building wheel for bottleneck (PEP 517) ... [?25ldone
[?25h  Created wheel for bottleneck: filename=Bottleneck-1.3.2-cp36-cp36m-linux_x86_64

In [21]:
from fastai.vision import *
import pathlib
from azureml.core import Workspace, Datastore, Dataset

# **1.0 Create your own image classifier - Dataset Creation**

by: Paula Tattam. An extraction of Fastai [Lesson 1](https://https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson1-pets.ipynb) and [Lesson 2](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb)

In this workshop you will get to create your own image classification dataset using google images. You will then build and train your own image classifier using the [fastai V1 library](https://www.fast.ai/2018/10/02/fastai-ai/). fastai is a python machine learning library built on top of the popular [PyTorch v1.0](https://engineering.fb.com/ai-research/facebook-accelerates-ai-development-with-new-partners-and-production-capabilities-for-pytorch-1-0/) machine learning framework.

Fastai is a library that allows you to rapidly build and train your own machine learning models utilising transfer learning from a range of current state of the art models.

# **Step 1: Pick a classification task**
For step 1 make up an image classification task. It can be any topic of your choice but the images will need to be available through [google images.](https://images.google.com/?gws_rd=ssl) For example:

*   Disney character classifier
*   Hotdogs or legs
*   Big cat classifier (tigers, lions, cheetahs, etc...)

Please try keep it PG and don't pick too many different classes as you will need to repeat the below step for each class.

Google image search allows you to exclude certain words in a search, combine searchs and a number of other operations.

For example, to search dog but exlcude wolves, use the `-` operator:

`dog -wolves -wolf`

See more options [here](https://support.google.com/websearch/answer/2466433?visit_id=637175902163553047-3698874010&p=adv_operators&hl=en&rd=1).






# **Step 2: Download URLs**

You will need to download each image URL to a file. This can be done by using a small snippet of JavaScript. Open the javascript console in either chrome or firefox as follows:

* Chrome: `ctrl+shift+j` (macOS: `Cmd+Opt+j`)
* Firefox: `ctrl+shit+k` (macOS: `Cmd+Opt+k`)

This will open up a window where you will paste the below code snippet. Before you paste the code, scroll down in your search results window a few times to load images. Only the displayed search image urls will be copied.

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

Repeat this step for each classification category that you have chosen. Once the file is downloaded, rename as per the following convention:

`urls_<label>.csv`

For example, if you are building a disney classifier you would name the files as follows:

`urls_mickey.csv, urls_minnie.csv etc...`

# **Step 3: Create directories and upload files**

Choose an appropriate name for your directory and create a list of your class labels. Edit the below cells as noted and run.


In [12]:
# UPDATE ME: add your labels as per the label used for the csv file
labels = ["zoro", "sanji"]

In [13]:
# UPDATE ME: name as per your classifcation task
name = "one_piece_crew"

In [14]:
for label in labels:
  path = Path(f'data/{name}') 
  dest = path/label
  dest.mkdir(parents=True, exist_ok=True)

In [16]:
path.ls()

[PosixPath('data/one_piece_crew/sanji'), PosixPath('data/one_piece_crew/zoro')]

Lastly, we upload the csv files. Open the side menu, press 'Upload' and select your files. Don't forget to move them into the newly created directory above. 

# **Step 4: Download images**

Next you will need to download the images for each label. Luckily, fast.ai have a function specifically designed for this. As long as you followed the naming convention above for the csv file, this will block of code should just work.

In this example, we set the image donwload limit to 200.

In [18]:
for label in labels:
  filename = f"urls_{label}.csv"
  dest = path/label
  download_images(path/filename, dest, max_pics=200)
  os.remove(path/filename)

Next, you will need to remove any images that cannot be opened. The following block of code does this for us.

In [19]:
for label in labels:
  print(label)
  verify_images(path/label, delete=True, max_size=500)

zoro


sanji


# Step 5: Create Dataset in Azure

Next you will create a dataset in Azure from you downloded images.

First, you need to upload these files to the azure blobstore. Azure automatically creates a default blobstore for you to use when a workspace is created. To find this name you can navigate to the studio.

In [22]:
datastore_name = 'workspaceblobstore'

In [23]:
workspace = Workspace.from_config()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code PQ65SRJXQ to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


In [24]:
datastore = Datastore.get(workspace, datastore_name)

Azure has different two types of datasets that include:

* FileDatasets - for images or videos
* TabularDatasets - for structured data (eg. csv files, sql tables, etc...)

For this task we will need to create a FileDataset.

In [46]:
src_dir = 'data'

In [47]:
%%capture
datastore.upload(src_dir)

In [57]:
datastore_paths = [(datastore, name)]
image_dataset = Dataset.File.from_files(path=datastore_paths)

In [58]:
# UPDATE ME: name as per your classifcation task
dataset_name = "OnePiece"
description = "Image dataset for anime one piece characters. Only includes Sanji and Zoro"

In [59]:
image_dataset.register(
    workspace=workspace,
    name=dataset_name,
    description=description,
)

{
  "source": [
    "('workspaceblobstore', 'one_piece_crew')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "37f88cdc-96f6-4e00-879d-845f4d379cdf",
    "name": "OnePiece",
    "version": 1,
    "description": "Image dataset for anime one piece characters. Only includes Sanji and Zoro",
    "workspace": "Workspace.create(name='ml-masterclass-ws', subscription_id='e36c4f51-a63e-4dd2-845f-26e8fea75d45', resource_group='ml-masterclass-rg')"
  }
}

Navigate to the azure machine learnings studio and see your newly created dataset. 