<a href="https://colab.research.google.com/github/ptats/ml101-grad-workshop/blob/master/notebooks/Image_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Create your own image classifier - including the dataset**

by: Paula Tattam. An extraction of Fastai [Lesson 1](https://https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson1-pets.ipynb) and [Lesson 2](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb)

In this workshop you will get to create your own image classification dataset using google images. You will then build and train your own image classifier using the [fastai V1 library](https://www.fast.ai/2018/10/02/fastai-ai/). fastai is a python machine learning library built on top of the popular [PyTorch v1.0](https://engineering.fb.com/ai-research/facebook-accelerates-ai-development-with-new-partners-and-production-capabilities-for-pytorch-1-0/) machine learning framework.

Fastai is a library that allows you to rapidly build and train your own machine learning models utilising transfer learning from a range of current state of the art models.

In [None]:
# run once
!curl -s https://course.fast.ai/setup/colab | bash

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My\ Drive/"

In [None]:
from fastai.vision import *

# **Step 1: Pick a classification task**
For step 1 make up an image classification task. It can be any topic of your choice but the images will need to be available through [google images.](https://images.google.com/?gws_rd=ssl) For example:

*   Disney character classifier
*   Hotdogs or legs
*   Big cat classifier (tigers, lions, cheetahs, etc...)

Please try keep it PG and don't pick too many different classes as you will need to repeat the below step for each class.

Google image search allows you to exclude certain words in a search, combine searchs and a number of other operations.

For example, to search dog but exlcude wolves, use the `-` operator:

`dog -wolves -wolf`

See more options [here](https://support.google.com/websearch/answer/2466433?visit_id=637175902163553047-3698874010&p=adv_operators&hl=en&rd=1).






# **Step 2: Download URLs**

You will need to download each image URL to a file. This can be done by using a small snippet of JavaScript. Open the javascript console in either chrome or firefox as follows:

* Chrome: `ctrl+shift+j` (macOS: `Cmd+Opt+j`)
* Firefox: `ctrl+shit+k` (macOS: `Cmd+Opt+k`)

This will open up a window where you will paste the below code snippet. Before you paste the code, scroll down in your search results window a few times to load images. Only the displayed search image urls will be copied.

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

Repeat this step for each classification category that you have chosen. Once the file is downloaded, rename as per the following convention:

`urls_<label>.csv`

For example, if you are building a disney classifier you would name the files as follows:

`urls_mickey.csv, urls_minnie.csv etc...`

# **Step 3: Create directories and upload files**

Choose an appropriate name for your directory and create a list of your class labels. Edit the below cells as noted and run.



In [None]:
# UPDATE ME: add your labels as per the label used for the csv file
labels = ["zoro", "sanji", "nami", "brook", "robin", "chopper", "frankie", "luffy", "usopp"]

In [None]:
# UPDATE ME: name as per your classifcation task
name = "one_piece_crew"

In [None]:
for label in labels:
  path = Path(f'data/{name}') 
  dest = path/label
  dest.mkdir(parents=True, exist_ok=True)

In [None]:
path.ls()

Lastly, we upload the csv files. Open the side menu, press 'Upload' and select your files. Don't forget to move them into the newly created directory above. 

![Upload Images](https://github.com/ptats/ml101-grad-workshop/blob/master/images/Notebook_img1_LI.jpg)

# **Step 4: Download images**

Next you will need to download the images for each label. Luckily, fast.ai have a function specifically designed for this. As long as you followed the naming convention above for the csv file, this will block of code should just work.

In this example, we set the image donwload limit to 200.

In [None]:
for label in labels:
  filename = f"urls_{label}.csv"
  dest = path/label
  download_images(path/filename, dest, max_pics=200)

Next, you will remove any images that cannot be opened. The following block of code removes any images that cannot be opened.

In [None]:
for label in labels:
  print(label)
  verify_images(path/label, delete=True, max_size=500)

# **Step 5: Data View**

In this section you will create a `DataBunch`, an object that is used for model training in the fast.ai library. The databunch is unique in that as the images are loaded, all the required pre-processing is completed as well. This includes:

*   resizing
*   pixel normalisation
*   flipping, rotation, zoom
*   contrast changes
*   symmetric warping

Some of the default image transformations may not make sense depending on the application. For example, we would not flip an image of a cat vertically but for satellite imagery this would make sense.

Once you create your databunch, you can inspect a batch of the data that has been uploaded and do a sanity check.

In [None]:
data = ImageDataBunch.from_folder(path, train='.', valid_pct=0.2, ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)

In [None]:
data.show_batch(rows=4, figsize=(7,6))

# **Step 6: Train Model**

With the data loaded and ready, you can now create your deep learning model. For this example, you will use a pre-trained machine learning model, a type of convolutional neural network called ResNet34. Don't worry if you don't know what this means. For now, all you need to know is that this model is a type of neural network that will take the images as input and output a probability for each label that you have created. You will train this model for 4 epochs. An epoch is one iteration of the complete dataset.

In [None]:
learn = cnn_learner(data, models.resnet34, metrics=error_rate)

In [None]:
learn.model

In [None]:
learn.fit_one_cycle(4)

In [None]:
learn.save('stage-1')

# **Step 7: Fine tuning and learning rates**

Once your model is working as expected, you can now train some more as well as find the best learning rate to use. The learning is a hyper parameter that you vary to help the model learn. 

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find(start_lr=1e-7, end_lr=1)

In [None]:
learn.recorder.plot()

In [None]:
# Update the max_lr values based on your plot above
learn.fit_one_cycle(2, max_lr=slice(1e-6, 1e-4))

In [None]:
learn.save('stage-2')

# **Step 7: Inspecting the Results**

Next you will inspect the results of your model. Here you will get to see:

*   Which categories the model confused with one another the most
*   Whether the misclassifications were reasonable or not

And lastly, you can view the confusion matrix to understand the distribution of errors the model makes.



In [None]:
interp = ClassificationInterpretation.from_learner(learn)
losses, idxs = interp.top_losses()
len(data.valid_ds)==len(losses)==len(idxs)

In [None]:
interp.plot_top_losses(25, figsize=(15,11))

In [None]:
interp.plot_confusion_matrix()

In [None]:
interp.most_confused(min_val=2)

# **Step 9: Data clean up**

Some of the top losses are not due to our model performing badly. Instead, they are due to incorrect images in our training data. fast.ai has a `ImageCleaner` widget that makes cleaning this kind of thing up super easy.

Unfortunately, this widget does not work with jupyter lab or google collab. To run the following lines of code you need to run this notebook locally. You can download a copy of this notebook from google collab or clone the github repo. Make sure you complete the 'nice to have' prerequisites defined in the github readme.

Once you have the notebook downloaded locally:

* run the 'from fastai.vision import *' cell
* Repeat Steps 3 & 4 above
* Download the saved model export and save in the 'models' directory

Once the above steps are complete, you can run the below cells. You will:

* view images with the highest losses and delete or re-label
* find images that are potential duplicates in the dataset

In [None]:
from fastai.widgets import *

In [None]:
db = (ImageList.from_folder(path)
      .split_none()
      .label_from_folder()
      .transform(get_transforms(), size=224)
      .databunch())

In [None]:
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)
learn_cln.load('stage-1')

In [None]:
ds, idx = DatasetFormatter().from_toplosses(learn_cln)

In [None]:
ImageCleaner(ds, idx, path)

When you are satisfied with your cleaning or reach the end of the available data, you can now move onto finding potential duplicates. A `cleaned.csv` file will have been created for you from the above step and you will need to use this file for the next step.

Using pandas, we can preview the contents of the newly created file to see what has been created.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/one_piece_crew/cleaned.csv')

In [None]:
df.head()

Now we can load the new csv file and use this to find duplicate images.

In [None]:
db = (ImageList.from_csv(path, 'cleaned.csv', folder='.')
       .split_none()
       .label_from_df()
       .transform(get_transforms(), size=224)
       .databunch()
)

This next cell might take some time to run as we need to run the images through the model and use the outputs from the last layer to calculate the similarities.

In [None]:
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

learn_cln.load('stage-1');

In [None]:
ds, idxs = DatasetFormatter().from_similars(learn_cln, pool=None)

In [None]:
ImageCleaner(ds, idxs, path, duplicates=True)

After this, you would go back and train a new model on the cleaned data to see if you can get some performance improvements.