# Session 3

This lecture covers the datasets and how to prepared them for training, evaluating, and testing.


## Learning Objectives:
* Building a dataset
    * Download images from google service
    * Preparing dataset folder structure
    * Collect images from camera or by sketching them
* Evaluate performance of training

## Practical Examples:
* Build a general image classifier
* Build an audio recognition system

## Required Libraries:
__Scikit-learn__: Useful library for data processing. We use it here to split training and validation datasets

In [None]:
!conda install scikit-learn -y

---
### Building Dataset

Preparding a dataset for training can be troublesome and time consuming process. To accelerate this process, we can use online websites and search engines to build an initial dataset for training.

For this session, we will use google images search engine to retrive images with specific keywords.
<br>Google Images Downloader: [GoogleImageDownload.ipynb](./Python/GoogleImageDownload.ipynb)

That notebook will help you at retriving images with selected keywords and amount. Running it will take time depending on the network connection and number of entries per category.

Another way is to download an existing dataset online, most famous websites for datascience is [Kaggle](https://www.kaggle.com/). 

When preparding a dataset, its important to use a way to organize it. The simplist way is to use folder structure defining the labels and samples per label. For example:
<br>__Shoes__
<br>|-__children__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg
<br>|-__men__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg
<br>|-__women__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg


When you have a complex type to train (for example, images and sensor readings combined), its best to have an excel like sheet that refers to the values of each sample. For example:

|ID | Image|Color| Size | Style   |
|---|---|---|---|---|
| Sample 1 | ImgA.jpg | Red   | 27   | Westren |
| Sample 2 | ImgB.jpg | Black | 29.5 | Asian   |
| ...      | ...| ... | ...|



Some helpful examples that use Processing to send images to python, and to help at building your dataset:
* Python Notebook for building the dataset: [OSCImageDataset.ipynb](./Python/OSCImageDataset.ipynb)
_<br>Make sure to specify the name of your dataset, and labels to be used_

__Processingstreamers:__
* Camera Images: [Camera_Streamer.pde](./Processing/Camera_Streamer/Camera_Streamer.pde)
* Image Sketches: [Sketch_Streamer.ipynb](./Sketch_Streamer/Sketch_Streamer.pde)

---
### Evaluate performance of training

When training the network for specific dataset, how can we tell if it can really generalize to unseen data? Solution is simple! We split the dataset to two main parts:
* __Train & Validation dataset (75% of all data)__: This dataset is used for training purpose, the training part of it (usually 80%) is used internally by the network to tune the weights of the layers, and the validation set (20%) is used to observe its performance and tune network parameters (layers count, number of perceptrons, ...etc). We also observe both loss functions of the training and validation while the network is training.

We can use this function from scikit-learn library to split the dataset (x_train,y_train) to two datasets: train and test (its called test, but its the validation dataset).


In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(x_train,y_train,test_size=0.1)

* __Test dataset (25% of all data)__: This one is used *After* the model has been trained, and this will give us the overall performance of the network for unseen data, and how it will perform on reality. 

In the audio and image classifier examples, the validation is used in the fit function:
<br>*model.fit(X_train,Y_train,__validation_data=(X_test,Y_test)__,epochs=100,batch_size=8)*

You will notice that when training, validation accuracy is also reported:

<img src="Images/training_metrics.png" width="80%">


---
### Visualize training performance using Tensorboard


Tensorboard is a great tool to display the performance of training your model against your data. 

In [None]:
#don't run this code here! Its just to show the newly added tensorboard
from tensorflow.keras.callbacks import TensorBoard
from time import time

## .... Model created, data is pre-processed and split to training (X_train, Y_train) and validation (X_test,Y_test)
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
model.fit(X_train,Y_train,validation_data=(X_test,Y_test),epochs=100,batch_size=8, callbacks=[tensorboard])

And then, in the command line (or terminal), access notebook location and run this command:
__<br>tensorboard --logdir=logs/__

<img src="Images/tensorboard_terminal.png" width="50%">

Then, in the browser (chrome for example), access the following URL:

[http://localhost:6006](http://localhost:6006)

If the port (6006) in was different from the one in the picture, then change the port part in the URL to the new port. You can compare multiple training sessions with different hyber paramteters of the model (number of layers, number of perceptrons, ...etc) or with/without normalization. 

<br>This picture shows the results after tuning the network parameters to achieve best accuracy for both training and validation datasets:

<img src="Images/tensorboard_metrics.png" width="100%">

---

## Practical Examples

### General Image Classifier:

<br>[ImageClassifier.ipynb](./Python/ImageClassifier.ipynb) allows you to train on a custom dataset of images


### Audio Recognizer:

<br>[AudioRecognizer.ipynb](./Python/AudioRecognizer.ipynb) allows you to train on a custom dataset of audio files. It shows the process how to preprocess audio files for training.