# Session 3

This lecture covers the datasets and how to prepared them for training, evaluating, and testing.


## Learning Objectives:
* Building a dataset
    * Download images from google service
    * Preparing dataset folder structure
    * Collect images from camera or by sketching them
* Learning Model Output
* Evaluate performance of training

## Practical Examples:
* Build a general image classifier
* Build an audio recognition system

## Required Libraries:
__Scikit-learn__: Useful library for data processing. We use it here to split training and validation datasets

In [None]:
!conda install scikit-learn -y

---
### Building Dataset

Preparding a dataset for training can be troublesome and time consuming process. To accelerate this process, we can use online websites and search engines to build an initial dataset for training.

For this session, we will use google images search engine to retrive images with specific keywords.
<br>Google Images Downloader: [GoogleImageDownload.ipynb](./Python/GoogleImageDownload.ipynb)

That notebook will help you at retriving images with selected keywords and amount. Running it will take time depending on the network connection and number of entries per category.

Another way is to download an existing dataset online, most famous websites for datascience is [Kaggle](https://www.kaggle.com/). 

When preparding a dataset, its important to use a way to organize it. The simplist way is to use folder structure defining the labels and samples per label. For example:
<br>__Shoes__
<br>|-__children__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg
<br>|-__men__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg
<br>|-__women__
<br>|----|- imgA.jpg
<br>|----|- imgB.jpg
<br>|----| ....
<br>|----|- imgN.jpg


When you have a complex type to train (for example, images and sensor readings combined), its best to have an excel like sheet that refers to the values of each sample. For example:

|ID | Image|Color| Size | Style   |
|---|---|---|---|---|
| Sample 1 | ImgA.jpg | Red   | 27   | Westren |
| Sample 2 | ImgB.jpg | Black | 29.5 | Asian   |
| ...      | ...| ... | ...|



The number of features used is reflected on the first layer's __input_shape__ of the model:

In [None]:
#don't run this code here
input_len=28*28 #length of feature vector use in the input
model.add(layers.Dense(64,activation='relu',input_shape=(input_len,)))


Some helpful examples that use Processing to send images to python, and to help at building your dataset:
* Python Notebook for building the dataset: [OSCImageDataset.ipynb](./Python/OSCImageDataset.ipynb)
_<br>Make sure to specify the name of your dataset, and labels to be used_

__Processingstreamers:__
* Camera Images: [Camera_Streamer.pde](./Processing/Camera_Streamer/Camera_Streamer.pde)
* Image Sketches: [Sketch_Streamer.ipynb](./Sketch_Streamer/Sketch_Streamer.pde)

---
## Learning Model Output

Its important to define the output of the learning model, and by that what the network is trying to learn. Two main types of outputs exists: Discrete and Continous output.

### Discrete Output
So far we covered mainly classification problems which is a discrete learning problem, and the outputs are labels from a predefined set [L1,L2,L3,...,Ln] such as classifying animals to their categories (e.g. Cats vs Dogs).
For this type of problems, we define the network to have outputs count equals to number of labels of the network, for example 2 outputs for Cats vs Dogs, or 10 for digit images classification. 

Labels are encoded using so called [one-hot-encoding](https://en.wikipedia.org/wiki/One-hot), which define a 1D vector with length equals to number of labels, and all values equals to zero excep to the index equal to the label number. For example, for [cats:0, dogs:1] labels:

|Label Name | ID| One-hot-encoding   |
|---|---|---|
| Cats | 0  | 1 0 |
| Dogs | 1  | 0 1   |

And for 10 digits classification:

|ID| One-hot-encoding   |
|---|---|
|0  | 1 0 0 0 0 0 0 0 0 0 |
|1  | 0 1 0 0 0 0 0 0 0 0 |
|2  | 0 0 1 0 0 0 0 0 0 0 |
|3  | 0 0 0 1 0 0 0 0 0 0 |
|...|...|
|8  | 0 0 0 0 0 0 0 0 1 0 |
|9  | 0 0 0 0 0 0 0 0 0 1 |

We can use this function to help us convert labels to the encoding:

tensorflow.keras.utils.to_categorical(Y, nb_classes)


In [2]:
from tensorflow.keras import utils

Y=[1,0,2,3,1,0,2]
nb_classes=len(set(Y)) #automatically calculate the number of unique labels
print("Number of labels: {0}".format(nb_classes))


oneHotY=utils.to_categorical(Y, nb_classes)

for y1,y2 in zip(Y,oneHotY):
    print("{0} --> {1}".format(y1,y2))


Number of labels: 4
1 --> [0. 1. 0. 0.]
0 --> [1. 0. 0. 0.]
2 --> [0. 0. 1. 0.]
3 --> [0. 0. 0. 1.]
1 --> [0. 1. 0. 0.]
0 --> [1. 0. 0. 0.]
2 --> [0. 0. 1. 0.]


The design of the model also is reflected via the last layer added to the model, and how the model is compiled. We use a Dense layer with class count equals to the number of labels (nb_classes), and we should set the __activation__ function to __'softmax'__. This function calculates the probability distribution over the outputs to provide which label is most likely to be activated.

In [None]:
#don't run this code here
model.add(layers.Dense(nb_classes,activation='softmax'))

For the compile, we should set the __loss__ parameter to be __'categorical_crossentropy'__, which calculates the error between the predicted labels and the provided labels as categories instead of continous values.

In [None]:
#don't run this code here
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

### Continous Output
The other common type of training is to use continous output, such as the case of regression problems.
Here the output should provide continous values within a specific range. For example prediction of temperature, wind speed, and humidty from a picture (maybe it might work!). So in this case, the output has three continous outputs:
* Temperature -20~50
* Wind speed    0~120
* Humidty       0~100

These outputs are reflected on the output layer that it should have a number of perceptrons equal to number of outputs (3 in this case), and the __activation__ function is __'linear'__:


In [None]:
#don't run this code here
nb_outputs=3
model.add(layers.Dense(nb_outputs,activation='linear'))

Also, the compile function should have a __loss__ function suitable for this types of problems. Most common are:
* 'mean_squared_error'
* 'mean_absolute_error'

And are specified as in __compile__ function:

In [None]:
#don't run this code here
model.compile(optimizer='adam',loss='mean_squared_error',metrics=['mae'])

---
## Evaluate performance of training

When training the network for specific dataset, how can we tell if it can really generalize to unseen data? Solution is simple! We split the dataset to two main parts:
* __Train & Validation dataset (75% of all data)__: This dataset is used for training purpose, the training part of it (usually 80%) is used internally by the network to tune the weights of the layers, and the validation set (20%) is used to observe its performance and tune network parameters (layers count, number of perceptrons, ...etc). We also observe both loss functions of the training and validation while the network is training.

We can use this function from scikit-learn library to split the dataset (x_dataset,y_dataset) to two datasets: train and test (its called test, but its the validation dataset).


In [7]:
#Generate some random samples
import numpy as np
x_dataset=np.random.rand(100,20)
y_dataset=np.random.rand(100)

#split samples to training
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(x_dataset,y_dataset,test_size=0.2)

print("Dataset has a total of: {0} samples".format(len(x_dataset)))
print("Training using: {0} samples".format(len(X_train)))
print("Validating using: {0} samples".format(len(X_test)))

Dataset has a total of: 100 samples
Training using: 80 samples
Validating using: 20 samples


* __Test dataset (25% of all data)__: This one is used *After* the model has been trained, and this will give us the overall performance of the network for unseen data, and how it will perform on reality. 

In the audio and image classifier examples, the validation is used in the fit function:
<br>*model.fit(X_train,Y_train,__validation_data=(X_test,Y_test)__,epochs=100,batch_size=8)*


You will notice that when training, validation accuracy is also reported:

<img src="Images/training_metrics.png" width="80%">


*Note*: __batch_size__ plays an important role in the speed of learning and accuracy. The batch referes to the number of samples propagated through the network for learning. Good practice to set the batch_size equals to a power of two [8,16,32,64,...]. Also, set the batch size around 5-10% of the total training samples.



---
### Visualize training performance using Tensorboard


Tensorboard is a great tool to display the performance of training your model against your data. 

In [None]:
#don't run this code here! Its just to show the newly added tensorboard
from tensorflow.keras.callbacks import TensorBoard
from time import time

## .... Model created, data is pre-processed and split to training (X_train, Y_train) and validation (X_test,Y_test)
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
model.fit(X_train,Y_train,validation_data=(X_test,Y_test),epochs=100,batch_size=8, callbacks=[tensorboard])

And then, in the command line (or terminal), access notebook location and run this command:
__<br>tensorboard --logdir=logs/__

<img src="Images/tensorboard_terminal.png" width="50%">

Then, in the browser (chrome for example), access the following URL:

[http://localhost:6006](http://localhost:6006)

If the port (6006) in was different from the one in the picture, then change the port part in the URL to the new port. You can compare multiple training sessions with different hyber paramteters of the model (number of layers, number of perceptrons, ...etc) or with/without normalization. 

<br>This picture shows the results after tuning the network parameters to achieve best accuracy for both training and validation datasets:

<img src="Images/tensorboard_metrics.png" width="100%">

---

## Practical Examples

### General Image Classifier:

<br>[ImageClassifier.ipynb](./Python/ImageClassifier.ipynb) allows you to train on a custom dataset of images


### Audio Recognizer:

<br>[AudioRecognizer.ipynb](./Python/AudioRecognizer.ipynb) allows you to train on a custom dataset of audio files. It shows the process how to preprocess audio files for training.