I want to start with a simple experiment, which shows that the concept is working at all.

Target is to train a model to differentiate links and bases from non-fitting drawings. For this task a dataset consisting of about 500 examples each was created.
I want to go through 5 different steps to show, that the model can differentiate *non-links* from *links*. Then the same algortithm is used to determine *bases*.

This is the first try in a series of steps taken to create a neural network to identify fourbar linkages from sketches and map them to their digital counterparts.

The steps taken are as follows:
 1. Acquire data from local harddrive.
 2. Prepare data by creating tensors of image, label pairs.
 3. Create a simple CNN to classify "links" from "non-hits" (*o*'s from *n*'s).
 4. Train model.
 5. Evaluate results.

### Step 1:
Acquire Data and put them in proper directories to train on them.

Data is stored in ../data/raw/{n, o, x} with either "no match", "joints" or "bases" respectively.

Data is subdivided between *raw*, *interim* and *processed* directories. *interim* is not used here, since this simple proof of concept won't go into too much preprocessing of the data.

The "processed" data is transfered into the appropriate directories at '../data/raw/{n, o, x}.
        
In these folders a subset of the *linkages* or *bases* and *non-hits* are placed to be used in training.

To distribute the data accordingly, `src.training_env` provides useful functions. Namely:
- create - To create a suitable skeleton for the processed data -> Returns path respectively.
- populate - To populate those directories with data.
- reset - To delete all files in processed for a clean restart.

Combinations of those functions are provided for convenience (e.g. `reset_and_populate`)

Because the *proof_of_concept* uses the `binary classifier` it only converges to a feasible result, if two classes are considered. Therefore we only observe *links* and *nohits* at this point.

In [1]:
import sys
sys.path.append('..')

from os.path import join
from src.training_env import create

processed = join('..', 'data', 'processed')

to_populate = [train_nohit, validate_nohit, test_nohit,
               train_links, validate_links, test_links,
               # train_bases, validate_bases, test_bases] 
              ] = create(['n', 'o'], processed)

However the raw data is never to be manipulated. Therefore the data gets subdivided into "processed" folder at '../data/processed/{train, validation, test}/{n, o, x}.

The dataset consists of at least 500 entries each.
To be exact, we take 500 images and distribute them about 60/20/20 into training, validation and test respectively.
This means each set puts:
300 entries into training.
100 entries into validation.
100 entries into test.

To increase the number of data via augmentation is a subject for later investigation, if there is improvement to be expected.

Since all data is named {0,1,2,3,4,5...}.jpeg inside their labelset, we can use this property to easily distribute the data.

In [2]:
from src.training_env import populate

raw = join('..', 'data', 'raw')
populate(raw, to_populate, [300,100,100], ['n', 'o'])

### Step 2

Preprocess data to be fit to be used. (Maybe data preprocessing is better to be done after model definition, because the model determines the input shape).

The data has to be transformed into tensors which can be fed into the model.
Four steps are suggested by the book (p.135):
 - Read the picture files.
 - Decode the JPEG content to RGB grids of pixels.
 - Convert these into floating-point tensors.
 - Rescale the pixel values (between 0 and 255) to the [0, 1] interval.
 
 At this instance, we can make use of generators. They are useful because not every image has to be loaded into memory.

In [3]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_dir = join(processed, 'train')
validation_dir = join(processed, 'validation')

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(512, 512),
    batch_size=20,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    target_size=(512, 512),
    batch_size=20,
    class_mode='binary')

Found 600 images belonging to 2 classes.
Found 200 images belonging to 2 classes.


### Step 3

Now a generic model for testing is created.

Here the model from deep learning with python p. 134 is used.

This should be reduced later an analyzed on my own. But for a quick proof of concept this should suffice.

In [4]:
from tensorflow.keras import layers
from tensorflow.keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(512, 512, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 510, 510, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 255, 255, 32)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 253, 253, 64)      18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 126, 126, 64)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 124, 124, 128)     73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 62, 62, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 60, 60, 128)       1

Now the model should be configured for training.
Therefore optimizers are imported. For binary classification the loss function 'binary_crossentropy' and as optimizer 'RMSprop' is used.


This is recommended by Francois Chollet. Why this is the case is a matter of further research.

In [5]:
from tensorflow.keras.optimizers import RMSprop

model.compile(loss='binary_crossentropy', optimizer=RMSprop(lr=1e-4), metrics=['acc'])

### Step 4

Training the model is done via the "fit" method. for this the train_generator has to be used.

Tensorboard should be used for visualisation. Therefore a log directory is created, with a suitable callback object.

In [6]:
from tensorflow.keras.callbacks import TensorBoard
import numpy as np

log_dir=(join('..', 'logs', 'srp00'))

callbacks = [ TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    embeddings_freq=1) ]

history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=5,
    validation_data=validation_generator,
    validation_steps=50,
    callbacks=callbacks)

Epoch 1/5
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [8]:
%load_ext tensorboard
%tensorboard --logdir {log_dir}

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 5052), started 1:34:07 ago. (Use '!kill 5052' to kill it.)

## Step 5

As you can see, the accuracy of the training data is improving with this approach. After only 1 epoch it is already at 87% accuracy, but the validation data stays around 90% accuracy for all epochs.

This is a clear indicator for overfitting TODO *(look at coursera-ml-class to get the image of the opening loss scissor)*

This means, that more data should be added or features reduced.
I would try to reduce the features first and then augment data to see if this improves results.

Anyway, this is a proof of concept.

The next step will be to create own models (starting with dense layers, not using CNN at first) and to improve and compare those.

All in all it is to say, that the proof of concept works.