notes
- why numpy==1.16.4 is used and not the most recent: https://github.com/tensorflow/tensorflow/issues/31249

I want to start with a simple experiment, which shows that the concept is working at all.

Target is to train a model to differentiate links and bases from non-fitting drawings. For this task a dataset consisting of about 500 examples each was created.
I want to go through *n* TODO different steps to show, that the model can differentiate *non-links* from *links*. Then the same algortithm is used to determine *bases*.

This is the first try in a series of steps taken to create a neural network to identify fourbar linkages from sketches and map them to their digital counterparts.

The steps taken are as follows:
 1. Acquire data from local harddrive (and then show loaded images).
 2. Prepare data by creating tensors of image, label pairs.
 3. Create a simple CNN to classify "links" from "non-hits" (*o*'s from *n*'s).
 4. Train model.
 5. Evaluate results.
 
After these four steps the model should be trained with a variety of hyperparameters to see which is the most promising one.

-> Further steps will try to use these models inside another CNN to get the coordinate of hits in a sketch.

### Step 1:
Acquire Data and put them in proper directories to train on them.

Data is stored in ../data/{n, o, x} with either "no match", "joints" or "bases" respectively.
It is not in the working directory, because multiple approaches (with different programming languages) are sought to be used on this dataset.

At first the right environment is created inside the working directory

- data
    - train
        - n
        - o
        - x
    - validate
        - n
        - o
        - x
    - test
        - n
        - o
        - x
        
In these folders a subset of the *linkages* or *bases* and *non-hits* are placed to be used in training.

In [1]:
from os.path import join, exists
from os import mkdir

def mkdir_ex(path):
    if not exists(path):
        mkdir(path)

n_dir = join('..', 'data', 'n') # non hits
o_dir = join('..', 'data', 'o') # links
# x_dir = join('..', 'data', 'x') # bases

data = join('.', 'data')
mkdir_ex(data)
# Create bases directories for training, validation and testing
train_dir = join(data, 'train')
mkdir_ex(train_dir)
validation_dir = join(data, 'validation')
mkdir_ex(validation_dir)
test_dir = join(data, 'test')
mkdir_ex(test_dir)
# Create respective training directories for data
train_nohit = join(train_dir, 'n')
mkdir_ex(train_nohit)
train_links = join(train_dir, 'o')
mkdir_ex(train_links)
# train_bases = join(train_dir, 'x')
# mkdir_ex(train_bases)
# And validation directories
validate_nohit = join(validation_dir, 'n')
mkdir_ex(validate_nohit)
validate_links = join(validation_dir, 'o')
mkdir_ex(validate_links)
# validate_bases = join(validation_dir, 'x')
# mkdir_ex(validate_bases)
# And test directories
test_nohit = join(test_dir, 'n')
mkdir_ex(test_nohit)
test_links = join(test_dir, 'o')
mkdir_ex(test_links)
# test_bases = join(test_dir, 'x')
# mkdir_ex(test_bases)

Since all folders are created and ready to be filled, the data is now propagated to their directories.

The dataset consists of at least 500 entries each.
To be exact, we take 500 images and distribute them about 60/20/20 into training, validation and test. This means each set brings:
300 entries into training.
100 entries into validation.
100 entries into test.

Another helpful aspect is, that the original data stays untouched and can not be compromised in any way.

To increase the number of data via augmentation is a subject of later debate, if there is improvement to be expected.

Since all data is named {0,1,2,3,4,5...}.jpeg inside their labelset, we can use this property to easily distribute the data.

In [2]:
from shutil import copyfile

def distribute_data(target_dir, src_dir, begin, limit):
    for i in range(begin, limit):
        filename = str(i) + '.jpeg'
        src = join(src_dir, filename)
        target = join(target_dir, filename)
        copyfile(src, target)

distribute_data(train_nohit, n_dir, 0, 300)
distribute_data(train_links, o_dir, 0, 300)
# distribute_data(train_bases, x_dir, 0, 300)
distribute_data(validate_nohit, n_dir, 300, 400)
distribute_data(validate_links, o_dir, 300, 400)
# distribute_data(validate_bases, x_dir, 300, 400)
distribute_data(test_nohit, n_dir, 400, 500)
distribute_data(test_links, o_dir, 400, 500)
# distribute_data(test_links, x_dir, 400, 500)

### Step 2

Preprocess data to be fit to be used. (Maybe data preprocessing is better to be done after model definition, because the model determines the input shape).

The data has to be transformed into tensors which can be fed into the model.
Four steps are suggested by the book (p.135):
 - Read the picture files.
 - Decode the JPEG content to RGB grids of pixels.
 - Convert these into floating-point tensors.
 - Rescale the pixel values (between 0 and 255) to the [0, 1] interval.

In [3]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(512, 512),
    batch_size=20,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    target_size=(512, 512),
    batch_size=20,
    class_mode='binary')

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Found 600 images belonging to 2 classes.
Found 200 images belonging to 2 classes.


### Step 3

Now a generic model for testing is created.

Here the model from deep learning with python p. 134 is used.

This should be reduced later an analyzed on my own. But for a quick proof of concept this should suffice.

In [4]:
from tensorflow.keras import layers
from tensorflow.keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(512, 512, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

W0910 23:25:54.580121 139773052057216 deprecation.py:506] From /usr/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 510, 510, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 255, 255, 32)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 253, 253, 64)      18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 126, 126, 64)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 124, 124, 128)     73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 62, 62, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 60, 60, 128)       1

Now the model should be configured for training.
Therefore optimizers are imported. For binary classification the loss function 'binary_crossentropy' and as optimizer 'RMSprop' is used.


This is recommended by Francois Chollet. Why this is the case is a matter of further research.

In [5]:
from tensorflow.keras.optimizers import RMSprop

model.compile(loss='binary_crossentropy', optimizer=RMSprop(lr=1e-4), metrics=['acc'])

W0910 23:25:54.748359 139773052057216 deprecation.py:323] From /usr/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


### Step 4

Training the model is done via the "fit" method. for this the train_generator has to be used.

## Disclaimer: At the moment train_generator has 3 classes. I have to select 2 of them!

Tensorboard should be used for visualisation. Therefore a log directory is created, with a suitable callback object.

In [6]:
from tensorflow.keras.callbacks import TensorBoard
import numpy as np

log_dir=(join('.', 'logs'))
mkdir_ex(log_dir)

callbacks = [ TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    embeddings_freq=1) ]

history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=30,
    validation_data=validation_generator,
    validation_steps=50 )#,
    # callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
