## Example of how to use the datahandler:
### the inputs and outputs
This notebook contains a very first attempt at multi-label classification.
A significant part of the notebook contains the preprocessing steps to turn the
raw data (of the csvs, the json and the folder containing the images) into the inputs to the datahandler: **filenames** and **labels**. The results are pickled, thus it is not necessary to rerun this preprocessing on this dataset in the future.
The remaining part of the notebook uses the data-generator produced by the datahandler ([datahandler_multilabel.py](./datahandler_multilabel.py)) to train an example of a model. 

Some of the code of the datahandler and of the example model is taken from [this post](https://towardsdatascience.com/multi-label-image-classification-in-tensorflow-2-0-7d4cf8a4bc72)

### Pre-processing

In [1]:
import tensorflow as tf
import json
import pandas as pd
from pathlib import Path
import os
import numpy as np
from os import listdir
from os.path import isfile, join
import pickle

from datahandler_multilabel import create_dataset

Importing the csv of the Tate Dataset to have a list of all the artworks contained in the data folder (which was made using [RetrievingTateModernImages.ipynb](./RetrievingTateModernImages.ipynb)) and the 
json tree to obtain the target vector of each image. I.e. the paths of the keys 
to reach the value (the image name) are the values 1 in the target vectors

In [13]:
data_info = pd.read_csv(os.path.join("D:", "collection", "artwork_data.csv"), verbose=0)

In [5]:
with open('TateDictLevel1.json', 'r') as infile:
    tree1 = json.load(infile)
len(tree1.values())

16

In [6]:
len(tree1['people'])

15469

In [9]:
def getKeysByValue(dictOfElements, valueToFind):
    '''Get a list of keys from dictionary which has the given value
    '''
    listOfKeys = list()
    listOfItems = list()
    for item in dictOfElements.items():
        listOfItems.append(item)
    for item in listOfItems:
        if valueToFind in item[1]:
            listOfKeys.append((valueToFind, item[0]))
    
    return listOfKeys

In [11]:
values = data_info.accession_number.tolist()
tuples = list()

for i in range(len(values)):
    tuples.append(getKeysByValue(tree1, values[i]))
    
print(len(tuples))

69201


Mapping the key paths to numeric values

In [164]:
class2num = {'people':0, 'objects':1, 'places':2, 'architecture':3, 'abstraction':4, 'society':5, 'nature':6, \
             'emotions, concepts and ideas':7, 'interiors':8, 'work and occupations':9, 'symbols & personifications':10, \
             'religion and belief':11, 'leisure and pastimes':12, 'history':13, 'literature and fiction':14, 'group/movement':15}
class2num['architecture']

3

In [167]:
img_target = []

for tupl in tuples:
    if len(tupl) > 0:
        zarray = numpy.zeros(16)
        for i in range(len(tupl)):
            zarray[class2num[tupl[i][1]]] = 1
        img_target.append((tupl[0][0], zarray))
        
len(img_target)

26968

Only keeping the filenames that are present in the dictectory (that could be downloaded from the links)

In [172]:
mypath = os.path.join('..','..','data_tate')
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

In [198]:
filenames = []
labels = []
for i,img_targ in enumerate(img_target):
    img_targ0 = str(img_targ[0])+'_8.jpg'
    if img_targ0 in onlyfiles:
        filenames.append(str(img_targ[0])+'_8.jpg')
        labels.append(img_targ[1])

In [199]:
filenames = [os.path.join('..','..','data_tate',str(filename)) \
                 for filename in filenames]
print(len(filenames))

24999


Dumping the resulting filenames and labels to pickle files:

In [200]:
with open('filenames.pkl', 'wb') as outfile:
    pickle.dump(filenames, outfile)
    
with open('labels.pkl', 'wb') as outfile2:
    pickle.dump(labels, outfile2)

### Using the pickled inputs to train a model

In [2]:
with open('filenames.pkl', 'rb') as infile:
    filenames = pickle.load(infile)
    
with open('labels.pkl', 'rb') as infile2:
    labels = pickle.load(infile2)
    
print(len(filenames), len(labels))

24999 24999


In [3]:
# calling the create_dataset function
train_ds = create_dataset(filenames, labels)

In [203]:
#very simple pre-trained model
import tensorflow_hub as hub

feature_extractor_url = "https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
                                         input_shape=(IMG_SIZE,IMG_SIZE,CHANNELS))

In [196]:
feature_extractor_layer.trainable = False

In [129]:
model = tf.keras.Sequential([
    feature_extractor_layer,
    tf.keras.layers.Dense(1024, activation='relu', name='hidden_layer'),
    tf.keras.layers.Dense(16, activation='sigmoid', name='output')
])

In [130]:
def macro_f1(y, y_hat, thresh=0.5):
    """Compute the macro F1-score on a batch of observations (average F1 across labels)
    
    Args:
        y (int32 Tensor): labels array of shape (BATCH_SIZE, N_LABELS)
        y_hat (float32 Tensor): probability matrix from forward propagation of shape (BATCH_SIZE, N_LABELS)
        thresh: probability value above which we predict positive
        
    Returns:
        macro_f1 (scalar Tensor): value of macro F1 for the batch
    """
    y_pred = tf.cast(tf.greater(y_hat, thresh), tf.float32)
    tp = tf.cast(tf.math.count_nonzero(y_pred * y, axis=0), tf.float32)
    fp = tf.cast(tf.math.count_nonzero(y_pred * (1 - y), axis=0), tf.float32)
    fn = tf.cast(tf.math.count_nonzero((1 - y_pred) * y, axis=0), tf.float32)
    f1 = 2*tp / (2*tp + fn + fp + 1e-16)
    macro_f1 = tf.reduce_mean(f1)
    return macro_f1

In [131]:
LR = 1e-5 # Keep it small when transfer learning
EPOCHS = 30

In [132]:
model.compile(
  optimizer=tf.keras.optimizers.Adam(learning_rate=LR),
  loss=tf.keras.losses.binary_crossentropy,
  metrics=[macro_f1]
)

In [204]:
history = model.fit(train_ds,
  epochs=EPOCHS,)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30

KeyboardInterrupt: 