# Create the Full Image Feature Sets
In this notebook we generate the feature dataset from thumbnail sized copies of the full Artwork image.

In [1]:
import warnings
warnings.filterwarnings("ignore", message=r"Passing", category=FutureWarning)
import tensorflow as tf
import logging
import tensorflow_hub as hub
from PIL import Image
import os
import pandas as pd
import numpy as np
import sqlite3
from numpy import savez_compressed
from sklearn.model_selection import train_test_split

In [2]:
# turn off tensorflow INFO messages
tf.logging.set_verbosity(30)

In [3]:
# define array names used in tile feature files
FEATURES     = "arr_0"
TILE_TAGS    = "arr_1"
IMAGE_TAGS   = "arr_2"
RANDOM_STATE = 42

In [4]:
# define file paths
data_file_path     = "./data/"
image_file_path    = "".join([data_file_path, "images/"])  
feature_sets       = "".join([data_file_path, "full_image_feature_sets/"]) 

## Connect to the Database

In [5]:
# Create a DB connection between python and the file system
conn = sqlite3.connect(''.join([data_file_path,"/database/artist.db"]))

## Spliting the Datasets into __Train__, __Test__ & __Validation__ Subsets
Both _genre_ & _artist_ feature sets have __3306__ entries. But, the _style_ feature set has __3742__. This is because an artwork may have more than one sytle associated with it. To allow the _feature sets_ to be split into _train_, _test_ & _validation_ subsets the _image_tag_ must be used to decide the final set detination. Here we perfor a double _train_test_split_ to achieve a splt ratio of approx: __70%__, __20%__ & __10%__.
<br/>
Create a simple query against the RDBMS to return a list of all _IMAGE_TAGs_.

In [6]:
# NOTE: select a list of unique image tags
query_string = """
SELECT IMAGE_TAG
FROM   ARTWORK_IMAGE
"""

# create the results dataframe
i_tags = pd.read_sql_query(query_string, conn)

Perform two random data splits to create the __TRAIN, TEST__ & __VALIDATE__ datasets selection in the approximate sizes : _70%_, _20%_ & _10%_. <br/> __NOTE:__ This is just the selection of which images will be assigned to the three sets. This results in three lists of __IMAGE_TAGS__. The actual creation of the datasets happens further down the notebook. <br/>__NOTE:__ The splits are kept consistence with the use of the <code>RANDOM_STATE</code> being set throughout the project to value: __42__.

In [7]:
# split the list of image tags into 3 sets:
# train    ~70%
# test     ~20%
# validate ~10%

# using the train test split we first extract our taining set which leaves a remainder
train_tags   , remainder_tags, _, _ = train_test_split(i_tags        , i_tags        , test_size = 0.3, random_state = RANDOM_STATE)

# now the reaminder is is split into a test and a validation set
validate_tags, test_tags     , _, _ = train_test_split(remainder_tags, remainder_tags, test_size = 0.7, random_state = RANDOM_STATE)

Convert the three dataset selections into Numpy Arrays.

In [8]:
# to reduce the search lookup process time we convert 
# the dataframes into numpy arrays
train_tags    = train_tags[   "image_tag"].to_numpy()
test_tags     = test_tags[    "image_tag"].to_numpy()
validate_tags = validate_tags["image_tag"].to_numpy()

Set the feature extractor.

In [9]:
# define module URL
module_url = 'https://tfhub.dev/google/imagenet/mobilenet_v2_100_96/feature_vector/3'

The function <code>get_image_features()</code> opens the image. Resizes it and then using the feature extractor module <code>https://tfhub.dev/google/imagenet/mobilenet_v2_100_96/feature_vector/3</code>, it extracts the images features.<br/>
__NOTE:__ The feature extraction process is run on the GPU device <code>/device:GPU:0</code>. (_Just as an experiment_).

In [10]:
# define function to extract features
def get_image_features(image_path):

    # resize to a small square image
    resized_image = Image.open(image_path).resize((96, 96), Image.ANTIALIAS)
    
    # convert the image to RGB and then to a numpy array
    img_batch     = np.array(resized_image.convert('RGB'), dtype = np.float32)[np.newaxis, :, :, :]/255

    # create graph
    img_graph     = tf.Graph()
    
    # define nvidia GPU as processing device
    with tf.device('/device:GPU:0'):

        # extract the image features
        with img_graph.as_default():

            feature_extractor = hub.Module(module_url)

            # create input placeholder
            input_imgs = tf.placeholder(dtype=tf.float32, shape=[None, 96, 96, 3])

            # a node with the features
            imgs_features = feature_extractor(input_imgs)

            # collect initializers
            init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])

        img_graph.finalize() 

        # create a session
        sess = tf.Session(graph=img_graph)

        # initialize it
        sess.run(init_op)

        # extract features
        features = sess.run(imgs_features, feed_dict={input_imgs: img_batch})
    
        # return the features
        return features

## Build the __Genre__ Feature Set Arrays

Query the RDBMS to return a list of all __IMAGE_TAG__ values and the associated __GENRE__ catagories.

In [11]:
# define query
query_string = """
SELECT IMAGE_TAG,
       GENRE
FROM   GENRE         AS A,
       ARTWORK       AS B,
       ARTWORK_IMAGE AS C
WHERE  A.ID = B.GENRE_ID
AND    B.ID = C.ARTWORK_ID
"""
    
# execute query
genre_query_result = pd.read_sql_query(query_string, conn)

Here we loop through every every __IMAGE_TAG__ and identify if the image feature data should be in the __TRAIN, TEST__ or __VALIDATE__ dataset. Once the correct destination has been identified. The data is added to a set of array and finally written to one of three compressed data file.

In [12]:
# define feature set arrays
features_tr  = []
features_te  = []
features_va  = []
genre_tr     = []
genre_te     = []
genre_va     = []
image_tag_tr = []
image_tag_te = []
image_tag_va = []

# loop through selection rows
for index, row in genre_query_result.iterrows():

    #read image 
    image_path     = "".join([image_file_path, "full_image_", row["image_tag"] ,".jpg"])

    # call the helper function to extract the features
    image_features = get_image_features(image_path)
        
    # check in smallest dataset and then second 
    # smallest dataset for speed
    if row["image_tag"] in validate_tags:
        # append data to feature set arrays
        features_va.append(image_features   )
        genre_va.append(    row["genre"]    )
        image_tag_va.append(row["image_tag"])
    elif row["image_tag"] in test_tags:
        # append data to feature set arrays
        features_te.append(image_features   )
        genre_te.append(    row["genre"]    )
        image_tag_te.append(row["image_tag"])
    else:
        # append data to feature set arrays
        features_tr.append(image_features   )
        genre_tr.append(    row["genre"]    )
        image_tag_tr.append(row["image_tag"])
    
 # write files
savez_compressed("".join([feature_sets,"genre_train_features"    ]),  features_tr, genre_tr, image_tag_tr)
savez_compressed("".join([feature_sets,"genre_test_features"     ]),  features_te, genre_te, image_tag_te)
savez_compressed("".join([feature_sets,"genre_validation_features"]), features_va, genre_va, image_tag_va)

## Build the __Style__ Feature Set Arrays

Query the RDBMS to return a list of all __IMAGE_TAG__ values and the associated __STYLE__ catagories.

In [13]:
# NOTE: we generate additonal records here becaus an artwork can be listed with more than one style
query_string = """
SELECT IMAGE_TAG,
       STYLE
FROM   STYLE         AS A,
       ARTWORK_STYLE AS B,
       ARTWORK_IMAGE AS C
WHERE   B.STYLE_ID  = A.ID   
AND    C.ARTWORK_ID = B.ARTWORK_ID
"""
    
style_query_result = pd.read_sql_query(query_string, conn)

Here we loop through every every __IMAGE_TAG__ and identify if the image feature data should be in the __TRAIN, TEST__ or __VALIDATE__ dataset. Once the correct destination has been identified. The data is added to a set of array and finally written to one of three compressed data file.

In [14]:
# define feature set arrays
features_tr  = []
features_te  = []
features_va  = []
style_tr     = []
style_te     = []
style_va     = []
image_tag_tr = []
image_tag_te = []
image_tag_va = []

# loop through selection rows
for index, row in style_query_result.iterrows():

    #read image 
    image_path     = "".join([image_file_path, "full_image_", row["image_tag"] ,".jpg"])
    
    # call the helper function to extract the features
    image_features = get_image_features(image_path)

    # check in smallest dataset and then second 
    # smallest dataset for speed
    if row["image_tag"] in validate_tags:
       # append data to feature set arrays
        features_va.append(image_features   )
        style_va.append(    row["style"]    )
        image_tag_va.append(row["image_tag"])
    elif row["image_tag"] in test_tags:
        # append data to feature set arrays
        features_te.append(image_features   )
        style_te.append(    row["style"]    )
        image_tag_te.append(row["image_tag"])
    else:
        # append data to feature set arrays
        features_tr.append(image_features   )
        style_tr.append(    row["style"]    )
        image_tag_tr.append(row["image_tag"])
    
 # write files
savez_compressed("".join([feature_sets,"style_train_features"    ]),  features_tr, style_tr, image_tag_tr)
savez_compressed("".join([feature_sets,"style_test_features"     ]),  features_te, style_te, image_tag_te)
savez_compressed("".join([feature_sets,"style_validation_features"]), features_va, style_va, image_tag_va)

## Build the __Artist__ Feature Set Arrays

Query the RDBMS to return a list of all __IMAGE_TAG__ values and the associated __ARTIST__ names.

In [15]:
# NOTE: we generate additonal records here becaus an artwork can be listed with more than one style
query_string = """
SELECT IMAGE_TAG,
       NAME
FROM   ARTWORK       AS A,
       ARTIST        AS B,
       ARTWORK_IMAGE AS C
WHERE  A.ARTIST_ID = B.ID   
AND    A.ID        = C.ARTWORK_ID
"""
    
artist_query_result = pd.read_sql_query(query_string, conn)

Here we loop through every every __IMAGE_TAG__ and identify if the image feature data should be in the __TRAIN, TEST__ or __VALIDATE__ dataset. Once the correct destination has been identified. The data is added to a set of array and finally written to one of three compressed data file.

In [16]:
# define feature set arrays
features_tr  = []
features_te  = []
features_va  = []
artist_tr    = []
artist_te    = []
artist_va    = []
image_tag_tr = []
image_tag_te = []
image_tag_va = []

# loop through selection rows
for index, row in artist_query_result.iterrows():

    #read image 
    image_path     = "".join([image_file_path, "full_image_", row["image_tag"] ,".jpg"])

    # call the helper function to extract the features
    image_features = get_image_features(image_path)
    
    # check in smallest dataset and then second 
    # smallest dataset for speed
    if row["image_tag"] in validate_tags:
        # append data to feature set arrays
        features_va.append(image_features   )
        artist_va.append(   row["name"]     )
        image_tag_va.append(row["image_tag"])
    elif row["image_tag"] in test_tags:
        # append data to feature set arrays
        features_te.append(image_features   )
        artist_te.append(   row["name"]     )
        image_tag_te.append(row["image_tag"])
    else:
        # append data to feature set arrays
        features_tr.append(image_features   )
        artist_tr.append(   row["name"]     )
        image_tag_tr.append(row["image_tag"])
    
 # write files
savez_compressed("".join([feature_sets,"artist_train_features"    ]),  features_tr, artist_tr, image_tag_tr)
savez_compressed("".join([feature_sets,"artist_test_features"     ]),  features_te, artist_te, image_tag_te)
savez_compressed("".join([feature_sets,"artist_validation_features"]), features_va, artist_va, image_tag_va)