# Predicting painting authors

In this notebook, I will build a model to predict authors of paintings. I will use the [dataset](https://www.kaggle.com/datasets/ikarus777/best-artworks-of-all-time) from Kaggle that has 8,000+ paintings by 50 most famous artists to train the model.

In [1]:
import os
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8')
tf.get_logger().setLevel('ERROR')

Let's start with exploring the information about the paintings in the dataset. Luckily, there is a csv file summarizing information about all paintings in the dataset and their authors. I will read it as a pandas dataframe.

In [None]:
artists = pd.read_csv(r'data\artists.csv').drop(columns=['id', 'bio', 'wikipedia'])

#sorting the dataframe by the number of paintings in the dataset
artists.sort_values('paintings', ascending=False).reset_index(drop=True)

Vincent van Gogh has the most paintings in the dataset, followed by Edgar Degas and Pablo Picasso. The difference between artists withe the most and the least paintings is large, meaning that the dataset is imbalanced. I'll deal with it later.

Each artist in the dataset is associated with one or more genres. Let's have a look at the distributrion of artists' genres.

In [None]:
#getting all genres from the genre column
genre_count = artists.genre.str.get_dummies(sep=',').sum().sort_values()

#plotting the number of genres in the dataset
genre_count.plot.barh()
plt.show()

Impressionism and Post-Impressionism are the most popular genres, if we only count individual artists. There are also plenty of artists associated with different periods and types of Renaissance art.

The paintings is stored on my local machine. Below, I display some of the paintings, selected randomly. The code used here is adapted from this Stackoverflow [answer](https://stackoverflow.com/a/60443998).

In [None]:
#getting the list of all filepaths
train_folder = 'data\images'
images = {}
for folder in os.listdir(train_folder):
    for image in os.listdir(train_folder + '/' + folder):
        filename = os.path.join(train_folder, folder, image)
        author = folder.replace('_', ' ')
        images[filename] = author
        
plt.figure(1, figsize=(12, 8))

#displaying nine randomly selected images
n = 0
for i in range(9):
    n += 1
    random_img = random.choice(list(images.keys()))
    imgs = plt.imread(random_img)
    plt.subplot(3, 3, n)
    plt.axis('off')
    plt.imshow(imgs)
    plt.title(images[random_img])

plt.show()

Next, I will build a simple deep learning network using TensorFlow to try to predict authors of the paintings. I will start with generating a dataset from the images in the folder on my local machine. I will resize all images to the 180x180 size and split them into 80 percent training and 20 percent validation dataset.

In [None]:
image_size = (180, 180)
batch_size = 32

train_ds = tf.keras.utils.image_dataset_from_directory(
    'data\artists',
    validation_split=0.2,
    subset='training',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
    )

valid_ds = tf.keras.utils.image_dataset_from_directory(
    'data\artists',
    validation_split=0.2,
    subset='validation',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
)

Given the relatvely small number of images in the dataset, I will use data augmentation to mitigate overfitting and allow the model generalize better to unseen images. I will use three data augmentation layers that adds random noise to images and randomly increased or decreases brightness and contrast of images. 

In [None]:
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomBrightness(0.5),
    tf.keras.layers.RandomContrast(0.5)
])

In [None]:
def visualize(original, augmented):
    plt.subplot(1,2,1)
    plt.title('Original image')
    plt.imshow(original[0].numpy().astype("int32"))
    plt.axis('off')

    plt.subplot(1,2,2)
    plt.title('Augmented image')
    plt.imshow(augmented[0].numpy().astype("int32"))
    plt.axis('off')

Below, I will compare the original resized images with the same image that was augmented using GaussianNoise.

In [None]:
image, label = next(iter(train_ds))
augmented_image = tf.keras.layers.RandomBrightness(0.5)(
            tf.expand_dims(image[0], 0), training=True
            )
visualize(image, augmented_image)

In [None]:
image, label = next(iter(train_ds))
augmented_image = tf.keras.layers.RandomContrast(0.5)(
            tf.expand_dims(image[0], 0), training=True
            )
visualize(image, augmented_image)

I will use the data aumentation layer as the first layer of the network. I will add several convolutional layers, as well as max pooling and dropouts layers for regularization. The final layer will output a classification prediction with probabilities of a painting being authored by all of 50 possible artists.

In [None]:
model = tf.keras.Sequential([
    data_augmentation,
    tf.keras.layers.Rescaling(1./255),
    tf.keras.layers.Conv2D(32, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(128, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(256, 3, activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(50, activation='softmax')
])

In [None]:
model.compile(
  optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
  loss=tf.keras.losses.SparseCategoricalCrossentropy(),
  metrics=['accuracy'])

After compiling the model, I will train the network for 15 epochs and then evaluate the accuracy of its predictions on a validation set.

In [None]:
epochs = 1

history = model.fit(
  train_ds,
  validation_data=valid_ds,
  epochs=epochs
)

In [None]:
# https://stackoverflow.com/a/67256122
predictions = np.array([])
labels =  np.array([])

for x, y in valid_ds:
    predictions = np.concatenate([predictions, np.argmax(model.predict(x, verbose=0), axis=-1)])
    labels = np.concatenate([labels, y.numpy()])

m = tf.keras.metrics.Accuracy()
m(labels, predictions).numpy()

The accuracy of the model's prediction for the validation set is rather low. It peaked at around 33 percent after 11 epochs. This is not surpising for the rather simple neural network archirecture given the complicated nature of the task, high number of possible classification labels, and rather limited number of training examples.

To see how the accuracy of predictions on the train and test sets changed after each epochs, I'll create a simple plot.

In [None]:
def plot_accuracy(history, epochs):

    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, epochs + 1)

    plt.plot(epochs, acc, 'r', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title('Training and validation accuracy')
    plt.legend()

    plt.show()

In [None]:
plot_accuracy(history, epochs)

The accuracy of the model's prediction for the validation set is rather low. It peaked at around 32-33 percent after 10 epochs and flattened out after that. This is not surpising for the rather simple neural network archirecture given the complicated nature of the task, high number of possible classification labels, and rather limited number of training examples. Meanwhil, the training accuracy started below the validation accuracy but increased at a much faster rate, leading to significant overfitting.

In [None]:
# #getting predicted probabilities
# y_pred_proba = model.predict(valid_ds)

# #getting predicted classes - https://github.com/keras-team/keras/issues/5961
# y_pred = y_pred_proba.argmax(axis=-1)

# #getting actual classes - https://stackoverflow.com/a/62823218
# y = np.concatenate([y for _, y in valid_ds], axis=0)

artists_alphabetic = artists.sort_values('name').reset_index(drop=True)
artists_dict = dict(zip(artists_alphabetic.index, artists_alphabetic.name))

pred_artists = [artists_dict[k] for k in predictions]
actual_artists = [artists_dict[k] for k in labels]
results_df = pd.DataFrame({'predicted': pred_artists, 'actual': actual_artists})
results_df['result'] = results_df.predicted == results_df.actual

In [None]:
sum(results_df.predicted == results_df.actual)

In [None]:
correct_by_artist = results_df.groupby('actual')['result'].mean()
correct_by_artist = pd.DataFrame({'name': correct_by_artist.index, 'share_correct': correct_by_artist.values})

artist_paintings_dict = dict(zip(artists.name, artists.paintings))

correct_by_artist['total_paintings'] = correct_by_artist.name.map(artist_paintings_dict)
correct_by_artist.sort_values('share_correct', ascending=False)

sns.regplot(y=correct_by_artist['share_correct'], x=correct_by_artist['total_paintings'], ci=None)

In [None]:
correct_by_artist.sort_values('share_correct', ascending=False)

To improve the performance of the model, I will use transfer learning. I will use a pre-trained ResNet-50 convolutional neural network as a base model. To adapt it to the painting dataset and the classification task, I will add the 50-label classification layer on the top of the network, along with a average pooling and a dropout layer, and train it on the dataset. As the ResNet-50 network accepts 224x244 images, I will re-create a training and validation dataset with the necessary image dimensions. 

In [None]:
image_size = (224, 224)
batch_size = 32

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    'data\artists',
    validation_split=0.2,
    subset='training',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
)

valid_ds = tf.keras.preprocessing.image_dataset_from_directory(
    'data\artists',
    validation_split=0.2,
    subset='validation',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
)

In [5]:
base_model_rn = tf.keras.applications.resnet50.ResNet50(input_shape=(224, 224, 3), include_top=False, weights='imagenet')

for layers in base_model_rn.layers:
    layers.trainable = False

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
dropout_layer = tf.keras.layers.Dropout(0.2)
prediction_layer = tf.keras.layers.Dense(22, activation='softmax')

model_rn = tf.keras.Sequential([
    base_model_rn,
    global_average_layer,
    dropout_layer,
    prediction_layer
])

model_rn.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

In [6]:
epochs = 10

history = model_rn.fit(
  styles_train_ds,
  validation_data=styles_valid_ds,
  epochs=epochs
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10

KeyboardInterrupt: 

In [2]:
image_size = (224, 224)
batch_size = 32

styles_train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    'data\styles',
    validation_split=0.2,
    subset='training',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
)

styles_valid_ds = tf.keras.preprocessing.image_dataset_from_directory(
    'data\styles',
    validation_split=0.2,
    subset='validation',
    seed=17,
    image_size=image_size,
    batch_size=batch_size
)

Found 8125 files belonging to 22 classes.
Using 6500 files for training.
Found 8125 files belonging to 22 classes.
Using 1625 files for validation.


In [45]:
base_model_en = tf.keras.applications.vgg16.VGG16(input_shape=(224, 224, 3), include_top=False, weights='imagenet')

for layers in base_model_en.layers:
    layers.trainable = False

model_en = tf.keras.Sequential([
    base_model_en,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(22, activation='softmax')
])

model_en.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

In [35]:
# for folder in os.listdir('data/styles/'):
#     files = sorted(os.listdir(f'data/styles/{folder}/'))
#     sample_number = len(files) // 10
#     random_files = random.sample(files, sample_number)
#     excessive_files = list(set(files) - set(random_files))
#     for file in excessive_files:
#         os.remove(f'data/styles/{folder}/{file}')

In [46]:
epochs = 10

history_en = model_en.fit(
  styles_train_ds,
  validation_data=styles_valid_ds,
  epochs=epochs
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

KeyboardInterrupt: 

In [None]:
model_en.save('efficientnetb0.h5')

After adding the necessary layers on the top of the pre-trained network, I will train these additional layers on the training dataset for 20 epochs.

Next, I will plot how the training and validation accuracies changed with each epoch.

In [None]:
plot_accuracy(history, epochs)

The accuracy of predictions on the validation set increased significantly with the use of the pre-trained network. The validation accuracy peaked at

Finally, I will train the model on the full dataset. Given that the validation accuracy ceased to improve significatly after ... epochs, I will train the model for this number of epochs.

In [None]:
image_size= (224, 224)
batch_size = 32

full_ds = tf.keras.preprocessing.image_dataset_from_directory(
    'data\images',
    seed=17,
    image_size=image_size,
    batch_size=batch_size,
)

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy'])

In [None]:
epochs = 5

history = model.fit(
  full_ds,
  epochs=epochs
)

After the model is trained, I will save it. I will need the saved model to develop a simple web app that will predict authors of famous paintings. I will use HuggingFace spaces for this purpose. The app will be available [here](https://huggingface.co/spaces/osydorchuk/painting_authors).

In [None]:
model.save('efficientnetb0.h5')

In [None]:
tf.__version__

In [None]:
# image = tf.keras.utils.load_img(r'test_data\vangoghmuseum-s0005V1962-800.jpg')
# input_arr = tf.keras.utils.img_to_array(image)
# input = tf.image.resize(input_arr, [224, 224])
# input = tf.keras.applications.resnet50.preprocess_input(input)
# input = np.array([input])
# predictions = model.predict(input)
# predictions