# ArtStyle

This project attempts to classify the style of various paintings by using different approaches.

### Data

The data consist of several thousand images of paintings, and a CSV file containing their labels. As a pre-processing step, styles that occur very infrequently were eliminated from consideration. The remaining data were divided into a training and testing set using the script `split.py`.

CSV files for train,test, and combined data have the format:
```
style   |   filename
----------------------
style1  |   filename1
style2  |   filename2
...
```

For some classifiers, features are extracted from image data and stored in separate CSV files for sake of time. 

### Classifiers

There are two *baseline* approaches. In both baseline approaches, there is no training phase.

* **Naive Classification** uses a simple majority style label to classifiy paintings.
* **Expert Classification** uses hand-crafted thresholds on various image properties to classify them. The thresholds are determined by programmers looking at paintings and trying to emulate their thought process behind their predictions.

There are several approaches used for comparative study:

* **kNN Classification** uses votes case by data points closest to a test point in feature space to determine final style. Features are pre-computed using the `extractfeatures.py` script.
* **Decision Tree Classification** splits data by most useful features successively until a split contains a single style. The prediction is made by running a test point through the same set of splits across features and seeing in which final split it ends up.

In [52]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from glob import glob
import os
import imageio
from scipy import ndimage

from naiveclassifier import NaiveClassifier
from expertclassifier import ExpertClassifier
from knnclassifier import KNNClassifier
from utils import ImageStreamer, plot_cmatrix

# CSV of files to be used in training set.
traindatapath = 'data/train.csv'
# CSV of pre-extracted features for kNN and DecisionTree classifiers for files in training set.
featuredatapath = 'data/features.csv'
# column names of CSV
cols = ['style', 'filename']
# directory in which images are downloaded. Both training and test images reside in the same directory.
imgdir = r'D:\ArtStyle\Test'
data = pd.read_csv(traindatapath, header=0, names=cols)

## Preprocessing

In [2]:
# Getting list of available data points (i.e. files downloaded to device, not filtered out and not in test set)
downloaded_fnames = os.listdir(imgdir)
train_fnames = list(data['filename'].values)             # filenames in training split (downloaded or not)
avail_fnames = [f for f in downloaded_fnames if f in train_fnames]  # downloaded filenames in training split

print('{0} files available out of {1} files downloaded.'.format(len(avail_fnames), len(downloaded_fnames)))

17111 files available out of 23807 files downloaded.


In [3]:
# truncating data to images already downloaded - this reduces the available training/validation data
data = data.loc[data['filename'].isin(avail_fnames)].reset_index(drop=True)

# splitting training data into training and validation sets
validation_split = 0.1  # as a fraction of the training set
train_split = 1 - validation_split

data = data.sample(frac=1, random_state=0).reset_index(drop=True)

num_instances = len(data)
num_train = int(np.ceil(train_split * num_instances))
num_val = int(np.ceil(validation_split * num_train))

val = data.iloc[num_train:]
train = data.iloc[:num_train]

print('{0} instances in training.\n{1} instance in validation.'.format(len(train), len(val)))


# Get styles in data, in alphabetical order
counts = data.groupby('style').count().add_suffix('_count').reset_index()
style_names = counts['style'].squeeze().values.copy()
style_names[11] = 'Primitivism'  # different label used for human predictions
display(counts)

15400 instances in training.
1711 instance in validation.


Unnamed: 0,style,filename_count
0,Abstract Expressionism,481
1,Art Informel,257
2,Art Nouveau (Modern),1011
3,Baroque,1027
4,Cubism,391
5,Early Renaissance,274
6,Expressionism,1546
7,High Renaissance,239
8,Impressionism,2187
9,Magic Realism,276


# Naive Classifier

The naive classifier determines the majority label in the training set and uses that as the only prediction during operation.

In [None]:
# Instantiate classifier and train
nc = NaiveClassifier()
nc.train(X=train['filename'].values, Y=train['style'].values)

# Evaluate on validation set
naive_acc, _ = nc.evaluate(X=train['filename'].values, Y=train['style'].values)
print('Training accuracy', naive_acc)
naive_acc, naive_pred = nc.evaluate(X=val['filename'].values, Y=val['style'].values)
print('Validation accuracy', naive_acc)

naive_cmatrix = confusion_matrix(val['style'].values, naive_pred, labels=counts['style'].values)
plt.figure(figsize=(12,12))
plot_cmatrix(naive_cmatrix, style_names)
# np.savetxt('naive_cmatrix.csv', naive_cmatrix, fmt='%3d', delimiter=',')

# Expert Classifier

The expert classifier uses hand-designed features to determine style. Its construction is in two stages:

## Training
Involves programmers looking at paintings and making predictions on style.

In [None]:
# Analyzing images to learn feature detectors by hand
plot=False
N = 50
stream = ImageStreamer(train['filename'].iloc[:N], basedir=imgdir)
ec = ExpertClassifier()
results = []
if plot:
    plt.figure(figsize=(24, int(np.ceil(N/4)) * 6))
# print('#\tstyle\t\t\tvar\tavg\tblur')
for i, im in enumerate(stream):
    if plot:
        plt.subplot(int(np.ceil(N/4)),4,i+1)
        plt.imshow(im)
        plt.title(avail_styles[i])
    results.append([train['style'].iloc[i], ec.variance(im), ec.avg_color(im), ec.blurriness(im, 1.0)])
#     print('{0}\t{1:20s}\t{2:5.0f}\t{3:3.0f}\t{4:3.0f}'.format(i+1, avail_styles[i], ec.variance(im), ec.avg_color(im), ec.blurriness(im, 1.0)))
resdf = pd.DataFrame(results, columns=['style', 'var', 'avg', 'blur'])
display(resdf)
if plot:
    plt.show()

In [None]:
# Plotting human assigned labels for expert classifier training
human = pd.read_csv('data/human_raw.csv', header=0, usecols=[1,3,4,5,6])
plt.figure(figsize=(12,12))
for i in range(4):
    plt.subplot(2,2,i+1)
    cmatrix = confusion_matrix(human['Label'].values, human.iloc[:, i+1].values, labels=style_names)
    im = plot_cmatrix(cmatrix, cbar=False)
    plt.title(human.columns[i+1])

ticks = np.arange(len(style_names))
plt.subplot(2,2,1)
plt.yticks(ticks, style_names)
plt.subplot(2,2,3)
plt.yticks(ticks, style_names)
plt.xticks(ticks, style_names, rotation=90)
plt.subplot(2,2,4)
plt.xticks(ticks, style_names, rotation=90)

fig = plt.gcf()
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
fig.colorbar(im, cax=cbar_ax)

plt.show()


After looking at paintings, we try to emulate our decision process by establishing thresholds on various image statistics (for e.g. mean brightness, color variance etc.)

In [None]:
grouped = resdf.groupby('style')
desc = grouped.describe(percentiles=[])
display(desc)

## Prediction

Based on insights from looking at paintings, an *Expert Classifier* is hand-designed and run on validation data.

In [None]:
# After feature detectors have been hand-designed, running them on a subset of training data
stream = ImageStreamer(val['filename'].values)
ec = ExpertClassifier()
acc, expert_pred = ec.evaluate(stream, val['style'].values)
print('Validation accuracy', acc)

# Get confusion matrix
true_labels = val['style'].values
expert_cmatrix = confusion_matrix(true_labels, expert_pred, labels=counts['style'].values)
plt.figure(figsize=(12,12))
plot_cmatrix(expert_cmatrix, style_names)

# kNN Classifier

In [61]:
# reading features corresponding to images
# The features.csv is of the form (no header):
# style, filename, feature1, feature2, ...
features = pd.read_csv(featuredatapath, header=None)
features.rename(columns={0:'style', 1:'filename'}, inplace=True)
features.rename(columns={features.columns[i]: 'f' + str(features.columns[i]-2) for i in range(2, len(features.columns))}, inplace=True)
# filtering out filenames not in training set, or filtered out due to low frequency
features = features.loc[features['filename'].isin(avail_fnames)].reset_index(drop=True)
# splitting features into training and validation sets
trainf = features.loc[features['filename'].isin(train['filename'])].reset_index(drop=True)
valf = features.loc[features['filename'].isin(val['filename'])].reset_index(drop=True)

In [None]:
# Instantiate classifier
hyperparameters = {
    'n_neighbors': 5,
    'weights': 'uniform',
    'algorithm': 'auto',
    'p': 2,
    'metric': 'minkowski',
    'n_jobs': -1
}
kc = KNNClassifier(**hyperparameters)
kc.train(features.iloc[:, 2:].values, features.iloc[:, 0])
acc, knn_pred = kc.evaluate(valf.iloc[:, 2:].values, valf.iloc[:, 0])
print('Validation accuracy', acc)

# Get confusion matrix
true_labels = valf['style'].values
knn_cmatrix = confusion_matrix(true_labels, knn_pred, labels=counts['style'].values)
plt.figure(figsize=(12,12))
plot_cmatrix(knn_cmatrix, style_names)