## **WikiArt Dataset Preprocessing**

#### **Overview**


Our task is to build a system for image classification and searching the similar ones. Thus, the [WikiArt](https://www.kaggle.com/datasets/steubk/wikiart) dataset was chosen as the one that is rich in variety of classes and a high amount of images.


It is planned to perform some transformations on the data and remove the duplicates.


#### **Dataset Format**


The dataset contains `80.042` artworks of different artists (`.jpg` format). The images are split into folders by the corresponding genres.


The description of the images is presented in the `classes.csv` file.


In [1]:
import pandas as pd
import ast
from sklearn.preprocessing import MultiLabelBinarizer
import cv2
import numpy as np
from tqdm import tqdm

DS_FOLDER = 'dataset/'

In [3]:
df = pd.read_csv(DS_FOLDER + 'classes.csv')

# remove unnecessary columns
df = df.drop(['artist', 'description'], axis=1)
df.head()

Unnamed: 0,filename,genre,phash,width,height,genre_count,subset
0,Abstract_Expressionism/aaron-siskind_acolman-1...,['Abstract Expressionism'],bebbeb018a7d80a8,1922,1382,1,train
1,Abstract_Expressionism/aaron-siskind_chicago-6...,['Abstract Expressionism'],d7d0781be51fc00e,1382,1746,1,train
2,Abstract_Expressionism/aaron-siskind_glouceste...,['Abstract Expressionism'],9f846e5a6c639325,1382,1857,1,train
3,Abstract_Expressionism/aaron-siskind_jerome-ar...,['Abstract Expressionism'],a5d691f85ac5e4d0,1382,1849,1,train
4,Abstract_Expressionism/aaron-siskind_kentucky-...,['Abstract Expressionism'],880df359e6b11db1,1382,1625,1,train


In [4]:
df.subset.value_counts()

Unnamed: 0_level_0,count
subset,Unnamed: 1_level_1
train,63998
test,16000
uncertain artist,44


Remove the images with an uncertain artist, leaving only the `train` and `test` sets:

In [5]:
df = df[df['subset'] != 'uncertain artist']

Look for the duplicates and `nan` values:

In [6]:
print(f'The amount of duplicates: {len(df) - df['phash'].nunique()}')
print(f'The amount of nan values: {df.isna().sum().sum()}')

df = df.drop(['phash'], axis=1)

The amount of duplicates: 0
The amount of nan values: 0


Apply one-hot encoding:

In [7]:
df.genre = df.genre.apply(ast.literal_eval)

mlb = MultiLabelBinarizer()

one_hot_encoded = mlb.fit_transform(df.genre)
one_hot_df = pd.DataFrame(one_hot_encoded, columns=mlb.classes_)

df = pd.concat([df, one_hot_df], axis=1)
df.head()

Unnamed: 0,filename,genre,width,height,genre_count,subset,Abstract Expressionism,Action painting,Analytical Cubism,Art Nouveau Modern,...,Northern Renaissance,Pointillism,Pop Art,Post Impressionism,Realism,Rococo,Romanticism,Symbolism,Synthetic Cubism,Ukiyo e
0,Abstract_Expressionism/aaron-siskind_acolman-1...,[Abstract Expressionism],1922,1382,1,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Abstract_Expressionism/aaron-siskind_chicago-6...,[Abstract Expressionism],1382,1746,1,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Abstract_Expressionism/aaron-siskind_glouceste...,[Abstract Expressionism],1382,1857,1,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Abstract_Expressionism/aaron-siskind_jerome-ar...,[Abstract Expressionism],1382,1849,1,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Abstract_Expressionism/aaron-siskind_kentucky-...,[Abstract Expressionism],1382,1625,1,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Validate the correctness of the provided genres count and the sum of one-hot encoded columns' values for each image:

In [8]:
one_hot_cols = df.columns[7:]
df['genre_count_ohe'] = df[one_hot_cols].sum(axis=1)
mismatch = df[df['genre_count_ohe'] != df['genre_count']]

print(f'Total amount of mismatches: {len(mismatch)}')

Total amount of mismatches: 2859


In [9]:
df['genre_count'] = df['genre_count_ohe']
df = df.drop(['genre_count_ohe'], axis=1)
df.head()

Unnamed: 0,filename,genre,width,height,genre_count,subset,Abstract Expressionism,Action painting,Analytical Cubism,Art Nouveau Modern,...,Northern Renaissance,Pointillism,Pop Art,Post Impressionism,Realism,Rococo,Romanticism,Symbolism,Synthetic Cubism,Ukiyo e
0,Abstract_Expressionism/aaron-siskind_acolman-1...,[Abstract Expressionism],1922,1382,0,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Abstract_Expressionism/aaron-siskind_chicago-6...,[Abstract Expressionism],1382,1746,0,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Abstract_Expressionism/aaron-siskind_glouceste...,[Abstract Expressionism],1382,1857,0,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Abstract_Expressionism/aaron-siskind_jerome-ar...,[Abstract Expressionism],1382,1849,0,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Abstract_Expressionism/aaron-siskind_kentucky-...,[Abstract Expressionism],1382,1625,0,train,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
print(f'The amount of nan values: {df.isna().sum().sum()}')
df.groupby('subset').filename.nunique()

The amount of nan values: 0


Unnamed: 0_level_0,filename
subset,Unnamed: 1_level_1
test,16000
train,63998


In [11]:
df.to_csv('dataset.csv')