### MACHINE LEARNING PROJECT

To start this project I need a dataset. I couldn't find an apropiate one, but I found [this repository][1] by  
Lucas David that gave me a great starting point.

Using his code I downloaded a big amount of pictures by list of artis with a noticeable style difference:

* Caravaggio
* Edgar Degas
* Francisco de Goya
* Katsushika Hokusai
* Frida Kahlo
* Wassily Kandinsky
* Gustav Klimt
* Roy Lichtenstein
* Piet Mondrian
* Claude Monet
* Pablo Picasso
* Jackson Pollock
* Joaquín Sorolla
* Diego Velazquez
* Andy Warhol

It took me around three hours to get all the XXXXX pictures.

In this notebook I made basic operations with images and define the functions that let me build my own dataset.

[1]: https://github.com/lucasdavid/wikiart

___
### PREREQUISITES

To process images I'll be using OpenCV library. It is important to take a look at the [docs][1] before running the next cell  
as you may want to use another OpenCV package. For this project I'll use the *'main modules package'*

[1]:https://pypi.org/project/opencv-python/

In [1]:
# !pip install opencv-python

In [2]:
### TODO - .yaml

___
### IMPORTS

In [1]:
# Modules used for data handling / test
import os
import csv
import cv2
import pathlib
import time
import pickle

from utils import get_collection, show_collection, nameof, mklist


# Modules used for EDA
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib.patches import Rectangle


# Modules used for image processing
import cv2

from collections import Counter
from utils import crop_img, chi_osc, extract_img_data, get_img_rgb
from utils import resize_img, reduce_col_palette, whitespace


# Modules used for ML
from sklearn.cluster import KMeans
from utils import color_quant

In [2]:
### TODO ### Import a class from a module

# For a better pd.DataFrame visualization
class display(object):
    '''This class was found in 'Python Data Science Handbook' by jakevdp (Jake Vanderplas),
    which you can access though his GitHub repository
    (https://github.com/jakevdp/PythonDataScienceHandbook)'''
    
    template = '''<div style="float: left; padding: 10px;">
                  <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
                  </div>'''
    
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_()) for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)

In [3]:
# Modules settings
%matplotlib inline

___
### UPDATE UTILS

In [4]:
# This cell only needs to be executed to update utils
# if modified after been imported

# %run utils

___
### BASIC EDA

In [5]:
raw_data = pd.read_csv('./data/raw_museum/raw_data.csv', names=['img_ID', 'artist', 'height', 'width'])

raw_data.shape, raw_data.head()

((5233, 4),
                 img_ID      artist  height  width
 0  9223372032559824886  caravaggio     559    474
 1               186636  caravaggio     900    863
 2               186724  caravaggio     800    541
 3               186639  caravaggio    3239   4501
 4               186671  caravaggio     912   1200)

In [6]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5233 entries, 0 to 5232
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   img_ID  5233 non-null   int64 
 1   artist  5233 non-null   object
 2   height  5233 non-null   int64 
 3   width   5233 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 163.7+ KB


In [7]:
raw_data.nunique()

img_ID    5233
artist      14
height    1609
width     1510
dtype: int64

In [8]:
# Get all paths to .jpg files
extensions = ['.jpg']

raw_collection = get_collection(path='./images/raw_museum', extensions=extensions)
large_collection = get_collection(path='./images/large_museum', extensions=extensions)
mid_collection = get_collection(path='./images/mid_museum', extensions=extensions)
mid_sqr_collection = get_collection(path='./images/mid_sqr_museum', extensions=extensions)
low_sqr_collection = get_collection(path='./images/low_sqr_museum', extensions=extensions)

collections = [large_collection, mid_collection, mid_sqr_collection, low_sqr_collection]

print(f'{len(raw_collection)} images in raw_collection')
print(f'{len(large_collection)} images in large_collection')
print(f'{len(mid_collection)} images in mid_collection')
print(f'{len(mid_sqr_collection)} images in mid_sqr_collection')
print(f'{len(low_sqr_collection)} images in low_sqr_collection:')

5468 images in raw_collection
4276 images in large_collection
4276 images in mid_collection
4276 images in mid_sqr_collection
4276 images in low_sqr_collection:


In [9]:
# Get all paths to .csv files
extensions = ['.csv']

large_data = get_collection(path='./data/large_museum', extensions=extensions)
mid_data = get_collection(path='./data/mid_museum', extensions=extensions)
mid_sqr_data = get_collection(path='./data/mid_sqr_museum', extensions=extensions)
low_sqr_data = get_collection(path='./data/low_sqr_museum', extensions=extensions)

# Turn paths into str
large_data = [str(i) for i in large_data]
mid_data = [str(i) for i in mid_data]
mid_sqr_data = [str(i) for i in mid_sqr_data]
low_sqr_data = [str(i) for i in low_sqr_data]

# Build museums DataFrame
columns_names=['img_ID', 'artist', 'height', 'width', 'whitespace', 'chiaroscuro',
               'color_01', 'color_02', 'color_03', 'color_04', 'color_05', 
               'color_06', 'color_07', 'color_08', 'color_09', 'color_10']

large_museum = pd.concat((pd.read_csv(file, names=columns_names) for file in large_data), ignore_index=True)
mid_museum = pd.concat((pd.read_csv(file, names=columns_names) for file in mid_data), ignore_index=True)
mid_s_museum = pd.concat((pd.read_csv(file, names=columns_names) for file in mid_sqr_data), ignore_index=True)
low_s_museum = pd.concat((pd.read_csv(file, names=columns_names) for file in low_sqr_data), ignore_index=True)

museums = [large_museum, mid_museum, mid_s_museum, low_s_museum]

In [10]:
[museum.shape for museum in museums]

[(4815, 16), (4649, 16), (4476, 16), (4600, 16)]

In [11]:
[museum['img_ID'].nunique() for museum in museums]

[4815, 4649, 4476, 4600]

I'll keep only the common images to all museums

In [12]:
A = large_museum['img_ID'].isin(mid_museum['img_ID'])
B = large_museum['img_ID'].isin(mid_s_museum['img_ID'])
C = large_museum['img_ID'].isin(low_s_museum['img_ID'])

large_museum = large_museum[A & B & C]

In [13]:
A = mid_museum['img_ID'].isin(large_museum['img_ID'])
B = mid_museum['img_ID'].isin(mid_s_museum['img_ID'])
C = mid_museum['img_ID'].isin(low_s_museum['img_ID'])

mid_museum = mid_museum[A & B & C]

In [14]:
A = mid_s_museum['img_ID'].isin(large_museum['img_ID'])
B = mid_s_museum['img_ID'].isin(mid_museum['img_ID'])
C = mid_s_museum['img_ID'].isin(low_s_museum['img_ID'])

mid_s_museum = mid_s_museum[A & B & C]

In [15]:
A = low_s_museum['img_ID'].isin(large_museum['img_ID'])
B = low_s_museum['img_ID'].isin(mid_museum['img_ID'])
C = low_s_museum['img_ID'].isin(mid_s_museum['img_ID'])

low_s_museum = low_s_museum[A & B & C]

In [16]:
# Update museums
museums = [large_museum, mid_museum, mid_s_museum, low_s_museum]

[museum.shape for museum in museums]

[(4271, 16), (4271, 16), (4271, 16), (4271, 16)]

I'll not only drop the rows but also the images

In [17]:
# IMG_ID are not int64 type but object type, so I'll cast it
for museum in museums:
    museum['img_ID'] = museum['img_ID'].astype(str, errors='ignore')

In [18]:
large_museum.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4271 entries, 0 to 4637
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   img_ID       4271 non-null   object 
 1   artist       4271 non-null   object 
 2   height       4271 non-null   int64  
 3   width        4271 non-null   int64  
 4   whitespace   4271 non-null   float64
 5   chiaroscuro  4271 non-null   float64
 6   color_01     4271 non-null   object 
 7   color_02     4271 non-null   object 
 8   color_03     4271 non-null   object 
 9   color_04     4271 non-null   object 
 10  color_05     4271 non-null   object 
 11  color_06     4271 non-null   object 
 12  color_07     4271 non-null   object 
 13  color_08     4271 non-null   object 
 14  color_09     4271 non-null   object 
 15  color_10     4271 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory usage: 567.2+ KB


In [19]:
for collection in collections:
    collection_name = nameof(collection, globals())
    del_images = 0
    
    for image in collection:
        img_ID = str(image).split('/')[-1].split('.')[0]
        
        # I can use any museum as all have the same images
        if large_museum['img_ID'][large_museum['img_ID'].str.contains(img_ID)].any():
            continue
        elif os.path.exists(image):
            os.remove(image)
            del_images += 1
        
    print(f'{del_images} images deleted from {collection_name}')

0 images deleted from large_collection
0 images deleted from mid_collection
0 images deleted from mid_sqr_collection
0 images deleted from low_sqr_collection


In [20]:
# Update collections
extensions = ['.jpg']

raw_collection = get_collection(path='./images/raw_museum', extensions=extensions)
large_collection = get_collection(path='./images/large_museum', extensions=extensions)
mid_collection = get_collection(path='./images/mid_museum', extensions=extensions)
mid_sqr_collection = get_collection(path='./images/mid_sqr_museum', extensions=extensions)
low_sqr_collection = get_collection(path='./images/low_sqr_museum', extensions=extensions)

collections = [large_collection, mid_collection, mid_sqr_collection, low_sqr_collection]

print(f'{len(raw_collection)} images in raw_collection')
print(f'{len(large_collection)} images in large_collection')
print(f'{len(mid_collection)} images in mid_collection')
print(f'{len(mid_sqr_collection)} images in mid_sqr_collection')
print(f'{len(low_sqr_collection)} images in low_sqr_collection:')

5468 images in raw_collection
4276 images in large_collection
4276 images in mid_collection
4276 images in mid_sqr_collection
4276 images in low_sqr_collection:


___

In [21]:
def data_report(df):
    # Get names
    cols = pd.DataFrame(df.columns.values, columns=['COL_N'])
    # Get types
    types = pd.DataFrame(df.dtypes.values, columns=["DATA_TYPE"])
    
    # Get missings
    percent_missing = round(df.isnull().sum()*100/len(df),2)
    percent_missing_df = pd.DataFrame(percent_missing.values, columns = ["MISSINGS (%)"])

    # Get unique values
    unicos = pd.DataFrame(df.nunique().values, columns = ["UNIQUE_VALUES"])

    # Get cardinality
    percent_cardin = round(unicos["UNIQUE_VALUES"]*100/len(df),2)
    percent_cardin_df = pd.DataFrame(percent_cardin.values, columns = ["CARDIN (%)"])
      
    # Concat
    concatenado = pd.concat([cols, types, percent_missing_df, unicos, percent_cardin_df], axis=1)
    concatenado.set_index('COL_N', drop=True, inplace=True)

    return concatenado.T


l_repo = data_report(large_museum)
m_repo = data_report(mid_museum)
ms_repo = data_report(mid_s_museum)
ls_repo = data_report(low_s_museum)

display('l_repo', 'm_repo', 'ms_repo', 'ls_repo')

COL_N,img_ID,artist,height,width,whitespace,chiaroscuro,color_01,color_02,color_03,color_04,color_05,color_06,color_07,color_08,color_09,color_10
DATA_TYPE,object,object,int64,int64,float64,float64,object,object,object,object,object,object,object,object,object,object
MISSINGS (%),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
UNIQUE_VALUES,4271,14,1,442,248,3871,87,83,88,87,88,90,90,89,94,94
CARDIN (%),100.0,0.33,0.02,10.35,5.81,90.63,2.04,1.94,2.06,2.04,2.06,2.11,2.11,2.08,2.2,2.2

COL_N,img_ID,artist,height,width,whitespace,chiaroscuro,color_01,color_02,color_03,color_04,color_05,color_06,color_07,color_08,color_09,color_10
DATA_TYPE,object,object,int64,int64,float64,float64,object,object,object,object,object,object,object,object,object,object
MISSINGS (%),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02
UNIQUE_VALUES,4271,14,1,1,2,2380,75,67,77,77,75,76,76,83,74,75
CARDIN (%),100.0,0.33,0.02,0.02,0.05,55.72,1.76,1.57,1.8,1.8,1.76,1.78,1.78,1.94,1.73,1.76

COL_N,img_ID,artist,height,width,whitespace,chiaroscuro,color_01,color_02,color_03,color_04,color_05,color_06,color_07,color_08,color_09,color_10
DATA_TYPE,object,object,int64,int64,float64,float64,object,object,object,object,object,object,object,object,object,object
MISSINGS (%),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02
UNIQUE_VALUES,4271,14,1,1,2,2380,74,73,80,79,73,78,77,79,82,77
CARDIN (%),100.0,0.33,0.02,0.02,0.05,55.72,1.73,1.71,1.87,1.85,1.71,1.83,1.8,1.85,1.92,1.8

COL_N,img_ID,artist,height,width,whitespace,chiaroscuro,color_01,color_02,color_03,color_04,color_05,color_06,color_07,color_08,color_09,color_10
DATA_TYPE,object,object,int64,int64,float64,float64,object,object,object,object,object,object,object,object,object,object
MISSINGS (%),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.05,0.09,0.19,0.49
UNIQUE_VALUES,4271,14,1,1,2,1527,74,69,69,68,67,72,70,75,80,77
CARDIN (%),100.0,0.33,0.02,0.02,0.05,35.75,1.73,1.62,1.62,1.59,1.57,1.69,1.64,1.76,1.87,1.8


At this point I will work with the large_museum as it contains more information and has no missings. Let's take a closer look

In [23]:
with open('./data/large_museum/large_museum', 'wb') as file:
    pickle.dump(large_museum, file)