## Exploration of the Tate Collection data
This notebook contains an exploration of the data retrieved from https://github.com/tategallery/collection. This exploration is aimed at producing a target vector for the indended classification of the artworks in the collection according to their subject. 

As such, this notebook is comprised of three sections:
- retrieval of the subject metadata
- summary statistics of the data according to their subject
- [TODO] the unpacking in a tensorflow compatible format of the data 

### Retrieving subject metadata
The repository cloned from github contains metadata in two formats: a csv file with some general metadata on all the images (not contained) and one json file per image id.

The subject metadata is contained only as a section of the json files. This section therefore retrieves all the subject sections of the json files and reorganizes them into a single json file for all subjects of all images.

The json file produced is in the form of {class:{subclass:'ImageId'}} as it is convinient both as a target vector format and to unpack the images into the correct directories. 

NB that an ImageId appears following more than one path since each image contains, on average, more than one subject. 

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import os
import glob

In [2]:
data_info = pd.read_csv('../../Capstone/artwork_data.csv', verbose=0)
data_info.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,accession_number,artist,artistRole,artistId,title,dateText,medium,creditLine,year,acquisitionYear,dimensions,width,height,depth,units,inscription,thumbnailCopyright,thumbnailUrl,url
0,1035,A00001,"Blake, Robert",artist,38,A Figure Bowing before a Seated Old Man with h...,date not known,"Watercolour, ink, chalk and graphite on paper....",Presented by Mrs John Richmond 1922,,1922.0,support: 394 x 419 mm,394,419,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-a-fi...
1,1036,A00002,"Blake, Robert",artist,38,"Two Drawings of Frightened Figures, Probably f...",date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 311 x 213 mm,311,213,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-two-...
2,1037,A00003,"Blake, Robert",artist,38,The Preaching of Warning. Verso: An Old Man En...,?c.1785,Graphite on paper. Verso: graphite on paper,Presented by Mrs John Richmond 1922,1785.0,1922.0,support: 343 x 467 mm,343,467,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...
3,1038,A00004,"Blake, Robert",artist,38,Six Drawings of Figures with Outstretched Arms,date not known,Graphite on paper,Presented by Mrs John Richmond 1922,,1922.0,support: 318 x 394 mm,318,394,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-six-...
4,1039,A00005,"Blake, William",artist,39,The Circle of the Lustful: Francesca da Rimini...,"1826–7, reprinted 1892",Line engraving on paper,Purchased with the assistance of a special gra...,1826.0,1919.0,image: 243 x 335 mm,243,335,,mm,,,http://www.tate.org.uk/art/images/work/A/A00/A...,http://www.tate.org.uk/art/artworks/blake-the-...


In [3]:
keys = data_info.accession_number.unique()
len(keys)

69201

In [4]:
#removing all the images beloning to section D as those are poor quality drawings
keys = np.array([key for key in keys if key[0] != 'D'])
len(keys)

31557

In [17]:
with open(r'D:\collection\artworks\T\005\t00500-1373.json') as json_file:
    data = json.load(json_file)
    
data['movements'][0]['name']

'Civil War and Commonwealth'

In [11]:
#exploring the format of the json file for image A00001
for p in data['subjects']:
    print('Children: ' + p)#['children'])
data#['subjects']['children']

Children: children
Children: id
Children: name


{'acno': 'T00500',
 'acquisitionYear': 1962,
 'all_artists': 'Edward Bower',
 'catalogueGroup': {},
 'classification': 'painting',
 'contributorCount': 1,
 'contributors': [{'birthYear': 1629,
   'date': 'active 1629–c.1667',
   'displayOrder': 1,
   'fc': 'Edward Bower',
   'gender': 'Male',
   'id': 45,
   'mda': 'Bower, Edward',
   'role': 'artist',
   'startLetter': 'B'}],
 'creditLine': 'Purchased 1962',
 'dateRange': {'endYear': 1646, 'startYear': 1646, 'text': '1646'},
 'dateText': '1646',
 'depth': '',
 'dimensions': 'support: 1244 x 981 mm\r\nframe: 1430 x 1190 x 100 mm',
 'foreignTitle': None,
 'groupTitle': None,
 'height': '981',
 'id': 1373,
 'inscription': 'date inscribed',
 'medium': 'Oil paint on canvas',
 'movementCount': 1,
 'movements': [{'era': {'id': 289, 'name': '16th and 17th century'},
   'id': 399,
   'name': 'Civil War and Commonwealth'}],
 'subjectCount': 9,
 'subjects': {'children': [{'children': [{'children': [{'id': 12714,
        'name': 'Drake, John, Sir

In [12]:
#replicating the path for each json given the start of the image id
path = {}
for key in keys:
    path[key] = os.path.join( "D:", "collection", "artworks", key[0].lower(), key[1:4], '*.json')

path[keys[332]]  

'D:collection\\artworks\\a\\003\\*.json'

In [13]:
#exploring how to retrieve all the files in the found direcotries
files = {}
for key in keys:
    files[key] = glob.glob(path[key])
    
files[keys[11111]][0]

'D:collection\\artworks\\p\\065\\p06500-12527.json'

In [11]:
#exploring how to take out only the part concerning the subject
with open(files[keys[11111]][0]) as json_file:
    data = json.load(json_file)
    subjects = data['subjects']['children']
    
subjects

[{'children': [{'children': [{'id': 10231,
      'name': "Rimbaud, Arthur, 'Les Illuminations'"}],
    'id': 58,
    'name': 'literature (not Shakespeare)'}],
  'id': 55,
  'name': 'literature and fiction'},
 {'children': [{'children': [{'id': 1050, 'name': 'arm/arms raised'},
     {'id': 270, 'name': 'standing'}],
    'id': 92,
    'name': 'actions: postures and motions'},
   {'children': [{'id': 167, 'name': 'woman'}], 'id': 95, 'name': 'adults'}],
  'id': 91,
  'name': 'people'},
 {'children': [{'children': [{'id': 1834, 'name': 'dawn'},
     {'id': 2261, 'name': 'sunrise'}],
    'id': 75,
    'name': 'times of the day'},
   {'children': [{'id': 506, 'name': 'field'}], 'id': 71, 'name': 'landscape'},
   {'children': [{'id': 975, 'name': 'sun'}], 'id': 69, 'name': 'astronomy'}],
  'id': 60,
  'name': 'nature'}]

In [20]:
with open(files[keys[11111]][0]) as json_file:
    data = json.load(json_file)
    if 'movements' in data.keys():
        movement = data['movements'][0]['name']
        movement

In [15]:
#analysing the structure of the found section
subjects[1]['children'][0]#['children']
subjects[0]['name'],subjects[0]['children'][0]['name'],subjects[0]['children'][0]['children'][0]['name']

('literature and fiction',
 'literature (not Shakespeare)',
 "Rimbaud, Arthur, 'Les Illuminations'")

In [16]:
#reobtaining imageid
(files[keys[11111]][0]).split('\\')[-1][:6].upper()

'P06500'

In [17]:
#trying to contruct the tree in the intended shape for one instance
tree = {}

for i in range(len(subjects)):
    tree[subjects[i]['name']] = {}
    for j in range(len(subjects[i]['children'])):
        tree[subjects[i]['name']][subjects[i]['children'][j]['name']] = {}
        for z in range(len([subjects[i]['children'][j]['children']])):
            tree[subjects[i]['name']]\
                [subjects[i]['children'][j]['name']]\
                [subjects[i]['children'][j]['children'][z]['name']] = (files[keys[11111]][0]).split('\\')[-1][:6].upper()
       
tree

{'literature and fiction': {'literature (not Shakespeare)': {"Rimbaud, Arthur, 'Les Illuminations'": 'P06500'}},
 'people': {'actions: postures and motions': {'arm/arms raised': 'P06500'},
  'adults': {'woman': 'P06500'}},
 'nature': {'times of the day': {'dawn': 'P06500'},
  'landscape': {'field': 'P06500'},
  'astronomy': {'sun': 'P06500'}}}

#### Actual tree construction
Now that the approriate investigation has been done I can start producing the complete tree

> #### Do not run the following part

In [31]:
path1 = {}
files = {}
subjects = {}
i = 0
j = 0
z = 0
k = 0
keys4paths= list(set([key[:4] for key in keys]))

#print('number of directories', len(keys4paths))

for key in keys4paths:
    path1[key] = os.path.join( "D:", "collection", "artworks", key[0].lower(), key[1:4], '*.json')
    files[key] = glob.glob(path1[key])
    #print('number of files in directory:', len(files[key]))
    for i in range(len(files[key])):
        z = z + 1
        with open(files[key][i]) as json_file:
            name_image = (files[key][i]).split('\\')[-1][:6].upper()
            data = json.load(json_file)
            k = k + 1
            #print(data)
            try:
                subjects[name_image] = data['subjects']['children']
                i = i + 1
            except:
                j = j + 1
                #print('does not have subject info for', name_image)
                #print(data)
                
                
                  
#print('number of files considered', z, k, '\nnumber of subjects found:', i, '\nnumber of subjects not found:', j)#, subjects[name_image])

number of directories 345


In [22]:
path1 = {}
files = {}
movements = {}
i = 0
j = 0
z = 0
k = 0
keys4paths= list(set([key[:4] for key in keys]))

#print('number of directories', len(keys4paths))

for key in keys4paths:
    path1[key] = os.path.join( "D:", "collection", "artworks", key[0].lower(), key[1:4], '*.json')
    files[key] = glob.glob(path1[key])
    #print('number of files in directory:', len(files[key]))
    for i in range(len(files[key])):
        z = z + 1
        with open(files[key][i]) as json_file:
            name_image = (files[key][i]).split('\\')[-1][:6].upper()
            data = json.load(json_file)
            k = k + 1
            #print(data)
            try:
                movements[name_image] = data['movements'][0]['name']
                i = i + 1
            except:
                j = j + 1
                #print('does not have subject info for', name_image)
                #print(data)
                
                
                  
print('number of files considered', z, k, '\nnumber of subjects found:', i, '\nnumber of subjects not found:', j)#, subjects[name_image])

number of files considered 30381 30381 
number of subjects found: 98 
number of subjects not found: 24410


In [24]:
len(list(movements.values()))

5971

In [25]:
with open('./results/TateMovementsDict.json', 'w') as outfile:
    json.dump(movements, outfile)

In [43]:
tree1 = {}

for image in subjects:
    for i in range(len(subjects[image])):
        if subjects[image][i]['name'] not in tree1.keys():
            tree1[subjects[image][i]['name']] = {}
    for j in range(len(subjects[image][i]['children'])):
        if subjects[image][i]['children'][j]['name']\
        not in tree1[subjects[image][i]['name']].keys():
            tree1[subjects[image][i]['name']]\
                 [subjects[image][i]['children'][j]['name']] = {}
        for z in range(len([subjects[image][i]['children'][j]['children']])):
            if subjects[image][i]['children'][j]['children'][z]['name']\
            not in tree1[subjects[image][i]['name']]\
                        [subjects[image][i]['children'][j]['name']].keys():
                tree1[subjects[image][i]['name']]\
                     [subjects[image][i]['children'][j]['name']]\
                     [subjects[image][i]['children'][j]['children'][z]['name']] = [image]
            else:
                tree1[subjects[image][i]['name']]\
                     [subjects[image][i]['children'][j]['name']]\
                     [subjects[image][i]['children'][j]['children'][z]['name']].append(image)
       
tree1.keys()

dict_keys(['people', 'objects', 'nature', 'society', 'work and occupations', 'architecture', 'leisure and pastimes', 'emotions, concepts and ideas', 'symbols & personifications', 'interiors', 'abstraction', 'religion and belief', 'places', 'history', 'literature and fiction', 'group/movement'])

Dumping the file into a json file so it can be reused in the future without rerunning this first part

In [48]:
with open('./results/TateDict.json', 'w') as outfile:
    json.dump(tree1, outfile)

### Summary statistics of the data

Now that I built the tree, let's look at some basic statistics on how the images are distributed among classes and how are the subclasses organized

In [6]:
with open('./results/TateDict.json', 'r') as infile:
    tree1 = json.load(infile)

In [7]:
# Let's start by looking at an example of path
tree1.keys()

dict_keys(['people', 'objects', 'places', 'architecture', 'abstraction', 'society', 'nature', 'emotions, concepts and ideas', 'interiors', 'work and occupations', 'symbols & personifications', 'religion and belief', 'leisure and pastimes', 'history', 'literature and fiction', 'group/movement'])

In [8]:
tree1['people'].keys()

dict_keys(['named individuals', 'portraits', 'actions: postures and motions', 'adults', 'groups', 'body', 'children', 'nudes', 'actions: processes and functions', 'actions: expressive', 'ethnicity', 'diseases and conditions', 'named groups', 'named families'])

In [9]:
tree1['people']['portraits'].keys()

dict_keys(['individuals: female', 'individuals: male', 'self-portraits', 'groups'])

In [10]:
tree1['people']['portraits']['individuals: male'][:4]

['N05909', 'N05925', 'N05929', 'N05936']

In [11]:
#some helper functions
def count_subclasses(d):
    return sum([count(v) if isinstance(v, dict) else 1 for v in d.keys()])
def count_entries(d):
    return len(d)

In [12]:
count_subclasses(tree1)

16

In [13]:
for subclass in tree1.keys():
    print(subclass, count_subclasses(tree1[subclass]))

people 14
objects 21
places 12
architecture 14
abstraction 2
society 16
nature 18
emotions, concepts and ideas 3
interiors 5
work and occupations 14
symbols & personifications 12
religion and belief 11
leisure and pastimes 5
history 7
literature and fiction 6
group/movement 1


In [14]:
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        print(subclass, subsubclass, count_subclasses(tree1[subclass][subsubclass]))

people named individuals 1641
people portraits 4
people actions: postures and motions 39
people adults 5
people groups 5
people body 53
people children 4
people nudes 3
people actions: processes and functions 19
people actions: expressive 37
people ethnicity 10
people diseases and conditions 58
people named groups 26
people named families 44
objects clothing and personal effects 142
objects fine art and design, named works 437
objects furnishings 62
objects religious and ceremonial 97
objects reading, writing, printed matter 202
objects vessels and containers 25
objects food and drink 92
objects kitchen 41
objects fine arts and music 108
objects weapons 32
objects electrical appliances 28
objects agriculture, gardening & fishing 51
objects sports and games 40
objects toys and models 43
objects domestic 34
objects scientific and measuring 46
objects heating and lighting 30
objects miscellaneous 64
objects tools and machinery 73
objects medical 21
objects materials 1
places UK counties 1

In [15]:
count_entries(tree1['people']['actions: postures and motions']['lying down'])

132

In [None]:
distr_data = []
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            distr_data.append(count_entries(tree1[subclass][subsubclass][subsubsubclass]))
            info = (subclass, ' - ', subsubclass, ' - ', subsubsubclass, ': ', count_entries(tree1[subclass][subsubclass][subsubsubclass]))
print(len(info))

In [17]:
distr_data[:10]

[1, 4, 5, 1, 2, 1, 1, 1, 1, 1]

In [18]:
'number of total classes', len(distr_data)

('number of total classes', 10897)

In [19]:
'average number of entries per class', np.mean(distr_data)

('average number of entries per class', 14.613196292557584)

In [20]:
'median number of entries per class', np.median(distr_data)

('median number of entries per class', 1.0)

In [21]:
'max number of entries per class', np.max(distr_data)

('max number of entries per class', 7145)

In [22]:
'min number of entries per class', np.min(distr_data)

('min number of entries per class', 1)

In [23]:
'total number of entries (with repetitions)', np.sum(distr_data)

('total number of entries (with repetitions)', 159240)

In [24]:
images_subclasses = {}
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        images_subclasses[subsubclass] = 0
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            images_subclasses[subsubclass] += count_entries(tree1[subclass][subsubclass][subsubsubclass])
        print(subclass, ' - ', subsubclass, ' - ', images_subclasses[subsubclass])

people  -  named individuals  -  2420
people  -  portraits  -  2417
people  -  actions: postures and motions  -  6008
people  -  adults  -  14034
people  -  groups  -  2947
people  -  body  -  2559
people  -  children  -  2177
people  -  nudes  -  1645
people  -  actions: processes and functions  -  1170
people  -  actions: expressive  -  1839
people  -  ethnicity  -  468
people  -  diseases and conditions  -  572
people  -  named groups  -  53
people  -  named families  -  49
objects  -  clothing and personal effects  -  3424
objects  -  fine art and design, named works  -  770
objects  -  furnishings  -  1826
objects  -  religious and ceremonial  -  783
objects  -  reading, writing, printed matter  -  1342
objects  -  vessels and containers  -  1531
objects  -  food and drink  -  509
objects  -  kitchen  -  431
objects  -  fine arts and music  -  1446
objects  -  weapons  -  686
objects  -  electrical appliances  -  174
objects  -  agriculture, gardening & fishing  -  667
objects  - 

In [25]:
images_classes = {}
for subclass in tree1.keys():
    images_classes[subclass] = 0
    for subsubclass in tree1[subclass].keys():
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            images_classes[subclass] += count_entries(tree1[subclass][subsubclass][subsubsubclass])
    print(subclass, images_classes[subclass])

people 38358
objects 16322
places 17353
architecture 15426
abstraction 9322
society 10295
nature 24857
emotions, concepts and ideas 8318
interiors 1978
work and occupations 4951
symbols & personifications 2819
religion and belief 2862
leisure and pastimes 2641
history 1352
literature and fiction 2385
group/movement 1


### Unpacking data in folder structure
Now that the tree is built, I can unpack the data retrieved by following the image urls into a format the can be easily handled using Tensorflow's flow_from_directories

In [None]:
from zipfile import ZipFile
with ZipFile(os.path.join("D:", "collection", "data_tate.zip"), 'r') as zipObj:
    listOfFileNames = zipObj.namelist()
    print(listOfFileNames[2])
    for class1 in tree1.keys():
        for class2 in tree1[class1].keys():
            for class3 in tree1[class1][class2].keys():
                for file in tree1[class1][class2][class3]:
                    try:
                        zipObj.extract(file, os.path.join( "D:", "collection", "data", class1, class2, class3, file))
                    except:
                        print('file not found: ' + file)