## Exploration of the Tate Collection data
This notebook contains an exploration of the data retrieved from https://github.com/tategallery/collection. This exploration is aimed at producing a target vector for the indended classification of the artworks in the collection according to their subject. 

As such, this notebook is comprised of three sections:
- retrieval of the subject metadata
- summary statistics of the data according to their subject
- [TODO] the unpacking in a tensorflow compatible format of the data 

### Retrieving subject metadata
The repository cloned from github contains metadata in two formats: a csv file with some general metadata on all the images (not contained) and one json file per image id.

The subject metadata is contained only as a section of the json files. This section therefore retrieves all the subject sections of the json files and reorganizes them into a single json file for all subjects of all images.

The json file produced is in the form of {class:{subclass:'ImageId'}} as it is convinient both as a target vector format and to unpack the images into the correct directories. 

NB that an ImageId appears following more than one path since each image contains, on average, more than one subject. 

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
import os
import glob

In [None]:
#data_info = pd.read_csv(Path('/media/ludovica/DiscoEsterno1TB/collection/artwork_data.csv'))
data_info = pd.read_csv(r'D:\collection\artwork_data.csv', verbose=0)
data_info.head()

In [3]:
keys = data_info.accession_number.unique()
len(keys)

69201

In [6]:
#removing all the images beloning to section D as those are poor quality drawings
keys = np.array([key for key in keys if key[0] != 'D'])
len(keys)

31557

In [8]:
#exploring the format of the json file for image A00001
with open(r'D:\collection\artworks\a\000\a00001-1035.json') as json_file:
    data = json.load(json_file)
    for p in data['subjects']:
        print('Children: ' + p)#['children'])
data#['subjects']['children']

Children: children
Children: id
Children: name


{'acno': 'A00001',
 'acquisitionYear': 1922,
 'all_artists': 'Robert Blake',
 'catalogueGroup': {},
 'classification': 'on paper, unique',
 'contributorCount': 1,
 'contributors': [{'birthYear': 1762,
   'date': '1762–1787',
   'displayOrder': 1,
   'fc': 'Robert Blake',
   'gender': 'Male',
   'id': 38,
   'mda': 'Blake, Robert',
   'role': 'artist',
   'startLetter': 'B'}],
 'creditLine': 'Presented by Mrs John Richmond 1922',
 'dateRange': None,
 'dateText': 'date not known',
 'depth': '',
 'dimensions': 'support: 394 x 419 mm',
 'foreignTitle': None,
 'groupTitle': None,
 'height': '419',
 'id': 1035,
 'inscription': None,
 'medium': 'Watercolour, ink, chalk and graphite on paper. Verso: graphite on paper',
 'movementCount': 0,
 'subjectCount': 6,
 'subjects': {'children': [{'children': [{'children': [{'id': 1050,
        'name': 'arm/arms raised'},
       {'id': 272, 'name': 'kneeling'},
       {'id': 694, 'name': 'sitting'}],
      'id': 92,
      'name': 'actions: postures and m

In [9]:
#replicating the path for each json given the start of the image id
path = {}
for key in keys:
    path[key] = os.path.join( "D:", "collection", "artworks", key[0].lower(), key[1:4], '*.json')

path[keys[332]]  

'D:collection\\artworks\\a\\003\\*.json'

In [12]:
#exploring how to retrieve all the files in the found direcotries
files = {}
for key in keys:
    files[key] = glob.glob(path[key])
    
files[keys[11111]][0]

'D:collection\\artworks\\p\\065\\p06500-12527.json'

In [11]:
#exploring how to take out only the part concerning the subject
with open(files[keys[11111]][0]) as json_file:
    data = json.load(json_file)
    subjects = data['subjects']['children']
    
subjects

[{'children': [{'children': [{'id': 10231,
      'name': "Rimbaud, Arthur, 'Les Illuminations'"}],
    'id': 58,
    'name': 'literature (not Shakespeare)'}],
  'id': 55,
  'name': 'literature and fiction'},
 {'children': [{'children': [{'id': 1050, 'name': 'arm/arms raised'},
     {'id': 270, 'name': 'standing'}],
    'id': 92,
    'name': 'actions: postures and motions'},
   {'children': [{'id': 167, 'name': 'woman'}], 'id': 95, 'name': 'adults'}],
  'id': 91,
  'name': 'people'},
 {'children': [{'children': [{'id': 1834, 'name': 'dawn'},
     {'id': 2261, 'name': 'sunrise'}],
    'id': 75,
    'name': 'times of the day'},
   {'children': [{'id': 506, 'name': 'field'}], 'id': 71, 'name': 'landscape'},
   {'children': [{'id': 975, 'name': 'sun'}], 'id': 69, 'name': 'astronomy'}],
  'id': 60,
  'name': 'nature'}]

In [15]:
#analysing the structure of the found section
subjects[1]['children'][0]#['children']
subjects[0]['name'],subjects[0]['children'][0]['name'],subjects[0]['children'][0]['children'][0]['name']

('literature and fiction',
 'literature (not Shakespeare)',
 "Rimbaud, Arthur, 'Les Illuminations'")

In [16]:
#reobtaining imageid
(files[keys[11111]][0]).split('\\')[-1][:6].upper()

'P06500'

In [17]:
#trying to contruct the tree in the intended shape for one instance
tree = {}

for i in range(len(subjects)):
    tree[subjects[i]['name']] = {}
    for j in range(len(subjects[i]['children'])):
        tree[subjects[i]['name']][subjects[i]['children'][j]['name']] = {}
        for z in range(len([subjects[i]['children'][j]['children']])):
            tree[subjects[i]['name']]\
                [subjects[i]['children'][j]['name']]\
                [subjects[i]['children'][j]['children'][z]['name']] = (files[keys[11111]][0]).split('\\')[-1][:6].upper()
       
tree

{'literature and fiction': {'literature (not Shakespeare)': {"Rimbaud, Arthur, 'Les Illuminations'": 'P06500'}},
 'people': {'actions: postures and motions': {'arm/arms raised': 'P06500'},
  'adults': {'woman': 'P06500'}},
 'nature': {'times of the day': {'dawn': 'P06500'},
  'landscape': {'field': 'P06500'},
  'astronomy': {'sun': 'P06500'}}}

#### Actual tree construction
Now that the approriate investigation has been done I can start producing the complete tree

> #### Do not run the following part

In [31]:
path1 = {}
files = {}
subjects = {}
i = 0
j = 0
z = 0
k = 0
keys4paths= list(set([key[:4] for key in keys]))

#print('number of directories', len(keys4paths))

for key in keys4paths:
    path1[key] = os.path.join( "D:", "collection", "artworks", key[0].lower(), key[1:4], '*.json')
    files[key] = glob.glob(path1[key])
    #print('number of files in directory:', len(files[key]))
    for i in range(len(files[key])):
        z = z + 1
        with open(files[key][i]) as json_file:
            name_image = (files[key][i]).split('\\')[-1][:6].upper()
            data = json.load(json_file)
            k = k + 1
            #print(data)
            try:
                subjects[name_image] = data['subjects']['children']
                i = i + 1
            except:
                j = j + 1
                #print('does not have subject info for', name_image)
                #print(data)
                
                
                  
#print('number of files considered', z, k, '\nnumber of subjects found:', i, '\nnumber of subjects not found:', j)#, subjects[name_image])

number of directories 345


In [30]:
len(list(subjects.values()))

26969

In [43]:
tree1 = {}

for image in subjects:
    for i in range(len(subjects[image])):
        if subjects[image][i]['name'] not in tree1.keys():
            tree1[subjects[image][i]['name']] = {}
    for j in range(len(subjects[image][i]['children'])):
        if subjects[image][i]['children'][j]['name']\
        not in tree1[subjects[image][i]['name']].keys():
            tree1[subjects[image][i]['name']]\
                 [subjects[image][i]['children'][j]['name']] = {}
        for z in range(len([subjects[image][i]['children'][j]['children']])):
            if subjects[image][i]['children'][j]['children'][z]['name']\
            not in tree1[subjects[image][i]['name']]\
                        [subjects[image][i]['children'][j]['name']].keys():
                tree1[subjects[image][i]['name']]\
                     [subjects[image][i]['children'][j]['name']]\
                     [subjects[image][i]['children'][j]['children'][z]['name']] = [image]
            else:
                tree1[subjects[image][i]['name']]\
                     [subjects[image][i]['children'][j]['name']]\
                     [subjects[image][i]['children'][j]['children'][z]['name']].append(image)
       
tree1.keys()

dict_keys(['people', 'objects', 'nature', 'society', 'work and occupations', 'architecture', 'leisure and pastimes', 'emotions, concepts and ideas', 'symbols & personifications', 'interiors', 'abstraction', 'religion and belief', 'places', 'history', 'literature and fiction', 'group/movement'])

Dumping the file into a json file so it can be reused in the future without rerunning this first part

In [48]:
with open('TateDict.json', 'w') as outfile:
    json.dump(tree1, outfile)

### Summary statistics of the data

Now that I built the tree, let's look at some basic statistics on how the images are distributed among classes and how are the subclasses organized

In [4]:
with open('TateDict.json', 'r') as infile:
    tree1 = json.load(infile)

In [5]:
# Let's start by looking at an example of path
tree1.keys()

dict_keys(['people', 'objects', 'nature', 'society', 'work and occupations', 'architecture', 'leisure and pastimes', 'emotions, concepts and ideas', 'symbols & personifications', 'interiors', 'abstraction', 'religion and belief', 'places', 'history', 'literature and fiction', 'group/movement'])

In [21]:
tree1['people'].keys()

dict_keys(['actions: postures and motions', 'adults', 'actions: expressive', 'children', 'actions: processes and functions', 'nudes', 'groups', 'body', 'ethnicity', 'named individuals', 'portraits', 'diseases and conditions', 'named families'])

In [22]:
tree1['people']['portraits'].keys()

dict_keys(['individuals: male', 'individuals: female', 'groups', 'self-portraits'])

In [23]:
tree1['people']['portraits']['individuals: male'][:4]

['P20089', 'P08100', 'T03576', 'N03523']

In [6]:
#some helper functions
def count_subclasses(d):
    return sum([count(v) if isinstance(v, dict) else 1 for v in d.keys()])
def count_entries(d):
    return len(d)

In [7]:
count_subclasses(tree1)

16

In [8]:
for subclass in tree1.keys():
    print(subclass, count_subclasses(tree1[subclass]))

people 13
objects 21
nature 18
society 15
work and occupations 14
architecture 14
leisure and pastimes 5
emotions, concepts and ideas 3
symbols & personifications 8
interiors 5
abstraction 2
religion and belief 10
places 5
history 3
literature and fiction 3
group/movement 0


In [76]:
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        print(subclass, subsubclass, count_subclasses(tree1[subclass][subsubclass]))

people actions: postures and motions 32
people adults 5
people actions: expressive 28
people children 4
people actions: processes and functions 14
people nudes 3
people groups 4
people body 36
people ethnicity 5
people named individuals 131
people portraits 4
people diseases and conditions 24
people named families 2
objects agriculture, gardening & fishing 31
objects vessels and containers 20
objects clothing and personal effects 94
objects religious and ceremonial 35
objects sports and games 12
objects weapons 17
objects fine arts and music 53
objects tools and machinery 41
objects reading, writing, printed matter 62
objects toys and models 11
objects food and drink 43
objects miscellaneous 29
objects heating and lighting 18
objects scientific and measuring 19
objects domestic 15
objects furnishings 41
objects kitchen 21
objects electrical appliances 15
objects fine art and design, named works 10
objects medical 8
objects materials 1
nature landscape 24
nature animals: mammals 58
natu

In [70]:
count_entries(tree1['people']['actions: postures and motions']['lying down'])

47

In [20]:
distr_data = []
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            distr_data.append(count_entries(tree1[subclass][subsubclass][subsubsubclass]))
            print(subclass, ' - ', subsubclass, ' - ', subsubsubclass, ': ', 
                  count_entries(tree1[subclass][subsubclass][subsubsubclass]))

people  -  actions: postures and motions  -  lying down :  47
people  -  actions: postures and motions  -  sitting :  570
people  -  actions: postures and motions  -  head in hand/hands :  76
people  -  actions: postures and motions  -  walking :  92
people  -  actions: postures and motions  -  standing :  447
people  -  actions: postures and motions  -  leg/legs raised :  12
people  -  actions: postures and motions  -  reclining :  210
people  -  actions: postures and motions  -  legs crossed :  16
people  -  actions: postures and motions  -  crouching :  32
people  -  actions: postures and motions  -  kneeling :  65
people  -  actions: postures and motions  -  hand/hands on hip :  22
people  -  actions: postures and motions  -  looking up :  23
people  -  actions: postures and motions  -  arms folded :  14
people  -  actions: postures and motions  -  bending forward :  35
people  -  actions: postures and motions  -  hands crossed :  8
people  -  actions: postures and motions  -  arm/

people  -  named individuals  -  Opie, William :  1
people  -  named individuals  -  Gonzalez, Lola :  1
people  -  named individuals  -  Drake :  1
people  -  named individuals  -  Ward, E.M. :  2
people  -  named individuals  -  Majendie :  1
people  -  named individuals  -  Whinny :  1
people  -  named individuals  -  Miller, Annie :  1
people  -  named individuals  -  John, Romilly :  1
people  -  named individuals  -  Pettigrew, Rose :  1
people  -  named individuals  -  Anne :  2
people  -  named individuals  -  Williams Hope :  1
people  -  named individuals  -  Skeaping, Paul :  1
people  -  named individuals  -  Rembrandt van Rijn :  1
people  -  named individuals  -  Gonzalez, Pilar :  1
people  -  named individuals  -  Hamnett, Nina :  1
people  -  named individuals  -  Collier, Ann :  1
people  -  named individuals  -  Barry, Dykes :  1
people  -  named individuals  -  Fraser, Pringle :  1
people  -  named individuals  -  Teed :  1
people  -  named individuals  -  Burne-Jon

objects  -  clothing and personal effects  -  hand mirror :  2
objects  -  clothing and personal effects  -  harness :  1
objects  -  clothing and personal effects  -  bag :  4
objects  -  clothing and personal effects  -  fan :  4
objects  -  clothing and personal effects  -  toothbrush :  1
objects  -  clothing and personal effects  -  ruff :  2
objects  -  clothing and personal effects  -  boot :  4
objects  -  clothing and personal effects  -  drapery :  17
objects  -  clothing and personal effects  -  cravat :  6
objects  -  clothing and personal effects  -  headscarf :  10
objects  -  clothing and personal effects  -  bracelet :  3
objects  -  clothing and personal effects  -  toga :  5
objects  -  clothing and personal effects  -  pinafore :  1
objects  -  clothing and personal effects  -  petticoat :  2
objects  -  clothing and personal effects  -  tights :  1
objects  -  clothing and personal effects  -  brooch :  3
objects  -  clothing and personal effects  -  rucksack :  1
o

objects  -  food and drink  -  drink, milk :  1
objects  -  food and drink  -  soup :  1
objects  -  food and drink  -  drink, whisky :  1
objects  -  food and drink  -  fruit, peach :  1
objects  -  food and drink  -  fruit, grape :  3
objects  -  food and drink  -  vegetable, mushroom :  2
objects  -  food and drink  -  sausage :  4
objects  -  food and drink  -  vegetable, carrot :  3
objects  -  food and drink  -  drink - non-specific :  2
objects  -  food and drink  -  fruit, tomato :  1
objects  -  food and drink  -  fruit, lemon :  2
objects  -  food and drink  -  vegetable, turnip :  1
objects  -  food and drink  -  fruit, fig :  1
objects  -  food and drink  -  drink, beer :  3
objects  -  food and drink  -  pie :  2
objects  -  food and drink  -  seafood :  3
objects  -  food and drink  -  peanut :  1
objects  -  food and drink  -  fruit, banana :  1
objects  -  food and drink  -  poultry - non-specific :  1
objects  -  food and drink  -  ice lolly :  1
objects  -  food and d

nature  -  landscape  -  sky :  68
nature  -  landscape  -  desert :  8
nature  -  landscape  -  cliff :  72
nature  -  landscape  -  farmland :  75
nature  -  landscape  -  forest :  57
nature  -  landscape  -  glacier :  8
nature  -  landscape  -  orchard :  5
nature  -  landscape  -  stratum :  2
nature  -  landscape  -  marsh :  6
nature  -  landscape  -  volcano :  6
nature  -  landscape  -  crater :  3
nature  -  landscape  -  jungle :  2
nature  -  landscape  -  vineyard :  2
nature  -  animals: mammals  -  horse :  201
nature  -  animals: mammals  -  ox :  10
nature  -  animals: mammals  -  goat :  26
nature  -  animals: mammals  -  cow :  114
nature  -  animals: mammals  -  deer :  25
nature  -  animals: mammals  -  cat :  14
nature  -  animals: mammals  -  tiger :  19
nature  -  animals: mammals  -  dog, Irish wolfhound :  1
nature  -  animals: mammals  -  dog - non-specific :  77
nature  -  animals: mammals  -  donkey :  24
nature  -  animals: mammals  -  lion :  17
nature  

nature  -  astronomy  -  moon :  90
nature  -  astronomy  -  sun :  70
nature  -  astronomy  -  planet :  4
nature  -  astronomy  -  space :  7
nature  -  astronomy  -  moonlight :  10
nature  -  astronomy  -  eclipse :  1
nature  -  astronomy  -  star :  13
nature  -  astronomy  -  constellation, Pictor :  1
nature  -  astronomy  -  universe :  1
nature  -  astronomy  -  comet :  1
nature  -  astronomy  -  meteor :  1
nature  -  astronomy  -  aureole :  1
nature  -  astronomy  -  supernova :  1
nature  -  animals: fish and aquatic life  -  scallop :  1
nature  -  animals: fish and aquatic life  -  fish :  22
nature  -  animals: fish and aquatic life  -  sea urchin :  1
nature  -  animals: fish and aquatic life  -  trout :  1
nature  -  animals: fish and aquatic life  -  whale :  3
nature  -  animals: fish and aquatic life  -  jellyfish :  1
nature  -  animals: fish and aquatic life  -  dolphin :  1
nature  -  animals: fish and aquatic life  -  sea lion :  1
nature  -  animals: fish an

society  -  sex and relationships  -  seduction :  1
society  -  sex and relationships  -  transvestism :  1
society  -  sex and relationships  -  virgin :  2
society  -  sex and relationships  -  courting :  1
society  -  sex and relationships  -  flirtation :  1
society  -  dress: ceremonial/royal  -  vestments :  14
society  -  dress: ceremonial/royal  -  court dress :  1
society  -  dress: ceremonial/royal  -  mourning dress :  3
society  -  dress: ceremonial/royal  -  wedding dress :  4
society  -  dress: ceremonial/royal  -  peers' robes :  5
society  -  dress: ceremonial/royal  -  matador costume :  1
society  -  government and politics  -  political prisoner :  4
society  -  government and politics  -  political tension :  2
society  -  government and politics  -  political belief :  3
society  -  government and politics  -  refugee :  2
society  -  government and politics  -  revolution :  1
society  -  government and politics  -  political protest :  1
society  -  government 

work and occupations  -  royalty and social rank  -  empress :  1
work and occupations  -  royalty and social rank  -  prince :  5
work and occupations  -  royalty and social rank  -  aristocrat :  9
work and occupations  -  royalty and social rank  -  queen :  12
work and occupations  -  royalty and social rank  -  knight :  13
work and occupations  -  royalty and social rank  -  baronet :  3
work and occupations  -  royalty and social rank  -  sultana :  1
work and occupations  -  royalty and social rank  -  princess :  2
work and occupations  -  royalty and social rank  -  pharoah :  1
work and occupations  -  equestrian and sporting  -  racing driver :  2
work and occupations  -  equestrian and sporting  -  wrestler :  6
work and occupations  -  equestrian and sporting  -  footballer :  1
work and occupations  -  equestrian and sporting  -  jockey :  7
work and occupations  -  equestrian and sporting  -  matador :  2
work and occupations  -  equestrian and sporting  -  groom :  4
w

architecture  -  townscapes, man-made features  -  breakwater :  1
architecture  -  townscapes, man-made features  -  stile :  2
architecture  -  townscapes, man-made features  -  lock :  6
architecture  -  townscapes, man-made features  -  water tank :  1
architecture  -  townscapes, man-made features  -  city wall :  1
architecture  -  townscapes, man-made features  -  alley :  1
architecture  -  townscapes, man-made features  -  gallows :  1
architecture  -  features  -  tower :  114
architecture  -  features  -  stair / step :  94
architecture  -  features  -  wall :  34
architecture  -  features  -  column :  17
architecture  -  features  -  window :  236
architecture  -  features  -  roof :  11
architecture  -  features  -  tile :  6
architecture  -  features  -  screen :  2
architecture  -  features  -  balcony :  1
architecture  -  features  -  arch :  41
architecture  -  features  -  turret :  3
architecture  -  features  -  mantelpiece :  13
architecture  -  features  -  drai

emotions, concepts and ideas  -  formal qualities  -  photographic :  720
emotions, concepts and ideas  -  formal qualities  -  repetition :  65
emotions, concepts and ideas  -  formal qualities  -  silhouette :  78
emotions, concepts and ideas  -  formal qualities  -  symmetry :  127
emotions, concepts and ideas  -  formal qualities  -  cubist space :  6
emotions, concepts and ideas  -  formal qualities  -  light :  32
emotions, concepts and ideas  -  formal qualities  -  texture :  168
emotions, concepts and ideas  -  formal qualities  -  sequence :  27
emotions, concepts and ideas  -  formal qualities  -  spontaneity :  48
emotions, concepts and ideas  -  formal qualities  -  visual illusion :  33
emotions, concepts and ideas  -  formal qualities  -  order :  13
emotions, concepts and ideas  -  formal qualities  -  misalignment :  2
emotions, concepts and ideas  -  formal qualities  -  documentary :  32
emotions, concepts and ideas  -  formal qualities  -  space :  81
emotions, conc

interiors  -  public and municipal  -  shop :  14
interiors  -  public and municipal  -  hall :  3
interiors  -  public and municipal  -  pharmacy :  1
interiors  -  public and municipal  -  museum :  5
interiors  -  public and municipal  -  waiting room :  2
interiors  -  public and municipal  -  court :  1
interiors  -  public and municipal  -  launderette :  12
interiors  -  public and municipal  -  classroom :  1
interiors  -  public and municipal  -  public lavatory :  1
interiors  -  public and municipal  -  library :  1
interiors  -  workspaces  -  office :  10
interiors  -  workspaces  -  plant room :  2
interiors  -  workspaces  -  studio :  108
interiors  -  workspaces  -  workshop :  16
interiors  -  workspaces  -  stable :  15
interiors  -  workspaces  -  photographic studio :  40
interiors  -  workspaces  -  laundry :  2
interiors  -  workspaces  -  foundry :  1
interiors  -  workspaces  -  factory :  1
interiors  -  workspaces  -  staff room :  1
interiors  -  workspaces 

history  -  politics and society  -  politics: America, foreign policy, post 1945 :  2
history  -  politics and society  -  politics: Anglo-Maratha treaty, 1790 :  1
history  -  politics and society  -  society: Waterloo Bridge, opening, 18 Jun 1817 :  2
history  -  politics and society  -  politics: Czechoslovakia, British non-intervention, 1938 :  2
history  -  politics and society  -  society: London Bridge, opening, 1 Aug 1831 :  1
history  -  politics and society  -  royalty: Prince of Wales, investiture, 1969 :  1
history  -  politics and society  -  period: Etruscan :  2
history  -  politics and society  -  period: Aztec civilization :  1
history  -  politics and society  -  protests and unrest: strike, National Union of Mineworkers, 1984 :  1
history  -  politics and society  -  royalty: Kaiser Friedrich III, memorial service, 1888 :  1
history  -  politics and society  -  politics: Situationist International, 1960s :  1
history  -  politics and society  -  period: 18th c. :  2

In [93]:
distr_data[:10]

[47, 570, 76, 92, 447, 12, 210, 16, 32, 65]

In [85]:
'number of total classes', len(distr_data)

('number of total classes', 2442)

In [88]:
'average number of entries per class', np.mean(distr_data)

('average number of entries per class', 15.402538902538902)

In [89]:
'median number of entries per class', np.median(distr_data)

('median number of entries per class', 2.0)

In [90]:
'max number of entries per class', np.max(distr_data)

('max number of entries per class', 1616)

In [91]:
'min number of entries per class', np.min(distr_data)

('min number of entries per class', 1)

In [92]:
'total number of entries (with repetitions)', np.sum(distr_data)

('total number of entries (with repetitions)', 37613)

In [19]:
images_subclasses = {}
for subclass in tree1.keys():
    for subsubclass in tree1[subclass].keys():
        images_subclasses[subsubclass] = 0
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            images_subclasses[subsubclass] += count_entries(tree1[subclass][subsubclass][subsubsubclass])
        print(subclass, ' - ', subsubclass, ' - ', images_subclasses[subsubclass])

people  -  actions: postures and motions  -  1912
people  -  adults  -  4089
people  -  actions: expressive  -  370
people  -  children  -  377
people  -  actions: processes and functions  -  298
people  -  nudes  -  398
people  -  groups  -  483
people  -  body  -  760
people  -  ethnicity  -  13
people  -  named individuals  -  145
people  -  portraits  -  212
people  -  diseases and conditions  -  69
people  -  named families  -  2
objects  -  agriculture, gardening & fishing  -  231
objects  -  vessels and containers  -  261
objects  -  clothing and personal effects  -  908
objects  -  religious and ceremonial  -  141
objects  -  sports and games  -  31
objects  -  weapons  -  160
objects  -  fine arts and music  -  367
objects  -  tools and machinery  -  221
objects  -  reading, writing, printed matter  -  215
objects  -  toys and models  -  23
objects  -  food and drink  -  103
objects  -  miscellaneous  -  75
objects  -  heating and lighting  -  55
objects  -  scientific and mea

In [17]:
images_classes = {}
for subclass in tree1.keys():
    images_classes[subclass] = 0
    for subsubclass in tree1[subclass].keys():
        for subsubsubclass in tree1[subclass][subsubclass].keys():
            images_classes[subclass] += count_entries(tree1[subclass][subsubclass][subsubsubclass])
    print(subclass, images_classes[subclass])

people 9128
objects 3333
nature 8984
society 1614
work and occupations 1521
architecture 3556
leisure and pastimes 764
emotions, concepts and ideas 2543
symbols & personifications 867
interiors 646
abstraction 4157
religion and belief 237
places 8
history 237
literature and fiction 18
group/movement 0


### Unpacking data in folder structure
Now that the tree is built, I can unpack the data retrieved by following the image urls into a format the can be easily handled using Tensorflow's flow_from_directories

In [None]:
with ZipFile(r'D:\collection\data_tate.zip', 'r') as zipObj:
    listOfFileNames = zipObj.namelist()
    print(listOfFileNames[2])
    for class1 in tree1.keys():
        for class2 in tree1[class1].keys():
            for class3 in tree1[class1][class2].keys():
                for file in tree1[class1][class2][class3]:
                    try:
                        zipObj.extract(file, os.path.join( "D:", "collection", "data", class1, class2, class3, file)
                    except:
                        print('file not found: ' + file)