# <a id="top"></a>Autoencoding Edward Hopper:<br>Using deep learning to recommend art

[Larry Finer](mailto:lfiner@gmail.com)  
March 2019

The goal of this project was to build a model that would take an image of an artwork and compare it visually to a corpus of more than 100,000 artworks from museums and other sources in order to find works that are similar visually. The main steps in the project were:

1. Download artwork images and metadata from multiple sites
2. **Combine metadata into a single pandas dataframe** (this file)  
3. Develop a convolutional neural network autoencoder model that adequately reproduces the images  
4. Extract the narrowest encoded layer and use it to encode the entire corpus as well as a test image; then compare the test image to the entire corpus using a cosine distance measure to find the nearest images

<hr>

## 2. Combine metadata into a single pandas dataframe 

### Sections
[2a. Imports and setup](#2a)  
[2b. Artspace](#2b)  
[2c. Guggenheim](#2c)  
[2d. MoMA](#2d)  
[2e. NGA](#2e)  
[2f. Tate](#2f)  
[2g. Whitney](#2g)  
[2h. Merge all](#2h)

### <a id="2a"></a>2a. Imports and setup

In [1]:
import pandas as pd
import pickle
import datetime as dt

### <a id="2b"></a>2b. Artspace

In [2]:
# Clean artspace dataframe
artspace = pickle.load(open('../data/artspace/Artists and artworks dataframe.pickle', 'rb'))

In [3]:
# Drop artist ID and unique name
artspace.drop(columns=['id', 'artist'], inplace=True)

In [4]:
# Prettify medium
artspace['medium'] = artspace['medium'].str.replace('-', ' ').str.capitalize()

In [5]:
artspace.rename(columns={'artwork_id': 'id', 'name_pretty': 'artist', 'year_created': 'date', 'artwork_url': 'page_url'}, inplace=True)
artspace['id'] = 'artspace_' + artspace['id'].astype(int).astype(str)
artspace.loc[artspace['date'].isna(), 'date'] = 0
artspace['date'] = artspace['date'].astype(int)
artspace['source'] = 'artspace'

In [6]:
artspace = artspace[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [7]:
artspace.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,artspace_35537,Lola Soloveychik,The Dead Sea,2015,Photograph,artspace,https://www.artspace.com/lola-soloveychik/the-...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
1,artspace_35538,Lola Soloveychik,California Sun,2013,Photograph,artspace,https://www.artspace.com/lola-soloveychik/cali...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
2,artspace_53265,Mimmo Paladino,Horse and Knight,2008,Print,artspace,https://www.artspace.com/mimmo_paladino/horse-...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
3,artspace_32803,Mimmo Paladino,Gli animali avanzano,1982,Print,artspace,https://www.artspace.com/mimmo_paladino/gli-an...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
4,artspace_33077,Mimmo Paladino,Il Sognatore,1982,Print,artspace,https://www.artspace.com/mimmo_paladino/il-sog...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...


In [8]:
pickle.dump(artspace, open('../data/all/artspace.pickle', 'wb'))

In [9]:
artspace.shape

(15774, 8)

### <a id="2c"></a>2c. Guggenheim

In [10]:
gugg = pd.DataFrame(pickle.load(open('../data/gugg/gugg.pickle', 'rb')))

In [11]:
gugg.rename(columns={'artwork_uid': 'id'}, inplace=True)
gugg['page_url'] = 'https://www.guggenheim.org/artwork/' + gugg['id'].astype(str)
# gugg['image_url'] = ''
gugg['id'] = 'gugg_' + gugg['id'].astype(str)
gugg['date'] = gugg['date'].str[:4]
gugg['medium'] = ''
gugg['source'] = 'gugg'
gugg = gugg[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [12]:
gugg.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,gugg_8303,Wallace Mitchell,Composition No. 2,1946,,gugg,https://www.guggenheim.org/artwork/8303,https://i0.wp.com/www.guggenheim.org/wp-conten...
1,gugg_493,William Baziotes,Dusk,1958,,gugg,https://www.guggenheim.org/artwork/493,https://i2.wp.com/www.guggenheim.org/wp-conten...
2,gugg_18939,Piotr Uklanski,Untitled (Dance Floor),1996,,gugg,https://www.guggenheim.org/artwork/18939,https://i2.wp.com/www.guggenheim.org/wp-conten...
3,gugg_1186,Max Ernst,Attirement of the Bride (La Toilette de la mar...,1940,,gugg,https://www.guggenheim.org/artwork/1186,https://i2.wp.com/www.guggenheim.org/wp-conten...
4,gugg_13481,Rirkrit Tiravanija,untitled 2002 (he promised),2002,,gugg,https://www.guggenheim.org/artwork/13481,https://i2.wp.com/www.guggenheim.org/wp-conten...


In [13]:
pickle.dump(gugg, open('../data/all/gugg.pickle', 'wb'))

### <a id="2d"></a>2d. MoMA

In [14]:
moma = pd.read_csv('../data/moma/MoMA artworks.csv')

In [15]:
moma.columns

Index(['ObjectID', 'URL', 'ThumbnailURL', 'Title', 'Artist', 'ConstituentID',
       'ArtistBio', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date',
       'Medium', 'Dimensions', 'CreditLine', 'AccessionNumber',
       'Classification', 'Department', 'DateAcquired', 'Cataloged',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

In [16]:
moma.shape

(136759, 29)

In [17]:
moma.ThumbnailURL.value_counts()

http://www.moma.org/media/W1siZiIsIjE5OTkxNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=cabab63ac5a7d402    59
http://www.moma.org/media/W1siZiIsIjEzNjMxNSJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=fbf6ad0392f98ee3    28
http://www.moma.org/media/W1siZiIsIjIzMDI1NyJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=1ef42de47aa2e7fb    11
http://www.moma.org/media/W1siZiIsIjIyODM1MyJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=a1619cfbf3adbe4e     9
http://www.moma.org/media/W1siZiIsIjYwMTUyIl0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=7c56759cf53854e0      8
http://www.moma.org/media/W1siZiIsIjIyODAxMyJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=1a7989f5c95ae577     7
http://www.moma.org/media/W1siZiIsIjIyNzkxNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=96e9f3a10c31e0ba     7
http://www.moma.org/media/W1siZiIsIjIyODgwNSJdLFsicCIsImNvbnZlcnQiLCI

In [18]:
moma = moma[moma.ThumbnailURL.notnull()]

In [19]:
moma.shape

(68565, 29)

In [20]:
moma.rename(columns={'ObjectID': 'id', 'URL': 'page_url', 'ThumbnailURL': 'image_url', 'Title': 'title', 'Artist': 'artist', 'Date': 'date', 'Medium': 'medium'}, inplace=True)
moma['id'] = 'moma_' + moma['id'].astype(str)
moma['source'] = 'moma'
moma = moma[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [21]:
moma.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,moma_2,Otto Wagner,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",1896,Ink and cut-and-pasted painted pages on paper,moma,http://www.moma.org/collection/works/2,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...
1,moma_3,Christian de Portzamparc,"City of Music, National Superior Conservatory ...",1987,Paint and colored pencil on print,moma,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...
2,moma_4,Emil Hoppe,"Villa near Vienna Project, Outside Vienna, Aus...",1903,"Graphite, pen, color pencil, ink, and gouache ...",moma,http://www.moma.org/collection/works/4,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...
3,moma_5,Bernard Tschumi,"The Manhattan Transcripts Project, New York, N...",1980,Photographic reproduction with colored synthet...,moma,http://www.moma.org/collection/works/5,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...
4,moma_6,Emil Hoppe,"Villa, project, outside Vienna, Austria, Exter...",1903,"Graphite, color pencil, ink, and gouache on tr...",moma,http://www.moma.org/collection/works/6,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...


In [22]:
pickle.dump(moma, open('../data/all/moma.pickle', 'wb'))

### <a id="2e"></a>2e. National Gallery of Art

In [56]:
nga = pd.DataFrame(pickle.load(open('../data/nga/nga.pickle', 'rb')))

In [57]:
nga.rename(columns={'artwork_uid': 'id'}, inplace=True)
nga['page_url'] = 'https://www.nga.gov/collection/art-object-page.' + nga['id'].astype(str) + '.html'
nga['id'] = 'nga_' + nga['id'].astype(str)
nga['source'] = 'nga'

In [53]:
nga['date'] = nga['date'][-4:]

In [58]:
nga.date.value_counts()

                                     9530
c. 1936                              3998
1935/1942                            3525
c. 1937                              3061
c. 1938                              2016
c. 1939                              1679
c. 1940                              1615
19th century                         1240
1955-1967                            1227
1936                                  858
1937                                  727
c. 1941                               639
1938                                  634
1969                                  570
1939                                  556
1975                                  538
1968                                  500
1972                                  461
1976                                  440
1940                                  435
1965                                  430
1973                                  428
1967                                  426
1947                              

In [25]:
nga = nga[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [26]:
nga.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,nga_27646,Francis Law Durand,Doll,c. 1939,pen and ink on paperboard,nga,https://www.nga.gov/collection/art-object-page...,https://media.nga.gov/iiif//public/objects/2/7...
1,nga_70914,"Bruce Nauman, Ken Farley, Gemini G.E.L.",Shit and Die,1983,drypoint in black on J. Barcham Green Crisbroo...,nga,https://www.nga.gov/collection/art-object-page...,https://media.nga.gov/iiif//public/objects/7/0...
2,nga_71579,Sandro Chia,Father and Son Song,1987/1989,heliorelief and woodcut on hot-pressed T.H. Sa...,nga,https://www.nga.gov/collection/art-object-page...,https://media.nga.gov/iiif//public/objects/7/1...
3,nga_59247,Franz Edmund Weirotter,Houses on an Inlet,,etching,nga,https://www.nga.gov/collection/art-object-page...,https://media.nga.gov/iiif//public/objects/5/9...
4,nga_127352,Ilse Bing,Self-Portrait,1953,gelatin silver print,nga,https://www.nga.gov/collection/art-object-page...,https://media.nga.gov/iiif//public/objects/1/2...


In [27]:
pickle.dump(nga, open('../data/all/nga.pickle', 'wb'))

### <a id="2f"></a>2f. Tate

In [28]:
tate = pd.DataFrame(pickle.load(open('../data/tate/tate.pickle', 'rb')))

In [29]:
tate.columns

Index(['Acquisition', 'Artist', 'Artists', 'Collection', 'Dimensions',
       'Medium', 'Original title', 'Part of', 'Reference', 'artist', 'date',
       'image_url', 'title'],
      dtype='object')

In [30]:
tate.head()

Unnamed: 0,Acquisition,Artist,Artists,Collection,Dimensions,Medium,Original title,Part of,Reference,artist,date,image_url,title
0,Presented by the Institute of Contemporary Pri...,Tom Phillips born 1937,,Tate,Image: 230 x 153 mm,Lithograph on paper,,Ein Deutsches Requiem: After Brahms,P01539,Tom Phillips,1972,https://www.tate.org.uk/art/images/work/P/P01/...,2
1,Presented by the Institute of Contemporary Pri...,Tom Phillips born 1937,,Tate,Image: 230 x 153 mm,Lithograph on paper,,Ein Deutsches Requiem: After Brahms,P01538,Tom Phillips,1972,https://www.tate.org.uk/art/images/work/P/P01/...,1
2,Presented by Dr Edward J. Steegmann 1908,Aaron Edwin Penley 1807–1870,,Tate,Support: 254 x 327 mm,Graphite on paper,,,N02391,Aaron Edwin Penley,1842,https://www.tate.org.uk/art/images/work/N/N02/...,Ruins at Torre Wood
3,Presented by Marlborough Graphics through the ...,Victor Pasmore 1908–1998,,Tate,Image: 391 x 397 mm,Intaglio print on paper,,,P01477,Victor Pasmore,1974,https://www.tate.org.uk/art/images/work/P/P01/...,When the Lute is Broken
4,Presented by Marlborough Graphics through the ...,Victor Pasmore 1908–1998,,Tate,Image: 379 x 382 mm,Intaglio print on paper,,Word and Image,P01468,Victor Pasmore,1974,https://www.tate.org.uk/art/images/work/P/P01/...,‘The Tear that Falls’


In [31]:
tate.rename(columns={'Reference': 'id', 'Medium': 'medium'}, inplace=True)
tate['page_url'] = ''
tate['id'] = 'tate_' + tate['id'].astype(str).str.lower()
tate['source'] = 'tate'

In [32]:
tate = tate[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [33]:
tate.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,tate_p01539,Tom Phillips,2,1972,Lithograph on paper,tate,,https://www.tate.org.uk/art/images/work/P/P01/...
1,tate_p01538,Tom Phillips,1,1972,Lithograph on paper,tate,,https://www.tate.org.uk/art/images/work/P/P01/...
2,tate_n02391,Aaron Edwin Penley,Ruins at Torre Wood,1842,Graphite on paper,tate,,https://www.tate.org.uk/art/images/work/N/N02/...
3,tate_p01477,Victor Pasmore,When the Lute is Broken,1974,Intaglio print on paper,tate,,https://www.tate.org.uk/art/images/work/P/P01/...
4,tate_p01468,Victor Pasmore,‘The Tear that Falls’,1974,Intaglio print on paper,tate,,https://www.tate.org.uk/art/images/work/P/P01/...


In [34]:
tate.image_url.describe()

count                                                    10
unique                                                   10
top       https://www.tate.org.uk/art/images/work/P/P01/...
freq                                                      1
Name: image_url, dtype: object

In [35]:
pickle.dump(tate, open('../data/all/tate.pickle', 'wb'))

### <a id="2g"></a>2g. Whitney

In [36]:
whitney = pd.DataFrame(pickle.load(open('../data/whitney/whitney.pickle', 'rb')))

In [37]:
whitney.head()

Unnamed: 0,Accession number,Artist,Artists,Classification,Credit line,Date,Dimensions,Edition information,Medium,Portfolio,Publication information,Rights and reproductions information,Series,Title,artwork_uid,image_url
0,70.1605.57,Edward Hopper,,Drawings,Josephine N. Hopper Bequest,1899–1906,Sheet (Irregular): 11 5/8 × 7 5/16 in. (29.5 ×...,,Pen and ink and graphite pencil on paper,,,"© Heirs of Josephine N. Hopper, licensed by th...",,Santa Claus Up To Date,31577,http://collectionimages.whitney.org/standard/8...
1,97.136.1,Ralph Gibson,,Photographs,Gift of Mr. and Mrs. Raymond W. Merritt,1971,Sheet: 13 15/16 × 11 3/16 in. (35.4 × 28.4 cm)...,2/25,Gelatin silver print,,,© artist or artist’s estate,Deja-Vu,Untitled,9437,http://collectionimages.whitney.org/standard/1...
2,93.89,Jane Hammond,,Prints,"Purchase, with funds from the Print Committee",1992–93,"Sheet (Irregular, Sight): 78 1/2 × 51 in. (199...","9/32 | 4 APs, 6 PPs","Etching, engraving, drypoint, screenprint, lit...",,Printed and published by Universal Limited Art...,© artist or artist’s estate,,Full House,8430,http://collectionimages.whitney.org/standard/1...
3,78.94,Harry Sternberg,,Prints,Gift of Mr. and Mrs. Michael H. Irving,1929,Sheet: 11 1/16 × 13 15/16 in. (28.1 × 35.4 cm)...,Edition of 30,Etching and aquatint,,Printed by Harry Sternberg,© artist or artist’s estate,Circus,Circus #4: The Rings,5128,http://collectionimages.whitney.org/standard/1...
4,96.117.70,Harold Edgerton,,Photographs,Gift of The Harold and Esther Edgerton Family ...,1963,Sheet: 8 1/16 × 10 1/16 in. (20.5 × 25.6 cm) I...,,Chromogenic print,,,© artist or artist’s estate,,Untitled (Moscow Circus),10871,http://collectionimages.whitney.org/standard/1...


In [38]:
whitney.rename(columns={'artwork_uid': 'id', 'Artist': 'artist', 'Date': 'date', 'Medium': 'medium', 'Title': 'title'}, inplace=True)
whitney['page_url'] = 'https://whitney.org/collection/works/' + whitney['id'].astype(str)
whitney['id'] = 'whitney_' + whitney['id'].astype(str).str.lower()
whitney['source'] = 'whitney'

In [39]:
whitney = whitney[['id', 'artist', 'title', 'date', 'medium', 'source', 'page_url', 'image_url']]

In [40]:
whitney.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,whitney_31577,Edward Hopper,Santa Claus Up To Date,1899–1906,Pen and ink and graphite pencil on paper,whitney,https://whitney.org/collection/works/31577,http://collectionimages.whitney.org/standard/8...
1,whitney_9437,Ralph Gibson,Untitled,1971,Gelatin silver print,whitney,https://whitney.org/collection/works/9437,http://collectionimages.whitney.org/standard/1...
2,whitney_8430,Jane Hammond,Full House,1992–93,"Etching, engraving, drypoint, screenprint, lit...",whitney,https://whitney.org/collection/works/8430,http://collectionimages.whitney.org/standard/1...
3,whitney_5128,Harry Sternberg,Circus #4: The Rings,1929,Etching and aquatint,whitney,https://whitney.org/collection/works/5128,http://collectionimages.whitney.org/standard/1...
4,whitney_10871,Harold Edgerton,Untitled (Moscow Circus),1963,Chromogenic print,whitney,https://whitney.org/collection/works/10871,http://collectionimages.whitney.org/standard/1...


In [41]:
pickle.dump(whitney, open('../data/all/whitney.pickle', 'wb'))

### <a id="2h"></a>2h. Merge all

In [42]:
corpus_metadata = None
corpus_metadata = pd.concat([artspace, gugg, moma, nga, tate, whitney])

In [43]:
corpus_metadata.shape

(269379, 8)

In [44]:
corpus_metadata.source.value_counts()

nga         95184
moma        68565
tate        64620
whitney     23335
artspace    15774
gugg         1901
Name: source, dtype: int64

In [45]:
pickle.dump(corpus_metadata, open('../data/all/corpus_metadata.pickle', 'wb'))
pickle.dump(corpus_metadata, open('../webapp/corpus_metadata.pickle', 'wb'), protocol=2)

In [46]:
corpus_metadata = corpus_metadata.reset_index(drop=True)
corpus_metadata.head()

Unnamed: 0,id,artist,title,date,medium,source,page_url,image_url
0,artspace_35537,Lola Soloveychik,The Dead Sea,2015,Photograph,artspace,https://www.artspace.com/lola-soloveychik/the-...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
1,artspace_35538,Lola Soloveychik,California Sun,2013,Photograph,artspace,https://www.artspace.com/lola-soloveychik/cali...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
2,artspace_53265,Mimmo Paladino,Horse and Knight,2008,Print,artspace,https://www.artspace.com/mimmo_paladino/horse-...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
3,artspace_32803,Mimmo Paladino,Gli animali avanzano,1982,Print,artspace,https://www.artspace.com/mimmo_paladino/gli-an...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...
4,artspace_33077,Mimmo Paladino,Il Sognatore,1982,Print,artspace,https://www.artspace.com/mimmo_paladino/il-sog...,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...


In [47]:
corpus_metadata.to_json('../webapp/corpus_metadata.json')

In [48]:
corpus_metadata.to_json('../data/autoencoder/models/corpus_metadata.json')

In [49]:
corpus_metadata[corpus_metadata['source'] != 'tate']['page_url'].describe()

count                                                204759
unique                                               204759
top       https://www.nga.gov/collection/art-object-page...
freq                                                      1
Name: page_url, dtype: object

In [52]:
corpus_metadata['date'][corpus_metadata['source'] == 'nga']

86240                c. 1939
86241                   1983
86242              1987/1989
86243                       
86244                   1953
86245                   1968
86246                   1860
86247                   1934
86248                   1918
86249              1935/1942
86250                   1973
86251                       
86252                   1975
86253                   1937
86254                c. 1930
86255                   1950
86256                   1773
86257                c. 1939
86258                   1990
86259                   1877
86260                       
86261           c. 1624/1625
86262                  1930s
86263                c. 1937
86264           16th century
86265                   1934
86266                   1929
86267                   1788
86268                c. 1500
86269                c. 1939
                 ...        
181394               c. 1920
181395                  2004
181396               c. 1936
181397        