# <a id="top"></a>Autoencoding Edward Hopper:<br>Using deep learning to recommend art

[Larry Finer](mailto:lfiner@gmail.com)  
March 2019

The goal of this project was to build a model that would take an image of an artwork and compare it visually to a corpus of more than 100,000 artworks from museums and other sources in order to find works that are similar visually. The main steps in the project were:

1. <b>Download artwork images and metadata from multiple sites</b> (this file)  
2. Combine metadata into a single pandas dataframe  
3. Develop a convolutional neural network autoencoder model that adequately reproduces the images
4. Extract the narrowest encoded layer and use it to encode the entire corpus as well as a test image; then compare the test image to the entire corpus using a cosine distance measure to find the nearest images

<hr>

## 1c. Download MoMA images

This file downloads the Guggenheim museum's entire online collection of approximately 69,000 images. Relevant metadata for each artwork, including artist, title, date, and medium, are available online [here](https://github.com/MuseumofModernArt/collection/blob/master/Artworks.csv).

### Sections
1c1. [Imports and setup](#1c1)  
1c2. [Download images](#1c2)

### <a id="top"></a>1c1. Imports and setup

In [64]:
import pandas as pd
import random
import time
import pickle
import requests
from fake_useragent import UserAgent
from itertools import islice
from lxml import html
from bs4 import BeautifulSoup

In [65]:
ua = UserAgent()
user_agent = {'User-agent': ua.random}

### 1c2. Download images

In [66]:
# Read in artwork data
moma = pd.read_csv('./data/moma/MoMA artworks.csv')

In [67]:
moma.shape

(136759, 29)

In [68]:
moma.columns

Index(['ObjectID', 'URL', 'ThumbnailURL', 'Title', 'Artist', 'ConstituentID',
       'ArtistBio', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date',
       'Medium', 'Dimensions', 'CreditLine', 'AccessionNumber',
       'Classification', 'Department', 'DateAcquired', 'Cataloged',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

In [69]:
# Reduce the dataframe to only those artworks that have an image online.
moma = moma[moma.ThumbnailURL.notnull()]

In [70]:
moma.shape

(68565, 29)

In [71]:
moma.head()

Unnamed: 0,ObjectID,URL,ThumbnailURL,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,...,Cataloged,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,2,http://www.moma.org/collection/works/2,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),-1841,-1918,...,Y,,,,48.6,,,168.9,,
1,3,http://www.moma.org/collection/works/3,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),-1944,0,...,Y,,,,40.6401,,,29.8451,,
2,4,http://www.moma.org/collection/works/4,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),-1876,-1957,...,Y,,,,34.3,,,31.8,,
3,5,http://www.moma.org/collection/works/5,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),-1944,0,...,Y,,,,50.8,,,50.8,,
4,6,http://www.moma.org/collection/works/6,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),-1876,-1957,...,Y,,,,38.4,,,19.1,,


In [75]:
# Initialize the image counter.
imagecounter = 0

In [76]:
# pickle.dump(imagecounter, open('./data/moma/images/Image counter.pickle', 'wb'))

Here is the main loop to download images.

In [1]:
imagecounter = pickle.load(open('./data/moma/images/Image counter.pickle', 'rb'))
print('Starting count for this run:', imagecounter)
print()
# Loop over each artwork in the dataframe.
for index, row in islice(moma.iterrows(), imagecounter, 100):
    # print(row['artwork_id'], row['image_url'])
    timeDelay = random.randrange(20, 30)/1000
    time.sleep(timeDelay)
    soup = BeautifulSoup(requests.get(row['URL'], headers = user_agent).text, "lxml")
    
    # Search the artwork's page for the URL of the image.
    link = 'http://www.moma.org' + soup.find('div', class_='work__image-container').find('img')['src']
    # print(link)
    
    # Download the image.
    img_data = requests.get(link, headers = user_agent).content
    file = './data/moma/images/' + str(row['ObjectID']) + '.jpg'
    # print(file)
    with open(file, 'wb') as handler:
            handler.write(img_data)
            
    # Increment the image counter; save it every 10 images.
    imagecounter += 1
    if imagecounter % 10 == 0:
        pickle.dump(imagecounter, open('./data/moma/images/Image counter.pickle', 'wb'))
        print('Count:', imagecounter)