# <a id="top"></a>Autoencoding Edward Hopper:<br>Using deep learning to recommend art

[Larry Finer](mailto:lfiner@gmail.com)  
March 2019

The goal of this project was to build a model that would take an image of an artwork and compare it visually to a corpus of more than 100,000 artworks from museums and other sources in order to find works that are similar visually. The main steps in the project were:

1. <b>Download artwork images and metadata from multiple sites</b> (this file)  
2. Combine metadata into a single pandas dataframe  
3. Develop a convolutional neural network autoencoder model that adequately reproduces the images
4. Extract the narrowest encoded layer and use it to encode the entire corpus as well as a test image; then compare the test image to the entire corpus using a cosine distance measure to find the nearest images

<hr>

## 1b. Download Guggenheim images and metadata

This file downloads the Guggenheim museum's entire online collection of approximately 1,900 images, and also downloads relevant metadata for each artwork, including artist, title, date, and medium.

### Sections
1b1. [Imports and setup](#1b1)  
1b2. [Download images and metadata](#1b2)

### <a id="1b1"></a>1b1. Imports and setup

In [3]:
import os
import re
import pandas as pd
import numpy as np
import random
import time
import pickle
import urllib.request
import requests
from fake_useragent import UserAgent
from itertools import islice
from lxml import html
from bs4 import BeautifulSoup

In [5]:
# Create a user agent for web scraping
ua = UserAgent()
user_agent = {'User-agent': ua.random}

### <a id="1b2"></a>1b2. Download images and metadata

In [86]:
# Initialize counters for the main web scraping loop
idcounter = 0
artworkcounter = 0
imagecounter = 0

# Initialize an empty list that will contain the artwork metadata
gugg = []

On the Guggenheim website, artworks are presented on pages that have URLs of the form https://www.guggenheim.org/artwork/artwork-ID-number. Exploration revealed that all artworks have an ID number in the range 1 to 60,000. Rather than searching to find these ID numbers, I simply tried each of the 60,000 URLs.

In [9]:
# Create a list of 60,000 ids in random order
ids = list(range(60000))
random.Random(42).shuffle(ids)

# Take a look at a few
ids[:10]

[15586, 33591, 37199, 23870, 29869, 8176, 17876, 32671, 13543, 5084]

In [3]:
# IF program was interrupted, run this cell to reload the saved counters and list of artworks
idcounter = pickle.load(open('./data/gugg/ID counter.pickle', 'rb'))
artworkcounter = pickle.load(open('./data/gugg/Artwork counter.pickle', 'rb'))
imagecounter = pickle.load(open('./data/gugg/Image counter.pickle', 'rb'))
gugg = pickle.load(open('./data/gugg/Guggenheim.pickle', 'rb'))

Here is the main web-scraping loop.

In [None]:
# Short function to save counters and artwork list periodically.
def save_counters_and_artwork_list():
    pickle.dump(idcounter, open('./data/gugg/ID counter.pickle', 'wb'))
    pickle.dump(artworkcounter, open('./data/gugg/Artwork counter.pickle', 'wb'))
    pickle.dump(imagecounter, open('./data/gugg/Image counter.pickle', 'wb'))        
    pickle.dump(gugg, open('./data/gugg/Guggenheim.pickle', 'wb'))
    print()
    print('IDs:', idcounter)
    print('Artworks:', artworkcounter)
    print('Images:', imagecounter)
    print()

In [None]:
print('Starting count for this run:', idcounter)
print()
# Try all id numbers from 1 (or most recently saved counter) to 60,000
for id in ids[idcounter:]:
    print('Attempt:', str(idcounter), 'Artwork ID:', str(id))
    url = 'https://www.guggenheim.org/artwork/' + str(id)

    timeDelay = random.randrange(20, 30)/1000
    time.sleep(timeDelay)
    
    # Download the page contents using Beautiful Soup
    soup = BeautifulSoup(requests.get(url, headers = user_agent).text, "lxml")

    # If there is a link with rel='canonical', this is an artwork page.
    try:
        pagestring = soup.find('link', rel='canonical')['href']
        # Try to get the ID number.
        pageid = int(pagestring[pagestring.rfind('/')+1:])
    # If the above failed, it is not an artwork page.
    except:
        pagestring = None
    # If not an artwork page or no artwork ID:
    if pagestring == None or pageid != id:
        print('== 404 or wrong page ==')
        # Save the counters and artwork list every 20 tries.
        if idcounter % 20 == 0:
            save_counters_and_artwork_list()
        idcounter += 1
        continue
    
    # Create an empty dictionary for this artwork.
    newdict = {}

    # Save the artwork ID.
    newdict['artwork_uid'] = id
    
    # Extract artwork metadata: title, date and artist.
    newdict['title'] = soup.find('meta', property='og:title')['content']
    newdict['date'] = soup.find('meta', property='article:published_time')['content']

    datadump = soup.find('script', text=re.compile('"artist":')).text
    chop = datadump[datadump.find('"artist":[{"id":')+16:datadump.find(',"url"')]
    artist = chop[chop.find('"name":"')+8:-1]
    print(artist)
    newdict['artist'] = artist
    
    # Get the URL for the image.
    image_url = soup.find('meta', property='og:image:secure_url')['content']
    
    # Save the image itself into a local directory.
    img_data = requests.get(image_url, headers = user_agent).content
    file = './data/gugg/images/' + str(id) + '.jpg'
    # print(file)
    with open(file, 'wb') as handler:
        handler.write(img_data)
    imagecounter += 1
    
    # Save the artwork URL.
    newdict['image_url'] = image_url

    # Add the filled-in dictionary to the list of artworks.
    gugg.append(newdict)    
            
    # Increment counters.
    idcounter += 1
    artworkcounter += 1
    
    # Save counters and artwork list every 20 tries.
    if idcounter % 20 == 0:
        save_counters_and_artwork_list()

In [11]:
# Check the final number of artworks.
len(gugg)

1901

In [12]:
# Take a look at a few artwork metadata entries.
gugg[:10]

[{'artwork_uid': 8303,
  'title': 'Composition No. 2',
  'date': '1946-01-01T05:00:00+00:00',
  'artist': 'Wallace Mitchell',
  'image_url': 'https://i0.wp.com/www.guggenheim.org/wp-content/uploads/1946/01/46.1048_ph_web-1.jpg'},
 {'artwork_uid': 493,
  'title': 'Dusk',
  'date': '1958-01-01T05:00:00+00:00',
  'artist': 'William Baziotes',
  'image_url': 'https://i2.wp.com/www.guggenheim.org/wp-content/uploads/1958/01/59.1544_ph_web-1.jpg'},
 {'artwork_uid': 18939,
  'title': 'Untitled (Dance Floor)',
  'date': '1996-01-01T05:00:00+00:00',
  'artist': 'Piotr Uklanski',
  'image_url': 'https://i2.wp.com/www.guggenheim.org/wp-content/uploads/1996/01/2006.72_ph_web-1.jpg'},
 {'artwork_uid': 1186,
  'title': 'Attirement of the Bride (La Toilette de la mari√©e)',
  'date': '1940-01-01T05:00:00+00:00',
  'artist': 'Max Ernst',
  'image_url': 'https://i2.wp.com/www.guggenheim.org/wp-content/uploads/1940/01/76.2553.78_ph_web-1.jpg'},
 {'artwork_uid': 13481,
  'title': 'untitled 2002 (he promis