# <a id="top"></a>Autoencoding Edward Hopper:<br>Using deep learning to recommend art

[Larry Finer](mailto:lfiner@gmail.com)  
March 2019

The goal of this project was to build a model that would take an image of an artwork and compare it visually to a corpus of more than 100,000 artworks from museums and other sources in order to find works that are similar visually. The main steps in the project were:

1. <b>Download artwork images and metadata from multiple sites</b> (this file)  
2. Combine metadata into a single pandas dataframe  
3. Develop a convolutional neural network autoencoder model that adequately reproduces the images
4. Extract the narrowest encoded layer and use it to encode the entire corpus as well as a test image; then compare the test image to the entire corpus using a cosine distance measure to find the nearest images

<hr>

## 1b. Download Artspace images

This file downloads Artspace's entire online collection of approximately 16,000 images. Metadata for each artwork, including artist, title, date, and medium, were already downloaded in [a previous project I did](https://github.com/lfiner/predicting-art-prices).

### Sections
1b1. [Imports and setup](#1b1)  
1b2. [Download images](#1b2)

### <a id="1b1"></a>1b1. Imports and setup

In [12]:
import pandas as pd
import random
import time
import pickle
import urllib.request
import requests
from fake_useragent import UserAgent
from itertools import islice

In [2]:
# Establish an instance of the UserAgent for randomizing where my web requests come from
ua = UserAgent()
user_agent = {'User-agent': ua.random}

### <a id="1b2"></a>1b2. Download images

In [3]:
# Load the metadata file created in the "Predicting art prices" project
df = pickle.load(open('./data/artspace/Artists and artworks dataframe.pickle', 'rb'))

In [4]:
df.shape

(15774, 72)

In [5]:
df.columns

Index(['artist', 'artwork_id', 'artwork_url', 'authentication', 'edition_size',
       'fame_cat', 'height', 'image_url', 'location', 'medium',
       'physical_description', 'price', 'styles', 'tags', 'title', 'width',
       'year_created', 'book', 'decorative-arts', 'mixed-media', 'new-media',
       'no-medium', 'painting', 'photograph', 'print', 'sculpture',
       'work-on-paper', 'unique', 'edition_size2', 'log_edition',
       'edition_grp', 'ed-1', 'ed-101-more', 'ed-11-30', 'ed-2-10', 'ed-31-50',
       'ed-51-100', 'signed', 'log_price', 'area', 'log_area', 'created_cat',
       'cr-1960s-70s', 'cr-1980s-90s', 'cr-2000s', 'cr-2010s', 'cr-pre-1960',
       'blue-chip', 'established', 'non-strategic', 'rising-star',
       'up-and-coming', 'name_lf', 'id', 'name_pretty', 'birth_year',
       'hometown', 'lives_in', 'degrees', 'degrees_n', 'genres', 'museums',
       'museums_n', 'galleries', 'galleries_n', 'name_last', 'artist_death',
       'dead', 'degrees_cat', 'galleries_c

In [6]:
# Convert ID to integer
df.artwork_id = df.artwork_id.astype(int)

In [7]:
df.reset_index(drop=True)

Unnamed: 0,artist,artwork_id,artwork_url,authentication,edition_size,fame_cat,height,image_url,location,medium,...,museums_n,galleries,galleries_n,name_last,artist_death,dead,degrees_cat,galleries_cat,museums_cat,log_museums_n
0,lola-soloveychik,35537,https://www.artspace.com/lola-soloveychik/the-...,Signed and numbered by the artist.,10.0,0.0,10.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,,photograph,...,0.0,,0.0,soloveychik,,0.0,2,0.0,0,-inf
1,lola-soloveychik,35538,https://www.artspace.com/lola-soloveychik/cali...,Signed and numbered by the artist.,10.0,0.0,8.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,,photograph,...,0.0,,0.0,soloveychik,,0.0,2,0.0,0,-inf
2,mimmo_paladino,53265,https://www.artspace.com/mimmo_paladino/horse-...,,,1.0,37.20465,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,London,print,...,1.0,"[Sperone Westwater, New York, NY]",1.0,paladino,,0.0,0,1.0,1,0.000000
3,mimmo_paladino,32803,https://www.artspace.com/mimmo_paladino/gli-an...,This work is signed and dated by the artist in...,60.0,1.0,32.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,United States,print,...,1.0,"[Sperone Westwater, New York, NY]",1.0,paladino,,0.0,0,1.0,1,0.000000
4,mimmo_paladino,33077,https://www.artspace.com/mimmo_paladino/il-sog...,This work is signed and dated by the artist in...,60.0,1.0,32.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,United States,print,...,1.0,"[Sperone Westwater, New York, NY]",1.0,paladino,,0.0,0,1.0,1,0.000000
5,jonathan-ryan-storm,53062,https://www.artspace.com/jonathan-ryan-storm/woo,This work is signed on verso.,1.0,0.0,15.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,New York,mixed-media,...,0.0,,0.0,ryanstorm,,0.0,1,0.0,0,-inf
6,jonathan-ryan-storm,53068,https://www.artspace.com/jonathan-ryan-storm/ep,This work is signed on verso.,1.0,0.0,12.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,New York,mixed-media,...,0.0,,0.0,ryanstorm,,0.0,1,0.0,0,-inf
7,jonathan-ryan-storm,53070,https://www.artspace.com/jonathan-ryan-storm/x...,this work is signed on verso.,1.0,0.0,11.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,New York,mixed-media,...,0.0,,0.0,ryanstorm,,0.0,1,0.0,0,-inf
8,jonathan-ryan-storm,53075,https://www.artspace.com/jonathan-ryan-storm/s...,This work is signed on verso.,1.0,0.0,35.00000,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,New York,painting,...,0.0,,0.0,ryanstorm,,0.0,1,0.0,0,-inf
9,thomas-hylander,16046,https://www.artspace.com/thomas-hylander/untitled,This work is signed.,1.0,1.0,13.77950,https://d5wt70d4gnm1t.cloudfront.net/media/a-s...,Copenhagen,sculpture,...,2.0,"[David Risley Gallery, Copenhagen, Denmark]",1.0,hylander,,0.0,3,1.0,2,0.693147


In [None]:
# For first run, instantiate the image counter
imagecounter = 0

In [10]:
# pickle.dump(imagecounter, open('./data/artspace/images/Image counter.pickle', 'wb'))

Here is the main loop to download the images.

In [2]:
# Reload the image counter.
imagecounter = pickle.load(open('./data/artspace/images/Image counter.pickle', 'rb'))
print('Starting count for this run:', imagecounter)
print()
# Loop over each row of the dataframe.
for index, row in islice(df.iterrows(), imagecounter, None):
    # print(row['artwork_id'], row['image_url'])
    timeDelay = random.randrange(20, 30)/1000
    time.sleep(timeDelay)
    # If the artwork has an image URL, download the image.
    if row['image_url'] != 'https:None':
        img_data = requests.get(row['image_url'], headers = user_agent).content
        with open('./data/artspace/images/' + str(row['artwork_id']) + '.jpg', 'wb') as handler:
            handler.write(img_data)
    else:
        print()
        print('Missing URL:', row['artwork_id'])
        print()
    imagecounter += 1
    if imagecounter % 10 == 0:
        pickle.dump(imagecounter, open('./data/artspace/images/Image counter.pickle', 'wb'))
        print('Count:', imagecounter)