## Art Auction Data

We will begin by compiling information about artworks in the Sotheby's lot archive (https://www.sothebys.com/en/buy/lot-archive). The lot archive contains ~1.5M entries of artworks that have gone to auction between 1999 and present day. In order to pare this down we will make some filtering decisions:

* $20k price floor
* Categories: (1) 19th century European paintings, (2) Contemporary Art, (3) Modern and Impressionist Art

This leaves us with 85,870 artworks.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
import os
import gc

%matplotlib inline

Some environment configs for scraping.

In [5]:
ua = {"User-Agent":"Mozilla/5.0"}
url = "https://www.sothebys.com/en/buy/lot-archive"

Quick sanity check to make sure we're getting the HTML and accessing the attributes.

In [6]:
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, 'lxml')

In [7]:
title = soup.title
print(title)

<title>Past Lot Archive | Search and browse past Fine Art, Jewels, Watches, Wine Lots | Sotheby's</title>


Each of the artworks is contained in a 'Card' div, which we'll select here.

In [10]:
cards = soup.find_all('div', {"class": "Card"})
len(cards)

15

Let's make sure we're collecting the right fields from each of these...

In [11]:
for card in cards:

    # Artist and art name
    title = card.find_all('div', {"class": "Card-titleWorkTxt"})
    cleaned = BeautifulSoup(str(title), "lxml").get_text()[1:-1]
    print(cleaned)
    # Price sold
    price = card.find_all('div', {"class": "Card-details"})
    cleaned = BeautifulSoup(str(price), "lxml").get_text()
    print(cleaned[1:-1])
    # Price estimate
    estimate = card.find_all('div', {"class": "Card-estimate"})
    cleaned = BeautifulSoup(str(estimate), "lxml").get_text()
    print(cleaned)
    # Sale info
    date = card.find_all('div', {"class": "Card-salePrice"})
    cleaned = BeautifulSoup(str(date), "lxml").get_text()
    print(cleaned)
    # Auction info
    auction = card.find_all('div', {"class": "Card-auctionDetails"})
    cleaned = BeautifulSoup(str(auction), "lxml").get_text()
    print(cleaned)
    im = card.findAll('img')
    print(im[0]['data-src'])
    
    

counter-balance desk lamp
Lot Sold 75,000 USD
[30,000 – 50,000 USD]
[23 May 2019 Sale Price 75,000 USD]
[23 May 2019 | 2:00 PM EDT | New York]
https://cdn.dotcom.sothebys.psdops.com/dims4/default/8e28b51/2147483647/strip/true/crop/2000x1843+0+0/resize/330x304!/quality/90/?url=https%3A%2F%2Fcdn.dotcom.sothebys.psdops.com%2F93%2F6f%2Ff0%2F1be519877ab836652bf304a08872296e224e3c36979216cdfe4c4d30bc%2F680n10084-9zj9n.jpg
"lotus bell" table lamp
Lot Sold 37,500 USD
[25,000 – 35,000 USD]
[23 May 2019 Sale Price 37,500 USD]
[23 May 2019 | 2:00 PM EDT | New York]
https://cdn.dotcom.sothebys.psdops.com/dims4/default/0b34d05/2147483647/strip/true/crop/1678x2000+0+0/resize/315x376!/quality/90/?url=https%3A%2F%2Fcdn.dotcom.sothebys.psdops.com%2Fc0%2F88%2F26%2F39d53f047f3888cd0a4444e10878fa3c83b72adf675111b8590570a617%2F799n10084-b7b3q.jpg
an important and rare "fish and waves" table lamp
Estimate 1,000,000 – 1,500,000 USD
[1,000,000 – 1,500,000 USD]
[23 May 2019 Sale Price 1,004,000 USD]
[23 May 20

In [12]:
# URL w/ Filters
# Price $20k + 
# 19th century european, contemporary art, impressionist and modern art
url_filters = 'https://www.sothebys.com/en/buy/lot-archive?s=0&from=20000&to=&f0=20000-&from=&to=&f2=00000164-609a-d1db-a5e6-e9fff79f0000&f2=00000164-609b-d1db-a5e6-e9ff01230000&f2=00000164-609b-d1db-a5e6-e9ff08ab0000&q='

The 'Card' divs are giving us good information, but we'd like to access fields for each artwork including dimensions,
auction date, artist birthdate, etc.

In [13]:
# Function to clean the tag contents
def clean_text(string):
        return BeautifulSoup(str(string), 'lxml').get_text()

In [1]:
arts = []
for i in tqdm(range(1, 2)):     # Change 2 to some big number
    ua = {"User-Agent":"Mozilla/5.0"}
    url = url_filters.format(i)
    page = requests.get(url, headers=ua)
    
    soup = BeautifulSoup(page.text, 'lxml')
    cards = soup.find_all('div', {"class": "Card"})
    
    for card in cards:
        # Artist and art name
        artist = clean_text(card.find_all('div', {"class": "Card-title"}))
        
        title = card.find_all('div', {"class": "Card-titleWorkTxt"})
        title_clean = clean_text(title)

        price = card.find_all('div', {"class": "Card-details"})
        price_clean = clean_text(price)

        estimate = card.find_all('div', {"class": "Card-estimate"})
        est_clean = clean_text(estimate)

        date = card.find_all('div', {"class": "Card-salePrice"})
        date_clean = clean_text(date)
        
        auction_name = clean_text(card.find_all('div', {"class":"Card-auctionTitle"}))

        auction = card.find_all('div', {"class": "Card-auctionDetails"})
        auc_clean = clean_text(auction)
        
        im = card.findAll('img')[0]['data-src']
        
        # Go to artwork page and get description, dimensions, etc.
        descrip_url = cards[0].find_all('a', {"class":"Card-media"}, href=True)[0]['href']
        tmp_page = requests.get(descrip_url, headers=ua)
        

        arts.append([artist, title_clean, price_clean, est_clean, date_clean, auction_name, auc_clean, im])

In [204]:
arts_cp = arts.copy()

In [208]:
!ls

art.csv       art_scp.ipynb [34mmet[m[m           [34mmoma[m[m


In [209]:
import csv

with open('art.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerows(arts_cp)
csvFile.close()

In [210]:
!ls

art.csv       art_scp.ipynb [34mmet[m[m           [34mmoma[m[m


In [216]:
data = pd.read_csv('art.csv', names = ["Artist", "Title", "Estimate" ,"Lot Sold", "Info", "Auction Name", "URL to JPEG"])

In [221]:
data[:1]

Unnamed: 0,Artist,Title,Estimate,Lot Sold,Info,Auction Name,URL to JPEG
[William Merritt Chase],[near the sea (shinnecock)],"[Estimate 200,000 – 300,000 USD]","[200,000 – 300,000 USD]","[21 May 2019 Sale Price 212,500 USD]",[American Art],[21 May 2019 | 10:00 AM EDT | New York],https://cdn.dotcom.sothebys.psdops.com/dims4/d...


# Experiments

In [123]:
url_filters

'https://www.sothebys.com/en/buy/lot-archive?s=0&from=20000&to=&f0=20000-&from=&to=&f2=00000164-609a-d1db-a5e6-e9fff79f0000&f2=00000164-609b-d1db-a5e6-e9ff01230000&f2=00000164-609b-d1db-a5e6-e9ff08ab0000&q='

In [124]:
ua = {"User-Agent":"Mozilla/5.0"}
page = requests.get(url_filters, headers=ua)
    
soup = BeautifulSoup(page.text, 'lxml')
cards = soup.find_all('div', {"class": "Card"})

In [125]:
a = cards[0].find_all('a', {"class":"Card-media"}, href=True)[0]['href']

In [126]:
a

'https://www.sothebys.com/en/buy/auction/2019/19th-century-european-art/thomas-ralph-spence-the-disciples-of-sappho'

In [127]:
a = cards[0].find_all('a', 'Card-info-container')

In [128]:
href_tags = soup.find_all('a', {'class': 'Card-info-container'})


In [129]:
href_tags[0]['href']

'https://www.sothebys.com/en/buy/auction/2019/19th-century-european-art/thomas-ralph-spence-the-disciples-of-sappho'

In [130]:
art_url = href_tags[0]['href']

In [131]:
art_url

'https://www.sothebys.com/en/buy/auction/2019/19th-century-european-art/thomas-ralph-spence-the-disciples-of-sappho'

In [132]:
art_page = requests.get(art_url, headers=ua)

In [133]:
art_soup = BeautifulSoup(art_page.text, 'lxml')

In [134]:
artist = art_soup.find_all('div', {'class': 'lotdetail-guarantee'})

In [135]:
artist

[]

In [79]:
clean_text(artist[0])

'Samuel Cooper'

In [80]:
art_name = art_soup.find_all('div', {'class': 'lotdetail-subtitle'})

In [82]:
clean_text(art_name[0])

'BRITISHPORTRAIT OF HENRY ALEXANDER, 4TH EARL OF STIRLING (D.1691), CIRCA 1666'

In [83]:
details = art_soup.find_all('div', {'class': 'lotdetail-description-text'})

In [88]:
details

[<div class="lotdetail-description-text">
                         Signed with monogram, centre right: SC<br/>Watercolour and bodycolour on vellum, silver gilt-frame, later mount, embellished with an Earl's coronet<br/>6.5cm by 5.5cm<br/></div>]

In [87]:
clean_text(details)

"[\r\n                        Signed with monogram, centre right: SCWatercolour and bodycolour on vellum, silver gilt-frame, later mount, embellished with an Earl's coronet6.5cm by 5.5cm]"

In [89]:
low = art_soup.find_all('span', {'class': 'range-from'})

In [90]:
low

[<span class="range-from" data-range-from="10000">10,000</span>,
 <span class="range-from" data-range-from="10000">10,000</span>,
 <span class="range-from" data-range-from="10000">10,000</span>]

In [92]:
sale_price = art_soup.find_all('div', {'class': 'price-sold'})

In [97]:
clean_text(low[0])

'10,000'

In [93]:
sale_price

[]

In [94]:
artist_dates = art_soup.find_all('div', {'class': 'lotdetail-artist-dates'})

In [96]:
clean_text(artist_dates)

'[1608-1672]'

In [147]:
links = soup.find_all('a', {'class': 'Card-media'})

In [153]:
for link in links:
    print(link['href'])

https://www.sothebys.com/buy/6fd865ec-0f5a-4749-a2ea-aed66ba70636/lots/5c062c01-6d42-4a06-ae93-bb236f0f92ff
https://www.sothebys.com/buy/6fd865ec-0f5a-4749-a2ea-aed66ba70636/lots/19ed6e35-1d35-4639-86b9-5af0479cbdcb
https://www.sothebys.com/buy/6fd865ec-0f5a-4749-a2ea-aed66ba70636/lots/b4ed0263-e6bc-4cd2-a9f2-d8cb07584074
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.106.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.107.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.108.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.109.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.114.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.115.html
https://www.sothebys.com/en/auctions/ecatalogue/2019/contemporary-art-day-n10070/lot.116.html
https://www.sotheb

In [136]:
tmp = links[0]['href']

In [148]:
len(links)

15

In [149]:
art_url = links[0]['href']
art_page = requests.get(art_url, headers=ua)
art_soup = BeautifulSoup(art_page.text, 'lxml')        

In [None]:
# artist and art 
h1 class="css-sxdrbj"
# estimate
p class = "css-4945v4"
# Sale price
span class = "css-15o7tlo"
# Currency
span class="css-wfxyp0"
# Description
div id="LotDetails"
# JPEG
img class="css-1vzoslw" src = "LINK TO JPEG"



In [150]:
art_soup.find('h1', {'class': 'css-sxdrbj'})

In [165]:
art_url = 'https://www.sothebys.com/en/buy/auction/2019/19th-century-european-art/jean-delville-orphee-aux-enfers'

In [166]:
tmp = requests.get(art_url, headers=ua)
t = BeautifulSoup(tmp.text, 'lxml')

In [167]:
d = t.find('div', {'id': 'LotDetails'})

In [168]:
d