<a href="https://colab.research.google.com/github/hturnbull93/problematic-artwork-identification/blob/master/problematic_artwork_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problematic Artwork Workflow

The aim of this project is to identify artworks in the UK (particularly statues/sculptures) that depict people related to or involved in the slave trade.

## The Problems

1. Getting a list of artworks in the UK.
2. Identifying which of the artworks depicts a slaver.

## Getting Artworks

The best source of the artworks comes from [ArtUK](https://artuk.org/). They don't have a public API, but do have an [infinite scrolling interface](https://artuk.org/discover/artworks/view_as/grid/search/work_type:sculpture) that loads in 20 artworks per "page". The [page for each artwork](https://artuk.org/discover/artworks/farmer-with-plough-244862/) contains details including title, artist, location and also images and descriptions for most of them (but not all). 

The strategy used was to collect the URLs from the infinite scroll page, then scrape the details from each artwork page in turn.

To get the URLs a small bit of JavaScript was used to automate scrolling and collect and match for the right the URLs.

```javascript
// Locate and scroll to the pagination button, causing another set of artworks to load
const scrollToButton = () => document.getElementsByClassName('pagination-btn')[0].scrollIntoView()

// Attempt to scroll every second
setInterval(scrollToButton, 1000)

// When all are loaded manually stop the scrolling
clearInterval(scrollToButton)

// Get all the links (a tags) in the document and spread into an array
const allLinks = [...document.links]

// Map through to get their href attributes
const allUrls = allLinks.map(link => link.href)

// Remove duplicate URLs
const uniqueUrls = [...new Set(allUrls)]

// Regex match the URLs of interest. After the artworks path match any
// combination that is not underscore until the next forward slash.
const matcher = /https:\/\/artuk\.org\/discover\/artworks\/[^_]*\//g
const urls = uniqueUrls.join("").match(matcher)

// Create a csv data string and join the URLs with newlines 
const csvContent = 'data:text/csv;charset=utf-8,' + urls.join("\n")

// Encode the csvContent as a URI
const encodedUri = encodeURI(csvContent)

// Create a link to download the csv
const link = document.createElement("a")
link.setAttribute("href", encodedUri)
link.setAttribute("download", "artwork_urls.csv")
link.click()
```

The pagination runs out at 500 pages, so while it wasn't possible to scroll down the entire sculptures category directly, each subcategory was collected, then the main category approached with an a-z sort, then z-a, resulting in 21,225 unique URLs out of a total 24,379 artworks on the ArtUK sculpture section.

## Scraping Artwork Data

A python script was used to scrape each artwork page for the title and artist.

It uses the Requests library to get the html of the page, which is minified using the htmlmin library to remove excess whitespace, and then parsed using the BeautifulSoup library.

The title and artist are attempted to be found using BeautifulSoup. If they cannot be found (if the request 404s for example) the url is marked as skipped.

While the actual script read and wrote to csv for the full ~21,000 artworks, the following is an example using a sample list of artworks. 

In [None]:
!pip install htmlmin

from bs4 import BeautifulSoup
import requests
import csv
import htmlmin

urlsCSV = [
  ["https://artuk.org/discover/artworks/farmer-with-plough-244862/"],
  ["https://artuk.org/discover/artworks/adam-sedgwick-17851873-253304/"],
  ["https://artuk.org/discover/artworks/napoleon-bonaparte-17691821-272365//"],
  ["https://artuk.org/discover/artworks/tank-dreams-260202/"],
  ["https://artuk.org/discover/artworks/spire-260987/"],
  ["https://artuk.org/discover/artworks/the-divine-tragedy-256405/"],
  ["https://artuk.org/discover/artworks/queen-mary-18671953-256643/"],
  ["https://artuk.org/discover/artworks/bust-of-a-woman-256637/"],
  ["https://artuk.org/discover/artworks/david-livingstone-18131873-266269/"],
  ["https://artuk.org/discover/artworks/queen-alexandra-18441925-256650/"],
  ["https://artuk.org/discover/artworks/clock-256663/"],
  ["https://artuk.org/discover/artworks/bird-262955"],
  ["https://artuk.org/discover/artworks/harris-academy-gates-248131/"],
  ["https://artuk.org/discover/artworks/michelangelo-14751564-248503/"],
  ["https://artuk.org/discover/artworks/spencer-perceval-17621812-prime-minister-253103/"],
  ["https://artuk.org/discover/artworks/seated-woman-262034/"],
  ["https://artuk.org/discover/artworks/pair-of-makonde-male-and-female-sculptures-261988/"],
  ["https://artuk.org/discover/artworks/charles-mcgarel-17881876-252527/"],
  ["https://artuk.org/discover/artworks/head-of-an-elderly-man-261990/"],
  ["https://artuk.org/discover/artworks/henry-richard-18121888-272031/"],
  ["https://artuk.org/discover/artworks/stone-table-252724/"],
  ["https://artuk.org/discover/artworks/the-stone-sculptures-small-torso-252722/"],
  ["https://artuk.org/discover/artworks/thors-hammer-252716/"],
  ["https://artuk.org/discover/artworks/edward-colston-16361721-266037/"]
]

artwork_url_title_artist = []

for row in urlsCSV:
  url = row[0]
  r = requests.get(url)
  minified = htmlmin.minify(r.text, remove_empty_space=True)
  doc = BeautifulSoup(minified, 'html.parser')

  try:
    artwork_name = " ".join(doc.h1.contents[0].split())
    artist_name = doc.find_all("h2", class_="artist")[0].text.strip()
    new_row = [url, artwork_name, artist_name]
    artwork_url_title_artist.append(new_row)
    print(new_row)
  except:
    error_row = [url, "SKIPPED - ERROR"]
    artwork_url_title_artist.append(error_row)
    print(error_row)

Collecting htmlmin
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone
  Created wheel for htmlmin: filename=htmlmin-0.1.12-cp36-none-any.whl size=27084 sha256=17d131a8ffd49f3502ce4c7f1544d94431689aa9c97e76a5c7bce85fb58dbd3d
  Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
Successfully built htmlmin
Installing collected packages: htmlmin
Successfully installed htmlmin-0.1.12
['https://artuk.org/discover/artworks/farmer-with-plough-244862/', 'Farmer with Plough', 'Denise Delavigne']
['https://artuk.org/discover/artworks/adam-sedgwick-17851873-253304/', 'Adam Sedgwick (1785â\x80\x931873)', 'Thomas Woolner (1825â\x80\x931892)']
['https://artuk.org/discover/artworks/napoleon-bonaparte-17691821-272365//', 'Napoleon Bonaparte (1769â\x80\x

## Finding Artworks Depicting Slavers

### NLP Name Query Strategy
The initial strategy was to use natural language processing (NLP) to perform named entity recognition (NER) to detect names in the artwork titles, then query Wikipedia for an article matching that name and rate the article based on the number of occurences of the word "slave".

For the NER, two libraries were considered: SpaCy and NLTK. In order to cast a wide net both were used and the resulting names made into a unique set. 

Each name found is queried in wikipedia using the wikipedia library. If an article for that name is found, then the number of "slave" occurences in the content is counted and recorded. If there is no article found a search is performed instead, and each of the articles in the search results are queried, casting an extra wide net in these cases. The idea is to reduce false negatives in the case that the name is a slaver, but there isn't an exact article of that name.

As before, the actual script read and wrote to csv files. Additionally, as working through each of the ~21,000 artworks would take a considerable time, the original CSV is split into chunks of ~1,000 rows, each with its own thread to process them.

The following script demonstrates the workflow taking the result of the previous script, but with with only 3 threads for chunks of 8.

In [None]:
!pip install wikipedia
import wikipedia

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

import re
import threading
import logging

matcher = re.compile("slave", re.IGNORECASE)

# NLTK function
def extract_entity_names(t):
  entity_names = []

  if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':
      entity_names.append(' '.join([child[0] for child in t]))
    else:
      for child in t:
        entity_names.extend(extract_entity_names(child))

  return entity_names

# Thread function
def thread_function(name, chunk):
  thread_name = f"{name:02d}"
  chunk_length = f"{len(chunk):04d}"
  print(f'Thread-{thread_name} starting', )

  for original_row in chunk:
    title = original_row[1]
    names = []

    # SpaCy Person match
    doc = nlp(title)
    names = [X.text for X in doc.ents if X.label_ == 'PERSON']

    # NLTK name match 
    sentences = nltk.sent_tokenize(title)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    for tree in chunked_sentences:
      names.extend(extract_entity_names(tree))
    
    unique_names = (list(set(names)))

    if len(unique_names) is 0:
      this_row = original_row[:]
      this_row.append("na - no names found")
      slave_matches.append(this_row)

    if len(unique_names) > 0:
      for name in unique_names:
        try:
          page = wikipedia.page(name)
          result = len(matcher.findall(page.content))
          this_row = original_row[:]
          this_row.extend(["direct", name, page.url, result])
          slave_matches.append(this_row)
          if result > 0:
            print(this_row)

        except:
          search = wikipedia.search(name)
          search = [item for item in search if "(disambiguation)" not in item]
          for item in search:
            try:
              page = wikipedia.page(item)
              result = len(matcher.findall(page.content))
              search_row = original_row[:]
              search_row.extend(["search", name, page.url, result])
              slave_matches.append(search_row)
              if result > 0:
                print(search_row)

            except:
              pass

  print(f'Thread-{thread_name} done', )

# Split input list into chunks
def chunk(list, n):
  for i in range(0, len(list), n):
    yield list[i:i + n]

chunks = list(chunk(artwork_url_title_artist, 8))

threads = []

slave_matches = []

# Assign threads
print("\n==============================STARTING==============================")
for i, chunk in enumerate(chunks):
  x = threading.Thread(target=thread_function, args=(i, chunk,))
  threads.append(x)
  x.start()

for i, thread in enumerate(threads):
  thread.join()
print("================================DONE================================")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!

Thread-00 starting
Thread-01 starting
Thread-02 starting
['https://artuk.org/discover/artworks/david-livingstone-18131873-266269/', 'David Livingstone (1813â\x80\x931873)', 'unknown artist', 'direct', 'David Livingstone', 'https://en.wikipedia.org/wiki/David_Livingstone', 35]




  lis = BeautifulSoup(html).find_all('li')


['https://artuk.org/discover/artworks/clock-256663/', 'Clock', 'John Kirton (1878â\x80\x931948)', 'direct', 'Clock', 'https://en.wikipedia.org/wiki/Clock', 3]
['https://artuk.org/discover/artworks/adam-sedgwick-17851873-253304/', 'Adam Sedgwick (1785â\x80\x931873)', 'Thomas Woolner (1825â\x80\x931892)', 'direct', 'Adam Sedgwick', 'https://en.wikipedia.org/wiki/Adam_Sedgwick', 7]
['https://artuk.org/discover/artworks/napoleon-bonaparte-17691821-272365//', 'Napoleon Bonaparte (1769â\x80\x931821)', 'Antoine-Denis Chaudet (1763â\x80\x931810)', 'direct', 'Napoleon Bonaparte', 'https://en.wikipedia.org/wiki/Napoleon', 15]
['https://artuk.org/discover/artworks/harris-academy-gates-248131/', 'Harris Academy Gates', 'David Findlay Wilson (b.1962)', 'search', 'Harris Academy Gates', 'https://en.wikipedia.org/wiki/National_Treasure_(film_series)', 1]
['https://artuk.org/discover/artworks/queen-mary-18671953-256643/', 'Queen Mary (1867â\x80\x931953)', 'John Kirton (1878â\x80\x931948)', 'search', '

This approach does flag up the slaver traders/owners (in this set, Charles McGarel, Adam Sedgwick, and Edward Colston).

In general, the combination of using NLTK which is less discriminatory in its NER resulting in names of things which aren't necescarily people, and the process of searching for the name and checking the returned articles seems to result in a significant number of fals positives. For example the artwork Clock has occurences of "slave" in reference to slave clocks that are synchronised by master clocks. Michaelangelo's article contains mentions of artworks he made with "slave" in their title (e.g. "Rebellious Slave"). Seated Woman and Queen Mary (depicting Mary of Teck) are picked by other articles from search that .

It also picks up on David Livingston, Spencer Perceval who are both abolitionists. Interestingly querying the name of the other abolitionist, Henry Richard, finds a direct match to an article Richard Henry Lee, rather than the Henry Richard article that exists.

In the above output, there are 13 results of interest (that is with at least 1 instance of the word "slave" in the found article), that correspond to 3 actual items of interest. These results still require manual investigation to sort out actual slave traders/owners from abolitionists, and when this workflow is scaled up to the full set would be quite an effort to do.

### Levenshtein Distance Match to Known Slaver List Strategy

It turns out that Wikipedia has a categories of [slave owners](https://en.wikipedia.org/wiki/Category:Slave_owners) and [slave traders](https://en.wikipedia.org/wiki/Category:Slave_traders), so a new strategy to check for the names of these known slavers. These article titles and URLs were collected using the following JavaScript on each of the category pages, and their subcategory pages. 

```javascript
// Get each of the lists and spread into an array
const lists = [...document.querySelectorAll('.mw-category-group ul')]

// Map through the lists, mapping through each list to get its list item's url
// and text (the name of the slaver), and flatten to a single array of arrays
const allNamesUrls = lists.map(list => {
  const listItems = [...list.children]
  const nameAndUrl = listItems.map(item => {
    const link = item.getElementsByTagName('a')[0]
    return [link.text, link.href]
  })
  return nameAndUrl
}).flat()

// Filter out items that are links to other subcategories or to a list page.
const urls = allNamesUrls.filter(item => {
  return !(item[1].includes("Category") || item[1].includes("List_of"))
})

// Create a csv data string and join the URLs with newlines 
const csvContent = 'data:text/csv;charset=utf-8,' + urls.join("\n")
 
// Encode the csvContent as a URI
const encodedUri = encodeURI(csvContent)
 
// Create a link to download the csv
const link = document.createElement("a")
link.setAttribute("href", encodedUri)
link.setAttribute("download", "artwork_urls.csv")
link.click()
```

In total 1,117 slave owners and traders known to Wikipedia  were collected.

Spacy and NLTK were used to extract the names from the resulting article titles in a similar manner to the process used to identify names in artwork titles, and for any that had more than one name found the most appropriate name to use was decided manually.

The Levenshtein distance is a metric that measures how much one string would need to be edited to become another string, which is used here to provide a fuzzy match rather than an exact match in the case there there are slight differences in the way that the names are presented.

The fuzzywuzzy library is used to calculate the Levenshtein distance from the artwork title (with dates in brackets removed) to each of the known slaver names, and records those with a rating of more than a certain threshold which is 80.

As before, the actual script read and wrote to csv files and uses threads to work on each chunk chunks of ~1,000 rows. Each thread is also passed its own copy of the `known_slavers` list to prevent any possibility of there being conflict when attempting to access it.

The following script demonstrates the workflow using the scraping results.

In [None]:
!pip install fuzzywuzzy
!pip install python-Levenshtein

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

from fuzzywuzzy import fuzz
import csv
import re
import threading

# Known slaver owners and traders, URL, Title of article, best name
known_slavers = [
  ['https://en.wikipedia.org/wiki/John_Crenshaw', 'John Crenshaw', 'John Crenshaw'],
  ['https://en.wikipedia.org/wiki/Edward_Colston', 'Edward Colston', 'Edward Colston'],
  ['https://en.wikipedia.org/wiki/Pedro_Blanco_(slave_trader)', 'Pedro Blanco (slave trader)', 'Pedro Blanco'],
  ['https://en.wikipedia.org/wiki/Francis_Baring,_3rd_Baron_Ashburton', 'Francis Baring, 3rd Baron Ashburton', 'Francis Baring'],
  ['https://en.wikipedia.org/wiki/Adam_Sedgwick', 'Adam Sedgwick', 'Adam Sedgwick'],
  ['https://en.wikipedia.org/wiki/Erasmus_W._Beck', 'Erasmus W. Beck', 'Erasmus W. Beck'],
  ['https://en.wikipedia.org/wiki/William_McIntosh', 'William McIntosh', 'William McIntosh'],
  ['https://en.wikipedia.org/wiki/Stephen_Delancey', 'Stephen Delancey', 'Stephen Delancey'],
  ['https://en.wikipedia.org/wiki/Martin_Jenkins_Crawford', 'Martin Jenkins Crawford', 'Martin Jenkins Crawford'],
  ['https://en.wikipedia.org/wiki/Charles_McGarel', 'Charles McGarel', 'Charles McGarel'],
  ['https://en.wikipedia.org/wiki/James_Jackson_(congressman)', 'James Jackson (congressman)', 'James Jackson'],
]

# Thread function
def thread_function(name, chunk, uris_names):
  thread_name = f"{name:02d}"

  for original_row in chunk:    
    title = original_row[1]
    treated_title = re.sub('\(.*\)','', title)

    for name_row in uris_names:
      name = name_row[2]
      partial = fuzz.token_sort_ratio(name, treated_title)

      if partial > 80:
        this_row = original_row[:]
        this_row.append(partial)
        this_row.extend(name_row)
        known_slaver_matches.append(this_row)
        print(this_row)

  print(f'Thread-{thread_name} done', )

# Split input list into chunks
def chunk(list, n):
  for i in range(0, len(list), n):
    yield list[i:i + n]

chunks = list(chunk(artwork_url_title_artist, 8))

threads = []

known_slaver_matches = []

# Assign threads
print("\n==============================STARTING==============================")
for i, chunk in enumerate(chunks):
  known_slavers_copy = known_slavers[:]

  x = threading.Thread(target=thread_function, args=(i, chunk, known_slavers_copy,))
  threads.append(x)
  x.start()

for i, thread in enumerate(threads):
  thread.join()
print("================================DONE================================")


['https://artuk.org/discover/artworks/adam-sedgwick-17851873-253304/', 'Adam Sedgwick (1785â\x80\x931873)', 'Thomas Woolner (1825â\x80\x931892)', 100, 'https://en.wikipedia.org/wiki/Adam_Sedgwick', 'Adam Sedgwick', 'Adam Sedgwick']
Thread-00 done
Thread-01 done
['https://artuk.org/discover/artworks/charles-mcgarel-17881876-252527/', 'Charles McGarel (1788â\x80\x931876)', 'Hamo Thornycroft (1850â\x80\x931925)', 100, 'https://en.wikipedia.org/wiki/Charles_McGarel', 'Charles McGarel', 'Charles McGarel']
['https://artuk.org/discover/artworks/edward-colston-16361721-266037/', 'Edward Colston (1636â\x80\x931721)', 'John Michael Rysbrack (1694â\x80\x931770)', 100, 'https://en.wikipedia.org/wiki/Edward_Colston', 'Edward Colston', 'Edward Colston']
Thread-02 done


This process successfully identifies all three slaver traders and owners in the sample data. While there are three exact matches here, with the entire set there are a few false positives where the subject of the artwork happens to have the same name (the date of birth and death do not match), and there are a few results where the names are simply similar, but are different names.

In the full set, of the 64 matches with a 100 Levenshtein score, 9 are not the slavers (other people with the same name), and 55 are the correct matches.

Of the 4 that are correct matches with sub 100 scores:

| Score | Name | Note 
|---|---|---|
| 92 | Simón Bolívar  | the artwork title did not have accented characters 
| 88 | Hadrian        | the artwork was titled "Hadrianus"                 
| 88 | Robert Clayton | the artwork title includes his title "Sir"         
| 81 | Julius Caeser  | the artwork title includes his praenomen (personal name) "Gaius"

Of the total 59 artworks matched due to subjects being in more than one artwork 23 slaver traders/owners are depicted.