# Cleaning the e-flux artists list

A lot of non-artist entities are stored in the initial `eflux_artists.txt` file, e.g. universities, museums and galleries.<br>
It's an important task to filter these out to only focus on artists.

In [3]:
with open('data/eflux_artists.txt', 'r', encoding="utf-8") as f:
    artists = f.read().splitlines()

In [4]:
artists[90:100]

['AXA Art',
 'AZ',
 'Aalborg University',
 'Aaltje Kramer',
 'Aalto University',
 'Aalto University School of Arts',
 'Aarati Akkapeddi',
 'Aarhus University',
 'Aaron Angell',
 'Aaron Betsky']

We convey these in three different steps:

- 1) Simply filter out entities containing "university", "museum", "institute" or "gallery" and other words in their name. For finding the relevant words, we use Named Entity Recognition (NER) to check for entities (words) that are not names.
- 2) Wikidata querying: We use the Wikidata API to check if the entities are actually humans (this would be the "instance of" property (P31) of "human" entity (Q5)). This is a very powerful tool, however quite slow and sadly many artists are not present in Wikidata.

The combination of these three methods should give us a good list of artists.

## Filtering out cases with "university", "museum", "institute", "gallery", etc. in their name

In [2]:
def delete_artists_containing_string_of_list(artists, strings):
    return [artist for artist in artists if not any(word.lower() in artist.lower() for word in strings)]

Not only is it worth it to clear out anything containing these 4 words, but find other commonly appearing words. For that, we'd just run through the list and count the occurrences of each word. We'll see which ones are fairly common that represent not a person, and we put those words into our "banwords" too.<br>
But because we'd mostly get most occurances with words like "David", "Michael", "John", etc., we'd need to focus on words that are not words.

For that, we simply use `spaCy`'s methods to check if the word's entity type is not a name.

In [5]:
import spacy
from collections import Counter
import re

nlp = spacy.load('en_core_web_sm')
words = re.findall(r'\w+', ' '.join(artists)) #Just collect words, simple method (input: all artists as one string)

#Select only words which are not name (person) entities
words = [word.text for word in nlp(' '.join(words)) if word.ent_type_ != 'PERSON']

word_counts = Counter([word.lower() for word in words]) #Universal counter method

word_counts

Counter({'of': 299,
         'art': 204,
         'arts': 133,
         'university': 132,
         'the': 105,
         'and': 94,
         'de': 89,
         'museum': 87,
         'for': 81,
         'foundation': 77,
         'van': 63,
         'christian': 60,
         'daniel': 54,
         'jan': 50,
         'institute': 49,
         'gallery': 47,
         'd': 47,
         'contemporary': 47,
         'j': 45,
         'college': 44,
         'center': 41,
         'anna': 40,
         'anne': 40,
         'emily': 40,
         'a': 37,
         'ana': 37,
         'o': 35,
         'kate': 34,
         'school': 33,
         'design': 33,
         'el': 31,
         'new': 31,
         'national': 31,
         'l': 30,
         'der': 30,
         'le': 30,
         'david': 30,
         'eva': 30,
         'smith': 28,
         'helen': 28,
         'von': 27,
         'c': 27,
         'council': 27,
         'andreas': 26,
         'fine': 26,
         'claudia': 26,
   

In [6]:
banwords = ['University', 'Gallery', 'Museum', 'Institute'] + [' of ', ' arts', ' art','foundation', 'new ', 'college',
                'contemporary', 'design', 'national', 'academy', 'architecture', 'department','culture', 'international',
                'cultural', 'centre','galerie','european','collective', 'architects', 'association','studies', 'research',
                'society', 'school', 'studio', 'laboratory','visual', 'modern', 'office', 'kunsthalle', 'california', 
                'social', 'sculpture', 'kunstverein', 'education', 'society', 'australia', 'curatorial', 'endowment',
                'ministry', 'barcelona', 'ensemble', 'institut', 'technology', 'development', 'contemporani', 'minneapolis',
                'aarhus', 'vienna', 'sciences', 'science', 'initiative', 'artforum', 'colleges', 'investment', 'photography',
                'capital', 'metropolitan', 'kunst', 'massachusetts', 'amsterdam', 'london', 'berlin', 'paris', 'brussels',
                'glasgow','union', 'chicago','experimental', 'graduate', ]

artists_ver_1 = delete_artists_containing_string_of_list(artists, banwords)

In [7]:
len(artists),len(artists_ver_1)

(22285, 21352)

Save version 1 of the cleaned list to a file.

In [15]:
with open('data/artists_cleaned_v1.txt', 'w', encoding="utf-8") as f:
    f.write('\n'.join(artists_ver_1))

Do the same with "eflux_artists_contemporary_extended.txt"

In [8]:
with open('data/eflux_artists_contemporary_extended.txt', 'r', encoding="utf-8") as f:
    artists_contemporary = f.read().splitlines()

In [9]:
artists_contemporary_ver_1 = delete_artists_containing_string_of_list(artists_contemporary, banwords)

In [12]:
with open('data/artists_contemporary_cleaned_v1.txt', 'w', encoding="utf-8") as f:
    f.write('\n'.join(artists_contemporary_ver_1))

### 1b) Filtering out entities that do not contain a name entity
Since we already collected the non-name entities that are contained in our artist list, we can simply filter out the entities that contain only these words.

(this runs for around few minutes on a charged laptop)

In [10]:
#Filtering out entities that do not contain a person entity

suspects = []
for artist in artists_ver_1:
    for word in artist.split():
        if (word.lower() not in words): #Save computation time
            if (nlp(word)[0].ent_type_ == 'PERSON'):
                break
    else: #We only get here if the loop completes without breaking->no person entity found
        suspects.append(artist)

In [13]:
len(suspects)

9288

We "surely" (very probably) have made sure that non-suspect entities are indeed artists and not institutions or anything else.<br>
However, we put a lot of artists into the "suspect" category, so we need to try to further check if some of them are "provably artists".

In [16]:
certain_artists = list(set(artists_ver_1) - set(suspects))
len(certain_artists)

12064

Looking into the first 100-200, we do not seem to find any cases that are not humans. Because of this, we keep the artist list as it is.

Nonetheless, if this would be a bigger issue, for this, we can use the Wikidata API to check if the entities are actually humans (this would be the "instance of" property (P31) of "human" entity (Q5)). Furthermore we could potentially check if the entities are "artists" (instance of "artist" (Q483501)) but this is not always part of occupation (e.g. photographers, fashion designers).<br>
Many artists are not present in Wikidata though, so this would not be enough.