# Current version: 0.2.1 (unfinished)

From 2024, updates to the dataset are handled and stored in a separate file. This is that file (previously, all Art500k dataset processing was done in *art500k.csv*, the file now renamed to *art500k_initial*).

In [11]:
import numpy as np
import pandas as pd

url_v_latest = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/artists.csv"
url_v_latest_art500k_artists = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv"
artists = pd.read_csv(url_v_latest)
art500k_artists = pd.read_csv(url_v_latest_art500k_artists)

In [5]:
import helper_functions  # This line imports the helper_functions module

## 2024.01.07 - : Use measures to find artists with multiple names (aliases)

If we take a look at popular artists in the dataset, for example Rembrandt:

In [12]:
import numpy as np
import pandas as pd

url_v_01_09 = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_artists_0_1.csv"
art500k_artists = pd.read_csv(url_v_01_09)

In [4]:
art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique()

array(['Rembrandt Peale', 'Rembrandt', 'after Rembrandt van Rijn',
       'Rembrandt Harmensz. van Rijn', 'Rembrandt van Rijn',
       'British 19th Century after Rembrandt van Rijn',
       'Richard Houston after Rembrandt van Rijn',
       'William Byron after Rembrandt van Rijn',
       'Georg Friedrich Schmidt after Rembrandt van Rijn',
       'Jonas Suyderhoff after Rembrandt van Rijn',
       'Timothy Cole after Rembrandt van Rijn',
       'Richard Earlom after Rembrandt van Rijn',
       'School of Rembrandt van Rijn', 'Rembrandt (Rembrandt van Rijn)',
       'Nicolaes Maes|School of Rembrandt van Rijn',
       'Rembrandt (Rembrandt van Rijn)|Ferdinand Bol',
       'Rembrandt (Rembrandt van Rijn)|Nicolaes Maes',
       'Rembrandt (Rembrandt van Rijn)|Andrea Mantegna|Rembrandt (Rembrandt van Rijn)',
       'Attributed to Rembrandt Peale',
       'Costantino Cumano after Rembrandt van Rijn',
       'follower of Rembrandt Harmensz. van Rijn',
       'Charles Turner after Rembrandt 

There are multiple entries for Rembrandt: *Rembrandt*, *Rembrandt van Rijn*,  *Rembrandt Harmensz. van Rijn*, *Rembrandt (Rembrandt van Rijn)*, *Rembrandt Harmensz van Rijn (Dutch)*, *Rembrandt (Rembrandt van Rijn)|Rembrandt (Rembrandt van Rijn)*. We need to combine entries for one artists if there are more than 1.<br>
However, this is not trivial to find. 

The other problem is processing other instances such as "X after Y". I believe for these cases, LLMs may be the most useful.

As of now, we tackle this problem by using a combination of measures to find artist aliases.

Considered measures:

* Fuzzy string matching (Levenshtein distance) between artist names. 
* Basic string containment (other artists names containing one word artist names, e.g. Rembrandt).
* Token-Based Matching (TBM) (Jaccard similarity) between artist names.
* Named Entity Recognition (NER) (Spacy) to find artist names from text, then apply Coreference Resolution to link pronouns and other expressions to the correct entities.
* LLMs to find artist names from text.


Other considerations: <br>
* Phonetic matching: This could be helpful when an artist's name is spelled differently in different languages, e.g. "Č" (Czech) / "Ch" (English) / "cs" (Hungarian). Even if this is the case for some instances, we should find these with the Levenshtein distance search. <br>
* Online available resources for aliases, web scraping, etc.
* Custom rules (e.g "... and his workshop", "... and his circle", etc.)

NER:

In [10]:
import spacy

#Example
data = {
    'author_name': ['Rembrandt', 'Rembrandt van Rijn', 'Rembrandt Peale', 'Michelangelo', 'Michelangelo Buonarroti', 'Michelangelo Merisi da Caravaggio', 'Caravaggio', 'Caravaggio, Michelangelo Merisi da', 'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)', 'Leonardo', 'Leonardo da Vinci'],
}
df = pd.DataFrame(data)

nlp = spacy.load("en_core_web_sm") #English only
aliases = {}

for name in df['author_name']:
    doc = nlp(name)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            aliases.setdefault(name, set()).add(ent.text)
            aliases.setdefault(ent.text, set()).add(name)

aliases = {key: list(value) for key, value in aliases.items()}
aliases

{'Rembrandt van Rijn': ['Rembrandt van Rijn'],
 'Rembrandt Peale': ['Rembrandt Peale'],
 'Michelangelo': ['Michelangelo'],
 'Michelangelo Buonarroti': ['Michelangelo Buonarroti'],
 'Michelangelo Merisi da Caravaggio': ['Michelangelo Merisi da'],
 'Michelangelo Merisi da': ['Michelangelo Merisi da Caravaggio',
  'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)'],
 'Caravaggio, Michelangelo Merisi da': ['Michelangelo Merisi', 'Caravaggio'],
 'Caravaggio': ['Caravaggio, Michelangelo Merisi da',
  'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)'],
 'Michelangelo Merisi': ['Caravaggio, Michelangelo Merisi da'],
 'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)': ['Michelangelo Merisi da',
  'Caravaggio'],
 'Leonardo': ['Leonardo'],
 'Leonardo da Vinci': ['Leonardo da Vinci']}

This seems to leave out many 1-word-alias cases, but it is a start.

In [22]:
import spacy

nlp = spacy.load("en_core_web_sm") #English only
aliases = {}

for name in art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique():
    doc = nlp(name)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            aliases.setdefault(name, set()).add(ent.text)
            aliases.setdefault(ent.text, set()).add(name)

aliases = {key: list(value) for key, value in aliases.items()}
aliases

{'Rembrandt Peale': ['Rembrandt Peale'],
 'after Rembrandt van Rijn': ['Rembrandt van Rijn'],
 'Rembrandt van Rijn': ['William Luson Thomas|Sir John Gilbert|Rembrandt (Rembrandt van Rijn)',
  'Jonas Suyderhoff after Rembrandt van Rijn',
  'Timothy Cole after Rembrandt van Rijn',
  'after Rembrandt van Rijn',
  'Richard Houston after Rembrandt van Rijn',
  'Charles Turner after Rembrandt van Rijn',
  'Georg Friedrich Schmidt after Rembrandt van Rijn',
  'Costantino Cumano after Rembrandt van Rijn',
  'Richard Earlom after Rembrandt van Rijn',
  'Captain William E. Baillie|Rembrandt (Rembrandt van Rijn)',
  'Jean Pierre de Frey|Rembrandt (Rembrandt van Rijn)',
  'Jan Georg (Joris) van Vliet|Rembrandt (Rembrandt van Rijn)',
  'Rembrandt van Rijn',
  'Rembrandt (Rembrandt van Rijn)|Rembrandt (Rembrandt van Rijn)',
  'Rembrandt (Rembrandt van Rijn)|Andrea Mantegna|Rembrandt (Rembrandt van Rijn)',
  'William Byron after Rembrandt van Rijn',
  'British 19th Century after Rembrandt van Rijn'],

The "after", "attributed to", "|", "follower of" cause big problems. We need to find a way to deal with these first.

In [40]:
multiple_names = []
for name in art500k_artists['artist'].unique():
    if "|" in name:
        multiple_names.append(name)

multiple_names

['Albrecht Dürer|Albrecht Dürer',
 'Nina de Garis Davies|Norman de Garis Davies',
 'Vienna|Du Paquier period',
 'Du Paquier period|Vienna',
 'Naotane Taikei|Yukinaka|Honjo Yoshitane',
 'John Milton|David Bogue|Myles Birket Foster|John Milton',
 'Henry N. Hooper and Company|O.P. Drake',
 'Nakagawa Isshō|Sadakuni',
 'Yasumitsu|Iwamoto Konkan',
 'Sukemitsu of Bizen|Yasumitsu|Iwamoto Konkan',
 'Fusamune of Soshu|Sōheishi Sōtensai',
 'Thomas Tompion|Nathanial Delander',
 'Henry Atkin|James Purdey the Elder|Abbe Robins|Peter Gumbrell',
 'Toshimasa|Kajikawa|Kuniyoshi',
 'Jean Henri Riesener|Jean-Gotfritt Mercklein',
 'Hanabusa Itchō|Kajikawa',
 "Johann Schott|Hans Wechtlin|Ulrich Pinder|Unidentified Weaver's Mark|Unidentified Weaver's Mark",
 'Antoine Watteau|François Boucher',
 'Brewster & Co.|Herman Stahmer',
 'Johann Georg Christoph Fries|Angelo Quaglio',
 'Ruth Whittier Shute|Samuel Addison Shute',
 'Brewster & Co.|Paul Rodissart',
 'Brewster & Co.|Channing Britton',
 'Brice Thomas|Franço

In [15]:
from transformers import pipeline

# Initialize the pipeline
generator = pipeline('text-generation', model='gpt2')
string = "This painting was painted after Leonardo by Rafael. The painting was painted by"
# Generate text
text = generator(string, max_length=len(string)+1)[0]['generated_text']

print(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This painting was painted after Leonardo by Rafael. The painting was painted by an Italian painter, who came from Milan, and his painting was the result of a long association with the Raffaele and with the Artemesons. A group of the artists worked together and made the painting. From the start of the 12th century, the artist wrote, he was not afraid of the artists of


In [35]:
from transformers import GPT2Tokenizer, pipeline

# Initialize the tokenizer and the pipeline
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
generator = pipeline('text-generation', model='gpt2')

pairs = []
for name in art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique():
    string = "The description of this painting's author is: " + name + ". The yes-or-no answer to the question 'Is the painter of this painting known or unknown?' is:"
    tokens = tokenizer.encode(string, return_tensors='pt')
    text = generator(string, max_length=tokens.shape[1]+1)[0]['generated_text'];
    answer = text.split(string)[1]
    pairs.append([name, answer])

pairs

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

[['Rembrandt Peale', ' Yes'],
 ['Rembrandt', ' No'],
 ['after Rembrandt van Rijn', ' of'],
 ['Rembrandt Harmensz. van Rijn', ' Not'],
 ['Rembrandt van Rijn', ' Yes'],
 ['British 19th Century after Rembrandt van Rijn', ' No'],
 ['Richard Houston after Rembrandt van Rijn', " '"],
 ['William Byron after Rembrandt van Rijn', " '"],
 ['Georg Friedrich Schmidt after Rembrandt van Rijn', " '"],
 ['Jonas Suyderhoff after Rembrandt van Rijn', ' Is'],
 ['Timothy Cole after Rembrandt van Rijn', ' The'],
 ['Richard Earlom after Rembrandt van Rijn', ' The'],
 ['School of Rembrandt van Rijn', ' He'],
 ['Rembrandt (Rembrandt van Rijn)', ' a'],
 ['Nicolaes Maes|School of Rembrandt van Rijn', '\n'],
 ['Rembrandt (Rembrandt van Rijn)|Ferdinand Bol', ' An'],
 ['Rembrandt (Rembrandt van Rijn)|Nicolaes Maes', ' Yes'],
 ['Rembrandt (Rembrandt van Rijn)|Andrea Mantegna|Rembrandt (Rembrandt van Rijn)',
  " '"],
 ['Attributed to Rembrandt Peale', ' "'],
 ['Costantino Cumano after Rembrandt van Rijn', ' he'],
 

In [7]:
import difflib

# Function to calculate similarity between two strings
def similarity(s1, s2):
    return difflib.SequenceMatcher(None, s1, s2).ratio()


cases = art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique()
# Cluster cases based on similarity
clusters = {}
for case in cases:
    assigned_cluster = False
    for cluster_center, original_painter in clusters.items():
        if similarity(case, cluster_center) > 0.6:
            clusters[cluster_center].append(case)
            assigned_cluster = True
            break
    if not assigned_cluster:
        clusters[case] = [case]

# Print the clusters
for cluster_center, cluster_cases in clusters.items():
    print(f"Cluster center: {cluster_center}")
    print(f"Cases in cluster: {', '.join(cluster_cases)}")
    print()


Cluster center: Rembrandt Peale
Cases in cluster: Rembrandt Peale, Rembrandt, Rembrandt van Rijn, Attributed to Rembrandt Peale

Cluster center: after Rembrandt van Rijn
Cases in cluster: after Rembrandt van Rijn, Rembrandt Harmensz. van Rijn, British 19th Century after Rembrandt van Rijn, Richard Houston after Rembrandt van Rijn, William Byron after Rembrandt van Rijn, Georg Friedrich Schmidt after Rembrandt van Rijn, Jonas Suyderhoff after Rembrandt van Rijn, Timothy Cole after Rembrandt van Rijn, Richard Earlom after Rembrandt van Rijn, School of Rembrandt van Rijn, Rembrandt (Rembrandt van Rijn), Nicolaes Maes|School of Rembrandt van Rijn, Rembrandt (Rembrandt van Rijn)|Ferdinand Bol, Rembrandt (Rembrandt van Rijn)|Nicolaes Maes, Costantino Cumano after Rembrandt van Rijn, follower of Rembrandt Harmensz. van Rijn, Charles Turner after Rembrandt van Rijn, Rembrandt Harmensz van Rijn (Dutch, After the School of Rembrandt Harmenszoon van Rijn, school of Rembrandt Harmensz. van Rijn, F

This is quite inconvinient and not always accurate (see: ['Follower of Rembrandt van Rijn', ' Yes'])

In [6]:
pd.DataFrame(art500k_artists['artist']).to_csv('art500k_artists.txt', sep=";" , index=False)
#Rembrandt
pd.DataFrame(art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique()).to_csv('rembrandt.txt', sep=";" , index=False)

### Current step: Using ChatGPT to find artist aliases and filter unknowns

In [21]:
artists_for_GPT = art500k_artists[1507:]['artist'].reset_index(drop=True)
for i in range(0, len(artists_for_GPT), 150):
    string = "datasets/generated/artists_for_GPT_" + str(i/150) + ".txt"
    pd.DataFrame(artists_for_GPT[i:i+150]).to_csv(string, sep=";" , index=False)
    #Remove first line
    # ...
    with open(string, 'r', encoding='utf-8') as fin:
        data = fin.readlines()

    with open(string, 'w', encoding='utf-8') as fout:
        fout.writelines(data[1:])

## Update 2024.01.13

## Update 2024.01.12-13

Minor artist change, to test the instance combination method + remove quotation marks (") from artist names.

In [28]:
art500k_artists[art500k_artists['artist'].str.contains("Gustavo Dall")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
1314,Gustavo Dall'Ara,,,,,,,,,1875.0,1923.0,,,,,
36434,Gustavo Dall'ara,,,,,,,,,1910.0,1913.0,,,,,


In [29]:
art500k_modified = helper_functions.combine_instances(df=art500k_artists, primary_artist_name="Gustavo Dall'Ara", secondary_artist_name="Gustavo Dall'ara")
art500k_modified[art500k_modified['artist'].str.contains("Gustavo Dall")]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Nationality'][0] = df2['Nationality'][0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[column][0] = column2_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[column][0] = column2_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[column][0] = column2_val
A value is trying to be set on a co

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
40296,Gustavo Dall'Ara,,,,,,,,,1875.0,1923.0,,,,,


In [30]:
art500k_artists = art500k_modified
art500k_modified['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_modified.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)

In [32]:
art500k_artists[art500k_artists['artist'].str.contains('"')]['artist']


1836                                "CHRIS ""DAZE"" ELLIS"
8069                         "Alejandro ""Mono"" González"
8222                                        "Nemi ""UHU"""
8902                            """Rafael Lozano-Hemmer"""
12257          "Giovanni Battista Trotti (""Il Malosso"")"
12567    "Richard Cosway|Mary ""Perdita"" Robinson|Will...
12626         "Giovanni Battista Discepoli (""Il Zoppo"")"
12722           "Giovanni Battista Crespi (""Il Cerano"")"
12742    "Bernardino Rodriguez (""Bernardino Siciliano"")"
13053    "Francesco Monti (""Il Brescianino delle Batta...
13069                             "John ""Warwick"" Smith"
13449         "Giorgio di Giovanni (""Giorgio da Siena"")"
13855              "Giovanni Battista (""Titta"") Lusieri"
13881                   "Giovanni Balducci (""Il Cosci"")"
16443        "Hanna Lachert; ""Ład"" Artists’ Cooperative"
20616                                             """TC"""
21332         "Michelangelo Cerruti (""Il Candelottaro""

In [33]:
art500k_artists['artist'] = art500k_artists['artist'].str.replace('"', '')
art500k_artists['artist'].str.contains('"').sum()

0

In [34]:
art500k_artists['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_artists.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)

In [3]:
art500k_artists[art500k_artists['artist'].str.contains("Marc Bohan")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
1507,Marc Bohan,,,,,,,,,1969.0,1969.0,"Paris, France",,,,"{Paris:2},{France:3}"
24700,Marc Bohan for Christian Dior SE,,,,,,,,,,,"Paris, France",,,,"{Paris:4},{France:4}"


In [6]:
art500k_modified = helper_functions.combine_instances(df=art500k_artists, primary_artist_name="Marc Bohan", secondary_artist_name="Marc Bohan for Christian Dior SE")
art500k_modified[art500k_modified['artist'].str.contains("Marc Bohan")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
40295,Marc Bohan,,,,,,,,,1969.0,1969.0,"Paris, France",,,,"{Paris:6},{France:7}"


In [8]:
art500k_modified.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)
art500k_modified['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_modified.to_csv('datasets/art500k_artists.csv', index=False)

## Update 2024.01.11

One minor change: remove the double "," in StylesYears and PlacesYears.

In [None]:
#Remove double commas
art500k_artists_copy = art500k_artists.copy()
for index, row in art500k_artists.iterrows():
    dict_like_columns = ['ArtMovement', 'StylesCount','PlacesCount']
    years_columns = ['FirstYear','LastYear','PlacesYears','StylesYears']

    for column in dict_like_columns+years_columns:
        column_value = row[column]
        if type(column_value) == float: #NaN
            continue
        values = [x for x in column_value.split(',') if x != '']
        values_one_comma_string = ",".join(values)
        art500k_artists_copy.at[index, column] = values_one_comma_string

In [184]:
art500k_artists_copy.to_csv("datasets/saves/art500k_artists_0_1.csv", index=False)