# Current version: 0.5

From 2024, updates to the dataset are handled and stored in a separate file. This is that file (previously, all Art500k dataset processing was done in *art500k.csv*, the file now renamed to *art500k_initial*).

To-do steps:

- Group artists together with their aliases (e.g. "Rembrandt" and "Rembrandt van Rijn" are two different instances in the dataset)
- Filter out artists that are not painters (e.g. photographers, sculptors, engravers, etc.)

In [2]:
import numpy as np
import pandas as pd
import helper_functions #From helper_functions.py

url_v_latest = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/artists.csv"
url_v_latest_art500k_artists = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv"
artists = pd.read_csv(url_v_latest)
art500k_artists = pd.read_csv(url_v_latest_art500k_artists, dtype={'Type': str})

#### If one wants to store the files locally:

<details><summary><u>Save the files locally</u></summary>

```python

art500k_artists.to_csv('datasets/saves/art500k_artists_0_5.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

```
</details>

## TODO - Get other Wikidata IDs, and combine instances based on them

## TODO (from 01.07-) : Use measures to find artists with multiple names (aliases)

This isn't totally updated. Some of this was run on previous versions of the Art500k cleaned dataset, and now the more relevant file is `art500k_further_selected.csv`. 

If we take a look at popular artists in the dataset, for example Rembrandt:

In [4]:
import numpy as np
import pandas as pd

url_v_01_11 = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_artists_0_4.csv"
art500k_artists = pd.read_csv(url_v_01_11, dtype={'Type': str})

In [5]:
art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique()

array(['Rembrandt Peale', 'Rembrandt', 'after Rembrandt van Rijn',
       'Rembrandt Harmensz. van Rijn', 'Rembrandt van Rijn',
       'British 19th Century after Rembrandt van Rijn',
       'Richard Houston after Rembrandt van Rijn',
       'William Byron after Rembrandt van Rijn',
       'Georg Friedrich Schmidt after Rembrandt van Rijn',
       'Jonas Suyderhoff after Rembrandt van Rijn',
       'Timothy Cole after Rembrandt van Rijn',
       'Richard Earlom after Rembrandt van Rijn',
       'Govaert FLINCK (Disc√≠pulo de Rembrandt)'], dtype=object)

There are multiple entries for Rembrandt: *Rembrandt*, *Rembrandt van Rijn*,  *Rembrandt Harmensz. van Rijn*, *Rembrandt (Rembrandt van Rijn)*, *Rembrandt Harmensz van Rijn (Dutch)*, *Rembrandt (Rembrandt van Rijn)|Rembrandt (Rembrandt van Rijn)*. We need to combine entries for one artists if there are more than 1.<br>
However, generally this is not trivial to find. 

The other problem is processing other instances such as "X after Y". I believe for these cases, LLMs may be the most useful.

As of now, this problem is tackled by using a combination of measures to find artist aliases.

Considered measures:

* Finding a proper word embedding model to find artist aliases / fine-tuning an LLM for this purpose.
* Finding aliases through Wikidata (works on painters such as Rembrandt, but not on all artists).

Other measures already implemented somewhat:

* String matching (Levenshtein distance, etc.) between artist names. 
* Basic string containment (other artists names containing one word artist names, e.g. Rembrandt).
* Named Entity Recognition (NER) (Spacy) to find artist names from text, then apply Coreference Resolution (?) to link pronouns and other expressions to the correct entities.
* Custom rules (e.g "... and his workshop", "... and his circle", etc.)
* Previously, using an LLM (*GPT-2*) to find artist names from text.

Some other considerations + opinions: <br>
* Creating a graph, adding edges between artist names that are similar, then only checking connected components, and hubs inside them.
* Phonetic matching: This could be helpful when an artist's name is spelled differently in different languages, e.g. "ƒå" (Czech) / "Ch" (English) / "cs" (Hungarian). Skipped, because even if this is the case for some instances, we should find these with the Levenshtein distance search. <br>
* Online available resources for aliases, web scraping, etc.: did not seem to find any, except the already said Wikidata which isn't always flexible
* Token-based matching/Jaccard similarity between artist names.


### Some measures with examples:

NER example:

In [6]:
import spacy

#Example
data = {
    'author_name': ['Rembrandt', 'Rembrandt van Rijn', 'Rembrandt Peale', 'Michelangelo', 'Michelangelo Buonarroti', 'Michelangelo Merisi da Caravaggio', 'Caravaggio', 'Caravaggio, Michelangelo Merisi da', 'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)', 'Leonardo', 'Leonardo da Vinci'],
}
df = pd.DataFrame(data)

nlp = spacy.load("en_core_web_sm") #English only
aliases = {}

for name in df['author_name']:
    doc = nlp(name)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            aliases.setdefault(name, set()).add(ent.text)
            aliases.setdefault(ent.text, set()).add(name)

aliases = {key: list(value) for key, value in aliases.items()}
aliases

{'Rembrandt van Rijn': ['Rembrandt van Rijn'],
 'Rembrandt Peale': ['Rembrandt Peale'],
 'Michelangelo': ['Michelangelo'],
 'Michelangelo Buonarroti': ['Michelangelo Buonarroti'],
 'Michelangelo Merisi da Caravaggio': ['Michelangelo Merisi da'],
 'Michelangelo Merisi da': ['Michelangelo Merisi da Caravaggio',
  'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)'],
 'Caravaggio, Michelangelo Merisi da': ['Caravaggio', 'Michelangelo Merisi'],
 'Caravaggio': ['Caravaggio, Michelangelo Merisi da',
  'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)'],
 'Michelangelo Merisi': ['Caravaggio, Michelangelo Merisi da'],
 'Caravaggio, Michelangelo Merisi da (Italian, Milan or Caravaggio 1571-1610 Porto Ercole)': ['Caravaggio',
  'Michelangelo Merisi da'],
 'Leonardo': ['Leonardo'],
 'Leonardo da Vinci': ['Leonardo da Vinci']}

This seems to leave out many 1-word-alias cases and Caravaggio was put into two different instances. Let's see how it works in our case, only for names containing Rembrandt:

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm") #English only
aliases = {}

for name in art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique():
    doc = nlp(name)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            aliases.setdefault(name, set()).add(ent.text)
            aliases.setdefault(ent.text, set()).add(name)

aliases = {key: list(value) for key, value in aliases.items()}
aliases

{'Rembrandt Peale': ['Rembrandt Peale'],
 'after Rembrandt van Rijn': ['Rembrandt van Rijn'],
 'Rembrandt van Rijn': ['Timothy Cole after Rembrandt van Rijn',
  'Rembrandt van Rijn',
  'Richard Earlom after Rembrandt van Rijn',
  'Jonas Suyderhoff after Rembrandt van Rijn',
  'Georg Friedrich Schmidt after Rembrandt van Rijn',
  'British 19th Century after Rembrandt van Rijn',
  'Richard Houston after Rembrandt van Rijn',
  'William Byron after Rembrandt van Rijn',
  'after Rembrandt van Rijn'],
 'Rembrandt Harmensz. van Rijn': ['Rembrandt Harmensz', 'van Rijn'],
 'Rembrandt Harmensz': ['Rembrandt Harmensz. van Rijn'],
 'van Rijn': ['Rembrandt Harmensz. van Rijn'],
 'British 19th Century after Rembrandt van Rijn': ['Rembrandt van Rijn'],
 'Richard Houston after Rembrandt van Rijn': ['Richard Houston',
  'Rembrandt van Rijn'],
 'Richard Houston': ['Richard Houston after Rembrandt van Rijn'],
 'William Byron after Rembrandt van Rijn': ['William Byron',
  'Rembrandt van Rijn'],
 'William 

Seems quite messy. The "after", "attributed to", "follower of" cause big problems. We need to find a way to deal with these first.

Update: The  "|" case is already handled.

In [10]:
from transformers import pipeline

# Initialize the pipeline
generator = pipeline('text-generation', model='gpt2')
string = "This painting was painted after Leonardo by Rafael. The painting was painted by"
# Generate text
text = generator(string, max_length=len(string)+1)[0]['generated_text']

print(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This painting was painted after Leonardo by Rafael. The painting was painted by Leonardo to commemorate the 20-year anniversary of its painting 'Videar' which was made in Paris by E.P. D. Dior. This work, originally composed in the 90's, took 30 years to get complete. In 1998 it was taken back to Leonardo and painted again for an ongoing masterpiece (V


Well.. even if we use RAG to fine-tune it for our problem, it seems like it will make mistakes in these cases already (the answer is Rafael, not Leonardo)

Let's see if it can tell if the artist is known or not

In [11]:
from transformers import GPT2Tokenizer, pipeline

# Initialize the tokenizer and the pipeline
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
generator = pipeline('text-generation', model='gpt2')

pairs = []
for name in art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique():
    string = "The description of this painting's author is: " + name + ". The yes-or-no answer to the question 'Is the painter of this painting known or unknown?' is:"
    tokens = tokenizer.encode(string, return_tensors='pt')
    text = generator(string, max_length=tokens.shape[1]+1)[0]['generated_text'];
    answer = text.split(string)[1]
    pairs.append([name, answer])

pairs

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[['Rembrandt Peale', ' No'],
 ['Rembrandt', ' "'],
 ['after Rembrandt van Rijn', ' he'],
 ['Rembrandt Harmensz. van Rijn', ' he'],
 ['Rembrandt van Rijn', ' The'],
 ['British 19th Century after Rembrandt van Rijn', ' in'],
 ['Richard Houston after Rembrandt van Rijn', ' this'],
 ['William Byron after Rembrandt van Rijn', ' "'],
 ['Georg Friedrich Schmidt after Rembrandt van Rijn', ' none'],
 ['Jonas Suyderhoff after Rembrandt van Rijn', " '"],
 ['Timothy Cole after Rembrandt van Rijn', ' the'],
 ['Richard Earlom after Rembrandt van Rijn', ' "'],
 ['Govaert FLINCK (Disc√≠pulo de Rembrandt)', ' "']]

I believe we could finetune it to say just Yes/No, but it doesn't seem promising still (it says "No" for Rembrandt Peale, etc.)

Clustering based on similarity (chosen 0.6 after some iterations):

In [12]:
import difflib

def similarity(s1, s2):
    return difflib.SequenceMatcher(None, s1, s2).ratio()

cases = art500k_artists[art500k_artists['artist'].str.contains("Rembrandt")]['artist'].unique()
clusters = {}
for case in cases:
    assigned_cluster = False
    for cluster_center, original_painter in clusters.items():
        if similarity(case, cluster_center) > 0.6:
            clusters[cluster_center].append(case)
            assigned_cluster = True
            break
    if not assigned_cluster:
        clusters[case] = [case]

for cluster_center, cluster_cases in clusters.items():
    print(f"Cluster center: {cluster_center}")
    print(f"Cases in cluster: {', '.join(cluster_cases)}")
    print()


Cluster center: Rembrandt Peale
Cases in cluster: Rembrandt Peale, Rembrandt, Rembrandt van Rijn

Cluster center: after Rembrandt van Rijn
Cases in cluster: after Rembrandt van Rijn, Rembrandt Harmensz. van Rijn, British 19th Century after Rembrandt van Rijn, Richard Houston after Rembrandt van Rijn, William Byron after Rembrandt van Rijn, Georg Friedrich Schmidt after Rembrandt van Rijn, Jonas Suyderhoff after Rembrandt van Rijn, Timothy Cole after Rembrandt van Rijn, Richard Earlom after Rembrandt van Rijn

Cluster center: Govaert FLINCK (Disc√≠pulo de Rembrandt)
Cases in cluster: Govaert FLINCK (Disc√≠pulo de Rembrandt)



Makes mistakes in important cases.

There was also tries of using ChatGPT to find artist aliases and filter unknowns, but it was too time-consuming without the API.

## TODO - Deal with "after" cases + remove other common words that are not names: e.g. Company, etc.

I've done a similar thing before in the e-flux project on GitHub, with an NLP sentiment tool you can find the words that are not human names (typically) and appear most often - e.g. university, college, company, "co." etc. - then filter these cases out.

In [46]:
after_word_case = []
for painter in art500k_artists['artist']:
    if "after" in painter.lower():
        after_word_case.append(painter)

## TODO: Time data cleaning 

## TODO: Fix wrong years for artists that have unreasonble years (last painting after death, first painting before birth, death before birth, etc.)

There are ~600 artists with something certainly wrong with their birth, death or activity years. We need to fix these.

In [6]:
art500k_further_selected = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_further_selected.csv')
wrong_order_artists = pd.DataFrame(columns=art500k_further_selected.columns)
large_difference_artists = pd.DataFrame(columns=art500k_further_selected.columns)

for index, row in art500k_further_selected.iterrows():
    if helper_functions.artist_years_order_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(wrong_order_artists)==0:
            wrong_order_artists = pd.DataFrame(row).T
        else:
            wrong_order_artists = pd.concat([wrong_order_artists, pd.DataFrame(row).T])
    if helper_functions.difference_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(large_difference_artists)==0:
            large_difference_artists = pd.DataFrame(row).T
        else:
            large_difference_artists = pd.concat([large_difference_artists, pd.DataFrame(row).T])

l1 = list(large_difference_artists['artist'])
l2 = list(wrong_order_artists['artist'])
union = list(set(l1) | set(l2))
len(union), len(l1), len(l2)

(593, 275, 578)

In [None]:
"John Atkinson Grimshaw" "FriendsAndCoworkers"

In [None]:
"Philip Bunamo"

"Baron Fran√ßois-Pascal-Simon G√©rard (French"
"Fran√ßois Pascal Simon G√©rard"

## 2024.10.27: Combining a few duplicates

In [None]:
primary_name = 'Georges de La Tour'
aliases = ['Georges de La Tour (1493-1652)', 'Georges De La Tour']

In [None]:
primary_name = 'J√≥zsef Rippl-R√≥nai'
aliases = ['Jozsef Rippl-Ronai']

In [14]:
primary_name = "L√°szl√≥ Moholy-Nagy"
aliases = ['Laszlo Moholy Nagy']


'Q212499'

In [None]:
"Jean-L√©on G√©r√¥me"
"Jean Leon Gerome"

## 2024.10.26-27 (Version 0.5): Combine instances based on same unicode normalized names

In [3]:
import json
import requests
response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_wikidata_IDs_mapping.json')
art500k_wikidata_IDs_mapping = json.loads(response.text)

art500k_further_selected = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_further_selected.csv')
response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
wikiart_art500k_mapping = json.loads(response.text);

In [4]:
duplicate_pairs = [("Stanis≈Çaw Ignacy Witkiewicz", "Stanislaw Ignacy Witkiewicz"),("Stanis≈Çaw Wyspia≈Ñski","Stanislaw Wyspianski"),
         ("Julian Fa≈Çat", "Julian Falat"), ("Peder Severin Kroyer", "Peder Severin Kr√∏yer"), ("Vilhelm Hammershoi", "Vilhelm Hammersh√∏i"),
         ("Piotr Micha≈Çowski", "Piotr Michalowski"),
         ] + [('Alexej von Jawlensky', 'Alexej Von Jawlensky'),
 ('Constantin Br√¢ncu»ôi', 'Constantin Brancusi'),
 ('Fernand Leger', 'Fernand L√©ger'),
 ('Fernando de Szyszlo', 'Fernando De Szyszlo'),
 ('Frantisek Kupka', 'Franti≈°ek Kupka'),
 ('Hilma af Klint', 'Hilma Af Klint'),
 ('Josef Sima', 'Josef ≈†√≠ma'),
 ('Salvador Dali', 'Salvador Dal√¨'),
 ('Theo van Doesburg', 'Theo Van Doesburg'),
 ('Bui Xuan Phai', 'B√πi Xu√¢n Ph√°i'),
 ('Elaine de Kooning', 'Elaine De Kooning'),
 ('Joan Miro', 'Joan Mir√≥'),
 ('Mark di Suvero', 'Mark Di Suvero'),
 ('Oyvind Fahlstrom', '√ñyvind Fahlstr√∂m'),
 ('Rodolfo Arico', 'Rodolfo Aric√≤'),
 ('Briton Riviere', 'Briton Rivi√®re'),
 ('Frederick McCubbin', 'Frederick Mccubbin'),
 ('Leon Bonnat', 'L√©on Bonnat'),
 ('Mihaly Munkacsy', 'Mih√°ly Munk√°csy'),
 ('Miklos Barabas', 'Mikl√≥s Barab√°s'),
 ('Pedro Am√©rico', 'Pedro Americo'),
 ('Tamara de Lempicka', 'Tamara De Lempicka'),
 ('Antoni Tapies', 'Antoni T√†pies'),
 ('Emile Galle', '√âmile Gall√©'),
 ('Felix Vallotton', 'F√©lix Vallotton'),
 ('Jules Cheret', 'Jules Ch√©ret'),
 ('Julio Romero de Torres', 'Julio Romero De Torres'),
 ('J√≥zef Mehoffer', 'Jozef Mehoffer'),
 ('Leon Bakst', 'L√©on Bakst'),
 ('Stefan Luchian', '»òtefan Luchian'),
 ('Wilhelm Tr√ºbner', 'Wilhelm Trubner'),
 ('Andre Masson', 'AndreÃÅ Masson'),
 ('Adam van Noort', 'Adam Van Noort'),
 ('Adriaen van Ostade', 'Adriaen Van Ostade'),
 ('Adriaen van de Velde', 'Adriaen Van De Velde'),
 ('Anthony van Dyck', 'Anthony Van Dyck'),
 ('Bartolome Esteban Murillo', 'Bartolom√© Esteban Murillo'),
 ('Cornelis de Vos', 'Cornelis De Vos'),
 ('Cornelis van Noorde', 'Cornelis Van Noorde'),
 ('Diego Velazquez', 'Diego Vel√°zquez'),
 ('Dirck van Baburen', 'Dirck Van Baburen'),
 ('Esaias van de Velde', 'Esaias Van De Velde'),
 ('Gabriel Metsu', 'Gabri√´l Metsu'),
 ('Jacob van Ruisdael', 'Jacob Van Ruisdael'),
 ('Jan van Goyen', 'Jan Van Goyen'),
 ('Maarten de Vos', 'Maarten De Vos'),
 ('Michiel van Musscher', 'Michiel Van Musscher'),
 ('Nikolaus Kn√ºpfer', 'Nikolaus Knupfer'),
 ('Otto Marseus van Schrieck', 'Otto Marseus Van Schrieck'),
 ('Otto van Veen', 'Otto Van Veen'),
 ('Pieter de Hooch', 'Pieter De Hooch'),
 ('Sebastien Bourdon', 'S√©bastien Bourdon'),
 ('Simon de Vlieger', 'Simon De Vlieger'),
 ('Theodoor van Thulden', 'Theodoor Van Thulden'),
 ('Willem van Aelst', 'Willem Van Aelst'),
 ('Edouard Vuillard', '√âdouard Vuillard'),
 ('Emile Bernard', '√âmile Bernard'),
 ('Paul Serusier', 'Paul S√©rusier'),
 ('Allan McCollum', 'Allan Mccollum'),
 ('Helio Oiticica', 'H√©lio Oiticica'),
 ('Herman de Vries', 'Herman De Vries'),
 ('Sol LeWitt', 'Sol Lewitt'),
 ('Francois Morellet', 'Fran√ßois Morellet'),
 ('Andre Derain', 'Andr√© Derain'),
 ('Carlo Carra', 'Carlo Carr√†'),
 ('Carlos Merida', 'Carlos M√©rida'),
 ('Filippo De Pisis', 'Filippo de Pisis'),
 ('Giacomo Manzu', 'Giacomo Manz√π'),
 ('Julio Gonzalez', 'Julio Gonz√°lez'),
 ('Maurice de Vlaminck', 'Maurice De Vlaminck'),
 ('Paul Cezanne', 'Paul C√©zanne'),
 ('Tarsila do Amaral', 'Tarsila Do Amaral'),
 ('Andrea del Castagno', 'Andrea Del Castagno'),
 ('Andrea del Verrocchio', 'Andrea Del Verrocchio'),
 ('Antonello da Messina', 'Antonello Da Messina'),
 ('Francesco del Cossa', 'Francesco Del Cossa'),
 ('Leonardo da Vinci', 'Leonardo Da Vinci'),
 ('Piero della Francesca', 'Piero Della Francesca'),
 ('Andrzej Wr√≥blewski', 'Andrzej Wroblewski'),
 ('Antonio Carneiro', 'Ant√≥nio Carneiro'),
 ('Jose Gutierrez Solana', 'Jos√© Guti√©rrez Solana'),
 ('Jose Pancetti', 'Jos√© Pancetti'),
 ('Kathe Kollwitz', 'K√§the Kollwitz'),
 ('Laszlo Mednyanszky', 'L√°szl√≥ Medny√°nszky'),
 ('Moise Kisling', 'Mo√Øse Kisling'),
 ('Oswaldo Guayasamin', 'Oswaldo Guayasam√≠n'),
 ('Bartolome Bermejo', 'Bartolom√© Bermejo'),
 ('Andrea del Sarto', 'Andrea Del Sarto'),
 ('Cima da Conegliano', 'Cima Da Conegliano'),
 ('Eugene Boudin', 'Eug√®ne Boudin'),
 ('Frederic Bazille', 'Fr√©d√©ric Bazille'),
 ('Honore Daumier', 'Honor√© Daumier'),
 ('Jacques-√âmile Blanche', 'Jacques-Emile Blanche'),
 ('Jose Malhoa', 'Jos√© Malhoa'),
 ('J√≥zef Pankiewicz', 'Jozef Pankiewicz'),
 ('Nicolae Darascu', 'Nicolae DƒÉrƒÉscu'),
 ('Santiago Rusinol', 'Santiago Rusi√±ol'),
 ('Vilhelms Purvitis', 'Vilhelms Purvƒ´tis'),
 ('Gentile da Fabriano', 'Gentile Da Fabriano'),
 ('Jesus Rafael Soto', 'Jes√∫s Rafael Soto'),
 ('Hans von Aachen', 'Hans Von Aachen'),
 ('Giorgio de Chirico', 'Giorgio De Chirico'),
 ('Jose Guadalupe Posada', 'Jos√© Guadalupe Posada'),
 ('Manuel Rodr√≠guez Lozano', 'Manuel Rodriguez Lozano'),
 ('Abraham van Strij', 'Abraham Van Strij'),
 ('Francisco Bayeu y Subias', 'Francisco Bayeu Y Subias'),
 ('Jacob van Strij', 'Jacob Van Strij'),
 ('Theodore Chasseriau', 'Th√©odore Chass√©riau'),
 ('Hugo van der Goes', 'Hugo Van Der Goes'),
 ('Jan van Eyck', 'Jan Van Eyck'),
 ('Matthias Gr√ºnewald', 'Matthias Grunewald'),
 ('Rogier van der Weyden', 'Rogier Van Der Weyden'),
 ('Eugene Delacroix', 'Eug√®ne Delacroix'),
 ('Th√©odore G√©ricault', 'Theodore Gericault'),
 ('Adolph de Meyer', 'Adolph De Meyer'),
 ('Gertrude Kasebier', 'Gertrude K√§sebier'),
 ('Charles-Francois Daubigny', 'Charles-Fran√ßois Daubigny'),
 ('Gustave Dore', 'Gustave Dor√©'),
 ('Paja Jovanovic', 'Paja Jovanoviƒá'),
 ('Theodore Rousseau', 'Th√©odore Rousseau'),
 ('Theodule Ribot', 'Th√©odule Ribot'),
 ('Francois Boucher', 'Fran√ßois Boucher'),
 ('Jean-Baptiste van Loo', 'Jean-Baptiste Van Loo'),
 ('Jos√© Campeche', 'Jose Campeche'),
 ('Maurice Quentin de La Tour', 'Maurice Quentin De La Tour'),
 ('Arnold B√∂cklin', 'Arnold Bocklin'),
 ('Rudolf von Alt', 'Rudolf Von Alt'),
 ('Wilhelm von Kaulbach', 'Wilhelm Von Kaulbach'),
 ('Jef Aerosol', 'Jef A√©rosol'),
 ('Albin Brunovsky', 'Alb√≠n Brunovsk√Ω'),
 ('Alejandro Obregon', 'Alejandro Obreg√≥n'),
 ('Meret Oppenheim', 'M√©ret Oppenheim'),
 ('Eugene Carriere', 'Eug√®ne Carri√®re'),
 ('Felicien Rops', 'F√©licien Rops'),
 ('Pierre Puvis de Chavannes', 'Pierre Puvis De Chavannes'),
 ('Marianne von Werefkin', 'Marianne Von Werefkin'),
 ('It≈ç Jakuch≈´', 'Ito Jakuchu'),
 ('Tani Bunch≈ç', 'Tani Buncho'),
 ('Paul C√©sar Helleu', 'Paul Cesar Helleu'),
 ('Antonio Da Monza', 'Antonio da Monza'),
 ('Jean Dess√®s', 'Jean Desses'),
 ('Mart√≠n Rico', 'Martin Rico'),
 ('Ant√¥nio Parreiras', 'Antonio Parreiras'),
 ('Hishida Shuns≈ç', 'Hishida Shunso'),
 ('Hendrick ter Brugghen', 'Hendrick Ter Brugghen'),
 ('F√©lix Bonfils', 'Felix Bonfils'),
 ('Ivan Generaliƒá', 'Ivan Generalic'),
 ('Uemura Sh≈çen', 'Uemura Shoen'),
 ('Charles Fran√ßois Daubigny', 'Charles Francois Daubigny'),
 ('Willem de Kooning', 'Willem De Kooning'),
 ('Nicolas R√©gnier', 'Nicolas Regnier'),
 ('Pedro Rodriguez', 'Pedro Rodr√≠guez'),
 ('Giovanni Francesco Da Rimini', 'Giovanni Francesco da Rimini'),
 ('Felix Lecomte', 'F√©lix Lecomte'),
 ('Kveta Pacovska', 'Kvƒõta Pacovsk√°'),
 ('Louis R√©my Mignot', 'Louis Remy Mignot'),
 ('Pedro Weing√§rtner', 'Pedro Weingartner'),
 ('Cecilio Pl√°', 'Cecilio Pla')]

for primary_name, secondary_name in duplicate_pairs:
    art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, primary_name, secondary_name).reset_index(drop=True)
    
    if secondary_name in art500k_wikidata_IDs_mapping:
        if primary_name not in art500k_wikidata_IDs_mapping:
            art500k_wikidata_IDs_mapping[primary_name] = art500k_wikidata_IDs_mapping[secondary_name]
        del art500k_wikidata_IDs_mapping[secondary_name]
    if (secondary_name in wikiart_art500k_mapping):
        if (primary_name not in wikiart_art500k_mapping):
            wikiart_art500k_mapping[primary_name] = wikiart_art500k_mapping[secondary_name]
        del wikiart_art500k_mapping[secondary_name]
    art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!=secondary_name]

In [5]:
with open('datasets/saves/art500k_wikidata_IDs_mapping.json', 'w', encoding='utf-8') as f:
    json.dump(art500k_wikidata_IDs_mapping, f, ensure_ascii=False, indent=4)

art500k_artists.to_csv('datasets/saves/art500k_artists_0_5.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

art500k_further_selected.to_csv('datasets/saves/art500k_further_selected.csv', index=False)

#There were no changes in the painter_name_pairs dictionary
with open('datasets/saves/painter_name_pairs.json', 'w', encoding='utf-8') as f:
    json.dump(wikiart_art500k_mapping, f, ensure_ascii=False, indent=4)

# End of version 0.4

## 2024.10.26 Combine instances (temporarily, for the final dataset): 'Roger de La Fresnaye', 'Luis Paret y Alc√°zar'

These were found through finding instances with the same romanized lowercase name, after importing the PainterPalette dataset in SQL. More detail in the *datasets_notebook.ipynb* notebook.

In [3]:
import json
import requests
response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_wikidata_IDs_mapping.json')
art500k_wikidata_IDs_mapping = json.loads(response.text)

art500k_further_selected = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_further_selected.csv')
#response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
#wikiart_art500k_mapping = json.loads(response.text);

In [4]:
primary_name = "Roger de La Fresnaye"
second_name = "Roger de la Fresnaye"
third_name = "Roger De La Fresnaye"

art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, primary_name, second_name)
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, primary_name, third_name)

In [5]:
art500k_wikidata_IDs_mapping[primary_name] = art500k_wikidata_IDs_mapping[second_name]
del art500k_wikidata_IDs_mapping[second_name]
del art500k_wikidata_IDs_mapping[third_name]

#del second, third from further selected
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!=second_name]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!=third_name]

In [6]:
primary_name = "Luis Paret y Alcazar"
second_name = "Luis Paret y Alc√°zar"
third_name = "Luis Paret Y Alcazar"


art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, primary_name, second_name)
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, primary_name, third_name)

In [7]:
art500k_wikidata_IDs_mapping[primary_name] = art500k_wikidata_IDs_mapping[second_name]
del art500k_wikidata_IDs_mapping[second_name]
del art500k_wikidata_IDs_mapping[third_name]

#del second, third from further selected
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!=second_name]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!=third_name]

In [8]:
with open('datasets/saves/art500k_wikidata_IDs_mapping.json', 'w', encoding='utf-8') as f:
    json.dump(art500k_wikidata_IDs_mapping, f, ensure_ascii=False, indent=4)

art500k_artists.to_csv('datasets/saves/art500k_artists_0_4.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

art500k_further_selected.to_csv('datasets/saves/art500k_further_selected.csv', index=False)

## 2024.03.22-28: Artist combination, birth/death/activity years cleaning

**Note**: The functions were moved to the *helper_functions.py* file. 

In [54]:
import json
with open('datasets/saves/art500k_wikidata_IDs_mapping.json', 'r', encoding='utf-8') as f:
    art500k_wikidata_IDs_mapping = json.load(f)

art500k_further_selected = pd.read_csv('datasets/saves/art500k_further_selected.csv')

### Find irregular years for artists

In [57]:
wrong_order_artists = pd.DataFrame(columns=art500k_further_selected.columns)
large_difference_artists = pd.DataFrame(columns=art500k_further_selected.columns)

for index, row in art500k_further_selected.iterrows():
    if helper_functions.artist_years_order_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(wrong_order_artists)==0:
            wrong_order_artists = pd.DataFrame(row).T
        else:
            wrong_order_artists = pd.concat([wrong_order_artists, pd.DataFrame(row).T])
    if helper_functions.difference_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(large_difference_artists)==0:
            large_difference_artists = pd.DataFrame(row).T
        else:
            large_difference_artists = pd.concat([large_difference_artists, pd.DataFrame(row).T])


Initial checks:

In [58]:
fix_first_year_artists = ['Franz von Matsch', 'Madeleine Vionnet', 'Rodolfo Bernardelli', 'Pierre Auguste Renoir', 'Valdivia', 
                          'Matsumura Keibun', 'Federica Galli', 'John Wilson', 'Anton von Maron', 'Paul C√©zanne', 'Carl Mydans']
fix_last_year_artists = ['Federico Barocci','Nicolas de Largilli√®re', 'Martin van Meytens', 'Daniel Hopfer', 'Jan van Os','Augusto Stahl',
                         'Charles Cressent','Henrique Bernardelli', 'Valdivia', 'Matsumura Keibun', 'Wilhelm Von Kaulbach', 'Hans Von Aachen']
drop_artists = ['An√≥nimo', 'Das', 'Indian', 'English', 'Smith', 'Japanese', 'French', 'Jo Yeong-seok', 'Pratt']

In [59]:
art500k_further_selected = art500k_further_selected[~art500k_further_selected['artist'].isin(drop_artists)]

In [60]:
art500k_further_selected = helper_functions.years_setting(art500k_further_selected, fix_first_year_artists, fix_last_year_artists)

Some manual modifications:

In [61]:
art500k_further_selected[art500k_further_selected['artist']=='Giovanni Michele Graneri'].iloc[0]['birth_date'] = 1708.0
art500k_further_selected[art500k_further_selected['artist']=='Giovanni Michele Graneri'].iloc[0]['death_date'] = 1762.0

Let's see orders, it might happen that some artists have wrong first/last years (Art500k years are less reliable than Wiki birth/death years).

In [62]:
wrong_order_artists = pd.DataFrame(columns=art500k_further_selected.columns)
large_difference_artists = pd.DataFrame(columns=art500k_further_selected.columns)

for index, row in art500k_further_selected.iterrows():
    if helper_functions.artist_years_order_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(wrong_order_artists)==0:
            wrong_order_artists = pd.DataFrame(row).T
        else:
            wrong_order_artists = pd.concat([wrong_order_artists, pd.DataFrame(row).T])
    if helper_functions.difference_check([row['birth_date'], row['FirstYear'], row['LastYear'], row['death_date']]):
        if len(large_difference_artists)==0:
            large_difference_artists = pd.DataFrame(row).T
        else:
            large_difference_artists = pd.concat([large_difference_artists, pd.DataFrame(row).T])

In [63]:
wrong_order_artists[['artist', 'birth_date', 'FirstYear', 'LastYear', 'death_date']]

Unnamed: 0,artist,birth_date,FirstYear,LastYear,death_date
2,Ren√© Lalique,1860.0,1897.0,1955.0,1945.0
3,Margaret Bourke-White,1904.0,1885.0,1982.0,1971.0
27,William Notman,1826.0,1862.0,1896.0,1891.0
39,Charles-Nicolas Cochin,1688.0,1745.0,1755.0,1754.0
48,Adolphe Braun,1812.0,1854.0,1906.0,1877.0
...,...,...,...,...,...
7300,Paula Modersohn Becker,1876.0,1898.0,1908.0,1907.0
7301,Samuel Schwarz,1983.0,1912.0,1942.0,
7304,Julio Romero De Torres,1874.0,1898.0,1931.0,1930.0
7324,Augustus Saint-Gaudens,1848.0,1872.0,1926.0,1907.0


There are still however 605 artists with irregular birth/death years.

This needs to be fixed later.

In [67]:
#Union of the two artist list
l1 = list(large_difference_artists['artist'])
l2 = list(wrong_order_artists['artist'])
union = list(set(l1) | set(l2))
len(union)

605

### Combine instances

In [90]:
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Leon Wycz√≥≈Çkowski", "Leon Wyczolkowski")
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Dusan Dzamonja", "Du≈°an D≈æamonja")
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Eduard von Gebhardt", "Eduard Von Gebhardt")
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Caravaggio", "Michelangelo da Caravaggio")
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Caravaggio", "Michelangelo Merisi da Caravaggio (1571-1610)")

### Delete the duplicates from the mapping

In [91]:
del art500k_wikidata_IDs_mapping["Leon Wyczolkowski"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Leon Wyczolkowski"]

del art500k_wikidata_IDs_mapping["Du≈°an D≈æamonja"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Du≈°an D≈æamonja"]

del art500k_wikidata_IDs_mapping["Eduard Von Gebhardt"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Eduard Von Gebhardt"]

del art500k_wikidata_IDs_mapping["Michelangelo da Caravaggio"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Michelangelo da Caravaggio"]

### Other fixes (e.g. bad data):

Note: Will have to fix artists IDs for the deleted duplicates later

In [92]:
#Robert Hunter will be queried again
art500k_wikidata_IDs_mapping['Robert Hunter'] = "Q20826574"
#artist_row = art500k_artists[art500k_artists['artist'] == "Robert Hunter"].iloc[0]

del art500k_wikidata_IDs_mapping["Man"]
art500k_artists = art500k_artists[art500k_artists['artist']!="Man"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Man"]

del art500k_wikidata_IDs_mapping["Vettor Pisani"]
art500k_artists = art500k_artists[art500k_artists['artist']!="Vettor Pisani"]
art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Vettor Pisani"]

#Nadar will be combined with Felix Nadar from the WikiArt dataset
#del art500k_wikidata_IDs_mapping["Nadar"]
#art500k_artists = art500k_artists[art500k_artists['artist']!="Nadar"]
#art500k_further_selected = art500k_further_selected[art500k_further_selected['artist']!="Nadar"]

### Save everything

In [96]:
with open('datasets/saves/art500k_wikidata_IDs_mapping.json', 'w', encoding='utf-8') as f:
    json.dump(art500k_wikidata_IDs_mapping, f)

art500k_artists.to_csv('datasets/saves/art500k_artists_0_4.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

art500k_further_selected.to_csv('datasets/saves/art500k_further_selected.csv', index=False)

## 2024.03.18-19: Rename Art500k_selected to Further Selected <br> Redo Art500k_selected based on Wikidata IDs<br> + partially deselect non-painters

Most of the artists already stored in `art500k_wikidata_IDs_mapping.json` are painters already, therefore we can ignore that problem for the most case.

In [4]:
import json
with open('datasets/saves/art500k_wikidata_IDs_mapping.json', 'r', encoding='utf-8') as f:
    art500k_wikidata_IDs_mapping = json.load(f)

In [5]:
art500k_further_selected_artists = [x for x in art500k_wikidata_IDs_mapping.keys() if art500k_wikidata_IDs_mapping[x] is not None]

In [6]:
art500k_further_selected = art500k_artists[art500k_artists['artist'].isin(art500k_further_selected_artists)]
art500k_further_selected

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
14,Utamaro,,,,,,,,,1787.0,1803.0,Japan,,,{Japan:26},No,Painting/Sculpture
15,Ren√© Lalique,,,,,"France,",,,Artists1/Ren√© Lalique/Orchids Diadem##EAHg4G1j...,1897.0,1955.0,"Paris, France",,,"{Paris:3},{France:16}",No,Painting/Sculpture
22,Margaret Bourke-White,,,{Social realism:6953},,,,,,1885.0,1982.0,,,,,,Painting/Sculpture
23,Alfred Eisenstaedt,,,,,,,,,1901.0,1993.0,,,,,,Painting/Sculpture
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20899,Jan Boskam,,,,,,,,,,,Amsterdam,,,{Amsterdam:3},,
20900,Wouter Muller,,,,,,,,,,,"Amsterdam,Netherlands",,,"{Amsterdam:13},{Netherlands:1}",,
20901,Jacques Jonghelinck,,,,,,,,,,,Antwerp,,,{Antwerp:2},,
20902,Jan van Halbeeck,,,,,,,,,,,Paris,,,{Paris:6},,


In [7]:
import httpimport
with httpimport.remote_repo('https://raw.githubusercontent.com/me9hanics/sparql-wikidata-data-collection/main/'):
    import functions as external_functions

In [None]:
attributes = ['birth_place', 'birth_date', 'death_place', 'death_date', 'gender', 'citizenship', 'occupation', 'work_locations']
for attribute in attributes:
    art500k_further_selected.loc[:, attribute] = None

further_selected_ids = [x for x in art500k_wikidata_IDs_mapping.values() if x is not None]
all_people_info = external_functions.get_multiple_people_all_info_by_id_fast_retry_missing(further_selected_ids, delay=121)
for response in all_people_info:
    artist = [key for key, value in art500k_wikidata_IDs_mapping.items() if value == response['id']][0]
    for attribute in attributes:
        try:
            art500k_further_selected.loc[art500k_further_selected['artist'] == artist, attribute] = response[attribute]
        except:
            art500k_further_selected.loc[art500k_further_selected['artist'] == artist, attribute] = str(response[attribute])

In [None]:
art500k_further_selected.rename(columns={"occupation":"occupations"}, inplace=True)

In [32]:
art500k_further_selected[0:5]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,...,Contemporary,Type,birth_place,birth_date,death_place,death_date,gender,citizenship,occupations,work_locations
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,...,No,Painting/Sculpture,Heraklion,1541-10-01T00:00:00Z,Toledo,1614-04-07T00:00:00Z,male,Republic of Venice,"['architect', 'painter', 'sculptor', 'architec...","[{'location': 'Madrid', 'start_time': None, 'e..."
14,Utamaro,,,,,,,,,1787.0,...,No,Painting/Sculpture,Edo,1753-01-01T00:00:00Z,Edo,1806-10-31T00:00:00Z,male,Japan,"['painter', 'graphic artist', 'ukiyo-e artist']",[]
15,Ren√© Lalique,,,,,"France,",,,Artists1/Ren√© Lalique/Orchids Diadem##EAHg4G1j...,1897.0,...,No,Painting/Sculpture,A√ø,1860-04-06T00:00:00Z,Paris,1945-05-01T00:00:00Z,male,France,"['goldsmith', 'jeweler', 'artist', 'glassblowe...","{'location': 'Paris', 'start_time': None, 'end..."
22,Margaret Bourke-White,,,{Social realism:6953},,,,,,1885.0,...,,Painting/Sculpture,The Bronx,1904-06-14T00:00:00Z,Stamford,1971-08-27T00:00:00Z,female,United States of America,"['photographer', 'writer', 'artist', 'photojou...","[{'location': 'Cleveland', 'start_time': None,..."
23,Alfred Eisenstaedt,,,,,,,,,1901.0,...,,Painting/Sculpture,Tczew,1898-12-06T00:00:00Z,Oak Bluffs,1995-08-23T00:00:00Z,male,Germany,"['photographer', 'entrepreneur', 'photojournal...","[{'location': 'New York City', 'start_time': N..."


In [15]:
def year_if_string(string):
    if type(string) == str:
        return external_functions.find_year(string)
    else:
        return np.nan

In [65]:
def get_places_from_work_locations(work_locations, quiet=True):
    if type(work_locations) == str:
        work_locations = external_functions.stringlist_to_list(work_locations)
    places = []
    if isinstance(work_locations, dict):
        work_locations = [work_locations]
    if work_locations is not None:
        try:
            for place in work_locations:
                try:
                    if place["location"] not in places:
                        places.append(place["location"])
                except:
                    print(f"Could not find location in work_location: {place}")
                    print(f"work_location: {work_locations}")
        except KeyError:
            if not quiet:
                print(f"Could not find work_locations in response")
    return str(places)


def get_places_with_years_from_work_locations(work_locations):
    places = []
    if type(work_locations) == str:
        work_locations = external_functions.stringlist_to_list(work_locations)
    if isinstance(work_locations, dict):
        work_locations = [work_locations]
    if work_locations:
        for place in work_locations:
            years = external_functions.get_years_from_response_location(place)
            if years != []:
                min_year = min(years); max_year = max(years)
                #Checking if the location is already in the list
                if not any(p.split(':')[0] == place["location"] for p in places):#Just get the part before the colon, which is the location's name
                    places.append(f"{place['location']}:{min_year}-{max_year}")
                else:
                    #Find the index of the location in the places list
                    for i, p in enumerate(places):
                        if p.split(':')[0] == place["location"]:
                            #Add these years next to the existing years
                            places[i] = f"{p},{min_year}-{max_year}"
                            break
    return str(places)

In [None]:
art500k_further_selected['locations'] = art500k_further_selected['work_locations'].apply(get_places_from_work_locations)
art500k_further_selected['locations_with_years'] = art500k_further_selected['work_locations'].apply(get_places_with_years_from_work_locations)
art500k_further_selected['birth_date']=art500k_further_selected['birth_date'].apply(external_functions.find_year)
art500k_further_selected['death_date']=art500k_further_selected['death_date'].apply(external_functions.find_year)
art500k_further_selected.drop(columns=['work_locations'], inplace=True)

In [74]:
art500k_further_selected.to_csv('datasets/saves/art500k_further_selected.csv', index=False)

## 03.11-13 Create Art500k_selected.csv <br> (Built on the mentioned *important work* below)

We first take the file created in *datasets_notebook.ipynb*: For all Art500k artists that are not in the WikiArt dataset, but have a corresponding Wikidata profile (sometimes, could be an incorrect one), we take further data from Wikidata.

In [4]:
import json
with open('datasets/saves/art500k_wikidata_names_mapping.json', 'r', encoding='utf-8') as f:
    art500k_wikidata_names_mapping = json.load(f)

In [None]:
art500k_selected = art500k_artists[art500k_artists['artist'].isin(art500k_wikidata_names_mapping.keys())].reset_index(drop=True)

In [7]:
import httpimport
with httpimport.remote_repo('https://raw.githubusercontent.com/me9hanics/sparql-wikidata-data-collection/main/'):
    import functions as external_functions

In [8]:
attributes = ['birth_place', 'birth_date', 'death_place', 'death_date', 'gender', 'citizenship', 'occupation', 'work_locations']
for attribute in attributes:
    art500k_selected[attribute] = None

all_people_info = external_functions.get_multiple_people_all_info_fast_retry_missing(list(art500k_selected['artist']), delay=121)
for response in all_people_info:
    for attribute in attributes:
        try:
            art500k_selected.loc[response['name'] == art500k_selected['artist'], attribute] = response[attribute]
        except:
            art500k_selected.loc[response['name'] == art500k_selected['artist'], attribute] = str(response[attribute])

Error fetching data for Tony Cragg, status code: 429.
Attempt 1 of 3.
Error fetching data for George W. Bush, status code: 429.
Attempt 1 of 3.
Error fetching data for Joaqu√≠n Torres-Garc√≠a, status code: 429.
Attempt 1 of 3.
Error fetching data for Cornelis van Haarlem, status code: 429.
Attempt 1 of 3.
Error fetching data for Yoshitomo Nara, status code: 429.
Attempt 1 of 3.
Error fetching data for Song Dong, status code: 429.
Attempt 1 of 3.
Error fetching data for Benvenuto Tisi, status code: 429.
Attempt 1 of 3.
Error fetching data for Sooni Taraporevala, status code: 429.
Attempt 1 of 3.
Error fetching data for Jos√© Ferraz de Almeida J√∫nior, status code: 429.
Attempt 1 of 3.
Error fetching data for Ry≈´sei Kishida, status code: 429.
Attempt 1 of 3.
Error fetching data for Martha Cooper, status code: 429.
Attempt 1 of 3.
Error fetching data for Cornelius van Poelenburgh, status code: 429.
Attempt 1 of 3.
Error fetching data for Paul-Jacques-Aim√© Baudry, status code: 429.
Attem

In the meanwhile, an artist was added to the mapping:

In [59]:
art500k_selected =art500k_selected[~(art500k_selected['artist'].str.contains("Costigliolo"))].reset_index(drop=True)

Great, now let's check who are missing:

In [41]:
art500k_selected[art500k_selected['gender'].isna()]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,...,Contemporary,Type,birth_place,birth_date,death_place,death_date,gender,citizenship,occupation,work_locations
120,Federico de Madrazo,,,,,,,,,1840.0,...,No,Painting/Sculpture,,,,,,,[],[]
262,Hong Ren,,,,,,,,,1664.0,...,No,Painting/Sculpture,,,,,,,researcher,[]
434,The Atlas,,,,,,,,,2013.0,...,Yes,Painting/Sculpture,,,,,,,[],[]
504,Manuel Capdevila,,,,,,,,,1976.0,...,No,Painting/Sculpture,,,,,,,[],[]
553,Myung Sook Kim,,"Ewha Woman's University, Seoul, Korea. B.F.A.,...",,,,,,,,...,,Painting/Sculpture,,1999,,,,,botanist,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9959,Fayum Portrait,Egyptians,,{Roman period (1c.BC - 4 c.AD):75},,"Byzantine Art,",,,,,...,,,,,,,,,[],[]
9972,Grigoras,Romanian,√âcole de Paris,"{Abstract Expressionism,New European Painting:20}",,,,,,1997.0,...,Yes,,,,,,,,[],[]
10027,Yun Hyong Keun,South Korean,,{Dansaekhwa (Korean Monochrome Painting):15},,,,,,1970.0,...,Yes,,,,,,,,[],[]
10070,Costigliolo,Uruguayan,,"{Constructivism,De Stijl (Neoplasticism),Cubis...",Artists2/Costigliolo/Rectangulos Y Cuadrados C...,,,,,1948.0,...,No,,,,,,,,[],[]


Most of these artists do have information about them, just not on the English Wikipedia. We'd need to gather data from other languages.

In [49]:
missing = list(art500k_selected[art500k_selected['gender'].isna()]['artist'])
wikidata_names = [art500k_wikidata_names_mapping[artist] for artist in missing]
returned_infos = external_functions.get_multiple_people_all_info_fast_retry_missing(wikidata_names, delay=121)

['Federico de Madrazo',
 'Hong Ren',
 'The Atlas',
 'Manuel Capdevila',
 'Myung Sook Kim',
 'Jung Wook Kim',
 'Byung Jin Kim',
 'Seung Young Kim',
 'Herter Brothers',
 'Hense',
 'Mr Zero',
 'Layer',
 'Daas',
 'Nazza',
 'Leonardo da Vinci‚ÄìFiumicino Airport',
 'Parlee',
 'Xeva',
 'Insa',
 'Malarky',
 'Cerok',
 'Resto',
 'Atak',
 'Michael Massenburg',
 'phoenix',
 'Nomade',
 'The Norwegian Institute for Nature Research',
 'Guache',
 'Blic',
 'Distort',
 'Adres',
 'Triga',
 'King Bee',
 'Demer',
 'Nanook',
 'Macs',
 'Libertad',
 'Ecos',
 'Oster',
 'Speto',
 'Plea',
 'Motor',
 'Cazu',
 'Steep',
 'Fame',
 'Ibie',
 'Tafa3',
 'PAN1',
 'Muck Rock',
 'Plek',
 'The Usos',
 'Rodez',
 'Hin',
 'Kobra',
 'Smithe',
 'Debe',
 'Mart',
 'Maasai',
 'Krahn people',
 'Katre',
 'Tizer',
 'Katch',
 '4B',
 'Awer',
 'Roids',
 'Mobstr',
 'Vyal',
 'Fusca',
 'Artkore',
 'Eime',
 'Akse',
 'Tata Airport',
 'Rwdd3',
 'Lalone',
 'Slicer',
 '4016 Sambre',
 'Jorz',
 'writer',
 'Exot',
 'Tester',
 'Morik',
 'Jasone',
 

Need to convert the time into just years

In [38]:
for index,artist in art500k_selected.iterrows():
    if artist['birth_date']:
        art500k_selected.loc[index, 'birth_date'] = external_functions.find_year(artist['birth_date'])
    if artist['death_date']:
        art500k_selected.loc[index, 'death_date'] = external_functions.find_year(artist['death_date'])

Save: *art500k_selected_artists_extension.csv* (new file)

In [40]:
art500k_selected.to_csv('datasets/saves/art500k_selected_artists_extension.csv', index=False)

We then use it to *artists_large.csv* and split into two our end dataset: the other one being *artists_precise.csv* 

As of now, *PainterPalette.csv* contains the data from *artists_large.csv*. *PainterPalette_precise.csv* is the representative file that contains the data from *artists_precise.csv*.

## *Important work*: Gather Wikidata data for Art500k artists that are not in the WikiArt dataset
This is rather done in `datasets_notebook.ipynb` and not included as part of the Art500k processed data.

First, import the WikiArt datasets and the mapping of (some) artists between the two datasets

## 2024.03.09 - Minor corrections

In [7]:
art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, "Edouard Manet", "√âdouard Manet")

In [8]:
art500k_artists.to_csv('datasets/saves/art500k_artists_0_4.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

## 2024.03.06: Remove NaN painter name row

In [10]:
def check_if_nan(entity):
    if type(entity) == float:
        if np.isnan(entity):
            return True
    return False

art500k_artists = art500k_artists[~(art500k_artists['artist'].apply(check_if_nan))].reset_index(drop=True)

In [11]:
art500k_artists.to_csv('datasets/saves/art500k_artists_0_4.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

## 2024.02.15 Minor updates: duplicate combining, filtering
Fix "Artist:", "Painted by", "Modeled by" cases 

In [67]:
duplicates = art500k_artists[art500k_artists.duplicated(['artist'], keep=False)]
(duplicates.sort_values(by=['artist'])[0:6])

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
3838,Rafael Lozano-Hemmer,,,,,,,,,2010.0,2011.0,,,,,Yes,
7302,Rafael Lozano-Hemmer,,,{Contemporary art:1},,,,,,2010.0,2010.0,,,,,Yes,


In [68]:
art500k_artists = helper_functions.art500k_combine_duplicates(art500k_artists)
art500k_artists

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4...",No,Painting/Sculpture
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1...",,Painting/Sculpture
2,Frida Kahlo,Mexican,,"{Na√Øve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Na√Øve Art (Primitivism):1922-1954,Surrealism:1...","{Na√Øve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me...",No,Painting/Sculpture
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...",,,"{Palestine:1},{Los Angeles:3},{California:3},{...",Yes,Painting/Sculpture
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21019,√âdouard Debat-Ponsan,French,,"{Academic art:1},{Academic Art:11}",,,,,,1876.0,1902.0,,Academicism:1876-1902,{Academicism:11},,No,
21020,Juan de Vald√©s Leal,Spanish,,{Baroque:17},"Virgin-Mary,Christianity,Christianity,saints-a...","Museo del Prado, Madrid, Spain,Museum of Fine ...",,,Artists2/Juan De Valdes Leal/The Imaculate Con...,1650.0,1700.0,"Seville, US, St. Louis, Russia, Saint Petersbu...",,{Baroque:17},"{St. Louis:1},{MO:1},{US:1},{Seville:4},{Spain...",No,
21021,Park Seo Bo,South Korean,,"{Korean Informel ,Dansaekhwa (Korean Monochrom...",,,,,,1968.0,2007.0,Korea,Minimalism:1968-2007,{Minimalism:18},{Korea:1},Yes,
21022,Albrecht Durer,German,German School,"{Northern Renaissance:856},{German Renaissance...","Andrea Mantegna,Rogier van der Weyden,","Raphael,Titian,Parmigianino,Jacopo Bassano,Bar...",,"Martin Schongauer,","Raphael,Giovanni Bellini,Leonardo da Vinci,Jan...",1481.0,1588.0,"Basel,London, Weimar, Frankfurt, Germany, Ber...",Northern Renaissance:1481-1528,"{Northern Renaissance:840},{Renaissance:1},{Fl...","{Berlin:54},{Germany:138},{Albertina:101},{Vie...",No,


In [69]:
artist_word_case = []
by_word_case = []
copy_word_case = []
for painter in art500k_artists['artist']:
    if ("by " in painter.lower()):
        by_word_case.append(painter)
    if "artist" in painter.lower():
        artist_word_case.append(painter)
    if "copy" in painter.lower():
        copy_word_case.append(painter)
    
len(artist_word_case), len(by_word_case), len(after_word_case), len(copy_word_case)


(180, 93, 424, 15)

In [70]:
for painter in artist_word_case:
    if "Artist: " in painter:
        if painter in ["Artist: Falsely attributed to Hon'ami Koho (Kuchu)", 'Artist: Decoration attributed to Kano Tangen',]:
            art500k_artists = art500k_artists[art500k_artists['artist'] != painter]
            continue
        new_name = painter.replace("Artist: ", "")
        #First, check if the new name is already in the dataset
        if new_name in art500k_artists['artist'].tolist():
            art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, new_name, painter)
        else:
            art500k_artists.loc[art500k_artists['artist'] == painter, "artist"] = new_name

    elif painter not in ['Street Artist RIP 5 Points', 'GCC Collective (collective of 8 artists)',]:
        art500k_artists = art500k_artists[art500k_artists['artist'] != painter]

In [71]:
art500k_artists.reset_index(drop=True)

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4...",No,Painting/Sculpture
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1...",,Painting/Sculpture
2,Frida Kahlo,Mexican,,"{Na√Øve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Na√Øve Art (Primitivism):1922-1954,Surrealism:1...","{Na√Øve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me...",No,Painting/Sculpture
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...",,,"{Palestine:1},{Los Angeles:3},{California:3},{...",Yes,Painting/Sculpture
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20929,Augustus Saint-Gaudens,,,"{Neoclassicism:1},{American Renaissance:1}",,,,,,1872.0,1926.0,"Salem, Ohio, New Hampshire, Cornish, United St...",,,"{Salem:1},{Ohio:1},{United States:4},{Cornish:...",No,
20930,John Henry Twachtman,American,"Society of American Artists,Ten (Ten American ...","{American Impressionism:4},{Modern art:4},{Imp...",,,"Leon Kroll,",,,1873.0,1902.0,"CA, Cincinnati, Connecticut, Ohio, Spain, MO, ...","Impressionism:1874-1902,Tonalism:1885-1901,Rea...","{Impressionism:225}, {Tonalism:25}, {Realism:8}","{Cincinnati:1},{Ohio:1},{United States:2},{Gre...",No,
20931,Henry Wolf,,,"{Modern art:39},{Contemporary art:1},{Realism:1}",,,,,,,,,,,,,
20932,Eiraku Hozen,,,,,,,,,,,Japan,,{Kyoto ware:13},{Japan:1},,


In [72]:
for painter in by_word_case:
    if " by" not in painter and painter not in ['By Martin Carlin','Issued by William Byrd III', 'Photograph By David Messent']:
        continue #A trick that works for now
    else:
        if "By " in painter and painter!= 'Issued by William Byrd III':
            pre_text, name = painter.split("By ")
        elif " by" in painter:
            try:
                pre_text, name = painter.split("by ")
            except:
                pre_text, name, name2 = painter.split("by ") #Multiple "by" cases, this is not relevant now and not so nice
        
        if name in art500k_artists['artist'].tolist():
            art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, name, painter)
        else:
            art500k_artists.loc[art500k_artists['artist'] == painter, "artist"] = name

In [73]:
art500k_artists[art500k_artists['artist'].isin(by_word_case)]['artist']

2332     Colonel William Willoughby Hooper
5017                Thomas Kirby Van Zandt
6518                        Baby Guerrilla
7398                      Paul Sandby Munn
10252                          KIRBY ROXAS
11535                         Digby Morton
11748                           Ruby James
16385                          Nobby Clark
20026                 Walter Darby Bannard
Name: artist, dtype: object

Perfect, we kept only the artists that have "by" in their name

In [74]:
copy_word_case

['Artist: Copy after Katsushika Hokusai',
 'Hieronymus Bosch (copy)',
 'Lucas van Leyden (copy)',
 'Jan van Eyck (copy)',
 'Artist: Copy after Utagawa Hiroshige',
 'London and Continental Photographic Copying Company',
 'copy after Jan Boskam',
 'copy after Adriaan Waterloos',
 'copy after Wouter Muller',
 'copy after Jacques Jonghelinck',
 'possibly copy after Jan van Halbeeck',
 'Copy of Zurbar√°n',
 'A copy of Jan van Goyen‚Äôs  painting',
 'copy after Jan Luder',
 'Copy of Murillo']

We have overlapping cases: "Artist: Copy after...", the easiest way to handle this is just reload the copy cases.

In [79]:
copy_word_case = []
for painter in art500k_artists['artist']:
    if "copy" in painter.lower():
        copy_word_case.append(painter)

In [88]:
copy_word_case

['Copy after Katsushika Hokusai',
 'Hieronymus Bosch (copy)',
 'Lucas van Leyden (copy)',
 'Jan van Eyck (copy)',
 'Copy after Utagawa Hiroshige',
 'London and Continental Photographic Copying Company',
 'copy after Jan Boskam',
 'copy after Adriaan Waterloos',
 'copy after Wouter Muller',
 'copy after Jacques Jonghelinck',
 'possibly copy after Jan van Halbeeck',
 'Copy of Zurbar√°n',
 'A copy of Jan van Goyen‚Äôs  painting',
 'copy after Jan Luder',
 'Copy of Murillo']

In [None]:
art500k_artists = art500k_artists[art500k_artists['artist']!='London and Continental Photographic Copying Company']

for painter in copy_word_case:
    if painter == 'London and Continental Photographic Copying Company':
        continue
    if "copy after" in painter.lower():
        pre_text, name = painter.split("after ")
    if "(copy)" in painter:
        pre_text, name = painter.split(" (copy)")
    if "copy of" in painter.lower():
        pre_text, name = painter.split("of ")
    
    if name in art500k_artists['artist'].tolist():
        art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, name, painter)
    else:
        art500k_artists.loc[art500k_artists['artist'] == painter, "artist"] = name

In [None]:
art500k_artists.to_csv('datasets/saves/art500k_artists_0_4.csv', index=False)
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)

## 2024.02.15: Re-do location data

In [None]:
art500k_artists = art500k_artists.rename(columns={'Places':'PaintingsExhibitedAt', 'PlacesCount':'PaintingsExhibitedAtCount'})
art500k_artists.drop(columns=["PlacesYears"], inplace=True)

In [10]:
art500k_artists

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4...",No,Painting/Sculpture
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1...",,Painting/Sculpture
2,Frida Kahlo,Mexican,,"{Na√Øve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Na√Øve Art (Primitivism):1922-1954,Surrealism:1...","{Na√Øve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me...",No,Painting/Sculpture
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...",,,"{Palestine:1},{Los Angeles:3},{California:3},{...",Yes,Painting/Sculpture
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21020,J√°nos Mattis-Teutsch,"Hungarian,Romanian",,"{Art Nouveau:1},{Socialist realism:1},{Abstrac...",,,,,,1909.0,1947.0,,"Constructivism:1925-1930,Abstract Art:1918-192...","{Constructivism:11}, {Abstract Art:61}, {Expre...",,,
21021,√âdouard Debat-Ponsan,French,,"{Academic art:1},{Academic Art:11}",,,,,,1876.0,1902.0,,Academicism:1876-1902,{Academicism:11},,No,
21022,Juan de Vald√©s Leal,Spanish,,{Baroque:17},"Virgin-Mary,Christianity,Christianity,saints-a...","Museo del Prado, Madrid, Spain,Museum of Fine ...",,,Artists2/Juan De Valdes Leal/The Imaculate Con...,1650.0,1700.0,"Seville, US, St. Louis, Russia, Saint Petersbu...",,{Baroque:17},"{St. Louis:1},{MO:1},{US:1},{Seville:4},{Spain...",No,
21023,Park Seo Bo,South Korean,,"{Korean Informel ,Dansaekhwa (Korean Monochrom...",,,,,,1968.0,2007.0,Korea,Minimalism:1968-2007,{Minimalism:18},{Korea:1},Yes,


# End of version 0.3

## 2024.01.24: Look through "|" cases, make minor fixes

The amount of cases where there is a "|" in the artist name (suggesting multiple artists as possible painters) used to be quite high, but from the previous update where 18000 instances of basically no information were removed, there is only one such case.

In [56]:
url_v_01_11 = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_artists_0_3.csv"
art500k_artists = pd.read_csv(url_v_01_11, dtype={'Type': str})

cases = (art500k_artists[art500k_artists['artist'].str.contains("|", regex=False)]['artist']).unique()
for case in cases:
    print(case)

Albrecht D√ºrer|Albrecht D√ºrer


In [57]:
art500k_artists_copy = helper_functions.art500k_combine_instances(art500k_artists, "Albrecht D√ºrer","Albrecht D√ºrer|Albrecht D√ºrer")
art500k_artists_copy = helper_functions.art500k_combine_instances(art500k_artists_copy, "Albrecht D√ºrer","Albrecht D_rer")
art500k_artists_copy = helper_functions.art500k_combine_instances(art500k_artists_copy, "Albrecht D√ºrer","Albrecht D√ºrer (German")
art500k_artists_copy = helper_functions.art500k_combine_instances(art500k_artists_copy, "Albrecht Durer","Albrecht D√ºrer") #For WikiArt, this name is better

In [58]:
art500k_artists_copy.to_csv('datasets/saves/art500k_artists_0_3.csv', index=False)
art500k_artists_copy.to_csv('datasets/art500k_artists.csv', index=False)

Manually added Type for D√ºrer.

## 2024.01.24 Filter out totally empty rows (aside from name, nationality, contemporary y/n and type) except if the artist is in WikiArt

In [45]:
wikiart_artists = pd.read_csv("datasets/wikiart_artists.csv")

drops = art500k_artists.copy()
drops = drops[(drops.drop(columns=['artist', 'Nationality'])).notna().any(axis=1)==False]
drops2 = drops.copy() #Cannot change dataframe while iterating over it
for artist in drops['artist']:
    if artist in wikiart_artists['artist'].unique():
        drops2 = drops2[drops2['artist'] != artist]

In [48]:
wikiart_artists = pd.read_csv("datasets/wikiart_artists.csv")

drops = art500k_artists.copy()
drops = drops[(drops.drop(columns=['artist', 'Nationality', 'Contemporary', 'Type'])).notna().any(axis=1)==False]
drops2 = drops.copy() #Cannot change dataframe while iterating over it
for artist in drops['artist']:
    if artist in wikiart_artists['artist'].unique():
        drops2 = drops2[drops2['artist'] != artist]

art500k_artists_copy = art500k_artists.copy()
art500k_artists_copy = art500k_artists_copy[~(art500k_artists_copy['artist'].isin(drops2['artist']))].reset_index(drop=True)
art500k_artists_copy

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount,Contemporary,Type
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","France:1841-1876,Switzerland:1844-1874,Lille:1...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4...",No,Painting/Sculpture
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...","France:1865-1889,Paris:1865-1898,CA:1891-1891,...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1...",,Painting/Sculpture
2,Frida Kahlo,Mexican,,"{Na√Øve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Mexico:1927-1954,San Francisco:1931-1933,Mexic...","Na√Øve Art (Primitivism):1922-1954,Surrealism:1...","{Na√Øve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me...",No,Painting/Sculpture
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...","London:2011-2011,UK:2011-2011",,,"{Palestine:1},{Los Angeles:3},{California:3},{...",Yes,Painting/Sculpture
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...","Spain:1577-1599,London:1600-1600,UK:1600-1600,...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21024,Th√©o van Rysselberghe,Belgian,Les XX,{Post-Impressionism:186},,,,,,1880.0,1926.0,"Belgium, Brussels, Netherlands, Otterlo, Hague...","Otterlo:1890-1890,Netherlands:1890-1920,Amster...","Post-Impressionism:1900-1926,Impressionism:188...","{Post-Impressionism:65}, {Impressionism:34}, {...","{Otterlo:2},{Netherlands:6},{Amsterdam:1},{Utr...",,
21025,J√°nos Mattis-Teutsch,"Hungarian,Romanian",,"{Art Nouveau:1},{Socialist realism:1},{Abstrac...",,,,,,1909.0,1947.0,,,"Constructivism:1925-1930,Abstract Art:1918-192...","{Constructivism:11}, {Abstract Art:61}, {Expre...",,,
21026,√âdouard Debat-Ponsan,French,,"{Academic art:1},{Academic Art:11}",,,,,,1876.0,1902.0,,,Academicism:1876-1902,{Academicism:11},,No,
21027,Juan de Vald√©s Leal,Spanish,,{Baroque:17},"Virgin-Mary,Christianity,Christianity,saints-a...","Museo del Prado, Madrid, Spain,Museum of Fine ...",,,Artists2/Juan De Valdes Leal/The Imaculate Con...,1650.0,1700.0,"Seville, US, St. Louis, Russia, Saint Petersbu...",,,{Baroque:17},"{St. Louis:1},{MO:1},{US:1},{Seville:4},{Spain...",No,


In [50]:
art500k_artists_copy.to_csv('datasets/art500k_artists.csv', index=False)
art500k_artists_copy.to_csv('datasets/saves/art500k_artists_0_3.csv', index=False)

# End of version 0.2

## 2024.01.20-23 Remove "Main" from locations (e.g. London, Main -> Main was detected as a separate location)

In [3]:
import numpy as np
import pandas as pd

url_v_01_10 = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_artists_0_2.csv"
art500k_artists = pd.read_csv(url_v_01_10, dtype={'Type': str})

import helper_functions  #From helper_functions.py

The idea: if in certain columns (e.g. "Places") there is a certain value contained (e.g. "Main") but not an exception (e.g. "Maine"), then a switch function is called


In [4]:
#NOTE: this is now placed in the helper_functions.py file
from helper_functions import row_contains_values_switch

#Switch function: #NOTE: A more general, word switch function is now placed in the helper_functions.py file
def switch_function_exclude_main(row_as_series, column_name):
    import re
    row = row_as_series.copy()

    if not isinstance(row_as_series[column_name], str): #For example, if it is NaN (float)
        return row

    if column_name == "Places": 
        row[column_name] = row_as_series[column_name].replace(", Main", "").replace(" Main,","")#Deal with all cases (beginning, end, middle)
        if row[column_name] == "Main":#One case can happen: if there is only one place, "Main"
            row[column_name] = ""
    if column_name == "PlacesYears":
        expressions = re.findall(r"Main:\d+-\d+|$", row_as_series[column_name])
        expression = expressions[0] if len(expressions) > 0 else ""
        if expression != "":
            row[column_name] = row_as_series[column_name].replace(","+expression, "").replace(expression+",","")
            if row[column_name] == expression: #If only one place, "Main"
                row[column_name] = ""
    if column_name == "PlacesCount":
        expressions = re.findall(r"\{Main:\d+\}", row_as_series[column_name])
        expression = expressions[0] if len(expressions) > 0 else ""
        if expression != "":
            row[column_name] = row_as_series[column_name].replace(","+expression, "").replace(expression+",","")
            if row[column_name] == expression:
                row[column_name] = ""
    return row

<details><summary><u>Testing:</u></summary>

```python

def test_row_switching():
    df = pd.DataFrame({
        'Places': ['umm, Maine, USA', 'London, Main', 'One, Main, Two', 'Saint-Germaine', 'USA', 'Mains, Main', 'umm, Main, USA'],
        'PlacesYears': ['umm:...,Main:1990-2000,', 'a:_,Main:2-12,b', 'Main:1990-2000', 'Main:1990-2000', 'Main:1990-2000,Maine:1990-2000', 'Mains:1990-2000,Main:1990-2000', 'umm:...,Main:1990-2000,Maine:1990-2000'],
        'PlacesCount': ['{Main:12},{yes:2}', '', '{Other:},{Main:2}', '{Mains:12}', '{Maine:12}', '{Mains:12},{Main:12}', '{Main:12},{yes:2},{Maine:12}']
    })

    df_result = df.apply(lambda row: row_contains_values_switch(row, ['Places', 'PlacesYears', 'PlacesCount'], ['Main'], exceptions=['Maine', 'Germain'], switch_function=switch_function_exclude_main), axis=1)

    assert df_result['Places'][0] == 'umm, Maine, USA' #1) Don't change anything, because Maine is an exception
    assert df_result['Places'][1] == 'London' #2) Remove Main
    assert df_result['Places'][2] == 'One, Two'#2)
    assert df_result['Places'][3] == 'Saint-Germaine' #1)
    assert df_result['Places'][4] == 'USA' #3) Don't change anything, because there is no Main
    assert df_result['Places'][5] == 'Mains' #2)
    assert df_result['Places'][6] == 'umm, Main, USA' #1b) Don't change anything, because Maine is an exception in another column

    assert df_result['PlacesYears'][0] == 'umm:...,Main:1990-2000,'
    assert df_result['PlacesYears'][1] == 'a:_,b'
    assert df_result['PlacesYears'][2] == ''
    assert df_result['PlacesYears'][3] == 'Main:1990-2000'
    assert df_result['PlacesYears'][4] == 'Main:1990-2000,Maine:1990-2000'
    assert df_result['PlacesYears'][5] == 'Mains:1990-2000'
    assert df_result['PlacesYears'][6] == 'umm:...,Main:1990-2000,Maine:1990-2000'

    assert df_result['PlacesCount'][0] == '{Main:12},{yes:2}' #1) Don't change anything, because Maine is an exception in another column
    assert df_result['PlacesCount'][1] == ''
    assert df_result['PlacesCount'][2] == '{Other:}'
    assert df_result['PlacesCount'][3] == '{Mains:12}'
    assert df_result['PlacesCount'][4] == '{Maine:12}'
    assert df_result['PlacesCount'][5] == '{Mains:12}'
    assert df_result['PlacesCount'][6] == '{Main:12},{yes:2},{Maine:12}'
    
    return df_result

test_row_switching()

```

</details>

Now let's use it on the dataset:

In [7]:
art500k_artists_copy = art500k_artists.apply(lambda row: row_contains_values_switch(row,columns = ["Places", "PlacesYears", "PlacesCount"], texts=["Main"], exceptions=["Maine", "am Main","Germain"], switch_function=switch_function_exclude_main), axis=1)

In [10]:
art500k_artists_copy.to_csv("datasets/saves/art500k_artists_0_2.csv", index=False)
art500k_artists_copy.to_csv("datasets/art500k_artists.csv", index=False)

## Update 2024.01.18: Add few more cases to help combine with WikiArt

(see *datasets_notebook.ipynb* 2024.01.16- update)

In [6]:
import numpy as np
import pandas as pd

url_v_01_10 = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/art500k_artists_0_2.csv"
art500k_artists = pd.read_csv(url_v_01_10, dtype={'Type': str})

In [7]:
pairs = {
    "Juan Carre√±o de Miranda": "Juan Carreno De Miranda",
    'Albert R√†fols-Casamada': 'Albert Rafols Casamada',
    'Francisco De Zurbaran': 'Francisco de Zurbar√°n',
    'Andr√©s de Santa Maria': 'Andres De Santa Maria', 
    'Jean-Honor√© Fragonard':'Jean Honore Fragonard',
    'Th√©o van Rysselberghe': 'Theo Van Rysselberghe',
    'J√°nos Mattis-Teutsch': 'Janos Mattis Teutsch',
    '√âdouard Debat-Ponsan': 'Edouard Debat Ponsan',
    'Juan de Vald√©s Leal': 'Juan De Valdes Leal',
    'Park Seo Bo': 'Park Seo-bo'
}
for key, value in pairs.items():
    art500k_artists = helper_functions.art500k_combine_instances(art500k_artists, key, value)
art500k_artists[-10:]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount,Contemporary,Type
39983,Juan Carre√±o de Miranda,Spanish,,{Baroque:26},,,,,,1656.0,1684.0,"Valenciennes, Museo del Prado, Austria, Budape...","France:1666-1666,Museo del Prado:1680-1680,Mad...",Baroque:1656-1684,{Baroque:26},"{Valenciennes:1},{France:2},{Museo del Prado:1...",No,Painting/Sculpture
39984,Albert R√†fols-Casamada,Spanish,,{Art Informel:28},,,,,,1858.0,2004.0,,,Art Informel:1858-2004,{Art Informel:28},,,
39985,Francisco De Zurbaran,Spanish,,{Baroque:96},"Caravaggio,","Gustave Courbet,",,"Francisco Pacheco,",,1625.0,1664.0,"Hungary, Museo del Prado, Paris, Barcelona, B...","Grenoble:1626-1640,France:1626-1661,Seville:16...",Baroque:1625-1664,{Baroque:94},"{Grenoble:7},{France:19},{Seville:31},{Spain:3...",No,
39986,Andr√©s de Santa Maria,Colombian,,{Impressionism:10},"Jean-Francois Millet,Gustave Courbet,",,,,,1894.0,1942.0,,,Impressionism:1894-1942,{Impressionism:10},,,
39987,Jean-Honor√© Fragonard,French,,"{Rococo:72},{Renaissance:1}",,,,,,1750.0,1790.0,"Netherlands, Paris,London, Pasadena, Moscow, ...","France:1753-1782,Paris:1765-1778,Russia:1760-1...",Rococo:1750-1790,{Rococo:70},"{France:21},{Paris:8},{Moscow:1},{Russia:3},{S...",No,Painting/Sculpture
39988,Th√©o van Rysselberghe,Belgian,Les XX,{Post-Impressionism:186},,,,,,1880.0,1926.0,"Belgium, Brussels, Netherlands, Otterlo, Hague...","Otterlo:1890-1890,Netherlands:1890-1920,Amster...","Post-Impressionism:1900-1926,Impressionism:188...","{Post-Impressionism:65}, {Impressionism:34}, {...","{Otterlo:2},{Netherlands:6},{Amsterdam:1},{Utr...",,
39989,J√°nos Mattis-Teutsch,"Hungarian,Romanian",,"{Art Nouveau:1},{Socialist realism:1},{Abstrac...",,,,,,1909.0,1947.0,,,"Constructivism:1925-1930,Abstract Art:1918-192...","{Constructivism:11}, {Abstract Art:61}, {Expre...",,,
39990,√âdouard Debat-Ponsan,French,,"{Academic art:1},{Academic Art:11}",,,,,,1876.0,1902.0,,,Academicism:1876-1902,{Academicism:11},,No,
39991,Juan de Vald√©s Leal,Spanish,,{Baroque:17},"Virgin-Mary,Christianity,Christianity,saints-a...","Museo del Prado, Madrid, Spain,Museum of Fine ...",,,Artists2/Juan De Valdes Leal/The Imaculate Con...,1650.0,1700.0,"Seville, US, St. Louis, Russia, Saint Petersbu...",,,{Baroque:17},"{St. Louis:1},{MO:1},{US:1},{Seville:4},{Spain...",No,
39992,Park Seo Bo,South Korean,,"{Korean Informel ,Dansaekhwa (Korean Monochrom...",,,,,,1968.0,2007.0,Korea,,Minimalism:1968-2007,{Minimalism:18},{Korea:1},Yes,


In [9]:
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)
art500k_artists.to_csv('datasets/saves/art500k_artists_0_2.csv', index=False)

## Update 2024.01.13-15: Add contemporary, and profession columns, and start removing unknown painters such as masters. 

In [41]:
art500k_artists['Contemporary'] = None
art500k_artists['Type'] = None
for index, artist in art500k_artists.iterrows():
    if pd.notnull(artist["LastYear"]):
        if artist["LastYear"] >= 2000:
            art500k_artists.loc[index, 'Contemporary'] = "Yes"
        elif artist["LastYear"] < 1980:
            art500k_artists.loc[index, 'Contemporary'] = "No"

art500k_artists.loc[0:1730, "Type"] = "Painting/Sculpture"
art500k_artists.loc[1731:2760, "Type"] = "Graffiti"; art500k_artists.loc[1731:2760, "Contemporary"] = "Yes"
art500k_artists.loc[2761:2900, "Type"] = "Design/Photography/Miscellaneous"
art500k_artists.loc[2900:2902, "Type"] = "Sculpture"; art500k_artists.loc[2900:2902, "Contemporary"] = "No"
art500k_artists.loc[2903:2907, "Type"] = "Painting"; art500k_artists.loc[2903:2907, "Contemporary"] = "No"
art500k_artists.loc[2908:2908, "Type"] = "Photography"; art500k_artists.loc[2908:2908, "Contemporary"] = "No"
art500k_artists.loc[2909:2911, "Type"] = "Design"; art500k_artists.loc[2909:2911, "Contemporary"] = "Yes"
art500k_artists.loc[2912:2912, "Type"] = "Painting"; art500k_artists.loc[2912:2912, "Contemporary"] = "No"
art500k_artists.loc[2913:2913, "Type"] = "Engraver/Miscellaneous"; art500k_artists.loc[2913:2913, "Contemporary"] = "No"
art500k_artists.loc[2914:2914, "Type"] = "Graphic Design"
art500k_artists.loc[2915:2915, "Type"] = "Graffiti"; art500k_artists.loc[2915:2915, "Contemporary"] = "Yes"
art500k_artists.loc[2916:2916, "Type"] = "Miscellaneous"; art500k_artists.loc[2916:2916, "Contemporary"] = "Yes"
art500k_artists.loc[2917:2917, "Type"] = "Painting"; art500k_artists.loc[2917:2917, "Contemporary"] = "Yes"

In [45]:
art500k_artists[art500k_artists['artist'].str.contains("Master")]['artist']

732                                         Bedford Master
757                                    Master of Frankfurt
956                                           Master E. S.
1020                    Master of the Virgo inter Virgines
1059                                     Master of Alkmaar
                               ...                        
36895                              Second Master of Bierge
37375                                     Master of Pedret
37655                                      Budapest Master
38569    Italian 16th Century or Master of the Victoria...
39106                                       Master Francke
Name: artist, Length: 290, dtype: object

Remove masters

In [63]:
masters = art500k_artists[art500k_artists['artist'].str.contains("Master") | art500k_artists['artist'].str.contains("master")]['artist']
masters_list=((masters).to_list()); masters_list.remove('Master Francke')
art500k_artists = art500k_artists[~(art500k_artists['artist'].isin(masters_list))].reset_index(drop=True)


In [64]:
art500k_artists.to_csv('datasets/art500k_artists.csv', index=False)
art500k_artists.to_csv('datasets/saves/art500k_artists_0_2.csv', index=False)

# End of version 0.1 of the Art500k artists dataset

## Update 2024.01.12-13

Minor artist change, to test the instance combination method + remove quotation marks (") from artist names.

In [28]:
art500k_artists[art500k_artists['artist'].str.contains("Gustavo Dall")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
1314,Gustavo Dall'Ara,,,,,,,,,1875.0,1923.0,,,,,
36434,Gustavo Dall'ara,,,,,,,,,1910.0,1913.0,,,,,


In [None]:
art500k_modified = helper_functions.art500k_combine_instances(df=art500k_artists, primary_artist_name="Gustavo Dall'Ara", secondary_artist_name="Gustavo Dall'ara")
art500k_modified[art500k_modified['artist'].str.contains("Gustavo Dall")]

In [30]:
art500k_artists = art500k_modified
art500k_modified['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_modified.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)

In [32]:
art500k_artists[art500k_artists['artist'].str.contains('"')]['artist']


1836                                "CHRIS ""DAZE"" ELLIS"
8069                         "Alejandro ""Mono"" Gonz√°lez"
8222                                        "Nemi ""UHU"""
8902                            """Rafael Lozano-Hemmer"""
12257          "Giovanni Battista Trotti (""Il Malosso"")"
12567    "Richard Cosway|Mary ""Perdita"" Robinson|Will...
12626         "Giovanni Battista Discepoli (""Il Zoppo"")"
12722           "Giovanni Battista Crespi (""Il Cerano"")"
12742    "Bernardino Rodriguez (""Bernardino Siciliano"")"
13053    "Francesco Monti (""Il Brescianino delle Batta...
13069                             "John ""Warwick"" Smith"
13449         "Giorgio di Giovanni (""Giorgio da Siena"")"
13855              "Giovanni Battista (""Titta"") Lusieri"
13881                   "Giovanni Balducci (""Il Cosci"")"
16443        "Hanna Lachert; ""≈Åad"" Artists‚Äô Cooperative"
20616                                             """TC"""
21332         "Michelangelo Cerruti (""Il Candelotta

In [33]:
art500k_artists['artist'] = art500k_artists['artist'].str.replace('"', '')
art500k_artists['artist'].str.contains('"').sum()

0

In [34]:
art500k_artists['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_artists.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)

In [3]:
art500k_artists[art500k_artists['artist'].str.contains("Marc Bohan")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
1507,Marc Bohan,,,,,,,,,1969.0,1969.0,"Paris, France",,,,"{Paris:2},{France:3}"
24700,Marc Bohan for Christian Dior SE,,,,,,,,,,,"Paris, France",,,,"{Paris:4},{France:4}"


In [6]:
art500k_modified = helper_functions.art500k_combine_instances(df=art500k_artists, primary_artist_name="Marc Bohan", secondary_artist_name="Marc Bohan for Christian Dior SE")
art500k_modified[art500k_modified['artist'].str.contains("Marc Bohan")]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
40295,Marc Bohan,,,,,,,,,1969.0,1969.0,"Paris, France",,,,"{Paris:6},{France:7}"


In [8]:
art500k_modified.to_csv('datasets/saves/art500k_artists_0_1.csv', index=False)
art500k_modified['artist'].to_csv('art500k_artists.txt', sep=";" , index=False)
art500k_modified.to_csv('datasets/art500k_artists.csv', index=False)

## Update 2024.01.11

One minor change: remove the double "," in StylesYears and PlacesYears.

In [None]:
#Remove double commas
art500k_artists_copy = art500k_artists.copy()
for index, row in art500k_artists.iterrows():
    dict_like_columns = ['ArtMovement', 'StylesCount','PlacesCount']
    years_columns = ['FirstYear','LastYear','PlacesYears','StylesYears']

    for column in dict_like_columns+years_columns:
        column_value = row[column]
        if type(column_value) == float: #NaN
            continue
        values = [x for x in column_value.split(',') if x != '']
        values_one_comma_string = ",".join(values)
        art500k_artists_copy.at[index, column] = values_one_comma_string

In [184]:
art500k_artists_copy.to_csv("datasets/saves/art500k_artists_0_1.csv", index=False)