# Creating the PainterPalette dataset from paintings datasets

The aim of this project is to create a dataset of painters from datasets such as WikiArt and Art500k, combining features, extending missing data of painters with web scraping through Google and Wiki API, and then creating links between painters based on similarity of style, geographical and social interaction.

One long-term goal would be to create a JSON file that contains all combined hierarchically. For example, a level in the structure could be art movement, inside it are artists with some base data like birthplace, year of birth and death and other geographical data, inside it are paintings with all contained data (even better would be including eras of painters in their substructure, and inside them the paintings). Then we could use this to create a network of art movements, artists, and paintings.

NEXT STEPS:<br>
- Combine more common authors from the two datasets (but under different name), and take in authors who are not in either datasets.<br>
- Remove non-painters from the datasets.<br>
- Finish and add Art500k artist dataset updates. <br>


FURTHER STEPS:<br>
- Turn the dataset into JSON format, and add pictures data

In [3]:
import pandas as pd
import numpy as np
import helper_functions

## WikiArt data

In [4]:
artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
artists_wikiart

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,1967.0,New York City,male,United States of America,"painter, university teacher, printmaker, colla...",['New York City'],[]
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,1927.0,2022.0,,male,,painter,[],[]
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,,,male,Dominion of India,"printmaker, painter",[],[]
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,1971.0,Meudon,male,Italy,"illustrator, painter","['Florence', 'Paris']",[]
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,1975.0,Athens,male,Greece,"writer, painter",[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3198,Serhij Schyschko,Unknown,Academic Art,{Unknown:9},9,Nosivka,1911.0,1997.0,Kyiv,male,Soviet Union,painter,[],[]
3199,Vudon Baklytsky,Unknown,Soviet Nonconformist Art,{Unknown:46},46,,,,,,,,,
3200,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,male,Germany,"photographer, printmaker","['New York City', 'Berlin', 'London']",['New York City:1996-1996']
3201,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,740.0,Chang'an,male,Tang dynasty,painter,[],[]


Artists grouped by style data

In [5]:
wa_grouped = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists_styles_grouped.csv')
print("Length:", len(wa_grouped), "\n", "Number of groups with only 1 count:", len(wa_grouped[wa_grouped['count']==min(wa_grouped['count'])]))
wa_grouped[wa_grouped['artist'].str.contains("Monet")].sort_values(by=['count'], ascending=False)

Length: 7646 
 Number of groups with only 1 count: 1115


Unnamed: 0,style,artist,movement,count
2963,Impressionism,Claude Monet,Impressionism,1341
5468,Realism,Claude Monet,Impressionism,12
7041,Unknown,Claude Monet,Impressionism,12
462,Academicism,Claude Monet,Impressionism,1
3339,Japonism,Claude Monet,Impressionism,1


## Art500K

First dataset (from official website)

In [7]:
art500k_artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv')
art500k_artists[0:8]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,PaintingsExhibitedAt,StylesYears,StylesCount,PaintingsExhibitedAtCount,Contemporary,Type
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4...",No,Painting/Sculpture
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1...",,Painting/Sculpture
2,Frida Kahlo,Mexican,,"{Naïve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Naïve Art (Primitivism):1922-1954,Surrealism:1...","{Naïve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me...",No,Painting/Sculpture
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...",,,"{Palestine:1},{Los Angeles:3},{California:3},{...",Yes,Painting/Sculpture
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de...",No,Painting/Sculpture
5,Diego Rivera,Mexican,"Mexican Mural Renaissance,La Ruche","{Social Realism,Muralism:146}","Marc Chagall,Robert Delaunay,","Frida Kahlo,Pedro Coronel,Vlady,",,,"Amedeo Modigliani,Saturnino Herran,Roberto Mon...",1904.0,1956.0,"Moscow, CA, Acapulco, New York, Spain, Northam...","Cubism:1912-1916,Muralism:1922-1956,Art Deco:1...","{Post-impressionism:1}, {Cubism:19}, {Mexican ...","{France:1},{Paris:1},{Moscow:1},{Acapulco:2},{...",No,Painting/Sculpture
6,Claude Monet,French,,"{Modern art:3},{Impressionism:1340}","Gustave Courbet,Charles-Francois Daubigny,John...","Childe Hassam,Robert Delaunay,Wassily Kandinsk...",,"Eugene Boudin,Charles Gleyre,","Alfred Sisley,Pierre-Auguste Renoir,Camille Pi...",1858.0,1926.0,"London, Main, Moscow, Rotterdam, Giverny, CA, ...","Impressionist:1879-1904,Impressionism:1864-192...",{Nineteenth-Century European PaintingImpressio...,"{France:79},{Giverny:1},{London:6},{UK:15},{Bo...",No,Painting/Sculpture
7,Francisco Goya,Spanish,,{Romanticism:391},"Albrecht Durer,Diego Velazquez,","Pablo Picasso,Chaim Soutine,Roberto Montenegro...",,"José Luzán,Anton Raphael Mengs,",,1760.0,1828.0,"London, Rotterdam, Museo del Prado, UK, Kunsth...",Romanticism:1760-1828,"{Neoclassicism / Portrait:1}, {Genre:1}, {Roma...","{Spain:168},{Paris:7},{Madrid:96},{Museo del P...",No,Painting/Sculpture


# Combining the two datasets

## TODO new version- Try to gather Wikidata data for Art500k artists that are not in the WikiArt dataset

## Version 2024.02.20 (0.3.): Same as before, minor column name change

In [6]:
import json
import requests

response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
mapping = json.loads(response.text); mapping_keys = list(mapping.keys())
artists_c = helper_functions.create_painter_dataset_from_mapping(mapping) #Function was changed from before
artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
additional_artists = artists_wikiart[~(artists_wikiart['artist'].isin(mapping_keys))]
additional_artists.rename(columns={'pictures_count':'wikiart_pictures_count'}, inplace=True)
artists = pd.concat([artists_c, additional_artists], ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  additional_artists.rename(columns={'pictures_count':'wikiart_pictures_count'}, inplace=True)


In [9]:
artists[50:60]

Unnamed: 0,artist,Nationality,citizenship,gender,styles,movement,Art500k_Movements,birth_place,death_place,birth_year,...,occupations,PaintingsExhibitedAt,PaintingsExhibitedAtCount,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,Contemporary
50,Richard Pousette-Dart,American,United States of America,male,"Abstract Art, Abstract Expressionism, Academicism",Abstract Art,{Abstract Expressionism:54},Saint Paul,Rockland County,1916.0,...,"photographer, painter, drawer","NY, New York City, US","{New York City:2},{NY:2},{US:2}","New York School,Irascibles",,,,,,
51,Ethel Léontine Gabain,"French,British",United Kingdom,female,Neo-Romanticism,Neo-Romanticism,,Le Havre,London,1883.0,...,"lithographer, painter","London, Manchester, UK","{London:2},{UK:3},{Manchester:1}",,,,,,,No
52,Charles-Amable Lenoir,,France,male,"Academicism, Unknown",Academic Art,{Academic Art:9},Châtelaillon-Plage,Paris,1860.0,...,painter,,,,,,,,,
53,Francisco de Zurbaran,Spanish,Spain,male,"Baroque, Unknown",Baroque,{Baroque:96},Fuente de Cantos,Madrid,1598.0,...,painter,"Hungary, Museo del Prado, Paris, Barcelona, B...","{Grenoble:7},{France:19},{Seville:31},{Spain:3...",,"Caravaggio,","Gustave Courbet,",,"Francisco Pacheco,",,No
54,Pieter van Hanselaere,Belgian,Belgium,male,Neoclassicism,Neoclassicism,{Neoclassicism:8},Ghent,Ghent,1786.0,...,painter,"Netherlands, Amsterdam","{Amsterdam:2},{Netherlands:2}",,,,,"Jacques-Louis David,",,No
55,Jean-Honore Fragonard,French,France,male,"Rococo, Unknown",Rococo,"{Rococo:72},{Renaissance:1}",Grasse,Paris,1732.0,...,"illustrator, painter, printmaker, architectura...","Netherlands, Paris,London, Pasadena, Moscow, ...","{France:21},{Paris:8},{Moscow:1},{Russia:3},{S...",,,,,,,No
56,Ion Theodorescu-Sion,Romanian,Romania,male,"Art Nouveau (Modern), Impressionism, Post-Impr...",Post-Impressionism,{Post-Impressionism:43},Ianca,Bucharest,1882.0,...,"trade unionist, caricaturist, painter",,,Balchik School,,,,,,No
57,Janos Mattis-Teutsch,"Hungarian,Romanian",Romania,male,"Abstract Art, Constructivism, Cubism, Expressi...",Constructivism,"{Art Nouveau:1},{Socialist realism:1},{Abstrac...",Brașov,Brașov,1884.0,...,"writer, poet, painter, sculptor, journalist",,,,,,,,,
58,Apollinary Goravsky,"Belarusian,Russian",Russian Empire,male,Romanticism,Romanticism,{Romanticism:12},Novyja Nabarki,Mariinskaya Hospital,1833.0,...,painter,"Russia, Moscow, Saint Petersburg, Minsk, Belarus","{Minsk:7},{Belarus:7},{Saint Petersburg:2},{Ru...",,"Belarusian National Museum of Fine Arts, Minsk...",,,,,No
59,Edouard Debat-Ponsan,French,France,male,Academicism,Academic Art,"{Academic art:1},{Academic Art:11}",Toulouse,Paris,1847.0,...,painter,,,,,,,,,No


In [10]:
artists.to_csv('datasets/artists.csv', index=False)
artists.to_csv('datasets/saves/artists_0_3.csv', index=False)
artists.to_csv('artists.csv', index=False)

## Version 2024.02.15 (0.3.): Same as before, modifications in columns

In [14]:
import json
import requests

response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
mapping = json.loads(response.text); mapping_keys = list(mapping.keys())
artists_c = helper_functions.create_painter_dataset_from_mapping(mapping) #Function was changed from before
artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
additional_artists = artists_wikiart[~(artists_wikiart['artist'].isin(mapping_keys))]
artists = pd.concat([artists_c, additional_artists], ignore_index=True)

In [16]:
artists.to_csv('datasets/artists.csv', index=False)
artists.to_csv('datasets/saves/artists_0_3.csv', index=False)
artists.to_csv('artists.csv', index=False)

## Version 2024.01.29 (0.2.) : Use the mapping to combine info from the two datasets, and take all other artists from WikiArt

In [45]:
import json
import requests

response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
mapping = json.loads(response.text); mapping_keys = list(mapping.keys())
artists_c = helper_functions.create_painter_dataset_from_mapping(mapping) #OLD VERSION OF THE FUNCTION
artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
#artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/artists.csv')
additional_artists = artists_wikiart[~(artists_wikiart['artist'].isin(mapping_keys))]
artists = pd.concat([artists_c, additional_artists], ignore_index=True)

In [None]:
artists.to_csv('datasets/artists.csv', index=False)
artists.to_csv('datasets/saves/artists_0_2.csv', index=False)
artists.to_csv('artists.csv', index=False)

## Update 2024.01.28 - Create a separate function to combine the two datasets by a mapping of artist names from one set to the other + create the previous update's mapping

Previous update's mapping saved as part of the previous update.<br>
Added function to *helper_functions.py*.

In [5]:
import json
import requests
from helper_functions import create_painter_dataset_from_mapping

response = requests.get('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/saves/painter_name_pairs.json')
mapping = json.loads(response.text)

artists_c = create_painter_dataset_from_mapping(mapping)
artists_c

Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,PlacesCount,Contemporary,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
0,Bracha L. Ettinger,"French,Jewish,Israeli",Tel Aviv,1948.0,New European Painting,{New European Painting:21},New European Painting:1991-2009,{New European Painting:21},,Yes,...,1991.0,2009.0,,,,,,,,
1,William H. Johnson,American,Florence,1901.0,"Cubism, Expressionism, Futurism, Naïve Art (Pr...","{Cubism:1},{Expressionism:24},{Futurism:1},{Na...","Naïve Art (Primitivism):1938-1946,Expressionis...","{Naïve Art (Primitivism):74}, {Expressionism:4...",,No,...,1923.0,1946.0,,,,,,,,
2,Alexey Bogolyubov,Russian,,,"Realism, Romanticism","{Realism:25},{Romanticism:19}","Realism:1860-1889,Romanticism:1845-1860","{Realism:25}, {Romanticism:19}","{Rybinsk:2},{Russia:9},{Saint Petersburg:6},{M...",No,...,1845.0,1889.0,"Saint Petersburg, Rybinsk, Russia, Moscow","Rybinsk:1879-1889,Russia:1850-1889,Saint Peter...",Peredvizhniki (Society for Traveling Art Exhib...,,,,,
3,O. Louis Guglielmi,"American,Egyptian",Cairo,1906.0,"Cubism, Expressionism, Magic Realism","{Cubism:3},{Expressionism:6},{Magic Realism:25}","Magic Realism:1931-1946,Cubism:1946-1954,Expre...","{Magic Realism:25}, {Cubism:8}, {Expressionism:6}",,No,...,1931.0,1955.0,,,,,,,,
4,Mikalojus Konstantinas Ciurlionis,,Varėna,1875.0,Symbolism,{Symbolism:168},,,,No,...,1905.0,1905.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2649,Marianne von Werefkin,,Tula,1860.0,Unknown,{Unknown:61},,,,,...,,,,,,,,,,
2650,Robert Demachy,French,Saint-Germain-en-Laye,1859.0,Unknown,{Unknown:24},,,{France:2},No,...,1900.0,1914.0,France,,,,,,,
2651,Wolfgang Tillmans,,Remscheid,1968.0,Unknown,{Unknown:9},,,"{London:1},{United Kingdom:1}",Yes,...,2001.0,2001.0,"London, United Kingdom",,,,,,,
2652,Wu Daozi,Chinese,Chang'an,680.0,Unknown,{Unknown:8},,,,,...,,,,,Four fathers of Chinese painting,,,,,


Missing cases:

In [9]:
artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/artists.csv')
artists[~(artists['artist'].isin(list(artists_c['artist'])))]

Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,PlacesCount,Contemporary,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
2458,John Frederick Herring Sr.,,,,Romanticism,{Romanticism:79},,,,,...,,,,,,,,,,
2459,Willem van Swanenburg,,Leiden,1580.0,Baroque,{Baroque:18},,,,,...,,,,,,,,,,
2460,Leon Battista Alberti,,Genoa,1404.0,Early Renaissance,{Early Renaissance:7},,,,,...,,,,,,,,,,
2465,Giuseppe Barberis,,Turin,1517.0,Romanticism,{Romanticism:200},,,,,...,,,,,,,,,,
2476,Mahmud Taghiyev,Azerbaijani,Baku,1923.0,"Expressionism, Realism","{Expressionism:16},{Realism:1}",,,,,...,,,,,,,,,,
2483,Dominique Gonzalez-Foerster,,Strasbourg,1965.0,"Conceptual Art, Unknown","{Conceptual Art:10},{Unknown:1}",,,,,...,,,,,,,,,,
2533,Carlos Quizpez Asín,,Lima,1900.0,"Cubism, Expressionism, Muralism, Neoclassicism","{Cubism:3},{Expressionism:9},{Muralism:6},{Neo...",,,,,...,,,,,,,,,,
2575,Ivan Mrkviсka,,,,"Impressionism, Orientalism, Post-Impressionism...","{Impressionism:35},{Orientalism:1},{Post-Impre...",,,,,...,,,,,,,,,,
2615,William Hawkins,,Vance County,1777.0,"Naïve Art (Primitivism), Outsider art","{Naïve Art (Primitivism):31},{Outsider art:7}",,,,,...,,,,,,,,,,
2617,Felix Gonzalez-Torres,,Guáimaro,1957.0,"Conceptual Art, Minimalism, Unknown","{Conceptual Art:9},{Minimalism:9},{Unknown:1}",,,,,...,,,,,,,,,,


In [10]:
artists_c[~(artists_c['artist'].isin(list(artists['artist'])))]

Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,PlacesCount,Contemporary,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
160,Michael Bell,,County Louth,1936.0,New Realism,{New Realism:9},,,,Yes,...,2010.0,2010.0,,,,,,,,
163,Arthur Pan,,,,Realism,{Realism:8},,,,,...,,,,,,,,,,
167,Georges Troubat,,,,Abstract Art,{Abstract Art:12},,,{France:70},No,...,1480.0,1490.0,France,France:1480-1490,,,,,,


## Version 2024.01.19 (updates 2024.01.16): Take the intersection of WikiArt and Art500k, look for similar names in Art500k and add them too

In [5]:
artists= artists_wikiart[artists_wikiart['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
drop = artists_wikiart[~(artists_wikiart['artist'].isin(art500k_artists['artist']))]
print("Artists remaining:", len(artists), "\n", "Artists dropped:", len(drop))

Artists remaining: 2458 
 Artists dropped: 745


Let's try to find painters with some name very similar to the ones in WikiArt:

In [6]:
import difflib
#from fuzzywuzzy import fuzz #Other possibility

# Function to calculate similarity between two strings
def similarity(s1, s2):
    return difflib.SequenceMatcher(None, s1, s2).ratio()

def similarity_difference(s1, s2):
    return (1 - similarity(s1, s2))*len(s1)

In [7]:
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

In [None]:
#Could use numpy to be faster, but this is fine for now
drop_sims = pd.DataFrame(columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])
for painter in drop['artist']:
    all_sims = []
    max_sim = 0
    for art500k_artist in art500k_artists['artist']:
        similarity_score = similarity(painter, art500k_artist)
        if similarity_score >= max_sim: #Runtime reasons
            max_sim = similarity_score
            all_sims.append((similarity_score,(1-similarity_score)*len(painter), art500k_artist))
    final_maximum = max(sims[0] for sims in all_sims) 
    for sims in all_sims:
        if sims[0] == final_maximum: #Just take the highest ones
            drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])    
#drop_sims.sort_values(by=['similarity'], ascending=False)

In [129]:
drop_sims = drop_sims.reset_index(drop=True)
drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[-15:]

Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
273,Ayse Erkmen,Ayşe Erkmen,0.909091,1.0
588,Park Seo-Bo,Park Seo Bo,0.909091,1.0
587,Park Seo-Bo,Park Seo-bo,0.909091,1.0
191,Zao Wou-Ki,Zao Wou Ki,0.9,1.0
225,Léo Schnug,Leo Schnug,0.9,1.0
558,Tony DeLap,Tony Delap,0.9,1.0
447,SM Sultan,Sm Sultan,0.888889,1.0
70,Jay DeFeo,Jay Defeo,0.888889,1.0
189,Se-Ok Suh,Se Ok Suh,0.888889,1.0
606,Ay-O,Ay O,0.75,1.0


In [None]:
painter_name_pairs_dict = {}
art500k_alias_groups = {}
subset = drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[:-5].reset_index(drop=True)
for index, row in subset.iterrows():
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys():
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        art500k_alias_groups[painter] = [subset.loc[index, '"Best" Art500k pair']]
    else:
        t = art500k_alias_groups[painter]
        art500k_alias_groups[painter] = t + [subset.loc[index, '"Best" Art500k pair']]

In [None]:
for key, value in art500k_alias_groups.items():
    if len(value) > 1:
        print(key, value)

Juan Carreno de Miranda ['Juan Carreño de Miranda', 'Juan Carreno De Miranda']
Albert Rafols-Casamada ['Albert Ràfols-Casamada', 'Albert Rafols Casamada']
Francisco de Zurbaran ['Francisco De Zurbaran', 'Francisco de Zurbarán']
Andres de Santa Maria ['Andres De Santa Maria', 'Andrés de Santa Maria']
Jean-Honore Fragonard ['Jean Honore Fragonard', 'Jean-Honoré Fragonard']
Theo van Rysselberghe ['Théo van Rysselberghe', 'Theo Van Rysselberghe']
Janos Mattis-Teutsch ['János Mattis-Teutsch', 'Janos Mattis Teutsch']
Edouard Debat-Ponsan ['Édouard Debat-Ponsan', 'Edouard Debat Ponsan']
Juan de Valdes Leal ['Juan de Valdés Leal', 'Juan De Valdes Leal']
Park Seo-Bo ['Park Seo Bo', 'Park Seo-bo']


I now edit these in the Art500k dataset. Let's try again:

In [10]:
art500k_artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv')
artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
artists= artists_wikiart[artists_wikiart['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
drop = artists_wikiart[~(artists_wikiart['artist'].isin(art500k_artists['artist']))]

In [11]:
drop_sims = pd.DataFrame(columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])
print("Cases:", len(drop))
for painter in drop['artist']:
    if (drop.index[drop['artist'] == painter][0] == len(drop)//4):
        print("25% now drop finding...")
    all_sims = []
    max_sim = 0
    for art500k_artist in art500k_artists['artist']:
        similarity_score = similarity(painter, art500k_artist)
        if similarity_score >= max_sim: #Runtime reasons
            max_sim = similarity_score
            all_sims.append((similarity_score,(1-similarity_score)*len(painter), art500k_artist))
    final_maximum = max(sims[0] for sims in all_sims) 

    for sims in all_sims:
        if sims[0] == final_maximum: #Just take the highest ones
            drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])    

print("Done with the drop finding...")

painter_name_pairs_dict = {}
art500k_alias_groups = {}
subset = drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[:-5].reset_index(drop=True)
for index, row in subset.iterrows():
    if (index == len(subset)//4):
        print("25% now...")
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys():
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        art500k_alias_groups[painter] = [subset.loc[index, '"Best" Art500k pair']]
    else:
        t = art500k_alias_groups[painter]
        art500k_alias_groups[painter] = t + [subset.loc[index, '"Best" Art500k pair']]


Cases: 745


  drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])


Done with the drop finding...
25% now...


In [12]:
for key, value in art500k_alias_groups.items():
    if len(value) > 1:
        print(key, value)

So we have fixed succeeded. In the meanwhile, one case now should be dropped:

In [None]:
del painter_name_pairs_dict['Zaya']

Let's find 2 and more character differences:

In [14]:
drop_sims = drop_sims.reset_index(drop=True)

In [147]:
drop_sims[(drop_sims['Character difference'] > 1.01) & (drop_sims['Character difference'] <2.01) ].sort_values(by=['similarity'], ascending=False)[0:]

Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
366,Marevna (Marie Vorobieff),Marevna Marie Vorobieff,0.958333,1.041667
229,Petro Kholodny (Elder),Petro Kholodny Elder,0.952381,1.047619
506,Martín Rico y Ortega,Martín Rico Ortega,0.947368,1.052632
471,"Robert De Niro, Sr.",Robert De Niro Sr,0.944444,1.055556
218,J. C. Leyendecker,J C Leyendecker,0.937500,1.062500
...,...,...,...,...
234,Kim Prisu,Kid Paris,0.777778,2.000000
349,Sun Mu,Sun Xun,0.769231,1.384615
59,Ed Clark,Clark,0.769231,1.846154
456,Alan Lee,Sean Lee,0.750000,2.000000


Manually checked each case

In [16]:
subset = drop_sims[(drop_sims['Character difference'] > 1.01) & (drop_sims['Character difference'] <2.01) ].sort_values(by=['similarity'], ascending=False)[0:30]
for index, row in subset.iterrows():
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys(): #Cautions, theoretically this should always happen
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        
painter_name_pairs_dict.update({"Chang Dai-chien": "Chang Dai Chien", "Félix Del Marle":"F Lix Del Marle", "Roger Bissière":"Roger Bissi Re",
                                "Jacques Hérold": "Jacques H Rold", "YiFei Chen": "Yifei Chen", "M.C. Escher": "M C Escher", 
                                "Hong Song-dam": "Hong Song Dam", "Mestre Ataíde": "Mestre Ata De", "Li Yuan-chia":"Li Yuan Chia", "José Luzán": "Jose Luzan"})
del painter_name_pairs_dict['Jacob Collins']
del painter_name_pairs_dict['Michael Bell']                  

More than 2 character differences (looking through all cases):

In [176]:
drop_sims[(drop_sims['Character difference'] > 2.01)].sort_values(by=['similarity'], ascending=False)[630:660]

Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
5,Babak-Matveev,Barbara White,0.538462,6.0
669,Ruth Annaqtuusi Tulurialik,South Australia,0.536585,12.04878
303,EtchingRoom1,Cochino,0.526316,5.684211
276,Haralampi G. Oroschakoff,Jamie Bischoff,0.526316,11.368421
52,[ a y s h ],Mary Lish,0.5,5.5
51,[ a y s h ],Cray Fish,0.5,5.5
464,Boushra Yahya Almutawakel,Thouraya Hamouda,0.487805,12.804878
498,JAROSLAV KELUC,ALBERT OLIVE,0.461538,7.538462
499,JAROSLAV KELUC,ARCHIVO FIEL,0.461538,7.538462


In [24]:
painter_name_pairs_dict.update({"António de Carvalho da Silva Porto": "Ant Nio De Carvalho Da Silva Porto", "Jean-Joseph-Xavier Bidauld":"Jean-Joseph-Xavier Bidauld (French", "Pieter Saenredam":"Jan Pietersz Saenredam", "Shin Yoon-bok":"Sin Yun-bok",
                                "Jindrich Styrsky": "Jindřich Štyrský", "George Philip Reinagle":"Philip Reinagle", "Ralph Blakelock":"Ralph Albert Blakelock", "Frances Macdonald": "Frances Macdonald Macnair","Pellizza da Volpedo":"Giuseppe Pellizza da Volpedo", "Raimundo de Madrazo":"Raimundo de Madrazo y Garreta",
                                "Elenore Abbott":"Elenore Plaisted Abbott", "C. R. W. Nevinson":"Christopher R. W. Nevinson", "Il Sassetta (Stefano di Giovanni)":"Stefano di Giovanni", "Rafael García Hispaleto (El Hispaleto)":"Rafael García y García. Hispaleto",
                                "Alexei Korzukhin": "Aleksey Ivanovich Korzukhin", "Charles-Andre van Loo (Carle van Loo)":"Charles-André van Loo", "Lubo Kristek":"Lubo Kristek In Landsberg"})

Some upper/lower case difference pairs:

In [179]:
painter_lowercase_pairs = {}
for index, row in drop_sims.iterrows():
    if row['Character difference'] > 1.01:
        painter_lowercase = row['artist'].lower()
        for artist in art500k_artists['artist']:
            if painter_lowercase == artist.lower():
                painter_lowercase_pairs[row['artist']] = artist
                
painter_lowercase_pairs

{'Adam van der Meulen': 'Adam Van Der Meulen',
 'JAROSLAV KELUC': 'Jaroslav Keluc',
 'Ding Yi': 'DING Yi',
 'JCJ Vanderheyden': 'Jcj Vanderheyden',
 'Bart van der Leck': 'Bart Van Der Leck',
 'Luis de Madrazo y Kuntz': 'Luis De Madrazo Y Kuntz',
 'Phase 2': 'PHASE 2',
 'TRACY 168': 'Tracy 168'}

Let's do a safety check: all keys should be in WikiArt, and all values should be in Art500k:

In [27]:
painter_name_pairs_dict.update(painter_lowercase_pairs)
for key, value in painter_name_pairs_dict.items():
    if key not in artists_wikiart['artist'].str.strip().values:
        print(key, "not in artists_wikiart")
    if value not in art500k_artists['artist'].str.strip().values:
        print(value, "not in art500k_artists")

YiFei Chen was in the database with a double space, and Gohar Fermanyan had a space at the end of the name. I fixed these in the WikiArt dataset, reran this, now it's good.

### Combine the two datasets

In [271]:
artists= artists_wikiart[artists_wikiart['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
artists = artists.merge(art500k_artists, on='artist', how='left')
artists

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,Nationality,PaintingSchool,ArtMovement,...,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount,Contemporary,Type
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,American,"New York School,American Abstract Artists,Iras...","{Abstract Expressionism,Minimalism:52}",...,"Jackson Pollock,",1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,NY:1938-1966,US:1938-1...","Expressionism:1944-1946,Abstract Art:1937-1941...","{Expressionism:7}, {Abstract Art:15}, {Color F...","{New York City:29},{NY:31},{US:32},{Buffalo:2}...",No,
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,Turkish,,{Abstract Art:28},...,,1968.0,2008.0,,,"Abstract Art:1992-2008,Abstract Expressionism:...","{Abstract Art:25}, {Abstract Expressionism:3}",,Yes,
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,Indian,,{Abstract Art:17},...,,1974.0,1974.0,,,Abstract Art:1974-1974,{Abstract Art:17},,No,
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,"Italian,French",Abstraction-Création,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,,1909.0,1971.0,,,"Abstract Art:1916-1971,Cubism:1914-1935,Metaph...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ...",,No,
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,Greek,,"{Abstract Art,Social Realism:79}",...,,1931.0,1974.0,,,"Post-Impressionism:1932-1955,Expressionism:193...","{Post-Impressionism:8}, {Expressionism:11}, {R...",,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2453,Marianne von Werefkin,Unknown,Expressionism,{Unknown:61},61,Tula,1860.0,,,{Der Blaue Reiter:1},...,,,,,,,,,,
2454,Robert Demachy,Unknown,Pictorialism,{Unknown:24},24,Saint-Germain-en-Laye,1859.0,French,,{Pictorialism:24},...,,1900.0,1914.0,France,,,,{France:2},No,
2455,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,,...,,2001.0,2001.0,"London, United Kingdom",,,,"{London:1},{United Kingdom:1}",Yes,
2456,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,Chinese,Four fathers of Chinese painting,{Tang Dynasty (618–907):8},...,,,,,,,,,,


In [272]:
for key, value in painter_name_pairs_dict.items():
    wikiart_artist_df = artists_wikiart[artists_wikiart['artist'] == key]; art500k_artist_df = art500k_artists[art500k_artists['artist'] == value]
    columns_list_W = artists_wikiart.columns.tolist(); columns_list_A = art500k_artists.columns.tolist()[1:]
    combined_df = pd.concat([wikiart_artist_df[columns_list_W].reset_index(), art500k_artist_df[columns_list_A].reset_index()],  axis=1).drop(columns=['index'])
    artists = pd.concat([artists, combined_df], axis=0).reset_index(drop=True)
artists

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,Nationality,PaintingSchool,ArtMovement,...,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount,Contemporary,Type
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,American,"New York School,American Abstract Artists,Iras...","{Abstract Expressionism,Minimalism:52}",...,"Jackson Pollock,",1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,NY:1938-1966,US:1938-1...","Expressionism:1944-1946,Abstract Art:1937-1941...","{Expressionism:7}, {Abstract Art:15}, {Color F...","{New York City:29},{NY:31},{US:32},{Buffalo:2}...",No,
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,Turkish,,{Abstract Art:28},...,,1968.0,2008.0,,,"Abstract Art:1992-2008,Abstract Expressionism:...","{Abstract Art:25}, {Abstract Expressionism:3}",,Yes,
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,Indian,,{Abstract Art:17},...,,1974.0,1974.0,,,Abstract Art:1974-1974,{Abstract Art:17},,No,
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,"Italian,French",Abstraction-Création,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,,1909.0,1971.0,,,"Abstract Art:1916-1971,Cubism:1914-1935,Metaph...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ...",,No,
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,Greek,,"{Abstract Art,Social Realism:79}",...,,1931.0,1974.0,,,"Post-Impressionism:1932-1955,Expressionism:193...","{Post-Impressionism:8}, {Expressionism:11}, {R...",,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2661,Gohar Fermanyan,Post-Impressionism,Post-Impressionism,{Post-Impressionism:3},3,,,Armenian,,{Post-Impressionism:3},...,,,,,,,{Post-Impressionism:3},,,
2662,JAROSLAV KELUC,Impressionism,Impressionism,{Impressionism:33},33,,,Czech,,{Impressionism:33},...,,1949.0,1979.0,,,Impressionism:1949-1979,{Impressionism:33},,No,
2663,Ding Yi,Maximalism,Maximalism,{Maximalism:29},29,"Suixi County, Anhui",150.0,,,,...,,1989.0,1991.0,,,,,,,
2664,Phase 2,Street art,Street art,{Street art:13},13,,,,,{Street art:2},...,,,,"New York, United States",,,,"{New York:1},{United States:1}",,


In [273]:
cols = artists.columns.tolist(); print(cols)
cols = cols[0:1]+cols[7:8]+cols[5:7]+cols[1:2]+cols[3:4]+cols[19:]+cols[2:3]+cols[9:10]+cols[4:5]+cols[15:19]+cols[8:9]+cols[10:15]
artists = artists[cols]
artists

['artist', 'styles', 'movement', 'styles_extended', 'pictures_count', 'birth_place', 'birth_year', 'Nationality', 'PaintingSchool', 'ArtMovement', 'Influencedby', 'Influencedon', 'Pupils', 'Teachers', 'FriendsandCoworkers', 'FirstYear', 'LastYear', 'Places', 'PlacesYears', 'StylesYears', 'StylesCount', 'PlacesCount', 'Contemporary', 'Type']


Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,PlacesCount,Contemporary,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
0,Ad Reinhardt,American,Buffalo,1913.0,"Abstract Art, Abstract Expressionism, Color Fi...","{Abstract Art:15},{Abstract Expressionism:5},{...","Expressionism:1944-1946,Abstract Art:1937-1941...","{Expressionism:7}, {Abstract Art:15}, {Color F...","{New York City:29},{NY:31},{US:32},{Buffalo:2}...",No,...,1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,NY:1938-1966,US:1938-1...","New York School,American Abstract Artists,Iras...","Piet Mondrian,Kazimir Malevich,Josef Albers,","Donald Judd,Barnett Newman,Mark Rothko,Frank S...",,,"Jackson Pollock,"
1,Adnan Coker,Turkish,,,"Abstract Art, Abstract Expressionism","{Abstract Art:25},{Abstract Expressionism:3}","Abstract Art:1992-2008,Abstract Expressionism:...","{Abstract Art:25}, {Abstract Expressionism:3}",,Yes,...,1968.0,2008.0,,,,,,,,
2,Akkitham Narayanan,Indian,Kerala,1939.0,Abstract Art,{Abstract Art:17},Abstract Art:1974-1974,{Abstract Art:17},,No,...,1974.0,1974.0,,,,,,,,
3,Alberto Magnelli,"Italian,French",Florence,1888.0,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...","{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...","Abstract Art:1916-1971,Cubism:1914-1935,Metaph...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ...",,No,...,1909.0,1971.0,,,Abstraction-Création,,,,,
4,Alekos Kontopoulos,Greek,Lamia,1904.0,"Abstract Art, Cubism, Expressionism, Post-Impr...","{Abstract Art:26},{Cubism:5},{Expressionism:10...","Post-Impressionism:1932-1955,Expressionism:193...","{Post-Impressionism:8}, {Expressionism:11}, {R...",,No,...,1931.0,1974.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2661,Gohar Fermanyan,Armenian,,,Post-Impressionism,{Post-Impressionism:3},,{Post-Impressionism:3},,,...,,,,,,,,,,
2662,JAROSLAV KELUC,Czech,,,Impressionism,{Impressionism:33},Impressionism:1949-1979,{Impressionism:33},,No,...,1949.0,1979.0,,,,,,,,
2663,Ding Yi,,"Suixi County, Anhui",150.0,Maximalism,{Maximalism:29},,,,,...,1989.0,1991.0,,,,,,,,
2664,Phase 2,,,,Street art,{Street art:13},,,"{New York:1},{United States:1}",,...,,,"New York, United States",,,,,,,


In [268]:
artists.to_csv('artists.csv', index=False); artists.to_csv('datasets/artists.csv', index=False)

### Re-update 2024.01.28: Take the mapping

In [30]:
for painter in artists_wikiart['artist']:
    if painter in list(art500k_artists['artist']):
        painter_name_pairs_dict[painter] = painter  

In [33]:
#Save the dictionary
import json

with open('datasets/saves/painter_name_pairs.json', 'w') as fp:
    json.dump(painter_name_pairs_dict, fp)

## Version 2023.12.02: Take the intersection of WikiArt and Art500k

In [95]:
artist_A = pd.read_csv('datasets/wikiart_artists.csv')
artists= artist_A[artist_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
print("Artists remaining:", len(artists))

Artists remaining: 2457


In [96]:
artists = artists.merge(art500k_artists, on='artist', how='left')
artists

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,Nationality,PaintingSchool,ArtMovement,...,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,American,"New York School,American Abstract Artists,Iras...","{Abstract Expressionism,Minimalism:52},",...,"Donald Judd,Barnett Newman,Mark Rothko,Frank S...",,,"Jackson Pollock,",1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,,NY:1938-1966,,US:1938...","Expressionism:1944-1946,,Abstract Art:1937-194...","{Expressionism:7}, {Abstract Art:15}, {Color F..."
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,Turkish,,"{Abstract Art:28},",...,,,,,1968.0,2008.0,,,"Abstract Art:1992-2008,,Abstract Expressionism...","{Abstract Art:25}, {Abstract Expressionism:3}"
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,Indian,,"{Abstract Art:17},",...,,,,,1974.0,1974.0,,,"Abstract Art:1974-1974,",{Abstract Art:17}
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,"Italian,French",Abstraction-Création,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,,,,,1909.0,1971.0,,,"Abstract Art:1916-1971,,Cubism:1914-1935,,Meta...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ..."
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,Greek,,"{Abstract Art,Social Realism:79},",...,,,,,1931.0,1974.0,,,"Post-Impressionism:1932-1955,,Expressionism:19...","{Post-Impressionism:8}, {Expressionism:11}, {R..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2452,Marianne von Werefkin,Unknown,Expressionism,{Unknown:61},61,Tula,1860.0,,,"{Der Blaue Reiter:1},",...,,,,,,,,,,
2453,Robert Demachy,Unknown,Pictorialism,{Unknown:24},24,Saint-Germain-en-Laye,1859.0,French,,"{Pictorialism:24},",...,,,,,1900.0,1914.0,France,,,
2454,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,,...,,,,,2001.0,2001.0,"London, United Kingdom",,,
2455,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,Chinese,Four fathers of Chinese painting,"{Tang Dynasty (618–907):8},",...,,,,,,,,,,


Later extend this list with skipped artists from both datasets

In [97]:
artist_AnotB = artist_A[~artist_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True).sort_values(by=['pictures_count'], ascending=False)
artist_AnotB.head(10)

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year
0,Alfred Freddy Krupa,"Abstract Art, Abstract Expressionism, Academic...",New Ink Art,"{Abstract Art:1},{Abstract Expressionism:1},{A...",735,Karlovac,1971.0
720,Zdzislaw Beksinski,Surrealism,Magic Realism,{Surrealism:707},707,Sanok,1929.0
737,Oleksandr Aksinin,Unknown,Soviet Nonconformist Art,{Unknown:480},480,Kiev,1930.0
140,M.C. Escher,"Art Deco, Art Nouveau (Modern), Cubism, Expres...",Surrealism,"{Art Deco:1},{Art Nouveau (Modern):1},{Cubism:...",470,Leeuwarden,1898.0
121,Oleg Holosiy,"Academicism, Cubism, Expressionism, Naïve Art ...",Neo-Expressionism,"{Academicism:1},{Cubism:5},{Expressionism:30},...",372,Dnipro,1965.0
308,Alexander Roitburd,"Cubism, Transavantgarde",Transavantgarde,"{Cubism:1},{Transavantgarde:263}",264,Odesa,1961.0
377,Maria Bozoky,"Expressionism, Impressionism",Expressionism,"{Expressionism:252},{Impressionism:4}",256,Oradea,1909.0
606,Konstantin Gorbatov,Post-Impressionism,Post-Impressionism,{Post-Impressionism:254},254,Tolyatti,1876.0
590,Felix Nadar,Pictorialism,Pictorialism,{Pictorialism:245},245,rue Saint-Honoré,1820.0
436,J.M.W. Turner,"Impressionism, Romanticism, Unknown",Romanticism,"{Impressionism:1},{Romanticism:243},{Unknown:1}",245,London,1775.0


In [98]:
cols = artists.columns.tolist()
cols

['artist',
 'styles',
 'movement',
 'styles_extended',
 'pictures_count',
 'birth_place',
 'birth_year',
 'Nationality',
 'PaintingSchool',
 'ArtMovement',
 'Influencedby',
 'Influencedon',
 'Pupils',
 'Teachers',
 'FriendsandCoworkers',
 'FirstYear',
 'LastYear',
 'Places',
 'PlacesYears',
 'StylesYears',
 'StylesCount']

In [99]:
cols = cols[0:1]+cols[7:8]+cols[5:7]+cols[1:2]+cols[3:4]+cols[19:]+cols[2:3]+cols[9:10]+cols[4:5]+cols[15:19]+cols[8:9]+cols[10:15]
artists = artists[cols]
artists

Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,movement,ArtMovement,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
0,Ad Reinhardt,American,Buffalo,1913.0,"Abstract Art, Abstract Expressionism, Color Fi...","{Abstract Art:15},{Abstract Expressionism:5},{...","Expressionism:1944-1946,,Abstract Art:1937-194...","{Expressionism:7}, {Abstract Art:15}, {Color F...",Abstract Expressionism,"{Abstract Expressionism,Minimalism:52},",...,1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,,NY:1938-1966,,US:1938...","New York School,American Abstract Artists,Iras...","Piet Mondrian,Kazimir Malevich,Josef Albers,","Donald Judd,Barnett Newman,Mark Rothko,Frank S...",,,"Jackson Pollock,"
1,Adnan Coker,Turkish,,,"Abstract Art, Abstract Expressionism","{Abstract Art:25},{Abstract Expressionism:3}","Abstract Art:1992-2008,,Abstract Expressionism...","{Abstract Art:25}, {Abstract Expressionism:3}",Abstract Art,"{Abstract Art:28},",...,1968.0,2008.0,,,,,,,,
2,Akkitham Narayanan,Indian,Kerala,1939.0,Abstract Art,{Abstract Art:17},"Abstract Art:1974-1974,",{Abstract Art:17},Abstract Art,"{Abstract Art:17},",...,1974.0,1974.0,,,,,,,,
3,Alberto Magnelli,"Italian,French",Florence,1888.0,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...","{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...","Abstract Art:1916-1971,,Cubism:1914-1935,,Meta...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ...",Abstract Art,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,1909.0,1971.0,,,Abstraction-Création,,,,,
4,Alekos Kontopoulos,Greek,Lamia,1904.0,"Abstract Art, Cubism, Expressionism, Post-Impr...","{Abstract Art:26},{Cubism:5},{Expressionism:10...","Post-Impressionism:1932-1955,,Expressionism:19...","{Post-Impressionism:8}, {Expressionism:11}, {R...",Social Realism,"{Abstract Art,Social Realism:79},",...,1931.0,1974.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2452,Marianne von Werefkin,,Tula,1860.0,Unknown,{Unknown:61},,,Expressionism,"{Der Blaue Reiter:1},",...,,,,,,,,,,
2453,Robert Demachy,French,Saint-Germain-en-Laye,1859.0,Unknown,{Unknown:24},,,Pictorialism,"{Pictorialism:24},",...,1900.0,1914.0,France,,,,,,,
2454,Wolfgang Tillmans,,Remscheid,1968.0,Unknown,{Unknown:9},,,Contemporary,,...,2001.0,2001.0,"London, United Kingdom",,,,,,,
2455,Wu Daozi,Chinese,Chang'an,680.0,Unknown,{Unknown:8},,,Tang Dynasty (618–907),"{Tang Dynasty (618–907):8},",...,,,,,Four fathers of Chinese painting,,,,,


In [100]:
artists.to_csv('datasets/artists.csv', index=False)

In [125]:
artists = pd.read_csv('datasets/artists.csv')

### Years modification

In [None]:
year_mistake = []
for artist in artists['artist']:
    if (artists[artists['artist'] == artist]['LastYear'].iloc[0]-artists[artists['artist'] == artist]['FirstYear'].iloc[0])>90:
        year_mistake.append(artist)
print((year_mistake))

In [None]:
artists[artists['artist'].isin(year_mistake)][['artist','birth_year','FirstYear','LastYear']]

In [129]:
too_early_years = ["Huang Yongyu","Joe Goode","Theodoros Stamos","Pablo Picasso", "Modest Cuixart","Giovanni Paolo Panini", "Guido Reni", "John Riley", "Marcello Bacciarelli","Rembrandt","Alfredo Volpi", "Henry Ossawa Tanner", "Pierre Soulages","Hieronymus Bosch","Agnes Lawrence Pelton","George Morland", "Jean-Baptiste Carpeaux"]
too_latest_years = ["Rupert Bunny", "Vasily Polenov", "Giovanni Paolo Panini", "Guido Reni","John Riley", "Luca Giordano", "Matthias Stom","Rembrandt", "Giovanni Bellini", "Alfredo Volpi", "Francesco Melzi", "Auguste Rodin", "Edgar Degas", "Henry Ossawa Tanner", "John Frederick Kensett","Giorgio de Chirico", "Maria Sibylla Merian", "Hieronymus Bosch","Jan Provoost","Jean Fouquet","Anton Azbe", "Jean-Baptiste Carpeaux"]
second_batch=['Hieronymus Bosch',
 'Jan Provoost',
 'George Lambert',
 'Charles Turner',
 'Thomas Jones',
 'William Morris']


In [128]:
for artist in too_early_years:
    artists.loc[artists['artist'] == artist, 'FirstYear'] = artists[artists['artist'] == artist]['birth_year']+18
#The latest_years artists are manually corrected.

In [130]:
#Manual edit last years
their_last_year = [1947, 1898, 1765, 1642, 1641, 1705, 1649, 1669, 1516, 1988, 1570, 1917, 1917, 1937, 1872, 1978, 1705, 1705, 1705, 1529, 1460, 1900, 1875]
last_years = [1516, 1460, 1802, 1832, 1803, 1892]
for i in range(len(too_latest_years)):
    artists.loc[artists['artist'] == too_latest_years[i], 'LastYear'] = their_last_year[i]
for i in range(len(second_batch)):
    artists.loc[artists['artist'] == second_batch[i], 'LastYear'] = last_years[i]

In [None]:
artists = artists.merge(subset, on='artist', how='left')

In [145]:

cols = artists.columns.to_list()
cols  = cols[0:15]+cols[-1:]+cols[15:-1]
cols.remove('PlacesCount_x')
artists = artists[cols]
artists.rename(columns={'PlacesCount_y':'PlacesCount'}, inplace=True)
artists.columns

Index(['artist', 'Nationality', 'birth_place', 'birth_year', 'styles',
       'styles_extended', 'StylesYears', 'StylesCount', 'movement',
       'ArtMovement', 'pictures_count', 'FirstYear', 'LastYear', 'Places',
       'PlacesYears', 'PlacesCount', 'PaintingSchool', 'Influencedby',
       'Influencedon', 'Pupils', 'Teachers', 'FriendsandCoworkers'],
      dtype='object')

Last step: in the .csv file, replace float .0 values with integers<br>
*This cannot be precisely done in Pandas, as you cannot have an integer datatype column (Series) with NaNs.*


In [63]:
#Turn the non-NaN years into integers
#t1 = artists['FirstYear'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#t2 = artists['LastYear'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#t3 = artists['birth_year'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#
#artists['FirstYear'] = t1
#artists['LastYear'] = t2
#artists['birth_year'] = t3
#
#artists.to_csv('datasets/artists.csv', index=False)
#Manually delete the cells with "remove_hrgldg"

NOTE: manually deleted the cells containing "remove_hrgldg" from the csv file.

In [64]:
artists = pd.read_csv('datasets/artists.csv')