# Creating the PainterPalette dataset from paintings datasets

The aim of this project is to create a dataset of painters from datasets such as WikiArt and Art500k, combining features, extending missing data of painters with web scraping through Google and Wiki API, and then creating links between painters based on similarity of style, geographical and social interaction.

One long-term goal would be to create a JSON file that contains all combined hierarchically. For example, a level in the structure could be art movement, inside it are artists with some base data like birthplace, year of birth and death and other geographical data, inside it are paintings with all contained data (even better would be including eras of painters in their substructure, and inside them the paintings). Then we could use this to create a network of art movements, artists, and paintings.

NEXT STEPS:<br>
- Combine more common authors from the two datasets (but under different name), and take in authors who are not in either datasets.<br>
- Remove non-painters from the datasets.<br>
- Finish and add Art500k artist dataset updates. <br>


FURTHER STEPS:<br>
- Turn the dataset into JSON format, and add pictures data

In [1]:
import pandas as pd
import numpy as np

## WikiArt data

### Birthplaces, birth years

In [9]:
artists_A = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
artists_A

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0
...,...,...,...,...,...,...,...
3198,Serhij Schyschko,Unknown,Academic Art,{Unknown:9},9,,
3199,Vudon Baklytsky,Unknown,Soviet Nonconformist Art,{Unknown:46},46,,
3200,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0
3201,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0


Artists grouped by style data

In [7]:
wa_grouped = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists_styles_grouped.csv')
print("Length:", len(wa_grouped), "\n", "Number of groups with only 1 count:", len(wa_grouped[wa_grouped['count']==min(wa_grouped['count'])]))
wa_grouped[wa_grouped['artist'].str.contains("Monet")].sort_values(by=['count'], ascending=False)

Length: 7646 
 Number of groups with only 1 count: 1115


Unnamed: 0,style,artist,movement,count
2963,Impressionism,Claude Monet,Impressionism,1341
5468,Realism,Claude Monet,Impressionism,12
7041,Unknown,Claude Monet,Impressionism,12
462,Academicism,Claude Monet,Impressionism,1
3339,Japonism,Claude Monet,Impressionism,1


## Art500K

First dataset (from official website)

In [12]:
art500k_artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv')
art500k_artists[0:7]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount,PlacesCount
0,Gustave Courbet,French,,{Realism:272},"Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,"London, Montpellier, Moscow, CA, UK, Norway, D...","France:1841-1876,Switzerland:1844-1874,Lille:1...","Realism:1835-1877,Romanticism:1830-1849","{Realism:257}, {Romanticism:13}","{France:88},{Switzerland:7},{Lille:8},{Paris:4..."
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91}","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,"London, CA, UK, Switzerland, Lisbon, US, Germa...","France:1865-1889,Paris:1865-1898,CA:1891-1891,...",Impressionism:1865-1905,{Impressionism:90},"{France:52},{Paris:15},{Brussels:2},{Belgium:1..."
2,Frida Kahlo,Mexican,,"{Naïve Art (Primitivism),Surrealism:99}","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,"CA, LA, New York, US, New Orleans, Washington ...","Mexico:1927-1954,San Francisco:1931-1933,Mexic...","Naïve Art (Primitivism):1922-1954,Surrealism:1...","{Naïve Art (Primitivism):99}, {Surrealism:15}","{Mexico:50},{San Francisco:6},{New York:4},{Me..."
3,Banksy,,,,,,,,,2011.0,2011.0,"Los Angeles, London, UK, Palestine, California...","London:2011-2011,UK:2011-2011",,,"{Palestine:1},{Los Angeles:3},{California:3},{..."
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,"Seville, London, Illescas, Romania, Moscow, Gr...","Spain:1577-1599,London:1600-1600,UK:1600-1600,...",Mannerism (Late Renaissance):1568-1600,"{Renaissance:2}, {XVI CenturySpanish Painting:...","{Spain:75},{Boston:1},{MA:1},{US:27},{Museo de..."
5,Diego Rivera,Mexican,"Mexican Mural Renaissance,La Ruche","{Social Realism,Muralism:146}","Marc Chagall,Robert Delaunay,","Frida Kahlo,Pedro Coronel,Vlady,",,,"Amedeo Modigliani,Saturnino Herran,Roberto Mon...",1904.0,1956.0,"Moscow, CA, Acapulco, New York, Spain, Northam...","Acapulco:1956-1956,Mexico:1905-1956,Guerrero:1...","Cubism:1912-1916,Muralism:1922-1956,Art Deco:1...","{Post-impressionism:1}, {Cubism:19}, {Mexican ...","{France:1},{Paris:1},{Moscow:1},{Acapulco:2},{..."
6,Claude Monet,French,,"{Modern art:3},{Impressionism:1340}","Gustave Courbet,Charles-Francois Daubigny,John...","Childe Hassam,Robert Delaunay,Wassily Kandinsk...",,"Eugene Boudin,Charles Gleyre,","Alfred Sisley,Pierre-Auguste Renoir,Camille Pi...",1858.0,1926.0,"London, Main, Moscow, Rotterdam, Giverny, CA, ...","France:1861-1924,London:1869-1889,UK:1869-1908...","Impressionist:1879-1904,Impressionism:1864-192...",{Nineteenth-Century European PaintingImpressio...,"{France:79},{Giverny:1},{London:6},{UK:15},{Bo..."


There needs to be further work done as seen.

# Combine the two datasets

## Version 2024.01.16: Take the intersection of WikiArt and Art500k, and look for similar names in Art500k

In [16]:
artists= artists_A[artists_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
drop = artists_A[~(artists_A['artist'].isin(art500k_artists['artist']))]
print("Artists remaining:", len(artists), "\n", "Artists dropped:", len(drop))

Artists remaining: 2457 
 Artists dropped: 746


Let's try to find painters with some name very similar to the ones in WikiArt:

In [30]:
import difflib
#from fuzzywuzzy import fuzz #Other possibility

# Function to calculate similarity between two strings
def similarity(s1, s2):
    return difflib.SequenceMatcher(None, s1, s2).ratio()

def similarity_difference(s1, s2):
    return (1 - similarity(s1, s2))*len(s1)

In [33]:
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

In [126]:
#Could use numpy to be faster, but this is fine for now
drop_sims = pd.DataFrame(columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])
for painter in drop['artist']:
    all_sims = []
    max_sim = 0
    for art500k_artist in art500k_artists['artist']:
        similarity_score = similarity(painter, art500k_artist)
        if similarity_score >= max_sim: #Runtime reasons
            max_sim = similarity_score
            all_sims.append((similarity_score,(1-similarity_score)*len(painter), art500k_artist))
    final_maximum = max(sims[0] for sims in all_sims) 
    for sims in all_sims:
        if sims[0] == final_maximum: #Just take the highest ones
            drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])    
drop_sims.sort_values(by=['similarity'], ascending=False)

  drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])


Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
0,John Frederick Herring Sr.,"John Frederick Herring, Sr.",0.981132,0.490566
0,Leon Battista Alberti,Leone Battista Alberti,0.976744,0.488372
0,Willem van Swanenburg,Willem van Swanenburgh,0.976744,0.488372
0,O. Louis Guglielmi,O Louis Guglielmi,0.971429,0.514286
0,Alexey Bogolyubov,Alexey Bogolyubov,0.971429,0.514286
...,...,...,...,...
0,[ a y s h ],Mary Lish,0.500000,5.500000
0,[ a y s h ],Cray Fish,0.500000,5.500000
0,Boushra Yahya Almutawakel,Thouraya Hamouda,0.487805,12.804878
0,JAROSLAV KELUC,ARCHIVO FIEL,0.461538,7.538462


In [129]:
drop_sims = drop_sims.reset_index(drop=True)
drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[-15:]

Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
273,Ayse Erkmen,Ayşe Erkmen,0.909091,1.0
588,Park Seo-Bo,Park Seo Bo,0.909091,1.0
587,Park Seo-Bo,Park Seo-bo,0.909091,1.0
191,Zao Wou-Ki,Zao Wou Ki,0.9,1.0
225,Léo Schnug,Leo Schnug,0.9,1.0
558,Tony DeLap,Tony Delap,0.9,1.0
447,SM Sultan,Sm Sultan,0.888889,1.0
70,Jay DeFeo,Jay Defeo,0.888889,1.0
189,Se-Ok Suh,Se Ok Suh,0.888889,1.0
606,Ay-O,Ay O,0.75,1.0


In [111]:
painter_name_pairs_dict = {}
art500k_alias_groups = {}
subset = drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[:-5].reset_index(drop=True)
for index, row in subset.iterrows():
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys():
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        art500k_alias_groups[painter] = [subset.loc[index, '"Best" Art500k pair']]
    else:
        t = art500k_alias_groups[painter]
        art500k_alias_groups[painter] = t + [subset.loc[index, '"Best" Art500k pair']]

In [117]:
for key, value in art500k_alias_groups.items():
    if len(value) > 1:
        print(key, value)

Juan Carreno de Miranda ['Juan Carreño de Miranda', 'Juan Carreno De Miranda']
Albert Rafols-Casamada ['Albert Ràfols-Casamada', 'Albert Rafols Casamada']
Francisco de Zurbaran ['Francisco De Zurbaran', 'Francisco de Zurbarán']
Andres de Santa Maria ['Andres De Santa Maria', 'Andrés de Santa Maria']
Jean-Honore Fragonard ['Jean Honore Fragonard', 'Jean-Honoré Fragonard']
Theo van Rysselberghe ['Théo van Rysselberghe', 'Theo Van Rysselberghe']
Janos Mattis-Teutsch ['János Mattis-Teutsch', 'Janos Mattis Teutsch']
Edouard Debat-Ponsan ['Édouard Debat-Ponsan', 'Edouard Debat Ponsan']
Juan de Valdes Leal ['Juan de Valdés Leal', 'Juan De Valdes Leal']
Park Seo-Bo ['Park Seo Bo', 'Park Seo-bo']


I now edit these in the Art500k dataset. Let's try again:

In [130]:
art500k_artists = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/art500k_artists.csv')
artists= artists_A[artists_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
drop = artists_A[~(artists_A['artist'].isin(art500k_artists['artist']))]

drop_sims = pd.DataFrame(columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])
print("Cases:", len(drop))
for painter in drop['artist']:
    if (drop.index[drop['artist'] == painter][0] == len(drop)//4):
        print("25% now drop finding...")
    all_sims = []
    max_sim = 0
    for art500k_artist in art500k_artists['artist']:
        similarity_score = similarity(painter, art500k_artist)
        if similarity_score >= max_sim: #Runtime reasons
            max_sim = similarity_score
            all_sims.append((similarity_score,(1-similarity_score)*len(painter), art500k_artist))
    final_maximum = max(sims[0] for sims in all_sims) 

    for sims in all_sims:
        if sims[0] == final_maximum: #Just take the highest ones
            drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])    

print("Done with the drop finding...")

painter_name_pairs_dict = {}
art500k_alias_groups = {}
subset = drop_sims[drop_sims['Character difference'] < 1.01 ].sort_values(by=['similarity'], ascending=False)[:-5].reset_index(drop=True)
for index, row in subset.iterrows():
    if (index == len(subset)//4):
        print("25% now...")
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys():
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        art500k_alias_groups[painter] = [subset.loc[index, '"Best" Art500k pair']]
    else:
        t = art500k_alias_groups[painter]
        art500k_alias_groups[painter] = t + [subset.loc[index, '"Best" Art500k pair']]


Cases: 746


  drop_sims = pd.concat([drop_sims, pd.DataFrame([[painter, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Art500k pair','similarity', 'Character difference'])])


Done with the drop finding...
25% now...


In [131]:
for key, value in art500k_alias_groups.items():
    if len(value) > 1:
        print(key, value)

So we have fixed succeeded. In the meanwhile, one case now should be dropped:

In [136]:
del painter_name_pairs_dict['Zaya']

Let's find 2 and more character differences:

In [141]:
drop_sims = drop_sims.reset_index(drop=True)

In [147]:
drop_sims[(drop_sims['Character difference'] > 1.01) & (drop_sims['Character difference'] <2.01) ].sort_values(by=['similarity'], ascending=False)[0:]

Unnamed: 0,artist,"""Best"" Art500k pair",similarity,Character difference
366,Marevna (Marie Vorobieff),Marevna Marie Vorobieff,0.958333,1.041667
229,Petro Kholodny (Elder),Petro Kholodny Elder,0.952381,1.047619
506,Martín Rico y Ortega,Martín Rico Ortega,0.947368,1.052632
471,"Robert De Niro, Sr.",Robert De Niro Sr,0.944444,1.055556
218,J. C. Leyendecker,J C Leyendecker,0.937500,1.062500
...,...,...,...,...
234,Kim Prisu,Kid Paris,0.777778,2.000000
349,Sun Mu,Sun Xun,0.769231,1.384615
59,Ed Clark,Clark,0.769231,1.846154
456,Alan Lee,Sean Lee,0.750000,2.000000


In [146]:
subset = drop_sims[(drop_sims['Character difference'] > 1.01) & (drop_sims['Character difference'] <2.01) ].sort_values(by=['similarity'], ascending=False)[0:30]
for index, row in subset.iterrows():
    painter = row['artist']
    if painter not in painter_name_pairs_dict.keys(): #Cautions, theoretically this should always happen
        painter_name_pairs_dict[painter] = subset.loc[index, '"Best" Art500k pair']
        
painter_name_pairs_dict.update({"Chang Dai-chien": "Chang Dai Chien", "Félix Del Marle":"F Lix Del Marle", "Roger Bissière":"Roger Bissi Re","Jacques Hérold": "Jacques H Rold", "YiFei Chen": "Yifei Chen", "M.C. Escher": "M C Escher", "Hong Song-dam": "Hong Song Dam", "Mestre Ataíde": "Mestre Ata De", "Li Yuan-chia":"Li Yuan Chia", "José Luzán": "Jose Luzan"})
del painter_name_pairs_dict['Jacob Collins']
del painter_name_pairs_dict['Michael Bell']                               

In [148]:
painter_name_pairs_dict["Robert De Niro, Sr."]	

'Robert De Niro Sr'

In [138]:
painter_lowercase_pairs = {}
for index, row in drop_sims.iterrows():
    if row['Character difference'] > 1.01:
        painter_lowercase = row['artist'].lower()
        for artist in art500k_artists['artist']:
            if painter_lowercase == artist.lower():
                painter_lowercase_pairs[row['artist']] = artist
                
painter_lowercase_pairs

{'Adam van der Meulen': 'Adam Van Der Meulen',
 'JAROSLAV KELUC': 'Jaroslav Keluc',
 'Ding Yi': 'DING Yi',
 'JCJ Vanderheyden': 'Jcj Vanderheyden',
 'Bart van der Leck': 'Bart Van Der Leck',
 'Luis de Madrazo y Kuntz': 'Luis De Madrazo Y Kuntz',
 'Phase 2': 'PHASE 2',
 'TRACY 168': 'Tracy 168'}

## Version 2023.12.02: Take the intersection of WikiArt and Art500k

In [95]:
artist_A = pd.read_csv('datasets/wikiart_artists.csv')
artists= artist_A[artist_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True)
print("Artists remaining:", len(artists))

Artists remaining: 2457


In [96]:
artists = artists.merge(art500k_artists, on='artist', how='left')
artists

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,Nationality,PaintingSchool,ArtMovement,...,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places,PlacesYears,StylesYears,StylesCount
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,American,"New York School,American Abstract Artists,Iras...","{Abstract Expressionism,Minimalism:52},",...,"Donald Judd,Barnett Newman,Mark Rothko,Frank S...",,,"Jackson Pollock,",1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,,NY:1938-1966,,US:1938...","Expressionism:1944-1946,,Abstract Art:1937-194...","{Expressionism:7}, {Abstract Art:15}, {Color F..."
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,Turkish,,"{Abstract Art:28},",...,,,,,1968.0,2008.0,,,"Abstract Art:1992-2008,,Abstract Expressionism...","{Abstract Art:25}, {Abstract Expressionism:3}"
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,Indian,,"{Abstract Art:17},",...,,,,,1974.0,1974.0,,,"Abstract Art:1974-1974,",{Abstract Art:17}
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,"Italian,French",Abstraction-Création,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,,,,,1909.0,1971.0,,,"Abstract Art:1916-1971,,Cubism:1914-1935,,Meta...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ..."
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,Greek,,"{Abstract Art,Social Realism:79},",...,,,,,1931.0,1974.0,,,"Post-Impressionism:1932-1955,,Expressionism:19...","{Post-Impressionism:8}, {Expressionism:11}, {R..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2452,Marianne von Werefkin,Unknown,Expressionism,{Unknown:61},61,Tula,1860.0,,,"{Der Blaue Reiter:1},",...,,,,,,,,,,
2453,Robert Demachy,Unknown,Pictorialism,{Unknown:24},24,Saint-Germain-en-Laye,1859.0,French,,"{Pictorialism:24},",...,,,,,1900.0,1914.0,France,,,
2454,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,,...,,,,,2001.0,2001.0,"London, United Kingdom",,,
2455,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,Chinese,Four fathers of Chinese painting,"{Tang Dynasty (618–907):8},",...,,,,,,,,,,


Later extend this list with skipped artists from both datasets

In [97]:
artist_AnotB = artist_A[~artist_A['artist'].isin(art500k_artists['artist'])].reset_index(drop=True).sort_values(by=['pictures_count'], ascending=False)
artist_AnotB.head(10)

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year
0,Alfred Freddy Krupa,"Abstract Art, Abstract Expressionism, Academic...",New Ink Art,"{Abstract Art:1},{Abstract Expressionism:1},{A...",735,Karlovac,1971.0
720,Zdzislaw Beksinski,Surrealism,Magic Realism,{Surrealism:707},707,Sanok,1929.0
737,Oleksandr Aksinin,Unknown,Soviet Nonconformist Art,{Unknown:480},480,Kiev,1930.0
140,M.C. Escher,"Art Deco, Art Nouveau (Modern), Cubism, Expres...",Surrealism,"{Art Deco:1},{Art Nouveau (Modern):1},{Cubism:...",470,Leeuwarden,1898.0
121,Oleg Holosiy,"Academicism, Cubism, Expressionism, Naïve Art ...",Neo-Expressionism,"{Academicism:1},{Cubism:5},{Expressionism:30},...",372,Dnipro,1965.0
308,Alexander Roitburd,"Cubism, Transavantgarde",Transavantgarde,"{Cubism:1},{Transavantgarde:263}",264,Odesa,1961.0
377,Maria Bozoky,"Expressionism, Impressionism",Expressionism,"{Expressionism:252},{Impressionism:4}",256,Oradea,1909.0
606,Konstantin Gorbatov,Post-Impressionism,Post-Impressionism,{Post-Impressionism:254},254,Tolyatti,1876.0
590,Felix Nadar,Pictorialism,Pictorialism,{Pictorialism:245},245,rue Saint-Honoré,1820.0
436,J.M.W. Turner,"Impressionism, Romanticism, Unknown",Romanticism,"{Impressionism:1},{Romanticism:243},{Unknown:1}",245,London,1775.0


In [98]:
cols = artists.columns.tolist()
cols

['artist',
 'styles',
 'movement',
 'styles_extended',
 'pictures_count',
 'birth_place',
 'birth_year',
 'Nationality',
 'PaintingSchool',
 'ArtMovement',
 'Influencedby',
 'Influencedon',
 'Pupils',
 'Teachers',
 'FriendsandCoworkers',
 'FirstYear',
 'LastYear',
 'Places',
 'PlacesYears',
 'StylesYears',
 'StylesCount']

In [99]:
cols = cols[0:1]+cols[7:8]+cols[5:7]+cols[1:2]+cols[3:4]+cols[19:]+cols[2:3]+cols[9:10]+cols[4:5]+cols[15:19]+cols[8:9]+cols[10:15]
artists = artists[cols]
artists

Unnamed: 0,artist,Nationality,birth_place,birth_year,styles,styles_extended,StylesYears,StylesCount,movement,ArtMovement,...,FirstYear,LastYear,Places,PlacesYears,PaintingSchool,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers
0,Ad Reinhardt,American,Buffalo,1913.0,"Abstract Art, Abstract Expressionism, Color Fi...","{Abstract Art:15},{Abstract Expressionism:5},{...","Expressionism:1944-1946,,Abstract Art:1937-194...","{Expressionism:7}, {Abstract Art:15}, {Color F...",Abstract Expressionism,"{Abstract Expressionism,Minimalism:52},",...,1937.0,1966.0,"US, NY, Canberra, Fort Worth, Buffalo, Austral...","New York City:1938-1966,,NY:1938-1966,,US:1938...","New York School,American Abstract Artists,Iras...","Piet Mondrian,Kazimir Malevich,Josef Albers,","Donald Judd,Barnett Newman,Mark Rothko,Frank S...",,,"Jackson Pollock,"
1,Adnan Coker,Turkish,,,"Abstract Art, Abstract Expressionism","{Abstract Art:25},{Abstract Expressionism:3}","Abstract Art:1992-2008,,Abstract Expressionism...","{Abstract Art:25}, {Abstract Expressionism:3}",Abstract Art,"{Abstract Art:28},",...,1968.0,2008.0,,,,,,,,
2,Akkitham Narayanan,Indian,Kerala,1939.0,Abstract Art,{Abstract Art:17},"Abstract Art:1974-1974,",{Abstract Art:17},Abstract Art,"{Abstract Art:17},",...,1974.0,1974.0,,,,,,,,
3,Alberto Magnelli,"Italian,French",Florence,1888.0,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...","{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...","Abstract Art:1916-1971,,Cubism:1914-1935,,Meta...","{Abstract Art:21}, {Cubism:10}, {Metaphysical ...",Abstract Art,"{Abstract Art,Cubo-Futurism,Concrete Art (Conc...",...,1909.0,1971.0,,,Abstraction-Création,,,,,
4,Alekos Kontopoulos,Greek,Lamia,1904.0,"Abstract Art, Cubism, Expressionism, Post-Impr...","{Abstract Art:26},{Cubism:5},{Expressionism:10...","Post-Impressionism:1932-1955,,Expressionism:19...","{Post-Impressionism:8}, {Expressionism:11}, {R...",Social Realism,"{Abstract Art,Social Realism:79},",...,1931.0,1974.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2452,Marianne von Werefkin,,Tula,1860.0,Unknown,{Unknown:61},,,Expressionism,"{Der Blaue Reiter:1},",...,,,,,,,,,,
2453,Robert Demachy,French,Saint-Germain-en-Laye,1859.0,Unknown,{Unknown:24},,,Pictorialism,"{Pictorialism:24},",...,1900.0,1914.0,France,,,,,,,
2454,Wolfgang Tillmans,,Remscheid,1968.0,Unknown,{Unknown:9},,,Contemporary,,...,2001.0,2001.0,"London, United Kingdom",,,,,,,
2455,Wu Daozi,Chinese,Chang'an,680.0,Unknown,{Unknown:8},,,Tang Dynasty (618–907),"{Tang Dynasty (618–907):8},",...,,,,,Four fathers of Chinese painting,,,,,


In [100]:
artists.to_csv('datasets/artists.csv', index=False)

In [125]:
artists = pd.read_csv('datasets/artists.csv')

In [None]:
year_mistake = []
for artist in artists['artist']:
    if (artists[artists['artist'] == artist]['LastYear'].iloc[0]-artists[artists['artist'] == artist]['FirstYear'].iloc[0])>90:
        year_mistake.append(artist)
print((year_mistake))

In [None]:
artists[artists['artist'].isin(year_mistake)][['artist','birth_year','FirstYear','LastYear']]

In [129]:
too_early_years = ["Huang Yongyu","Joe Goode","Theodoros Stamos","Pablo Picasso", "Modest Cuixart","Giovanni Paolo Panini", "Guido Reni", "John Riley", "Marcello Bacciarelli","Rembrandt","Alfredo Volpi", "Henry Ossawa Tanner", "Pierre Soulages","Hieronymus Bosch","Agnes Lawrence Pelton","George Morland", "Jean-Baptiste Carpeaux"]
too_latest_years = ["Rupert Bunny", "Vasily Polenov", "Giovanni Paolo Panini", "Guido Reni","John Riley", "Luca Giordano", "Matthias Stom","Rembrandt", "Giovanni Bellini", "Alfredo Volpi", "Francesco Melzi", "Auguste Rodin", "Edgar Degas", "Henry Ossawa Tanner", "John Frederick Kensett","Giorgio de Chirico", "Maria Sibylla Merian", "Hieronymus Bosch","Jan Provoost","Jean Fouquet","Anton Azbe", "Jean-Baptiste Carpeaux"]
second_batch=['Hieronymus Bosch',
 'Jan Provoost',
 'George Lambert',
 'Charles Turner',
 'Thomas Jones',
 'William Morris']


In [128]:
for artist in too_early_years:
    artists.loc[artists['artist'] == artist, 'FirstYear'] = artists[artists['artist'] == artist]['birth_year']+18
#The latest_years artists are manually corrected.

In [130]:
#Manual edit last years
their_last_year = [1947, 1898, 1765, 1642, 1641, 1705, 1649, 1669, 1516, 1988, 1570, 1917, 1917, 1937, 1872, 1978, 1705, 1705, 1705, 1529, 1460, 1900, 1875]
last_years = [1516, 1460, 1802, 1832, 1803, 1892]
for i in range(len(too_latest_years)):
    artists.loc[artists['artist'] == too_latest_years[i], 'LastYear'] = their_last_year[i]
for i in range(len(second_batch)):
    artists.loc[artists['artist'] == second_batch[i], 'LastYear'] = last_years[i]

In [None]:
artists = artists.merge(subset, on='artist', how='left')

In [145]:

cols = artists.columns.to_list()
cols  = cols[0:15]+cols[-1:]+cols[15:-1]
cols.remove('PlacesCount_x')
artists = artists[cols]
artists.rename(columns={'PlacesCount_y':'PlacesCount'}, inplace=True)
artists.columns

Index(['artist', 'Nationality', 'birth_place', 'birth_year', 'styles',
       'styles_extended', 'StylesYears', 'StylesCount', 'movement',
       'ArtMovement', 'pictures_count', 'FirstYear', 'LastYear', 'Places',
       'PlacesYears', 'PlacesCount', 'PaintingSchool', 'Influencedby',
       'Influencedon', 'Pupils', 'Teachers', 'FriendsandCoworkers'],
      dtype='object')

Last step: in the .csv file, replace float .0 values with integers<br>
*This cannot be precisely done in Pandas, as you cannot have an integer datatype column (Series) with NaNs.*


In [63]:
#Turn the non-NaN years into integers
#t1 = artists['FirstYear'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#t2 = artists['LastYear'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#t3 = artists['birth_year'].fillna(0).astype(int).replace(0, "remove_hrgldg")
#
#artists['FirstYear'] = t1
#artists['LastYear'] = t2
#artists['birth_year'] = t3
#
#artists.to_csv('datasets/artists.csv', index=False)
#Manually delete the cells with "remove_hrgldg"

NOTE: manually deleted the cells containing "remove_hrgldg" from the csv file.

In [64]:
artists = pd.read_csv('datasets/artists.csv')