# WikiArt dataset cleaning, extending

WikiArt: all art pieces stored as a picture in Wikipedia (till a certain date).<br>
Original gathering and steps is in the *datasets/originals/wikiart_initial.ipynb* file. This file contains the updates since 2024.02.05.

## Update 2024.02.12: Check for birth_year and birth_place mistakes

 (e.g., 385: William Scott)

## Update 2024.02.08-12: Gather more information on painters via the Wikidata API

We take the current list of painters:

In [1]:
import pandas as pd

artists_wikiart = pd.read_csv('https://raw.githubusercontent.com/me9hanics/PainterPalette/main/datasets/wikiart_artists.csv')
artists_wikiart

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,1967.0,New York City,male,United States of America,"painter, university teacher, printmaker, colla...",['New York City'],[]
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,,,,,,[],[]
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,,,,,,[],[]
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,1971.0,Meudon,male,Italy,"illustrator, painter","['Florence', 'Paris']",[]
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,1975.0,Athens,male,Greece,"writer, painter",[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3198,Serhij Schyschko,Unknown,Academic Art,{Unknown:9},9,,,,,,,,,
3199,Vudon Baklytsky,Unknown,Soviet Nonconformist Art,{Unknown:46},46,,,,,,,,,
3200,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,male,Germany,"photographer, printmaker","['New York City', 'Berlin', 'London']",['New York City:1996-1996']
3201,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,,,,,,[],[]


Example:

In [2]:
import httpimport

with httpimport.remote_repo('https://raw.githubusercontent.com/me9hanics/sparql-wikidata-data-collection/main/'):
    import functions as external_functions

In [5]:
external_functions.get_all_person_info("Rembrandt")

{'name': 'Rembrandt',
 'birth_place': 'Leiden',
 'birth_date': '1977-03-15T00:00:00Z',
 'death_date': '2001-10-30T00:00:00Z',
 'death_place': 'Amsterdam',
 'gender': 'male organism',
 'citizenship': 'Dutch Republic',
 'occupation': ['drawer',
  'printmaker',
  'etcher',
  'art collector',
  'collector',
  'painter'],
 'work_locations': [{'location': 'Amsterdam',
   'start_time': '1623-01-01T00:00:00Z',
   'end_time': '1625-01-01T00:00:00Z',
   'point_in_time': None},
  {'location': 'Amsterdam',
   'start_time': '1631-01-01T00:00:00Z',
   'end_time': '1669-01-01T00:00:00Z',
   'point_in_time': None},
  {'location': 'Leiden',
   'start_time': '1625-01-01T00:00:00Z',
   'end_time': '1631-01-01T00:00:00Z',
   'point_in_time': None},
  {'location': 'Leiden',
   'start_time': '1620-01-01T00:00:00Z',
   'end_time': '1624-01-01T00:00:00Z',
   'point_in_time': None}]}

In [6]:
df = artists_wikiart.copy()
unprocessed_painters = [painter for painter in artists_wikiart['artist'] if df.loc[df['artist']==painter, 'gender'].isnull().values[0] == True]

### Partial fetching (querying 1 painter at a time), slower but more reliable:

Warning: this should take over long (5 hours in practice), as Wikidata only accepts 30 queries per minute.

In [31]:
for painter in unprocessed_painters:
    if df.loc[df['artist']==painter,'gender'].isnull().values[0]==False: #We already processed this painter
        continue
    painter_json = external_functions.get_person_info_retry_after(painter, placeofbirth = False, dateofbirth = False)
    if painter_json:
        df.loc[df['artist'] == painter, 'death_year'] = external_functions.find_year(painter_json['death_date'])
        df.loc[df['artist'] == painter, 'death_place'] = painter_json['death_place']
        df.loc[df['artist'] == painter, 'gender'] = painter_json['gender']
        df.loc[df['artist'] == painter, 'citizenship'] = painter_json['citizenship']
        df.loc[df['artist'] == painter, 'occupations'] = ', '.join((painter_json['occupation']))
        df.loc[df['artist'] == painter, 'locations'] = external_functions.get_places_from_response(painter_json)
        df.loc[df['artist'] == painter, 'locations_with_years'] = external_functions.get_places_with_years_from_response(painter_json)

df
        

Error fetching data for David Michael Hinnebusch, status code: 429.
Attempt 1 of 3.
Error fetching data for Jorge Martins, status code: 429.
Attempt 1 of 3.
Error fetching data for Stepan Ryabchenko, status code: 429.
Attempt 1 of 3.
Error fetching data for Arthur Pinajian, status code: 429.
Attempt 1 of 3.
Error fetching data for Elmer Bischoff, status code: 429.
Attempt 1 of 3.
Error fetching data for Ilse D'Hollander, status code: 429.
Attempt 1 of 3.
Error fetching data for Marcia Hafif, status code: 429.
Attempt 1 of 3.
Error fetching data for Richard Smith, status code: 429.
Attempt 1 of 3.
Error fetching data for Ronnie Landfield, status code: 429.
Attempt 1 of 3.
Error fetching data for Theodoros Stamos, status code: 429.
Attempt 1 of 3.
Error fetching data for Warren Rohrer, status code: 429.
Attempt 1 of 3.
Error fetching data for Adalbert Schaffer, status code: 429.
Attempt 1 of 3.
Error fetching data for Alexandre Cabanel, status code: 429.
Attempt 1 of 3.
Error fetching da

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
0,Ad Reinhardt,"Abstract Art, Abstract Expressionism, Color Fi...",Abstract Expressionism,"{Abstract Art:15},{Abstract Expressionism:5},{...",52,Buffalo,1913.0,1967.0,New York City,male,United States of America,"painter, university teacher, printmaker, colla...",['New York City'],[]
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,,,,,,[],[]
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,,,,,,[],[]
3,Alberto Magnelli,"Abstract Art, Art Nouveau (Modern), Cubism, Ex...",Abstract Art,"{Abstract Art:19},{Art Nouveau (Modern):2},{Cu...",35,Florence,1888.0,1971.0,Meudon,male,Italy,"illustrator, painter","['Florence', 'Paris']",[]
4,Alekos Kontopoulos,"Abstract Art, Cubism, Expressionism, Post-Impr...",Social Realism,"{Abstract Art:26},{Cubism:5},{Expressionism:10...",79,Lamia,1904.0,1975.0,Athens,male,Greece,"writer, painter",[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3198,Serhij Schyschko,Unknown,Academic Art,{Unknown:9},9,,,,,,,,,
3199,Vudon Baklytsky,Unknown,Soviet Nonconformist Art,{Unknown:46},46,,,,,,,,,
3200,Wolfgang Tillmans,Unknown,Contemporary,{Unknown:9},9,Remscheid,1968.0,,,male,Germany,"photographer, printmaker","['New York City', 'Berlin', 'London']",['New York City:1996-1996']
3201,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,,,,,,[],[]


We see two types of missing painters' data:
- Painters where there was no information returned, but we got a response returned (e.g. "Adnan Coker", we can see it that his list of locations is not NaN, just an empty list, which means we had a response, but no data)

-  Painters where there was no response returned (e.g. "Vudon Baklytsky")

In [6]:
len(df[df['locations'].isnull()]), len(df[df['gender'].isnull()])

(138, 1560)

138 no-response cases, and 1422 (1560-138) empty-response cases

### Less query fetching (querying 150 painters at a time), faster but less reliable:

2024.02.10 CET afternoon run

**Note**: The external function was modified while adding this update, on 2024.02.10. Before that, artists had multiple responses (based their occupations list and locations list).

In [9]:
unprocessed_painters = [painter for painter in artists_wikiart['artist'] if df.loc[df['artist']==painter, 'gender'].isnull().values[0] == True]
responses = external_functions.get_multiple_people_all_info(unprocessed_painters)

We got some missing data of artists this way (800+ instances):

In [15]:
no_response_painters =[]
modified_painters = []
for painter in unprocessed_painters:
    if not df.loc[df['artist']==painter, "gender"].isnull().values[0]: #We expect None values, any other value means we already processed this
        continue
    try:
        response_list = [response for response in responses if response['name']==painter][0]
    except IndexError:
        no_response_painters.append(painter)
        #There was no response for this painter
        continue
    #The try-except can be replace with using the "next" function: response_list=next((response for response in responses if response['name']==painter), None) then check if response_list is None
    if pd.isnull(df.loc[df['artist'] == painter, 'death_year']).values[0]:
        df.loc[df['artist'] == painter, 'death_year'] = external_functions.find_year(response_list['death_date'])
    if pd.isnull(df.loc[df['artist'] == painter, 'death_place']).values[0]:
        df.loc[df['artist'] == painter, 'death_place'] = response_list['death_place']
    if pd.isnull(df.loc[df['artist'] == painter, 'gender']).values[0]:
        df.loc[df['artist'] == painter, 'gender'] = response_list['gender'] 
    if pd.isnull(df.loc[df['artist'] == painter, 'citizenship']).values[0]:
        df.loc[df['artist'] == painter, 'citizenship'] = response_list['citizenship']
    if pd.isnull(df.loc[df['artist'] == painter, 'occupations']).values[0] or df.loc[df['artist'] == painter, 'occupations'].values[0] in ([], "[]"):
        df.loc[df['artist'] == painter, 'occupations'] = ', '.join((response_list['occupation']))
    if pd.isnull(df.loc[df['artist'] == painter, 'locations']).values[0] or df.loc[df['artist'] == painter, 'locations'].values[0] in ([], "[]"):
        df.loc[df['artist'] == painter, 'locations'] = external_functions.get_places_from_response(response_list)
    if pd.isnull(df.loc[df['artist'] == painter, 'locations_with_years']).values[0] or df.loc[df['artist'] == painter, 'locations_with_years'].values[0] in ([], "[]"):
        df.loc[df['artist'] == painter, 'locations_with_years'] = external_functions.get_places_with_years_from_response(response_list)
    modified_painters.append(painter)

In [26]:
df.to_csv('datasets/wikiart_artists.csv', index=False)

But many still missing:

In [20]:
df[df['gender'].isnull()]

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
1,Adnan Coker,"Abstract Art, Abstract Expressionism",Abstract Art,"{Abstract Art:25},{Abstract Expressionism:3}",28,,,,,,,,[],[]
2,Akkitham Narayanan,Abstract Art,Abstract Art,{Abstract Art:17},17,Kerala,1939.0,,,,,,[],[]
8,Alfred Freddy Krupa,"Abstract Art, Abstract Expressionism, Academic...",New Ink Art,"{Abstract Art:1},{Abstract Expressionism:1},{A...",735,Karlovac,1971.0,,,,,,[],[]
13,Amin Aghaei,"Abstract Art, Magic Realism",Contemporary,"{Abstract Art:6},{Magic Realism:10}",16,Isfahan,1982.0,,,,,,[],[]
14,Andrzej Nowacki,Abstract Art,Abstract Art,{Abstract Art:26},26,Rabka-Zdrój,1953.0,,,,,,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3192,Pablo Rey,Unknown,Contemporary,{Unknown:30},30,Barcelona,1968.0,,,,,,[],[]
3198,Serhij Schyschko,Unknown,Academic Art,{Unknown:9},9,,,,,,,,,
3199,Vudon Baklytsky,Unknown,Soviet Nonconformist Art,{Unknown:46},46,,,,,,,,,
3201,Wu Daozi,Unknown,Tang Dynasty (618–907),{Unknown:8},8,Chang'an,680.0,,,,,,[],[]


In [21]:
len(df[df['locations'].isnull()]), len(df[df['gender'].isnull()])

(135, 732)

For other painters without a response, let's check the most matching painter name, possibly we can find the correct name for getting a response:

In [22]:
import requests

painter_names_url = 'https://raw.githubusercontent.com/me9hanics/PainterPalette/main/painter_names_200k.txt'
wiki_painter_names_200k = requests.get(painter_names_url).text.splitlines()

Let's first find exact name matches, as it can be fastly done with numpy (remember that wiki_painter_names_200k has ~200000 names nearly, we need to be resourceful).<br>
We'd expect to find 0 exact matches:

In [23]:
import numpy as np

missing_data_painters = [painter['artist'] for index, painter in df[df['gender'].isnull()].iterrows()]
same_name_painters = np.intersect1d(missing_data_painters, wiki_painter_names_200k)
no_match_painters = np.setdiff1d(missing_data_painters, wiki_painter_names_200k)
same_name_painters

array(['Adnan Coker', 'Adolf Fleischmann', 'Adriana Varejão',
       'Agim Sulaj', 'Aki Kuroda', 'Akira Kanayama', 'Akkitham Narayanan',
       'Alan Lee', 'Albert Bitran', 'Albert Huie', 'Albert Julius Olsson',
       'Alberto Gironella', 'Alberto Sotio', 'Alejandro Cabeza',
       'Aleksandra Ekster', 'Alexander Khvostenko-Khvostov',
       'Alexandre Jacovleff', 'Alfred Freddy Krupa', 'Ali Akbar Sadeghi',
       'Ali Omar Ermes', "Allan D'Arcangelo", 'Andrei Rublev',
       'Andrey Shishkin', 'Andrzej Nowacki', 'Anne Truitt',
       'Anthony Sims', 'Antoine Blanchard', 'Antonietta Raphael',
       'Archibald Thorburn', 'Armin Andreas Pangerl', 'Arnulf Rainer',
       'Arsen Savadov', 'Arthur Pan', 'Ashley Bickerton', 'Ay-O',
       'Aydin Aghdashloo', 'Bahman Mohasses', 'Banksy',
       'Barbara Chase-Riboud', 'Barrington Watson', 'Basil Beattie',
       'Beatriz González', 'Beatriz Milhazes', 'Benny Andrews',
       'Bernardo Marques', 'Bernd Luz', 'Betty Goodwin', 'Billy Childish'

We found that half of the cases are in the list already, which is surprising, either there is no data about them, or we had problems fetching them.

Firstly, there are interestly some cases where we didn't get any non-location data, but we got some location data. We should firstly correct these cases (by simply running the function again for these painters).

In [27]:
partial_missing_data_painters = df[(df['artist'].isin(same_name_painters)) & (df['locations'] != '[]')] ['artist'].tolist()
df[(df['artist'].isin(same_name_painters)) & (df['locations'] != '[]')] [:5]

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
300,Leiko Ikemura,"Abstract Expressionism, Neo-Expressionism",Neo-Expressionism,"{Abstract Expressionism:2},{Neo-Expressionism:10}",12,Tsu,1951.0,,,,,,"['Zürich', 'Cologne', 'Berlin', 'Spain']","['Zürich:1979-1979', 'Spain:1972-1972']"
350,Richard Smith,"Abstract Expressionism, Hard Edge Painting, Po...",Post-Painterly Abstraction,"{Abstract Expressionism:1},{Hard Edge Painting...",20,Detroit,1968.0,1733.0,Saint Michael,,,,"['New York City', 'Brecon']",[]
381,Walasse Ting,"Abstract Expressionism, Naïve Art (Primitivism...",Abstract Expressionism,"{Abstract Expressionism:10},{Naïve Art (Primit...",70,Wuxi,1929.0,2010.0,New York City,,,,"['New York City', 'Amsterdam', 'Paris']",[]
389,Yayoi Kusama,"Abstract Expressionism, Art Brut, Conceptual A...",Feminist Art,"{Abstract Expressionism:1},{Art Brut:2},{Conce...",27,Matsumoto,1929.0,,,,,,"['Netherlands', 'Kyoto', 'New York City', 'Tok...","['Netherlands:1967-1967', 'Tokyo:2000-2015', '..."
431,Eugene de Blaas,"Academicism, Unknown",Academic Art,"{Academicism:96},{Unknown:1}",97,Albano Laziale,1843.0,1931.0,Venice,,,,"['Paris', 'Rome', 'Belgium', 'Venice', 'Nether...",[]


The get_person_info_retry_after() function was modified to be more like the get_all_person_info() function (iterating over multiple results). Let's try now, on all painters in same_name_painters:

In [25]:
df2 = df.copy()

In [33]:
modified_painters_2 = []
for painter in same_name_painters:
    response = external_functions.get_person_info_retry_after(painter, placeofbirth_return = False, dateofbirth_return = False)
    if response:
        if pd.isnull(df.loc[df['artist'] == painter, 'death_year']).values[0]:
            df.loc[df['artist'] == painter, 'death_year'] = external_functions.find_year(response['death_date'])
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'death_place']).values[0]:
            df.loc[df['artist'] == painter, 'death_place'] = response['death_place']
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'gender']).values[0]:
            df.loc[df['artist'] == painter, 'gender'] = response['gender']
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'citizenship']).values[0] or df.loc[df['artist'] == painter, 'citizenship'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'citizenship'] = response['citizenship']
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'occupations']).values[0] or df.loc[df['artist'] == painter, 'occupations'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'occupations'] = ', '.join((response['occupation']))
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'locations']).values[0] or df.loc[df['artist'] == painter, 'locations'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'locations'] = external_functions.get_places_from_response(response)
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'locations_with_years']).values[0] or df.loc[df['artist'] == painter, 'locations_with_years'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'locations_with_years'] = external_functions.get_places_with_years_from_response(response)
            if painter not in modified_painters_2:
                modified_painters_2.append(painter)

df[(df['artist'].isin(same_name_painters)) & (df['locations'] != '[]')]

Error fetching data for Albert Julius Olsson, status code: 429.
Attempt 1 of 3.
Error fetching data for Allan D'Arcangelo, status code: 429.
Attempt 1 of 3.
Error fetching data for Arnulf Rainer, status code: 429.
Attempt 1 of 3.
Error fetching data for Basil Beattie, status code: 429.
Attempt 1 of 3.
Error fetching data for Bracha L. Ettinger, status code: 429.
Attempt 1 of 3.
Error fetching data for Charles Jacque, status code: 429.
Attempt 1 of 3.
Error fetching data for Conroy Maddox, status code: 429.
Attempt 1 of 3.
Error fetching data for Darren Waterston, status code: 429.
Attempt 1 of 3.
Error fetching data for Edward Avedisian, status code: 429.
Attempt 1 of 3.
Error fetching data for Ernesto Neto, status code: 429.
Attempt 1 of 3.
Error fetching data for Fernando Calhau, status code: 429.
Attempt 1 of 3.
Error fetching data for Gebre Kristos Desta, status code: 429.
Attempt 1 of 3.
Error fetching data for Gianni Piacentino, status code: 429.
Attempt 1 of 3.
Error fetching da

Unnamed: 0,artist,styles,movement,styles_extended,pictures_count,birth_place,birth_year,death_year,death_place,gender,citizenship,occupations,locations,locations_with_years
300,Leiko Ikemura,"Abstract Expressionism, Neo-Expressionism",Neo-Expressionism,"{Abstract Expressionism:2},{Neo-Expressionism:10}",12,Tsu,1951.0,,,female,Switzerland,"university teacher, sculptor, painter, illustr...","['Zürich', 'Cologne', 'Berlin', 'Spain']","['Zürich:1979-1979', 'Spain:1972-1972']"
350,Richard Smith,"Abstract Expressionism, Hard Edge Painting, Po...",Post-Painterly Abstraction,"{Abstract Expressionism:1},{Hard Edge Painting...",20,Detroit,1968.0,1733.0,Saint Michael,male,Canada,"Catholic bishop, Catholic priest, film actor, ...","['New York City', 'Brecon']",[]
381,Walasse Ting,"Abstract Expressionism, Naïve Art (Primitivism...",Abstract Expressionism,"{Abstract Expressionism:10},{Naïve Art (Primit...",70,Wuxi,1929.0,2010.0,New York City,male,Republic of China,"graphic artist, painter, illustrator, poet, wr...","['New York City', 'Amsterdam', 'Paris']",[]
389,Yayoi Kusama,"Abstract Expressionism, Art Brut, Conceptual A...",Feminist Art,"{Abstract Expressionism:1},{Art Brut:2},{Conce...",27,Matsumoto,1929.0,,,female,Japan,"collagist, conceptual artist, video artist, in...","['Netherlands', 'Kyoto', 'New York City', 'Tok...","['Netherlands:1967-1967', 'Tokyo:2000-2015', '..."
431,Eugene de Blaas,"Academicism, Unknown",Academic Art,"{Academicism:96},{Unknown:1}",97,Albano Laziale,1843.0,1931.0,Venice,male,Kingdom of Italy,painter,"['Paris', 'Rome', 'Belgium', 'Venice', 'Nether...",[]
749,Frank Xavier Leyendecker,"Art Nouveau (Modern), Kitsch",Art Nouveau,"{Art Nouveau (Modern):12},{Kitsch:5}",17,Germany,1876.0,1924.0,,male,Germany,"drawer, painter, illustrator","['Netherlands', 'United States of America']","['Netherlands:1912-1912', 'United States of Am..."
766,Henry van de Velde,"Art Nouveau (Modern), Cloisonnism, Constructiv...",Neo-Impressionism,"{Art Nouveau (Modern):11},{Cloisonnism:1},{Con...",20,Antwerp,1863.0,1957.0,Zürich,male,Germany,painter,['Italy'],[]
937,Jan Brueghel the Elder,Baroque,Baroque,{Baroque:112},112,City of Brussels,1568.0,1625.0,Antwerp,male,Habsburg Netherlands,"architectural draftsperson, graphic artist, pa...","['Rome', 'Spa', 'Naples', 'Netherlands', 'Ital...","['Rome:1592-1594', 'Naples:1590-1590', 'Nether..."
1062,Imi Knoebel,"Color Field Painting, Minimalism",Minimalism,"{Color Field Painting:3},{Minimalism:12}",15,Dresden,1940.0,,,male,Germany,"video installation artist, sculptor, painter, ...","['United States of America', 'Düsseldorf', 'Da...","['United States of America:1974-1974', 'Düssel..."
1092,Banksy,"Conceptual Art, Graffiti Art, Street art",Street art,"{Conceptual Art:6},{Graffiti Art:1},{Street ar...",30,Yate,1974.0,,,male,United Kingdom,"artivist, graffiti artist, political activist,...","['London', 'West End', 'New York City', 'Washi...",[]


In [39]:
df.to_csv('datasets/wikiart_artists.csv', index=False)

In [46]:
no_match_painters

array(['3D', 'A.Y. Jackson', 'Adalbert Erdeli', 'Adrian Piper',
       'Alaa Awad', 'Alan Tellez', 'Albrecht Durer', 'Aleksander Belyaev',
       'Alex Hay', 'Alexander Shilov', 'Alexey  Bogolyubov',
       'Alexis Gritchenko', 'Alfred Concanen', 'Alvaro Lapa',
       'Amin Aghaei', 'Andrey Allakhverdov', 'Angel Planells',
       'Anima Ehtiat', 'Ann Hamilton', 'Antonio de La Gándara',
       'António de Carvalho da Silva Porto', 'Arkhyp Kuindzhi',
       'Arne Quinze', 'Arthur Nísio', 'Arthur Pinajian', 'Babak-Matveev',
       'Bahia Shehab', 'Bartolome Esteban Murillo',
       'Benito Quinquela Martin', 'Benoit Maire', 'Bernadette Resha',
       'Bernd and Hilla Becher', 'Berthold  Woltze', 'Blek le Rat',
       'Brassai', 'Cameron Platter', 'Carl-Ludwig Johann Christineck',
       'Carles Delclaux Is', 'Carlos Merida', 'Chaibia Talal',
       'Chaokun Wang', 'Charles-Andre van Loo (Carle van Loo)',
       'Charly Palmer', 'Christo and Jeanne-Claude', 'Chul Hyun Ahn',
       'Clarenc

In [53]:
import difflib
def similarity(s1, s2):
    return difflib.SequenceMatcher(None, s1, s2).ratio()

#Could use numpy to be faster, but this is fine for now
pairs = pd.DataFrame(columns=['artist','"Best" Wiki200k pair','similarity', 'Character difference'])
for name in no_match_painters:
    all_sims = []
    max_sim = 0
    for wiki_name in wiki_painter_names_200k:
        similarity_score = similarity(name, wiki_name)
        character_difference = (1-similarity_score)*len(name)
        if similarity_score >= max_sim: #Runtime reasons
            max_sim = similarity_score
            all_sims.append((similarity_score,character_difference, wiki_name))
        if (character_difference<0.01) | ((character_difference < 1.02) & (len(name)>5)):#1 Character difference, runtime reasons
            break
        
    final_maximum = max(sims[0] for sims in all_sims) 
    for sims in all_sims:
        if sims[0] == final_maximum: #Just take the highest ones
            if pairs.empty: #Could simply concat it, but then we'dget a FutureWarning
                pairs = pd.DataFrame([[name, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Wiki200k pair','similarity', 'Character difference'])
            pairs = pd.concat([pairs, pd.DataFrame([[name, sims[2], sims[0], sims[1]]], columns=['artist','"Best" Wiki200k pair','similarity', 'Character difference'])])    
pairs.sort_values(by=['similarity'], ascending=False)

Unnamed: 0,artist,"""Best"" Wiki200k pair",similarity,Character difference
0,Richard Caton Woodville Sr.,"Richard Caton Woodville, Sr.",0.981818,0.490909
0,John Frederick Herring Sr.,"John Frederick Herring, Sr.",0.981132,0.490566
0,Pieter Bruegel the Elder,Pieter Brueghel the Elder,0.979592,0.489796
0,George Frederick Watts,George Frederic Watts,0.976744,0.511628
0,Colette Pope Heldner,Collette Pope Heldner,0.975610,0.487805
...,...,...,...,...
0,[ a y s h ],Harry Fish,0.476190,5.761905
0,[ a y s h ],Mary Brush,0.476190,5.761905
0,JAROSLAV KELUC,ROA,0.352941,9.058824
0,JAROSLAV KELUC,JAS,0.352941,9.058824


In [58]:
pairs = pairs.reset_index(drop=True)

In [85]:
pairs.sort_values(by=['similarity'], ascending=False)[0:5]

Unnamed: 0,artist,"""Best"" Wiki200k pair",similarity,Character difference
291,Richard Caton Woodville Sr.,"Richard Caton Woodville, Sr.",0.981818,0.490909
168,John Frederick Herring Sr.,"John Frederick Herring, Sr.",0.981132,0.490566
279,Pieter Bruegel the Elder,Pieter Brueghel the Elder,0.979592,0.489796
106,George Frederick Watts,George Frederic Watts,0.976744,0.511628
59,Colette Pope Heldner,Collette Pope Heldner,0.97561,0.487805


In [72]:
pairs_dict ={}
for index, row in pairs[:90].iterrows():
    pairs_dict[row['artist']] = row['"Best" Wiki200k pair']
pairs_dict.update({'Jose de Guimaraes': 'José de Guimarães', 'Jan Sluyters':'Jan Sluijters', 'Shin Yoon-bok': 'Shin Yun-bok', 'Hendrick Terbrugghen': 'Hendrick ter Brugghen',
                   'Gevorg Bashindzhagian':'Gevorg Bashinjaghian', 'Kim Tschang-yeul': 'Kim Tschang Yeul', 'Alexander Shilov':'Aleksandr Shilov',
                    'Efim Volkov':'Yefim Volkov', 'Efim Volkov':'Shôzô Shimamoto', 'Serhij Schyschko': 'Serhii Shyshko', 'Samuel Dirksz van Hoogstraten': 'Samuel van Hoogstraten',
                    'Miklos Barabas': 'Miklós Barabás', 'Brassai': 'Brassaï', 'Linder': 'I Linder', 'Marjorie Acker Phillips': 'Marjorie Phillips', 'Spyros Papaloukas': 'Spiros Papalucas',
                    'George Mavroides': 'Giorgos Mavroides', 'Sarunas Sauka': 'Šarūnas Sauka', 'Rene Bertholo': 'René Bértholo', 'Oleksandr Bogomazov': 'Alexander Bogomazov',
                    'Lady Frieda Harris': 'Frieda Harris', 'Maria Bozoky': 'Mária Bozóky', 'Eleonora Brigalda Barbas':'Eleonora Brigalda', 'Dumitru Ghiatza': 'Dumitru Ghiață',
                    'Arthur Nísio': 'Arthur José Nísio', 'Oleksandr Aksinin': 'Alexander Aksinin', 'Fab 5 Freddy': 'Fab Five Freddy', 'Edward Ruscha': 'Ed Ruscha',
                    'Georgyi Yakutovytch': 'Heorhii Yakutovych', 'Angel Planells': 'Angel Planells playes', 'Giovanni (Nino) Costa': 'Giovanni Costa',
                    'Quentin Matsys':'Quinten Metsys', 'Ramirez Villamizar':'Eduardo Ramírez Villamizar', 'Ivan Tvorozhnikov':'Ivan Ivanovich Tvorozhnikov',
                    'Clarence Holbrook Carter': 'Clarence Carter', 'Julio Le Parc': 'julio le parc', 'Rafael García Hispaleto (El Hispaleto)': 'Rafael García y Hispaleto',
                    'Tran Van Can': 'Trần Văn Cẩn', 'Marevna (Marie Vorobieff)': 'Marie Vorobieff', 'Emiliano Di Cavalcanti': 'Di Cavalcanti', 'Il Sassetta (Stefano di Giovanni)': 'Stefano di Giovanni',
                    'Jan Both':'Jan Dirksz Both', 'Charles-Andre van Loo (Carle van Loo)': 'Charles-André van Loo', 'Herbert Gustave Schmalz (Herbert Carmichael)': 'Herbert Gustave Schmalz',
                    'Meta Vaux Warrick Fuller': 'Warwick Fuller', 'Mihri Musfik':'Mihri Müşfik Hanım', 'TRACY 168': 'Tracy 168'})

In [81]:
modified_painters_3 = [] #For painters whose alias we found
modified_painters_4 = [] #For those we didn't
for painter in no_match_painters:
    if painter in pairs_dict:
        alias = pairs_dict[painter]
        modifying_list = modified_painters_3 #Just for less writing down below
    else:
        alias = painter
        modifying_list = modified_painters_4
    response = external_functions.get_person_info_retry_after(alias, )
    if response:
        df.loc[df['artist'] == painter, 'birth_year'] = external_functions.find_year(response['birth_date'])
        df.loc[df['artist'] == painter, 'birth_place'] = response['birth_place']
        if pd.isnull(df.loc[df['artist'] == painter, 'death_year']).values[0]:
            df.loc[df['artist'] == painter, 'death_year'] = external_functions.find_year(response['death_date'])
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'death_place']).values[0]:
            df.loc[df['artist'] == painter, 'death_place'] = response['death_place']
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'gender']).values[0]:
            df.loc[df['artist'] == painter, 'gender'] = response['gender']
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'citizenship']).values[0] or df.loc[df['artist'] == painter, 'citizenship'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'citizenship'] = response['citizenship']
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'occupations']).values[0] or df.loc[df['artist'] == painter, 'occupations'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'occupations'] = ', '.join((response['occupation']))
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'locations']).values[0] or df.loc[df['artist'] == painter, 'locations'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'locations'] = external_functions.get_places_from_response(response)
            if painter not in modifying_list:
                modifying_list.append(painter)
        if pd.isnull(df.loc[df['artist'] == painter, 'locations_with_years']).values[0] or df.loc[df['artist'] == painter, 'locations_with_years'].values[0] in ([], "[]"):
            df.loc[df['artist'] == painter, 'locations_with_years'] = external_functions.get_places_with_years_from_response(response)
            if painter not in modifying_list:
                modifying_list.append(painter)
    if painter in pairs_dict:
        modified_painters_3 = modifying_list
    else:
        modified_painters_4 = modifying_list     


Error fetching data for Alfred Cohen, status code: 429.
Attempt 1 of 3.
Error fetching data for Aarne Nurminen, status code: 429.
Attempt 1 of 3.
Error fetching data for Berthold Woltze, status code: 429.
Attempt 1 of 3.
Error fetching data for Carl Palme, status code: 429.
Attempt 1 of 3.
Error fetching data for Corneliu Michăilescu, status code: 429.
Attempt 1 of 3.
Error fetching data for Don Drost, status code: 429.
Attempt 1 of 3.
Error fetching data for Émile Bernard, status code: 429.
Attempt 1 of 3.
Error fetching data for Giovanni Costa, status code: 429.
Attempt 1 of 3.
Error fetching data for I Linder, status code: 429.
Attempt 1 of 3.
Error fetching data for Rafael García y Hispaleto, status code: 429.
Attempt 1 of 3.


In [98]:
df.to_csv('datasets/wikiart_artists.csv', index=False)

This is quite some improvement.

We are only really left with:
- 319-113=206 painters for whom we didn't find an alias (Among whom are at least 83-2 are painters who never had any response, therefore has None for locations)
- 2 painters whom we did have an alias for, but not a response

Overall, we have 83 painters with no new data, and 22 more painters who miss some "trivial" data.

### --- PARTIAL SAVES --- (2024.02.10 3:30 AM UTC and 3:20 PM UTC)

In [33]:
artists_wikiart.to_csv('datasets/saves/wikiart_artists.csv', index=False)
df.to_csv('datasets/wikiart_artists.csv', index=False)