# Network of Painters: building a dataset from paintings datasets, then creating links

The aim of this project is to create a dataset of painters from datasets such as WikiArt and Art500k, combining features, extending missing data of painters with web scraping through Google and Wiki API, and then creating links between painters based on similarity of style, geographical and social interaction.

Note: One long-term goal would be to create a JSON file that contains all combined hierarchically. For example, a level in the structure could be art movement, inside it are artists with some base data like birthplace, year of birth and death and other geographical data, inside it are paintings with all contained data (even better would be including eras of painters in their substructure, and inside them the paintings). Then we could use this to create a network of art movements, artists, and paintings.

NEXT STEPS:<br>
-Add "Places" for Art500k datasets (+change datasets_notebook save.csv loads)<br>
-Add aliases for painters in Art500k datasets<br>
-Combine the datasets on authors<br>

FURTHER STEPS: <br>
-Define connections between painters<br>
-Create a network of painters<br>
-Analyze the network<br>

<details><summary><u> Update 11.06: Maximilian Schich </u></summary>
<p>
I e-mailed an art researcher that Elisa suggested, Maximilian Schich, asking about datasets for our project. He said: 

-we do not have a record of social interactions between artists at the corpus scale. The closest thing is: co-exhibition networks, which you may already know from the work of Fraiberger et al. (incl. Laszlo Barabasi). (http://genetics.bwh.harvard.edu/courses/Biophysics205/Papers/All_papers/Fraiberger_2018.pdf page 2) The issue there is that the network is short, circa1985 to 2020.

-Hyperlink networks (I guess WikiLinks, Pageranks and such), such as those found in Wikipedia are obviously beset with all kinds of issues, even though they do recapitulate the evolution of conventional style periods pretty well (cf. the work of Doron Goldfarb et al.. incl. myself). More locally speaking, it i a core topic in art history to shed light on the social network of artists and their patrons, but this does not lend itself to quantitative analysis. 

-I personally have done a visualization for Max Planck, based on the social network of 5500 individuals related to the Roman Baroque (https://zuccaro.schich.info/), which did reveal another issue, which is that for painters, art historians tend to research family relationships (more cliques), while for architects they focus on business relationships (more hubs). But here you got the inverse problem that there is not much information on the paintings

-There is a question/issue he raised from this: "Should we really assume social interaction influencing the styles of artists? Note that this may substantially underestimate the plasticity of the human brain/mind! It is like assuming that cellists only hang out with cellists, when we all know that grunge bands in Seatlle all did hang out together and missing a bassist. Meanwhile we do have evidence that artists such as Rubens did routinely hang out with different(!) artists, who could serve clients with different genres and if necessary styles. Bramante did build Gothic in Milan and Renaissance style in Rome at the same time. Rubens would call in Elsheimer to do miniatures, etc. And since the mid 19th century, all artists in the Western scene were essentially familiar, not only with the same corpus of classic artists and their works, but also with the contemporary production. Large art exhibitions in Paris literally drew millions of people each year in the mid 19th century (think Burning Man or SXSW today). So it is save to say that most artists of note were familiar with a great number of styles. Styles may bifurcate. for artists the opposite may be true (cf. run DMC meets Aerosmith => https://www.youtube.com/watch?v=4B_UYYPb-Gk). If I were you, I'd turn the question around, pointing into the opposite direction: **If two artists have similar style, can we find traces that they (eventually) knew each other**?" He said influence is B.S. (literally) and there's 100 times more evidence for similarity than influence between two artworks, and suggested answering "does style lead to social interaction?"

-"Here is how this question can be attacked with the available data: The standard "corpus" for artists is their "catalog raisonne", i.e. the catalog of all their works, which does not exist for all artists and is typically a lot of work, sold in expensive books. We are a long way from a comprehensive dataset like this. Yet, for the purpose of a more limited project, you could use general conventional style similarity from the usual suspect databases (Wikiart, Art500k, etc.). As a proxy of social interaction, you could use the hyperlink and/or wikidata links connected to the same artists. Even though these two sources are limited, you could still compare the two graphs as in "Wikipedia connection" vs. "visual similarity".

We have recently published a paper on general similarity using compression ensembles, using a subset of art500k/Wikiart, which is essentially 65k paintings with a reliably year as a data. We have also used the first 100 days of the hic et nunc NFT art platform (which coincidentally you get both social interaction and painting information). See "Availability of data and materials" in https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-023-00397-3#Sec21 "

So this could be interesting to think about
</p>
</details>

In [2]:
import pandas as pd
import numpy as np

<details><summary><u>National Gallery of Art  (US) dataset (unused) </u></summary>
<p>
    
```python

df1 = pd.read_csv('datasets/originals/nga_constituents.csv') # From their website
df1.head()

```
    
</p>
</details>

## WikiArt data

Load the cleaned paintings data

In [3]:
wa_paintings = pd.read_csv('datasets/wikiart_paintings_refined.csv')
print("Length:", len(wa_paintings))
wa_paintings.head() #Consider dropping style: "Unknown" 

Length: 175313


Unnamed: 0,artist,style,genre,movement,tags
0,Andrei Rublev,Moscow school of icon painting,religious painting,Byzantine Art,"['Christianity', 'saints-and-apostles', 'angel..."
1,Andrei Rublev,Moscow school of icon painting,religious painting,Byzantine Art,"['Christianity', 'Old-Testament', 'Daniel', 'p..."
2,Andrei Rublev,Moscow school of icon painting,miniature,Byzantine Art,"['Christianity', 'saints-and-apostles', 'Khitr..."
3,Andrei Rublev,Moscow school of icon painting,religious painting,Byzantine Art,"['Christianity', 'saints-and-apostles', 'St.-L..."
4,Andrei Rublev,Moscow school of icon painting,miniature,Byzantine Art,"['Christianity', 'arts-and-crafts', 'saints-an..."


Load the grouped data: artists grouped by style

In [11]:
wa_grouped = pd.read_csv('datasets/wikiart_artists_styles_grouped.csv')
print("Length:", len(wa_grouped), "\n", "Number of groups with only 1 count:", len(wa_grouped[wa_grouped['count']==min(wa_grouped['count'])]))
wa_grouped[wa_grouped['artist'].str.contains("Monet")].sort_values(by=['count'], ascending=False)

Length: 7647 
 Number of groups with only 1 count: 1115


Unnamed: 0,style,artist,movement,count
2963,Impressionism,Claude Monet,Impressionism,1341
5468,Realism,Claude Monet,Impressionism,12
7042,Unknown,Claude Monet,Impressionism,12
462,Academicism,Claude Monet,Impressionism,1
3339,Japonism,Claude Monet,Impressionism,1


## Art500K

First dataset (from official website)

In [5]:
art500k = pd.read_csv('datasets/art500k_cleaned.csv')
(art500k[4:10])

  art500k = pd.read_csv('datasets/art500k_cleaned.csv')


Unnamed: 0,author_name,Genre,Style,Nationality,PaintingSchool,ArtMovement,Date,Influencedby,Influencedon,Tag,Pupils,Location,Teachers,FriendsandCoworkers
4,El Greco,,,,,,ca. 1610-1614,,,,,,,
5,El Greco,,,,,,,,,,,,,
6,Diego Rivera,,,,,,,,,,,,,
7,Claude Monet,,,,,,,,,,,,,
8,Francisco Goya,,,,,,,,,,,,,
9,Francisco Goya,,,,,,,,,,,,,


In [30]:
art500k_artists = pd.read_csv('save.csv')
art500k_artists[0:10]

Unnamed: 0,artist,Nationality,PaintingSchool,ArtMovement,Influencedby,Influencedon,Pupils,Teachers,FriendsandCoworkers,FirstYear,LastYear,Places
0,Gustave Courbet,French,,"{Realism:272},","Rembrandt,Caravaggio,Diego Velazquez,Peter Pau...","Edouard Manet,Claude Monet,Pierre-Auguste Reno...",,,,1830.0,1877.0,
1,Auguste Rodin,French,,"{Modern art:3},{Impressionism:91},","Michelangelo,Donatello,","Georgia O'Keeffe,Man Ray,Aristide Maillol,Olex...","Constantin Brancusi,",,,1865.0,1985.0,
2,Frida Kahlo,Mexican,,"{Naïve Art (Primitivism),Surrealism:99},","Amedeo Modigliani,Diego Rivera,Jose Clemente O...","Judy Chicago,Georgia O'Keeffe,Feminist Art,",,,,1922.0,1954.0,
3,Banksy,,,,,,,,,2011.0,2011.0,
4,El Greco,"Spanish,Greek",Cretan School,"{Spanish Renaissance:1},{Renaissance:2},{Manne...","Byzantine Art,","Expressionism,Cubism,Eugene Delacroix,Edouard ...",,"Titian,","Giulio Clovio,",1568.0,1614.0,
5,Diego Rivera,Mexican,"Mexican Mural Renaissance,La Ruche","{Social Realism,Muralism:146},","Marc Chagall,Robert Delaunay,","Frida Kahlo,Pedro Coronel,Vlady,",,,"Amedeo Modigliani,Saturnino Herran,Roberto Mon...",1904.0,1956.0,
6,Claude Monet,French,,"{Modern art:3},{Impressionism:1340},","Gustave Courbet,Charles-Francois Daubigny,John...","Childe Hassam,Robert Delaunay,Wassily Kandinsk...",,"Eugene Boudin,Charles Gleyre,","Alfred Sisley,Pierre-Auguste Renoir,Camille Pi...",1858.0,1926.0,
7,Francisco Goya,Spanish,,"{Romanticism:391},","Albrecht Durer,Diego Velazquez,","Pablo Picasso,Chaim Soutine,Roberto Montenegro...",,"José Luzán,Anton Raphael Mengs,",,1760.0,1828.0,
8,Edvard Munch,Norwegian,"Berlin Secession,Degenerate art","{Symbolism,Expressionism:188},","Paul Gauguin,Vincent van Gogh,Henri de Toulous...","Egon Schiele,Wassily Kandinsky,Ernst Ludwig Ki...",,"Leon Bonnat,","Franz Marc,",1881.0,1944.0,
9,Édouard Manet,,,,,,,,,1858.0,1882.0,


There needs to be further work done as seen.

Second dataset: from Rasta <br>
https://github.com/nphilou/rasta/tree/d22b34d5ac1aee9c1f80b4a73ad6792fd465c605/data/art500k

<details><p>Importing the data</p>
    
```python

rasta = pd.read_table('datasets/originals/art500k_rasta370k.txt', header=0, engine='python', sep='\t|\s{4,}');
rasta[0:5]

```

</details>

Every painting either has East or West origin (or not given), may just filter to one of them

From this, we could create a network possibly.

<details><summary><u>Something further:</u></summary>
<p>

https://en.wikipedia.org/wiki/Renaissance (at the bottom)
https://en.wikipedia.org/wiki/Periods_in_Western_art_history
    
</p>
</details>

## Combine the two

Take the artist

### PageRank / Wiki Connections:
The Python Class 6 original_networkx_SP_06.ipynb file had a good example for PageRank

Wiki Connections: full dataset http://www.iesl.cs.umass.edu/data/data-wiki-links

smaller dataset: https://snap.stanford.edu/data/wikispeedia.html

# Philosophy, Politics
<details><summary><u>Philosophy</u></summary>
<p>

## Philosopher's web: 

Only available after paying 10$ for pro user

## Philosophy data
https://philosophydata.com/phil_nlp.zip
Downloaded, but not used yet, as I see it is NLP data 
</p>
</details>

## Network connection: Six Degrees of Francis Bacon
Network of the people connected to Francis Bacon, sadly the people in the set are mostly all born in the 16th century and are English so most philosophers in this list are not super relevant, there is no Kant, Nietzsche, etc.  But good example of a network

http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network

<details><summary><u>Code for obtaining graph</u></summary>
<p>
    
```python
import igraph as ig #To install: conda install -c conda-forge python-igraph  
people = pd.read_csv('datasets/SDFB_people_.csv')
relationships = pd.read_csv('datasets/SDFB_relationships_.csv')

#I used igraph, because it's faster than networkx, and graph-tool sucks on Windows
network = relationships.rename(columns={'id': 'relationship_id', }).drop(columns=['created_by', 'approved_by', 'citation'])
print(network.head(), '\n')
cols = network.columns.tolist()
cols = cols[1:3] + cols[0:1] + cols[3:]
network = network[cols]
network = network[network['person1_index'] != 10050190] #for some reason, there is no person with this id, I did a loop
# I used the documentation here: https://python.igraph.org/en/stable/generation.html#from-pandas-dataframe-s  this I followed
# this is important too: https://python.igraph.org/en/stable/api/igraph.Graph.html#DataFrame  
g = ig.Graph.DataFrame(network, directed=False, vertices=people[['id', 'display_name','historical_significance','birth_year','death_year']], use_vids=False)
print(g.summary().replace(',', '\n'))
```
    
</p>
</details>

<details><summary><u>Code for filtering</u></summary>
<p>
    
```python
filtered = g.vs.select(_degree = 0) #https://python.igraph.org/en/stable/tutorial.html#selecting-vertices-and-edges
g.delete_vertices(filtered)

import cairo #Needed for plotting #import cairocffi as cairo  # can do matplotlib too
#layout = g.layout(layout='auto')
#ig.plot(g, layout = layout) #ig.plot(g) #looks even worse

```
    
</p>
</details>

<details><summary><u>Code for obtaining graph</u></summary>
<p>
    
```python
layout = g.layout(layout='reingold_tilford_circular') #kamada_kawai requires too much computing, 'fruchterman_reingold' is too dense
visual_style = {}
visual_style["vertex_size"] = 5
visual_style["vertex_color"] = "blue"
visual_style['bbox'] = (900, 900)
visual_style["layout"] = layout
#ig.plot(g, **visual_style) #Commented out because it takes big memory
# Needs improvement, but it's a start
```
    
</p>
</details>


# Other opportunities for networks:
https://global.health/  they got nice data on diseases, probably time-variant too

<details><summary><u>Monkeypox, ebola</u></summary>
<p>
    
```python
df3 = pd.read_csv('datasets/monkeypox.csv')
df4 = pd.read_csv('datasets/ebola.csv')
df4
```
    
</p>
</details>

## Modeling of Biological + Socio-tech systems (MOBS) Lab
https://www.mobs-lab.org/

