# Network building (graph-tool)

Using the `graph-tool` library, and functions in `graph_functions.py`, we firstly created a large network of artists, where two artists are connected if they have a common exhibition. We then created two smaller networks: one where two artists are only connected if they share two common exhibitions (this network was in the end not analyzed) and one where only a subset of artist are included, who are in [PainterPalette](https://github.com/me9hanics/PainterPalette). This is be described more in the `5_community_detection.ipynb` notebook.

## Initial steps

In [63]:
import json
with open('data/artists_cleaned_v1.txt', 'r', encoding='utf-8') as f:
    artists = f.read().splitlines()

with open('data/announcements.json', 'r', encoding='utf-8') as f:
    announcements = json.load(f)

An example announcement:

In [64]:
announcements["281251"]

{'link': '/announcements/281251/curators-intensive-taipei-19/',
 'title_artists': None,
 'title': 'Curators’ Intensive Taipei 19',
 'subtitle': 'Taipei Fine Arts Museum',
 'announcement_date': 'September 28, 2019',
 'artists': ['Ade Darmawan',
  'David Teh',
  'Esther Lu',
  'Fang-Wei Chang',
  'Kenjiro Hosaka',
  'Manray Hsu',
  'Mirwan Andan',
  'Raimundas Malašauskas',
  'TFAM',
  'Zoe Butt']}

We only use the `artists` attribute.

## 1) Initial network

The tool used for building the network is `graph-tool`, a fast Python graph library built on C++ (hence why it is fast), awesome for visualizations and (social) network analysis.

The way we build a network of artists:
- Add artists (e.g. from the `artists_cleaned_v1.txt` file) as nodes in the graph
- Run through all announcements, add edges between artists that share an announcement
- Graph is undirected and weighted, weight = number of announcements shared

In [3]:
#import graph_tool as gt
#from graph_tool import inference
from graph_tool.all import * #Otherwise, draw_hierarchy will not be found in any case (even if importing graph_tool.draw)
import graph_functions

#### Add the nodes of the graph (artists)

(The artists were in the end not added from the *.txt* file, but from the announcements dictionary, theoretically these two lists of artists should be equivalent, in practice the *.txt* file may possibly contain artists without announcements, this is why we went with the slower method of going through announcement values to find artists)

In [4]:
g = Graph(directed=False)

artist_to_vertex = {}

for announcement in announcements.values():
    announcement_artists = announcement['artists']
    for artist in artists:
        if artist not in artist_to_vertex:
            v = g.add_vertex()
            artist_to_vertex[artist] = v

Make an artist name from vertex dict (/flip out artist_to_vertex dictionary) so we can look up artists by vertex's property

In [5]:
artist_name = g.new_vertex_property("string") #vertex_to_artist = {v: artist for artist, v in artist_to_vertex.items()}
for artist,v in artist_to_vertex.items():
    artist_name[v] = artist

g.vertex_properties["artist_name"] = artist_name

(now we have artist-vertex, and vertex-artist dicts, which are optimal for fast look up, which we may need while computing)


#### Add edges:

In [6]:
import itertools

edge_weights = {}
for announcement in announcements.values():
    artists_ = announcement['artists']
    for artist1, artist2 in itertools.combinations(artists_, 2):
        #Check if the edge already exists (this returns the edge, or None)
        edge = g.edge(artist_to_vertex[artist1], artist_to_vertex[artist2])
        if edge:
            edge_weights[edge] += 1
        else:
            edge = g.add_edge(artist_to_vertex[artist1], artist_to_vertex[artist2])
            edge_weights[edge] = 1

Add edge weight property to the edges

In [7]:
weight = g.new_edge_property("int")
for edge, weight_value in edge_weights.items():
    weight[edge] = weight_value
g.edge_properties["weight"] = weight

Now we have a graph that includes all connections.

In [8]:
#Print amount of vertices and edges
print(g.num_vertices())
print(g.num_edges())

21350
1149352


The problem is we have a bit too much edges. The nested block model computation crashed, so we compute for a smaller subset of the data - a subgraph of the original graph.<br>
(I ran it locally in a container, likely JupyterLab can be faster and not run out of more memory.)

## Focusing on a subgraph

We construct two sub-networks of the original network:

-`g_double`: has edges only between artists that share 2+ announcements<br>
-`g_selected`: is a subset of ~1000 artists who appear in the [PainterPalette](3) dataset (developed by Mihaly), and thus we have more data of these artists, such as locations, nationality/citizenship, styles, biographical data and so on. (It may happen that some artists appear with in both datasets with a different name, but we do not account for this)

That PainterPalette dataset contains: 

- Bio data of a painter: Nationality/citizenship, name, birth and death years and places, gender
- Artistic style data
- Locations of activity (sometimes with years)
- Occupations (e.g. painter, sculptor, litographer, etc.)
- Influences: on painters, and from painters, pupils, teachers
- Friends, coworkers (limited data)
- Quantities of paintings, in styles, etc.

Mostly about painters, but also some other artists. If we look for artists in our network that appear in this dataset, we can get some basic information about them, that we later can use for analyzing the communities: how does style, gender ratio, "success" differ among communities? If we focus on the intersection of the two datasets (artists who are in our network, and also in the dataset) we can do such analysis.

Construct the two networks (functions described in `graph_functions.py`):

In [10]:
#Construct g_doubles: a subgraph of g with only edges with weight > 1 (so with at least 2 connections)
g_doubles = graph_functions.create_subgraph_from_edges(g, [edge for edge in g.edges() if g.edge_properties["weight"][edge] > 1])

In [11]:
print("Nodes:", g_doubles.num_vertices()), print("Edges:", g_doubles.num_edges());

Nodes: 10192
Edges: 233958


As we see, this is already considerably smaller than the original network we started from.

`g_selected`, the other network is even smaller. It is the network of those artists, that appear in the *PainterPalette* dataset (there is typically a lot of information about these artists). A bit more than 1000 people, but mostly painters are included, only a smaller portion of artists are contemporary.

In [70]:
import pandas as pd

url = "https://raw.githubusercontent.com/me9hanics/PainterPalette/main/PainterPalette.csv"
artists_df = pd.read_csv(url).drop(columns=['Type', 'Contemporary'])
#I often include this way, don't have to download, plus keeps track of updates
#dropping type and contemporary because they are "artificial" columns

Just to give an example, this is what data we have of one artist:

In [12]:
(artists[0:1]).iloc[0].squeeze()

artist                                                      Bracha L. Ettinger
Nationality                                              French,Jewish,Israeli
citizenship                                                             Israel
gender                                                                  female
styles                                                   New European Painting
movement                                                 New European Painting
Art500k_Movements                                   {New European Painting:21}
birth_place                                                           Tel Aviv
death_place                                                                NaN
birth_year                                                              1948.0
death_year                                                                 NaN
FirstYear                                                               1991.0
LastYear                                            

Find the intersection of the artists in the graph and the artists in the dataset:

In [14]:
artists_in_graph = set(artist_name[v] for v in g.vertices())
artists_in_dataset = set(artists['artist'])
artists_in_both = artists_in_graph.intersection(artists_in_dataset)

print("Amount of artists of whom we have extra data about:", len(artists_in_both))

Amount of artists of whom we have extra data about: 1043


Get their graph:

In [16]:
g_selected = graph_functions.create_subgraph_from_names(g, artists_in_both, artist_to_vertex)
print("Number of edges in the graph:", g_selected.num_edges())

Number of edges in the graph: 53003


Save graphs: (the official documentation recommends saving the graph in binary format for perfect reconstruction)

In [None]:
g_selected.save("data/coexhibition_network_selected_artists.gt.gz")
g.save("data/coexhibition_network.gt.gz")

We use the first graph in the `5_community_detection.ipynb` notebook, where we analyze the communities of artists in this network.