# 1. Build artist graph  
This notebook illustrates the process of building the artist graph that represents artists and their collaborations network. I used [Spotipy](https://spotipy.readthedocs.io/en/2.18.0/) to explore artists and their collaborators in a breadth-first search manner.

## The graph `G`
* The nodes represent artists.
* There exsits an edge `E(X,Y)` between nodes `X` and `Y` if the corresponding artists have a collaborative recording on Spotify.
* The edge weight between nodes `X` and `Y` is the number of collaborative albums between artists `X` and `Y`. 
* Each edge `E(X,Y)` contains an `album` attribute, which is a set of collaborative albums found between artist `X` and `Y`. It was necessary to keep the `album` attribute so that no album is double-counted for the weight. Note that the edge weight is equal to the size of the `album` attribute. 

## Graph building process: BFS
* Given a start artist, the graph `G` is built by exploring an artist's collaborators in a BFS way. 
* In particular, I repeated the following process for every artist in queue.
```
visit artist X
find collaborative albums of artist X
    for each collaborative album A:
        for each artist Y in album A:
            create node for artist Y (if necessary)
            create edge E(X,Y) between X and Y (if necessary)
            add album A to the album attribute of edge E(X,Y)
            add artist Y to queue if Y hasn't been visited
```
* After exploring sufficient number of artists, define the `weight` attribute of each edge to be the number of albums between two nodes. 
* I also kept a dictionary of artists names and their Spotify IDs.
* For details, see function `continue_building_graph` in `musicians.py`.
    
## Criteria for a collaborative album
* I defined an album `A` to be a collaborative album between artists `X` and `Y` if there exists a track in album `A` with both `X` and `Y` as the artists. 
* I chose such definition because simply using the `album artist` attribute can overestimate collaboration among artists. There are numerous classical music albums that are compilations of recordings by various artists without having any collaborative recordings among the artists. For example, albums like 'Greatest Classical Music' or 'Classical Christmas' may contain individual tracks by artist `X` and `Y` without having any tracks that have both `X` and `Y` as the artist. My definition ensures that we don't count such albums as a collaboration between artist `X` and `Y`. 
 

## Summary
* I built the artist graph by visiting 80,000 artists. 
* The resulting graph consists of ~188,000 nodes. Note that the number of nodes is larger than 80,000 since the resulting graph contains all 80,000 artists and their collaborators. 

In [None]:
%load_ext autoreload
%autoreload 2

import networkx as nx 
import pickle
import spotipy

from collections import deque
from spotipy.oauth2 import SpotifyClientCredentials
from musicians import *

In [2]:
# Connect to Spotify Web API
cid ="" 
secret = "" 

auth_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(auth_manager=auth_manager)

In [35]:
# specify start artist
Gilels_uri = '21h8E3aA7a9mjcUHbLpjxf'

# build graph
G, Q, visited, ID_name = build_graph(Gilels_uri, 80000, sp)

In [46]:
# Create dictionary of artist name : ID 
# for names with multiple IDs, choose the ID that has the highest degree in G

name_ID = dict()
for (ID, name) in ID_name.items():
    if name not in name_ID:
        name_ID[name] = ID
    else:
        # If name already appears in name_ID, compare degrees
        old_ID = name_ID[name]
            
        ID_degree = G.degree(ID) if type(G.degree(ID)) == int else 0 
        old_ID_degree = G.degree(old_ID) if type(G.degree(old_ID)) == int else 0

        if ID_degree >= old_ID_degree:
            name_ID[name] = ID
        else:
            pass

In [48]:
print('number of visited artists:', len(visited))
print('number of nodes in G: ', len(G))

number of visited artists: 80000
number of nodes in G:  188294


In [None]:
### save
"""
# convert edge's 'album' attribute from set to list
for n1,n2,edge in G.edges(data=True):
    edge['albums'] = list(edge['albums'])
nx.write_gml(G, "graph_80000/graph.gml")

# save visited artists
visited = list(visited)
with open('graph_80000/visited_artists_80000.pkl', 'wb') as f:
    pickle.dump(visited, f)
    
# save dictionaries for artist names and IDs    
with open('graph_80000/ID_name_80000.pkl', 'wb') as f:
    pickle.dump(ID_name, f)
with open('graph_80000/name_ID_80000.pkl', 'wb') as f:
    pickle.dump(name_ID, f)
    
# save queue
with open('graph_80000/queue_80000.pkl', 'wb') as f:
    pickle.dump(unique_artists, f)
"""

### Optional: Expanding on existing graph
One can expand on a previously-built graph by calling the function `continue_building_graph`

In [5]:
# load existing graph and data
G = nx.read_gml("graph_80000/graph.gml")
# convert edge's 'album' attribute from list to set
for n1,n2,edge in G.edges(data=True):
    edge['albums'] = set(edge['albums'])

with open('graph_80000/ID_name.pkl', 'rb') as f:
    ID_name = pickle.load(f)
    
with open('graph_80000/visited_artists.pkl', 'rb') as f:
    visited = pickle.load(f)
visited = set(visited)

with open('graph_80000/queue.pkl', 'rb') as f:
    Q = pickle.load(f)
Q = deque(Q)

In [None]:
# visit 1000 more artists and expand the graph G
G, Q, visited, ID_name = build_graph(G, Q, visited, ID_name, 1000)