**Spotify recommendation network analysis**

This code was written in order to create and analyse a network based on Spotify's data.

In this first part, all the necessary libraries are imported.

In [None]:
import json
from pyvis.network import Network
from IPython.display import display
from scipy.stats import linregress
import networkx as nx
import matplotlib.pyplot as plt

net = Network(notebook=True, height="1000px", width="100%", directed=True, select_menu=True, filter_menu=True)

This section is used to convert the Spotify's data from JSON format to a Python-friendly language. Moreover, it prints the dataset, in order to be able to present them through, for example, a table. The informations are then used for creating the files with the data to analyze in the following sections.

In [None]:
json_data = {} #Insert here the data obtained from Spotify API
data = json.loads(json_data) #Convert the JSON data in a Pyhton format

artists = data['artists']
for artist in artists:
    name = artist['name']
    followers = artist['followers']['total']
    genres = artist['genres']
    popularity = artist['popularity']
    print("Name:", name)
    print("Genres:", genres)
    print("Popularity:", popularity)
    print("Followers:", followers)
    print()

A function that gives back a color based on some numerical value given in input. It will be used to color nodes (artists) based on their popularity value and their number of followers

In [None]:
def get_node_color(value):
    normalized_value = value / 100.0
    red = int((1 - normalized_value) * 255)
    green = 0
    blue = int(normalized_value * 255)
    return f"#{red:02x}{green:02x}{blue:02x}" #Return the color in hexadecimal format

In this section, the network is built. In order to get a better representation, *pyvis* library is used. 
I have decided to create a network where the nodes get a color and a size based on their popularity, which is a number between 0 and 100 attributed by Spotify's algorithm.

In [None]:
data_file = input("Write the path of the data file: ") #The data files are given from keyboard
popularity_file = input("Write the path of the popularity data file: ")

with open(popularity_file, 'r', encoding = 'utf-8') as file:
    data = file.readlines()

with open(data_file, 'r', encoding = 'utf-8') as file:
    data1 = file.readlines()
    
for line in data:
    data = line.strip().split('?') #The data file used has the name of the artists and their popularity separated by a ?
    artists1 = data[0]
    popularity = float(data[1])
    color = get_node_color(popularity)
    net.add_node(artists1, size = popularity, color = color) #Add the artist as a node in the graph

for line in data1:
    data1 = line.strip().split('?') #The data file used has the name of the artists and their suggested artists separated by a ?
    artists = data1[0]
    suggested_artists = data1[1:]
    for suggested_artist in suggested_artists:
        net.add_edge(artists,suggested_artist) #Add an edge from the current artist to the suggested artist
    else:
        pass

net.show("name_of_the_file.html")

#This last part can be used instead of the last line if this error code appears: 
#UnicodeEncodeError: 'charmap' codec can't encode character '' in position : character maps to <undefined> 

html = net.generate_html()
with open("name_of_the_file.html", mode = 'w', encoding = 'utf-8') as fp:
        fp.write(html)
display(HTML(html))

Before studying the network properties, it's interesting to plot the popularity of the artists versus the number of followers they have, to check if the value of popularity Spotify attributes to everyone of them is consistent with the number of people who follow the artists.

In [None]:
file_path=input("Write the path of the popularity-vs-followers data file: ")
with open(file_path, 'r', encoding='utf-8-sig') as file:
    data = file.readlines()

popularity=[]
followers=[]

for line in data:
    data = line.strip().split('?')  # Split the line into artist and suggested artists
    popularity.append(float(data[1]))
    followers.append(float(data[2]))
    
plt.plot(popularity,followers, 'bo')
plt.xlabel('Popularity')
plt.ylabel('Number of followers')
plt.xticks(np.arange(0,105,5)) #The range of the x-axis is extended to 105 to include graphically the node with popularity value 100

To study the Spotify network, it is better to recreate it using *networkx* library as *pyvis* library doesn't have built-in functions to analyze a network but can just be used as a visualization tool.
So I defined this function that generates a network starting from an input file and then the network is generated. 

In [None]:
def create_graph_from_file(file_path):
    graph = nx.DiGraph()  #Creates a directed graph object

    with open(file_path, 'r', encoding = 'utf-8-sig') as file:
        data = file.readlines()
    
    for line in data:
        data = line.strip().split('?')  #Split the line into artist and suggested artists
        artist = data[0]
        suggested_artists = data[1:]
        graph.add_node(artist)  #Add the artist as a node in the graph
        for suggested_artist in suggested_artists:
            graph.add_edge(artist, suggested_artist)  #Add an edge from the current artist to the suggested artist

    return graph

In [None]:
file_path = input("Write the path of the data file: ") #The data file is given from keyboard
graph = create_graph_from_file(file_path)
print(graph) #It gives back the number of nodes and edges of the network

The first things one can obtain out of the graph generated are the adjancency matrix and the incidence matrix.

In [None]:
adjacency_matrix = nx.adjacency_matrix(graph)
adjacency_matrix = adjacency_matrix.toarray() #Convert the adjacency matrix to a NumPy array

fig, ax = plt.subplots()

im = ax.imshow(adjacency_matrix, cmap = 'binary') #Plot the adjacency matrix in black and white (binary attribute)
cbar = ax.figure.colorbar(im, ax = ax) #A colorbar is added to the plot

In [None]:
incidence_matrix = nx.incidence_matrix(graph)
incidence_matrix = incidence_matrix.toarray() #Convert the incidence matrix to a NumPy array

fig, ax = plt.subplots()

im = ax.imshow(incidence_matrix, cmap = 'binary') #Plot the incidence matrix in black and white (binary attribute)
cbar = ax.figure.colorbar(im, ax = ax) #A colorbar is added to the plot

Now, centrality measures are taken into account. I start from considering the degree centrality (or just degree) of the network. Since is directed, I consider also the indegree and the outdegree.

In [None]:
degree = graph.degree()
in_degree = graph.in_degree()
out_degree = graph.out_degree()

print(degree)
print(in_degree)
print(out_degree)

Now I consider the degree cumulative distributions, plotting it in a log-log scale.

In [None]:
degree = dict(degree) #Dictionary node:degree in order to access easily the parameters
degree = sorted(list(degree.values())) #To isolate just the values from the names of the artists
frequency = [degree.count(x) for x in degree]
x = np.asarray(degree, dtype = float)
y = np.asarray(frequency, dtype = float)
y_normalized = y/graph.number_of_nodes() #The frequency is normalized in respect to the total number of nodes

plt.figure(figsize = (7,6))
plt.xlabel('Degree')
plt.ylabel('Frequency')
plt.xscale('log')
plt.yscale('log')
plt.plot(x, y_normalized, 'bo')

Using this following section of the code, it's possible to fit a power-law on the data obtained. In particular, this will be done just for the indegree.

In [None]:
in_degree = dict(in_degree) #Dictionary node:in_degree in order to access easily the parameters

#To isolate just the values from the names of the artists and only the positive ones otherwise the log() is not defined
in_degree = sorted(list(in_degree_val for in_degree_val in in_degree.values() if in_degree_val>0)) 

frequency = [in_degree.count(x) for x in in_degree]
x = np.asarray(in_degree, dtype = float)
y = np.asarray(frequency, dtype = float)
y_normalized = y/graph.number_of_nodes() #The frequency is normalized in respect to the total number of nodes

plt.figure(figsize = (7,6))
plt.xlabel('Indegree')
plt.ylabel('Frequency')
plt.xscale('log')
plt.yscale('log')
plt.plot(x, y_normalized, 'bo', label='Indegree points')

slope, intercept, r_value = linregress(np.log(x), np.log(y_normalized)) #Perform a linear regression on the log-log plot
fitted = np.exp(intercept) * x**slope #Power law assumption
plt.plot(x, fitted, 'r', label=f'Fitted curve: y = {np.exp(intercept):.2f} * x^({slope:.2f})')
plt.legend()

print(f"Linear Regression R-squared value: {r_value ** 2}") #Quantifies how good or badly the curve fits the dataset

To characterise the network, the average degree and the average clustering coefficient are computed.

In [None]:
average_degree = sum(degree)/len(degree)

clustering_coefficient = nx.clustering(graph)
average_clustering_coefficient = sum(clustering_coefficient.values())/len(clustering_coefficient)

print("Average Degree:", average_degree)
print("Average Clustering Coefficient:", average_clustering_coefficient)

Now other centrality measeures are taken into account: in particular, betweenness centrality, closeness centrality and eigenvector centrality.

In [None]:
betweenness_centrality = nx.betweenness_centrality(graph)
closeness_centrality = nx.closeness_centrality(graph)
eigenvector_centrality = nx.eigenvector_centrality(graph)
pagerank = nx.pagerank(graph)

To visualize this data in decreasing order, to access the informations I was interested in, I used this code, where one can substitute one of the previous quantities in centrality.values() and centrality.keys(). So if for example you want to see the betweenness centrality results, you will have betweenness_centrality.values() and betweenness_centrality.keys().

In [None]:
centrality_data = list(zip(centrality.values(),centrality.keys()))
sorted_values = sorted(centrality_data, key=lambda x: x[0], reverse=True)
print(sorted_values)

To create the smaller Spotify network and the AllMusic network, one can use the create_graph_from_file function and to visualzie it a combination of *networkx* and *pyvis* libraries, in the following way:

In [None]:
file_path = input("Write the path of the data file: ") #The data file is given from keyboard
graph = create_graph_from_file(file_path)
print(graph) #It gives back the number of nodes and edges of the network
net.from_nx(graph)
net.show("name_of_file.html")

To analyze the two new networks, one can use the codes presented in the different sections of this code, adjusting them to the specific cases they are interested in.