<a href="https://colab.research.google.com/github/matt2fu/spotify-feature-analytics/blob/main/spotify_feature_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Feature Analytics

# Motivation

This project focuses on Spotify artists. We wanted to find the "Kevin Bacon number" equivalent for Spotify artists that worked together i.e. were featured on the same song. 

For this project, we used two datasets: edges.csv and nodes.csv. Edges.csv contains data on artists that are featured on the same song, it is in the form of an edge list of artist ids. Nodes.csv contains data on the Spotify artists, their id, name, etc.

We then used BFS to analyze the shortest path i.e. the number of connections between different artists.

# Imports and Setup

In [None]:
# import packages
import glob
import pandas as pd
import numpy as np
import re
import os
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
from google.colab import drive

# Data Loading and Preprocessing

Download the CSV files: 
*   edges.csv: https://www.kaggle.com/datasets/jfreyberg/spotify-artist-feature-collaboration-network?select=edges.csv
*   nodes.csv: https://www.kaggle.com/datasets/jfreyberg/spotify-artist-feature-collaboration-network?select=nodes.csv

Drag them into Google Colab:
*   Expand menu on the left
*   Click file icon, the bottom-most icon
*   Click upload icon, the left-most icon
*   Upload the nodes.csv and edges.csv files

## Edges Data

This is the dataset containing data on the artists that are featured on the same song, it is in the form of an edge list of artist ids.

In [None]:
# load data
edges_df = pd.read_csv("edges.csv")

In [None]:
# preview data
edges_df.head()

Unnamed: 0,id_0,id_1
0,76M2Ekj8bG8W7X2nbx2CpF,7sfl4Xt5KmfyDs2T3SVSMK
1,0hk4xVujcyOr6USD95wcWb,7Do8se3ZoaVqUt3woqqSrD
2,38jpuy3yt3QIxQ8Fn1HTeJ,4csQIMQm6vI2A2SCVDuM2z
3,6PvcxssrQ0QaJVaBWHD07l,6UCQYrcJ6wab6gnQ89OJFh
4,2R1QrQqWuw3IjoP5dXRFjt,4mk1ScvOUkuQzzCZpT6bc0


## Nodes Data

This is the dataset containing data on the Spotify artists, their id, name, etc. 

In [None]:
# load data
nodes_df = pd.read_csv("nodes.csv")

# drop columns 
nodes_2_df = nodes_df[['spotify_id', 'name']]

In [None]:
# preview data
nodes_2_df.head()

Unnamed: 0,spotify_id,name
0,48WvrUGoijadXXCsGocwM4,Byklubben
1,4lDiJcOJ2GLCK6p9q5BgfK,Kontra K
2,652XIvIBNGg3C0KIGEJWit,Maxim
3,3dXC1YPbnQPsfHPVkm1ipj,Christopher Martin
4,74terC9ol9zMo8rfzhSOiG,Jakob Hellman


## Merge Nodes and Edges Data

We merge nodes_df and edges_df to get a dataset composed of an edge list of artists that are featured on the same song, but instead of by id it is by name.

In [None]:
# merge nodes_df and edges_df
feature_df = edges_df.merge(nodes_2_df, left_on = 'id_0', right_on = 'spotify_id')

# drop and rename columns
feature_2_df = feature_df[['name', 'id_1']]
feature_3_df = feature_2_df.rename(columns = {'name' : 'artist_1'})

# merge nodes_df and edges_df
feature_4_df = feature_3_df.merge(nodes_2_df, left_on = 'id_1', right_on = 'spotify_id')

# drop and rename columns
feature_5_df = feature_4_df[['artist_1', 'name']]
feature_6_df = feature_5_df.rename(columns = {'name' : 'artist_2'})

In [None]:
# preview data
feature_6_df.head()

Unnamed: 0,artist_1,artist_2
0,NGHTMRE,Lil Jon
1,Offset,Lil Jon
2,Max Styler,Lil Jon
3,Sak Noel,Lil Jon
4,Alvaro,Lil Jon


## Pandas Datafram to Numpy Array

In [None]:
# pandas dataframe to numpy array
feature_edge_list = feature_6_df.to_numpy()

In [None]:
# preview numpy array
feature_edge_list

array([['NGHTMRE', 'Lil Jon'],
       ['Offset', 'Lil Jon'],
       ['Max Styler', 'Lil Jon'],
       ...,
       ['Erdzan Saidov', 'Cubita'],
       ['Elai', 'Ayoo ELAI'],
       ['Def Rock', 'Tarlan']], dtype=object)

# BFS or whatever...