Pip Install Commands

In [1]:
%pip install shapely
%pip install node2vec


[notice] A new release of pip available: 22.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.




Libraries

In [33]:
import os
import json
import requests
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from shapely.prepared import prep
from shapely.geometry import mapping, shape, Point
from node2vec import Node2Vec
from sklearn.cluster import KMeans

Const Values

In [2]:
YEAR_COLUMN = "year"
TEMPO_COLUMN = "tempo"
LOUDNESS_COLUMN = "loudness"
DURATION_COLUMN = "duration"
SONG_HOTTTNESSS_COLUMN = "song_hotttnesss"
ARTIST_HOTTTNESSS_COLUMN = "artist_hotttnesss"
ARTIST_FAMILIARITY_COLUMN = "artist_familiarity"
DECADE_COLUMN = "decade"

NUMERIC_COLUMNS_LIST = [
    YEAR_COLUMN,
    TEMPO_COLUMN,
    LOUDNESS_COLUMN,
    DURATION_COLUMN,
    SONG_HOTTTNESSS_COLUMN,
    ARTIST_HOTTTNESSS_COLUMN,
    ARTIST_FAMILIARITY_COLUMN,
    DECADE_COLUMN
]

SONG_TITLE_COLUMN = "song_title"
COUNTRY_COLUMN = "country"
ARTIST_LONGITUDE_COLUMN = "artist_longitude"
ARTIST_LATITUDE_COLUMN = "artist_latitude"
ARTIST_LOCATION_COLUMN = "artist_location"
ARTIST_ID_COLUMN = "artist_id"
SONG_ID_COLUMN = "song_id"

UNKNOWN_COUNTRY_VALUE = "unknown"
MUSIC_DATA_FOLDER_PATH = "../Music Data/"
MODELS_FOLDER_PATH = "../models/"

In [3]:
def get_attribute_node_name(node_type, node_value):
    return f"{node_type} {node_value}"

Loading Songs & Artists datasets

In [4]:
raw_songs_dataset = pd.read_csv("../Data/songs_dataset.csv")
raw_artists_dataset = pd.read_csv("../Data/artist_terms.csv")

Riaz: Merging datasets based on artist_id<br>
In the following cell i am removing the duplicate rows based on `artist_id` and only keep the first record

In [5]:
# Remove duplicates from the artist dataset based on artist_id
raw_artists_dataset = raw_artists_dataset.drop_duplicates(subset=ARTIST_ID_COLUMN, keep='first')

# Merge the datasets on artist_id
raw_music_dataset = pd.merge(raw_songs_dataset, raw_artists_dataset, on=ARTIST_ID_COLUMN, how='left')

In the above cell i merged the datasets based on artist_id and merge was on left join:
When you specify how='left', it means that all the keys from the left dataframe (in this case, the raw_songs_dataset dataframe) will be included in the merged dataframe, and only the matching keys from the right dataframe (in this case, the artist_dataset dataframe) will be added.

In other words:

All rows from the left dataframe (raw_songs_dataset) are retained.
If there are matching keys (in this case, artist_id) in the right dataframe (artist_dataset), the corresponding data from the right dataframe will be added to the merged dataframe.
If there are no matching keys in the right dataframe, the corresponding columns in the merged dataframe will be filled with NaN (missing values).

In [6]:
raw_music_dataset.head()

Unnamed: 0,song_id,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_id,artist_name,artist_latitude,artist_longitude,artist_location,artist_hotttnesss,artist_familiarity,term
0,SOVFVAK12A8C1350D9,Tanssi vaan,1995.0,Karkuteillä,150.778,-10.555,156.55138,0.299877,ARMVN3U1187FB3A1EB,Karkkiautomaatti,,,,0.356992,0.439604,pop rock
1,SOGTUKN12AB017F4F1,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,ARGEKB01187FB50750,Hudson Mohawke,55.8578,-4.24251,"Glasgow, Scotland",0.437504,0.643681,broken beat
2,SOBNYVR12A8C13558C,Si Vos Querés,2003.0,De Culo,87.433,-4.654,145.05751,,ARNWYLR1187B9B2F9C,Yerba Brava,,,,0.372349,0.448501,cumbia
3,SOHSBXH12A8C13B0DF,Tangle Of Aspens,,Rene Ablaze Presents Winter Sessions,140.035,-7.806,514.29832,,AREQDTE1269FB37231,Der Mystic,,,,0.0,0.0,hard trance
4,SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",,Berwald: Symphonies Nos. 1/2/3/4,90.689,-21.42,816.53506,,AR2NS5Y1187FB5879D,David Montgomery,,,,0.109626,0.361287,ragtime


In [7]:
raw_music_dataset.isna().sum()

song_id                    0
song_title                 0
year                  484270
release                    5
tempo                      0
loudness                   0
duration                   0
song_hotttnesss       417782
artist_id                  0
artist_name                0
artist_latitude       641766
artist_longitude      641766
artist_location       487546
artist_hotttnesss         12
artist_familiarity       185
term                    3767
dtype: int64

In [8]:
raw_music_dataset.isna().sum().sum()

2677099

Shartil: For now I am going to delete all rows with missing data.

In [9]:
music_dataset = raw_music_dataset.dropna()

In [10]:
len(music_dataset)

126905

Shartil: Evenly selecing over 1000 songs, and saving them as the final dataframe

In [11]:
print(music_dataset.shape)

music_dataset = music_dataset.iloc[::120] # returns dataframe with (1058, 13)

print(music_dataset.shape)

(126905, 16)
(1058, 16)


Najeeb: Introducing a new column "country" based on Latitude and Longitude.

This code is used to load geographical data from a Local GeoJSON file, process it, and subsequently determine which country a given set of Latitudes and Logitudes coordinates falls into. The country names extracted from GeoJSON file are then inserted into a new column in a dataset.

- The "json.load" function reads the file and convert it into a Python dictionary ('geojson_data').

- An Empty dictionary named 'countries' is initiated to store the coordinated data associated with each country.

- The script iterates over each feature in the 'features' array of the 'geojson_data'. Each features represents a country.

- For each feature, the geometry ('geom') and the administrative name of the country is extracted.

- The geometry is then processed with a function 'prep' applied to 'shape(geom)'. This likely involves creating a geometric shape from the geometry data and preparing it for fast spatial queries. The processed geometry is dtored in the 'countries' dictionary with the country name as key.

- A function 'get_country' is defined which takes longitudes ('lon') and latitude ('lat') as arguments and creates a ('Point') object from coordinates.

- It then iterates  over the 'countries' dictionary and check whether the point is contained within any of the country geometrics using the 'contains' method of the geometry.

- if a containing country is found, the function returns the country's name. if no containing country is found, it returns a value 'UNKNOWN_COUNTRY_VALUE'

- A new column in the dataset ('msic_dataset') is populated by applying 'get_country' function to each row. In this way the country column is added to music_dataset based on latitude and logitude columns.


In [12]:
# Fetch and process the geojson data from a local file
with open(r'../Data/countries.geojson.json', 'r') as file:
    geojson_data = json.load(file)

countries = {}
for feature in geojson_data["features"]:
    geom = feature["geometry"]
    country = feature["properties"]["ADMIN"]
    countries[country] = prep(shape(geom))

# Function to get country name from latitude and longitude
def get_country(lon, lat):
    point = Point(lon, lat)
    for country, geom in countries.items():
        if geom.contains(point):
            return country

    return UNKNOWN_COUNTRY_VALUE

# Apply the function to create a new 'country' column
music_dataset[COUNTRY_COLUMN] = music_dataset.apply(
    lambda row: get_country(row[ARTIST_LONGITUDE_COLUMN], 
    row[ARTIST_LATITUDE_COLUMN]), 
    axis=1
    )

Shartil: Deleting redundant columns 

In [13]:
music_dataset = music_dataset.drop(
    [
        ARTIST_LATITUDE_COLUMN,
        ARTIST_LONGITUDE_COLUMN,
        ARTIST_LOCATION_COLUMN,
        SONG_ID_COLUMN,
        ARTIST_ID_COLUMN
    ], 
    axis=1)

Shartil: Deleting all rows with "unknown" as the country value

In [14]:
music_dataset = music_dataset[music_dataset[COUNTRY_COLUMN] != UNKNOWN_COUNTRY_VALUE]

In [15]:
music_dataset.reset_index(drop=True, inplace=True)
music_dataset.head()

Unnamed: 0,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_name,artist_hotttnesss,artist_familiarity,term,country
0,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,Hudson Mohawke,0.437504,0.643681,broken beat,United Kingdom
1,Don't Save It All For Christmas Day,2004.0,Merry Christmas With Love,127.397,-9.149,273.08363,0.732281,Clay Aiken,0.500596,0.8521,teen pop,United States of America
2,White Lies,2006.0,Rocinate,92.103,-9.323,388.80608,0.417314,Ester Drang,0.330889,0.525616,shoegaze,United States of America
3,Guess Who I Saw In Paris,1999.0,Sugar Me,105.054,-18.484,170.31791,0.368414,Claudine Longet,0.377489,0.563184,easy listening,France
4,No More Birthdays (Phil Spector Folk) / San Fr...,2006.0,Born To Please,95.658,-6.141,280.45016,0.0,Sound Team,0.368423,0.590111,art rock,United States of America


In [16]:
music_dataset.shape

(1036, 12)

Shartil: making sure the final dataset contains 1000 songs

In [17]:
music_dataset = music_dataset.iloc[:1000] # returns dataframe with 1000 songs

print(music_dataset.shape)

(1000, 12)


Shartil: Adding decade column to dataset

In [18]:
music_dataset = music_dataset.assign(decade=lambda row: (row[YEAR_COLUMN].astype(int) // 10) * 10)
music_dataset.head()

Unnamed: 0,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_name,artist_hotttnesss,artist_familiarity,term,country,decade
0,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,Hudson Mohawke,0.437504,0.643681,broken beat,United Kingdom,2000
1,Don't Save It All For Christmas Day,2004.0,Merry Christmas With Love,127.397,-9.149,273.08363,0.732281,Clay Aiken,0.500596,0.8521,teen pop,United States of America,2000
2,White Lies,2006.0,Rocinate,92.103,-9.323,388.80608,0.417314,Ester Drang,0.330889,0.525616,shoegaze,United States of America,2000
3,Guess Who I Saw In Paris,1999.0,Sugar Me,105.054,-18.484,170.31791,0.368414,Claudine Longet,0.377489,0.563184,easy listening,France,1990
4,No More Birthdays (Phil Spector Folk) / San Fr...,2006.0,Born To Please,95.658,-6.141,280.45016,0.0,Sound Team,0.368423,0.590111,art rock,United States of America,2000


In [19]:
min_decade = music_dataset[DECADE_COLUMN].min()
max_decade = music_dataset[DECADE_COLUMN].max()
decade_array = np.arange(min_decade, max_decade + 10, 10, dtype=int)

Shartil: Saving music_dataset as a CSV file

In [20]:
if not os.path.isdir(MUSIC_DATA_FOLDER_PATH):
    os.mkdir(MUSIC_DATA_FOLDER_PATH)

music_dataset.to_csv(f"{MUSIC_DATA_FOLDER_PATH}/music_dataset.csv", mode='w+')

Shartil: I will normalize the numeric columns using min max normalization.

In [21]:
def min_max_normalize_column(df, column_name):
    min_val = df[column_name].min()
    max_val = df[column_name].max()
    
    if min_val == max_val:
        raise ValueError("Cannot normalize column when all values are the same.")
    
    df[column_name] = (df[column_name] - min_val) / (max_val - min_val)

In [22]:
normalized_music_dataset = music_dataset.copy()

for numeric_column in NUMERIC_COLUMNS_LIST:
    min_max_normalize_column(normalized_music_dataset, numeric_column)

In [23]:
normalized_music_dataset.head()

Unnamed: 0,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_name,artist_hotttnesss,artist_familiarity,term,country,decade
0,No One Could Ever,0.927273,Butter,0.716171,0.961028,0.100247,0.620488,Hudson Mohawke,0.511917,0.671199,broken beat,United Kingdom,0.833333
1,Don't Save It All For Christmas Day,0.890909,Merry Christmas With Love,0.513242,0.763265,0.203325,0.735383,Clay Aiken,0.585741,0.906486,teen pop,United States of America,0.833333
2,White Lies,0.927273,Rocinate,0.371054,0.758411,0.292268,0.419082,Ester Drang,0.387169,0.537915,shoegaze,United States of America,0.833333
3,Guess Who I Saw In Paris,0.8,Sugar Me,0.423229,0.502846,0.12434,0.369974,Claudine Longet,0.441695,0.580325,easy listening,France,0.666667
4,No More Birthdays (Phil Spector Folk) / San Fr...,0.927273,Born To Please,0.385376,0.84718,0.208987,0.0,Sound Team,0.431087,0.610724,art rock,United States of America,0.833333


Shartil: Now I am going to create the graph

In [24]:
music_graph = nx.Graph()

In [25]:
for current_index, current_row in normalized_music_dataset.iterrows():
    node_data_dict = {}

    for current_column in NUMERIC_COLUMNS_LIST:
        node_data_dict[current_column] = current_row[current_column]

    music_graph.add_node(current_index, **node_data_dict)

In [26]:
# Shartil: adding empty nodes for the decades, which function as main nodes
# i.e., all the songs from 1950 will be connected to the 1950 node.
# This will save complex logic of connecting all the songs from the decade, and ensuring that the resulted graph will be less complicated.
for current_decade in decade_array:
    node_name = get_attribute_node_name(DECADE_COLUMN, current_decade)
    music_graph.add_node(node_name)

Shartil: for now, the graph only has decade nodes & song nodes that contain their matching ID in the dataframe

In [27]:
for index, row in music_dataset.iterrows():
    current_decade = row[DECADE_COLUMN]
    node_name = get_attribute_node_name(DECADE_COLUMN, current_decade)
    music_graph.add_edge(node_name, index)

print(music_graph)

Graph with 1007 nodes and 1000 edges


In [28]:
def get_songs_by_criteria(music_graph, given_criteria):
    selected_songs = [ song for song in music_graph[given_criteria].keys()]
    return selected_songs

In [29]:
def print_length_and_items(given_list, amont_of_items = 50):
    print(len(given_list))
    print(given_list[:amont_of_items])

In [30]:
decade_input = 1990
node_name = get_attribute_node_name(DECADE_COLUMN, decade_input)

decade_songs = get_songs_by_criteria(music_graph, node_name)
print_length_and_items(decade_songs)

256
[3, 12, 18, 23, 24, 28, 34, 35, 53, 54, 55, 58, 70, 71, 75, 77, 80, 92, 93, 94, 98, 101, 102, 104, 109, 112, 115, 117, 121, 126, 131, 138, 143, 145, 158, 159, 162, 166, 167, 168, 176, 178, 180, 186, 189, 191, 193, 198, 199, 202]


Shartil: this code no longer works, since the knowledge graph doesnt contain country data

In [32]:
# country_input = "Sweden"

# country_songs = get_songs_by_criteria(music_graph, country_input)
# print_length_and_items(country_songs)

Shartil: Now let's get the intersection of the lists<br>
This code was taken from this [StackOverflow answer](https://stackoverflow.com/a/3697438/9609586)

In [33]:
# print("The songs from Sweden that were released in 1990:")

# result_list = list(set(decade_songs) & set(country_songs))
# print_length_and_items(result_list)

Riaz: Node Embeding using node2vec

In [36]:
# Precompute probabilities and generate walks - **ON WINDOWS ONLY WORKS WITH workers=1**
node2vec = Node2Vec(music_graph, dimensions=64, walk_length=10, num_walks=200, workers=4)  # Use temp_folder for big graphs

# Embed nodes
model = node2vec.fit(window=10, min_count=1, batch_words=4)  # Any keywords acceptable by gensim.Word2Vec can be passed, `dimensions` and `workers` are automatically passed (from the Node2Vec constructor)

Computing transition probabilities:   0%|          | 0/1007 [00:00<?, ?it/s]

Generating walks (CPU: 3): 100%|██████████| 50/50 [00:09<00:00,  5.51it/s]
Generating walks (CPU: 2): 100%|██████████| 50/50 [00:09<00:00,  5.50it/s]
Generating walks (CPU: 1): 100%|██████████| 50/50 [00:09<00:00,  5.46it/s]
Generating walks (CPU: 4): 100%|██████████| 50/50 [00:09<00:00,  5.44it/s]


Saving embedding and model into models folder

In [37]:

# Save embeddings for later use
model.wv.save_word2vec_format(f"{MODELS_FOLDER_PATH}/embedding")

# Save model for later use
model.save(f"{MODELS_FOLDER_PATH}/node2vec_model")


Riaz: Nodes clustering based on embedding using K-Means

In [41]:
# Extract the embeddings and their labels
embeddings = model.wv
labels = list(embeddings.index_to_key)
X = np.array([embeddings[label] for label in labels])

print('Shape of X: ',X.shape)
print('Vectors :', X[:2])


# Apply K-means clustering
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
clusters = kmeans.labels_

cluster_nodes = {}

# Print the clusters
for i, label in enumerate(labels):
    if clusters[i] in cluster_nodes:
        cluster_nodes[clusters[i]] = cluster_nodes[clusters[i]] + [label]
    else:
        cluster_nodes[clusters[i]] = [label]


Shape of X:  (1007, 64)
Vectors : [[-8.87519047e-02  4.64707240e-02  1.01184301e-01  4.76024926e-01
  -2.08615005e-01 -2.13695958e-01  3.62863749e-01  6.45037964e-02
  -1.05757378e-01 -1.23130515e-01  6.17349148e-01 -8.56697261e-02
   5.61661273e-03 -2.57294148e-01 -4.34916578e-02  2.29714572e-01
   7.66308606e-02 -8.14373568e-02 -1.64118066e-01 -1.04754325e-02
   2.74835616e-01  2.65286207e-01  9.02275220e-02 -4.18187320e-01
  -2.69567192e-01  9.51370001e-02 -1.26770392e-01  2.07653016e-01
  -1.80968121e-01  1.19017452e-01 -6.30519390e-02 -4.76274863e-02
   3.28287899e-01 -5.73682368e-01 -4.89366442e-01  3.90114635e-01
   3.25769693e-01  2.50507772e-01 -1.71384722e-01 -2.49990404e-01
  -5.56424797e-01  4.21053588e-01 -1.83949932e-01  4.46699202e-01
   2.31308848e-01 -1.54556632e-01 -1.44896820e-01  8.38391781e-02
   1.07053630e-01  9.62468460e-02 -5.36897898e-01  2.80958742e-01
  -2.29705885e-01  2.92848140e-01  2.83302665e-01 -2.94188976e-01
   1.22754931e-01 -4.36842859e-01 -4.00900

Displaying numbers of nodes in each cluster

In [42]:

for i in cluster_nodes:
    print(f" Cluster: {i}. Nodes: {cluster_nodes[i]} \n")

 Cluster: 0. Nodes: ['decade 2000', '801', '707', '110', '543', '244', '84', '532', '415', '526', '197', '219', '10', '489', '550', '394', '144', '419', '521', '490', '957', '715', '214', '122', '906', '500', '752', '847', '552', '285', '310', '432', '248', '228', '798', '91', '497', '229', '27', '448', '966', '969', '16', '799', '59', '152', '754', '896', '910', '358', '946', '286', '591', '596', '323', '66', '320', '477', '653', '697', '598', '779', '991', '895', '31', '578', '581', '507', '743', '366', '577', '692', '442', '820', '538', '113', '250', '805', '243', '316', '129', '668', '302', '11', '629', '425', '137', '268', '825', '160', '386', '758', '384', '755', '597', '787', '407', '73', '51', '263', '298', '959', '452', '599', '986', '525', '975', '179', '82', '704', '900', '462', '463', '293', '999', '221', '322', '161', '38', '150', '314', '834', '57', '210', '331', '768', '499', '730', '674', '627', '222', '155', '62', '522', '968', '992', '465', '875', '340', '595', '897',