Pip Install Commands

In [208]:
%pip install shapely


[notice] A new release of pip available: 22.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.


Libraries

In [209]:
import os
import json
import requests
import numpy as np
import pandas as pd
import networkx as nx
from shapely.prepared import prep
from shapely.geometry import mapping, shape, Point

Const Values

In [210]:
YEAR_COLUMN_NAME = "year"
DECADE_COLUMN_NAME = "decade"
SONG_TITLE_COLUMN_NAME = "song_title"
COUNTRY_COLUMN_NAME = "country"
ARTIST_LONGITUDE_COLUMN_NAME = "artist_longitude"
ARTIST_LATITUDE_COLUMN_NAME = "artist_latitude"
ARTIST_LOCATION_COLUMN_NAME = "artist_location"
ARTIST_ID_COLUMN_NAME = "artist_id"
SONG_ID_COLUMN_NAME = "song_id"

UNKNOWN_COUNTRY_VALUE = "unknown"
MUSIC_DATA_FOLDER_PATH = "../Music Data/"

Loading Songs & Artists datasets

In [211]:
raw_songs_dataset = pd.read_csv("../Data/songs_dataset.csv")
raw_artists_dataset = pd.read_csv("../Data/artist_terms.csv")

Riaz: Merging datasets based on artist_id

In [212]:
raw_songs_dataset.head()

Unnamed: 0,song_id,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_id,artist_name,artist_latitude,artist_longitude,artist_location,artist_hotttnesss,artist_familiarity
0,SOVFVAK12A8C1350D9,Tanssi vaan,1995.0,Karkuteillä,150.778,-10.555,156.55138,0.299877,ARMVN3U1187FB3A1EB,Karkkiautomaatti,,,,0.356992,0.439604
1,SOGTUKN12AB017F4F1,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,ARGEKB01187FB50750,Hudson Mohawke,55.8578,-4.24251,"Glasgow, Scotland",0.437504,0.643681
2,SOBNYVR12A8C13558C,Si Vos Querés,2003.0,De Culo,87.433,-4.654,145.05751,,ARNWYLR1187B9B2F9C,Yerba Brava,,,,0.372349,0.448501
3,SOHSBXH12A8C13B0DF,Tangle Of Aspens,,Rene Ablaze Presents Winter Sessions,140.035,-7.806,514.29832,,AREQDTE1269FB37231,Der Mystic,,,,0.0,0.0
4,SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",,Berwald: Symphonies Nos. 1/2/3/4,90.689,-21.42,816.53506,,AR2NS5Y1187FB5879D,David Montgomery,,,,0.109626,0.361287


In the following cell i am removing the duplicate rows based on `artist_id` and only keep the first record

In [213]:
# Remove duplicates from the artist dataset based on artist_id
raw_artists_dataset = raw_artists_dataset.drop_duplicates(subset=ARTIST_ID_COLUMN_NAME, keep='first')

# Merge the datasets on artist_id
raw_music_dataset = pd.merge(raw_songs_dataset, raw_artists_dataset, on=ARTIST_ID_COLUMN_NAME, how='left')

In the above cell i merged the datasets based on artist_id and merge was on left join:
When you specify how='left', it means that all the keys from the left dataframe (in this case, the raw_songs_dataset dataframe) will be included in the merged dataframe, and only the matching keys from the right dataframe (in this case, the artist_dataset dataframe) will be added.

In other words:

All rows from the left dataframe (raw_songs_dataset) are retained.
If there are matching keys (in this case, artist_id) in the right dataframe (artist_dataset), the corresponding data from the right dataframe will be added to the merged dataframe.
If there are no matching keys in the right dataframe, the corresponding columns in the merged dataframe will be filled with NaN (missing values).

In [214]:
raw_music_dataset.head()

Unnamed: 0,song_id,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_id,artist_name,artist_latitude,artist_longitude,artist_location,artist_hotttnesss,artist_familiarity,term
0,SOVFVAK12A8C1350D9,Tanssi vaan,1995.0,Karkuteillä,150.778,-10.555,156.55138,0.299877,ARMVN3U1187FB3A1EB,Karkkiautomaatti,,,,0.356992,0.439604,pop rock
1,SOGTUKN12AB017F4F1,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,ARGEKB01187FB50750,Hudson Mohawke,55.8578,-4.24251,"Glasgow, Scotland",0.437504,0.643681,broken beat
2,SOBNYVR12A8C13558C,Si Vos Querés,2003.0,De Culo,87.433,-4.654,145.05751,,ARNWYLR1187B9B2F9C,Yerba Brava,,,,0.372349,0.448501,cumbia
3,SOHSBXH12A8C13B0DF,Tangle Of Aspens,,Rene Ablaze Presents Winter Sessions,140.035,-7.806,514.29832,,AREQDTE1269FB37231,Der Mystic,,,,0.0,0.0,hard trance
4,SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",,Berwald: Symphonies Nos. 1/2/3/4,90.689,-21.42,816.53506,,AR2NS5Y1187FB5879D,David Montgomery,,,,0.109626,0.361287,ragtime


In [215]:
raw_music_dataset.isna().sum()

song_id                    0
song_title                 2
year                  484270
release                    7
tempo                      0
loudness                   0
duration                   0
song_hotttnesss       417782
artist_id                  0
artist_name                0
artist_latitude       641766
artist_longitude      641766
artist_location       487546
artist_hotttnesss         12
artist_familiarity       185
term                    3767
dtype: int64

In [216]:
raw_music_dataset.isna().sum().sum()

2677103

Shartil: For now I am going to delete all rows with missing data.

In [217]:
music_dataset = raw_music_dataset.dropna()

In [218]:
len(music_dataset)

126903

Shartil: Adding year column to dataset

In [219]:
music_dataset = music_dataset.assign(decade=lambda row: (row[YEAR_COLUMN_NAME].astype(int) // 10) * 10)

In [220]:
min_decade = music_dataset[DECADE_COLUMN_NAME].min()
max_decade = music_dataset[DECADE_COLUMN_NAME].max()

decade_array = np.linspace(min_decade, max_decade, 10, dtype=int)

Najeeb: Introducing a new column "country" based on Latitude and Longitude.

In [221]:
# Fetch and process the geojson data from a local file
with open(r'..\Data\countries.geojson.json', 'r') as file:
    geojson_data = json.load(file)

countries = {}
for feature in geojson_data["features"]:
    geom = feature["geometry"]
    country = feature["properties"]["ADMIN"]
    countries[country] = prep(shape(geom))

# Function to get country name from latitude and longitude
def get_country(lon, lat):
    point = Point(lon, lat)
    for country, geom in countries.items():
        if geom.contains(point):
            return country

    return UNKNOWN_COUNTRY_VALUE

# Apply the function to create a new 'country' column
music_dataset[COUNTRY_COLUMN_NAME] = music_dataset.apply(
    lambda row: get_country(row[ARTIST_LONGITUDE_COLUMN_NAME], 
    row[ARTIST_LATITUDE_COLUMN_NAME]), 
    axis=1
    )

Shartil: Deleting redundant columns 

In [222]:
music_dataset = music_dataset.drop(
    [
        ARTIST_LATITUDE_COLUMN_NAME,
        ARTIST_LONGITUDE_COLUMN_NAME,
        ARTIST_LOCATION_COLUMN_NAME,
        SONG_ID_COLUMN_NAME,
        ARTIST_ID_COLUMN_NAME
    ], 
    axis=1)

Shartil: Deleting all rows with "unknown" as the country value

In [223]:
music_dataset = music_dataset[music_dataset[COUNTRY_COLUMN_NAME] != UNKNOWN_COUNTRY_VALUE]

Shartil: Evenly selecing 1000 songs, and saving them as the final dataframe

In [224]:
music_dataset.shape
music_dataset = music_dataset.iloc[::124] # returns dataframe with (1003, 13)
music_dataset = music_dataset.iloc[:-3] # returns dataframe with (1000, 13)
music_dataset.shape

(1000, 13)

In [225]:
music_dataset.reset_index(drop=True, inplace=True)
music_dataset.head()

Unnamed: 0,song_title,year,release,tempo,loudness,duration,song_hotttnesss,artist_name,artist_hotttnesss,artist_familiarity,term,decade,country
0,No One Could Ever,2006.0,Butter,177.768,-2.06,138.97098,0.617871,Hudson Mohawke,0.437504,0.643681,broken beat,2000,United Kingdom
1,Golden Promises,1980.0,A Black Box,95.564,-12.293,176.71791,0.407902,Peter Hammill,0.412421,0.574284,art rock,1980,United Kingdom
2,Chambers Of The Heart,1989.0,Set Free - The Definitive Edition,118.639,-14.426,277.91628,0.0,Constance Demby,0.389029,0.500782,meditation,1980,United States of America
3,In The Bleak Mid Winter (Album Version),2002.0,Hymnsongs,164.025,-6.953,332.19873,0.320229,Phil Keaggy,0.493816,0.58851,ccm,2000,United States of America
4,Good Man,2006.0,In The Dark Live At Vicar Street,111.589,-9.55,258.53342,0.425125,Josh Ritter,0.489464,0.729852,folk rock,2000,United States of America


In [226]:
if not os.path.isdir(MUSIC_DATA_FOLDER_PATH):
    os.mkdir(MUSIC_DATA_FOLDER_PATH)

music_dataset.to_csv(f"{MUSIC_DATA_FOLDER_PATH}/music_dataset.csv", mode='w+')

Shartil: Now I am going to create the graph

In [227]:
music_graph = nx.Graph()

In [228]:
music_graph.add_nodes_from(decade_array.tolist())
music_graph.add_nodes_from(music_dataset[COUNTRY_COLUMN_NAME].unique().tolist())
music_graph.add_nodes_from(music_dataset[SONG_TITLE_COLUMN_NAME].tolist())

In [229]:
relationships = []
for index, row in music_dataset.iterrows():
    current_song_title = row[SONG_TITLE_COLUMN_NAME]
    current_decade = row[DECADE_COLUMN_NAME]
    current_country = row[COUNTRY_COLUMN_NAME]

    relationships.append((current_decade, current_song_title, {"label": "release_decade"}))
    relationships.append((current_country, current_song_title, {"label": "release_country"}))

music_graph.add_edges_from(relationships)

In [230]:
print(music_graph)

Graph with 1056 nodes and 1997 edges


In [231]:
def get_songs_by_criteria(music_graph, given_criteria):
    selected_songs = [ song for song in music_graph[given_criteria].keys()]
    return selected_songs

In [232]:
def print_length_and_items(given_list, amont_of_items = 50):
    print(len(given_list))
    print(given_list[:amont_of_items])

In [233]:
decade_input = 1990

decade_songs = get_songs_by_criteria(music_graph, decade_input)
print_length_and_items(decade_songs)

244
['Annabelle', 'Find Me A Girl', 'A Million Miles From Nowhere', 'Montate', 'Big Screen Television', 'The Birds', 'Hypocrisy Is The Greatest Luxury', "Crazy 'Bout My Baby", 'Stirnenfuß', 'Seven Dreaming Souls', "Che Cosse' L'amor", 'True Lovers', 'Crushonya', 'Greys', 'Blue Water', 'Just For You', 'Until The Sun Comes Back Again (LP Version)', 'Fast As You Can', 'A Day At The Beach', "Sun Won't Shine", 'Madness', 'Du Er En Dritt', 'Justine', "Fontanellette (2007 Remastered Live @ CBGB's)", 'The Masquerade Is Over', 'Will To Give', 'Asshole', 'Continuum', 'Lullabye', 'Over and Over', 'To Make Up My Mind', 'My God Called Me This Morning', 'Under Pressure (Album Version)', 'Laid Back Sunday (Album Version)', 'I Found My Smile Again', 'Folsom Prison Blues', 'Theme from Symphony #40', 'One Thousand Cycles', 'Geek The Girl', 'The Heart Never Learns', 'Love You Too Much', 'I Can Only Give You Everything', 'It Is Well With My Soul', '¿ Cómo_ cuándo y porqué ( Why do I love you so )', 'Encon

In [234]:
country_input = "Sweden"

country_songs = get_songs_by_criteria(music_graph, country_input)
print_length_and_items(country_songs)

14
['Swedish Sin', 'Seven Dreaming Souls', '125', 'Voyage of gurdijeff', 'Intro', 'Sky Phenomenon', 'The Khlysti Evangelist', 'Break Another Heart', 'Fly Catching', 'Feed On me', 'Baby', 'Reborn in blasphemy', 'Leaders', 'Go For The Soul']


Shartil: Now let's get the intersection of the lists<br>
This code was taken from this [StackOverflow answer](https://stackoverflow.com/a/3697438/9609586)

In [235]:
print("The songs from Sweden that were released in 1990:")

result_list = list(set(decade_songs) & set(country_songs))
print_length_and_items(result_list)

The songs from Sweden that were released in 1990:
3
['Seven Dreaming Souls', 'Fly Catching', 'Reborn in blasphemy']
