### Hello and welcome to my TLG Python project!

Something you probably don't know about me is that I've gotten really into dance music over the past year - namely house, techno, and trance.

If you've spent any time listening to these genres, you know they can all sound the same (lots of "boots" and "hats" noises) and with similar BPMs.

I've become really interested in analyzing the music I listen to and trying to understand what sets songs and artists apart from one another, so I created this audio analysis tool I'm calling... 

# The MusicMapper

The overall goal is this: 
1. Allow the user to select an artist and return 4 related artists
2. For each of the (now) 5 artists, query 10 of their most popular songs
3. Take the 50 songs and return audio features for each - we'll cover this more later, but generally considering things like energy, tempo, acousticness, etc.
4. Find the two fields with the greatest variance (or let the user decide!) and analyze these in 2D space on an interactive scatterplot.

Alright, enough prelude - let's get started!

First things first, let's install and import some libraries.

For those new to Jupyter and notebokes, the "!pip install" command will allow for the specified packages to be installed on the local environment - we use this in notebooks to set up the environment since there's no terminal.

In [None]:
!pip install spotipy
!pip install pandas
!pip install scikit-learn
!pip install bokeh

import csv
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import requests
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from bokeh.plotting import figure, output_file, show
from bokeh.models import HoverTool, CategoricalColorMapper
from bokeh.layouts import column
from bokeh.models.widgets import Div

Great, now that the environment has all the libraries and packages we need, let's set up the API authentication variables.

I recognize hard-coding API credentials is terrible practice, but for the sake of this exercise, let's keep them here (I'll rotate credentials immediately after).
Here, I'm storing my ID and the secret as variables, then presenting them to SpotifyClientCredentials, and storing that connection as an object "sp".

In [None]:
# Storing my Spotify API credentials - delete before commit!
CLIENT_ID = '349309ae5ef349eb85096837bc86547b'
CLIENT_SECRET = '2c5d4880a00a4bba80c915b55ccb54b5'

# Authenticate with Spotify
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Next, let's take a look how we want our data to be structured. 
We're about to define a blank csv here, and each field (column) has been assigned a field name related to features for the songs we'll be looking at. Most are self explanatory, with URI being the way spotify looks up objects within the app and API. But just to outline everything...

The following values can be accessed at this link (https://developer.spotify.com/documentation/web-api/reference/get-audio-features)

***Key***: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

***Mode***: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

***Danceability***: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

***Energy***: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

***Loudness***: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

***Speechiness***: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

***Acousticness***: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

***Instrumentalness***: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

***Liveness***: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

***Valence***: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

***Tempo***: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Embed Code: The HTML code you would use to embed a track in a webpage, like so:
 <iframe style="border-radius:12px" src="https://open.spotify.com/embed/track/5eg711QA7sgVe8MFdFE5zT?utm_source=generator" width="100%" height="352" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>

In [None]:
csv_file = '50_tracks.csv'
field_names = ['Artist Name', 'Artist URI', 'Track Name', 'Track URI', 'Album',
               'Key', 'Mode', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
               'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Embed Code']


So now that we have the csv nicely defined, let's write the headers, which we just defined as "field_names".

In [None]:
with open(csv_file, 'w', newline='') as file: # Writing to the csv, with open, using writer
    writer = csv.DictWriter(file, fieldnames=field_names)
    writer.writeheader()

Great! Time to define our first function - artist_lookup

First, we'll present the user with the input to select their own artist. Then we'll take that input and use it to search for just 1 artist on Spotify and store that as "results".

Next, through a small if loop
    if results has values:
        store the artist_uri and take that to the next function, top_tracks
    else:
        exit, we couldn't find your artist (sorry!)

In [None]:
def artist_lookup():
    # Prompt the user for an artist name
    user_artist = input("Enter an artist: ")

    # Search for the artist using Spotify API
    results = sp.search(q='artist:' + user_artist, type='artist', limit=1)

    # Get the artist URI from the search results
    if results['artists']['items']:
        artist = results['artists']['items'][0]
        artist_uri = artist['uri']
        top_tracks(artist_uri, user_artist) # Passing to top_tracks, with artist_uri and user_artist as arguments
        related_artists(artist_uri) # Passing to related_artists, with artist_uri as argument
    else:
        print("Artist not found.")

Time to define our second function, top_tracks

This function takes two arguments, artist_uri and user_artist

First, we'll pass artist_uri to sp.artist_top_tracks and store those results - then, just parsing our the tracks in the next line.

After that, we'll print some text and then write to our csv, using a for loop.

For each track in the list tracks, get just the URI (dropping the first 15 characters) then getting info for each (this just returns the album name).

Next, we'll take that track_uri and pass it through sp.audiofeatures(), which will give us data for each field we defined in our csv.

Following that, we're just doing a quick lookup of the oEmbed data - which is a little tricky, but just returns html for each track to be used with our scatterplot later

Finally, let's write all this data to the csv!

In [None]:
def top_tracks(artist_uri, user_artist):
    results = sp.artist_top_tracks(artist_uri) # Saving as results
    tracks = results['tracks'] # Looking for just tracks

    print("\n\nThe top 10 tracks for the artist are...\n")
    with open(csv_file, 'a', newline='', encoding="utf-8") as file: # Opening and appending the csv now
        writer = csv.DictWriter(file, fieldnames=field_names)
        for track in tracks:
            track_uri = track['uri'][14:] # Dropping the first 15 characters (spotify:tracks:) from the output
            track_info = sp.track(track_uri) # Getting information for each track, such as Album Name
            audio_features = sp.audio_features(track_uri) # Getting audio features

            # Perform Spotify oEmbed lookup
            oembed_api_url = (f'https://open.spotify.com/oembed?url=https://open.spotify.com/track/{track_uri}&format=json')
            response = requests.get(oembed_api_url)

            if response.status_code == 200:
                oembed_data = response.json()
                embedded_html = oembed_data['html'] # Taking just the html portion of the response.json()
            else:
                embedded_html = ''

            writer.writerow({ # Writing everything to the csv now
                'Artist Name': user_artist,
                'Artist URI': artist_uri[15:],
                'Track Name': track['name'],
                'Track URI': track_uri,
                'Album': track_info['album']['name'],
                'Key': audio_features[0]['key'],
                'Mode': audio_features[0]['mode'],
                'Danceability': audio_features[0]['danceability'],
                'Energy': audio_features[0]['energy'],
                'Loudness': audio_features[0]['loudness'],
                'Speechiness': audio_features[0]['speechiness'],
                'Acousticness': audio_features[0]['acousticness'],
                'Instrumentalness': audio_features[0]['instrumentalness'],
                'Liveness': audio_features[0]['liveness'],
                'Valence': audio_features[0]['valence'],
                'Tempo': audio_features[0]['tempo'],
                'Embed Code': embedded_html
            })

            print(f"{track['name']}")

Whew, that was a lot, but great job so far!

Next, we're going to lookup related artists for the user_artist.

First, let's pass the artist_uri variable to the API and store it as related. Next, we'll initialize a blank list, and then finally, we'll find just 4 artists in Spotify's long list of related artists.

Then, we'll append that artist_uri to a list, and lookup the top 10 tracks for that artist (going back up to the function above us).

In [None]:
def related_artists(artist_uri): # Using this to look up the 4 most closely related artists for the user_artist
    print("\nAnd 4 related artists... ")
    related = sp.artist_related_artists(artist_uri)
    related_artists_uris = []
    for artist in related['artists'][:4]: # Limiting to just 4
        print(artist['name'])
        artist_uri = artist['uri']
        related_artists_uris.append(artist_uri) # Appending to the list
        top_tracks(artist_uri, artist['name']) # Running each of these through top_tracks

Okay cool - we made it to the end of filling in our csv with data! Now time to run artist_lookup and get things started.

In [None]:
artist_lookup()

So csvs are really cool, but the de facto data type to do actual analysis in Python is a pandas dataframe, which we call "df".

Here, we use pandas to read the csv and save it as a dataframe, afterwards printing a random sample of 5 rows/records of the dataframe.

In [None]:
df = pd.read_csv('50_tracks.csv', encoding='utf-8')

df.sample(5)

Great, so if you look at some of our data from the above sample... you'll see it's not all the same type, or within the same range - we'll have to clean and normalize everything.

df.columns will just "cut" the colums we define - so this will give us everything from "Danceability" onwards, but not including "Embed Code".
Then, MinMaxScaler will scale all of our variances to be floats between the value of 0.0 and 1.0 (just like most the data in our dataset). 

In [None]:
data_columns = df.columns[7:-1]  # Exclude the last column "Embed Code"

# Normalizing with Min-Max scaling - this will take tempo and convert it into a 0-1 value
scaler = MinMaxScaler()
df[data_columns] = scaler.fit_transform(df[data_columns])

Awesome - let's do some exploratory analysis on our data!

So since our dataset has 9 variables we're considering - and since humans can't view things in 9 dimensions - we have to reduce our dimensionality (unfortunately). To do so, a simple thing we could do is find the two variables with the most variance. We'll isolate just those using df[data_columns].var() to get the total, and then sort and display just the top two fields.

In [None]:
variances = df[data_columns].var()

# Sort the fields based on their variances in descending order
sorted_variances = variances.sort_values(ascending=False)
print("All of the variances:")
print(sorted_variances)

# Select the top two fields with the highest variances
top_two_fields = sorted_variances.head(2)
field1 = top_two_fields.index[0]  # Get the field name of the first field
field2 = top_two_fields.index[1]  # Get the field name of the second field

print(f"The two fields with the greatest variance are {field1} and {field2}.\n")

There's also an option for the user to explore the data on their own and decide on two different variables of their own choosing.

In [None]:
# Ask the user if they want to use the fields with the greatest variance or define their own fields
user_choice = input("Would you like select your own categories to analyze? (Y/N): ")

if user_choice.lower() == 'y':
    field1 = input("Enter Field 1: ") # Would be nice to incorporate some input filtering/control here
    field2 = input("Enter Field 2: ")

Time to make things look nice for presentation. First, we'll just get the number of unique artists (which should be 5), and take the length to define each artist to our "color palette".

color_palette has 5 colors we've defined as hex values, then we'll connect each artist to a color using the CategoricalColorMapper. 

Next, we create our scatterplot, using the fields we defined earlier as our X and Y axes and the color points we designated for each artist.

In [None]:
# Get unique artists
artists = df['Artist Name'].unique()

# Defining a color palette for each artist
color_palette = ['#FF0000', '#FF8000', '#000080', '#008000', '#00FFFF']

# Create a color mapper mapping each artist to a color
num_artists = len(artists)
palette = color_palette[:num_artists]
color_mapper = CategoricalColorMapper(factors=artists, palette=palette) # Building using CCM

# Create the scatterplot with colored points
scatterplot = figure()
scatterplot.circle(x=field1, y=field2, source=df, fill_color={'field': 'Artist Name', 'transform': color_mapper}, size=6, legend_field='Artist Name')

Alright, last step! I'll openly admit that this was the most challenging section for me - this took a ton of research from StackOverflow, GitHub Copilot, and ChatGPT.

HoverTool is used in our plot to display Artist Name and Track Name when we hover over a point on our scatterplot. Next, we define the labels for our plot with our chosen fields.

Then, we have to parse out the html (remember those pesky embed codes?) from our dataframe - so really just the last column - and display each of those on the page.

Finally - let's plot our graph using bokeh!

In [None]:
# Customize hover tooltips
hover = HoverTool(tooltips=[("Artist", "@{Artist Name}"), ("Track", "@{Track Name}")])
scatterplot.add_tools(hover)

# Set plot attributes
scatterplot.xaxis.axis_label = field1
scatterplot.yaxis.axis_label = field2

# Get the HTML data from the last field for each record
html_data = df.iloc[:, -1]

# Create a Div element to display the HTML data     
html_content = Div(text='', width=800, height=200)

# Concatenate all the HTML data and update the Div element
html_content.text = '\n\n'.join(html_data)

# Create a layout with the scatterplot and the Div element
layout = column(scatterplot, html_content)

# Show the plot within the Jupyter Notebook
from bokeh.io import output_notebook, show
output_notebook()
show(layout)