In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from getSpotifyData import GetSpotifyData

This notebook is going to be used to pull the top playlists for several genres, look at the top tracks and see how well these songs represent that genre, and then explore some of the distributions of the track features for these genres. We are going to be looking at 'deep house', 'yacht rock', and 'shoegaze' since they are all quite different genres from one another and vary in terms of specificity. To be clear, Yacht Rock is a very niche genre only made up of a few prominent rock bands in a brief time period. Shoegaze has a broader meaning and can has lots of sister-genres that sound pretty similar, like dream pop, but the general sound is very easy to identify. Deep House is a very unspecific genre and the name gets applied to many different artists from many different decades. Some say Larry Heard is Deep House, some say Tchami is Deep House, some say that any house track with an FM Bass that you might hear in Ibiza is Deep House. Let's see how this code does at grasping at the essence of each of these genres.

In [None]:
s = GetSpotifyData('credentials.json')
s.authenticate()

Let's use the class I've written to get the top 100 playlists for each search of these genres, each track from those playlists, and the track features for each track.

In [None]:

dh = s.get_tracks_for_search_term('deep house', 100)
sg = s.get_tracks_for_search_term('shoegaze', 100)
yr = s.get_tracks_for_search_term('yacht rock',100)

Let's save all of these to a csv's so that we can access them more easily in the future

In [None]:
dh.to_csv('deep_house.csv',index=False)
sg.to_csv('shoegaze.csv',index=False)
yr.to_csv('yacht_rock.csv',index=False)

In [None]:
# import from csv if we want to do that
dh = pd.read_csv('deep_house.csv')
sg = pd.read_csv('shoegaze.csv')
yr = pd.read_csv('yacht_rock.csv')

In [None]:
print('Deep House shape:', dh.shape)
print('Shoegaze shape:', sg.shape)
print('Yacht Rock shape:', yr.shape)

Let's get the association rules for each genre. Here, what we are doing is looking at all the playlists and seeing how likely certain tracks are to appear in the same playlist together. We can find out thinks like "given track A is in this playlist, how likely is it that track B is in this playlist." We can also see the frequency of each individual track as well and find the most popular tracks for this genre this way. Let's do that for each of the genres now.

the min_sup parameter in the get_track_associations method is used to set the minimum support, or frequency, of a track in order to be considered significant enough to be a part of the association rules mining. If you make it too low there will be too many relationships to analyze and the package won't be able to handle it, but if you set it too high there won't be enough tracks to get any meaningful relationships. I would recommend starting with a value of 0.1 and adjusting until you have an output size you deem appropriate.

In [None]:
dh_rules, dh_top_tracks = s.get_track_associations(dh, min_sup=0.04)

In [None]:
def print_tracks(df):
    for i in range(df.shape[0]):
        track_name = df.loc[i]['track_name']
        artist_names = df.loc[i]['artist_names']
        print(f"{track_name} - {artist_names}")

print_tracks(dh_top_tracks)

Reading through the Deep House tracks there aren't any that jump out as being completely out of place. Some that I would consider more Tech House than deep house but definitely still close. Overall I get the sense that this is mostly "anything with an FM Bass" kind of deep house given the large numbre of covers/remixes by Alok/R3HAB and lots of MEDUZA/Regard/RAYE/MNEK/Imanbek. These are definitely the most popular types of "Deep House" songs especially the type to be found in a random playlist. I wonder if we would get more traidional deep house if we were to use less popular tracks or add more words to our search term.

In [None]:
sg_rules, sg_top_tracks = s.get_track_associations(sg, min_sup=0.04)
print(sg_top_tracks.shape)

In [None]:
print_tracks(sg_top_tracks)

These results look pretty great for the most part. Lots of Cocteau Twins, my bloody valentine, Slowdive, and Ride are probably the biggest 4 bands of the genre so that is good that they all appear alot. Other less prominent artists like Lush, Mazzy Star, Jesus and Mary Chain, Beach House also showing up a lot. Seeing lots of Japanese music at the end which was very surprising to me, going to have to look into some of these songs!

ダブル・プラトニツク・スウイサイド - 溶けない名前 <- this one is pretty cool! definitely shoe-gazey as well, but a little more upbeat and energetic than is normal, but definitely dreamy and distorted.

Virgin Suicide - 宇宙ネコ子 <- this one is good too but not as interesting as the first. Still is definitely shoe-gazey!



In [None]:
yr_rules, yr_top_tracks = s.get_track_associations(yr, min_sup=0.08)
print(yr_top_tracks.shape)

In [None]:
print_tracks(yr_top_tracks)

These also look like pretty great results! Lots of Toto, Kenny Loggins, Hall and Oates, Doobie Brothers, America, the Eagles, and of course Steely Dan. It looks like this is a pretty good balance between the die-hard Yacht Rock achetypes and the more radio friendly side of it as well!

Let's look at some visualizations of the distributions for the different genres

In [None]:
inspect_cols = [
    'tempo',
    'popularity',
    'danceability', 
    'energy',
    'loudness',
    'speechiness', 
    'acousticness', 
    'instrumentalness', 
    'liveness',
    'valence'
]

def plot_genre_features_distributions(df, columns, color=None):
    # if we have a color argument append it to the columns list
    if color != None:
        columns = columns+[color]
    # remove duplicate occurrences of tracks across playlists
    df = df[columns].drop_duplicates()

    for col in columns:
        fig = px.histogram(df, x=col, marginal='box',color=color,histnorm='probability density')
        fig.show()

In [None]:
plot_genre_features_distributions(dh, inspect_cols)

In [None]:
plot_genre_features_distributions(sg,inspect_cols)

In [None]:
plot_genre_features_distributions(yr,inspect_cols)

Now Let's overlay them all on top of each other so we can compare more directly

In [None]:
# first join all data together
dh['genre'] = 'Deep House'
sg['genre'] = 'Shoegaze'
yr['genre'] = 'Yacht Rock'

# filter down to just columns we care about and remove duplicates
dh_sample = dh[inspect_cols+['genre']].drop_duplicates()
sg_sample = sg[inspect_cols+['genre']].drop_duplicates()
yr_sample = yr[inspect_cols+['genre']].drop_duplicates()

# take random sample of each genre so that they all have same number of tracks
sample_size = min(dh_sample.shape[0],sg_sample.shape[0],yr_sample.shape[0])
print('sample size:',sample_size)
dh_sample = dh_sample.sample(sample_size)
sg_sample = sg_sample.sample(sample_size)
yr_sample = yr_sample.sample(sample_size)


all_genres = pd.concat([dh_sample,sg_sample,yr_sample],axis=0)
print(all_genres.shape)




In [18]:
plot_genre_features_distributions(all_genres,inspect_cols,'genre')

Some Intersting Results:

- Tempo for Deep House has much less variance than the other genres
- Danceability is unsurprisingly highest for Deep House but Yacht Rock was not far behind
- Suprised that Yacht Rock is lower energy than Shoegaze
- Not surprised that Yacht Rock is the happiest genre but surprised that deep house and shoegaze are so close
- Surprised that Deep House is not more popular. Intersting that the range of of popularity in yacht rock is so wide, but it makes sense having top hits of the late 70s and early 80s as       well as some lesser known one hit wonders.


TODO:

- Look at the association rules and build a method that will build playlists with the most popular tracks of a genre
- build a method that will build a playlist based off of seed tracks and select the most correlated tracks to be added to give a more customized playlist