# Spotify Association Rule Mining

## Data Preprocessing

First all needed modules and functions are imported:

In [1]:
from spotify_association_rules import load_data, bin_continous_attributes, convert_categorial_attributes,bin_tracks_with_most_imp_words,df_to_lists,mine_association_rules,print_rules,transform_item_list_to_binary
from time import time

Secondly, data is loaded into a dataframe:

In [7]:
df=load_data(n=1000)
df.head()

Unnamed: 0,track,artist,uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,decade,hit
29385,Redshift,Soul Extract,spotify:track:6zqI70p7pulTXdha8A3GcR,0.531,0.876,4,-6.381,0,0.0948,0.000313,0.0,0.0809,0.306,148.026,248010,4,19.92028,10,10s,0
18970,Mini-Skirt Minnie,Wilson Pickett,spotify:track:7p7kHvFpphFrlZvgKUhclw,0.68,0.648,9,-9.482,1,0.0749,0.193,0.000124,0.0419,0.827,97.7,179200,4,33.01747,8,60s,1
8480,Airship,Nobuo Uematsu,spotify:track:3Dh0ceEo8pIuzS7xTCm8KW,0.282,0.337,10,-13.761,1,0.0457,0.598,0.853,0.17,0.945,180.508,49933,4,21.10148,5,80s,0
11435,Don't Walk Away,Toni Childs,spotify:track:4R9rKeBJjQSMiKIXacPPEI,0.66,0.814,11,-10.808,1,0.0345,0.065,1.2e-05,0.119,0.722,115.134,238973,4,29.60502,12,80s,1
27330,Post To Be,Omarion Featuring Chris Brown & Jhene Aiko,spotify:track:0fgZUSa7D7aVvv3GfO0A1n,0.733,0.676,10,-5.655,0,0.0432,0.0697,0.0,0.208,0.701,97.448,226581,4,18.36223,13,10s,1


The data contains a lot of continous attributes. Applying the Apriori algorithm requires a discrete amount of different data items, so the continous attributes are binned in `n_bins=2`. Categorial attribute are only transferred into a better readable form. The `uri` attribute is dropped, because URLs of songs do not provide any valueable information; the `artist` attribute contains a lot of individuals that are hard to bin and lead to demand of a lot of computation power, so it is ommited. The `track` attribute is further analyzed using NLP.

In [8]:
cont_attributes=['danceability','energy','loudness',
                'speechiness','acousticness','instrumentalness','valence',
                'liveness','tempo','duration_ms','chorus_hit','sections']
    
cat_attributes=['key','mode','time_signature','decade','hit']
    
df=bin_continous_attributes(df, cont_attributes)
df=convert_categorial_attributes(df, cat_attributes)
ser_track=df['track']
df=df.drop(['uri','track','artist'], axis='columns')
df.head()    


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,decade,hit
29385,"danceability (0.515, 0.963)","energy (0.501, 0.999)",key (4),"loudness (-19.398, 2.291)",mode (0),"speechiness (0.021, 0.483)","acousticness (-0.001, 0.498)","instrumentalness (-0.001, 0.496)","liveness (0.021, 0.502)","valence (0.021, 0.506)","tempo (138.346, 217.396)","duration_ms (17763.906, 904080.0)",time_signature (4),"chorus_hit (-0.174, 86.785)","sections (1.928, 38.0)",decade (10s),hit (0)
18970,"danceability (0.515, 0.963)","energy (0.501, 0.999)",key (9),"loudness (-19.398, 2.291)",mode (1),"speechiness (0.021, 0.483)","acousticness (-0.001, 0.498)","instrumentalness (-0.001, 0.496)","liveness (0.021, 0.502)","valence (0.506, 0.99)","tempo (59.138, 138.346)","duration_ms (17763.906, 904080.0)",time_signature (4),"chorus_hit (-0.174, 86.785)","sections (1.928, 38.0)",decade (60s),hit (1)
8480,"danceability (0.066, 0.515)","energy (0.003, 0.501)",key (10),"loudness (-19.398, 2.291)",mode (1),"speechiness (0.021, 0.483)","acousticness (0.498, 0.995)","instrumentalness (0.496, 0.992)","liveness (0.021, 0.502)","valence (0.506, 0.99)","tempo (138.346, 217.396)","duration_ms (17763.906, 904080.0)",time_signature (4),"chorus_hit (-0.174, 86.785)","sections (1.928, 38.0)",decade (80s),hit (0)
11435,"danceability (0.515, 0.963)","energy (0.501, 0.999)",key (11),"loudness (-19.398, 2.291)",mode (1),"speechiness (0.021, 0.483)","acousticness (-0.001, 0.498)","instrumentalness (-0.001, 0.496)","liveness (0.021, 0.502)","valence (0.506, 0.99)","tempo (59.138, 138.346)","duration_ms (17763.906, 904080.0)",time_signature (4),"chorus_hit (-0.174, 86.785)","sections (1.928, 38.0)",decade (80s),hit (1)
27330,"danceability (0.515, 0.963)","energy (0.501, 0.999)",key (10),"loudness (-19.398, 2.291)",mode (0),"speechiness (0.021, 0.483)","acousticness (-0.001, 0.498)","instrumentalness (-0.001, 0.496)","liveness (0.021, 0.502)","valence (0.506, 0.99)","tempo (59.138, 138.346)","duration_ms (17763.906, 904080.0)",time_signature (4),"chorus_hit (-0.174, 86.785)","sections (1.928, 38.0)",decade (10s),hit (1)


With the function `bin_tracks_with_most_imp_words` the `track` attribute is further analyzed using NLP. Within the former function the most relevant words in the `track` attribute are computed using tfidf and mapped to each individual sample if one of those words occurs in that sample. As a sample can contain 0 to `n_words` the individuals are transformed into a list of lists with all other attributes in `df_to_lists`. The Apriori algorithm of `mlxtend` only accepts binary mapped item lists, hence `transform_item_list_to_binary` transforms `df_lists`.

In [9]:
list_tracks=bin_tracks_with_most_imp_words(ser_track,n_words=10)
df_lists=df_to_lists(df, list_tracks)
df_binary=transform_item_list_to_binary(df_lists)
df_binary.head()

Unnamed: 0,"acousticness (-0.001, 0.498)","acousticness (0.498, 0.995)","chorus_hit (-0.174, 86.785)","chorus_hit (86.785, 173.569)","danceability (0.066, 0.515)","danceability (0.515, 0.963)",decade (00s),decade (10s),decade (60s),decade (70s),...,word (got),word (let),word (live),word (love),word (metal),word (need),word (never),word (rock),word (time),word (version)
0,True,False,True,False,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,True,False,False,True,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,True,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,True,False,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


## Association Rule Mining

To find association rules the function `mine_association_rules` is executed. In this example only association rules are extracted that have `hit (1)` as consequent. Since `hit (1)` is the label for hit songs the mined rules reveal hidden patterns that characterise hit songs:

In [11]:
#induction of association rules requires some computation power and will take 1-3 min
start=time()
df_ar_hits=mine_association_rules(df_binary)
stop=time()
dur=stop-start
print('Duration of computation: {}'.format(dur))
print()
    
print('Association rules for hit songs:')
print()
print_rules(df_ar_hits)
print()

Processing 14 combinations | Sampling itemset size 14320
Finished Apriori algorithm
Inferred association rules
Duration of computation: 42.26316809654236

Association rules for hit songs:

Association rules 0:
    time_signature (4)
    acousticness (-0.001, 0.498)
    danceability (0.515, 0.963)
    tempo (59.138, 138.346)
    instrumentalness (-0.001, 0.496)
    sections (1.928, 38.0)
    liveness (0.021, 0.502)
    energy (0.501, 0.999)

    Support: 0.217
    Confidence: 0.81
    Lift: 1.613

Association rules 1:
    time_signature (4)
    acousticness (-0.001, 0.498)
    danceability (0.515, 0.963)
    tempo (59.138, 138.346)
    loudness (-19.398, 2.291)
    instrumentalness (-0.001, 0.496)
    sections (1.928, 38.0)
    speechiness (0.021, 0.483)
    liveness (0.021, 0.502)
    energy (0.501, 0.999)

    Support: 0.217
    Confidence: 0.81
    Lift: 1.613

Association rules 2:
    time_signature (4)
    acousticness (-0.001, 0.498)
    danceability (0.515, 0.963)
    tempo (59.1