# AcousticBrainz Feature Analysis

The goal of this project is to generate graphs that represent the musical features with respect to the files' generes that are present in 4 datasets: 
- the allmusic dataset
- the discogs dataset
- the lastfm dataset
- the tagtraum dataset

And then compute the graphs with the features and the generes of those files. But first, the N subgeneres with the most songs will be computed for simplicity sake.

### Importing externals and constants

In [1]:
from utilities.constants import *
from utilities.data_management import *
from utilities.file_management import *
from utilities.plot_utils import *

print("ROOT_DIR: ",ROOT_DIR)
print("IMAGE_FOLDER: ",IMAGE_FOLDER)
print("DATA_FOLDER: ",DATA_FOLDER)

ROOT_DIR:  /notebooks
IMAGE_FOLDER:  /notebooks/Output Plots
DATA_FOLDER:  /notebooks/Data Files


### Getting the Ids for all datasets

The datasets must be in tsv format ('\t' delimiter) and should contain the name of the file in the first column of the file (the first element of the first column, will be ignored as it should be its name). If the file to be read is different from a tsv file, give the delimiter as a kwarg (delimiter = ',' for instance) 

In [2]:
allmusic_ids = get_ids("acousticbrainz-mediaeval2017-allmusic-train.tsv")
discogs_ids = get_ids("acousticbrainz-mediaeval2017-discogs-train.tsv")
lastfm_ids = get_ids("acousticbrainz-mediaeval2017-lastfm-train.tsv")
tagtraum_ids = get_ids("acousticbrainz-mediaeval2017-tagtraum-train.tsv")

### Computing the intersection of all the ids

with the ids in four different iterables, the intersection can be computed easily by converting the lists to sets.

In [3]:
intersection_ids = compute_instersection(allmusic_ids,discogs_ids,lastfm_ids,tagtraum_ids)


### Computing the list of sounds with the generes with most occurrences in the lastfm dataset

For that, first the dataset is loaded and the desired sounds obtained in the intersection are extracted.

In [4]:
lastfm_sounds = load_file("acousticbrainz-mediaeval2017-lastfm-train.tsv")

#get the ids to remove and remove them from the dataframe
diff = set(lastfm_sounds.index.tolist())-intersection_ids
lastfm_sounds = lastfm_sounds.drop(diff)

print("Rows of lastfm_sounds (number of sound files): ", lastfm_sounds.shape[0])

Rows of lastfm_sounds (number of sound files):  247716


Then the most frequent generes are obtained from that reduced dataset

In [5]:
most_frequent = get_most_frequent(lastfm_sounds,N = 20)

And with that list of generes, the dataset is reduced again by extracting the sounds with those generes

In [6]:
lastfm_sounds = reduce_df(lastfm_sounds,most_frequent)
most_frequent_generes_lastfm_sounds = lastfm_sounds.index.tolist()

### Computation of the reduced dataset containing the features to be plotted

In [7]:
selected_features = load_file("acousticbrainz-mediaeval2017-train-amplab2019-selected-features-mbid.csv",sep=',')
diff_features = set(selected_features.index.tolist())-set(most_frequent_generes_lastfm_sounds)
selected_features = selected_features.drop(diff_features)
print(selected_features.shape)
selected_features.sort_index().head()

(124185, 8)


Unnamed: 0_level_0,lowlevel.average_loudness,metadata.audio_properties.length,metadata.audio_properties.replay_gain,rhythm.bpm,rhythm.danceability,rhythm.onset_rate,tonal.key_key,tonal.key_scale
mbid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
00005a44-2152-4971-80c1-c217563845eb,0.902541,333.348572,-5.080051,128.858856,1.105704,2.669715,D,minor
00005ac4-210c-4914-89ba-6279ea881809,0.778497,274.756989,0.355402,142.368774,1.228754,3.26087,A#,major
00007960-9d81-4192-b548-ad33d6b0ca54,0.96831,191.440002,-12.583757,115.908508,1.148735,3.327064,A,minor
0000d8a7-8a9b-4b9d-a95c-038c6cb66547,0.937835,291.186676,-16.055473,90.996552,1.063905,3.094078,D,major
0000fb36-5ee0-44c5-9fa7-5c944d8a85ac,0.903276,291.030212,-16.016674,134.99791,1.110559,4.439092,D,minor


### Organization of the features in a dictionary of generes and plot them

In [8]:
information = organize_features_in_genre_dict(selected_features, lastfm_sounds, most_frequent)

plot_all_features(information, selected_features, most_frequent)

Done!


### Additional features computation

This cell takes a long time, so the resulting dataframe is saved in the next cell, in case it is needed again.

In [9]:
features_to_extract_from_json = [ "aggressive", "happy", "sad", "party", "relaxed",
                                 "instrumental", "voice", "female", "male"]

labels_to_extract = ["mood_aggressive","mood_happy","mood_sad","mood_party","mood_relaxed"
                     ,"voice_instrumental", "voice_instrumental", "gender", "gender"] 

highlevel_features = dataframe_from_json(features_to_extract_from_json, labels_to_extract, selected_features)

2559.237488000001
9/257
Expected time: 0h 42m 39s


KeyboardInterrupt: 

In [None]:
save_file(highlevel_features,"acousticbrainz-mediaeval-train-intersection-highlevel-selectedfeatures.tsv",
          sep='\t')

### Organizing the features in a dictionary as it was done before

In [None]:
highlevel_features = load_file("acousticbrainz-mediaeval-train-intersection-highlevel-selectedfeatures.tsv")
information = organize_features_in_genre_dict(highlevel_features, lastfm_sounds, most_frequent)

### Plotting the last plots

In [None]:
plot_all_features(information, highlevel_features, most_frequent)