# 02_transforming_feats: Transforming Audio Analysis Features for EDA Preperation

**Description**: Cleaning the audio features listing that I pulled from the Spotify API, and iterating through each of the audio analysis ;json files to create a feature-set that will be extensively analyzed in the subsequent EDA process.

**Disclaimer:** Due to the size of the combined audio analysis files (~10GB), they are not available on this repository, and therefore, it is not possible to run every cell from start to finish.

## Table of Contents

1. [Section Breakdown](#1)
2. [Segments Breakdown](#2)

In [1]:
import csv
import json
import re

import numpy as np
import pandas as pd

from library import kc_counter, analysis_sorter, get_sec_ss, pt_grabber, pt_grabber_sgl, mk_sum_df

## Analyzing Audio Analysis Features

For a thorough overview of the information contained in an individual audio analysis file, please refer to this [article](https://www.youtube.com/redirect?q=http%3A%2F%2Fdocs.echonest.com.s3-website-us-east-1.amazonaws.com%2F_static%2FAnalyzeDocumentation.pdf&redir_token=Z7-grOcoBMsZQ3-_KTKY0DdMPlN8MTUzOTkxMTI3OUAxNTM5ODI0ODc5&event=video_description&v=goUzHd7cTuA). It may also be helpful to refer to this [presentation](https://www.youtube.com/watch?v=goUzHd7cTuA) by Spotify's Mark Koh.

There is much to be garnered from the audio analysis files, however, it's not realistic from a compute standpoint to keep every piece of information for calculating similarity. Therefore, I've only kept a select portion of features from each file. 

In [41]:
with open('../data/audio_analysis/000xQL6tZNLJzIrtIgxqSl.json', 'r') as f:
    example = json.load(f)

##### Each of the Sections Present in an Audio Analysis File

In [42]:
example.keys()

dict_keys(['meta', 'track', 'bars', 'beats', 'tatums', 'sections', 'segments'])

<a name="1"></a>
## 1. Section Breakdown

From the Analyzer Documentation: *a set of section markers, in seconds. Sections are defined by large variations in rhythm or timbre, e.g. chorus, verse, bridge, guitar solo, etc. Each section contains its own descriptions of tempo, key, mode, time_signature, and loudness*  

Also interesting to note, Mark Koh characterized the carving up of segments "not great", in that they're not particularly accurate for delineating the true sections of a song (e.g., verse 1, chorus, verse 2). Given this uncertainty, and the varying number of sections on a given song, I decided to forego including individual section level features into the recommender.

I still felt it was necessary to capture the information included within the sections though, because the changes in key, tempo, and time signature are important in determing the overall shape of the song. Therefore, I've gathered the mean and variance of each feature in a songs' sections to capture their general shape (with the exception of `key`, `key_confidence`, `mode`, and `mode_confidence`, as there's no ordinal value to a song's key, and whether or not it's major/minor).

In [43]:
for k, v in example['sections'][0].items():
    print(k)

start
duration
confidence
loudness
tempo
tempo_confidence
key
key_confidence
mode
mode_confidence
time_signature
time_signature_confidence


In [24]:
df = pd.DataFrame(example['sections'])

#### An Example Section

In [8]:
example['sections'][0]

{'start': 0.0,
 'duration': 19.11423,
 'confidence': 1.0,
 'loudness': -12.969,
 'tempo': 166.075,
 'tempo_confidence': 0.507,
 'key': 2,
 'key_confidence': 0.504,
 'mode': 1,
 'mode_confidence': 0.731,
 'time_signature': 4,
 'time_signature_confidence': 0.514}

### 1a. Counting Key Changes

Given that there's an estimated key in each section of a song, I decided to capture this change by counting occurences that a key changed.

In [57]:
kc_list = kc_counter.kc_counter(analysis_list)

##### Checking for `null` results

In [81]:
for _ in kc_list:
    for k, v in _.items():
        if v == 'unable to record key changes':
            print(k, v)

21VDF2xzLl8P1vDVr0nuQY unable to record key changes


I'll need to drop this record.

#### Casting the Results to a Dataframe

In [74]:
kc_df = pd.Series([val.values() for val in kc_list], index=[k.keys() for k in kc_list])

In [76]:
kc_df = pd.Series()
for _ in kc_list:
    for k, v in _.items():
        kc_df.loc[k] = v

In [99]:
kc_df.head()

000xQL6tZNLJzIrtIgxqSl    2
001wUOgo8t9VElHl45bxzr    9
003eoIwxETJujVWmNFMoZy    5
003z5LtGJ2cdJARKIO9LgL    6
004S8bMhFQjnbuqvdh6W71    9
dtype: object

##### Dropping record with no key change value

In [83]:
kc_df.drop(labels='21VDF2xzLl8P1vDVr0nuQY', inplace=True)

#### Dropping non-songs from key change listing

In [84]:
non_songs = pd.read_csv('../data/non_songs.csv', index_col = 0)

In [86]:
non_songs.head(2)

Unnamed: 0,s_song_id,album_release_date,artist_id,artist_name,duration_ms,explicit,linked_album,song_title
163,2xfcxlx0QRbqUhpVidqmOU,2013-05-28,1xlkcCr7PNHw2dRG1Gm6YF,Ron White,322322.0,True,A Little Unprofessional,L.A. Beautiful/You're Beautiful/The Yellow Blur
708,5LMcncchvV1jYHMG4hviSN,1998-04-04,0NnoRcD3WkqC9aouHyE8YY,Trey Parker,153026.0,False,Cannibal! The Musical (Original Motion Picture...,Overture


In [89]:
kc_df.drop(labels=non_songs['s_song_id'], inplace=True)

In [90]:
kc_df.shape

(22909,)

I have slightly fewer works then what's present in my `song_df`. Still enough to conduct a thorough analysis, however.

### 1b. Getting Mean and Variance of Several Features within Song Sections

Here's where I actually iterate through the audio analysis files, and grab the mean/variance of those select features.

In [177]:
mean_dicts, var_dicts = analysis_sorter.analysis_sorter(analysis_list)

Completed 5000 files
Completed 10000 files
Completed 15000 files
Completed 20000 files


##### Verifying Mean and Variance Values

In [182]:
mean_dicts[0]

{'6k9L7kTBzjXY0GfazHYqCg': {'confidence': 0.6483,
  'duration': 20.578667000000003,
  'loudness': -9.3965,
  'mode': 0.8,
  'mode_confidence': 0.46769999999999995,
  'tempo': 142.13690000000003,
  'tempo_confidence': 0.2643}}

In [237]:
var_dicts[0]

{'6k9L7kTBzjXY0GfazHYqCg': {'confidence': 0.04827756666666666,
  'duration': 88.7212888207789,
  'loudness': 85.8691118333333,
  'mode': 0.17777777777777778,
  'mode_confidence': 0.03472067777777778,
  'tempo': 1.7909101000000023,
  'tempo_confidence': 0.007427788888888888}}

#### Combining Summary Stats Docs

In [184]:
#mean
with open('../data/section_mean_summary_5000.json', 'r') as f:
    section_mean_summary_5000 = json.load(f)
with open('../data/section_mean_summary_10000.json', 'r') as f:
    section_mean_summary_10000 = json.load(f)
with open('../data/section_mean_summary_15000.json', 'r') as f:
    section_mean_summary_15000 = json.load(f)
with open('../data/section_mean_summary_20000.json', 'r') as f:
    section_mean_summary_20000 = json.load(f)

#var
with open('../data/section_var_summary_5000.json', 'r') as f:
    section_var_summary_5000 = json.load(f)
with open('../data/section_var_summary_10000.json', 'r') as f:
    section_var_summary_10000 = json.load(f)
with open('../data/section_var_summary_15000.json', 'r') as f:
    section_var_summary_15000 = json.load(f)
with open('../data/section_var_summary_20000.json', 'r') as f:
    section_var_summary_20000 = json.load(f)

In [185]:
mean_dicts.extend(section_mean_summary_5000)
mean_dicts.extend(section_mean_summary_10000)
mean_dicts.extend(section_mean_summary_15000)
mean_dicts.extend(section_mean_summary_20000)

var_dicts.extend(section_var_summary_5000)
var_dicts.extend(section_var_summary_10000)
var_dicts.extend(section_var_summary_15000)
var_dicts.extend(section_var_summary_20000)

In [186]:
len(mean_dicts), len(var_dicts)

(23125, 23125)

In [188]:
for e in var_dicts[:2]:
    print(e.keys())

dict_keys(['6k9L7kTBzjXY0GfazHYqCg'])
dict_keys(['6kAS4yj3wHJXcLp93vr5aG'])


#### Tossing Section Means and Vars into DataFrame + `csv`

In [None]:
section_mean = get_sec_ss.get_sec_ss(mean_dicts)

In [None]:
section_var = get_sec_ss.get_sec_ss(var_dicts)

### 1c. How Many Entries did not have Summary Stats?

In [211]:
len(section_mean), len(section_var)

(23125, 23125)

In [281]:
section_var[section_var['confidence'] == 'Unable to calculate variance of section features']

Unnamed: 0,confidence,duration,loudness,mode,mode_confidence,tempo,tempo_confidence
21VDF2xzLl8P1vDVr0nuQY,Unable to calculate variance of section features,Unable to calculate variance of section features,Unable to calculate variance of section features,Unable to calculate variance of section features,Unable to calculate variance of section features,Unable to calculate variance of section features,Unable to calculate variance of section features


In [283]:
section_mean[section_mean['confidence'] == 'Unable to calculate mean of section features']

Unnamed: 0,confidence,duration,loudness,mode,mode_confidence,tempo,tempo_confidence
21VDF2xzLl8P1vDVr0nuQY,Unable to calculate mean of section features,Unable to calculate mean of section features,Unable to calculate mean of section features,Unable to calculate mean of section features,Unable to calculate mean of section features,Unable to calculate mean of section features,Unable to calculate mean of section features


Only for one title was I unable to calculate the mean and/or variance of section features. I'll go ahead and drop this record.

In [285]:
section_mean.drop(labels='21VDF2xzLl8P1vDVr0nuQY', inplace=True)
section_var.drop(labels='21VDF2xzLl8P1vDVr0nuQY', inplace=True)

#### Checking for Duplicate Values

In [267]:
section_mean[section_mean.duplicated(keep=False)]

Unnamed: 0,confidence,duration,loudness,mode,mode_confidence,tempo,tempo_confidence
6wLj4AQJiBuJl5uiY0hSe8,0.760429,34.6233,-16.9286,0.285714,0.553429,110.056,0.601714
2memjAKXTXCK1WsUsWGHe7,0.760429,34.6233,-16.9286,0.285714,0.553429,110.056,0.601714


In [266]:
section_var[section_var.duplicated(keep=False)]

Unnamed: 0,confidence,duration,loudness,mode,mode_confidence,tempo,tempo_confidence
6wLj4AQJiBuJl5uiY0hSe8,0.05692,336.221,184.478,0.238095,0.065341,0.00296895,0.00792824
2GJxRwFe8oLcbXgTw9P5of,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2ULmjTNKicNAC0HAyYa47y,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2memjAKXTXCK1WsUsWGHe7,0.05692,336.221,184.478,0.238095,0.065341,0.00296895,0.00792824
5Asz9rHr2rViBdl6pkXpoq,0.0,0.0,0.0,0.0,0.0,0.0,0.0


It appears as though the duplicate results are just the same observation statistics for two different song id's. Rather than drop them, I will keep the values and decide what to do after I've combined my datasets for EDA.

#### Checking for `null's` 

In [230]:
section_var.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23125 entries, 6k9L7kTBzjXY0GfazHYqCg to 6k7e2cjr10EbQW5QnblOtY
Data columns (total 7 columns):
confidence          23122 non-null object
duration            23122 non-null object
loudness            23122 non-null object
mode                23122 non-null object
mode_confidence     23122 non-null object
tempo               23122 non-null object
tempo_confidence    23122 non-null object
dtypes: object(7)
memory usage: 1.4+ MB


In [232]:
section_mean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23125 entries, 6k9L7kTBzjXY0GfazHYqCg to 6k7e2cjr10EbQW5QnblOtY
Data columns (total 7 columns):
confidence          23125 non-null object
duration            23125 non-null object
loudness            23125 non-null object
mode                23125 non-null object
mode_confidence     23125 non-null object
tempo               23125 non-null object
tempo_confidence    23125 non-null object
dtypes: object(7)
memory usage: 2.0+ MB


Very strange that I have null values in the variance df, but not the means df. I'll check the dicts that I converted to see if those values are null there as well.

In [240]:
for i in var_dicts:
    for k in i.keys():
        if k == '2ULmjTNKicNAC0HAyYa47y':
            print(i)

{'2ULmjTNKicNAC0HAyYa47y': {'confidence': nan, 'duration': nan, 'loudness': nan, 'mode': nan, 'mode_confidence': nan, 'tempo': nan, 'tempo_confidence': nan}}


It's null there as well. I will go back, retreive the original json's, and compute the variance of each value.

In [221]:
section_var.shape

(23125, 7)

In [254]:
section_var[section_var.isnull().any(1)].head()

Unnamed: 0,confidence,duration,loudness,mode,mode_confidence,tempo,tempo_confidence
2GJxRwFe8oLcbXgTw9P5of,,,,,,,
2ULmjTNKicNAC0HAyYa47y,,,,,,,
5Asz9rHr2rViBdl6pkXpoq,,,,,,,


#### Dealing with Null Values

In [256]:
with open('../data/audio_analysis/2GJxRwFe8oLcbXgTw9P5of.json', 'r') as f:
    _2GJxRwFe8oLcbXgTw9P5of = json.load(f)
with open('../data/audio_analysis/2ULmjTNKicNAC0HAyYa47y.json', 'r') as f:
    _2ULmjTNKicNAC0HAyYa47y = json.load(f)
with open('../data/audio_analysis/5Asz9rHr2rViBdl6pkXpoq.json', 'r') as f:
    _5Asz9rHr2rViBdl6pkXpoq = json.load(f)

In [260]:
_5Asz9rHr2rViBdl6pkXpoq['sections']

[{'start': 0.0,
  'duration': 59.93333,
  'confidence': 1.0,
  'loudness': -17.937,
  'tempo': 80.584,
  'tempo_confidence': 0.115,
  'key': 2,
  'key_confidence': 0.285,
  'mode': 0,
  'mode_confidence': 0.487,
  'time_signature': 3,
  'time_signature_confidence': 0.128}]

Turns out that for each one of these titles, there is only 1 section, therefore...impossible to get a variance value. I just changed the observation to all 0's.

In [263]:
section_var.loc['2GJxRwFe8oLcbXgTw9P5of'] = 0
section_var.loc['2ULmjTNKicNAC0HAyYa47y'] = 0
section_var.loc['5Asz9rHr2rViBdl6pkXpoq'] = 0

#### Output to `csv`

In [286]:
section_mean.to_csv('../data/spotify_section_means.csv')

In [287]:
section_var.to_csv('../data/spotify_section_var.csv')

### 1d. Merging Key Change listing with Spotify Features

Since the general `key` of every song is shown within the `spotify_song_feat` table that I have, I'll merge it with the aforementioned table to cut down on extraneous lists.

In [95]:
song_feat = pd.read_csv('../data/spotify_song_feat.csv').set_index('id')
song_feat.head()

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
62bOmKYxYg7dhrC6gH9vFn,0.61,0.926,8,-4.843,0,0.0479,0.031,0.0012,0.0821,0.861,172.638,200400,4
46n2EGFnPC3tzWCN1Aqe26,0.55,0.587,2,-6.279,1,0.0329,0.354,0.0,0.128,0.466,165.975,284760,4
2AW37v0bDyuOzGP3XnmFuA,0.636,0.873,0,-4.672,0,0.071,0.0407,1e-06,0.0372,0.908,165.071,192427,4
594M0rqYMOo8BhMGEdoi5C,0.686,0.915,7,-4.447,1,0.0364,0.0028,7e-06,0.233,0.796,110.054,211000,4
0Jc8qF1mUPo1A96HE9QxZz,0.706,0.861,11,-6.684,1,0.154,0.0341,0.0,0.127,0.923,119.946,238427,4


In [98]:
song_feat.drop(labels=non_songs['s_song_id'], inplace=True)

In [103]:
song_feat_kc = song_feat.merge(pd.DataFrame(kc_df), left_on='id', right_on=kc_df.index)

In [108]:
song_feat_kc.rename({0:'key_changes'}, axis=1, inplace=True)

In [110]:
song_feat_kc.shape

(23661, 15)

##### Removing Duplicate Songs from `song_feat_kc`

In [115]:
song_feat_kc.drop_duplicates(inplace=True)

In [117]:
song_feat_kc.set_index('id', inplace=True)

#### Exporting Song Feats back to `csv`

In [118]:
song_feat_kc.to_csv('../data/song_feats.csv')

<a name="2"></a>
## 2. Segments

Each segment represents a uniform element of sound, typically under 1 second (e.g., if one was listening to a piano piece, for instance, a segment could represent a single chord being played). Each segment is characterized by their perceptual onsets and duration in seconds, loudness, pitch and timbral content [(Mark Koh)](https://www.youtube.com/watch?v=goUzHd7cTuA&feature=youtu.be).

There is a variable number of segments in a given song, which makes it difficult to use for calculating similarity. The calculations should be time invariant. Therefore, I decided to take the mean and variance of each `Timbre` and `Pitch` value (please refer to the data dictionary for more information on each). I separated `pitches` and `timbre` into separate matrices, considering I need to get the mean and variance of each element within each of those arrays. 

I did not store any other values from the audio segments. Reason being, that I wasn't overly concered with the length of each segment, nor was I concerned with the loundess levels, since they're referred to within the `Sections` and audio features.

### 2a. Creating Functions to Grab Summary Pitch and Timbre Statistics for Each Song

#### Grabbing Pitch / Timbre Summary Stats

In [112]:
timbre_means, timbre_var, pitch_means, pitch_var = pt_grabber.pt_grabber(analysis_list)

grabbing 1001
grabbing 2001
grabbing 3001
grabbing 4001
grabbing 5001
grabbing 6001
grabbing 7001
grabbing 8001
grabbing 9001
grabbing 10001
grabbing 11001
grabbing 12001
grabbing 13001
grabbing 14001
grabbing 15001
grabbing 16001
grabbing 17001
grabbing 18001
grabbing 19001
grabbing 20001
grabbing 21001
grabbing 22001
grabbing 23001


#### Checking Lists for Lack of Summary Stats

In [119]:
count = 0
for e in pitch_var:
    if isinstance(e, str):
        count += 1
print(count)

0


### 2b. Tossing Lists into DataFrames + `csv`

In [139]:
list(timbre_means[0].keys())[0]

'000xQL6tZNLJzIrtIgxqSl'

In [135]:
list(timbre_means[0].values())[0]

49.242242014742054

In [None]:
tm_df = mk_sum_df.mk_sum_df(timbre_means)
tv_df = mk_sum_df.mk_sum_df(timbre_var)
pm_df = mk_sum_df.mk_sum_df(pitch_means)
pv_df = mk_sum_df.mk_sum_df(pitch_var)

#### Removing Duplicate Records

In [167]:
tm_df.shape, tv_df.shape, pm_df.shape, pv_df.shape

((23129, 12), (23129, 12), (23129, 12), (23129, 12))

In [168]:
tm_df.drop_duplicates(inplace = True)
tv_df.drop_duplicates(inplace = True)
pm_df.drop_duplicates(inplace = True)
pv_df.drop_duplicates(inplace = True)

In [169]:
tm_df.shape, tv_df.shape, pm_df.shape, pv_df.shape

((23124, 12), (23124, 12), (23124, 12), (23124, 12))

#### Checking for `null's`

In [273]:
pv_df[tm_df.isnull().any(1)]

Unnamed: 0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,dim_10,dim_11,dim_12


#### Outputting to csv

In [173]:
tm_df.to_csv('../data/timbre_means.csv')
tv_df.to_csv('../data/timbre_var.csv')
pm_df.to_csv('../data/pitch_means.csv')
pv_df.to_csv('../data/pitch_var.csv')

#### Next notebook: 03_EDA