# Week 03/Week 04 Linear Regression and Logistic Regression Assignment

- Course: Z604 Music Data Mining 
- Instructor: Kahyun Choi 

Download data files and helper scripts of the week from GitHub

In [None]:
# if you meet "fatal: destination path 'W04' already exists and is not an empty directory" error, uncomment the line below and run again
# !rm -fr W04/
!git clone https://github.com/music-data-mining/W04.git

In [None]:
# go to the directory of the week
%cd W04

# Setup

In [None]:
# Common imports
import os
import numpy as np
import pandas as pd
from scipy import stats

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Exploring Deezer MSD Mood Dataset II (Three N dimensional MSD Features -- timbre, chroma, and loudness)

### Load moodmsdfeatures.csv

In [None]:
data = pd.read_csv('moodmsdfeatures.csv')  # load data set

In [None]:
data.head() # Return the first 5 rows

Unnamed: 0,dzr_sng_id,MSD_sng_id,MSD_track_id,valence,arousal,artist_name,track_name,mode,tempo,loudness,quadrant
0,560270,SOTFUBR12A8C13E3EC,TRAYKVH128F42AC993,0.373325,-0.923151,Faithless,Mass Destruction,1,88.74,-5.509,4
1,560274,SOULDME12AB01887C6,TRATCMK12903CABAC8,0.373325,-0.923151,Faithless,Salva Mea,0,128.067,-9.0,4
2,623060,SOAYOFO12AF72A4B88,TRALJBT128F4266FD8,0.594359,-0.130347,Jennifer Lopez,Play,0,104.796,-1.81,4
3,916339,SOOOWIC12A6701C7E5,TRBGPJP128E078ED20,1.071901,0.84683,Aerosmith,Crazy,0,232.709,-4.43,1
4,916480,SOJCBAM12A6701FD04,TRAGFPP128E078F34C,0.032224,-0.512921,The Cardigans,Paralyzed,1,145.271,-6.966,4


In [None]:
data.info() # get a quick description of the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 11 columns):
dzr_sng_id      224 non-null int64
MSD_sng_id      224 non-null object
MSD_track_id    224 non-null object
valence         224 non-null float64
arousal         224 non-null float64
artist_name     224 non-null object
track_name      224 non-null object
mode            224 non-null int64
tempo           224 non-null float64
loudness        224 non-null float64
quadrant        224 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 19.4+ KB


### Load MSD timbre, chroma, and loudmax features and Caculate Their Means and Variances

In [None]:
import csv
import hdf5_getters

songreader = csv.DictReader(open('moodmsdfeatures.csv'))

timbre_mean = np.empty((0,12), dtype=float)
timbre_var = np.empty((0,12), dtype=float)
chroma_mean = np.empty((0,12), dtype=float)
chroma_var = np.empty((0,12), dtype=float)
loudmax_mean = np.empty((0,1), dtype=float)
loudmax_var = np.empty((0,1), dtype=float)

for song in songreader:
    trid = song['MSD_track_id']
    filename = 'deezer_MSD/' + trid + '.h5'
    h5 = hdf5_getters.open_h5_file_read(filename)
    print("filename: ",filename)
    timbre = hdf5_getters.get_segments_timbre(h5)
    print("timbre.shape: ",timbre.shape)
    timbre_mean = np.vstack((timbre_mean, np.mean(timbre, axis = 0)))
    print("timbre_mean.shape: ",timbre_mean.shape)
    timbre_var = np.vstack((timbre_var, np.var(timbre, axis = 0)))
    print("timbre_var.shape: ",timbre_var.shape)
    chroma = hdf5_getters.get_segments_pitches(h5)
    chroma_mean = np.vstack((chroma_mean, np.mean(chroma, axis = 0)))
    chroma_var = np.vstack((chroma_var, np.var(chroma, axis = 0)))
    loudmax = hdf5_getters.get_segments_loudness_max(h5)
    loudmax_mean = np.vstack((loudmax_mean, np.mean(loudmax, axis = 0)))
    loudmax_var = np.vstack((loudmax_var, np.var(loudmax, axis = 0)))    

filename:  deezer_MSD/TRAYKVH128F42AC993.h5
timbre.shape:  (966, 12)
timbre_mean.shape:  (1, 12)
timbre_var.shape:  (1, 12)
filename:  deezer_MSD/TRATCMK12903CABAC8.h5
timbre.shape:  (1084, 12)
timbre_mean.shape:  (2, 12)
timbre_var.shape:  (2, 12)
filename:  deezer_MSD/TRALJBT128F4266FD8.h5
timbre.shape:  (1077, 12)
timbre_mean.shape:  (3, 12)
timbre_var.shape:  (3, 12)
filename:  deezer_MSD/TRBGPJP128E078ED20.h5
timbre.shape:  (843, 12)
timbre_mean.shape:  (4, 12)
timbre_var.shape:  (4, 12)
filename:  deezer_MSD/TRAGFPP128E078F34C.h5
timbre.shape:  (927, 12)
timbre_mean.shape:  (5, 12)
timbre_var.shape:  (5, 12)
filename:  deezer_MSD/TRAXLPR128F428E466.h5
timbre.shape:  (732, 12)
timbre_mean.shape:  (6, 12)
timbre_var.shape:  (6, 12)
filename:  deezer_MSD/TRBBAHD128F428E0FE.h5
timbre.shape:  (900, 12)
timbre_mean.shape:  (7, 12)
timbre_var.shape:  (7, 12)
filename:  deezer_MSD/TRBHLDQ128F423EF10.h5
timbre.shape:  (559, 12)
timbre_mean.shape:  (8, 12)
timbre_var.shape:  (8, 12)
filena

### Linear Regression using means and vars of timbre, chroma, and loudmax

In [None]:
from sklearn.linear_model import LinearRegression
X = np.hstack((timbre_mean,timbre_var,chroma_mean,chroma_var, loudmax_mean, loudmax_var))
y = data['valence'].values.reshape(-1,1)
reg = LinearRegression()
reg.fit(X, y)
y_pred = reg.predict(X)

from sklearn.metrics import mean_squared_error, r2_score
print('Coefficient of determination: {:.2f}'.format(r2_score(y, y_pred)))
print('Mean squared error: {:.2f}'.format(mean_squared_error(y, y_pred)))

Coefficient of determination: 0.30
Mean squared error: 0.78


# Q1. Load beat-aligned MSD features, find the best linear model using those features, and report its R squared and mean squared error.

You've loaded MSD timbre, chroma, and loudmax features and calculated their means and variances. And then, you also find the best linear model using sklearn.linear_model.LinearRegression and reported the R squared score (r2_score) and mean squared error (mean_squared_error). 

As we learned last week, we use not only simple mean and variance of features but also variance and mean of the mean of means as well. In Q1-1, you will use the baf.get_btFEATURENAME(filename) function in beat_aligned_feats.py to get the mean of a feature per each beat. Please calculate the following values and use them as features to linear regression in addition to the means and variances of timbre, chroma, and loudmax features. 

- timbre_mean_mean
- timbre_mean_var
- chroma_mean_mean
- chroma_mean_var
- loudmax_mean_mean
- loudmax_mean_var

Deliverable: Report R squared scores and mean squared error of the best linear model. 

### Q1-1. Load beat-aligned MSD timbre, chroma, and loudmax features (1pt)

As we learned last week, we use not only simple mean and variance of features but also variance and mean of the mean of means as well. In Q1-1, you will use the baf.get_btFEATURENAME(filename) function in beat_aligned_feats.py to get the mean of a feature per each beat. Please calculate the following values and use them as features to linear regression in addition to the means and variances of timbre, chroma, and loudmax features.
- timbre_mean_mean
- timbre_mean_var
- chroma_mean_mean
- chroma_mean_var
- loudmax_mean_mean
- loudmax_mean_var

In [None]:
## This code shows you how to use beat_aligned_feats.get_btFEATIRE function
## bttimbre is a sort of the mean of timbre per each beat
## That explains the smaller rows of bttimbre
import hdf5_getters
import beat_aligned_feats as baf

filename = 'deezer_MSD/TRAPEIN128E078892E.h5'

timbre = hdf5_getters.get_segments_timbre(h5)
print("timbre.shape: ",timbre.shape)

bttimbre = baf.get_bttimbre(filename)
bttimbre = np.transpose(bttimbre)
print("bttimbre.shape: ",bttimbre.shape)


timbre.shape:  (1011, 12)
bttimbre.shape:  (270, 12)


In [None]:
import csv
import hdf5_getters
import beat_aligned_feats as baf

songreader = csv.DictReader(open('moodmsdfeatures.csv'))

timbre_mean_mean = np.empty((0,12), dtype=float)
timbre_mean_var = np.empty((0,12), dtype=float)
chroma_mean_mean = np.empty((0,12), dtype=float)
chroma_mean_var = np.empty((0,12), dtype=float)
loudmax_mean_mean = np.empty((0,1), dtype=float)
loudmax_mean_var = np.empty((0,1), dtype=float)

for song in songreader:
    trid = song['MSD_track_id']
    filename = 'deezer_MSD/' + trid + '.h5'
    
    print(filename)
    #
    #
    #
    # Write codes to caculate timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var
    #
    #   

deezer_MSD/TRAYKVH128F42AC993.h5
bttimbre.shape:  (382, 12)
timbre_mean_mean.shape:  (1, 12)
timbre_mean_var.shape:  (1, 12)
deezer_MSD/TRATCMK12903CABAC8.h5
bttimbre.shape:  (469, 12)
timbre_mean_mean.shape:  (2, 12)
timbre_mean_var.shape:  (2, 12)
deezer_MSD/TRALJBT128F4266FD8.h5
bttimbre.shape:  (369, 12)
timbre_mean_mean.shape:  (3, 12)
timbre_mean_var.shape:  (3, 12)
deezer_MSD/TRBGPJP128E078ED20.h5
bttimbre.shape:  (1238, 12)
timbre_mean_mean.shape:  (4, 12)
timbre_mean_var.shape:  (4, 12)
deezer_MSD/TRAGFPP128E078F34C.h5
bttimbre.shape:  (701, 12)
timbre_mean_mean.shape:  (5, 12)
timbre_mean_var.shape:  (5, 12)
deezer_MSD/TRAXLPR128F428E466.h5
bttimbre.shape:  (360, 12)
timbre_mean_mean.shape:  (6, 12)
timbre_mean_var.shape:  (6, 12)
deezer_MSD/TRBBAHD128F428E0FE.h5
bttimbre.shape:  (613, 12)
timbre_mean_mean.shape:  (7, 12)
timbre_mean_var.shape:  (7, 12)
deezer_MSD/TRBHLDQ128F423EF10.h5
bttimbre.shape:  (396, 12)
timbre_mean_mean.shape:  (8, 12)
timbre_mean_var.shape:  (8, 12)

### Q1-2. Linear Regression using means and vars of means (for each beat) of timbre, chroma, and loudmax  (1pt)

Report R squared scores and mean squared error of the best linear model that describes valence when the following features were used. 

timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var. 

Coefficient of determination: 0.31
Mean squared error: 0.77


Report R squared scores and mean squared error of the best linear model that describes arousal when the following features were used. 

timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var. 

Coefficient of determination: 0.40
Mean squared error: 0.54


### Q1-3. Linear Regression using 1) means and vars of timbre, chroma, and loudmax and 2)means and vars of means (for each beat) of timbre, chroma, and loudmax (1pt)

Report R squared scores and mean squared error of the best linear model that describes valence when the following features were used. 

timbre_mean,timbre_var,chroma_mean,chroma_var, loudmax_mean, loudmax_var, timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var

Coefficient of determination: 0.52
Mean squared error: 0.54


Report R squared scores and mean squared error of the best linear model that describes arousal when the following features were used. 

timbre_mean,timbre_var,chroma_mean,chroma_var, loudmax_mean, loudmax_var, timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var

Coefficient of determination: 0.60
Mean squared error: 0.36


### Q1-4. Softmax Regression using 1) means and vars of timbre, chroma, and loudmax and 2)means and vars of means (for each beat) of timbre, chroma, and loudmax (1pt)

* Build a softmax regression classifier to solve the music mood classification problem. The four mood classes are four quadrants of Russell’s emotion circumplex.
 
* Use 10% of the dataset for the test set. Report classification scores of the training set and the test set. 


* Use the following features:
timbre_mean,timbre_var,chroma_mean,chroma_var, loudmax_mean, loudmax_var, timbre_mean_mean, timbre_mean_var, chroma_mean_mean, chroma_mean_var, loudmax_mean_mean, loudmax_mean_var

# Q2. Explore MSD features

You can see an example track description from the following link http://millionsongdataset.com/pages/example-track-description/. The table on the webpage shows a list of fields in an example MSD hd5 file. Some of them are metadata, such as artist information, and some are features such as timbre and loudness. Explore features of Deezer MSD Mood Dataset II after loading them using hdf5_getters.open_h5_file_read(filename) and hdf5_getters.get_FEATURENAME(h5). In particular, pick three features and report their summary information using data.hist() and data.describe() functions of pandas. 

You can also read the ISMIR MSD paper (http://ismir2011.ismir.net/papers/OS6-1.pdf) to get deeper understanding of the MSD. 

Deliverable: Summary information of three MSD features of your interest  (1pt)

# Q3. Find one MIR paper that used MSD and write a one-paragraph summary of it.

To find a paper, use https://scholar.google.com. Some possible search keywords are:

- million song dataset ismir
- million song dataset ismir chord
- million song dataset cover

Deliverable: A link to the paper and your one-paragraph summary/introduction of it (just a few sentences would be sufficient). Some possible questions that you can answer are:

- What problem did they solve?
- Which features did they use?
- What is the most interesting takeaway message from the paper?  (1pt)