# HW4 Problem:  Classifying Music Genres

This problem asks you to classify music into genres that include:  Blues, Classical, Jazz, Metal, Pop, Rock.

Column 192 in the dataset is the "GENRE" attribute used for classification.
The 191 columns before this are numeric features of music clips.
<blockquote>
A database of 60 music performers has been prepared for the competition. The material is divided into six categories: classical music, jazz, blues, pop, rock and heavy metal. For each of the performers 15-20 music pieces have been collected. All music pieces are partitioned into 20 segments and parameterized. The descriptors used in parametrization also those formulated within the MPEG-7 standard, are only listed here since they have already been thoroughly reviewed and explained in many studies.  <br /><br />The feature vector consists of 191 parameters, the first 127 parameters are based on the MPEG-7 standard, the remaining ones are cepstral coefficients descriptors and time-related dedicated parameters:<br /><br />a) parameter 1: Temporal Centroid, <br />b) parameter 2: Spectral Centroid average value, <br />c) parameter 3: Spectral Centroid variance, <br />d) parameters 4-37: Audio Spectrum Envelope (ASE)  average values in 34 frequency bands<br />e) parameter 38: ASE average value (averaged for all frequency bands)<br />f) parameters 39-72: ASE variance values in 34 frequency bands<br />g) parameter 73: averaged ASE variance parameters<br />h) parameters 74,75: Audio Spectrum Centroid -- average and variance values<br />i) parameters 76,77: Audio Spectrum Spread -- average and variance values<br />j) parameters 78-101: Spectral Flatness Measure (SFM) average values for 24 frequency bands<br />k) parameter 102: SFM average value (averaged for all frequency bands)<br />l) parameters 103-126: Spectral Flatness Measure (SFM) variance values for 24 frequency bands<br />m) parameter 127: averaged SFM variance parameters<br />n) parameters 128-147: 20 first mel cepstral coefficients average values <br />o) parameters 148-167: the same as 128-147<br />p) parameters 168-191: dedicated parameters in time domain based of the analysis of the distribution of the envelope in relation to the rms value.<br />
</blockquote>

The results of a contest for building classifiers for this dataset are reported in:
<blockquote>
http://duch.mimuw.edu.pl/~mwojnars/papers/ismis-2011-contest.pdf
</blockquote>
This paper offers some ideas about models to consider.

#  The Goal

In this assignment you are to generate the genre predictions you can for a set of test data:
<ul><li>
Given the file <tt>MusicGenres.csv</tt>, develop a classifier that is as accurate as possible.
</li><li>
Use your classifier to predict genre classifications for each row of data in <tt>MusicFeatures.csv</tt>.
</li><li>
Put your predictions in a .csv file called  <tt>HW4_predictions.csv</tt> and upload it to CCLE.
</li></ul>

## Step 1: download your data, using your UID

<blockquote>

Download the music data at:
<br/>
http://datamining.cs.ucla.edu/cs249/hw4/music/___PUT_YOUR_UID_HERE___.zip

<br/>
<br/>
<i>For example, if your UID is  123456789, download the file</i>
    http://datamining.cs.ucla.edu/cs249/hw4/music/123456789.zip
    
</blockquote>
    
This zip file has two csv data files:  a training set and a test set.

## Step 2: construct a model from <tt>training_set.csv</tt>

Using the <tt>training_set.csv</tt> data, construct a classifier.

<br/>
<b>YOU CAN USE ANY ENVIRONMENT YOU LIKE TO BUILD A CLASSIFIER.</b>
Please construct the most accurate models you can.

<hr style="border-width:20px;">

## Step 3: generate predictions from <tt>test_set.csv</tt>
    
The rows of file <tt>test_set.csv</tt> have input features for a number of music clips.
Using your classifer, produce class predictions for each of them.

<br/>
Put one predicted class name per line in a CSV file <tt>HW4_Music_Predictions.csv</tt>.
This file should also have the header line "<tt>GENRE</tt>".

<br/>
<i>Your score on this problem will be the accuracy of these predictions.</i>
<br/>

<hr style="border-width:20px;">

## Step 4: upload <tt>HW4_Music_Predictions.csv</tt> and your notebook to CCLE

Finally, go to CCLE and upload:
<ul><li>
your output CSV file <tt>HW4_Music_Predictions.csv</tt>
</li><li>
your notebook file <tt>HW4_Music_Genres.ipynb</tt>
</li></ul>

We are not planning to run any of the uploaded notebooks.
However, your notebook should have the commands you used in developing your models ---
in order to show your work.
As announced, all assignment grading in this course will be automated,
and the notebook is needed in order to check results of the grading program.

# Solution : Music Genre

## Load all the libraries

In [17]:
library(data.table)
library(caret)
library(kernlab)
library(gbm)
library(nnet)
library(randomForest)
library(class)
library(e1071)
library(doMC)
registerDoMC(cores=6) # parallel processing

## Load all the features 

In [18]:
options(warn=-1)

main.training.set = fread("HW4_music_data/training_set.csv", data.table = TRUE)
main.testing.set = fread("HW4_music_data/test_set.csv", data.table = TRUE)

## Split Data for Training Data and Class Label

In [19]:
main.training.set = main.training.set[sample(nrow(main.training.set)),] # shuffle data

main.training.data = main.training.set
main.training.data$GENRE = NULL

## Feature Selection - Remove Highly Correlated Features 

In [9]:
# correlationMatrix = cor(main.training.data)
# highlyCorrelated = findCorrelation(correlationMatrix, cutoff=0.75)

# main.training.set = subset(main.training.set, select = colnames(main.training.set)[-highlyCorrelated])
# main.testing.set = subset(main.testing.set, select = colnames(main.testing.set)[-highlyCorrelated])

## Split into Training & Testing Model

In [20]:
split.idx = createDataPartition(main.training.set$GENRE, p = .7,
                                list = FALSE,
                                times = 1)


training.set = main.training.set[split.idx, ]
testing.set = main.training.set[-split.idx, ]

print(dim(training.set))
print(dim(testing.set))

training.data = training.set
training.data$GENRE = NULL

testing.data= testing.set
testing.data$GENRE = NULL

[1] 7872  192
[1] 3373  192


## Normalize Data

In [21]:
procValues = preProcess(main.training.data, method = c("center", "scale"))

normalize.main.training.data = predict(procValues, main.training.data)
normalize.main.testing.set = predict(procValues, main.testing.set)

normalize.main.training.set = data.frame(cbind(normalize.main.training.data, main.training.set$GENRE))
colnames(normalize.main.training.set) = c(colnames(normalize.main.training.data), "GENRE")


procValues = preProcess(training.data, method = c("center", "scale"))
normalize.training.data = predict(procValues, training.data)
normalize.testing.data = predict(procValues, testing.data)

normalize.training.set = data.frame(cbind(normalize.training.data, training.set$GENRE))
colnames(normalize.training.set) = c(colnames(normalize.training.data), "GENRE")
normalize.testing.set =  data.frame(cbind(normalize.testing.data, testing.set$GENRE))
colnames(normalize.testing.set) = c(colnames(normalize.testing.data), "GENRE")

control = trainControl(method="repeatedcv", number=10, repeats=3)

## Accuracy Function

In [22]:
calc.accuracy = function(predicted, actual){
  return(length(which(predicted == actual)) / length(actual))
}

## Model Testing

### Multinomial Logistic Regression

In [23]:
nnet.fit = multinom(GENRE ~. , data = training.set, MaxNWts = 10000, trControl=control)
nnet.fit.n = multinom(GENRE ~. , data = normalize.training.set, MaxNWts = 10000, trControl=control)

predicted.genre.softmax = predict(nnet.fit, newdata = testing.set)
predicted.genre.softmax.n = predict(nnet.fit.n, newdata = normalize.testing.set)

cat("\nMultinomial Logistic Regression Accuracy\n")
cat("Without Normalize ", calc.accuracy(predicted.genre.softmax, testing.set$GENRE), "\n")
cat("With Normalize ", calc.accuracy(predicted.genre.softmax.n, testing.set$GENRE), "\n")

# weights:  193 (192 variable)
initial  value 5456.454605 
iter  10 value 2981.753545
iter  20 value 1799.687223
iter  30 value 1332.898608
iter  40 value 1171.515439
iter  50 value 1113.047086
iter  60 value 1112.520628
final  value 1112.520504 
converged
# weights:  193 (192 variable)
initial  value 5456.454605 
iter  10 value 1826.749107
iter  20 value 1570.530788
iter  30 value 1396.695954
iter  40 value 1187.585073
iter  50 value 1058.893249
iter  60 value 945.098932
iter  70 value 908.767225
iter  80 value 895.603720
iter  90 value 886.275796
iter 100 value 880.201454
final  value 880.201454 
stopped after 100 iterations

Multinomial Logistic Regression 
Without Normalize  0.9454492 
With Normalize  0.9531574 


### Lasso Logistic Regression 

In [24]:
glmnet.fitted.model = glmnet(x = as.matrix(training.data),y = training.set$GENRE,
                             family="multinomial", lambda=0.04)
glmnet.fitted.model.n = glmnet(x = as.matrix(normalize.training.data),y = training.set$GENRE,
                               family="multinomial", lambda=0.04)

predicted.genre.lasso = predict(glmnet.fitted.model, newx=as.matrix(testing.data), type="class")
predicted.genre.lasso.n = predict(glmnet.fitted.model.n, newx=as.matrix(normalize.testing.data), type="class")

cat("\nLasso Logistic Regression Accuracy\n")
cat("Without Normalize ", calc.accuracy(predicted.genre.lasso, testing.set$GENRE), "\n")
cat("With Normalize ", calc.accuracy(predicted.genre.lasso.n, testing.set$GENRE), "\n")


Lasso Logistic Regression 
Without Normalize  0.9060184 
With Normalize  0.9060184 


### Support Vector Machine

In [25]:
svm.model = svm(factor(GENRE) ~ ., data = training.set, cv.folds=5, trControl=control)
svm.model.n = svm(factor(GENRE) ~ ., data = normalize.training.set, cv.folds=5, trControl=control)

predicted.genre.svm.n = predict(svm.model.n, normalize.testing.set)
predicted.genre.svm = predict(svm.model, testing.data)

cat("\nSupport Vector Machine Accuracy\n")
cat("Without Normalize ", calc.accuracy(predicted.genre.svm, testing.set$GENRE), "\n")
cat("With Normalize ", calc.accuracy(predicted.genre.svm.n, testing.set$GENRE), "\n")

Support Vector Machine 
Without Normalize  0.973021 
With Normalize  0.973021 


### Random Forest 

In [26]:
rf.model = randomForest(factor(GENRE) ~ .,data = training.set, cv.folds=5, trControl=control)
rf.model.n = randomForest(factor(GENRE) ~ .,data = normalize.training.set, cv.folds=5, trControl=control)

predicted.genre.rf = predict(rf.model, newdata = as.matrix(testing.data), type="response", predict.all = FALSE)
predicted.genre.rf.n = predict(rf.model.n, newdata = as.matrix(normalize.testing.data), type="response", 
                               predict.all = FALSE)

cat("\nRandom Forest Accuracy\n")
cat("Without Normalize ", calc.accuracy(predicted.genre.rf, testing.set$GENRE), "\n")
cat("With Normalize ", calc.accuracy(predicted.genre.rf.n, testing.set$GENRE), "\n")


Random Forest 
Without Normalize  0.9715387 
With Normalize  0.9712422 


### Ensemble Classifier 

In [27]:
# Used RANDOM FOREST, SVM & SOFTMAX 
ensemble.model = data.frame(predicted.genre.rf, predicted.genre.svm, predicted.genre.softmax) 
ensemble.model.n = data.frame(predicted.genre.rf.n, predicted.genre.svm.n, predicted.genre.softmax.n)

predicted.ensemble = apply(ensemble.model,1,function(x) names(which.max(table(x))))
predicted.ensemble.n = apply(ensemble.model.n,1,function(x) names(which.max(table(x))))

cat("Ensemble Classifier Accuracy \n")
cat("Without Normalize ", calc.accuracy(predicted.ensemble, testing.set$GENRE), "\n")
cat("With Normalize ", calc.accuracy(predicted.ensemble.n, testing.set$GENRE), "\n")

Ensemble Classifier 
Without Normalize  0.973614 
With Normalize  0.9739105 


## Model Building for Main Testing Data

## Without Normalized Data

In [28]:
nnet.fit.main = multinom(GENRE ~. , data = main.training.set, MaxNWts = 20000)
svm.model.main = svm(factor(GENRE) ~ ., data = main.training.set, cv.folds = 10)
rf.model.main = randomForest(factor(GENRE) ~ .,data = main.training.set, cv.folds = 10)

predicted.genre.softmax.main = predict(nnet.fit.main, newdata = as.matrix(main.testing.set))
predicted.genre.svm.main = predict(svm.model.main, newdata = as.matrix(main.testing.set))
predicted.genre.rf.main = predict(rf.model.main, newdata = as.matrix(main.testing.set), 
                                  type="response", predict.all = FALSE)

ensemble.classifier = data.frame(predicted.genre.rf.main, predicted.genre.svm.main,
                                 predicted.genre.softmax.main) # Used RANDOM FOREST, SVM & SOFTMAX 
ensemble.classifier.output = matrix(apply(ensemble.classifier, 1, function(x)
                                    names(which.max(table(x)))), ncol = 1)
colnames(ensemble.classifier.output) = c("GENRE")

# weights:  193 (192 variable)
initial  value 7794.440045 
iter  10 value 4011.089795
iter  20 value 2615.000693
iter  30 value 1928.200753
iter  40 value 1726.974736
iter  50 value 1630.930578
iter  60 value 1628.832170
iter  70 value 1593.910829
iter  80 value 1591.973378
iter  90 value 1558.587544
iter 100 value 1545.905154
final  value 1545.905154 
stopped after 100 iterations


## With Normalized Data

In [29]:
nnet.fit.main.n = multinom(GENRE ~. , data = normalize.main.training.set, MaxNWts = 20000)
svm.model.main.n = svm(factor(GENRE) ~ ., data = normalize.main.training.set, cv.folds = 10)
rf.model.main.n = randomForest(factor(GENRE) ~ .,data = normalize.main.training.set, cv.folds = 10)

predicted.genre.softmax.main.n = predict(nnet.fit.main.n, newdata = as.matrix(normalize.main.testing.set))
predicted.genre.svm.main.n = predict(svm.model.main.n, newdata = as.matrix(normalize.main.testing.set))
predicted.genre.rf.main.n = predict(rf.model.main.n, newdata = as.matrix(normalize.main.testing.set), 
                                    type="response", predict.all = FALSE)

normalize.ensemble.classifier = data.frame(predicted.genre.rf.main.n, predicted.genre.svm.main.n, 
                                           predicted.genre.softmax.main.n) # Used RANDOM FOREST, SVM & SOFTMAX 
normalize.ensemble.classifier.output = matrix(apply(normalize.ensemble.classifier, 1, function(x) 
                                              names(which.max(table(x)))), ncol = 1)
colnames(normalize.ensemble.classifier.output) = c("GENRE")

# weights:  193 (192 variable)
initial  value 7794.440045 
iter  10 value 2966.913634
iter  20 value 2572.383333
iter  30 value 2185.995636
iter  40 value 1830.931338
iter  50 value 1629.587835
iter  60 value 1448.323272
iter  70 value 1359.387574
iter  80 value 1337.027284
iter  90 value 1328.416636
iter 100 value 1322.371502
final  value 1322.371502 
stopped after 100 iterations


## Output results to CSV

In [33]:
write.table(ensemble.classifier.output, file = "HW4_Music_Predictions.csv", append = FALSE, quote = TRUE, sep = " ",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = TRUE, qmethod = c("escape", "double"),
            fileEncoding = "")