## Benchmark

Below is a benchmark multinomial logistic regression model trained on only audio features. You will likely find that incorporating lyric features will improve performance. It is worth noting that increasing the size of the feature set and number of observations from training set will increase the computational complexity of the model. Consider pre-processing (e.g., using feature selection, dimension reduction) as potential ways of decreasing this computational complexity. Note that you can cache these pre-processing steps using the code block argument `cache=TRUE` so you do not have to perform these each time you try a new model specification!


In [None]:
```{r message=FALSE, warning=FALSE}
library(glmnet)
library(caret)
library(data.table)

#download.file('https://github.com/lse-my474/pset_data/raw/main/songs_test.csv', 'songs_test.csv')
#download.file('https://github.com/lse-my474/pset_data/raw/main/songs_train.csv', 'songs_train.csv')

songs_tr <- read.csv('songs_train.csv')
songs_te <- read.csv('songs_test.csv')

songs_tr_sub <- songs_tr[,(colnames(songs_tr) %like% 'audio_')]
songs_te_sub <- songs_te[,(colnames(songs_te) %like% 'audio_')]

# convert to one-hot outcome for multinomial logit model
# using `caret` and `data.table`
# see https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics
tr_y <- model.matrix(~0+genre, data=songs_tr)
tr_x <- model.matrix(~., data=songs_tr_sub)
colnames(tr_y) <- c('hip hop', 'pop', 'rap', 'rock')

te_x <- model.matrix(~., data=songs_te_sub)

mod <- glmnet(
    tr_x,
    tr_y,
    nfolds = 3,
    family = "multinomial",
    type.logistic = "modified.Newton",
    alpha = 1,
    lambda = 0,
)
y_pred <- predict(mod, te_x, type = "class")

# Output answers for submission to Kaggle
answers <- cbind(10001:(10000+nrow(te_x)), y_pred)
colnames(answers) <- c('song_id', 'genre')
write.csv(answers, 'answers.csv', row.names=FALSE)