## Popularity of Music Records

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist's release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song's properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn't make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here's a detailed description of the variables:

- year = the year the song was released
- songtitle = the title of the song
- artistname = the name of the artist of the song
- songID and artistID = identifying variables for the song and artist
- timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
- loudness = a continuous variable indicating the average amplitude of the audio in decibels
- tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
- key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
- energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
- pitch = a continuous variable that indicates the pitch of the song
- timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
- Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

### Understanding the Data

In [1]:
songs = read.csv('./dataset/songs.csv')
str(songs)

'data.frame':	7574 obs. of  39 variables:
 $ year                    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ songtitle               : Factor w/ 7141 levels "̈́ l'or_e des bois",..: 6204 5522 241 3115 48 608 255 4419 2886 6756 ...
 $ artistname              : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
 $ songID                  : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
 $ artistID                : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
 $ timesignature           : int  3 4 4 4 4 4 4 4 4 4 ...
 $ timesignature_confidence: num  0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
 $ loudness                : num  -4.26 -4.05 -3.57 -3.81 -4.71 ...
 $ tempo                   : num  91.5 140 160.5 97.5 140.1 ...
 $ tempo_confidence        : num  0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
 $ key                   

In [2]:
songs_2010 = subset(songs, year == 2010)
nrow(songs_2010)

In [3]:
songs_mj = subset(songs, artistname == 'Michael Jackson')
nrow(songs_mj)

In [5]:
songs_mj_top10 = subset(songs_mj, Top10 == 1)
songs_mj_top10

Unnamed: 0_level_0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,⋯,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
4329,2001,You Rock My World,Michael Jackson,SOBLCOF13134393021,ARXPPEY1187FB51DF4,4,1.0,-2.768,95.003,0.892,⋯,120.076,-53.839,63.576,-85.169,84.84,-102.185,55.266,-48.107,56.116,1
6207,1995,You Are Not Alone,Michael Jackson,SOJKNNO13737CEB162,ARXPPEY1187FB51DF4,4,1.0,-9.408,120.566,0.805,⋯,90.735,-61.583,60.92,-55.904,76.632,-69.799,46.173,-67.281,47.128,1
6210,1995,Black or White,Michael Jackson,SOBBRFO137756C9CB7,ARXPPEY1187FB51DF4,4,1.0,-4.017,115.027,0.535,⋯,107.974,-55.063,52.505,-110.999,71.477,-133.939,60.442,-55.008,43.473,1
6218,1995,Remember the Time,Michael Jackson,SOIQZMT136C9704DA5,ARXPPEY1187FB51DF4,4,1.0,-3.633,107.921,1.0,⋯,146.587,-58.117,62.157,-54.44,94.501,-112.348,90.437,-53.634,51.681,1
6915,1992,In The Closet,Michael Jackson,SOKIOOC12AF729ED9E,ARXPPEY1187FB51DF4,4,0.991,-4.315,110.501,0.949,⋯,124.354,-78.303,41.322,-83.184,106.263,-136.109,102.829,-48.192,74.575,1


In [6]:
unique(songs$timesignature)

In [8]:
sort(table(songs$timesignature))


   0    7    5    1    3    4 
  10   19  112  143  503 6787 

In [9]:
which.max(songs$tempo)

In [10]:
songs$songtitle[which.max(songs$tempo)]

### Creating Our Prediction Model

In [11]:
SongsTrain = subset(songs, year <= 2009)
SongsTest = subset(songs, year == 2010)
nrow(SongsTrain)
nrow(SongsTest)

In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model. We'll start by using all song attributes as our independent variables, which we'll call Model 1.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".

We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables.

There is a nice trick to avoid doing so. Let's suppose that, except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we can use the formula
```R
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
```
to build our model. Notice that the "." is used in place of enumerating all the independent variables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.)

However, in our case, we want to exclude some of the variables in our dataset from being used as independent variables ("year", "songtitle", "artistname", "songID", and "artistID"). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won't use in our model.
```R
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
```
To remove these variables from your training and testing sets, type the following commands in your R console:
```R
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]

SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
```
Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.

In [12]:
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")

In [13]:
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars)]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars)]

In [14]:
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family='binomial')
summary(SongsLog1)


Call:
glm(formula = Top10 ~ ., family = "binomial", data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9220  -0.5399  -0.3459  -0.1845   3.0770  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.470e+01  1.806e+00   8.138 4.03e-16 ***
timesignature             1.264e-01  8.674e-02   1.457 0.145050    
timesignature_confidence  7.450e-01  1.953e-01   3.815 0.000136 ***
loudness                  2.999e-01  2.917e-02  10.282  < 2e-16 ***
tempo                     3.634e-04  1.691e-03   0.215 0.829889    
tempo_confidence          4.732e-01  1.422e-01   3.329 0.000873 ***
key                       1.588e-02  1.039e-02   1.529 0.126349    
key_confidence            3.087e-01  1.412e-01   2.187 0.028760 *  
energy                   -1.502e+00  3.099e-01  -4.847 1.25e-06 ***
pitch                    -4.491e+01  6.835e+00  -6.570 5.02e-11 ***
timbre_0_min              2.316e-02  4.256e-03   5.

### Beware of Multicollinearity Issues!

In [15]:
cor(SongsTrain$loudness, SongsTrain$energy)

Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3, in which we keep "loudness" and omit "energy".

Create Model 2, which is Model 1 without the independent variable "loudness". This can be done with the following command:
```R
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
```
We just subtracted the variable loudness. We couldn't do this with the variables "songtitle" and "artistname", because they are not numeric variables, and we might get different values in the test set that the training set has never seen. But this approach (subtracting the variable from the model formula) will always work when you want to remove numeric variables.

In [16]:
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family='binomial')
summary(SongsLog2)


Call:
glm(formula = Top10 ~ . - loudness, family = "binomial", data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0983  -0.5607  -0.3602  -0.1902   3.3107  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.241e+00  7.465e-01  -3.002 0.002686 ** 
timesignature             1.625e-01  8.734e-02   1.860 0.062873 .  
timesignature_confidence  6.885e-01  1.924e-01   3.578 0.000346 ***
tempo                     5.521e-04  1.665e-03   0.332 0.740226    
tempo_confidence          5.497e-01  1.407e-01   3.906 9.40e-05 ***
key                       1.740e-02  1.026e-02   1.697 0.089740 .  
key_confidence            2.954e-01  1.394e-01   2.118 0.034163 *  
energy                    1.813e-01  2.608e-01   0.695 0.486991    
pitch                    -5.150e+01  6.857e+00  -7.511 5.87e-14 ***
timbre_0_min              2.479e-02  4.240e-03   5.847 5.01e-09 ***
timbre_0_max             -1.007e-01  1.1

In [17]:
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family='binomial')
summary(SongsLog3)


Call:
glm(formula = Top10 ~ . - energy, family = "binomial", data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9182  -0.5417  -0.3481  -0.1874   3.4171  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.196e+01  1.714e+00   6.977 3.01e-12 ***
timesignature             1.151e-01  8.726e-02   1.319 0.187183    
timesignature_confidence  7.143e-01  1.946e-01   3.670 0.000242 ***
loudness                  2.306e-01  2.528e-02   9.120  < 2e-16 ***
tempo                    -6.460e-04  1.665e-03  -0.388 0.698107    
tempo_confidence          3.841e-01  1.398e-01   2.747 0.006019 ** 
key                       1.649e-02  1.035e-02   1.593 0.111056    
key_confidence            3.394e-01  1.409e-01   2.409 0.015984 *  
pitch                    -5.328e+01  6.733e+00  -7.914 2.49e-15 ***
timbre_0_min              2.205e-02  4.239e-03   5.200 1.99e-07 ***
timbre_0_max             -3.105e-01  2.537

### Validating Our Model

In [18]:
testPred = predict(SongsLog3, newdata=SongsTest, type='response')
table(SongsTest$Top10, testPred >= 0.45)

   
    FALSE TRUE
  0   309    5
  1    40   19

In [19]:
(309 + 19) / nrow(SongsTest)

In [23]:
ntop10 = subset(SongsTest, Top10 == 0)

In [25]:
nrow(ntop10) / nrow(SongsTest)

It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?

Let's view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company's objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.

A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.

In [26]:
sensitivity = 19 / (40 + 19)
sensitivity

In [27]:
specificity = 309 / (309 + 5)
specificity