#### POPULARITY OF MUSIC RECORDS

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales. 

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist's release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable. 

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success. 

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song's properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn't make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here's a detailed description of the variables:

- year = the year the song was released
- songtitle = the title of the song
- artistname = the name of the artist of the song
- songID and artistID = identifying variables for the song and artist
- timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
- loudness = a continuous variable indicating the average amplitude of the audio in decibels
- tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
- key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
- energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
- pitch = a continuous variable that indicates the pitch of the song
timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
- Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

In [45]:
# Owen Wichiencharoen's standard Python Imports:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')

from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm

import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import datasets, metrics
import sklearn.linear_model as lm

# from sklearn.tree import DecisionTreeClassifier, export_graphviz
# from sklearn.ensemble import RandomForestClassifier
# import pydot
# from os import system
# from sklearn.externals.six import StringIO
# from IPython.display import Image

#import itertools
#import pandas_datareader.data as pdweb
#from pandas_datareader.data import DataReader
#from datetime import datetime
#from io import StringIO

In [2]:
df_raw = pd.read_csv('../data/songs.csv', encoding='latin-1')
df_raw[:5]

Unnamed: 0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,...,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
0,2010,This Is the House That Doubt Built,A Day to Remember,SOBGGAB12C5664F054,AROBSHL1187B9AFB01,3,0.853,-4.262,91.525,0.953,...,82.475,-52.025,39.116,-35.368,71.642,-126.44,18.658,-44.77,25.989,0
1,2010,Sticks & Bricks,A Day to Remember,SOPAQHU1315CD47F31,AROBSHL1187B9AFB01,4,1.0,-4.051,140.048,0.921,...,106.918,-61.32,35.378,-81.928,74.574,-103.808,121.935,-38.892,22.513,0
2,2010,All I Want,A Day to Remember,SOOIZOU1376E7C6386,AROBSHL1187B9AFB01,4,1.0,-3.571,160.512,0.489,...,80.621,-59.773,45.979,-46.293,59.904,-108.313,33.3,-43.733,25.744,0
3,2010,It's Complicated,A Day to Remember,SODRYWD1315CD49DBE,AROBSHL1187B9AFB01,4,1.0,-3.815,97.525,0.794,...,96.675,-78.66,41.088,-49.194,95.44,-102.676,46.422,-59.439,37.082,0
4,2010,2nd Sucks,A Day to Remember,SOICMQB1315CD46EE3,AROBSHL1187B9AFB01,4,0.788,-4.707,140.053,0.286,...,110.332,-56.45,37.555,-48.588,67.57,-52.796,22.888,-50.414,32.758,0


In [3]:
# How many 2010 songs are there in this dataset?

df_raw[df_raw['year']==2010].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 373 entries, 0 to 372
Data columns (total 39 columns):
year                        373 non-null int64
songtitle                   373 non-null object
artistname                  373 non-null object
songID                      373 non-null object
artistID                    373 non-null object
timesignature               373 non-null int64
timesignature_confidence    373 non-null float64
loudness                    373 non-null float64
tempo                       373 non-null float64
tempo_confidence            373 non-null float64
key                         373 non-null int64
key_confidence              373 non-null float64
energy                      373 non-null float64
pitch                       373 non-null float64
timbre_0_min                373 non-null float64
timbre_0_max                373 non-null float64
timbre_1_min                373 non-null float64
timbre_1_max                373 non-null float64
timbre_2_min           

In [4]:
# How many Michael Jackson songs are there in this data set?

len(df_raw[df_raw['artistname']=='Michael Jackson'])

18

In [5]:
# How many MJ songs made top 10?

pd.crosstab(index=df_raw[df_raw['artistname']=='Michael Jackson']['artistname'],columns=df_raw['Top10'])

Top10,0,1
artistname,Unnamed: 1_level_1,Unnamed: 2_level_1
Michael Jackson,13,5


In [6]:
# Which MJ songs made top 10?

df_raw[(df_raw['artistname']=='Michael Jackson') & (df_raw['Top10']==1)]

Unnamed: 0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,...,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
4328,2001,You Rock My World,Michael Jackson,SOBLCOF13134393021,ARXPPEY1187FB51DF4,4,1.0,-2.768,95.003,0.892,...,120.076,-53.839,63.576,-85.169,84.84,-102.185,55.266,-48.107,56.116,1
6206,1995,You Are Not Alone,Michael Jackson,SOJKNNO13737CEB162,ARXPPEY1187FB51DF4,4,1.0,-9.408,120.566,0.805,...,90.735,-61.583,60.92,-55.904,76.632,-69.799,46.173,-67.281,47.128,1
6209,1995,Black or White,Michael Jackson,SOBBRFO137756C9CB7,ARXPPEY1187FB51DF4,4,1.0,-4.017,115.027,0.535,...,107.974,-55.063,52.505,-110.999,71.477,-133.939,60.442,-55.008,43.473,1
6217,1995,Remember the Time,Michael Jackson,SOIQZMT136C9704DA5,ARXPPEY1187FB51DF4,4,1.0,-3.633,107.921,1.0,...,146.587,-58.117,62.157,-54.44,94.501,-112.348,90.437,-53.634,51.681,1
6914,1992,In The Closet,Michael Jackson,SOKIOOC12AF729ED9E,ARXPPEY1187FB51DF4,4,0.991,-4.315,110.501,0.949,...,124.354,-78.303,41.322,-83.184,106.263,-136.109,102.829,-48.192,74.575,1


In [7]:
# What are the values of timesignature that occur in our dataset?
# Which timesignature value is the most frequent among songs in our dataset?

df_raw['timesignature'].value_counts()

4    6787
3     503
1     143
5     112
7      19
0      10
Name: timesignature, dtype: int64

In [8]:
# What is the song with the highest tempo in this dataset?

df_raw[df_raw['tempo']==df_raw['tempo'].max()]

Unnamed: 0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,...,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
6205,1995,Wanna Be Startin' Somethin',Michael Jackson,SONHIQM13738B7BE80,ARXPPEY1187FB51DF4,3,1.0,-14.528,244.307,0.566,...,93.6,-52.012,95.827,-63.554,84.129,-53.492,67.001,-73.421,67.308,0


#### Creating Our Prediction Model

We wish to predict whether or not a song will make it to the Top 10.

To do this, first use the subset function to split the data into a training set "SongsTrain" consisting of all the observations up to and including 2009 song releases, and a testing set "SongsTest", consisting of the 2010 song releases.

How many observations (songs) are in the training set?

In [9]:
SongsTrain = df_raw[df_raw['year']<=2009]
SongsTest = df_raw[df_raw['year']>2009]

len(SongsTrain)

7201

In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model. We'll start by using all song attributes as our independent variables, which we'll call Model 1.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".

Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.

Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?

In [10]:
SongsTrain.drop(["year", "songtitle", "artistname", "songID","artistID"],axis=1,inplace=True)
SongsTest.drop(["year", "songtitle", "artistname", "songID","artistID"],axis=1,inplace=True)
SongsTest.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Index(['timesignature', 'timesignature_confidence', 'loudness', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'energy', 'pitch',
       'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
       'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
       'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
       'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
       'timbre_8_min', 'timbre_8_max', 'timbre_9_min', 'timbre_9_max',
       'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max',
       'Top10'],
      dtype='object')

In [11]:
# Let's use statsmodel, so we can calculate AIC with llf (log likelihood function) ...
formula = 'Top10 ~ timesignature + timesignature_confidence + loudness + tempo + tempo_confidence + key + key_confidence + energy + pitch + timbre_0_min + timbre_0_max + timbre_1_min + timbre_1_max + timbre_2_min + timbre_2_max + timbre_3_min + timbre_3_max + timbre_4_min + timbre_4_max + timbre_5_min + timbre_5_max + timbre_6_min + timbre_6_max + timbre_7_min + timbre_7_max + timbre_8_min + timbre_8_max + timbre_9_min + timbre_9_max + timbre_10_min + timbre_10_max + timbre_11_min + timbre_11_max'
LogReg0 = smf.GLM.from_formula(formula=formula, data=SongsTrain, family=sm.families.Binomial()).fit()
LogReg0.summary()

0,1,2,3
Dep. Variable:,Top10,No. Observations:,7201.0
Model:,GLM,Df Residuals:,7167.0
Model Family:,Binomial,Df Model:,33.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2379.6
Date:,"Sun, 07 Aug 2016",Deviance:,4759.2
Time:,02:06:58,Pearson chi2:,6700.0
No. Iterations:,9,,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,14.7000,1.806,8.138,0.000,11.160 18.240
timesignature,0.1264,0.087,1.457,0.145,-0.044 0.296
timesignature_confidence,0.7450,0.195,3.814,0.000,0.362 1.128
loudness,0.2999,0.029,10.282,0.000,0.243 0.357
tempo,0.0004,0.002,0.215,0.830,-0.003 0.004
tempo_confidence,0.4732,0.142,3.328,0.001,0.195 0.752
key,0.0159,0.010,1.529,0.126,-0.004 0.036
key_confidence,0.3087,0.141,2.187,0.029,0.032 0.585
energy,-1.5021,0.310,-4.847,0.000,-2.110 -0.895


In [12]:
print(LogReg0.params)

Intercept                   14.699988
timesignature                0.126395
timesignature_confidence     0.744992
loudness                     0.299879
tempo                        0.000363
tempo_confidence             0.473227
key                          0.015882
key_confidence               0.308675
energy                      -1.502144
pitch                      -44.907740
timbre_0_min                 0.023159
timbre_0_max                -0.330982
timbre_1_min                 0.005881
timbre_1_max                -0.000245
timbre_2_min                -0.002127
timbre_2_max                 0.000659
timbre_3_min                 0.000692
timbre_3_max                -0.002967
timbre_4_min                 0.010396
timbre_4_max                 0.006111
timbre_5_min                -0.005598
timbre_5_max                 0.000077
timbre_6_min                -0.016856
timbre_6_max                 0.003668
timbre_7_min                -0.004549
timbre_7_max                -0.003774
timbre_8_min

In [13]:
# AIC is computed like this:
print ('AIC is: ',2*(33+1) - 2*LogReg0.llf)

AIC is:  4827.15410239


####  Let's try sklearn linear model - because Stefan Jansen says it's better

In [14]:
LogReg = lm.LogisticRegression()
X = SongsTrain[['timesignature', 'timesignature_confidence', 'loudness', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'energy', 'pitch',
       'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
       'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
       'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
       'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
       'timbre_8_min', 'timbre_8_max', 'timbre_9_min', 'timbre_9_max',
       'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max']]
# X = sm.add_constant(X)
y = SongsTrain['Top10']
LogReg.fit(X,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
LogReg_results = pd.DataFrame({'Indep_vars': X.columns,'LogReg_coefs': LogReg.coef_.flatten()})
print(LogReg_results)

                  Indep_vars  LogReg_coefs
0              timesignature      0.251621
1   timesignature_confidence      0.735213
2                   loudness      0.109770
3                      tempo      0.001656
4           tempo_confidence      0.553166
5                        key      0.018453
6             key_confidence      0.348524
7                     energy     -0.905281
8                      pitch     -0.129706
9               timbre_0_min      0.024806
10              timbre_0_max     -0.154196
11              timbre_1_min      0.005909
12              timbre_1_max     -0.000576
13              timbre_2_min     -0.003374
14              timbre_2_max      0.000209
15              timbre_3_min      0.000702
16              timbre_3_max     -0.003026
17              timbre_4_min      0.007977
18              timbre_4_max      0.007225
19              timbre_5_min     -0.006468
20              timbre_5_max      0.000404
21              timbre_6_min     -0.017489
22         

In [16]:
LogReg.intercept_

array([ 0.36153456])

In [17]:
LogReg_data = SongsTrain.copy()
LogReg_data['probability'] = LogReg.predict_proba(SongsTrain.drop('Top10',axis=1)).T[1]
LogReg_data[:5]

Unnamed: 0,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,key,key_confidence,energy,pitch,timbre_0_min,...,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10,probability
373,3,0.732,-6.32,89.614,0.652,1,0.773,0.598529,0.004,0.0,...,-71.776,58.432,-53.816,88.571,-89.816,38.026,-52.075,52.827,0,0.060917
374,3,0.906,-9.541,117.742,0.542,0,0.722,0.363399,0.006,0.739,...,-64.47,58.086,-76.937,74.441,-88.244,42.209,-66.812,40.749,0,0.039069
375,4,0.987,-4.842,119.018,0.838,6,0.106,0.760151,0.003,0.0,...,-52.459,40.679,-50.408,58.811,-78.239,35.264,-54.2,46.49,0,0.02537
376,4,0.822,-5.272,71.479,0.613,4,0.781,0.755034,0.014,0.0,...,-55.811,78.963,-51.504,70.455,-74.928,30.839,-51.377,27.768,0,0.047078
377,4,0.983,-6.233,77.492,0.74,8,0.552,0.523658,0.008,0.0,...,-61.392,50.309,-62.994,96.837,-90.397,60.549,-52.122,48.059,0,0.135712


Let's now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?

Answer: These columns should be used as predictors?

In [18]:
np.corrcoef(SongsTrain['loudness'],SongsTrain['energy'])

array([[ 1.        ,  0.73990671],
       [ 0.73990671,  1.        ]])

Given that these two variables are highly correlated, LogReg suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll build two variations of our original model: LogReg2, in which we keep "energy" and omit "loudness", and LogReg3 (Model3), in which we keep "loudness" and omit "energy".

In [19]:
LogReg2 = lm.LogisticRegression()
X = SongsTrain[['timesignature', 'timesignature_confidence', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'energy', 'pitch',
       'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
       'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
       'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
       'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
       'timbre_8_min', 'timbre_8_max', 'timbre_9_min', 'timbre_9_max',
       'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max']]
y = SongsTrain['Top10']
LogReg2.fit(X,y)

LogReg2_results = pd.DataFrame({'Indep_vars': X.columns,'LogReg2_coefs': LogReg2.coef_.flatten()})
print(LogReg2_results)
print("intercept : ",LogReg2.intercept_)

                  Indep_vars  LogReg2_coefs
0              timesignature       0.104796
1   timesignature_confidence       0.725203
2                      tempo      -0.000553
3           tempo_confidence       0.530720
4                        key       0.014666
5             key_confidence       0.325711
6                     energy      -0.200724
7                      pitch      -1.121612
8               timbre_0_min       0.022986
9               timbre_0_max      -0.098995
10              timbre_1_min       0.006435
11              timbre_1_max      -0.000792
12              timbre_2_min      -0.003402
13              timbre_2_max       0.000322
14              timbre_3_min       0.000756
15              timbre_3_max      -0.002799
16              timbre_4_min       0.008183
17              timbre_4_max       0.007223
18              timbre_5_min      -0.006651
19              timbre_5_max       0.000643
20              timbre_6_min      -0.016335
21              timbre_6_max    

In [20]:
LogReg3 = lm.LogisticRegression()
X = SongsTrain[['timesignature', 'timesignature_confidence', 'loudness', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'pitch',
       'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
       'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
       'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
       'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
       'timbre_8_min', 'timbre_8_max', 'timbre_9_min', 'timbre_9_max',
       'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max']]
y = SongsTrain['Top10']
LogReg3.fit(X,y)

LogReg3_results = pd.DataFrame({'Indep_vars': X.columns,'LogReg3_coefs': LogReg3.coef_.flatten()})
print(LogReg3_results)
print("intercept : ",LogReg3.intercept_)

                  Indep_vars  LogReg3_coefs
0              timesignature       0.236011
1   timesignature_confidence       0.720563
2                   loudness       0.098730
3                      tempo       0.000510
4           tempo_confidence       0.408968
5                        key       0.017520
6             key_confidence       0.323141
7                      pitch      -0.358035
8               timbre_0_min       0.023101
9               timbre_0_max      -0.172972
10              timbre_1_min       0.005301
11              timbre_1_max      -0.000715
12              timbre_2_min      -0.003812
13              timbre_2_max       0.000039
14              timbre_3_min       0.000431
15              timbre_3_max      -0.003135
16              timbre_4_min       0.008652
17              timbre_4_max       0.007568
18              timbre_5_min      -0.006217
19              timbre_5_max       0.000494
20              timbre_6_min      -0.018353
21              timbre_6_max    

In [42]:
LogReg3_predictions = LogReg3.predict_proba(SongsTest[X.columns]).T[1]

threshold = 0.45
LogReg3_pred_0_1 = (LogReg3_predictions > threshold)
pd.crosstab(index=SongsTest['Top10'], columns=LogReg3_pred_0_1)

col_0,False,True
Top10,Unnamed: 1_level_1,Unnamed: 2_level_1
0,311,3
1,47,12


#### Now let's repeat and do it with statsmodel

In [25]:
formula = 'Top10 ~ timesignature + timesignature_confidence + loudness + tempo + tempo_confidence + key + key_confidence + pitch + timbre_0_min + timbre_0_max + timbre_1_min + timbre_1_max + timbre_2_min + timbre_2_max + timbre_3_min + timbre_3_max + timbre_4_min + timbre_4_max + timbre_5_min + timbre_5_max + timbre_6_min + timbre_6_max + timbre_7_min + timbre_7_max + timbre_8_min + timbre_8_max + timbre_9_min + timbre_9_max + timbre_10_min + timbre_10_max + timbre_11_min + timbre_11_max'
model3 = smf.GLM.from_formula(formula=formula, data=SongsTrain, family=sm.families.Binomial()).fit()
model3.summary()

0,1,2,3
Dep. Variable:,Top10,No. Observations:,7201.0
Model:,GLM,Df Residuals:,7168.0
Model Family:,Binomial,Df Model:,32.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2391.4
Date:,"Sun, 07 Aug 2016",Deviance:,4782.7
Time:,02:09:52,Pearson chi2:,7000.0
No. Iterations:,9,,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,11.9606,1.714,6.977,0.000,8.601 15.320
timesignature,0.1151,0.087,1.319,0.187,-0.056 0.286
timesignature_confidence,0.7143,0.195,3.670,0.000,0.333 1.096
loudness,0.2306,0.025,9.120,0.000,0.181 0.280
tempo,-0.0006,0.002,-0.388,0.698,-0.004 0.003
tempo_confidence,0.3841,0.140,2.747,0.006,0.110 0.658
key,0.0165,0.010,1.593,0.111,-0.004 0.037
key_confidence,0.3394,0.141,2.409,0.016,0.063 0.616
pitch,-53.2841,6.733,-7.914,0.000,-66.480 -40.088


In [26]:
print(len(X.columns))
print(len(model3.params))
print(len(model3.predict()))

32
33
7201


In [27]:
temp = ['intercept']
temp += [i for i in X.columns]
temp

['intercept',
 'timesignature',
 'timesignature_confidence',
 'loudness',
 'tempo',
 'tempo_confidence',
 'key',
 'key_confidence',
 'pitch',
 'timbre_0_min',
 'timbre_0_max',
 'timbre_1_min',
 'timbre_1_max',
 'timbre_2_min',
 'timbre_2_max',
 'timbre_3_min',
 'timbre_3_max',
 'timbre_4_min',
 'timbre_4_max',
 'timbre_5_min',
 'timbre_5_max',
 'timbre_6_min',
 'timbre_6_max',
 'timbre_7_min',
 'timbre_7_max',
 'timbre_8_min',
 'timbre_8_max',
 'timbre_9_min',
 'timbre_9_max',
 'timbre_10_min',
 'timbre_10_max',
 'timbre_11_min',
 'timbre_11_max']

In [28]:
model3_results = pd.DataFrame({'Indep_vars': temp,'model3_coefs': model3.params})
print(model3_results)

                                        Indep_vars  model3_coefs
Intercept                                intercept     11.960562
timesignature                        timesignature      0.115094
timesignature_confidence  timesignature_confidence      0.714270
loudness                                  loudness      0.230557
tempo                                        tempo     -0.000646
tempo_confidence                  tempo_confidence      0.384093
key                                            key      0.016495
key_confidence                      key_confidence      0.339406
pitch                                        pitch    -53.284058
timbre_0_min                          timbre_0_min      0.022045
timbre_0_max                          timbre_0_max     -0.310480
timbre_1_min                          timbre_1_min      0.005416
timbre_1_max                          timbre_1_max     -0.000511
timbre_2_min                          timbre_2_min     -0.002254
timbre_2_max             

In [38]:
model3_predictions = model3.predict(SongsTest[['timesignature', 'timesignature_confidence', 'loudness', 'tempo',
       'tempo_confidence', 'key', 'key_confidence', 'pitch',
       'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
       'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
       'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
       'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
       'timbre_8_min', 'timbre_8_max', 'timbre_9_min', 'timbre_9_max',
       'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max']],linear=False)

threshold = 0.45
model3_pred_0_1 = (model3_predictions>threshold)
pd.crosstab(index=SongsTest['Top10'], columns=model3_pred_0_1)

col_0,False,True
Top10,Unnamed: 1_level_1,Unnamed: 2_level_1
0,309,5
1,40,19


In [43]:
print("accuracy : ",(19+309)/(19+309+5+40))
print("sensitivity : ",19/59)
print("specificity : ",309/314)

accuracy :  0.8793565683646113
sensitivity :  0.3220338983050847
specificity :  0.9840764331210191


#### So is this model good or bad?

High specificity and low sensitivity means:

Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits. 

In [46]:
metrics.roc_auc_score(SongsTest['Top10'], model3_predictions)

0.84896901651732692

In [50]:
metrics.average_precision_score(SongsTest['Top10'], model3_predictions)

0.5820177599773827

In [53]:
metrics.recall_score(SongsTest['Top10'], model3_pred_0_1)

0.32203389830508472

In [55]:
metrics.accuracy_score(SongsTest['Top10'], model3_pred_0_1)

0.87935656836461129

In [56]:
metrics.precision_score(SongsTest['Top10'], model3_pred_0_1)

0.79166666666666663

In [60]:
# Is this the same as log likelihood function? Apparently no

metrics.log_loss(SongsTest['Top10'], model3_predictions)

0.3176440416229917