Applying Logistic Regression to predict whether the song will be a hit

In [16]:
from pandas import Series, DataFrame
import pandas as pd
from sklearn.model_selection import train_test_split #to split data into test/train 
from patsy import dmatrices #to split data into matrices for model 
from sklearn.linear_model import LogisticRegression #to fit a logistic regression classifier
from sklearn import metrics #to compute accuracy score 
import warnings
warnings.filterwarnings('ignore') #to ignore warnings
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
df = pd.read_csv('No_Outliers_Spotify.csv') #importing data set

The following function takes the data set, the formula of independent variables predicting the dependent, and the specified target variable. It splits the data into train (70%) and test (30%). Then, it fits a logistic regression to the training data and computes the train and test accuracy scores. Finally, it compares these scores to the baseline model score. The baseline model is a model that simply predicts the outcome that is most common in the training set. 

In [21]:
def train_test(df,formula,y_var):
    Y, X = dmatrices(formula, df, return_type='dataframe') #transform data into matrices
    y = Y[y_var].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split into test and train
    model = LogisticRegression() #fit logistic regression model 
    result = model.fit(X_train, y_train)
    prediction_train = model.predict(X_train) #predict model with training data 
    print("Train Accuracy: ",metrics.accuracy_score(y_train, prediction_train)) #compare actual vs predicted scores for train
    prediction = model.predict(X_test) #predict model with test data
    print("Test Accuracy: ",metrics.accuracy_score(y_test, prediction)) #compare actual vs predicted scores for test
    #comparing to baseline
    num_pos = len(y_train[y_train==1])
    num_neg = len(y_train[y_train==0])
    #want max value as that's what baseline would choose:
    if num_pos >= num_neg: #if most common outcome is a 1
        max_val = 1
    else: #if most common outcome is a 0
        max_val = 0
    #compute accuracy for baseline model 
    correct_examples_in_test = len(y_test[y_test==max_val]) 
    total_examples_in_test = len(y_test)
    print('Number of examples where baseline is correct =', correct_examples_in_test)
    print('Baseline accuracy =', correct_examples_in_test * 1.0 / total_examples_in_test)
    print("Variables weights:",model.coef_)
    print("Intercept:",model.intercept_)

Next, I apply the function to our data using the independent variables selected by stepwise regression 

In [22]:
predictors = 'target ~ 0 + instrumentalness + danceability + loudness + energy + acousticness + C(mode) + time_signature'
train_test(df,predictors,'target')

Train Accuracy:  0.8087357569180683
Test Accuracy:  0.829746835443038
Number of examples where baseline is correct = 818
Baseline accuracy = 0.5177215189873418
Variables weights: [[-0.28406032  0.27791191 -4.94196468  4.69901682  0.41089726 -5.15041623
  -1.38164077  0.54518021]]
Intercept: [2.22234958]


Overall, our logistic regression model performed with 80.8% train accuracy, 82.97% test accuracy, and had a baseline accuracy of 51.7%. Clearly the logistic regression model outperforms the baseline model. 

__Model__

Target = 2.22 + 0.278(instrumentalness) - 4.941(danceability) + 4.699(loudness) + 0.411(energy) -  5.150(acousticness) - 1.381(mode) + 0.545(time_signature) 