# Classification

Based on lyrics and the audio features, we will try to predict whether the song will be a hit or not.

In order to predict the class based on the lyrics, we have to make numerical representation of the lyrics. We used several techniques for this: TF-IDF, Word2Vec and Doc2Vec.

For classification, we used the following machine learning techniques: Logistic Regression, Random Forest, Neural Networks. 

In [1]:
import json
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import multiprocessing
from tqdm import tqdm
from sklearn import utils
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.preprocessing import PolynomialFeatures
import codecs
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras import optimizers
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

Using TensorFlow backend.


In [2]:
with open('../data/top_hits_merged_clean_lyrics_audio_features.json') as json_file:
    top_hits = json.load(json_file)
    
with open('../data/not_hits_merged_clean_lyrics_audio_features.json') as json_file:
    not_hits = json.load(json_file)

We assign a class to each song, depending whether the song was on the Billboard 100: 
- 1 - hit
- 0 - not hit

In [3]:
top_hits_df = pd.read_json(top_hits)
not_hits_df = pd.read_json(not_hits)

top_hits_df['class'] = 1
not_hits_df['class'] = 0

df = pd.concat([top_hits_df, not_hits_df])

## TD-IDF

We used the TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the lyrics. Additionally, we tried dimensionality reduction with PCA to observe whether it will improve the results. 

TF-IDF is a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.

In [4]:
# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [5]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [6]:
X = df['clean_lyrics'] # the features we want to analyze
ylabels = df['class'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=72)

#### Logistic Regression

We try to predict whether the song will be a hit or not based on lyrics, using Logistic Regression. The test accuracy is 0.54. 

In [7]:
classifier = LogisticRegression(solver="lbfgs")

# Create pipeline using Bag of Words
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))
print(" Precision:",metrics.precision_score(y_test, predicted, average=None))
print(" Recall:",metrics.recall_score(y_test, predicted, average=None))

 test Accuracy: 0.5497220506485485
 Precision: [0.51548947 0.58374384]
 Recall: [0.55172414 0.54797688]


#### Logistic Regression with PCA

We try to predict whether the song will be a hit or not based on lyrics, using Logistic Regression. We use PCA to reduce the dimensionality of the matrix to 50. The test accuracy is 0.57. So, we have an improvement when using Logistic Regression with Principal Component Analysis. 

In [8]:
class ToDenseTransformer(BaseEstimator,TransformerMixin):

    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self

In [9]:
# Create pipeline using Bag of Words
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('to_dense',ToDenseTransformer()),
                 ('pca',PCA(50)),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))
print(" Precision:",metrics.precision_score(y_test, predicted, average=None))
print(" Recall:",metrics.recall_score(y_test, predicted, average=None))

 test Accuracy: 0.5781346510191476
 Precision: [0.53854506 0.63037249]
 Recall: [0.65782493 0.50867052]


#### Random Forest Classifier

We try to predict whether the song will be a hit or not based on lyrics, using Random Forest Classifier.The test accuracy is 0.53. 

In [10]:
classifier = RandomForestClassifier(n_estimators=1000)

# Create pipeline using Bag of Words
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))
print(" Precision:",metrics.precision_score(y_test, predicted, average=None))
print(" Recall:",metrics.recall_score(y_test, predicted, average=None))

 test Accuracy: 0.5379864113650401
 Precision: [0.50424929 0.56407448]
 Recall: [0.47214854 0.59537572]


#### Random Forest Classifier with PCA

We try to predict whether the song will be a hit or not based on lyrics, using Random Forest Classifier. We use PCA to reduce the dimensionality of the matrix to 50. The test accuracy is 0.50. So, we don't an improvement when using Random Forest Classifier with Principal Component Analysis. 

In [11]:
# Create pipeline using Bag of Words
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('to_dense',ToDenseTransformer()),
                 ('pca',PCA(50)),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))
print(" Precision:",metrics.precision_score(y_test, predicted, average=None))
print(" Recall:",metrics.recall_score(y_test, predicted, average=None))

 test Accuracy: 0.5046324891908586
 Precision: [0.46833773 0.53658537]
 Recall: [0.47082228 0.53410405]


We can see that we get the best results using 

## Word2Vec

Next, we tried the word2vec technique. This technique is used to obtain numeric representations (vectors) of individual words. The obtained vectors are such that they retain the linguistic context of the words, meaning that words appearing in a similar context will have similar numerical representations. Training these representations requires large amount of data, so we used the pre-trained GloVe word embeddings downloaded from https://nlp.stanford.edu/projects/glove/. Next, we combined the vectors of the individual words into one vector representing the entire song by taking the average of the word vectors. This technique has been previously proposed for representing sentences.

In [12]:
def load_embeddings_binary(embeddings_path):
    """
    It loads embedding provided by glove which is saved as binary file. Loading of this model is
    about  second faster than that of loading of txt glove file as model.
    :param embeddings_path: path of glove file.
    :return: glove model
    """
    with codecs.open(embeddings_path + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]
    wv = np.load(embeddings_path + '.npy')
    model = {}
    for i, w in enumerate(index2word):
        model[w] = wv[i]
    return model

In [13]:
w2v_model = load_embeddings_binary('../data/glove.6B.50d')

In [14]:
def get_w2v(sentence, model):
    """
    :param sentence: inputs a single sentences whose word embedding is to be extracted.
    :param model: inputs glove model.
    :return: returns numpy array containing word embedding of all words    in input sentence.
    """
    return np.mean(np.array([list(model[val]) for val in sentence.split() if val in model]), axis=0)

In [15]:
X = df['clean_lyrics'].apply(lambda lyrics: get_w2v(lyrics, w2v_model)).values # the features we want to analyze
ylabels = df['class'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=72)

#### Logistic Regression

We try to predict whether the song will be a hit or not based on the lyrics, using Logistic Regression. The test accuracy is 0.57. 

In [16]:
classifier = LogisticRegression()
classifier.fit(list(X_train), y_train)
predicted = classifier.predict(list(X_test))

print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))

 test Accuracy: 0.5719579987646696


#### Neural Networks

We try to predict whether the song will be a hit or not based on the lyrics, using Neural Networks. We tried different parameters for density, optimizers, batch size, epochs, and for this values we got the best results. The test accuracy is 0.56.

In [17]:
model = Sequential()
model.add(Dense(30, input_dim = 50))
model.add(Activation('relu')) 
                           
model.add(Dropout(0.1))   
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
model_hist = model.fit(np.stack(X_train), y_train,
                       batch_size=8, epochs=100,
                       verbose=1, validation_split=0.2)

Train on 3021 samples, validate on 756 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100


In [19]:
score = model.evaluate(np.stack(X_test), y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.7129344347820671
Test accuracy: 0.5602223873138428


**If we compare the results, we can see that we got higher results using Logistic Regression**

## Doc2Vec

In our opinion, the low performance of the word2vec model is based on the fact that our documents (songs) are large and combining so many word vectors into one paragraph vector leads to bad performance. Doc2vec is an extension of the word2vec approach that tries to learn document representations based on the context instead of simply combining the representations of individual words. However, we could not find any pre-trained doc2vec models. We tried training our own model from our data, but this led to an expected bad performance. This is because we have only a few thousand paragraphs (songs) to train on, and getting good document representations requires having much larger datasets. We believe that having a larger corpus of songs would certainly improve the performance of the doc2vec model.

In [20]:
train, test = train_test_split(df, test_size=0.3, random_state=42)
lemmatizer = WordNetLemmatizer() 

def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(lemmatizer.lemmatize(word.lower()))
    return tokens

train_tagged = train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_lyrics']), tags=[r['class']]), axis=1)
test_tagged = test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_lyrics']), tags=[r['class']]), axis=1)

In [21]:
cores = multiprocessing.cpu_count()

model_dbow = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])

100%|██████████| 3777/3777 [00:00<00:00, 1997212.08it/s]


In [22]:
%%time
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 3777/3777 [00:00<00:00, 1790245.93it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2247713.71it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2213790.69it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2442850.61it/s]
100%|██████████| 3777/3777 [00:00<00:00, 1385628.11it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2236922.65it/s]
100%|██████████| 3777/3777 [00:00<00:00, 1626978.15it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2516982.24it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2491646.15it/s]
100%|██████████| 3777/3777 [00:00<00:00, 1845728.32it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2381164.32it/s]
100%|██████████| 3777/3777 [00:00<00:00, 1430160.35it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2248990.09it/s]
100%|██████████| 3777/3777 [00:00<00:00, 1870794.31it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2195383.34it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2361640.76it/s]
100%|██████████| 3777/3777 [00:00<00:00, 2029970.04it/s]
100%|██████████| 3777/3777 [00:

CPU times: user 60 s, sys: 751 ms, total: 1min
Wall time: 19.4 s


In [23]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

#### Logistic Regression

We try to predict whether the song will be a hit or not using Logistic Regression. The test accuracy is 0.48, so it is worse than the base rate.

In [24]:
y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))

Testing accuracy 0.4898085237801112
Testing F1 score: 0.4893726416195995


# Audio Features

We will try to predict whether a song will be a hit or not based on the Aaudio Features. The audio features are already normalized from Spotify.

In [25]:
audio_features = ['acousticness', 'danceability',  'energy',
            'instrumentalness', 'liveness', 'loudness', 'mode',
            'speechiness', 'tempo', 'time_signature', 'valence']

In [26]:
X = df[audio_features]
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=72)

#### Logistic Regression

We try to predict whether the song will be a hit or not based on the Audio Features, using Logistic Regression. The test accuracy is 0.59.

In [27]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)

print(" test Accuracy:",metrics.accuracy_score(y_test, predicted))

 test Accuracy: 0.5941939468807906


#### Logistic Regression - Polynomial Features

We will generate a new feature matrix consisting of all polynomial combinations of the features with degree equal to the specified degree - 2, and try to predict whether the song will be a hit or not using Logistic Regression. The test accuracy is 0.61, so we can see an improvement.

In [28]:
poly = PolynomialFeatures(2)

X_train = poly.fit_transform(X_train)
X_test = poly.transform(X_test)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)

print(" test Accuracy:", metrics.accuracy_score(y_test, predicted))

 test Accuracy: 0.6145768993205682


#### Neural Networks

We try to predict whether the song will be a hit or not based on the Audio Features using Neural Networks. We tried different parameters for density, optimizers, batch size, epochs, and for this values we got the best results. The test accuracy is 0.53.

In [29]:
model = Sequential()
model.add(Dense(30, input_dim = 78))
model.add(Activation('relu')) 

model.add(Dropout(0.1))   
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [30]:
model_hist = model.fit(X_train, y_train,
                       batch_size=10, epochs=100,
                       verbose=1, validation_split=0.2)

Train on 3021 samples, validate on 756 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100


Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [31]:
score = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.692354913513332
Test accuracy: 0.5355157256126404


**If we compare the results, we can see that we got the highest accuracy when we used Logistic Regression with polynomial features expansion of the matrix.**