# Using Lyrics for Sentiment Analysis (Binary Genre Classification) of Songs, Jake Schaeffer

In [1]:
import numpy as np
import pandas as pd

# reads the CSV of Song Titles, Artists, and Lyrics
data = pd.read_csv('songdata.csv')
# removes '\n's that were scattered throughout the lyrics text
data['text'].replace('\n','',regex=True,inplace=True)

# reads CSV of top songs from 
aa_data = pd.read_csv('top_songs.csv')

# reads Jan Wiebe's Subjectivity Lexicon
lexicon = pd.read_csv("subjectivity.csv")

In [2]:
song_title = data['song'].values
song_artist = data['artist'].values

In [3]:
import pylast

# You have to have your own unique two values for API_KEY and API_SECRET
API_KEY = "my key"
API_SECRET = "my secret"

network = pylast.LastFMNetwork(api_key=API_KEY, api_secret=API_SECRET,
                               username=username, password_hash=password_hash)

In [4]:
# Collects the last.fm key for each song
lasted_tracks = []
for i in range(len(song_title)):
    track = network.get_track(song_artist[i],song_title[i])
    lasted_tracks.append(track)

In [None]:
# Collects all of the tags for each song in the dataset.
# This block took about 10 hours to run so I saved the results to a CSV
# which was loaded in below to be used when I came back to this project.
top_list = []
for i in range(len(lasted_tracks)):
    try:
        top_tag = lasted_tracks[i].get_top_tags()
        top_list.append(top_tag)
    except:
        top_list.append("NA")
    if i%10 == 0:
        print "at stage",i

In [5]:
# CSV of collected tag data
tags_formatted = pd.read_csv("songs_tags")
listed = pd.DataFrame(lasted_tracks)
tags_formatted['names'] = listed[0:100]

In [7]:
# The API calls returned a long string that contained the tags buried
# amongst messy text. This parses the strings and extracts the tag.
def substring_after(s, delim):
    return s.partition(delim)[2]

topdf = tags_formatted.drop(tags_formatted.columns[[0]],1)
complex_tags = topdf.iloc[:,0][1]
sub = substring_after(complex_tags[2],"(u'")

complex_tags = complex_tags.split("Tag")
sub = substring_after(complex_tags[4],"(u'")
tag = sub.split("'")[0]

In [79]:
# This was code that does the same thing as the above cell. The tags
# stored as a variable, and loaded in from my saved CSV required different
# processing steps (very annoyingly).
tags = []
weights = []

A = []

for i in range(len(topdf.iloc[:,0])):
    my_list = []
    complex_tags = str(topdf.iloc[:,0][i])
    complex_tags = complex_tags.split("Tag")
    for j in range(len(complex_tags)):
        sub = substring_after(complex_tags[j],"(u'")
        tag = sub.split("'")[0]
        my_list.append(tag)
    A.append(my_list)

In [11]:
# Adf = pd.DataFrame(A)
Adf = pd.read_csv('songs_tags')
Adf=Adf.drop(Adf.columns[[0]], axis=1)
# Adf['title'] = song_title
Adf['artist'] = song_artist

count = 0
no_tags = []
Gets index of all songs that had no tags
for i in range(len(Adf.iloc[:,0].values)):
    if Adf.iloc[:,0].values[i] == None:
        no_tags.append(i)
Removes songs from Dataframe with no tags, shrinking DataFrame from
57,650 entries to 46,010
Adf.drop(Adf.index[no_tags],inplace=True)

The following block goes through Jan Wiebe's semantic dictionary, notes
whether a tag is a word that matches positive or negative sentiment in 
her dictionary, and classifies songs based on whether there are more
positive or negative tags.

Again, this script took about 30 minutes to run (my mistake of
iterating through a Pandas df using .iloc) so the results were also
saved to a CSV to be referenced later.

In [12]:
import time
word_list = lexicon.iloc[:,0].values
word_list = word_list.tolist()

classifier = []

start = time.time()
for i in range(len(Adf)):
    pos_count = 0
    neg_count = 0
    for j in range(len(Adf.iloc[i])):
        single_tag = Adf.iloc[i][j]
        try:
            ind = word_list.index(single_tag.lower())
        except:
            continue
        if lexicon.iloc[:,1][ind] == "positive":
            pos_count += 1
        if lexicon.iloc[:,1][ind] == "negative":
            neg_count += 1
    if pos_count > neg_count:
        classifier.append(1)
    elif neg_count > pos_count:
        classifier.append(-1)
    else:
        classifier.append(0)
    if i%1000 == 0:
        print "stage", i

end = time.time()
runtime = end - start
print "Script took",runtime, "seconds."

In [15]:
# This block creates unique song/artist string combos as many songs have
# the same exact title and were messing with the merge.
Adf['sentiment'] = classifier
Training_S = Adf[['title','artist','sentiment']]
data.rename(columns={'song': 'title', 'text': 'lyrics'}, inplace=True)
Training_S.to_csv('title_sentiment')
data.to_csv('cleaned_data')
Training_S['merger'] = Adf['title'] + Adf['artist']
data['merger'] = data['title'] + data['artist']

In [16]:
This step equalizes the datasets so that there are an even number of
positive and negative classified songs.
Merged = Training_S.merge(data, on='merger', how='left').drop_duplicates()
Merged = Merged[['title_x','artist_x','lyrics','sentiment']]
Merged.rename(columns={'title_x': 'title', 'artist_x': 'artist'}, inplace=True)

PosSent = Merged[Merged.sentiment == 1]
NegSent = Merged[Merged.sentiment == -1]
PosSentMatch = PosSent.sample(n=len(NegSent))
SentMatched = pd.concat([PosSentMatch, NegSent], ignore_index=True)
SentMatched.to_csv('training_data')

In [139]:
# Loads in the cleaned dataset all of the steps above created
SentMatched = pd.read_csv('training_data')
SentMatched = SentMatched.drop(SentMatched.columns[[0]], axis=1)
SentMatched.head()

Unnamed: 0,title,artist,lyrics,sentiment
0,Down In The Park,Foo Fighters,Down in the park where the machmen meet The m...,1
1,If You Love Me Baby,The Beatles,"If you leave me, baby I don't know what I'll ...",1
2,A Chair In The Sky,Joni Mitchell,The rain slammed hard as bars It caught me--b...,1
3,No Bone Movies,Ozzy Osbourne,Silver screen such a disgrace I couldn't look...,1
4,Common Mortal Man,Free,I was on my way to a needle factory Up and co...,1


In [151]:
SentMatched.target = SentMatched.sentiment
lyric_data = SentMatched.lyrics
target_data = SentMatched.target

    Now that I have a cleaned, classified dataset, I can begin the model selection process.

    The below cell builds a dictionary of features and transforms documents to feature vectors through text preprocessing, tokenizing and filtering of stopwords.

In [126]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
X_train, X_test, y_train, y_test = train_test_split(lyric_data, target_data, 
                                                    test_size=0.25, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
y_train.target_names = ['positive','negative']
X_train_counts.shape

(5776, 26690)

In [140]:
# CountVectorizer supports counts of N-grams of words or consecutive
# characters. Once fitted, the vectorizer has built a dictionary of 
# feature indices with the index value of a word in the vocabulary
# linked to its frequency in the whole training corpus.
count_vect.vocabulary_.get('terror')

23403

The occurrence count of words within lyrics provided a solid starting point in the classification process, but there was an issue: longer songs  have higher average count values than shorter songs, even when they might talk about the same topics.

In the name of preventing this issue, I can divide the number of occurrences of each word in a lyric by the total number of words in the lyric: these new features are called tf for Term Frequencies.

To provide one more refinement, I can downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is referred to in sci-kit learn as tf–idf for “Term Frequency times Inverse Document Frequency”. tf and tf_idf are calculated below.

In [85]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(5776, 26690)

In [86]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(5776, 26690)

Now that features have been extracted from the lyric corpus, I wanted to train a classifier to try to predict positive or negative sentiment of a song. To start, I used a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier, but the one most appropriate for word counts is the multinomial Naive Bayes classifier:

To calculate P(sentiment|lyrics), the algorithm calculates P(lyrics|sentiment)*P(sentiment)/P(lyrics). P(lyrics) would have no effect on the comparisons under consideration since there was no need to compare across multiple lyrics. Thus, P(lyrics|sentiment)*P(sentiment) became the only calculation. The lyrics are represented by the series of n words that constitutes the lyrics, and the algorithm assumes that the probability of seeing any given word depends exclusively on the classified sentiment, it does not take into consideration other words in the lyrics.

To return to the original calculation goal, P(sentiment|lyrics) is simply equal to the product over the set P(word(i)|sentiment) for all values from i=1 to n.

In [93]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

With the classifier trained, I can now test out what sentiment the model predicts for different lyric sets. +1 classification represents positive sentiment, -1 represents negative sentiment.

In [146]:
sample_lyrics = ['happy is happy', 'sad is sad']
X_new_counts = count_vect.transform(sample_lyrics)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)
predicted

array([ 1, -1])

The following cell creates a pipeline that allows the model to be trained with just one command. I'm then able to look at the accuracy of this trained model.

In [112]:
# vectorized lyrics model
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])
text_clf.fit(X_train, y_train)  
docs_test = X_test
NB_predicted = text_clf.predict(docs_test)
accuracy = np.mean(NB_predicted == y_test)
print "Naive Bayes Accuracy:", accuracy*100, "%"

Naive Bayes Accuracy: 60.7476635514 %


In the paper that I'm replicating, their use of the Naive Bayes approach resulted in an average basic accuracy of 56%. While I would like to assume that it is my exceptional machine learning model parameter setting skills that resulted in this improvement, it is more likely that I was able to achieve better accuracy as a result of my order of magnitude larger data set. While the paper had a total of 420 classified songs, my dataset was comprised of 7,702 classified songs. Despite this slight improvement of the accuracy from the paper, I wanted to test another model, the linear SVM.

In [113]:
# model with linear support vector machine
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None)),])
text_clf.fit(X_train, y_train)
SVM_predicted = text_clf.predict(docs_test)
accuracy = np.mean(SVM_predicted == y_test)
print "Linear SVM Accuracy:", accuracy*100, "%"

Linear SVM Accuracy: 58.6708203531 %


Suprisingly, the linear SVM performed worse on my dataset, resulting in an accuracy of 58.7%. The paper I'm replicating did not report on a linear SVM model in their research.

In [118]:
from sklearn import metrics
print "Naive Bayes Model Results:"
print(metrics.classification_report(y_test, NB_predicted,
    target_names=['positive sentiment','negative sentiment']))
print " "
print "Linear SVM Model Results:"
print(metrics.classification_report(y_test, SVM_predicted,
    target_names=['positive sentiment','negative sentiment']))

Naive Bayes Model Results:
                    precision    recall  f1-score   support

positive sentiment       0.66      0.45      0.54       969
negative sentiment       0.58      0.77      0.66       957

       avg / total       0.62      0.61      0.60      1926

 
Linear SVM Model Results:
                    precision    recall  f1-score   support

positive sentiment       0.58      0.66      0.62       969
negative sentiment       0.60      0.51      0.55       957

       avg / total       0.59      0.59      0.58      1926



In looking at the metrics provided by sklearn, we can look at precision as "how useful the search results are", and recall as "how complete our results are". The two metrics are calculated independently of each other. In looking at the extrema of the results, interestingly, Negative Sentiment had recall of nearly 0.8 in the Naive Bayes model. Also, the Naive Bayes model performed slightly better than the Linear SVM model which I found surprising given Linear SVM's general regard as one of the best text classifiers. I also found it interesting that 

The Nature of Music and Lyrical Classification:
    
In my dataset, there were over 3x as many songs classified as positive than negative. This is unsurprising as the database I accessed had the lyrics of popular songs, more of which will be happy than negative. In looking at my results though, I'm happy with the classification accuracy I was able to achieve. Lyrics are a much more difficult classifier than something like reviews for a few reasons.

1. Songs can contain a series of negative lyrics but end on an uplifting, positive note, or vice versa. This is notable in the areas of love songs where lyricists may speak on how happy they were in a relationship, but end on a sad note because of a break-up.

2. Songs may not contain any words from a typical sentiment lexicon, but still express positive or negative emotions. For example, the song ”Ocean Front Property” by George Strait includes the following stanza:

I got some ocean front property in Arizona
From my front porch you can see the sea
I got some ocean front property in Arizona
If you'll buy that, I'll throw the golden gate in free

It's tough to find an individual word that screams "positive" from this selection, but taken as a whole the section seems positive. But on deeper analysis, there is no such thing as ocean front property in the landlocked state of Arizona. This is a semi-spiteful message, and its complexity would likely be entirely lost on a classifier.

3. Hip-hop songs in particular suffer from the issue of containing lyrics that express positive emotions about negative events like shootings and robbery. This just contributes to the complexities a classification system is expected to pick up on.

With these considerations in mind, I'm satisfied with the results my algorithm was able to return.

References:

1. Identifying the Emotional Polarity of Song Lyrics through Natural
Language Processing (Oudenne and Chasins)
https://people.eecs.berkeley.edu/~schasins/papers/identifyingEmotionalPolarity.pdf

2. 55000+ Song Lyrics from LyricsFreak 
https://www.kaggle.com/mousehead/songlyrics

3. Jan Wiebe's Subjectivity Lexicon
http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/