In [20]:
# This is an exploratory journey into the poetic world of one of the legendary bands in the history of rock. 
# It's an inevitably shallow approach to understanding a music band simply based on a collection of lyrics, 
# but I hope it would be at least a fun ride and an interesting experience in textual analysis. 

In [21]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
# %matplotlib inline

In [22]:
df = pd.read_csv('songdata.csv')

In [23]:
df.columns

Index(['artist', 'song', 'link', 'text'], dtype='object')

In [24]:
df = df[['artist', 'song', 'text']]

In [25]:
df.head(3)

Unnamed: 0,artist,song,text
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...


In [26]:
# len(doors) == 97
doors = df[df['artist'] == 'Doors']['text']
len(doors)

97

In [27]:
songs = [doors.iloc[i] for i in range(len(doors))]

In [28]:
# This pandas series contains all the songs The Doors created during their period of activity. In total there
# are 97 songs

In [29]:
songs[0]

"All hail the American night!  \n  \nWhat was that?  \nI don't know  \nSounds like guns, thunder.  \n  \nAlright! Alright! Alright!  \nHey, listen! Listen! Listen, man! listen, man!  \nI don't know how many you people believe in astrology  \nYeah, that's right, that's right, baby, I, I am a  \nSagittarius  \nThe most philosophical of all the signs  \nBut anyway, I don't believe in it  \nI think it's a bunch of bullshit, myself  \nBut I tell you this, man, I tell you this  \nI don't know what's gonna happen, man,  \nBut I want to have my kicks  \nBefore the whole shithouse goes up in flames  \nAlright!\n\n"

In [30]:
# Let's get rid of some of this noise so we could get to the content
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+')

for i in range(len(songs)):
    songs[i] = tokenizer.tokenize(songs[i].lower())

In [31]:
# trying to get rid of the stopwords. However certain characters would still remain there
import spacy
nlp = spacy.load("en")
nlp.vocab["'t"].is_stop 

False

In [32]:
# so we need to write a custom filter...
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english') + 
                ["n't", "'s", "'m", "'t", "'re", "yeah", "ya", "oh", "l", "la"] + list(ENGLISH_STOP_WORDS))
songs = [list(filter(lambda x: x.lower() not in stopwords, song)) for song in songs]

In [33]:
songs[2][10:30]

['mon',
 'set',
 'free',
 'said',
 'warden',
 'warden',
 'warden',
 'break',
 'lock',
 'key',
 'said',
 'warden',
 'warden',
 'warden',
 'break',
 'lock',
 'key',
 'come',
 'mister',
 'c']

In [34]:
# looks a bit better, but still there are some single character tokens and words with little semantic content
# like 'Yeah' and 'ya'. On the other hand there are some repetitions like 'warden' that pose a challenge to 
# our semantic analysis since they might be simply there for the purposes of rythm and music, and don't  
# necessarily bear a heavy weight with regards to the overall message conveyed in a song. To counter that we 
# will later use tf-idf. However for now and before we proceed further, let's create a word cloud from all 
# the words in the lyrics:

# we're flattening the nested list containing all the words
lyrics = " ".join([words for song in songs for words in song])

from wordcloud import WordCloud
from PIL import Image
jim_mask = np.array(Image.open("jim.jpg"))
wc = WordCloud(background_color="white", max_words=3000, mask=jim_mask)
wc.generate(lyrics)

# wc.to_file("jim_cloud.png")

<wordcloud.wordcloud.WordCloud at 0x7fd4a8758b00>

<img src="jim_cloud.png" height="700" width="400"/>

In [35]:
# Nice wordcloud, ain't it? Call me crazy but it kinda looks like Jim himself. 

Now that we came this far, how about looking at concordances of words in The Doors's lyrics. In other words, we'd like to see what words are more likely to come together in the band's lyrics. For this we first resort to creating dense word vectors using word2vec and then we will use the dimensionality reduction technique t-sne to map the vectors to a two-dimensional space where we can plot them on a 2D plane.


In [36]:
import gensim, re, matplotlib
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE

# when training the word2vec model, we only consider words that occurred at least min_count=30 times
model = Word2Vec(songs, window=5, size=300, workers=-1, min_count=30)
labels = []
tokens = []

for word in model.wv.vocab:
    tokens.append(model[word])
    labels.append(word)
# TSNE plot to find the similarity of words
tsne_model = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
new_values = tsne_model.fit_transform(tokens)

In [None]:
x = []
y = []
for value in new_values:
    x.append(value[0])
    y.append(value[1])
plt.figure(figsize=(16, 12)) 
for i in range(len(x)):
    plt.scatter(x[i],y[i])
    plt.annotate(labels[i],
                 xy=(x[i], y[i]),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
# plt.savefig('30_most_frequent.png');

<img src="30_most_frequent_1.png" />

Again call me nuts, but sounds like a vague touch of Jim is still lurking in the background of this plot. If you squint, you might see him.

You will get different results in separate runs, so we can't take the plot too seriously, but nonetheless there are some interesting recurring patterns. So, these words are the most prominent motifs in the band's lyrics. Both <i>man</i> and <i>night</i> occur on the plot, and they actually are adjacent in some of the runs. They remind me of the famous story of Jim witnessing a car accident involving American Indians. <i>end</i> and <i>time</i> also occur next to each other, for which <a href="https://en.wikipedia.org/wiki/The_End_(The_Doors_song)">the connotations are clear</a>.  

The relative distance between the words can be interpretted in several ways. In some cases they correspond to semantic similarity (<i>need</i> and <i>want</i> or <i>ride</i> and <i>roll</i>), and in other cases I assume, it translates to the selectional preferences of verbs. For instance the verbs <i>got</i>, <i>gonna</i>, and <i>gotta</i> are placed in relative closeness to <i>roll</i> and <i>ride</i>. Same goes for the verbs <i>come</i> and <i>tell</i> being close to the noun <i>baby</i>.

The next plot is what we get in a subsequent run. In each run, we see different patterns that we can interpret the way we want only to reinforce our own preconceptions about the band. There was a little bug in the code when I ran the first time, so we ended up seeing <i>come</i> and <i>Come</i> both on the plot. But the general idea is the same.

<img src="30_most_frequent_2.png" />

Again there are some adjancies that could be telling. <i>away</i> and <i>girl</i> invokes certain love-related songs. <i>little</i> and <i>baby</i> are close (co-occurrence), also <i>like</i> and <i>want</i> (semantic similarity). <i>night</i> and <i>man</i> are closer here.

This is all good, but now that we have a list of all the words in each song, let's do a little unsupervised clustering to see what songs are topically similar. 