

## Trendy or Timeless?


Popular music is on the forefront of what the people consider new and cool, but it can also reflect shared values that are more persistent, even timeless. Leaving aside the ancient question of whether art imitates life or life imitates art, are there some concepts so firmly rooted in collective human belief systems that they remain constant across time? While a fully generalizable answer to that question may not be possible, some insights can be obtained from analysis of the content of the lyrics of popular music, and comparing them across time. 

This inquiry is primarily exploratory, but by proceeding at multiple levels of abstraction it is hoped that some conclusions can be drawn. More concrete metrics, such as word count, song length, and lexical diversity (non-repetitiveness) help establish a heuristic baseline. As one might expect, the word "love" appears very frequently in song lyrics, but not always in a positive context. It may be possible to learn more by deeper examination.  Another potentially useful metric of the sentiment content of music lyrics is obtained by treating the artist as a proxy (literally the embodiment) for the music's semantic content, and tracing the persistence versus evanescence of individual artists or bands over time.  And of course, some more modern text mining techniques, Latent Semantic Analysis (LSI) Latent Dirichlet Analysis (LDA), will be used to extract models of what the songs are about.

All of these inquiries use the same data set, a sample consisting of the lyrics for songs listed in the Billboard Top 100 for each year from 1070 through 2014. 

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")


#### Load additional:

In [2]:
import itertools
import collections
import json
import pickle
import nltk
import gensim

Instantiate spark: (note this will only work after vagrant is up):

In [3]:
''' 
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)
sc._conf.getAll()
''' 

' \nimport os\nimport findspark\nfindspark.init()\nprint findspark.find()\nimport pyspark\nconf = (pyspark.SparkConf()\n    .setMaster(\'local[4]\')\n    .setAppName(\'pyspark\')\n    .set("spark.executor.memory", "2g"))\nsc = pyspark.SparkContext(conf=conf)\nsc._conf.getAll()\n'

In [4]:
''' 
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
''' 

' \nimport sys\nrdd = sc.parallelize(xrange(10),10)\nrdd.map(lambda x: sys.version).collect()\n'

Open everything that's been saved in /data/conditioned

In [5]:
# Open everything that's been saved in /data/conditioned

df=pd.read_csv("../../data/conditioned/use-this-master-lyricsdf-extracted.csv")
dfg=pd.read_csv("../../data/conditioned/master-lyricsdf-genre_inner.csv")

with open("../../data/conditioned/noun-n-gram.json") as json_file:
    noungram = json.load(json_file)
with open("../../data/conditioned/nounvocab.json") as json_file:
    nounvocab = json.load(json_file)
with open("../../data/conditioned/nounid2word.json") as json_file:
    nounid2word = json.load(json_file)
with open("../../data/conditioned/adj-n-gram.json") as json_file:
    adjgram = json.load(json_file)
with open("../../data/conditioned/adjvocab.json") as json_file:
    adjvocab = json.load(json_file)
with open("../../data/conditioned/adjid2word.json") as json_file:
    adjvocab = json.load(json_file)
with open("../../data/conditioned/decade-dict.json") as json_file:
    decade_dict = json.load(json_file)
    

f = open("../../data/conditioned/ahypes.p",'r')  
ahypes = pickle.load(f)  
f = open("../../data/conditioned/nhypes.p",'r')  
nhypes = pickle.load(f)
f = open("../../data/conditioned/corpus.p",'r')  
corpus = pickle.load(f)


df.head(3)

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
0,0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary. Feeling small. When tears a...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary. Feeling small. When tears a...
1,1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,Why do birds suddenly appear. Everytime you ar...,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear. Everytime you ar...
2,2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d..."


In [6]:
decades=df.decade.unique()
df.shape, dfg.shape

((4500, 11), (2946, 316))

Additional Data Cleaning- should be able to eliminate with new data set

In [7]:
# eliminate " We don't currently have a license"
counter=0
#eliminate rows with null lyrics
dfc = df[pd.isnull(df.lyrics)==False]


for row in dfc.iterrows():
    if row[1][6].startswith("We don't currently have a license"):
        dfc.iloc[row[1][1],6]="Instrumental"
    counter+=1
    if counter >3000:
        break

#eliminate instrumentals
dfc = dfc[df.lyrics!="Instrumental"]
# eliminate " 	We don't currently have a license"

dfc.shape
#dfc.head()



(4341, 11)

## Unsupervised Machine Learning  

Fortunately, most of the hard work in making LSA and LDA models operational is accomplished by the `gensim` library for Python, which resulted from the [Phd Dissertation of Radim Hurek](http://radimrehurek.com/phd_rehurek.pdf). Hurek provides an exelent accessible dicussion of the logic behind textual anal;ysis, beginning with the statistical semantics hypothesis:

> Statistical patterns of human word usage can be used to figure out what people mean. 

 
The first use of gensim LSI and LDA is to extract topics across the entire data set, spanning years 1970-2014. The number of topic nodes can be changed by adjusting the `numtopics` variable; the value of `showtopics` determines how many of the extracted topics will be displayed.   

In [8]:
# because the index numbers were in string form in the dict
id2w=dict()
run = dict()
for k in nounid2word:
    id2w[int(k)]=nounid2word[k]   

ntopics=10
nwords=10

In [9]:
%%time
print "Topics:"+str(ntopics)+" Words: "+str(nwords)
# reduce the chunksize since our data is not so big.
#for i in range(1,ntopics): just run 10 topics every time for now
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2w, num_topics=ntopics, update_every=1, chunksize=100, passes=1)
topicsobject= lda.print_topics(num_topics=ntopics,num_words=nwords)

Topics:10 Words: 10
Wall time: 7.53 s


In [10]:
topicsdict={}
topicnumber=0
for t in topicsobject:
    topictuple= [d.split('*') for d in t.split(' + ')]
    topicsdict[topicnumber] = topictuple
    topicnumber += 1
lda_decade_topics={}  # this dict will hold decade results, put this one in now for convenience
lda_decade_topics[1000] = topicsdict   # all-year results stored as "decade" 1000   


In [11]:
# this function makes reading the output easier
# dectops is the master dict of topics keyed by decade
# remmeber decade 1000 is the result of running across all years
# howmany is th enumber ot display
# years is a list of the years to display, defaults to whole set
def printreadable(dectops,howmany,years=decades):
    for dec in years:
        howmany = min(howmany,ntopics)
        print "Top "+str(howmany)+" (of "+str(ntopics)+") Topics for "+str(dec)+":"
        for i in range(0,len(dectops[dec])):
            print "\n- - - -Topic "+str(i)+"- - - - "
            for j in range(0,howmany):
                print dectops[dec][i][j][1], dectops[dec][i][j][0]
        print '-----------------------------------------------\n'

In [12]:
printreadable(lda_decade_topics,nwords,[1000])


Top 10 (of 10) Topics for 1000:

- - - -Topic 0- - - - 
baby 0.237
eye 0.063
hand 0.060
road 0.044
touch 0.044
body 0.043
problem 0.040
ooh 0.025
fun 0.025
lip 0.014

- - - -Topic 1- - - - 
thing 0.159
time 0.152
nigga 0.119
day 0.083
man 0.051
bout 0.043
lady 0.037
friend 0.035
shot 0.028
glass 0.014

- - - -Topic 2- - - - 
life 0.115
kid 0.072
boy 0.065
hitta 0.053
hoe 0.035
light 0.032
youre 0.027
star 0.020
drink 0.019
soul 0.017

- - - -Topic 3- - - - 
hitta 0.041
word 0.031
floor 0.029
start 0.024
house 0.022
lane 0.017
truck 0.017
dancing 0.016
tear 0.016
skirt 0.015

- - - -Topic 4- - - - 
girl 0.256
love 0.170
somethin 0.062
sky 0.060
round 0.029
foot 0.029
bed 0.019
dope 0.018
shade 0.017
guy 0.013

- - - -Topic 5- - - - 
bass 0.062
ass 0.051
mind 0.048
club 0.038
hitta 0.030
dance 0.027
home 0.026
ho 0.025
break 0.023
kiss 0.019

- - - -Topic 6- - - - 
diamond 0.093
tonight 0.086
world 0.085
dream 0.039
game 0.034
right 0.033
everybody 0.023
line 0.021
person 0.018
underwear

In [13]:
#for bow in corpus[2:4400:440]:
#    print bow
#    print lda.get_document_topics(bow)
#    print " ".join([id2w[e[0]] for e in bow])
#    print "=========================================="

# LSI analysis 

In [14]:
lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2w, num_topics=ntopics)

In [15]:
lsi_decade_topics = {}

tlist = lsi.print_topics(nwords)

i=0
lsidict={}
for t in tlist:
    u = t.split(' + ')
    topictuple= [v.split('*') for v in u]
    lsidict[i]= topictuple
    i +=1

lsi_decade_topics[1000]=lsidict


In [16]:
printreadable(lsi_decade_topics,nwords,[1000])

Top 10 (of 10) Topics for 1000:

- - - -Topic 0- - - - 
"hitta" 0.990
"day" 0.050
"shit" 0.046
"ride" 0.043
"finger" 0.042
"trigger" 0.042
"motherfucking" 0.042
"bitch" 0.034
"love" 0.032
"case" 0.021

- - - -Topic 1- - - - 
"love" 0.808
"baby" 0.507
"girl" 0.168
"time" 0.133
"thing" 0.088
"cheep" 0.072
"night" 0.071
"momma" 0.061
"way" 0.053
"man" 0.046

- - - -Topic 2- - - - 
"baby" 0.730
"love" -0.574
"girl" 0.259
"cheep" 0.146
"momma" 0.123
"man" 0.072
"bird" 0.063
"night" 0.056
"thing" 0.051
"way" 0.046

- - - -Topic 3- - - - 
"girl" 0.903
"baby" -0.335
"man" 0.125
"thing" 0.104
"cheep" -0.088
"momma" -0.072
"pop" 0.071
"lock" 0.068
"time" 0.066
"way" 0.051

- - - -Topic 4- - - - 
"time" -0.961
"girl" 0.137
"thing" -0.130
"love" 0.096
"baby" 0.095
"touch" -0.077
"night" -0.042
"hand" -0.041
"ya" -0.034
"life" -0.033

- - - -Topic 5- - - - 
"thing" -0.634
"touch" -0.593
"hand" -0.296
"time" 0.190
"applause" -0.173
"man" -0.158
"girl" 0.138
"make" -0.119
"loud" -0.115
"way" -0.048



## Breaking It Down by Decade

The corpus for each decade has been separately generated and saved in a separate subdirectory using the name corpus each time. To analyze by decade, we therefore use identical code operating on different working directories.


### LSA

In [17]:
%%time
decades = ["1970","1980","1990","2000","2010"]
for dec in decades:
    filename="../../data/conditioned/decades/"+dec+"/corpus"+dec+".p"
    f = open(filename,'r')  
    corpus = pickle.load(f)
    print dec,len(corpus)
    lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2w, num_topics=ntopics)
    tlist = lsi.print_topics(nwords)
    i=0
    lsidict={}
    for t in tlist:
        u = t.split(' + ')
        topictuple= [v.split('*') for v in u]
        lsidict[i]= topictuple
        i +=1
    lsi_decade_topics[int(dec)]=lsidict


1970 6031
1980 6447
1990 7795
2000 10398
2010 4801
Wall time: 1.75 s


In [18]:
printreadable(lsi_decade_topics,nwords)
#lsi_decade_topics

Top 10 (of 10) Topics for 1970:

- - - -Topic 0- - - - 
"schoolgirl" 0.634
"glare" 0.529
"passin" 0.382
"monsoon" 0.271
"goner" 0.185
"hoss" 0.123
"downtown" 0.115
"groove" 0.107
"intellectualism" 0.107
"icky" 0.106

- - - -Topic 1- - - - 
"doobie" -0.864
"dancey" -0.504
"prejudice" -0.009
"windmill" -0.000
"eso" 0.000
"landmark" -0.000
"hunkey" -0.000
"motha" -0.000
"trick-or-treat" 0.000
"picnic" 0.000

- - - -Topic 2- - - - 
"sound" -0.981
"split" -0.082
"passin" -0.076
"duct" -0.050
"tip" -0.048
"schoolgirl" 0.044
"freek" -0.040
"ridin" -0.038
"glare" 0.037
"judge" -0.036

- - - -Topic 3- - - - 
"duct" 0.815
"passin" 0.474
"macaroni" 0.151
"schoolgirl" -0.139
"glare" -0.116
"sound" -0.105
"operating" 0.096
"necklace" 0.073
"coco" 0.071
"monsoon" -0.063

- - - -Topic 4- - - - 
"passin" 0.746
"duct" -0.547
"schoolgirl" -0.207
"glare" -0.173
"operating" 0.121
"monsoon" -0.095
"cologne" 0.086
"split" 0.084
"goner" -0.060
"sound" -0.059

- - - -Topic 5- - - - 
"split" -0.974
"passin" 0.

## LDA 

In [19]:
%%time
topicsobjects={}
for dec in decades:
    filename="../../data/conditioned/decades/"+dec+"/corpus"+dec+".p"
    f = open(filename,'r')  
    corpus = pickle.load(f)
    print dec,len(corpus)
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2w, num_topics=10, update_every=1, chunksize=100, passes=3)
    topicsobjects[dec]=lda.print_topics(num_topics=ntopics,num_words=nwords)
    

1970 6031
1980 6447
1990 7795
2000 10398
2010 4801
Wall time: 23.5 s


In [24]:
# the dict decadetopics was declared back when lsa was done on all years
# it holds lsa for all years under key 1000.
for dec in topicsobjects:
    topicsdict={}
    topicnumber=0
    for t in topicsobjects[dec]:
        topictuple= [d.split('*') for d in t.split(' + ')]
        
        topicsdict[topicnumber] = topictuple
        topicnumber += 1
    lda_decade_topics[int(dec)]=topicsdict
   

In [25]:

printreadable(lda_decade_topics,nwords)

Top 10 (of 10) Topics for 1970:

- - - -Topic 0- - - - 
sound 0.429
freek 0.079
club 0.026
hassle 0.024
physician 0.016
nightshift 0.015
picnic 0.013
body 0.012
selfs 0.011
wig 0.009

- - - -Topic 1- - - - 
pon 0.059
stabber 0.057
reminisce 0.055
steppin 0.033
leprechaun 0.019
curve 0.014
heh 0.013
t-t-tasty 0.013
fighter 0.013
jour 0.013

- - - -Topic 2- - - - 
susie 0.179
smokescreen 0.057
hurtin 0.039
video 0.034
karat 0.027
woa 0.020
addition 0.018
scheming 0.017
twirlin 0.012
blue 0.012

- - - -Topic 3- - - - 
necklace 0.113
duct 0.105
burger 0.049
shrug 0.039
jockin 0.036
skeezer 0.023
excitement 0.021
master 0.016
fortune 0.016
windowpane 0.014

- - - -Topic 4- - - - 
macaroni 0.095
wrapping 0.064
radio 0.043
hit 0.027
veil 0.024
gully 0.023
trick-or-treat 0.022
slave 0.020
floor-oor 0.018
sweetness 0.015

- - - -Topic 5- - - - 
nuttin 0.105
amarretta 0.082
slippin 0.046
downtown 0.041
intellectualism 0.027
ahehe 0.027
kush 0.025
herrre 0.022
wonderland 0.020
past 0.019

- - - -

In [None]:
#topicsobjects