

## Trendy or Timeless?


Popular music is on the forefront of what the people consider new and cool, but it can also reflect shared values that are more persistent, even timeless. Leaving aside the ancient question of whether art imitates life or life imitates art, are there some concepts so firmly rooted in collective human belief systems that they remain constant across time? While a fully generalizable answer to that question may not be possible, some insights can be obtained from analysis of the content of the lyrics of popular music, and comparing them across time. 

This inquiry is primarily exploratory, but by proceeding at multiple levels of abstraction it is hoped that some conclusions can be drawn. More concrete metrics, such as word count, song length, and lexical diversity (non-repetitiveness) help establish a heuristic baseline. As one might expect, the word "love" appears very frequently in song lyrics, but not always in a positive context. It may be possible to learn more by deeper examination.  Another potentially useful metric of the sentiment content of music lyrics is obtained by treating the artist as a proxy (literally the embodiment) for the music's semantic content, and tracing the persistence versus evanescence of individual artists or bands over time.  And of course, some more modern text mining techniques, Latent Semantic Analysis (LSI) Latent Dirichlet Analysis (LDA), will be used to extract models of what the songs are about.

All of these inquiries use the same data set, a sample consisting of the lyrics for songs listed in the Billboard Top 100 for each year from 1970 through 2014. 

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")


#### Load additional:

In [2]:
import itertools
import collections
import json
import pickle
import nltk
import gensim

Instantiate spark: (note this will only work after vagrant is up):

In [3]:
''' 
import os
import findspark
findspark.init()
print findspark.find()
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)
sc._conf.getAll()
''' 

' \nimport os\nimport findspark\nfindspark.init()\nprint findspark.find()\nimport pyspark\nconf = (pyspark.SparkConf()\n    .setMaster(\'local[4]\')\n    .setAppName(\'pyspark\')\n    .set("spark.executor.memory", "2g"))\nsc = pyspark.SparkContext(conf=conf)\nsc._conf.getAll()\n'

In [4]:
''' 
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()
''' 

' \nimport sys\nrdd = sc.parallelize(xrange(10),10)\nrdd.map(lambda x: sys.version).collect()\n'

Open everything that's been saved in /data/conditioned

In [5]:
# Open everything that's been saved in /data/conditioned

df=pd.read_csv("../../data/conditioned/use-this-master-lyricsdf-extracted.csv")
dfg=pd.read_csv("../../data/conditioned/master-lyricsdf-genre_inner.csv")

with open("../../data/conditioned/noun-n-gram.json") as json_file:
    noungram = json.load(json_file)
with open("../../data/conditioned/nounvocab.json") as json_file:
    nounvocab = json.load(json_file)
with open("../../data/conditioned/nounid2word.json") as json_file:
    nounid2word = json.load(json_file)
with open("../../data/conditioned/adj-n-gram.json") as json_file:
    adjgram = json.load(json_file)
with open("../../data/conditioned/adjvocab.json") as json_file:
    adjvocab = json.load(json_file)
with open("../../data/conditioned/adjid2word.json") as json_file:
    adjvocab = json.load(json_file)
with open("../../data/conditioned/decade-dict.json") as json_file:
    decade_dict = json.load(json_file)
    

f = open("../../data/conditioned/ahypes.p",'r')  
ahypes = pickle.load(f)  
f = open("../../data/conditioned/nhypes.p",'r')  
nhypes = pickle.load(f)
f = open("../../data/conditioned/corpus.p",'r')  
corpus = pickle.load(f)


In [6]:
decades=df.decade.unique()
df.shape, dfg.shape

((4500, 11), (2946, 316))

Additional Data Cleaning- should be able to eliminate with new data set

In [7]:
# eliminate " We don't currently have a license"
counter=0
#eliminate rows with null lyrics
dfc = df[pd.isnull(df.lyrics)==False]


for row in dfc.iterrows():
    if row[1][6].startswith("We don't currently have a license"):
        dfc.iloc[row[1][1],6]="Instrumental"
    counter+=1
    if counter >3000:
        break

#eliminate instrumentals
dfc = dfc[df.lyrics!="Instrumental"]
# eliminate " 	We don't currently have a license"

dfc.shape
#dfc.head()



(4341, 11)

## Unsupervised Machine Learning  

Fortunately, most of the hard work in making LSA and LDA models operational is accomplished by the `gensim` library for Python, which resulted from the [Phd Dissertation of Radim Hurek](http://radimrehurek.com/phd_rehurek.pdf). Hurek provides an exelent accessible dicussion of the logic behind textual anal;ysis, beginning with the statistical semantics hypothesis:

> Statistical patterns of human word usage can be used to figure out what people mean. 


 

The first use of gensim LSI and LDA is to extract topics across the entire data set, spanning years 1970-2014. The number of topic nodes can be changed by adjusting the `numtopics` variable; the value of `showtopics` determines how many of the extracted topics will be displayed.   

In [8]:
# because the index numbers were in string form in the dict
id2w=dict()
run = dict()
for k in nounid2word:
    id2w[int(k)]=nounid2word[k]   
   
    # this sets the parameters for all the runs to follow.
ntopics=80 # topics for LDA
nfeatures = 300   #features for LSI
nwords=12

In [9]:
# this function makes reading the output easier
# dectops is the master dict of topics keyed by decade
# remmeber decade 1000 is the result of running across all years
# howmany is the number of topics to display
# years is a list of the years to display, defaults to whole set
def printreadable(dectops,howmany,years=decades):
    for dec in years:
        howmany = min(howmany,ntopics)
        print "Top "+str(howmany)+" of "+str(ntopics)+" Topics for decade "+str(dec)+":"
        for i in range(0,howmany):
            print "\n- - - -Topic "+str(i)+"- - - - "
            for j in range(0,howmany):
                print dectops[dec][i][j][1], dectops[dec][i][j][0]
        print '-----------------------------------------------\n'

# Latent Semantic Indexing (LSI) Analysis 

Latent Semantic Indexing, also called Latent Semantic Analysis, was first described in Deerwester, et al, (1990)  [Indexing by Latent Semantic Analysis](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf). According to the [gensim documentation](https://radimrehurek.com/gensim/tut2.html), "target dimensionality of 200-500 is recommended as a 'golden standard.'"  Accordingly, for LSI 300 features are selected. We first use LSI to extract topic features across the entire data set, then do the same for the LDA model, which is an extension of LSI. Then the sample is partitioned according to decade, and both processes are applied to each partition separately.  

In [10]:
nfeatures = 300
lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2w, num_topics=300)

In [11]:
lsi_decade_topics = {}

tlist = lsi.print_topics(nwords)

i=0
lsidict={}
for t in tlist:
    u = t.split(' + ')
    topictuple= [v.split('*') for v in u]
    lsidict[i]= topictuple
    i +=1

lsi_decade_topics[1000]=lsidict



In [12]:
printreadable(lsi_decade_topics,10,[1000])

Top 10 of 80 Topics for decade 1000:

- - - -Topic 0- - - - 
"hitta" 0.990
"day" 0.054
"shit" 0.046
"ride" 0.044
"finger" 0.042
"trigger" 0.042
"motherfucking" 0.042
"bitch" 0.034
"love" 0.032
"drink" 0.021

- - - -Topic 1- - - - 
"love" 0.807
"baby" 0.508
"girl" 0.169
"time" 0.133
"thing" 0.087
"night" 0.074
"cheep" 0.072
"momma" 0.061
"way" 0.055
"man" 0.047

- - - -Topic 2- - - - 
"baby" 0.727
"love" -0.576
"girl" 0.264
"cheep" 0.144
"momma" 0.122
"man" 0.069
"bird" 0.062
"night" 0.061
"thing" 0.055
"time" 0.046

- - - -Topic 3- - - - 
"girl" 0.900
"baby" -0.339
"man" 0.126
"thing" 0.112
"cheep" -0.088
"momma" -0.072
"pop" 0.070
"time" 0.068
"lock" 0.067
"way" 0.056

- - - -Topic 4- - - - 
"time" -0.937
"thing" -0.192
"girl" 0.154
"touch" -0.133
"baby" 0.099
"love" 0.098
"hand" -0.074
"night" -0.049
"life" -0.038
"applause" -0.038

- - - -Topic 5- - - - 
"thing" -0.644
"touch" -0.574
"hand" -0.306
"time" 0.282
"applause" -0.161
"girl" 0.121
"make" -0.114
"loud" -0.107
"man" -0.057
"

## Latent Dirichlet Analysis (LDA)

In [13]:
%%time
print "Topics:"+str(ntopics)+" Words: "+str(nwords)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2w, num_topics=ntopics, update_every=1, chunksize=100, passes=3)
topicsobject= lda.print_topics(num_topics=ntopics,num_words=nwords)

Topics:80 Words: 12
Wall time: 1min 30s


In [14]:
lda_decade_topics={}  # this dict will hold decade results, also this one for convenience

topicsdict={}
topicnumber=0
for t in topicsobject:  #store them all even if only some will be displayed
    topictuple= [d.split('*') for d in t.split(' + ')]
    topicsdict[topicnumber] = topictuple
    topicnumber += 1

lda_decade_topics[1000] = topicsdict   # all-year results stored as "decade" 1000   


In [15]:
printreadable(lda_decade_topics,nwords,[1000])


Top 12 of 80 Topics for decade 1000:

- - - -Topic 0- - - - 
heart 0.340
kind 0.111
smile 0.096
hair 0.078
train 0.047
door 0.043
ice 0.038
morning 0.031
bang 0.028
ticket 0.024
sea 0.017
bucket 0.017

- - - -Topic 1- - - - 
seat 0.223
jean 0.203
inside 0.119
couple 0.070
suit 0.055
fool 0.051
queen 0.036
da 0.036
la 0.030
lung 0.013
hitta 0.012
bench 0.011

- - - -Topic 2- - - - 
man 0.386
shot 0.204
fight 0.041
picture 0.039
king 0.037
think 0.036
tune 0.030
castle 0.027
mix 0.017
note 0.014
square 0.011
needle 0.011

- - - -Topic 3- - - - 
lover 0.121
notch 0.092
lie 0.063
filth 0.063
answer 0.061
shawty 0.059
la-laaa 0.057
shine 0.050
bite 0.043
panamera 0.040
belt 0.035
alcohol 0.028

- - - -Topic 4- - - - 
hitta 0.238
finger 0.180
competition 0.045
church 0.039
tobacco 0.038
shotgun 0.036
league 0.034
doll 0.032
difference 0.031
record 0.028
mommy 0.022
interlock 0.015

- - - -Topic 5- - - - 
tear 0.265
thought 0.154
nothin 0.117
shoulder 0.102
honey 0.071
contigo 0.035
aisle 0.0

In [16]:
#for bow in corpus[2:4400:440]:
#    print bow
#    print lda.get_document_topics(bow)
#    print " ".join([id2w[e[0]] for e in bow])
#    print "=========================================="

## Breaking It Down by Decade

The corpus for each decade has been separately generated and saved in a separate subdirectory using the name corpus each time. To analyze by decade, we therefore use identical code operating on different working directories.


### LSA by Decade

In [17]:
%%time
decades = ["1970","1980","1990","2000","2010"]
for dec in decades:
    filename="../../data/conditioned/decades/"+dec+"/corpus"+dec+".p"
    f = open(filename,'r')  
    corpus = pickle.load(f)
    print dec,len(corpus)
    lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2w, num_topics=ntopics)
    tlist = lsi.print_topics(nwords)
    i=0
    lsidict={}
    for t in tlist:
        u = t.split(' + ')
        topictuple= [v.split('*') for v in u]
        lsidict[i]= topictuple
        i +=1
    lsi_decade_topics[int(dec)]=lsidict


1970 6031
1980 6447
1990 7795
2000 10398
2010 4801
Wall time: 1.8 s


In [18]:
printreadable(lsi_decade_topics,10)
#lsi_decade_topics

Top 10 of 80 Topics for decade 1970:

- - - -Topic 0- - - - 
"schoolgirl" 0.634
"glare" 0.529
"passin" 0.382
"monsoon" 0.271
"goner" 0.185
"hoss" 0.123
"downtown" 0.115
"groove" 0.107
"intellectualism" 0.107
"icky" 0.106

- - - -Topic 1- - - - 
"doobie" 0.864
"dancey" 0.504
"prejudice" 0.009
"true" -0.000
"declinin" 0.000
"pinky" 0.000
"airwave" 0.000
"baby-baby-baby" -0.000
"finisher" 0.000
"salary" 0.000

- - - -Topic 2- - - - 
"sound" -0.981
"split" -0.082
"passin" -0.076
"duct" -0.050
"tip" -0.048
"schoolgirl" 0.044
"freek" -0.040
"ridin" -0.038
"glare" 0.037
"judge" -0.036

- - - -Topic 3- - - - 
"duct" 0.815
"passin" 0.474
"macaroni" 0.151
"schoolgirl" -0.139
"glare" -0.116
"sound" -0.105
"operating" 0.096
"necklace" 0.073
"coco" 0.071
"monsoon" -0.063

- - - -Topic 4- - - - 
"passin" 0.746
"duct" -0.547
"schoolgirl" -0.207
"glare" -0.173
"operating" 0.121
"monsoon" -0.095
"cologne" 0.086
"split" 0.084
"goner" -0.060
"sound" -0.059

- - - -Topic 5- - - - 
"split" -0.974
"passin" 

## LDA Over the Years

In [19]:
%%time
topicsobjects={}
for dec in decades:
    filename="../../data/conditioned/decades/"+dec+"/corpus"+dec+".p"
    f = open(filename,'r')  
    corpus = pickle.load(f)
    print dec,len(corpus)
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2w, num_topics=10, update_every=1, chunksize=100, passes=3)
    topicsobjects[dec]=lda.print_topics(num_topics=ntopics,num_words=nwords)
    

1970 6031
1980 6447
1990 7795
2000 10398
2010 4801
Wall time: 21.1 s


In [20]:
# the dict decadetopics was declared back when lsa was done on all years
# it holds lsa for all years under key 1000.
for dec in topicsobjects:
    topicsdict={}
    topicnumber=0
    for t in topicsobjects[dec]:
        topictuple= [d.split('*') for d in t.split(' + ')]
        
        topicsdict[topicnumber] = topictuple
        topicnumber += 1
    lda_decade_topics[int(dec)]=topicsdict
   

In [21]:

printreadable(lda_decade_topics,nwords)

Top 12 of 80 Topics for decade 1970:

- - - -Topic 0- - - - 
operating 0.050
hurtin 0.049
jockin 0.037
ahehe 0.031
herrre 0.026
physician 0.026
grape 0.015
blue 0.015
wig 0.014
heh 0.013
fighter 0.013
x11 0.013

- - - -Topic 1- - - - 
nuttin 0.108
smokescreen 0.063
preacher 0.051
reminisce 0.050
share 0.030
addition 0.020
excitement 0.019
forgiveness 0.019
box 0.018
wound 0.018
hi-fi 0.016
tombstone 0.012

- - - -Topic 2- - - - 
macaroni 0.091
amarretta 0.080
wrapping 0.061
ba-da 0.039
trim 0.038
buffalo 0.031
veil 0.023
gully 0.022
trick-or-treat 0.021
master 0.014
baby-baby-baby 0.014
guzzling 0.012

- - - -Topic 3- - - - 
split 0.315
necklace 0.091
duct 0.084
watch 0.026
skeezer 0.018
traffic 0.017
windowpane 0.011
upset 0.011
windmill 0.009
rockilla 0.008
thinking 0.007
meant 0.007

- - - -Topic 4- - - - 
sound 0.385
cologne 0.092
hoss 0.082
club 0.023
hassle 0.022
smyth 0.012
picnic 0.011
foreigner 0.008
groovy 0.008
a-rappin 0.007
sweeties 0.007
finisher 0.007

- - - -Topic 5- - 

KeyError: 10

Finally, the topic result sets are saved for processing and display.

In [None]:
#access example, LDA topic 3  for 1970:
print lda_decade_topics[1970][3]
# LSI is structured identically, of course results are different:
print "\n" ,lsi_decade_topics[1970][3]

In [None]:
with open("lda_decade_topics.json","w") as fd:
    json.dump(lda_decade_topics, fd)
    
with open("lsi_decade_topics.json","w") as fd:
    json.dump(lsi_decade_topics, fd)
