## LSI and LDA On Data Partitioned by Genre

In this part, the LSI and LDA methods of gensim are applied to subsets of the database. Because the sample sizes are necessarily smaller, some reduction in both the number of topics and the number of genres is necessary. By requiring the minimum number of document per genre to be at least 240 and by estimating 200 rather than 300 topics for the LSI, the overfitting potential is decreased. For LDA the number of topics extracted remains at 80. 

As usual, the programming starts with including ibraries. 

#### Note this book requires spark and is run under vagrant.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
import time
import itertools
import json
import pickle
import os
import collections
import nltk
import gensim

In [3]:
import findspark
findspark.init()
print findspark.find()

/home/vagrant/spark


In [4]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "4g"))
sc = pyspark.SparkContext(conf=conf)

sc._conf.getAll()

[(u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.executor.memory', u'4g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [5]:
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 

In [6]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [7]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

from collections import Counter

## Loading of Data

THe data has been extensively conditioned elsewhere. This dataset has the genres for titles all the way back to 1970, in the wide format used by HW1, with each genre represented by a column containing either true or false.  In addition, the json object `songsbygenre.json` is keyed by the complete list of genres and provides an easy source for iterating over them.   

In [8]:
# load 
with open("../../notebooks/ss/songsbygenre.json") as json_file:
    genresj = json.load(json_file)
genrelist= genresj.keys()

lyrics_pd_df = pd.read_csv("../../data/conditioned/all_years_and_genres_with_lyrics_and_wordcount_and_vocabulary.csv")
lyrics_pd_df.head(2)
#genrelist


Unnamed: 0.1,Unnamed: 0,song_key,lyrics,lyrics_url,lyrics_abstract,decade,artist,title,year,band_singer,ranking,song,songurl,url,born,genres,ya,/wiki/2_Tone,/wiki/A_cappella,/wiki/Acid_house,/wiki/Acid_jazz,/wiki/Acid_rock,/wiki/Acoustic_music,/wiki/Acoustic_rock,/wiki/Adult_Contemporary,/wiki/Adult_Contemporary_Music,/wiki/Adult_contemporary,/wiki/Adult_contemporary_music,/wiki/Adult_contemporary_music#Soft_adult_contemporary,/wiki/Afrobeat,/wiki/Album-oriented_rock,/wiki/Alternative_R%26B,/wiki/Alternative_country,/wiki/Alternative_dance,/wiki/Alternative_dance#Indietronica,/wiki/Alternative_hip_hop,/wiki/Alternative_metal,/wiki/Alternative_pop,/wiki/Alternative_rock,/wiki/Ambient_house,/wiki/Ambient_music,/wiki/American_folk_music,/wiki/Americana_(music),/wiki/Anarcho-punk,/wiki/Anti-folk,/wiki/Arena_rock,/wiki/Art_pop,/wiki/Art_punk,/wiki/Art_rock,/wiki/Avant-garde_music,...,/wiki/Southern_hip_hop,/wiki/Southern_rap,/wiki/Southern_rock,/wiki/Southern_soul,/wiki/Space_disco,/wiki/Space_rock,/wiki/Spoken_word,/wiki/Sunshine_pop,/wiki/Surf_music,/wiki/Surf_rock,/wiki/Swamp_pop,/wiki/Swamp_rock,/wiki/Swing_(genre),/wiki/Swing_music,/wiki/Symphonic_rock,/wiki/Synthpop,/wiki/Talking_blues,/wiki/Tech_house,/wiki/Techno,/wiki/Techno_music,/wiki/Teen_pop,/wiki/Tejano_music,/wiki/Thrash_metal,/wiki/Traditional_pop,/wiki/Traditional_pop_music,/wiki/Trance_music,/wiki/Trap_music,/wiki/Trip_hop,/wiki/UK_funky,/wiki/UK_garage,/wiki/Underground_hip_hop,/wiki/Urban_adult_contemporary,/wiki/Urban_contemporary,/wiki/Urban_contemporary_gospel,/wiki/Urban_music,/wiki/Vocal_music,/wiki/West_Coast_Rap,/wiki/West_Coast_hip_hop,/wiki/West_coast_hip_hop,/wiki/Western_music_(North_America),/wiki/Western_swing,/wiki/Witch_house,/wiki/World_music,/wiki/Worldbeat,/wiki/Worship_music,/wiki/Zydeco,wordcount,wordset,lexdiv,repetition_score
0,627,1976-86,Are you ready\nDo what you wanna do\nDo what y...,http://lyrics.wikia.com/Ohio_Players:Who%27d_S...,Are you ready\nDo what you wanna do\nDo what y...,1970,Ohio Players,Who'd She Coo?,1976,Ohio Players,86,Who'd She Coo?,/wiki/Who%27d_She_Coo%3F,/wiki/Ohio_Players,False,"[/wiki/Funk, /wiki/Disco, /wiki/Rhythm_and_blu...",1959 ( 1959 ) –2002 ( 2002 ),False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,35,26,0.742857,1.346154
1,1375,1984-59,I thought that dreams belonged to other men 'C...,http://lyrics.wikia.com/index.php?title=Mike_R...,I thought that dreams belonged to other men 'C...,1980,Mike Reno,Almost Paradise,1984,Ann Wilson,59,Almost Paradise,/wiki/Almost_Paradise,/wiki/Ann_Wilson,1950-06-19,"[/wiki/Rock_music, /wiki/Hard_rock, /wiki/Folk...",1970–present,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,144,104,0.722222,1.384615


In [9]:

# cull excess columns swept up on read
del lyrics_pd_df['Unnamed: 0']
del lyrics_pd_df['lyrics_abstract']
del lyrics_pd_df['band_singer']

#after this, some long function definitioins, then the action occurs.
lyrics_pd_df.shape

(4893, 451)

## Helper Functions

The following functions are lifted from HW5 with some modification. Because they were well documented elsewhere, most of the comments have been removed. 

In [10]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2


The spark processes work appropriately inside this function:

In [11]:
# this modularizes the process documented in detail by HW5
def makecorpus(genre_df):
    ldf = sqlsc.createDataFrame(genre_df)
    lyric_parts = ldf.map(lambda r : get_parts(r.lyrics))

    parseout=lyric_parts.collect()
    nounrdd=sc.parallelize([ele[0] for ele in parseout]).flatMap(lambda l: l)
    nwordsrdd = (nounrdd.flatMap(lambda word: word)
                 .map(lambda word: (word, 1))
                 .reduceByKey(lambda a, b: a + b)
                    )
    top10=nwordsrdd.takeOrdered(10, key = lambda x: -x[1])

    nounvocabtups = (nwordsrdd
                 .map(lambda (x,y): x)
                 .zipWithIndex()
    )
    nounvocab=nounvocabtups.collectAsMap()
    nounid2word=nounvocabtups.map(lambda (x,y): (y,x)).collectAsMap()
    documents = nounrdd.map(lambda words: Counter([nounvocab[word] for word in words]).most_common())
    corpus=documents.collect()
    pickle.dump( corpus, open( "../../notebooks/ss"+genre+"_corpus.p", "wb" ) )
    return corpus,nounid2word


## Iterate Over Genres, Building Corpus and Running LSI and LDA

In [12]:
%%time
#genrelist=["/wiki/Rock_music"]  #DE-COMMENT TO TEST ON ONE GENRE

# dicts to hold the output
lsi_genre_topics={}
lda_genre_topics={}    
nfeatures = 200
ntopics=80
nwords=20 

for genre in genrelist:
    # take genre subset of df 
    genredf = lyrics_pd_df[lyrics_pd_df[genre]==True]
    if len(genredf) > 240 and genre != "NA":
        print genre, len(genredf)
        #get the corpus and the id-to-word dict
        atuple=makecorpus(genredf)
        corpus=atuple[0]
        id2w=atuple[1]
        lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2w, num_topics=300)
        tlist = lsi.print_topics(nfeatures)
        lsi_topics_for_genre={}
        for t in tlist:
            plist = t[1].split(' + ')
            slist = [(u.split('*')[0],u.split('*')[1]) for u in plist]
            lsi_topics_for_genre[t[0]]=plist
        lsi_genre_topics[genre]=lsi_topics_for_genre
         
        lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2w, num_topics=ntopics, update_every=1, chunksize=100, passes=3)
        topicsobject= lda.print_topics(num_topics=ntopics,num_words=nwords)
        lda_genre_topics[genre]= topicsobject
        #store them in json compatible way
        lda_topics_for_genre={}
        for t in topicsobject:  
            plist = t[1].split(' + ')
            slist = [(u.split('*')[0],u.split('*')[1]) for u in plist]
            lda_topics_for_genre[t[0]]=plist
        lda_genre_topics[genre]=lda_topics_for_genre
            

    


/wiki/Dance-pop 283
/wiki/Alternative_rock 328
/wiki/Soul_music 835
/wiki/Rhythm_and_blues 393
/wiki/Dance_music 291
/wiki/Rock_music 721
/wiki/Pop_rock 709
/wiki/Contemporary_R%26B 1252
/wiki/Disco 268
/wiki/Funk 324
/wiki/Soft_rock 504
/wiki/Country_music 403
/wiki/Hard_rock 311
/wiki/Hip_hop_music 1275
/wiki/Pop_music 1598
CPU times: user 2min 19s, sys: 1.14 s, total: 2min 20s
Wall time: 4min 37s


### Place the results in json objects. 

In [13]:
with open("../../notebooks/ss/lsi_genre_topics.json","w") as fd:
    json.dump(lsi_genre_topics, fd)
    
with open("../../notebooks/ss/lda_genre_topics.json","w") as fd:
    json.dump(lda_genre_topics, fd)

print "Finished!"

Finished!


In [15]:
os.getcwd()

'/home/vagrant/lyrics-lab/notebooks/mj'