#Lyrics Metadata Processing
**This notebook combines lessons learned from [Data-Exploration Notebook](Data-Exploration.ipynb) and [Process-Missing-Lyrics Notebook](Process-Missing-Lyrics.ipynb) to accomplish the following strategy**

1. pipeline process will leverage api from lyrics.wikia for song lyrics both abstracts (suitable for partial display to users) and full (suitable for processing needs)
    1. URL e.g. [Paul Simon's "Bridge Over Troubled Water](http://lyrics.wikia.com/wiki/Paul_Simon:Bridge_Over_Troubled_Water)
    1. API Metadata e.g. [Joe Bonamassa's "So Many Roads"](http://lyrics.wikia.com/api.php?action=lyrics&artist=Joe%20Bonamassa&song=So%20Many%20Roads&fmt=json)  
1. maintain the following from initial [parsed lyrics metadata for 1970-2014](../../data/provided/all%20billboard%20top%20100%20songs%20from%201970-2014.csv) (which was provided): `postion`, `year`, `title`, `artist`, `title.href` (wikipedia)
1. maintain the derived columns `decade` and `song_key`

This notebook will result in a master dataframe exported to [master_lyrics.csv](../../data/conditioned/master_lyrics.csv).

Lyrics harvesting (beyond metadata which is done in this notebook) will be done in [Lyrics-Raw-Harvesting Notebook](Lyrics-Raw-Harvesting.ipynb) 

After core pipeline processing to be done in [Vocab-Consolidation Notebook](Vocab-Consolidation.ipynb), the following artifacts will be established, possibly combined for better latent factors processing (not reflected here):
* vocabs for noun and adj
* n-gram for noun and adj
* synonyms for noun and adj
* hypernyms for noun and adj

Possibile additional steps with data processing:
1. split out decade-centric artifacts to aid statistical analysis to be reflected in work done in [Decade-Separation Notebook](Decade-Separation.ipynb)
1. process 2015?
1. process prior to 1970? (involves some Billboard scraping )

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [105]:
## MLJ: Additional Extras
import os
import codecs
import requests
import time
import itertools
import json
import pickle

##Load Provided Data Into Pandas Dataframe
* ultimately will preserve all of `postion`, `year`, `title`, `artist`, `title.href` (wikipedia)
* `lyrics` will only be used if not replaced by lyrics.wikia api

In [3]:
# load the provided lyrics
lyrics_pd_df = pd.read_csv("../../data/provided/all billboard top 100 songs from 1970-2014.csv")  

In [4]:
# cull excess columns swept up on read
lyrics_pd_df = lyrics_pd_df[['position','year','title.href','title','artist','lyrics']]

In [5]:
lyrics_pd_df.shape

(4500, 6)

In [6]:
lyrics_pd_df.head()

Unnamed: 0,position,year,title.href,title,artist,lyrics
0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary feeling small When tears are...
1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,x
2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"American woman, stay away from me American wom..."
3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops keep falling on my head Just like th...
4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War huh Yeah! Absolutely uh-huh, uh-huh huh Ye..."


##Augment With Additional Derived Columns
* `decade` , e.g. 1970
* `song_key`, e.g. 1970-1

In [7]:
# add `decade` column to df
lyrics_pd_df['decade'] = lyrics_pd_df.year.apply(lambda y : y - y%10)

In [8]:
# add a `song_key` column by joining `year` and `position` for better identity 
# adapted from:
# http://stackoverflow.com/questions/29983946/concatenate-cells-into-a-string-with-separator-pandas-python
lyrics_pd_df['song_key'] = lyrics_pd_df[['year','position']].apply(lambda row: '-'.join(row.astype(str).values), axis=1)

In [9]:
# view a sample of output
lyrics_pd_df.sample(5).head()

Unnamed: 0,position,year,title.href,title,artist,lyrics,decade,song_key
3890,91,2008,https://en.wikipedia.org/wiki/Mrs._Officer,Mrs. Officer,Lil Wayne,x,2000,2008-91
1279,80,1982,https://en.wikipedia.org/wiki/Here_I_Am_(Air_S...,Here I Am,Air Supply,Here I am playing with those memories again An...,1980,1982-80
4227,28,2012,https://en.wikipedia.org/wiki/Boyfriend_(Justi...,Boyfriend,Justin Bieber,If I was your boyfriend I'd never let you go I...,2010,2012-28
175,76,1971,https://en.wikipedia.org/wiki/If_Not_for_You,If Not for You,Olivia Newton-John,"If not for you, babe, I Couldn't even find the...",1970,1971-76
1483,84,1984,https://en.wikipedia.org/wiki/Time_Will_Reveal...,Time Will Reveal,DeBarge,What can I do? To make you feel secure Remove ...,1980,1984-84


##Process Metadata from Lyrics.Wikia
* URL e.g. [Paul Simon's "Bridge Over Troubled Water](http://lyrics.wikia.com/wiki/Paul_Simon:Bridge_Over_Troubled_Water)
* API Metadata e.g. [Joe Bonamassa's "So Many Roads"](http://lyrics.wikia.com/api.php?action=lyrics&artist=Joe%20Bonamassa&song=So%20Many%20Roads&fmt=json)   

In [10]:
lw_api_root = "http://lyrics.wikia.com/api.php"
lw_success_dir = "../../data/harvested/lw-json/"
lw_issues_dir = "../../data/harvested/lw-json-error/"

In [11]:
# adapted from https://justgagan.wordpress.com/2010/09/22/python-create-path-or-directories-if-not-exist/
def assureDirExists(path):
    d = os.path.dirname(path)
    if not os.path.exists(d):
        os.makedirs(d)

In [12]:
# make sure the directories exist
assureDirExists(lw_success_dir)
assureDirExists(lw_issues_dir)

In [13]:
# adapted from http://stackoverflow.com/questions/82831/check-whether-a-file-exists-using-python
def isNonZeroFile(fpath):  
    return True if os.path.isfile(fpath) and os.path.getsize(fpath) > 0 else False

In [14]:
print "lw-json test (expect true) --> ", isNonZeroFile("{}1971-72.json".format(lw_success_dir))
print "lw-json-error test (expect false) --> ", isNonZeroFile("{}1971-72.json".format(lw_issues_dir))

lw-json test (expect true) -->  True
lw-json-error test (expect false) -->  False


In [15]:
# consolidated helper method for paths to be used in writing success and issues as needed.
def buildPathsDictFor(song_key):
    """
    return a dictionary of paths and filename for the `song_key`
    """
    success_jpath = "{}{}.json".format(lw_success_dir,song_key) #normal json
    issues_jpath = "{}{}.json".format(lw_issues_dir,song_key) #json with issues
    issues_tpath = "{}{}.txt".format(lw_issues_dir,song_key) #text with issue message
    
    return {'song_key':song_key, 'success_jpath':success_jpath, 
            'issues_jpath':issues_jpath, 'issues_tpath':issues_tpath}
    

In [104]:
# consolidated helper method for result writes
def writeStrToFile(str, pathsd, pathsd_key):
    """
    write the given str to the given path.
    
     --- Input ---
    str: String to write
    pathsd: Dictionary holding paths
    pathsd_key: String key to use in pathsd when writing
    
    --- Return ---
    pathsd[pathsd_key]
    """
    path = pathsd[pathsd_key]
    
    with codecs.open(path,'w',encoding='utf8') as text_file:
        text_file.write(str) #auto-closed due to context manager use
    
    return path
    

In [73]:
# consolidated handling of clean and write for non-cached api / raw text results.
# this will write to a customized json (augmented for our processing needs).

def cleanAndWriteLyricsWikiaMetaToJson(song_key, text, pathsd, debug=False):
    """
    clean metadata from result of API call to lyrics.wikia
    write results, even issues, appropriately
    
     --- Input ---
    song_key: String key to use
    text: String response from the lw api 
    pathsd: Dictionary of paths to use based on results
    debug: optional print of processing, default = False
    
    --- Return ---
    tuple of the following:
    [0] String path to results        
    [1] Boolean error in lyrics processing?
    [2] Boolean error in url processing? 
    """  
    k = song_key #shorthand
    v = text #shorthand
    
    d = {}
    d['song_key'] = k    
        
    if debug:
        print "key --> ", k    
        print "raw value -->\n", v
    
    qtoken = '":"'
    atoken = "':'"    
    q_mode = False
        
    qc = v.count(qtoken)
    ac = v.count(atoken)
    
    lyrics = ""
    llast = -1
    url = ""
    ulast = -1
    
    elyrics = False
    eurl = False
    
    # want lyrics and url
    #quot mode
    if qc > ac:    
        q_mode = True
        
        #lyrics
        try:
            lyrics = v.split('"lyrics{}'.format(qtoken),1)[1]
            lyrics = lyrics[:lyrics.find('"url{}'.format(qtoken))]
            llast = lyrics.rfind('"')    
        except Exception as e:
            print "{} lyrics not parsed --> {}".format(k,e)
            elyrics = True
        
        #url
        try:
            url = v.split('"url{}'.format(qtoken),1)[1]
            ulast = url.rfind('"') 
        except Exception as e:
            print "{} url not parsed --> {}".format(k,e)    
            eurl = True
        
    #apos mode    
    else:
        q_mode = False
        
        #lyrics
        try:
            lyrics = v.split("'lyrics{}".format(atoken),1)[1]
            lyrics = lyrics[:lyrics.find("'url{}".format(atoken))]
            llast = lyrics.rfind("'")
        except Exception as e:
            print "{} lyrics not parsed --> {}".format(k,e)
            elyrics = True
        
        #url
        try:
            url = v.split("'url{}".format(atoken),1)[1]
            ulast = url.rfind("'")
        except Exception as e:
            print "{} url not parsed --> {}".format(k,e)    
            eurl = True
        
    if debug:
        print
        print "q_mode? {}, qc: {}, ac: {}".format(q_mode, qc, ac)
    
    # final parse on lyrics
    if lyrics:
        lyrics = lyrics[:llast]  
        
        # check for any lingering escaped quotes within lyrics 
        lyrics = lyrics.replace('\\"','"').replace("\\'","'")
        # check for double new lines
        lyrics = lyrics.replace("\n\n","\n")
            
    # final parse on url    
    if url:
        url = url[:ulast]
    
    # add to dictionary, even if empty.
    d['lyrics_abstract'] = lyrics
    d['lyrics_url'] = url
    
    if debug:
        print
        print "error with lyrics_abstract parse? ", elyrics
        print "lyrics -->\n", lyrics
        print
        print "error with lyrics_url parse? ", eurl
        print "url --> ", url
    
    # determine pathsd_key
    pathsd_key = None
    if elyrics or eurl:
        pathsd_key = "issues_jpath"
        d['meta'] = v #go ahead and add meta
    else:
        pathsd_key = "success_jpath"
    
    # write results
    str = json.dumps(d)   
    writeStrToFile(str, pathsd, pathsd_key)    
    
    return pathsd_key, elyrics, eurl    

In [26]:
# this is a refactor from lessons learned from Process-Missing-Lyrics Notebook, to use
# a convention of persisted results on disk over a bloated cache object.

def cachedRefsOrBuildFromLyricsWikia(song_key, payload, force=False, debug=False):
    """
    Leverage cache where possible; helpful for reprocessing. This will use
    `song_key` to lookup available results within data/harvested/lw-json.
    
    Successful results are within data/harvested/lw-json/<song_key>.json
    Unsuccessful result are within data/harvested/lw-json-error/<song_key>.json
    
     --- Input ---
    song_key: String will be used to look for existing persisted results
    payload: dictionary with request params used to process the API call if not in persisted results
    force: optional Boolean to indicate full processing, ignoring cache, default = False
    debug: optional Boolean to indicate more verbose output
    
    --- Return ---
    tuple of the following:
    t[0] String path to processing results, 
    t[1] Boolean indicating True for success, False for issue
    t[2] Boolean indicating True for cache results, else False
    """  
    
    pathsd = buildPathsDictFor(song_key)
    
    #if not force and is in the cache (i.e. already persisted) just return it
    if not force and isNonZeroFile(pathsd['success_jpath']):
        print "... using song_key in cache: ", song_key
        return pathsd['success_jpath'], True, True
    
    try:
        # otherwise, attempt to download via api
        r = requests.get(lw_api_root, params=payload)
        print("... attempting retrieval: ",r.url)
   
        # here we access the webpage and download the content using requests, just keeping text
        if r.status_code == 200:            
            text = r.text
            """
            returns a tuple of the following:
            t[0] String path to processing results, 
            t[1] Boolean indicating True for error with lyrics abstract
            t[2] Boolean indicating True for error with url abstract
            """   
            t = cleanAndWriteLyricsWikiaMetaToJson(song_key,text, pathsd, debug)
            
            if t[1] or t[2]:
                return t[0], False, False #issues_path, not success, not cache
            else:
                return t[0], True, False #success_path, success, not cache
        
        # essentially, else status code not 200
        msg = "Not able to process song_key: `{}`, status_code: `{}`".format(song_key,r.status_code)
        return writeStrToFile(msg, pathsd, 'issues_tpath'), False, False #issues_tpath, not success, not cache
    
    except Exception as e:
        msg = "exception processing song_key: `{}`, {}".format(song_key,e)
        print msg
        return writeStrToFile(msg, pathsd, 'issues_tpath'), False, False #issues_tpath, not success, not cache

In [27]:
# Main entry-point for processing lyric metadata against a dataframe.
# This will not alter the dataframe.
def refsFromLyricsWikia(df, query_delay=1, force=False, debug=False):
    """
    Attempt to populated refs from lyrics.wikia via API, skipping successfully persisted previous results.
    Each song_key result is individually persisted to file for repeat / additive processing pipeline.
    
     --- Input ---
    df: Dataframe from which to build and cache results
    query_delay: optional delay value, default 1
    force: optional Boolean to indicate full processing, ignoring cache, default = False
    debug: optional Boolean to indicate more verbose output
    
    --- Return ---
    tuple of the following:
    t[0] dictionary of new processing by song_key with path to results,
    t[1] dictionary of existing / cached processing by song_key with path to results,
    t[2] dictionary of issues by song_key with path to results 
    """   
    cache_refs = {}
    new_refs = {}
    issues = {}
    
    for r in df.iterrows():
        song_key = r[1]['song_key'] 
        artist = r[1]['artist']
        song = r[1]['title']
         
        if debug:    
            print "... song_key: {}, artist: {}, song: {}".format(song_key,artist,song)
        
        payload = {'action': 'lyrics', 'fmt': 'json', 'artist': artist, 'song': song }
        
        # the following returns tuple with the following:
        # t[0] String path to processing results, 
        # t[1] Boolean indicating True if success or False if issue
        # t[2] Boolean indicating 'True' if results from cache   
        t = cachedRefsOrBuildFromLyricsWikia(song_key, payload, force=force, debug=debug)
        
        # cached results (ignored)
        if t[2] and t[1]:
            cache_refs[song_key] = t[0]
        # new results    
        elif t[1]:
            new_refs[song_key] = t[0]
        # issues    
        else:    
            issues[song_key] = t[0]
        
        # time delay if results not from cache
        if not t[2]:
            time.sleep(query_delay)
        
    return new_refs, cache_refs, issues

###Quick Test to Verify Handling on a Single Key

In [109]:
# quick test #1
tnew_refs, tcache_refs, t_issues = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.song_key == "2001-96"], debug=True)   
print
print "how many new refs were downloaded via API? ", len(tnew_refs)
print "how many results were in the cache? ", len(tcache_refs)
print "how many issues were encountered? ", len(t_issues)
print tnew_refs

... song_key: 2001-96, artist: 3 Doors Down, song: Be Like That
... using song_key in cache:  2001-96

how many new refs were downloaded via API?  0
how many results were in the cache?  1
how many issues were encountered?  0
{}


###Full Handling
**Piece-mealing just for clarity and ability to go in chunks. With a 1 second delay, it will take 1000 seconds (15-20 mins) for each decade.**
* Normal results go to [lw-json](../../data/harvested/lw-json), this is essentially the cache. If an entry is in here for a `song_key`, the api is not called, unless `force` equals True
* Results with any detected issues go to [lw-json-error](../../data/harvested/lw-json-error) to be reviewed and can be added into cache once corrected (i.e. manual lookup of information) for other processes to use without any special awareness, it should appear as an automated "normal" result once manually fixed.

####Process the 1970s

In [76]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Mon, 23 Nov 2015 00:31:03


In [75]:
%%time
# process 70s
new_refs70, cache_refs70, issues70 = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.decade == 1970])  

('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=Simon+and+Garfunkel&song=Bridge+over+Troubled+Water')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=The+Carpenters&song=%28They+Long+to+Be%29+Close+to+You')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=The+Guess+Who&song=American+Woman')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=B.J.+Thomas&song=Raindrops+Keep+Fallin%27+on+My+Head')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=Edwin+Starr&song=War')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=Diana+Ross&song=Ain%27t+No+Mountain+High+Enough')
('... attempting retrieval: ', u'http://lyrics.wikia.com/api.php?action=lyrics&fmt=json&artist=The+Jackson+5&song=I%27ll+Be+There')
('... attempting retrie

####Process the 1980s

In [None]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

In [None]:
%%time
# process 80s
new_refs80, cache_refs80, issues80 = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.decade == 1980])  

####Process the 1990s

In [None]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

In [None]:
%%time
# process 90s
new_refs90, cache_refs90, issues90 = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.decade == 1990])  

####Process the 2000s

In [None]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

In [None]:
%%time
# process 2000s
new_refs2000, cache_refs2000, issues2000 = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.decade == 2000])  

####Process the 2010s

In [None]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

In [None]:
%%time
# process 2010s
new_refs2010, cache_refs2010, issues2010 = refsFromLyricsWikia(lyrics_pd_df[lyrics_pd_df.decade == 2010])  

##Add Results to Dataframe
loop through the persisted results and add `lyrics_abstract` and `lyrics_url` columns.

In [98]:
# close out with lyricsdf
lyricsdf = lyrics_pd_df.copy(deep=True)

In [99]:
def lyricsUrlFor(song_key):
    try:
        path = "{}{}.json".format(lw_success_dir,song_key) #normal json
        with open(path) as fp:
            j = json.load(fp)
            return j['lyrics_url'].encode('ascii')
    except Exception:
        return ""
    
def lyricsAbstractFor(song_key):    
    try:
        path = "{}{}.json".format(lw_success_dir,song_key) #normal json
        with open(path) as fp:
            j = json.load(fp)
            return j['lyrics_abstract'].encode('ascii')
    except Exception:
        return ""

In [100]:
# Apply lyrics url where available
lyricsdf['lyrics_url'] = lyricsdf.song_key.apply(lambda x : lyricsUrlFor(x))

In [101]:
# Apply lyrics abstract where available
lyricsdf['lyrics_abstract'] = lyricsdf.song_key.apply(lambda x : lyricsAbstractFor(x))

In [102]:
lyricsdf.head()

Unnamed: 0,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary feeling small When tears are...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary\nFeeling small\nWhen tears a...
1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,x,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear\nEverytime you ar...
2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"American woman, stay away from me American wom...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da\nMmm, mmm, da da da\nMmm, mmm, d..."
3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops keep falling on my head Just like th...,1970,1970-4,http://lyrics.wikia.com/B.J._Thomas:Raindrops_...,Raindrops are falling on my head\nAnd just lik...
4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War huh Yeah! Absolutely uh-huh, uh-huh huh Ye...",1970,1970-5,http://lyrics.wikia.com/Edwin_Starr:War,"War, huh, yeah\nWhat is it good for?\nAbsolute..."


##Save Dataframe

In [97]:
# save lyricsdf
lyrics_pd_df.to_csv("../../data/conditioned/pre-lyricsdf.csv",index=False) #encoding='utf-8' doesn't work

In [103]:
# save lyricsdf
lyricsdf.to_csv("../../data/conditioned/master-lyricsdf.csv",index=False) #encoding='utf-8' doesn't work