#Lyrics Raw Harvesting
This notebook continues the processing pipeline. It acts on each available `lyrics_url` in [master-lyricsdf.csv](../../data/conditioned/master-lyricsdf.csv) -- the output of [Lyrics-Metadata-Processing Notebook](Lyrics-Metadata-Processing.ipynb). It builds a cache (i.e. persistence) of song lyrics in order to pick up on processing as needed without duplicating calls to the Lyrics.Wikia API.
* [lw-raw-lyrics](../../data/harvested/lw-raw-lyrics) is the directory for song lyrics (pre-parsed)
* [lw-raw-lyrics-error](../../data/harvested/lw-raw-lyrics-error) is the directory for songs unable to be processed, requiring some sort of manual intervention. After being corrected, additional processing units should remain unaware of manual intervention.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [38]:
## MLJ: Additional Extras
import os
import codecs
import requests
import time
import itertools
import json
import pickle

##Load Lyrics Dataframe

In [3]:
# load the latest master lyricsdf
lyricsdf = pd.read_csv("../../data/conditioned/master-lyricsdf.csv")  

In [4]:
lyricsdf.head()

Unnamed: 0,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary feeling small When tears are...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary\nFeeling small\nWhen tears a...
1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,x,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear\nEverytime you ar...
2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"American woman, stay away from me American wom...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da\nMmm, mmm, da da da\nMmm, mmm, d..."
3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops keep falling on my head Just like th...,1970,1970-4,http://lyrics.wikia.com/B.J._Thomas:Raindrops_...,Raindrops are falling on my head\nAnd just lik...
4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War huh Yeah! Absolutely uh-huh, uh-huh huh Ye...",1970,1970-5,http://lyrics.wikia.com/Edwin_Starr:War,"War, huh, yeah\nWhat is it good for?\nAbsolute..."


##Setup Directories and File Utility Methods

In [5]:
lw_root = "http://lyrics.wikia.com/wiki/"
lw_success_dir = "../../data/harvested/lw-raw-lyrics/"
lw_issues_dir = "../../data/harvested/lw-raw-lyrics-error/"

In [6]:
# adapted from https://justgagan.wordpress.com/2010/09/22/python-create-path-or-directories-if-not-exist/
def assureDirExists(path):
    d = os.path.dirname(path)
    if not os.path.exists(d):
        os.makedirs(d)

In [7]:
# make sure the directories exist
assureDirExists(lw_success_dir)
assureDirExists(lw_issues_dir)

In [8]:
# adapted from http://stackoverflow.com/questions/82831/check-whether-a-file-exists-using-python
def isNonZeroFile(fpath):  
    return True if os.path.isfile(fpath) and os.path.getsize(fpath) > 0 else False

In [33]:
# consolidated helper method for paths to be used in writing success and issues as needed.
# note *_hpath used here vs *_jpath in previous notebook.
def buildPathsDictFor(song_key):
    """
    return a dictionary of paths and filename for the `song_key`
    """
    success_hpath = "{}{}.html".format(lw_success_dir,song_key) #normal json
    issues_hpath = "{}{}.html".format(lw_issues_dir,song_key) #json with issues
    issues_tpath = "{}{}.txt".format(lw_issues_dir,song_key) #text with issue message
    
    return {'song_key':song_key, 'success_hpath':success_hpath, 
            'issues_hpath':issues_hpath, 'issues_tpath':issues_tpath}

In [36]:
# consolidated helper method for result writes
def writeStrToFile(str, pathsd, pathsd_key):
    """
    write the given str to the given path.
    
     --- Input ---
    str: String to write
    pathsd: Dictionary holding paths
    pathsd_key: String key to use in pathsd when writing
    
    --- Return ---
    pathsd[pathsd_key]
    """
    path = pathsd[pathsd_key]
    
    with codecs.open(path,'w',encoding='utf8') as text_file:
        text_file.write(str) #auto-closed due to context manager use
    
    return path

##Grab or Identify Cached Song Lyrics

In [29]:
def cachedLyricsOrBuildFromLyricsWikia(song_key, lyrics_url, pathsd, force=False, debug=False):
    """
    Leverage cache where possible; helpful for reprocessing. This will use
    `song_key` to lookup available results within data/harvested/lw-raw-lyrics.
    
    Successful results are within data/harvested/lw-raw-lyrics/<song_key>.html
    Unsuccessful result are within data/harvested/lw-raw-lyrics-error/<song_key>.txt
    
     --- Input ---
    song_key: String will be used to look for existing persisted results
    lyrics_url: String url to used to process the lyrics download call if not in persisted results
    force: optional Boolean to indicate full processing, ignoring cache, default = False
    debug: optional Boolean to indicate more verbose output
    
    --- Return ---
    tuple of the following:
    t[0] String path to processing results, 
    t[1] Boolean indicating True for success, False for issue
    t[2] Boolean indicating True for cache results, else False
    """  
    
    #if not force and is in the cache (i.e. already persisted) just return it
    if not force and isNonZeroFile(pathsd['success_hpath']):
        print "... using song lyric in cache for ", song_key
        return pathsd['success_hpath'], True, True #success_hpath, success, cache
    
    try:
        # otherwise, attempt to download from lyrics.wikia
        r = requests.get(lyrics_url)
        print "... attempting retrieval: ",r.url
        
        if debug:
            print "...encoding: ", r.encoding
   
        # here we access the webpage and download the content using requests, just keeping text
        if r.status_code == 200:            
            text = r.text
            
            return writeStrToFile(text, pathsd, 'success_hpath'), True, False #success_hpath, success, not cache
        
        # essentially, else status code not 200
        msg = "Not able to process lyrics for song_key: `{}`, status_code: `{}`".format(song_key,r.status_code)
        return writeStrToFile(msg, pathsd, 'issues_tpath'), False, False #issues_tpath, not success, not cache
    
    except Exception as e:
        msg = "exception processing lyrics for song_key: `{}`, {}".format(song_key,e)
        print msg
        return writeStrToFile(msg, pathsd, 'issues_tpath'), False, False #issues_tpath, not success, not cache    

In [25]:
# main entry-point for raw lyric processing
def getAvailableRawSongLyrics(df, query_delay=1, force=False, debug=False):
    """
    Attempt to download lyrics from lyrics.wikia, skipping successfully persisted previous results.
    Each song_key result is individually persisted to file for repeat / additive processing pipeline.
    
     --- Input ---
    df: Dataframe from which to build and cache results
    query_delay: optional delay value, default 1
    force: optional Boolean to indicate full processing, ignoring cache, default = False
    debug: optional Boolean to indicate more verbose output
    
    --- Return ---
    tuple of the following:
    t[0] dictionary of new processing by song_key with path to results,
    t[1] dictionary of existing / cached processing by song_key with path to results,
    t[2] dictionary of issues by song_key with path to results 
    """   
    cache_refs = {}
    new_refs = {}
    issues = {}    
    
    for r in df.iterrows():
        song_key = r[1].song_key
        lyrics_url = r[1].lyrics_url
        
        pathsd = buildPathsDictFor(song_key)
        
        if lyrics_url:
            if debug:
                print "song_key: {}, lyrics_url: {}".format(song_key, lyrics_url)
            
            # let this call handle download or skip based on cache 
            # the following returns tuple with the following:
            # t[0] String path to processing results, 
            # t[1] Boolean indicating True if success or False if issue
            # t[2] Boolean indicating 'True' if results from cache
            t = cachedLyricsOrBuildFromLyricsWikia(song_key,lyrics_url,pathsd,force=force,debug=debug)
            
            # cached results (ignored)
            if t[2] and t[1]:
                cache_refs[song_key] = t[0]
            # new results    
            elif t[1]:
                new_refs[song_key] = t[0]
            # issues    
            else:    
                issues[song_key] = t[0]
            
            # time delay if results not from cache
            if not t[2]:
                time.sleep(query_delay)
            
        else:
            msg = "no url for song_key: {}".format(song_key)
            if debug:
                print msg
            issues[song_key] = msg    
            writeStrToFile(msg, pathsd, 'issues_tpath')
        
    return new_refs, cache_refs, issues

###Quick Test to Verify Handling a Single Key

In [41]:
# quick test
ttuple = getAvailableRawSongLyrics(lyricsdf[lyricsdf.song_key == "2001-96"], debug=True)   
print
print "how many new lyrics were downloaded? ", len(ttuple[0])
print "how many results were in the cache? ", len(ttuple[1])
print "how many issues were encountered? ", len(ttuple[2])

print ttuple[0]


song_key: 2001-96, lyrics_url: http://lyrics.wikia.com/3_Doors_Down:Be_Like_That
... using song lyric in cache for  2001-96

how many new lyrics were downloaded?  0
how many results were in the cache?  1
how many issues were encountered?  0
{}


###Process the 1970s

In [44]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Mon, 23 Nov 2015 03:33:43


In [45]:
%%time
# process 70s
new_refs70, cache_refs70, issues70 = getAvailableRawSongLyrics(lyricsdf[lyricsdf.decade == 1970])  

... using song lyric in cache for  1970-1
... using song lyric in cache for  1970-2
... using song lyric in cache for  1970-3
... using song lyric in cache for  1970-4
... using song lyric in cache for  1970-5
... using song lyric in cache for  1970-6
... using song lyric in cache for  1970-7
... using song lyric in cache for  1970-8
... using song lyric in cache for  1970-9
... using song lyric in cache for  1970-10
... using song lyric in cache for  1970-11
... using song lyric in cache for  1970-12
... using song lyric in cache for  1970-13
... using song lyric in cache for  1970-14
... using song lyric in cache for  1970-15
... using song lyric in cache for  1970-16
... using song lyric in cache for  1970-17
... using song lyric in cache for  1970-18
... using song lyric in cache for  1970-19
... using song lyric in cache for  1970-20
... using song lyric in cache for  1970-21
... using song lyric in cache for  1970-22
... using song lyric in cache for  1970-23
... using song lyric