#Vector Ensemble
Ensemble approach using Spark. This notebook leverages the consolidated vector CSV which includes normal, synonym, and hypernym vectors, see [master-lyricsdf-word_syn_hype_vectors.csv](../../data/conditioned/master-lyricsdf-word_syn_hype_vectors.csv)

##SET DECADE (OR NOT)?

In [1]:
# if decade is set, then filter results accordingly.
decade = None
# decade = 1970

In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [3]:
## MLJ: Additional Extras
import time
import itertools
import json
import pickle

In [4]:
import os
# os.environ['PYSPARK_PYTHON'] = '/anaconda/bin/python'

In [5]:
import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
# os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

/home/vagrant/spark


In [6]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [7]:
sc._conf.getAll()

[(u'spark.executor.memory', u'2g'),
 (u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.driver.memory', u'8g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [8]:
import sys
rdd = sc.parallelize(xrange(2),2)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]']

In [9]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [10]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

##Setup Data For Pipeline
###Load Dataframe into Pandas for initial manipulation

In [11]:
# load the lyrics from the approved "master" dataframe
lyrics_pd_df = pd.read_csv("../../data/conditioned/master-lyricsdf-word_syn_hype_vectors.csv")  

In [12]:
if decade:
    lyrics_pd_df = lyrics_pd_df[lyrics_pd_df['decade'] == decade]

In [13]:
lyrics_pd_df.head()

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract,noun_vector,adj_vector,noun_syn_vector,adj_syn_vector,noun_syn_hype_vector,adj_syn_hype_vector
0,0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary. Feeling small. When tears a...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary. Feeling small. When tears a...,time bridge water,rough troubled,time bridge water,troubled rough,water bridge time,rough troubled rough
1,1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,Why do birds suddenly appear. Everytime you ar...,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear. Everytime you ar...,dream starlight eye,true blue,starlight eye dream,true blue,starlight eye dream,blue blue blue blue true
2,2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d...",woman mess mind mama thing time growin light y...,american important old coloured leave,light time ma mess thing woman crap mind,american colored old important,ma thing light ma crap ma ma crap woman mess t...,american colored colored important old colored
3,3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops are falling on my head. And just lik...,1970,1970-4,http://lyrics.wikia.com/B.J._Thomas:Raindrops_...,Raindrops are falling on my head. And just lik...,guy foot bed happiness step eye,big long red,measure guy eye happiness foot bed,long red large,eye guy bed happiness measure guy measure foot,large red red red long red red large
4,4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War, huh, yeah. What is it good for? Absolutel...",1970,1970-5,http://lyrics.wikia.com/Edwin_Starr:War,"War, huh, yeah. What is it good for? Absolutel...",god destruction life war unrest generation man...,good innocent younger young short precious fig...,life coevals destruction agitation man war man...,active better cherished younger young short in...,agitation godhead manner life god manner day w...,cherished active innocent better younger short...


###Add Labels for Data based on position
This will change based upon the current run. A straight-forward usage is to see how well top and bottom 50 can be predicted.
**Note: Spark ML seems picky about `label` being the column name**

In [21]:
# use positions for labeling
positions_10_percent = {
  10.0:range(1,11),
  20.0:range(11,21),
  30.0:range(21,31),
  40.0:range(31,41),
  50.0:range(41,51),
  60.0:range(51,61), 
  70.0:range(61,71),
  80.0:range(71,81),
  90.0:range(81,91),
  100.0:range(91,101)  
}

positions_25_percent = {
  25.0:range(1,26),
  50.0:range(26,51),
  75.0:range(51,76),
  100.0:range(76,101)
}

# binary classification for top 25
positions_top_25 = {
  0.0:range(1,26),
  1.0:range(26,101)
}

# binary classification for top 50
positions_top_50 = {
  0.0:range(1,51),
  1.0:range(51,101)
}

In [22]:
# Here is the dictionary for this run. This is how the classification is being done.
positions_description = "Top 50 versus Bottom 50"
positions_dict = positions_top_50

def labelForPosition(pos):
    for k,p in positions_dict.iteritems():
        if pos in p:
            return k
    return None

#label is position, e.g. 1-10, in this use
lyrics_pd_df['label'] = lyrics_pd_df.position.apply(labelForPosition)
lyrics_pd_df.sample(5).head()

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract,noun_vector,adj_vector,noun_syn_vector,adj_syn_vector,noun_syn_hype_vector,adj_syn_hype_vector,label
1311,1311,12,1983,https://en.wikipedia.org/wiki/You_and_I_(Eddie...,You and I,Eddie Rabbitt,Just you and I. Sharing our love together. And...,1980,1983-12,http://lyrics.wikia.com/Eddie_Rabbitt_%26_Crys...,Just you and I. Sharing our love together. And...,,,,,,,0
1796,1796,97,1987,,I've Been in Love Before,Cutting Crew,"Oooh, oo oooh, cha. Catch my breath, close my ...",1980,1987-97,http://lyrics.wikia.com/Cutting_Crew:I%27ve_Be...,"Oooh, oo oooh, cha. Catch my breath, close my ...",oooh cha inside hit minute dance word,oo wrong oooh dangerous small,hit inside dance word minute,dangerous incorrect small,minute dance inside inside word hit,dangerous small incorrect small,1
4316,4316,17,2013,https://en.wikipedia.org/wiki/We_Can%27t_Stop,We Can't Stop,Miley Cyrus,"It's our party, we can do what we want (. ). (...",2010,2013-17,http://lyrics.wikia.com/Miley_Cyrus:We_Can%27t...,"It's our party, we can do what we want (. ). (...",party kiss cup body whop home homegirl butt,ooh-ooh red sweaty ready big,homegirl kiss home butt cup body party,red large ready,party home kiss butt homegirl cup body,large red red red red red ready large,0
2712,2712,13,1997,https://en.wikipedia.org/wiki/For_You_I_Will_(...,For You I Will,Monica,When you're feeling lost in the night. When yo...,1990,1997-13,http://lyrics.wikia.com/Monica:For_You_I_Will,When you're feeling lost in the night. When yo...,time fortress,tough tall strong,time fortress,tough tall strong,fortress time,tall strong tough,0
2624,2624,25,1996,https://en.wikipedia.org/wiki/Who_Will_Save_Yo...,Who Will Save Your Soul,Jewel,People living their lives for you on TV. They ...,1990,1996-25,http://lyrics.wikia.com/Jewel:Who_Will_Save_Yo...,People living their lives for you on TV. They ...,brick wall boy doctor lawyer thrill home god f...,cold free cute cheap homeless different afraid...,security bargain male_child wall lawyer brick ...,homeless different free all_right cunning cold...,wall doctor god male_child bargain bang home l...,cunning free cold all_right different homeless...,0


###Filter out Non-Lyric Records
**Non-Lyrics due to:**
* Instrumentals
* Licensing restrictions on lyrics.wikia
* No lyrics added to lyrics.wikia

In [15]:
# Check for nulls (which may include instrumentals, non )
empties = np.where(pd.isnull(lyrics_pd_df[['lyrics']]))
print "How many empties are there? {}".format(len(empties[0]))

How many empties are there? 159


In [17]:
lyrics_pd_df.shape

(4500, 18)

In [18]:
# filter out null lyrics
lyrics_pd_df = lyrics_pd_df.dropna(axis=0, how='any', thresh=None, subset=['lyrics'], inplace=False)

In [19]:
lyrics_pd_df.shape

(4341, 18)

###What Vector column is to be used for this run?

In [28]:
vector_col = 'noun_vector'
# vector_col = 'noun_syn_vector'
# vector_col = 'noun_syn_hype_vector'

In [29]:
vector_col_values = lyrics_pd_df[vector_col].values

In [30]:
len(vector_col_values)

4341

In [31]:
vector_col_values[:5]

array(['time bridge water', 'dream starlight eye',
       'woman mess mind mama thing time growin light ya shit',
       'guy foot bed happiness step eye',
       'god destruction life war unrest generation man dream day lord way'], dtype=object)

###Manipulate `vector_col` into list 
**This list removes duplication of tokens and sorts within each vector**

In [34]:
corpus_list = []
for v in vector_col_values:
    if not isinstance(v,float):
        tmp = v.split()
        cs = []
        for t in tmp:
            if not t in cs:
                cs.append(t)
        corpus_list.append(sorted(cs))
    # just in case, handle empty    
    else:
        corpus_list.append([])

In [35]:
print corpus_list[:10]

[['bridge', 'time', 'water'], ['dream', 'eye', 'starlight'], ['growin', 'light', 'mama', 'mess', 'mind', 'shit', 'thing', 'time', 'woman', 'ya'], ['bed', 'eye', 'foot', 'guy', 'happiness', 'step'], ['day', 'destruction', 'dream', 'generation', 'god', 'life', 'lord', 'man', 'unrest', 'war', 'way'], ['baby', 'day', 'desire', 'lip', 'love', 'mountain', 'remember', 'river', 'valley'], ['dream', 'honey', 'love', 'respect', 'shoulder', 'world'], ['dream', 'love', 'time', 'ya'], ['night', 'person'], ['darkness', 'room']]


In [36]:
len(corpus_list)

4341

###Convert and manipulate with Spark

In [None]:
# convert from pandas to spark dataframe
lyricsdf = sqlsc.createDataFrame(lyrics_pd_df)

In [None]:
lyricsdf.show(3)

##Pipeline Using Spark
Reference [combine all features into a single feature vector](https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html)
![Ensemble Pipeline Overview](https://databricks.com/wp-content/uploads/2015/07/simple-pipeline.png)
* Tokenizer
* HashingTF
* Word2Vec
* OneHotEncoder
* Vector Assembler

In [None]:
# Pipeline adapted from:
# http://spark.apache.org/docs/latest/ml-guide.html
# https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html
from pyspark.ml.feature import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row

In [None]:
# Used in common for printing predictions
def printPredicts(predictsdf,pred_hits,pipeline_name="Pipeline"):
    hits = 0
    misses = 0
    print "How did {} do predicting {}?".format(pipeline_name,positions_description)
    for r in predictsdf.iterrows():
        song_key = r[1].song_key
        pred = pred_hits[song_key]  
        result = labelForPosition(r[1].position)
        correct = result == pred
        if correct:
            hits +=1
            print "Correct ::: song_key --> {}, predicted {}".format(song_key, pred)
        else:
            misses +=1
            print "Incorrect ::: song_key --> {}, predicted {}".format(song_key, pred)
    print "{} hits: {}, misses: {}".format(pipeline_name,hits,misses)

In [None]:
# TODO (HERE OR TABLEAU) -- VIZ Method

###Whole Corpus Approach: Fit a model to songs prior to 2013 and predict on 2014.

In [None]:
# augment lyricsdf with corpus
lyricsdf_corpus = lyricsdf.withColumn("lyrics_tokenized", corpus)

In [None]:
# training on songs to 2013
training = lyricsdf.filter(lyricsdf['year'] != 2014).select(['song_key','lyrics','label'])

# training_corpus on songs to 2013
training_corpus = lyricsdf_corpus.filter(lyricsdf['year'] != 2014).select(['song_key','lyrics','label'])

In [None]:
# test year 2014
test = lyricsdf.filter(lyricsdf['year'] == 2014).select(['song_key','lyrics','label'])

# test_corpus year 2014
test_corpus = lyricsdf_corpus.filter(lyricsdf['year'] == 2014).select(['song_key','lyrics','label'])

###Pipeline 1 : Whole Document as Tokens

In [None]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

In [None]:
%%time
# Prepare training documents from a list of (id, text, label) tuples.
LabeledDocument = Row("song_key", "lyrics", "label")

## ML Pipeline 
tok1 = Tokenizer(inputCol="lyrics", outputCol="words")
htf1 = HashingTF(inputCol=tok1.getOutputCol(), outputCol="features", numFeatures=200)
lr1 = LogisticRegression(maxIter=10, regParam=0.01)
pipeline1 = Pipeline(stages=[tok1, htf1, lr1])

# Fit the pipeline to training documents.
model1 = pipeline1.fit(training)

In [None]:
# Make predictions on test documents and print columns of interest.
prediction1 = model1.transform(test)
selected1 = prediction1.select("song_key", "lyrics", "prediction")

In [None]:
print type(selected1)

In [None]:
# build up predictions
pred_hits1 = {}
for row in selected1.collect():
    pred_hits1[row[0]] = row[2]

In [None]:
# quick check, predicted 2010-1 in top 25 correctly.
pred_hits1['2014-1']

###How did pipeline1 do at predicting top 50 hits for 2014 in light of hits prior?

In [None]:
printPredicts(lyrics_pd_df[lyrics_pd_df['year'] == 2014],pred_hits1,pipeline_name="Pipeline1")

In [None]:
# TODO : VIZ

###Pipeline2
**Word Vector approach**

####TODO: IMPLEMENT

In [None]:
## 
tok2 = Tokenizer(inputCol="lyrics", outputCol="words")
htf2 = HashingTF(inputCol=tok2.getOutputCol(), outputCol="tf", numFeatures=200)
w2v = Word2Vec(inputCol="lyrics", outputCol="w2v")
ohe = OneHotEncoder(inputCol="label", outputCol="lbl")
va = VectorAssembler(inputCols=["tf", "w2v", "lbl"], outputCol="features")
pipeline2 = Pipeline(stages=[tok,htf,w2v,ohe,va])

# Fit the pipeline to training documents.
model = pipeline2.fit(training)

##Try per decade predictions