#Data Exploration
### Adapted concepts from [HW1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw1/hw1.ipynb) and [HW5 Part1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw5/hw5part1.ipynb)

**This notebook should be locally run by issuing `vagrant up` from project root, then locating the notebook at "http:\\localhost:4545". You may also need to issue `vagrant provision` to update any required resources.**


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
## MLJ: Additional Extras
import time
import itertools
import json
import pickle

In [3]:
import os
# os.environ['PYSPARK_PYTHON'] = '/anaconda/bin/python'

In [4]:
import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
# os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

/home/vagrant/spark


In [5]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [6]:
sc._conf.getAll()

[(u'spark.executor.memory', u'2g'),
 (u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.driver.memory', u'8g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [7]:
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 

In [8]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [9]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

#Load Provided Data Into Pandas Dataframe

In [10]:
# load the provided lyrics
lyrics_pd_df = pd.read_csv("../../data/provided/all billboard top 100 songs from 1970-2014.csv")  

In [11]:
# cull excess columns swept up on read
lyrics_pd_df = lyrics_pd_df[['position','year','title.href','title','artist','lyrics']]

In [12]:
lyrics_pd_df.shape

(4500, 6)

In [13]:
lyrics_pd_df.head()

Unnamed: 0,position,year,title.href,title,artist,lyrics
0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary feeling small When tears are...
1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,x
2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"American woman, stay away from me American wom..."
3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops keep falling on my head Just like th...
4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War huh Yeah! Absolutely uh-huh, uh-huh huh Ye..."


In [14]:
# add `decade` column to df
lyrics_pd_df['decade'] = lyrics_pd_df.year.apply(lambda y : y - y%10)

In [15]:
# add a `song_key` column by joining `year` and `position` for better identity 
# adapted from:
# http://stackoverflow.com/questions/29983946/concatenate-cells-into-a-string-with-separator-pandas-python
lyrics_pd_df['song_key'] = lyrics_pd_df[['year','position']].apply(lambda row: '-'.join(row.astype(str).values), axis=1)

In [16]:
# view output
lyrics_pd_df.sample(5).head()

Unnamed: 0,position,year,title.href,title,artist,lyrics,decade,song_key
294,95,1972,,Baby Let Me Take You (In My Arms),The Detroit Emeralds,x,1970,1972-95
2707,8,1997,https://en.wikipedia.org/wiki/Return_of_the_Mack,Return of the Mack,Mark Morrison,,1990,1997-8
1541,42,1985,https://en.wikipedia.org/wiki/All_I_Need_(Jack...,All I Need,Jack Wagner,Kissing you is not what I had planned And now ...,1980,1985-42
2231,32,1992,https://en.wikipedia.org/wiki/Smells_Like_Teen...,Smells Like Teen Spirit,Nirvana,"Load up on guns, bring your friends, It's fun ...",1990,1992-32
2342,43,1993,https://en.wikipedia.org/wiki/I%27d_Die_Withou...,I'd Die Without You,P.M. Dawn,x,1990,1993-43


##Which Lyrics Are Missing?

In [17]:
# missing lyrics have a value of 'x', these will need to be back-filled in the Harvest section
missing_lyricsdf = lyrics_pd_df[lyrics_pd_df['lyrics'] == 'x']
print "lyrics == 'x' shape --> ", missing_lyricsdf.shape
print "check lyrics len < 10 --> ", lyrics_pd_df[lyrics_pd_df['lyrics'].str.len() <= 10].shape
missing_lyricsdf.head()

lyrics == 'x' shape -->  (863, 8)
check lyrics len < 10 -->  (863, 8)


Unnamed: 0,position,year,title.href,title,artist,lyrics,decade,song_key
1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,x,1970,1970-2
6,7,1970,https://en.wikipedia.org/wiki/I%27ll_Be_There_...,I'll Be There,The Jackson 5,x,1970,1970-7
8,9,1970,https://en.wikipedia.org/wiki/Let_It_Be_(song),Let It Be,The Beatles,x,1970,1970-9
14,15,1970,https://en.wikipedia.org/wiki/ABC_(song),ABC,The Jackson 5,x,1970,1970-15
15,16,1970,https://en.wikipedia.org/wiki/The_Love_You_Save,The Love You Save,The Jackson 5,x,1970,1970-16


##Join Provided Artist Info with Song 
* **TODO**: Use artist info provided to have a joined dataframe, reference HW1

##Harvest Missing Data
* **TODO**: consider missing lyrics from `missing_lyrics` 
* **TODO**: consider additional years, prior to 1970 and 2015 
* **TODO**: harvest artist info for additional

##Join Harvested Artist Info with Song 
* **TODO**: Use artist info harvested to have a joined dataframe, reference HW1

##Build Decade Dictionary

In [18]:
# build dictionary holding indexes for decades, useful for filtering in corpus, vocabs, etc.
def buildDecadeDict(df=lyrics_pd_df):
    """
    assumes df is sorted by year.
    Note: saved value as list to be consistent with json persistence.
    """
    d = {}
    dyear = None
    didx = 0
    idx = -1
    for row in lyrics_pd_df.iterrows():
        idx +=1    
        year = row[1].year
        # initial conditions
        if not dyear:
            dyear = year
        # year change    
        elif year - year%10 != dyear:
            # add the last year to the dictionary
            d[dyear] = [didx,idx-1]           
            dyear = year
            didx = idx
    # handle the last entry in the loop
    d[dyear] = [didx,idx]
    print d[dyear]
    
    return d
decade_dict = buildDecadeDict()
decade_dict

[4000, 4499]


{1970: [0, 999],
 1980: [1000, 1999],
 1990: [2000, 2999],
 2000: [3000, 3999],
 2010: [4000, 4499]}

##Save Pandas Conditioning

In [19]:
# save lyrics_pd_df
lyrics_pd_df.to_csv("../../data/conditioned/lyrics_pd_df.csv",index=False) #note: ascii

In [20]:
# save decade dict
with open('../../data/conditioned/decade-dict.json', 'w') as fp:
    json.dump(decade_dict, fp)

##Manipulate With Spark

In [21]:
# convert from pandas to spark dataframe
lyricsdf = sqlsc.createDataFrame(lyrics_pd_df)

In [22]:
# view output
lyricsdf.show(3)

+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|
+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|
|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|                   x|  1970|  1970-2|
|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|American woman, s...|  1970|  1970-3|
+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
only showing top 3 rows



In [23]:
# no longer need the pandas version, clear it out
del lyrics_pd_df

In [24]:
#view output
lyricsdf.show(3)

+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|
+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|
|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|                   x|  1970|  1970-2|
|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|American woman, s...|  1970|  1970-3|
+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+
only showing top 3 rows



In [25]:
#We cache the data to make sure it is only read once from disk
lyricsdf.cache()
print "How many songs do we have?", lyricsdf.count()

How many songs do we have? 4500


In [26]:
print "What is the schema?", lyricsdf.printSchema()

What is the schema? root
 |-- position: long (nullable = true)
 |-- year: long (nullable = true)
 |-- title.href: string (nullable = true)
 |-- title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- lyrics: string (nullable = true)
 |-- decade: long (nullable = true)
 |-- song_key: string (nullable = true)

None


##Sample Lyrics (or Not)

Some initial sampling to take from each year.

In [27]:
# whether or not to sample lyrics, and how many to sample per year
sample_lyrics = False
PER_YEAR_SAMPLES=10

In [28]:
#(your code here)
def randomSubSampleLyrics(sparkdf,take=PER_YEAR_SAMPLES):    
    # generate spark pairs as a tuple
    br_pairs = sparkdf.map(lambda r: (r.year, r.song_key))
    
    # group by key for a list of reviews per business and collect
    br_grouped = br_pairs.groupByKey().mapValues(lambda x: list(x)).collect()
        
    #sample after collect
    br_sample = [np.random.choice(v, size=take, replace=False) for k,v in br_grouped]    
    
    #flatten into a list
    return list(itertools.chain.from_iterable(br_sample))
    
small_song_keys = randomSubSampleLyrics(lyricsdf)

In [29]:
if sample_lyrics:
    print "How many small_song_keys? ", len(small_song_keys)
    small_song_keys[:5]
else:
    print "No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)"

No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)


In [30]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Fri, 20 Nov 2015 16:17:16


In [31]:
%%time
#(your code here)
if sample_lyrics:
    ldf=lyricsdf[lyricsdf.song_key.isin(small_song_keys)]#creates new dataframe
else:
    ldf=lyricsdf

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.11 µs


In [32]:
# cache results
ldf.cache()

DataFrame[position: bigint, year: bigint, title.href: string, title: string, artist: string, lyrics: string, decade: bigint, song_key: string]

In [33]:
print "How many lyrics are in ldf? ", ldf.count()

How many lyrics are in ldf?  4500


##NLP

In [34]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [35]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [36]:
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

In [37]:
print "Quick Test of parse..."
parse("The world is the craziest place. I am working hard.", tokenize=True, lemmata=True)

Quick Test of parse...


u'The/DT/B-NP/O/the world/NN/I-NP/O/world is/VBZ/B-VP/O/be the/DT/B-NP/O/the craziest/JJ/I-NP/O/craziest place/NN/I-NP/O/place ././O/O/.\nI/PRP/B-NP/O/i am/VBP/B-VP/O/be working/VBG/I-VP/O/work hard/RB/B-ADVP/O/hard ././O/O/.'

In [38]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [39]:
print "Quick check of get_parts ..."
get_parts("Have had many other items and just love the food. The patio...job was and...perfect. Lunch is good, and the only egg is great")

Quick check of get_parts ...


([[u'patio', u'job'], [u'lunch', u'egg']], [[u'perfect'], [u'good', u'great']])

###Run Get Parts on Provided Data

In [40]:
#(your code here)
lyric_parts = ldf.map(lambda r : get_parts(r.lyrics))

In [41]:
# view output
lyric_parts.take(2)

[([[u'feeling',
    u'tear',
    u'eye',
    u'time',
    u'friend',
    u'bridge',
    u'water',
    u'bridge',
    u'water',
    u'street',
    u'evening',
    u'comfort',
    u'darkness',
    u'pain',
    u'bridge',
    u'water',
    u'bridge',
    u'water',
    u'time',
    u'dream',
    u'way',
    u'friend',
    u'bridge',
    u'water',
    u'mind',
    u'bridge',
    u'water',
    u'mind']],
  [[u'weary',
    u'small',
    u'rough',
    u'troubled',
    u'troubled',
    u'troubled',
    u'troubled',
    u'troubled',
    u'troubled']]),
 ([], [])]

In [42]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Fri, 20 Nov 2015 16:17:18


In [43]:
%%time
parseout=lyric_parts.collect()

CPU times: user 126 ms, sys: 13 ms, total: 139 ms
Wall time: 59.6 s


##Vocab
###Nouns

In [44]:
print "How many parseout entries? ", len(parseout)

How many parseout entries?  4500


In [45]:
# flatten parseout to create initial noun rdd
nounrdd=sc.parallelize([ele[0] for ele in parseout]).flatMap(lambda l: l)

In [46]:
# view output
nounrdd.take(1)

[[u'feeling',
  u'tear',
  u'eye',
  u'time',
  u'friend',
  u'bridge',
  u'water',
  u'bridge',
  u'water',
  u'street',
  u'evening',
  u'comfort',
  u'darkness',
  u'pain',
  u'bridge',
  u'water',
  u'bridge',
  u'water',
  u'time',
  u'dream',
  u'way',
  u'friend',
  u'bridge',
  u'water',
  u'mind',
  u'bridge',
  u'water',
  u'mind']]

In [47]:
# cache results
nounrdd.cache()

PythonRDD[34] at RDD at PythonRDD.scala:43

In [48]:
# straight reduce for overall word counts
nwordsrdd = (nounrdd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [49]:
# view output
nwordsrdd.take(5)

[(u'jockin', 3),
 (u'sleet', 6),
 (u'sleep', 77),
 (u'mansion', 10),
 (u'integrity', 1)]

In [50]:
# top n, based on values, sorted descending
nwordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'love', 4969),
 (u'baby', 3852),
 (u'time', 3801),
 (u'way', 2994),
 (u'girl', 2825),
 (u'night', 2406),
 (u'heart', 2006),
 (u'thing', 1996),
 (u'day', 1947),
 (u'life', 1806)]

In [51]:
nwordsrdd.cache()

PythonRDD[41] at RDD at PythonRDD.scala:43

In [52]:
# collect all the words and cache
nounvocabtups = (nwordsrdd
             .map(lambda (x,y): x)
             .zipWithIndex()
)

In [53]:
# view output
nounvocabtups.take(3)

[(u'jockin', 0), (u'sleet', 1), (u'sleep', 2)]

In [54]:
# cache results
nounvocabtups.cache()

PythonRDD[44] at RDD at PythonRDD.scala:43

In [55]:
# collect results
nounvocab=nounvocabtups.collectAsMap()
nounid2word=nounvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [56]:
# since sampling may be used, avoiding more common usage, e.g. `nounvocab['dance']`
nounid2word[0], nounvocab.keys()[5], nounvocab[nounvocab.keys()[5]]

(u'jockin', u'woody', 2302)

In [57]:
print "How big is the noun vocabulary? ", len(nounvocab.keys())

How big is the noun vocabulary?  8757


###Adjectives

In [58]:
# create initial adj rdd from parseout
adjrdd=sc.parallelize([ele[1] for ele in parseout])

In [59]:
# view output
adjrdd.take(3)

[[[u'weary',
   u'small',
   u'rough',
   u'troubled',
   u'troubled',
   u'troubled',
   u'troubled',
   u'troubled',
   u'troubled']],
 [],
 [[u'american',
   u'american',
   u'hanging',
   u'important',
   u'old',
   u'american',
   u'american',
   u'american',
   u'american',
   u'american',
   u'american',
   u'warm',
   u'american',
   u'american',
   u'american',
   u'american',
   u'good',
   u'good',
   u'american',
   u'american',
   u'american']]]

In [60]:
# cache results
adjrdd.cache()

ParallelCollectionRDD[46] at parallelize at PythonRDD.scala:423

In [61]:
# straight reduce for overall word counts
awordsrdd = (adjrdd
             .flatMap(lambda l: l)
             .flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [62]:
# view output
awordsrdd.take(5)

[(u'ooh-oh', 1),
 (u'good-night', 1),
 (u'suicidal', 2),
 (u'b-b-b-be', 1),
 (u'crucial', 5)]

In [63]:
# top n, based on values, sorted descending
awordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'good', 1934),
 (u'little', 1560),
 (u'bad', 858),
 (u'real', 842),
 (u'true', 705),
 (u'wrong', 703),
 (u'right', 661),
 (u'long', 632),
 (u'crazy', 626),
 (u'new', 620)]

In [64]:
# cache results
awordsrdd.cache()

PythonRDD[54] at RDD at PythonRDD.scala:43

In [65]:
#(your code here)
adjvocabtups = (awordsrdd
              .map(lambda (x,y): x)
              .zipWithIndex()
)

In [66]:
# view output
adjvocabtups.take(3)

[(u'ooh-oh', 0), (u'good-night', 1), (u'suicidal', 2)]

In [67]:
# cache results
adjvocabtups.cache()

PythonRDD[57] at RDD at PythonRDD.scala:43

In [68]:
# collect results
adjvocab=adjvocabtups.collectAsMap()
adjid2word=adjvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [69]:
# since sampling may be used, avoiding more common usage, e.g. `adjvocab['exotic']`
adjid2word[0], adjvocab.keys()[5], adjvocab[adjvocab.keys()[5]]

(u'ooh-oh', u'dynamic', 32)

In [70]:
print "How big is the adjective vocabulary? ", len(adjvocab)

How big is the adjective vocabulary?  3486


##Document Corpus

In [71]:
##################################################################################################
# CITATION - Use of counter for reduce within each word list from:
# http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python
##################################################################################################
from collections import Counter

# for each sentence, reduct into a list of tuple k,v where k=vocab index and v=count, 
# each word list is sorted by occurence
documents = nounrdd.map(lambda words: Counter([nounvocab[word] for word in words]).most_common())

In [72]:
# verify output
documents.take(1)

[[(610, 6),
  (3123, 6),
  (8417, 2),
  (5210, 2),
  (232, 2),
  (5211, 1),
  (580, 1),
  (840, 1),
  (828, 1),
  (8241, 1),
  (6868, 1),
  (1206, 1),
  (7987, 1),
  (7003, 1),
  (3420, 1)]]

In [73]:
# gather spark results
corpus=documents.collect()

##Save Spark Conditioning

In [74]:
# save lyricsdf / ldf
ldf.toPandas().to_csv("../../data/conditioned/sample_or_not_lyricsdf.csv",index=False,encoding='utf-8') #note: utf-8

In [75]:
# save noun n-gram
with open('../../data/conditioned/noun-n-gram.json', 'w') as fp:
    json.dump(dict(nwordsrdd.collect()), fp)

In [76]:
# save adjective n-gram
with open('../../data/conditioned/adj-n-gram.json', 'w') as fp:
    json.dump(dict(awordsrdd.collect()), fp)

In [77]:
# save noun vocab and id2word
with open('../../data/conditioned/nounvocab.json', 'w') as fp:
    json.dump(nounvocab, fp)
    
with open('../../data/conditioned/nounid2word.json', 'w') as fp:
    json.dump(nounid2word, fp)    

In [78]:
# save adj vocab and id2word
with open('../../data/conditioned/adjvocab.json', 'w') as fp:
    json.dump(adjvocab, fp)
    
with open('../../data/conditioned/adjid2word.json', 'w') as fp:
    json.dump(adjid2word, fp) 

In [79]:
# save corpus
pickle.dump( corpus, open( "../../data/conditioned/corpus.p", "wb" ) )

##Synonyms

###Synonym Lookups
Focus on WordNet python package within [nltk](http://www.nltk.org) via [textblob](https://textblob.readthedocs.org/en/dev/)
The main idea is to lookup all words in the noun and adj vocab dictionaries and attempt to collapse down -- where possible -- to synonyms. The synonyms can be used for common_support also.

In [83]:
# test of TextBlob

# from https://textblob.readthedocs.org/en/dev/
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

blob.translate(to="es")  # 'La amenaza titular de The Blob...'

0.06
-0.341666666667


TextBlob("La amenaza titular de The Blob siempre me ha parecido la película final
monstruo: una, la masa insaciablemente hambriento ameba capaz de penetrar
prácticamente cualquier salvaguardia, capaz de - como médico condenado escalofriantemente
lo describe - "asimilar carne en contacto.
Comparaciones Snide a la gelatina ser condenados, es un concepto con el más
devastadora de las posibles consecuencias, no muy diferente del escenario plaga gris
propuesto por los teóricos tecnológicos temerosos de
la inteligencia artificial ejecutar rampante.")

In [85]:
# test of Synset and path_similarity

# from http://textblob.readthedocs.org/en/dev/quickstart.html#quickstart
from textblob.wordnet import Synset
octopus = Synset("octopus.n.02")
nautilus = Synset('paper_nautilus.n.01')
shrimp = Synset('shrimp.n.03')
pearl = Synset('pearl.n.01')

print "octopus similarity to octopus --> ",octopus.path_similarity(octopus)  # 1.0
print "octopus similarity to nautilus --> ",octopus.path_similarity(nautilus)  # 0.33
print "octopus similarity to shrimp --> ",octopus.path_similarity(shrimp)  # 0.11
print "octopus similarity to pearl --> ",octopus.path_similarity(pearl)  # 0.07

octopus similarity to octopus -->  1.0
octopus similarity to nautilus -->  0.333333333333
octopus similarity to shrimp -->  0.111111111111
octopus similarity to pearl -->  0.0666666666667


##Vice / Virtue Separation

In [None]:
# TODO

##Ensemble Approach
See [combine all features into a single feature vector](https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html)

In [None]:
# see https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html


###Spark Word2Vec for similarity
This approach is plausible but probably should be used in an ensemble vector approach, see [combine all features into a single feature vector](https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html)

In [80]:
"""
from pyspark.mllib.feature import Word2Vec #mllib works with RDD as-is
# from pyspark.ml.feature import Word2Vec

word2vec = Word2Vec()
model = word2vec.fit(nounrdd)

synonyms = model.findSynonyms('ho', 10)

for word, cosine_distance in synonyms:
    print("{}: {}".format(word, cosine_distance))
"""
print




##Split out by decade

In [None]:
# TODO