#Vocab Consolidation
### Adapted concepts from [HW1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw1/hw1.ipynb) and [HW5 Part1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw5/hw5part1.ipynb)

**This notebook should be locally run by issuing `vagrant up` from project root, then locating the notebook at "http:\\localhost:4545". You may also need to issue `vagrant provision` to update any required resources.**

The following artifacts will be established by manipulating the output of the processing pipeline for harvesting data, file [use-this-master-lyricsdf-extracted.csv](../../data/conditioned/use-this-master-lyricsdf-extracted.csv):
* vocabs for noun and adj
* n-gram for noun and adj
* synonyms for noun and adj
* hypernyms for noun and adj

Other notes:
* this notebook leverages and finalizes exploratory work in [Data-Exploration Notebook](Data-Exploration.ipynb).
* outputs are anticipated to be combined in follow-on work for better latent factors, prediction, and recommendation processing (not reflected here)
* in other notebooks that use the exact same contents as here, we will establish n-gram and vocab per decade.



In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
## MLJ: Additional Extras
import time
import itertools
import json
import pickle

In [3]:
import os
# os.environ['PYSPARK_PYTHON'] = '/anaconda/bin/python'

In [4]:
import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
# os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

/home/vagrant/spark


In [5]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [6]:
sc._conf.getAll()

[(u'spark.executor.memory', u'2g'),
 (u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.driver.memory', u'8g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [7]:
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 

In [8]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [9]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

#Load Finalized Conditioned Data Into Pandas Dataframe

In [10]:
# load the lyrics from the approved "master" dataframe
lyrics_pd_df = pd.read_csv("../../data/conditioned/use-this-master-lyricsdf-extracted.csv")  

In [11]:
lyrics_pd_df.shape

(4500, 11)

In [12]:
lyrics_pd_df.head()

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
0,0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary. Feeling small. When tears a...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary. Feeling small. When tears a...
1,1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,Why do birds suddenly appear. Everytime you ar...,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear. Everytime you ar...
2,2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d..."
3,3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops are falling on my head. And just lik...,1970,1970-4,http://lyrics.wikia.com/B.J._Thomas:Raindrops_...,Raindrops are falling on my head. And just lik...
4,4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War, huh, yeah. What is it good for? Absolutel...",1970,1970-5,http://lyrics.wikia.com/Edwin_Starr:War,"War, huh, yeah. What is it good for? Absolutel..."


##Manipulate With Spark

In [14]:
# convert from pandas to spark dataframe
lyricsdf = sqlsc.createDataFrame(lyrics_pd_df)

In [15]:
# view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|    0|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|http://lyrics.wik...|When you're weary...|
|    1|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|Why do birds sudd...|  1970|  1970-2|http://lyrics.wik...|Why do birds sudd...|
|    2|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|Mmm, da da da. Mm...|  1970|  1970-3|http://lyrics.wik...|Mmm, da da da. Mm...|
+-----+--------+----+-

In [16]:
#view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|    0|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|http://lyrics.wik...|When you're weary...|
|    1|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|Why do birds sudd...|  1970|  1970-2|http://lyrics.wik...|Why do birds sudd...|
|    2|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|Mmm, da da da. Mm...|  1970|  1970-3|http://lyrics.wik...|Mmm, da da da. Mm...|
+-----+--------+----+-

In [17]:
#We cache the data to make sure it is only read once from disk
lyricsdf.cache()
print "How many songs do we have?", lyricsdf.count()

How many songs do we have? 4500


In [18]:
print "What is the schema?", lyricsdf.printSchema()

What is the schema? root
 |-- index: long (nullable = true)
 |-- position: long (nullable = true)
 |-- year: long (nullable = true)
 |-- title.href: string (nullable = true)
 |-- title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- lyrics: string (nullable = true)
 |-- decade: long (nullable = true)
 |-- song_key: string (nullable = true)
 |-- lyrics_url: string (nullable = true)
 |-- lyrics_abstract: string (nullable = true)

None


##Sample Lyrics (or Not)

Some initial sampling to take from each year.

In [19]:
# whether or not to sample lyrics, and how many to sample per year
sample_lyrics = False
PER_YEAR_SAMPLES=10

In [20]:
#(your code here)
def randomSubSampleLyrics(sparkdf,take=PER_YEAR_SAMPLES):    
    # generate spark pairs as a tuple
    br_pairs = sparkdf.map(lambda r: (r.year, r.song_key))
    
    # group by key for a list of reviews per business and collect
    br_grouped = br_pairs.groupByKey().mapValues(lambda x: list(x)).collect()
        
    #sample after collect
    br_sample = [np.random.choice(v, size=take, replace=False) for k,v in br_grouped]    
    
    #flatten into a list
    return list(itertools.chain.from_iterable(br_sample))
    
small_song_keys = randomSubSampleLyrics(lyricsdf)

In [21]:
if sample_lyrics:
    print "How many small_song_keys? ", len(small_song_keys)
    small_song_keys[:5]
else:
    print "No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)"

No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)


In [22]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 04:11:01


In [23]:
%%time
#(your code here)
if sample_lyrics:
    ldf=lyricsdf[lyricsdf.song_key.isin(small_song_keys)]#creates new dataframe
else:
    ldf=lyricsdf

CPU times: user 11 µs, sys: 4 µs, total: 15 µs
Wall time: 32.2 µs


In [24]:
# cache results
ldf.cache()

DataFrame[index: bigint, position: bigint, year: bigint, title.href: string, title: string, artist: string, lyrics: string, decade: bigint, song_key: string, lyrics_url: string, lyrics_abstract: string]

In [25]:
print "How many lyrics are in ldf? ", ldf.count()

How many lyrics are in ldf?  4500


##NLP

In [26]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [27]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [28]:
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

In [29]:
print "Quick Test of parse..."
parse("The world is the craziest place. I am working hard.", tokenize=True, lemmata=True)

Quick Test of parse...


u'The/DT/B-NP/O/the world/NN/I-NP/O/world is/VBZ/B-VP/O/be the/DT/B-NP/O/the craziest/JJ/I-NP/O/craziest place/NN/I-NP/O/place ././O/O/.\nI/PRP/B-NP/O/i am/VBP/B-VP/O/be working/VBG/I-VP/O/work hard/RB/B-ADVP/O/hard ././O/O/.'

In [30]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [31]:
print "Quick check of get_parts ..."
get_parts("Have had many other items and just love the food. The patio...job was and...perfect. Lunch is good, and the only egg is great")

Quick check of get_parts ...


([[u'patio', u'job'], [u'lunch', u'egg']], [[u'perfect'], [u'good', u'great']])

###Run Get Parts on Provided Data

In [32]:
#(your code here)
lyric_parts = ldf.map(lambda r : get_parts(r.lyrics))

In [33]:
# view output
lyric_parts.take(2)

[([[u'time'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water']],
  [[u'rough'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled']]),
 ([[u'dream'], [u'starlight', u'eye'], [u'dream'], [u'starlight', u'eye']],
  [[u'true'], [u'blue'], [u'true'], [u'blue']])]

In [34]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 04:11:55


In [35]:
%%time
parseout=lyric_parts.collect()

CPU times: user 139 ms, sys: 21.8 ms, total: 161 ms
Wall time: 1min 31s


##Vocab
###Nouns

In [36]:
print "How many parseout entries? ", len(parseout)

How many parseout entries?  4500


In [37]:
# flatten parseout to create initial noun rdd
nounrdd=sc.parallelize([ele[0] for ele in parseout]).flatMap(lambda l: l)

In [38]:
# view output
nounrdd.take(1)

[[u'time']]

In [39]:
# cache results
nounrdd.cache()

PythonRDD[34] at RDD at PythonRDD.scala:43

In [40]:
# straight reduce for overall word counts
nwordsrdd = (nounrdd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [41]:
# view output
nwordsrdd.take(5)

[(u'jockin', 1),
 (u'slope', 1),
 (u'girl(oh', 1),
 (u'dance', 216),
 (u'pigeon', 3)]

In [42]:
# top n, based on values, sorted descending
nwordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'love', 2390),
 (u'baby', 1665),
 (u'girl', 1583),
 (u'time', 1544),
 (u'thing', 1097),
 (u'night', 1003),
 (u'man', 918),
 (u'way', 881),
 (u'day', 830),
 (u'heart', 802)]

In [43]:
nwordsrdd.cache()

PythonRDD[41] at RDD at PythonRDD.scala:43

In [44]:
# collect all the words and cache
nounvocabtups = (nwordsrdd
             .map(lambda (x,y): x)
             .zipWithIndex()
)

In [45]:
# view output
nounvocabtups.take(3)

[(u'jockin', 0), (u'slope', 1), (u'girl(oh', 2)]

In [46]:
# cache results
nounvocabtups.cache()

PythonRDD[44] at RDD at PythonRDD.scala:43

In [47]:
# collect results
nounvocab=nounvocabtups.collectAsMap()
nounid2word=nounvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [48]:
# since sampling may be used, avoiding more common usage, e.g. `nounvocab['dance']`
nounid2word[0], nounvocab.keys()[5], nounvocab[nounvocab.keys()[5]]

(u'jockin', u'catch', 728)

In [49]:
print "How big is the noun vocabulary? ", len(nounvocab.keys())

How big is the noun vocabulary?  5144


###Adjectives

In [50]:
# create initial adj rdd from parseout
adjrdd=sc.parallelize([ele[1] for ele in parseout])

In [51]:
# view output
adjrdd.take(3)

[[[u'rough'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled']],
 [[u'true'], [u'blue'], [u'true'], [u'blue']],
 [[u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'important'],
  [u'old'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'coloured'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'coloured'],
  [u'american'],
  [u'leave'],
  [u'american'],
  [u'american']]]

In [52]:
# cache results
adjrdd.cache()

ParallelCollectionRDD[46] at parallelize at PythonRDD.scala:423

In [53]:
# straight reduce for overall word counts
awordsrdd = (adjrdd
             .flatMap(lambda l: l)
             .flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [54]:
# view output
awordsrdd.take(5)

[(u'suicidal', 2),
 (u'hooked', 21),
 (u'resist', 1),
 (u'dynamic', 3),
 (u'cocky', 2)]

In [55]:
# top n, based on values, sorted descending
awordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'little', 1838),
 (u'good', 1727),
 (u'real', 946),
 (u'bad', 770),
 (u'new', 764),
 (u'big', 678),
 (u'true', 649),
 (u'sweet', 635),
 (u'ooh', 607),
 (u'long', 579)]

In [56]:
# cache results
awordsrdd.cache()

PythonRDD[54] at RDD at PythonRDD.scala:43

In [57]:
#(your code here)
adjvocabtups = (awordsrdd
              .map(lambda (x,y): x)
              .zipWithIndex()
)

In [58]:
# view output
adjvocabtups.take(3)

[(u'suicidal', 0), (u'hooked', 1), (u'resist', 2)]

In [59]:
# cache results
adjvocabtups.cache()

PythonRDD[57] at RDD at PythonRDD.scala:43

In [60]:
# collect results
adjvocab=adjvocabtups.collectAsMap()
adjid2word=adjvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [61]:
# since sampling may be used, avoiding more common usage, e.g. `adjvocab['exotic']`
adjid2word[0], adjvocab.keys()[5], adjvocab[adjvocab.keys()[5]]

(u'suicidal', u'suspenseful', 1696)

In [62]:
print "How big is the adjective vocabulary? ", len(adjvocab)

How big is the adjective vocabulary?  3379


##Document Corpus

In [63]:
##################################################################################################
# CITATION - Use of counter for reduce within each word list from:
# http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python
##################################################################################################
from collections import Counter

# for each sentence, reduct into a list of tuple k,v where k=vocab index and v=count, 
# each word list is sorted by occurence
documents = nounrdd.map(lambda words: Counter([nounvocab[word] for word in words]).most_common())

In [64]:
# verify output
documents.take(1)

[[(5139, 1)]]

In [65]:
# gather spark results
corpus=documents.collect()

##Save Spark Conditioning

In [66]:
# save noun n-gram
with open('../../data/conditioned/noun-n-gram.json', 'w') as fp:
    json.dump(dict(nwordsrdd.collect()), fp)

In [67]:
# save adjective n-gram
with open('../../data/conditioned/adj-n-gram.json', 'w') as fp:
    json.dump(dict(awordsrdd.collect()), fp)

In [68]:
# save noun vocab and id2word
with open('../../data/conditioned/nounvocab.json', 'w') as fp:
    json.dump(nounvocab, fp)
    
with open('../../data/conditioned/nounid2word.json', 'w') as fp:
    json.dump(nounid2word, fp)    

In [69]:
# save adj vocab and id2word
with open('../../data/conditioned/adjvocab.json', 'w') as fp:
    json.dump(adjvocab, fp)
    
with open('../../data/conditioned/adjid2word.json', 'w') as fp:
    json.dump(adjid2word, fp) 

In [70]:
# save corpus
pickle.dump( corpus, open( "../../data/conditioned/corpus.p", "wb" ) )

##Synonyms

###Synonym Lookups
Focus on WordNet python package within [nltk](http://www.nltk.org) via [textblob](https://textblob.readthedocs.org/en/dev/)
The main idea is to lookup all words in the noun and adj vocab dictionaries and attempt to collapse down -- where possible -- to synonyms. The synonyms can be used for common_support also.

In [71]:
from textblob.wordnet import Synset
from textblob.wordnet import NOUN
from textblob.wordnet import ADJ

SIM_THRESHOLD = 1.0 # Only act on values at/above threshold

In [72]:
## COMMON METHODS FOR SYNSETS
def synsetStr(syn):
    """
    attempt to parse the string from a Synset, e.g. Synset('dog.n.01') would return 'dog'
    return String or None
    """
    try:
        return syn.name().split('.')[0]
    except Exception:
        return None
    
def flattenSynsetValues(syn_dict, skip_invalid=True, replace_invalid=None):
    """
    flatten synset values in dictionary using params
    """
    d = {}
    for k,v in syn_dict.iteritems():
        if v:
            d[k] = synsetStr(v)
        elif not skip_invalid:
            d[k] = replace_invalid
    return d

In [73]:
## CORE FUNCTIONS FOR BUILDING SIMILARITY MATRIX

def posToSingle(pos):
    """
    Keep up with which pos values are implemented.
    """
    if pos == NOUN:
        return "n"
    elif pos == ADJ:
        return "a"
    return None # essentially, else clause


def cachedSynsetOrBuild(idx, syns, p, id_lookup):
    """
    Build Synset for given `idx`, using the `id_lookup`.
    Facilitate O(n) computational complexity by caching results.
    
    --- Input ---
    idx: id to build and cache
    syns: existing dictionary of synsets, with k: id, v: Synset or None
    p: String pos value in the form needed for Synset generation, see `posToSingle`
    id_lookup: dictionary for noun / adj to build n x n matrix of similarity.
    
    --- Return ---
    Synset or None
    """
    if idx in syns:
        return syns[idx] 
        
    # focus on `.01` only
    try:                      
        syn = Synset("{}.{}.01".format(id_lookup[idx],p))
        syns[idx] = syn
        return syn
    except Exception:
        syns[idx] = None
        return None

def similarityMatrix(id2word, pos, take_n=None):
    """
    ##############################################################
    Build matrix of synsets for given id2word dictionary.    
    Optionally, only build a similarity matrix for the first n values.
    
    --- Input ---    
    id2word: dictionary for noun / adj to build n x n matrix of similarity.
    pos: WordNet position, `NOUN` or `ADJ` imported based on needs
    take_n: whether take the first n values for testing, default=None
    
    --- Return ---
    return a tuple, t where
    t[0]: n x n matrix with raw similarity score or zero
    t[1]: dictionary of synsets with k: id, v: Synset or None
    ##############################################################    
    """    
    syns = {} # obtain O(n)
    p = posToSingle(pos)
    
    # determine n
    n = len(id2word)
    if take_n:
        n = take_n
    
    # n x n matrix, initialized with zeros 
    matrix = np.zeros((n,n))
    
    # populate
    ns = range(n)
    for i in ns:  
        isyn = cachedSynsetOrBuild(i,syns,p,id2word)       
        for j in ns:
            # find j in synset
            jsyn = None
            if isyn:
                jsyn = cachedSynsetOrBuild(j,syns,p,id2word) # no reason unless isyn is ok
        
            # update matrix with path_similarity between i and j words
            if isyn and jsyn:            
                ps = isyn.path_similarity(jsyn)            
                if ps:
                    matrix[i][j] = ps
            
    return matrix, syns

In [74]:
## FUNCTIONS FOR EVALUATING SIMILARITY MATRIX RESULTS

def printSimilarityPairs(matrix, show_n=None, id_lookup=None, sim_threshold=SIM_THRESHOLD): 
    """
    print non zero similarities, ignoring diagonals.
    Optionally, show only first n non zeros then return.
    Optionally, lookup ids with words.
    Optionally, only evaluate values at/above a threshold.
    """
    ns = range(len(matrix))      
    c = 0
    for i in ns:
        for j in ns:
            v = matrix[i][j] 
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
                    
            if (i != j) and met_threshold:                
                if not show_n or c < show_n:
                    c += 1
                    s_i = i
                    s_j = j
                    if id_lookup:
                        s_i = id_lookup[i]
                        s_j = id_lookup[j]
                    print "{},{} --> {}".format(s_i,s_j,v)
                elif show_n:
                    return
                
def countSimilarityPairs(matrix, sim_threshold=SIM_THRESHOLD):
    """
    count non zero similarities, ignoring diagonals.
    Optionally, only evaluate values at/above a threshold.    
    """
    c = 0
    ns = range(len(matrix))         
    for i in ns:
        for j in ns:
            v = matrix[i][j]
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
            
            if (i != j) and met_threshold:                
                c += 1                    
    return c

In [75]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 04:17:05


In [76]:
%%time
# build adj similarity matrix
asimatrix, asyns = similarityMatrix(adjid2word, ADJ)

CPU times: user 27.4 s, sys: 0 ns, total: 27.4 s
Wall time: 27.4 s


In [77]:
# Count non-zero similarities for adjectivies at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(asimatrix)

334

In [78]:
# Check adj similarity results, are they any good?
printSimilarityPairs(asimatrix, show_n=10, id_lookup=adjid2word)

crimson,ruby --> 1.0
crimson,cherry --> 1.0
crimson,scarlet --> 1.0
crimson,red --> 1.0
magic,magical --> 1.0
aflame,ablaze --> 1.0
small,little --> 1.0
7th,seventh --> 1.0
blue,bluish --> 1.0
unsure,shy --> 1.0


In [79]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 04:19:04


In [80]:
%%time
# build noun similarity matrix (can take 30+ minutes!!!)
nsimatrix, nsyns = similarityMatrix(nounid2word, NOUN)

CPU times: user 16min 36s, sys: 7.07 s, total: 16min 44s
Wall time: 16min 46s


In [81]:
# Count non-zero similarities for nouns at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(nsimatrix)

586

In [82]:
# Check noun similarity results, are they any good?
printSimilarityPairs(nsimatrix, show_n = 10, id_lookup=nounid2word)

sleep,slumber --> 1.0
prick,motherfucker --> 1.0
prick,bastard --> 1.0
prick,asshole --> 1.0
chatter,yack --> 1.0
cavity,pit --> 1.0
topic,subject --> 1.0
tush,ass --> 1.0
tush,derriere --> 1.0
tush,fanny --> 1.0


## Save Similarity Matrix


In [83]:
# save asimatrix
pickle.dump( asimatrix, open( "../../data/conditioned/asimatrix.p", "wb" ) )  

In [84]:
# flatten and save asyns
with open('../../data/conditioned/asyns.json', 'w') as fp:
    json.dump(flattenSynsetValues(asyns), fp)

In [85]:
# save nsimatrix
pickle.dump( nsimatrix, open( "../../data/conditioned/nsimatrix.p", "wb" ) )

In [86]:
# flatten and save nsyns
with open('../../data/conditioned/nsyns.json', 'w') as fp:
    json.dump(flattenSynsetValues(nsyns), fp)

##Hypernyms
find the lowest common [hypernym](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy) between similar

In [87]:
#Quick Test
Synset('dog.n.01').lowest_common_hypernyms(Synset('cat.n.01'))[0]

Synset('carnivore.n.01')

In [88]:
## CORE FUNCTIONS FOR BUILDING HYPERNYM

def makeOrderedTuple(idx1, idx2):
    if idx1 > idx2:
        return (idx2,idx1) 
    return (idx1,idx2) 

def cachedHypernymOrBuild(idx1, idx2, syn_lookup, hypes, hype_as_str=True):
    """
    Build Hypernym for given `idxtuple`, using the `syns_lookup`.
    Facilitate O(n) computational complexity by caching results
    Will internally manage hypernym keys as ordered tuple.
    
    --- Input ---
    idx: tuple of id to build and cache
    syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
    hypes: dictionary for hypernyms with k: ordered tuple, v: hypernym.
    hype_as_str: optional build map with string values, default = True
    --- Return ---
    a hypernym Synset or None
    """
    ituple = makeOrderedTuple(idx1,idx2)    
    if ituple in hypes: 
        return hypes[ituple] 
    
    try:    
        s1 = syn_lookup[ituple[0]]
        s2 = syn_lookup[ituple[1]]
        h = s1.lowest_common_hypernyms(s2)[0]
        
        if hype_as_str:
            h = synsetStr(h)
            
        hypes[ituple] = h
        return h
    except Exception:
        hypes[ituple] = None
        return None

def lowestCommonHypernyms(simatrix, syn_lookup, sim_threshold=SIM_THRESHOLD, hype_as_str=True):
    """
    Build a matrix with hypernym where found.
    Optionally, only evaluate values at/above a threshold.
    
    --- Input ---
    simatrix: tuple of id to build and cache
    syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
    sim_threshold: optional threshold to use for establishing hypernyms, default = SIM_THRESHOLD
    hype_as_str: optional build map with string values, default = True
    
    --- Return ---
    dictionary for hypernyms with k: ordered tuple, v: Synset.    
    """
    
    hypes = {} # dictionary to build up.
    
    n = len(simatrix)
    ns = range(n)          
    for i in ns:
        for j in ns:
            v = simatrix[i][j] 
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
                    
            if (i != j) and met_threshold:                                
                cachedHypernymOrBuild(i,j, syn_lookup, hypes, hype_as_str)
                
    return hypes

In [89]:
## FUNCTIONS FOR EVALUATING HYPERNYMS

def countHypernyms(hypes, count_valid=True, count_invalid=True):
    """
    Count  hypernyms, ignoring None
    """
    c = 0
    for k,v in hypes.iteritems():
        if count_valid and v:
            c += 1
        elif count_invalid and not v:
            c += 1        
    return c

###Adjective Hypernyms

In [90]:
# find adj hypernyms, defaulting to only the string value
ahypes = lowestCommonHypernyms(asimatrix, asyns)

In [91]:
# check results
print "how many adj hypernyms? ", countHypernyms(ahypes)
print "how many valid adj hypernyms? ", countHypernyms(ahypes, count_valid=True, count_invalid=False)
print "how many invalid adj hypernyms? ", countHypernyms(ahypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(ahypes.keys()[0],ahypes[ahypes.keys()[0]])

how many adj hypernyms?  167
how many valid adj hypernyms?  167
how many invalid adj hypernyms?  0
example key: (1581, 1687), value: grateful


In [92]:
ahypes

{(23, 570): u'red',
 (23, 1490): u'red',
 (23, 1630): u'red',
 (23, 1727): u'red',
 (40, 1467): u'charming',
 (43, 1534): u'ablaze',
 (49, 1960): u'small',
 (63, 81): u'seventh',
 (66, 2802): u'blue',
 (80, 1308): u'diffident',
 (100, 1041): u'icky',
 (100, 1505): u'icky',
 (100, 1947): u'icky',
 (134, 530): u'ignored',
 (174, 1812): u'enormous',
 (178, 930): u'casual',
 (191, 447): u'all_right',
 (191, 655): u'all_right',
 (193, 2240): u'cheery',
 (193, 2304): u'cheery',
 (207, 2757): u'nauseating',
 (211, 408): u'ferocious',
 (211, 564): u'ferocious',
 (237, 1159): u'awful',
 (265, 2225): u'boggy',
 (289, 2803): u'dizzy',
 (289, 3127): u'dizzy',
 (300, 1132): u'hairy',
 (309, 1660): u'ageless',
 (309, 2634): u'ageless',
 (309, 3215): u'ageless',
 (330, 2788): u'alone',
 (346, 1714): u'disgusting',
 (348, 2028): u'colossal',
 (350, 1786): u'adolescent',
 (354, 2645): u'extreme',
 (376, 2027): u'cockamamie',
 (376, 2812): u'cockamamie',
 (378, 481): u'bare',
 (378, 2377): u'bare',
 (40

###Noun Hypernyms

In [93]:
# find noun hypernyms
nhypes = lowestCommonHypernyms(nsimatrix, nsyns)

In [94]:
# check results
print "how many noun hypernyms? ", countHypernyms(nhypes)
print "how many valid noun hypernyms? ", countHypernyms(nhypes, count_valid=True, count_invalid=False)
print "how many invalid noun hypernyms? ", countHypernyms(nhypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(nhypes.keys()[0],nhypes[nhypes.keys()[0]])

how many noun hypernyms?  293
how many valid noun hypernyms?  293
how many invalid noun hypernyms?  0
example key: (1172, 5051), value: bent


In [95]:
nhypes

{(10, 1664): u'sleep',
 (13, 2406): u'asshole',
 (13, 3216): u'asshole',
 (13, 3859): u'asshole',
 (26, 98): u'yak',
 (27, 532): u'pit',
 (32, 4653): u'subject',
 (33, 449): u'buttocks',
 (33, 711): u'buttocks',
 (33, 1995): u'buttocks',
 (75, 225): u'attempt',
 (75, 313): u'attempt',
 (78, 4399): u'phase',
 (94, 4646): u'battle',
 (113, 1044): u'kingdom',
 (131, 295): u'chap',
 (131, 1767): u'chap',
 (131, 5081): u'chap',
 (144, 2608): u'scream',
 (153, 4740): u'amour_propre',
 (177, 704): u'purpose',
 (177, 2491): u'purpose',
 (180, 1039): u'bartender',
 (182, 2506): u'idea',
 (187, 3998): u'career',
 (197, 1600): u'fellow',
 (214, 223): u'baby',
 (218, 4225): u'component',
 (219, 1091): u'grief',
 (219, 4797): u'grief',
 (225, 313): u'attempt',
 (229, 300): u'millimeter',
 (231, 4022): u'past',
 (237, 1964): u'boom',
 (237, 2495): u'boom',
 (237, 4053): u'boom',
 (244, 3828): u'laugh',
 (260, 4034): u'fall',
 (273, 3240): u'play',
 (282, 969): u'dad',
 (282, 1293): u'dad',
 (282, 36

##Save Hypernyms

In [96]:
# save adj hypernyms
pickle.dump( ahypes, open( "../../data/conditioned/ahypes.p", "wb" ) )  

In [97]:
# save noun hypernyms
pickle.dump( nhypes, open( "../../data/conditioned/nhypes.p", "wb" ) )  