#Vocab Consolidation (DECADE 1990)
### Adapted concepts from [HW1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw1/hw1.ipynb) and [HW5 Part1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw5/hw5part1.ipynb)

**This notebook should be locally run by issuing `vagrant up` from project root, then locating the notebook at "http:\\localhost:4545". You may also need to issue `vagrant provision` to update any required resources.**

The following artifacts will be established by manipulating the output of the processing pipeline for harvesting data, file [use-this-master-lyricsdf-extracted.csv](../../data/conditioned/use-this-master-lyricsdf-extracted.csv):
* vocabs for noun and adj
* n-gram for noun and adj
* synonyms for noun and adj
* hypernyms for noun and adj

Note: within this notebook we will establish n-gram and vocab separately for the given decade, results to be stored in [decades](../../../data/conditioned/decades) directory.

Other notes:
* this notebook leverages and finalizes exploratory work in [Data-Exploration Notebook](Data-Exploration.ipynb).
* outputs are anticipated to be combined in follow-on work for better latent factors, prediction, and recommendation processing (not reflected here)
* **IMPORTANT TO ISOLATE DECADE NOTEBOOKS INTO THEIR OWN FOLDER, e.g. decades/1970.**


##IMPORTANT: SET THE DECADE BELOW

In [1]:
## SET THE DECADE FOR PROCESS FILTERING
decade = 1990

In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [3]:
## MLJ: Additional Extras
import time
import itertools
import json
import pickle

In [4]:
import os
# os.environ['PYSPARK_PYTHON'] = '/anaconda/bin/python'

In [5]:
import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
# os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

/home/vagrant/spark


In [6]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [7]:
sc._conf.getAll()

[(u'spark.executor.memory', u'2g'),
 (u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.driver.memory', u'8g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [8]:
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 

In [9]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [10]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

#Load Finalized Conditioned Data Into Pandas Dataframe

In [11]:
# load the lyrics from the approved "master" dataframe
lyrics_pd_df = pd.read_csv("../../../../data/conditioned/use-this-master-lyricsdf-extracted.csv")  

##NEW FOR DECADE ::: FILTER BY DECADE
filter lyrics_pd_df by decade and then process the entire notebook, saving results within this directory.
**IMPORTANT TO ISOLATE DECADE NOTEBOOKS INTO THEIR OWN FOLDER, e.g. decades/1970.**

In [12]:
#FILTER BY DECADE
lyrics_pd_df = lyrics_pd_df[lyrics_pd_df['decade'] == decade]

In [13]:
lyrics_pd_df.shape

(1000, 11)

In [14]:
lyrics_pd_df.head()

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
2000,2000,1,1990,https://en.wikipedia.org/wiki/Hold_On_(Wilson_...,Hold On,Wilson Phillips,I know this pain. Why do you lock yourself up ...,1990,1990-1,http://lyrics.wikia.com/Wilson_Phillips:Hold_On,I know this pain. Why do you lock yourself up ...
2001,2001,2,1990,https://en.wikipedia.org/wiki/It_Must_Have_Bee...,It Must Have Been Love,Roxette,Must have been love. But it's over now. Lay a ...,1990,1990-2,http://lyrics.wikia.com/Roxette:It_Must_Have_B...,Must have been love. But it's over now. Lay a ...
2002,2002,3,1990,https://en.wikipedia.org/wiki/Nothing_Compares...,Nothing Compares 2 U,Sinead O'Connor,It's been seven hours and fifteen days. Since ...,1990,1990-3,http://lyrics.wikia.com/Sin%C3%A9ad_O%27Connor...,It's been seven hours and fifteen days. Since ...
2003,2003,4,1990,https://en.wikipedia.org/wiki/Poison_(Bell_Biv...,Poison,Bell Biv DeVoe,Poison. [Ron:] Yeah Spyderman & Freeze in full...,1990,1990-4,http://lyrics.wikia.com/Bell_Biv_DeVoe:Poison,Poison. [Ron:] Yeah Spyderman & Freeze in full...
2004,2004,5,1990,https://en.wikipedia.org/wiki/Vogue_(Madonna_s...,Vogue,Madonna,We don't currently have a license for these ly...,1990,1990-5,http://lyrics.wikia.com/Madonna:Vogue,We don't currently have a license for these ly...


##Manipulate With Spark

In [15]:
# convert from pandas to spark dataframe
lyricsdf = sqlsc.createDataFrame(lyrics_pd_df)

In [16]:
# view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+---------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|         artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+---------------+--------------------+------+--------+--------------------+--------------------+
| 2000|       1|1990|https://en.wikipe...|             Hold On|Wilson Phillips|I know this pain....|  1990|  1990-1|http://lyrics.wik...|I know this pain....|
| 2001|       2|1990|https://en.wikipe...|It Must Have Been...|        Roxette|Must have been lo...|  1990|  1990-2|http://lyrics.wik...|Must have been lo...|
| 2002|       3|1990|https://en.wikipe...|Nothing Compares 2 U|Sinead O'Connor|It's been seven h...|  1990|  1990-3|http://lyrics.wik...|It's been seven h...|
+-----+--------+----+--------------------+----

In [17]:
#view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+---------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|         artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+---------------+--------------------+------+--------+--------------------+--------------------+
| 2000|       1|1990|https://en.wikipe...|             Hold On|Wilson Phillips|I know this pain....|  1990|  1990-1|http://lyrics.wik...|I know this pain....|
| 2001|       2|1990|https://en.wikipe...|It Must Have Been...|        Roxette|Must have been lo...|  1990|  1990-2|http://lyrics.wik...|Must have been lo...|
| 2002|       3|1990|https://en.wikipe...|Nothing Compares 2 U|Sinead O'Connor|It's been seven h...|  1990|  1990-3|http://lyrics.wik...|It's been seven h...|
+-----+--------+----+--------------------+----

In [18]:
#We cache the data to make sure it is only read once from disk
lyricsdf.cache()
print "How many songs do we have?", lyricsdf.count()

How many songs do we have? 1000


In [19]:
print "What is the schema?", lyricsdf.printSchema()

What is the schema? root
 |-- index: long (nullable = true)
 |-- position: long (nullable = true)
 |-- year: long (nullable = true)
 |-- title.href: string (nullable = true)
 |-- title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- lyrics: string (nullable = true)
 |-- decade: long (nullable = true)
 |-- song_key: string (nullable = true)
 |-- lyrics_url: string (nullable = true)
 |-- lyrics_abstract: string (nullable = true)

None


##Sample Lyrics (or Not)

Some initial sampling to take from each year.

In [20]:
# whether or not to sample lyrics, and how many to sample per year
sample_lyrics = False
PER_YEAR_SAMPLES=10

In [21]:
#(your code here)
def randomSubSampleLyrics(sparkdf,take=PER_YEAR_SAMPLES):    
    # generate spark pairs as a tuple
    br_pairs = sparkdf.map(lambda r: (r.year, r.song_key))
    
    # group by key for a list of reviews per business and collect
    br_grouped = br_pairs.groupByKey().mapValues(lambda x: list(x)).collect()
        
    #sample after collect
    br_sample = [np.random.choice(v, size=take, replace=False) for k,v in br_grouped]    
    
    #flatten into a list
    return list(itertools.chain.from_iterable(br_sample))
    
small_song_keys = randomSubSampleLyrics(lyricsdf)

In [22]:
if sample_lyrics:
    print "How many small_song_keys? ", len(small_song_keys)
    small_song_keys[:5]
else:
    print "No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)"

No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)


In [23]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 05:41:52


In [24]:
%%time
#(your code here)
if sample_lyrics:
    ldf=lyricsdf[lyricsdf.song_key.isin(small_song_keys)]#creates new dataframe
else:
    ldf=lyricsdf

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 13.8 µs


In [25]:
# cache results
ldf.cache()

DataFrame[index: bigint, position: bigint, year: bigint, title.href: string, title: string, artist: string, lyrics: string, decade: bigint, song_key: string, lyrics_url: string, lyrics_abstract: string]

In [26]:
print "How many lyrics are in ldf? ", ldf.count()

How many lyrics are in ldf?  1000


##NLP

In [27]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [28]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [29]:
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

In [30]:
print "Quick Test of parse..."
parse("The world is the craziest place. I am working hard.", tokenize=True, lemmata=True)

Quick Test of parse...


u'The/DT/B-NP/O/the world/NN/I-NP/O/world is/VBZ/B-VP/O/be the/DT/B-NP/O/the craziest/JJ/I-NP/O/craziest place/NN/I-NP/O/place ././O/O/.\nI/PRP/B-NP/O/i am/VBP/B-VP/O/be working/VBG/I-VP/O/work hard/RB/B-ADVP/O/hard ././O/O/.'

In [31]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [32]:
print "Quick check of get_parts ..."
get_parts("Have had many other items and just love the food. The patio...job was and...perfect. Lunch is good, and the only egg is great")

Quick check of get_parts ...


([[u'patio', u'job'], [u'lunch', u'egg']], [[u'perfect'], [u'good', u'great']])

###Run Get Parts on Provided Data

In [33]:
#(your code here)
lyric_parts = ldf.map(lambda r : get_parts(r.lyrics))

In [34]:
# view output
lyric_parts.take(2)

[([[u'way'],
   [u'pain'],
   [u'time'],
   [u'break', u'chain'],
   [u'ya', u'break', u'chain']],
  [[u'fair'], [u'comfortable'], [u'worth'], [u'free'], [u'break', u'free']]),
 ([[u'wake', u'air', u'silence'], [u'water'], [u'winter', u'day']],
  [[u'lonely'], [u'outside'], [u'hard']])]

In [35]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 05:41:54


In [36]:
%%time
parseout=lyric_parts.collect()

CPU times: user 85.2 ms, sys: 18.9 ms, total: 104 ms
Wall time: 15.1 s


##Vocab
###Nouns

In [37]:
print "How many parseout entries? ", len(parseout)

How many parseout entries?  1000


In [38]:
# flatten parseout to create initial noun rdd
nounrdd=sc.parallelize([ele[0] for ele in parseout]).flatMap(lambda l: l)

In [39]:
# view output
nounrdd.take(1)

[[u'way']]

In [40]:
# cache results
nounrdd.cache()

PythonRDD[34] at RDD at PythonRDD.scala:43

In [41]:
# straight reduce for overall word counts
nwordsrdd = (nounrdd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [42]:
# view output
nwordsrdd.take(5)

[(u'jockin', 1), (u'mardi', 1), (u'liar', 1), (u'dance', 24), (u'rod', 2)]

In [43]:
# top n, based on values, sorted descending
nwordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'love', 593),
 (u'baby', 400),
 (u'time', 331),
 (u'heart', 260),
 (u'girl', 237),
 (u'thing', 233),
 (u'day', 233),
 (u'man', 224),
 (u'way', 207),
 (u'life', 180)]

In [44]:
nwordsrdd.cache()

PythonRDD[41] at RDD at PythonRDD.scala:43

In [45]:
# collect all the words and cache
nounvocabtups = (nwordsrdd
             .map(lambda (x,y): x)
             .zipWithIndex()
)

In [46]:
# view output
nounvocabtups.take(3)

[(u'jockin', 0), (u'mardi', 1), (u'liar', 2)]

In [47]:
# cache results
nounvocabtups.cache()

PythonRDD[44] at RDD at PythonRDD.scala:43

In [48]:
# collect results
nounvocab=nounvocabtups.collectAsMap()
nounid2word=nounvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [49]:
# since sampling may be used, avoiding more common usage, e.g. `nounvocab['dance']`
nounid2word[0], nounvocab.keys()[5], nounvocab[nounvocab.keys()[5]]

(u'jockin', u'appetite', 1494)

In [50]:
print "How big is the noun vocabulary? ", len(nounvocab.keys())

How big is the noun vocabulary?  2423


###Adjectives

In [51]:
# create initial adj rdd from parseout
adjrdd=sc.parallelize([ele[1] for ele in parseout])

In [52]:
# view output
adjrdd.take(3)

[[[u'fair'], [u'comfortable'], [u'worth'], [u'free'], [u'break', u'free']],
 [[u'lonely'], [u'outside'], [u'hard']],
 [[u'fancy'], [u'lonely'], [u'wrong'], [u'planted'], [u'willing']]]

In [53]:
# cache results
adjrdd.cache()

ParallelCollectionRDD[46] at parallelize at PythonRDD.scala:423

In [54]:
# straight reduce for overall word counts
awordsrdd = (adjrdd
             .flatMap(lambda l: l)
             .flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [55]:
# view output
awordsrdd.take(5)

[(u'suicidal', 1),
 (u'dynamic', 1),
 (u'intricate', 1),
 (u'shot', 1),
 (u'hate', 2)]

In [56]:
# top n, based on values, sorted descending
awordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'little', 425),
 (u'good', 354),
 (u'real', 283),
 (u'true', 229),
 (u'new', 183),
 (u'bad', 164),
 (u'sweet', 163),
 (u'big', 125),
 (u'right', 123),
 (u'wrong', 121)]

In [57]:
# cache results
awordsrdd.cache()

PythonRDD[54] at RDD at PythonRDD.scala:43

In [58]:
#(your code here)
adjvocabtups = (awordsrdd
              .map(lambda (x,y): x)
              .zipWithIndex()
)

In [59]:
# view output
adjvocabtups.take(3)

[(u'suicidal', 0), (u'dynamic', 1), (u'intricate', 2)]

In [60]:
# cache results
adjvocabtups.cache()

PythonRDD[57] at RDD at PythonRDD.scala:43

In [61]:
# collect results
adjvocab=adjvocabtups.collectAsMap()
adjid2word=adjvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [62]:
# since sampling may be used, avoiding more common usage, e.g. `adjvocab['exotic']`
adjid2word[0], adjvocab.keys()[5], adjvocab[adjvocab.keys()[5]]

(u'suicidal', u'saved', 11)

In [63]:
print "How big is the adjective vocabulary? ", len(adjvocab)

How big is the adjective vocabulary?  1471


##Document Corpus

In [64]:
##################################################################################################
# CITATION - Use of counter for reduce within each word list from:
# http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python
##################################################################################################
from collections import Counter

# for each sentence, reduct into a list of tuple k,v where k=vocab index and v=count, 
# each word list is sorted by occurence
documents = nounrdd.map(lambda words: Counter([nounvocab[word] for word in words]).most_common())

In [65]:
# verify output
documents.take(1)

[[(1482, 1)]]

In [66]:
# gather spark results
corpus=documents.collect()

##Save Spark Conditioning

###NEW FOR DECADE ::: SAVE LOCAL TO NOTEBOOK DIR (THEN MOVE TO DATA)

In [67]:
# save noun n-gram
with open('noun-n-gram{}.json'.format(decade), 'w') as fp:
    json.dump(dict(nwordsrdd.collect()), fp)

In [68]:
# save adjective n-gram
with open('adj-n-gram{}.json'.format(decade), 'w') as fp:
    json.dump(dict(awordsrdd.collect()), fp)

In [69]:
# save noun vocab and id2word
with open('nounvocab{}.json'.format(decade), 'w') as fp:
    json.dump(nounvocab, fp)
    
with open('nounid2word{}.json'.format(decade), 'w') as fp:
    json.dump(nounid2word, fp)    

In [70]:
# save adj vocab and id2word
with open('adjvocab{}.json'.format(decade), 'w') as fp:
    json.dump(adjvocab, fp)
    
with open('adjid2word{}.json'.format(decade), 'w') as fp:
    json.dump(adjid2word, fp) 

In [71]:
# save corpus
pickle.dump( corpus, open( "corpus{}.p".format(decade), "wb" ) )

##Synonyms

###Synonym Lookups
Focus on WordNet python package within [nltk](http://www.nltk.org) via [textblob](https://textblob.readthedocs.org/en/dev/)
The main idea is to lookup all words in the noun and adj vocab dictionaries and attempt to collapse down -- where possible -- to synonyms. The synonyms can be used for common_support also.

In [72]:
from textblob.wordnet import Synset
from textblob.wordnet import NOUN
from textblob.wordnet import ADJ

SIM_THRESHOLD = 1.0 # Only act on values at/above threshold

In [73]:
## COMMON METHODS FOR SYNSETS
def synsetStr(syn):
    """
    attempt to parse the string from a Synset, e.g. Synset('dog.n.01') would return 'dog'
    return String or None
    """
    try:
        return syn.name().split('.')[0]
    except Exception:
        return None
    
def flattenSynsetValues(syn_dict, skip_invalid=True, replace_invalid=None):
    """
    flatten synset values in dictionary using params
    """
    d = {}
    for k,v in syn_dict.iteritems():
        if v:
            d[k] = synsetStr(v)
        elif not skip_invalid:
            d[k] = replace_invalid
    return d

In [74]:
## CORE FUNCTIONS FOR BUILDING SIMILARITY MATRIX

def posToSingle(pos):
    """
    Keep up with which pos values are implemented.
    """
    if pos == NOUN:
        return "n"
    elif pos == ADJ:
        return "a"
    return None # essentially, else clause


def cachedSynsetOrBuild(idx, syns, p, id_lookup):
    """
    Build Synset for given `idx`, using the `id_lookup`.
    Facilitate O(n) computational complexity by caching results.
    
    --- Input ---
    idx: id to build and cache
    syns: existing dictionary of synsets, with k: id, v: Synset or None
    p: String pos value in the form needed for Synset generation, see `posToSingle`
    id_lookup: dictionary for noun / adj to build n x n matrix of similarity.
    
    --- Return ---
    Synset or None
    """
    if idx in syns:
        return syns[idx] 
        
    # focus on `.01` only
    try:                      
        syn = Synset("{}.{}.01".format(id_lookup[idx],p))
        syns[idx] = syn
        return syn
    except Exception:
        syns[idx] = None
        return None

def similarityMatrix(id2word, pos, take_n=None):
    """
    ##############################################################
    Build matrix of synsets for given id2word dictionary.    
    Optionally, only build a similarity matrix for the first n values.
    
    --- Input ---    
    id2word: dictionary for noun / adj to build n x n matrix of similarity.
    pos: WordNet position, `NOUN` or `ADJ` imported based on needs
    take_n: whether take the first n values for testing, default=None
    
    --- Return ---
    return a tuple, t where
    t[0]: n x n matrix with raw similarity score or zero
    t[1]: dictionary of synsets with k: id, v: Synset or None
    ##############################################################    
    """    
    syns = {} # obtain O(n)
    p = posToSingle(pos)
    
    # determine n
    n = len(id2word)
    if take_n:
        n = take_n
    
    # n x n matrix, initialized with zeros 
    matrix = np.zeros((n,n))
    
    # populate
    ns = range(n)
    for i in ns:  
        isyn = cachedSynsetOrBuild(i,syns,p,id2word)       
        for j in ns:
            # find j in synset
            jsyn = None
            if isyn:
                jsyn = cachedSynsetOrBuild(j,syns,p,id2word) # no reason unless isyn is ok
        
            # update matrix with path_similarity between i and j words
            if isyn and jsyn:            
                ps = isyn.path_similarity(jsyn)            
                if ps:
                    matrix[i][j] = ps
            
    return matrix, syns

In [75]:
## FUNCTIONS FOR EVALUATING SIMILARITY MATRIX RESULTS

def printSimilarityPairs(matrix, show_n=None, id_lookup=None, sim_threshold=SIM_THRESHOLD): 
    """
    print non zero similarities, ignoring diagonals.
    Optionally, show only first n non zeros then return.
    Optionally, lookup ids with words.
    Optionally, only evaluate values at/above a threshold.
    """
    ns = range(len(matrix))      
    c = 0
    for i in ns:
        for j in ns:
            v = matrix[i][j] 
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
                    
            if (i != j) and met_threshold:                
                if not show_n or c < show_n:
                    c += 1
                    s_i = i
                    s_j = j
                    if id_lookup:
                        s_i = id_lookup[i]
                        s_j = id_lookup[j]
                    print "{},{} --> {}".format(s_i,s_j,v)
                elif show_n:
                    return
                
def countSimilarityPairs(matrix, sim_threshold=SIM_THRESHOLD):
    """
    count non zero similarities, ignoring diagonals.
    Optionally, only evaluate values at/above a threshold.    
    """
    c = 0
    ns = range(len(matrix))         
    for i in ns:
        for j in ns:
            v = matrix[i][j]
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
            
            if (i != j) and met_threshold:                
                c += 1                    
    return c

In [76]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 05:42:14


In [77]:
%%time
# build adj similarity matrix
asimatrix, asyns = similarityMatrix(adjid2word, ADJ)

CPU times: user 7.31 s, sys: 27.3 ms, total: 7.34 s
Wall time: 7.34 s


In [78]:
# Count non-zero similarities for adjectivies at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(asimatrix)

100

In [79]:
# Check adj similarity results, are they any good?
printSimilarityPairs(asimatrix, show_n=10, id_lookup=adjid2word)

magic,magical --> 1.0
small,little --> 1.0
unsure,shy --> 1.0
okay,fine --> 1.0
okay,ok --> 1.0
large,big --> 1.0
eternal,everlasting --> 1.0
eternal,perpetual --> 1.0
fine,okay --> 1.0
fine,ok --> 1.0


In [80]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 24 Nov 2015 05:42:22


In [81]:
%%time
# build noun similarity matrix (can take 30+ minutes!!!)
nsimatrix, nsyns = similarityMatrix(nounid2word, NOUN)

CPU times: user 4min 27s, sys: 2.19 s, total: 4min 29s
Wall time: 4min 29s


In [82]:
# Count non-zero similarities for nouns at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(nsimatrix)

244

In [83]:
# Check noun similarity results, are they any good?
printSimilarityPairs(nsimatrix, show_n = 10, id_lookup=nounid2word)

prick,motherfucker --> 1.0
prick,bastard --> 1.0
hate,hatred --> 1.0
tush,ass --> 1.0
tush,fanny --> 1.0
shriek,scream --> 1.0
seashore,coast --> 1.0
crap,shit --> 1.0
limousine,limo --> 1.0
babe,baby --> 1.0


## Save Similarity Matrix

###NEW FOR DECADE ::: SAVE LOCAL TO NOTEBOOK DIR (THEN MOVE TO DATA)

In [84]:
# save asimatrix
pickle.dump( asimatrix, open( "asimatrix{}.p".format(decade), "wb" ) )  

In [85]:
# flatten and save asyns
with open('asyns{}.json'.format(decade), 'w') as fp:
    json.dump(flattenSynsetValues(asyns), fp)

In [86]:
# save nsimatrix
pickle.dump( nsimatrix, open( "nsimatrix{}.p".format(decade), "wb" ) )

In [87]:
# flatten and save nsyns
with open('nsyns{}.json'.format(decade), 'w') as fp:
    json.dump(flattenSynsetValues(nsyns), fp)

##Hypernyms
find the lowest common [hypernym](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy) between similar

In [88]:
#Quick Test
Synset('dog.n.01').lowest_common_hypernyms(Synset('cat.n.01'))[0]

Synset('carnivore.n.01')

In [89]:
## CORE FUNCTIONS FOR BUILDING HYPERNYM

def makeOrderedTuple(idx1, idx2):
    if idx1 > idx2:
        return (idx2,idx1) 
    return (idx1,idx2) 

def cachedHypernymOrBuild(idx1, idx2, syn_lookup, hypes, hype_as_str=True):
    """
    Build Hypernym for given `idxtuple`, using the `syns_lookup`.
    Facilitate O(n) computational complexity by caching results
    Will internally manage hypernym keys as ordered tuple.
    
    --- Input ---
    idx: tuple of id to build and cache
    syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
    hypes: dictionary for hypernyms with k: ordered tuple, v: hypernym.
    hype_as_str: optional build map with string values, default = True
    --- Return ---
    a hypernym Synset or None
    """
    ituple = makeOrderedTuple(idx1,idx2)    
    if ituple in hypes: 
        return hypes[ituple] 
    
    try:    
        s1 = syn_lookup[ituple[0]]
        s2 = syn_lookup[ituple[1]]
        h = s1.lowest_common_hypernyms(s2)[0]
        
        if hype_as_str:
            h = synsetStr(h)
            
        hypes[ituple] = h
        return h
    except Exception:
        hypes[ituple] = None
        return None

def lowestCommonHypernyms(simatrix, syn_lookup, sim_threshold=SIM_THRESHOLD, hype_as_str=True):
    """
    Build a matrix with hypernym where found.
    Optionally, only evaluate values at/above a threshold.
    
    --- Input ---
    simatrix: tuple of id to build and cache
    syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
    sim_threshold: optional threshold to use for establishing hypernyms, default = SIM_THRESHOLD
    hype_as_str: optional build map with string values, default = True
    
    --- Return ---
    dictionary for hypernyms with k: ordered tuple, v: Synset.    
    """
    
    hypes = {} # dictionary to build up.
    
    n = len(simatrix)
    ns = range(n)          
    for i in ns:
        for j in ns:
            v = simatrix[i][j] 
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
                    
            if (i != j) and met_threshold:                                
                cachedHypernymOrBuild(i,j, syn_lookup, hypes, hype_as_str)
                
    return hypes

In [90]:
## FUNCTIONS FOR EVALUATING HYPERNYMS

def countHypernyms(hypes, count_valid=True, count_invalid=True):
    """
    Count  hypernyms, ignoring None
    """
    c = 0
    for k,v in hypes.iteritems():
        if count_valid and v:
            c += 1
        elif count_invalid and not v:
            c += 1        
    return c

###Adjective Hypernyms

In [91]:
# find adj hypernyms, defaulting to only the string value
ahypes = lowestCommonHypernyms(asimatrix, asyns)

In [92]:
# check results
print "how many adj hypernyms? ", countHypernyms(ahypes)
print "how many valid adj hypernyms? ", countHypernyms(ahypes, count_valid=True, count_invalid=False)
print "how many invalid adj hypernyms? ", countHypernyms(ahypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(ahypes.keys()[0],ahypes[ahypes.keys()[0]])

how many adj hypernyms?  50
how many valid adj hypernyms?  50
how many invalid adj hypernyms?  0
example key: (200, 973), value: cheery


In [93]:
ahypes

{(16, 568): u'charming',
 (21, 866): u'small',
 (39, 432): u'diffident',
 (89, 189): u'all_right',
 (89, 288): u'all_right',
 (130, 652): u'large',
 (140, 1141): u'ageless',
 (140, 1274): u'ageless',
 (189, 288): u'all_right',
 (194, 949): u'cardinal',
 (200, 973): u'cheery',
 (200, 1004): u'cheery',
 (206, 213): u'bare',
 (218, 1226): u'cockamamie',
 (223, 732): u'religious',
 (296, 630): u'crisp',
 (321, 359): u'favorite',
 (331, 1273): u'bally',
 (342, 1338): u'bang-up',
 (370, 675): u'red',
 (370, 848): u'red',
 (382, 578): u'barbarous',
 (382, 1084): u'barbarous',
 (412, 742): u'brumous',
 (427, 1158): u'entire',
 (430, 705): u'grateful',
 (475, 593): u'wide',
 (521, 561): u'grey',
 (547, 611): u'blasted',
 (562, 1116): u'brassy',
 (578, 1084): u'barbarous',
 (598, 1165): u'apparent',
 (631, 1024): u'bitty',
 (631, 1376): u'bitty',
 (675, 848): u'red',
 (697, 1257): u'hurt',
 (763, 888): u'chunky',
 (812, 1214): u'chief',
 (839, 1454): u'incredible',
 (900, 1406): u'particular',
 

###Noun Hypernyms

In [94]:
# find noun hypernyms
nhypes = lowestCommonHypernyms(nsimatrix, nsyns)

In [95]:
# check results
print "how many noun hypernyms? ", countHypernyms(nhypes)
print "how many valid noun hypernyms? ", countHypernyms(nhypes, count_valid=True, count_invalid=False)
print "how many invalid noun hypernyms? ", countHypernyms(nhypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(nhypes.keys()[0],nhypes[nhypes.keys()[0]])

how many noun hypernyms?  122
how many valid noun hypernyms?  122
how many invalid noun hypernyms?  0
example key: (670, 1979), value: criminal


In [96]:
nhypes

{(6, 826): u'asshole',
 (6, 1670): u'asshole',
 (9, 1771): u'hate',
 (15, 222): u'buttocks',
 (15, 1119): u'buttocks',
 (70, 918): u'scream',
 (85, 2090): u'seashore',
 (89, 1823): u'crap',
 (98, 706): u'limousine',
 (103, 110): u'baby',
 (106, 2013): u'grief',
 (108, 434): u'dad',
 (108, 467): u'dad',
 (108, 1616): u'dad',
 (108, 2108): u'dad',
 (108, 2196): u'dad',
 (133, 1317): u'sister',
 (135, 680): u'answer',
 (152, 242): u'topographic_point',
 (160, 1347): u'narrative',
 (184, 893): u'person',
 (184, 2170): u'person',
 (203, 1379): u'adieu',
 (209, 389): u'ace',
 (214, 2162): u'aroma',
 (222, 1119): u'buttocks',
 (239, 301): u'hood',
 (239, 1469): u'hood',
 (249, 791): u'chump',
 (253, 1651): u'loot',
 (262, 469): u'ma',
 (262, 1402): u'ma',
 (262, 1690): u'ma',
 (262, 1777): u'ma',
 (280, 1871): u'fall',
 (301, 1469): u'hood',
 (319, 1028): u'sunset',
 (323, 1480): u'shop',
 (328, 2186): u'battle',
 (351, 1836): u'barroom',
 (394, 1021): u'expression',
 (398, 671): u'speaker',


##Save Hypernyms

###NEW FOR DECADE ::: SAVE LOCAL TO NOTEBOOK DIR (THEN MOVE TO DATA)

In [97]:
# save adj hypernyms
pickle.dump( ahypes, open( "ahypes{}.p".format(decade), "wb" ) )  

In [98]:
# save noun hypernyms
pickle.dump( nhypes, open( "nhypes{}.p".format(decade), "wb" ) )  