#Vocab Consolidation
### Adapted concepts from [HW1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw1/hw1.ipynb) and [HW5 Part1](https://github.com/cs109-students/michaeljohns-2015hw/blob/hw5/hw5part1.ipynb)

**This notebook should be locally run by issuing `vagrant up` from project root, then locating the notebook at "http:\\localhost:4545". You may also need to issue `vagrant provision` to update any required resources.**

The following artifacts will be established by manipulating the output of the processing pipeline for harvesting data, file [use-this-master-lyricsdf-extracted.csv](../../data/conditioned/use-this-master-lyricsdf-extracted.csv):
* vocabs for noun and adj
* n-gram for noun and adj
* synonyms for noun and adj
* hypernyms for noun and adj

Other notes:
* this notebook leverages and finalizes exploratory work in [Data-Exploration Notebook](Data-Exploration.ipynb).
* outputs are anticipated to be combined in follow-on work for better latent factors, prediction, and recommendation processing (not reflected here)
* in other notebooks that use the exact same contents as here, we will establish n-gram and vocab per decade.



In [1]:
## SET THE DECADE FOR PROCESS FILTERING
## THIS WILL ALLOW SPECIAL PROCESSING
decade = None # for no decade filtering, i.e. corpus-wide
# decade = 1970
# decade = 1980
# decade = 1990
# decade = 2000
# decade = 2010

##Imports

In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

In [3]:
## MLJ: Additional Extras
import os
import time
import itertools
import json
import pickle

##Handle Directory for Output

In [4]:
# adapted from https://justgagan.wordpress.com/2010/09/22/python-create-path-or-directories-if-not-exist/
def assureDirExists(path):
    d = os.path.dirname(path)
    if not os.path.exists(d):
        os.makedirs(d)

In [5]:
# create requisite directory for processing
root_out = ""
if not decade:
    root_out = "../../data/conditioned/corpus_vocabs/" #entire corpus
else:
    root_out = "../../data/conditioned/decades/"+str(decade)+"/" #single decade
    
assureDirExists(root_out)

##Spark Setup

In [6]:
import os
# os.environ['PYSPARK_PYTHON'] = '/anaconda/bin/python'

In [7]:
import findspark
findspark.init()
print findspark.find()
# Depending on your setup you might have to change this line of code
#findspark makes sure I dont need the below on homebrew.
#os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
#the below actually broke my spark, so I removed it. 
#Depending on how you started the notebook, you might need it.
# os.environ['PYSPARK_SUBMIT_ARGS']="--master local pyspark --executor-memory 4g"

/home/vagrant/spark


In [8]:
import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local[4]')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

In [9]:
sc._conf.getAll()

[(u'spark.executor.memory', u'2g'),
 (u'spark.master', u'local[4]'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.driver.memory', u'8g'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.app.name', u'pyspark')]

In [10]:
import sys
rdd = sc.parallelize(xrange(10),10)
rdd.map(lambda x: sys.version).collect()

['2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]',
 

In [11]:
sys.version

'2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

In [12]:
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

#Load Finalized Conditioned Data Into Pandas Dataframe

In [13]:
# load the lyrics from the approved "master" dataframe
lyrics_pd_df = pd.read_csv("../../data/conditioned/use-this-master-lyricsdf-extracted.csv")  

In [14]:
#FILTER BY DECADE IF SET
if decade:
    lyrics_pd_df = lyrics_pd_df[lyrics_pd_df['decade'] == decade]

In [15]:
lyrics_pd_df.shape

(4500, 11)

In [16]:
lyrics_pd_df.head()

Unnamed: 0,index,position,year,title.href,title,artist,lyrics,decade,song_key,lyrics_url,lyrics_abstract
0,0,1,1970,https://en.wikipedia.org/wiki/Bridge_over_Trou...,Bridge over Troubled Water,Simon and Garfunkel,When you're weary. Feeling small. When tears a...,1970,1970-1,http://lyrics.wikia.com/Simon_And_Garfunkel:Br...,When you're weary. Feeling small. When tears a...
1,1,2,1970,https://en.wikipedia.org/wiki/(They_Long_to_Be...,(They Long to Be) Close to You,The Carpenters,Why do birds suddenly appear. Everytime you ar...,1970,1970-2,http://lyrics.wikia.com/Carpenters:%28They_Lon...,Why do birds suddenly appear. Everytime you ar...
2,2,3,1970,https://en.wikipedia.org/wiki/American_Woman_(...,American Woman,The Guess Who,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d...",1970,1970-3,http://lyrics.wikia.com/The_Guess_Who:American...,"Mmm, da da da. Mmm, mmm, da da da. Mmm, mmm, d..."
3,3,4,1970,https://en.wikipedia.org/wiki/Raindrops_Keep_F...,Raindrops Keep Fallin' on My Head,B.J. Thomas,Raindrops are falling on my head. And just lik...,1970,1970-4,http://lyrics.wikia.com/B.J._Thomas:Raindrops_...,Raindrops are falling on my head. And just lik...
4,4,5,1970,https://en.wikipedia.org/wiki/War_(Edwin_Starr...,War,Edwin Starr,"War, huh, yeah. What is it good for? Absolutel...",1970,1970-5,http://lyrics.wikia.com/Edwin_Starr:War,"War, huh, yeah. What is it good for? Absolutel..."


##Manipulate With Spark

In [17]:
# convert from pandas to spark dataframe
lyricsdf = sqlsc.createDataFrame(lyrics_pd_df)

In [18]:
# view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|    0|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|http://lyrics.wik...|When you're weary...|
|    1|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|Why do birds sudd...|  1970|  1970-2|http://lyrics.wik...|Why do birds sudd...|
|    2|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|Mmm, da da da. Mm...|  1970|  1970-3|http://lyrics.wik...|Mmm, da da da. Mm...|
+-----+--------+----+-

In [19]:
#view output
lyricsdf.show(3)

+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|index|position|year|          title.href|               title|             artist|              lyrics|decade|song_key|          lyrics_url|     lyrics_abstract|
+-----+--------+----+--------------------+--------------------+-------------------+--------------------+------+--------+--------------------+--------------------+
|    0|       1|1970|https://en.wikipe...|Bridge over Troub...|Simon and Garfunkel|When you're weary...|  1970|  1970-1|http://lyrics.wik...|When you're weary...|
|    1|       2|1970|https://en.wikipe...|(They Long to Be)...|     The Carpenters|Why do birds sudd...|  1970|  1970-2|http://lyrics.wik...|Why do birds sudd...|
|    2|       3|1970|https://en.wikipe...|      American Woman|      The Guess Who|Mmm, da da da. Mm...|  1970|  1970-3|http://lyrics.wik...|Mmm, da da da. Mm...|
+-----+--------+----+-

In [20]:
#We cache the data to make sure it is only read once from disk
lyricsdf.cache()
print "How many songs do we have?", lyricsdf.count()

How many songs do we have? 4500


In [21]:
print "What is the schema?", lyricsdf.printSchema()

What is the schema? root
 |-- index: long (nullable = true)
 |-- position: long (nullable = true)
 |-- year: long (nullable = true)
 |-- title.href: string (nullable = true)
 |-- title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- lyrics: string (nullable = true)
 |-- decade: long (nullable = true)
 |-- song_key: string (nullable = true)
 |-- lyrics_url: string (nullable = true)
 |-- lyrics_abstract: string (nullable = true)

None


##Sample Lyrics (or Not)

Some initial sampling to take from each year.

In [22]:
# whether or not to sample lyrics, and how many to sample per year
sample_lyrics = False
PER_YEAR_SAMPLES=10

In [23]:
#(your code here)
def randomSubSampleLyrics(sparkdf,take=PER_YEAR_SAMPLES):    
    # generate spark pairs as a tuple
    br_pairs = sparkdf.map(lambda r: (r.year, r.song_key))
    
    # group by key for a list of reviews per business and collect
    br_grouped = br_pairs.groupByKey().mapValues(lambda x: list(x)).collect()
        
    #sample after collect
    br_sample = [np.random.choice(v, size=take, replace=False) for k,v in br_grouped]    
    
    #flatten into a list
    return list(itertools.chain.from_iterable(br_sample))
    
small_song_keys = randomSubSampleLyrics(lyricsdf)

In [24]:
if sample_lyrics:
    print "How many small_song_keys? ", len(small_song_keys)
    small_song_keys[:5]
else:
    print "No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)"

No lyric sampling, full processing (change `sample_lyrics` value to `True` to sample)


In [25]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 08 Dec 2015 03:53:26


In [26]:
%%time
#(your code here)
if sample_lyrics:
    ldf=lyricsdf[lyricsdf.song_key.isin(small_song_keys)]#creates new dataframe
else:
    ldf=lyricsdf

CPU times: user 11 µs, sys: 5 µs, total: 16 µs
Wall time: 22.2 µs


In [27]:
# cache results
ldf.cache()

DataFrame[index: bigint, position: bigint, year: bigint, title.href: string, title: string, artist: string, lyrics: string, decade: bigint, song_key: string, lyrics_url: string, lyrics_abstract: string]

In [28]:
print "How many lyrics are in ldf? ", ldf.count()

How many lyrics are in ldf?  4500


##NLP

In [29]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [30]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [31]:
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

In [32]:
print "Quick Test of parse..."
parse("The world is the craziest place. I am working hard.", tokenize=True, lemmata=True)

Quick Test of parse...


u'The/DT/B-NP/O/the world/NN/I-NP/O/world is/VBZ/B-VP/O/be the/DT/B-NP/O/the craziest/JJ/I-NP/O/craziest place/NN/I-NP/O/place ././O/O/.\nI/PRP/B-NP/O/i am/VBP/B-VP/O/be working/VBG/I-VP/O/work hard/RB/B-ADVP/O/hard ././O/O/.'

In [33]:
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [34]:
print "Quick check of get_parts ..."
get_parts("Have had many other items and just love the food. The patio...job was and...perfect. Lunch is good, and the only egg is great")

Quick check of get_parts ...


([[u'patio', u'job'], [u'lunch', u'egg']], [[u'perfect'], [u'good', u'great']])

###Run Get Parts on Provided Data

In [35]:
#(your code here)
lyric_parts = ldf.map(lambda r : get_parts(r.lyrics))

In [36]:
# view output
lyric_parts.take(2)

[([[u'time'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water'],
   [u'bridge', u'water']],
  [[u'rough'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled'],
   [u'troubled']]),
 ([[u'dream'], [u'starlight', u'eye'], [u'dream'], [u'starlight', u'eye']],
  [[u'true'], [u'blue'], [u'true'], [u'blue']])]

In [37]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 08 Dec 2015 03:53:27


In [38]:
%%time
parseout=lyric_parts.collect()

CPU times: user 112 ms, sys: 34.4 ms, total: 146 ms
Wall time: 1min 22s


##Vocab
###Nouns

In [39]:
print "How many parseout entries? ", len(parseout)

How many parseout entries?  4500


In [40]:
# flatten parseout to create initial noun rdd
nounrdd=sc.parallelize([ele[0] for ele in parseout]).flatMap(lambda l: l)

In [41]:
# view output
nounrdd.take(5)

[[u'time'],
 [u'bridge', u'water'],
 [u'bridge', u'water'],
 [u'bridge', u'water'],
 [u'bridge', u'water']]

In [42]:
# cache results
nounrdd.cache()

PythonRDD[34] at RDD at PythonRDD.scala:43

In [43]:
# straight reduce for overall word counts
nwordsrdd = (nounrdd.flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [44]:
# view output
nwordsrdd.take(5)

[(u'jockin', 1),
 (u'slope', 1),
 (u'girl(oh', 1),
 (u'dance', 216),
 (u'pigeon', 3)]

In [45]:
# top n, based on values, sorted descending
nwordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'love', 2390),
 (u'baby', 1665),
 (u'girl', 1583),
 (u'time', 1544),
 (u'thing', 1097),
 (u'night', 1003),
 (u'man', 918),
 (u'way', 881),
 (u'day', 830),
 (u'heart', 802)]

In [46]:
nwordsrdd.cache()

PythonRDD[41] at RDD at PythonRDD.scala:43

In [47]:
# collect all the words and cache
nounvocabtups = (nwordsrdd
             .map(lambda (x,y): x)
             .zipWithIndex()
)

In [48]:
# view output
nounvocabtups.take(3)

[(u'jockin', 0), (u'slope', 1), (u'girl(oh', 2)]

In [49]:
# cache results
nounvocabtups.cache()

PythonRDD[44] at RDD at PythonRDD.scala:43

In [50]:
# collect results
nounvocab=nounvocabtups.collectAsMap()
nounid2word=nounvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [51]:
# since sampling may be used, avoiding more common usage, e.g. `nounvocab['dance']`
nounid2word[0], nounvocab.keys()[5], nounvocab[nounvocab.keys()[5]]

(u'jockin', u'catch', 728)

In [52]:
print "How big is the noun vocabulary? ", len(nounvocab.keys())

How big is the noun vocabulary?  5144


###Adjectives

In [53]:
# create initial adj rdd from parseout
adjrdd=sc.parallelize([ele[1] for ele in parseout])

In [54]:
# view output
adjrdd.take(3)

[[[u'rough'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled'],
  [u'troubled']],
 [[u'true'], [u'blue'], [u'true'], [u'blue']],
 [[u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'important'],
  [u'old'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'coloured'],
  [u'american'],
  [u'american'],
  [u'american'],
  [u'coloured'],
  [u'american'],
  [u'leave'],
  [u'american'],
  [u'american']]]

In [55]:
# cache results
adjrdd.cache()

ParallelCollectionRDD[46] at parallelize at PythonRDD.scala:423

In [56]:
# straight reduce for overall word counts
awordsrdd = (adjrdd
             .flatMap(lambda l: l)
             .flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
)

In [57]:
# view output
awordsrdd.take(5)

[(u'suicidal', 2),
 (u'hooked', 21),
 (u'resist', 1),
 (u'dynamic', 3),
 (u'cocky', 2)]

In [58]:
# top n, based on values, sorted descending
awordsrdd.takeOrdered(10, key = lambda x: -x[1])

[(u'little', 1838),
 (u'good', 1727),
 (u'real', 946),
 (u'bad', 770),
 (u'new', 764),
 (u'big', 678),
 (u'true', 649),
 (u'sweet', 635),
 (u'ooh', 607),
 (u'long', 579)]

In [59]:
# cache results
awordsrdd.cache()

PythonRDD[54] at RDD at PythonRDD.scala:43

In [60]:
#(your code here)
adjvocabtups = (awordsrdd
              .map(lambda (x,y): x)
              .zipWithIndex()
)

In [61]:
# view output
adjvocabtups.take(3)

[(u'suicidal', 0), (u'hooked', 1), (u'resist', 2)]

In [62]:
# cache results
adjvocabtups.cache()

PythonRDD[57] at RDD at PythonRDD.scala:43

In [63]:
# collect results
adjvocab=adjvocabtups.collectAsMap()
adjid2word=adjvocabtups.map(lambda (x,y): (y,x)).collectAsMap()

In [64]:
# since sampling may be used, avoiding more common usage, e.g. `adjvocab['exotic']`
adjid2word[0], adjvocab.keys()[5], adjvocab[adjvocab.keys()[5]]

(u'suicidal', u'suspenseful', 1696)

In [65]:
print "How big is the adjective vocabulary? ", len(adjvocab)

How big is the adjective vocabulary?  3379


##Document Corpus

In [66]:
##################################################################################################
# CITATION - Use of counter for reduce within each word list from:
# http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python
##################################################################################################
from collections import Counter

# for each sentence, reduct into a list of tuple k,v where k=vocab index and v=count, 
# each word list is sorted by occurence
documents = nounrdd.map(lambda words: Counter([nounvocab[word] for word in words]).most_common())

In [67]:
# verify output
documents.take(1)

[[(5139, 1)]]

In [68]:
# gather spark results
corpus=documents.collect()

##Save Spark Conditioning

###Part of Speech Nouns / Adjectives (Original Lyrics Array)

In [69]:
ncollect = sc.parallelize([ele[0] for ele in parseout]).collect()
acollect = sc.parallelize([ele[1] for ele in parseout]).collect()

In [70]:
print "How many noun rows? ", len(ncollect)
print "How many adjective rows? ", len(acollect)

How many noun rows?  4500
How many adjective rows?  4500


In [71]:
print ncollect[:3]

[[[u'time'], [u'bridge', u'water'], [u'bridge', u'water'], [u'bridge', u'water'], [u'bridge', u'water'], [u'bridge', u'water'], [u'bridge', u'water']], [[u'dream'], [u'starlight', u'eye'], [u'dream'], [u'starlight', u'eye']], [[u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman', u'mess', u'mind'], [u'woman'], [u'woman', u'mama'], [u'thing'], [u'time', u'growin'], [u'woman'], [u'woman'], [u'woman', u'mama'], [u'light'], [u'woman'], [u'woman'], [u'woman'], [u'light'], [u'woman', u'mama'], [u'ya', u'woman'], [u'woman'], [u'shit']]]


In [72]:
print acollect[:3]

[[[u'rough'], [u'troubled'], [u'troubled'], [u'troubled'], [u'troubled'], [u'troubled'], [u'troubled']], [[u'true'], [u'blue'], [u'true'], [u'blue']], [[u'american'], [u'american'], [u'american'], [u'american'], [u'american'], [u'american'], [u'american'], [u'american'], [u'american'], [u'important'], [u'old'], [u'american'], [u'american'], [u'american'], [u'coloured'], [u'american'], [u'american'], [u'american'], [u'coloured'], [u'american'], [u'leave'], [u'american'], [u'american']]]


In [73]:
# save ncollect
with open(root_out+'noun_collect.json', 'w') as fp:
    json.dump(ncollect, fp)

In [74]:
# save acollect
with open(root_out+'adj_collect.json', 'w') as fp:
    json.dump(acollect, fp)

###Unique words per lyric

In [75]:
# Word Reduction per document
def buildWordReduction(collected):
    ngram_reduced = []
    for r in collected:
        v = []
        for rr in r:
            for i in rr:
                if not i in v:
                    v.append(i)
        ngram_reduced.append(v)
    return ngram_reduced

In [76]:
nreduction = buildWordReduction(ncollect)
areduction = buildWordReduction(acollect)

In [77]:
nreduction[2]

[u'woman',
 u'mess',
 u'mind',
 u'mama',
 u'thing',
 u'time',
 u'growin',
 u'light',
 u'ya',
 u'shit']

In [78]:
# save noun word reduction
with open(root_out+'noun-word-reduction.json', 'w') as fp:
    json.dump(nreduction, fp)

In [79]:
# save adj word reduction
with open(root_out+'adj-word-reduction.json', 'w') as fp:
    json.dump(areduction, fp)

###N-Gram Specific
**Want Raw n-gram for total words, then reduced n-gram for 1x per document max**

In [80]:
# save noun n-gram (raw)
with open(root_out+'noun-n-gram.json', 'w') as fp:
    json.dump(dict(nwordsrdd.collect()), fp)

In [81]:
# save adjective n-gram (raw)
with open(root_out+'adj-n-gram.json', 'w') as fp:
    json.dump(dict(awordsrdd.collect()), fp)

In [82]:
# build from nreduction and areduction to get actual counts.
def buildNgramReduced(reduction):
    return (sc.parallelize(reduction)
          .flatMap(lambda word: word)
          .map(lambda word: (word, 1))
          .reduceByKey(lambda a, b: a + b)
       ).collect()

In [83]:
n_ngram_reduced = buildNgramReduced(nreduction)
a_ngram_reduced = buildNgramReduced(areduction)

In [84]:
# save reduced noun n-gram
with open(root_out+'noun_n-gram_reduced.json', 'w') as fp:
    json.dump(n_ngram_reduced, fp)

In [85]:
# save reduced adj n-gram
with open(root_out+'adj_n-gram_reduced.json', 'w') as fp:
    json.dump(a_ngram_reduced, fp)

###Vocab, id2word

In [86]:
# save noun vocab and id2word
with open(root_out+'nounvocab.json', 'w') as fp:
    json.dump(nounvocab, fp)
    
with open(root_out+'nounid2word.json', 'w') as fp:
    json.dump(nounid2word, fp)    

In [87]:
# save adj vocab and id2word
with open(root_out+'adjvocab.json', 'w') as fp:
    json.dump(adjvocab, fp)
    
with open(root_out+'adjid2word.json', 'w') as fp:
    json.dump(adjid2word, fp) 

###Corpus

In [88]:
# save corpus
pickle.dump( corpus, open( root_out+'corpus.p', "wb" ) )

##Synonyms

###Synonym Lookups
Focus on WordNet python package within [nltk](http://www.nltk.org) via [textblob](https://textblob.readthedocs.org/en/dev/)
The main idea is to lookup all words in the noun and adj vocab dictionaries and attempt to collapse down -- where possible -- to synonyms. The synonyms can be used for common_support also.

In [89]:
from textblob.wordnet import Synset
from textblob.wordnet import NOUN
from textblob.wordnet import ADJ

SIM_THRESHOLD = 1.0 # Only act on values at/above threshold

In [90]:
## COMMON METHODS FOR SYNSETS
def synsetStr(syn):
    """
    attempt to parse the string from a Synset, e.g. Synset('dog.n.01') would return 'dog'
    return String or None
    """
    try:
        return syn.name().split('.')[0]
    except Exception:
        return None
    
def flattenSynsetValues(syn_dict, skip_invalid=True, replace_invalid=None):
    """
    flatten synset values in dictionary using params
    """
    d = {}
    for k,v in syn_dict.iteritems():
        if v:
            d[k] = synsetStr(v)
        elif not skip_invalid:
            d[k] = replace_invalid
    return d

In [91]:
## CORE FUNCTIONS FOR BUILDING SIMILARITY MATRIX

def posToSingle(pos):
    """
    Keep up with which pos values are implemented.
    """
    if pos == NOUN:
        return "n"
    elif pos == ADJ:
        return "a"
    return None # essentially, else clause


def cachedSynsetOrBuild(idx, syns, p, id_lookup):
    """
    Build Synset for given `idx`, using the `id_lookup`.
    Facilitate O(n) computational complexity by caching results.
    
    --- Input ---
    idx: id to build and cache
    syns: existing dictionary of synsets, with k: id, v: Synset or None
    p: String pos value in the form needed for Synset generation, see `posToSingle`
    id_lookup: dictionary for noun / adj to build n x n matrix of similarity.
    
    --- Return ---
    Synset or None
    """
    if idx in syns:
        return syns[idx] 
        
    # focus on `.01` only
    try:                      
        syn = Synset("{}.{}.01".format(id_lookup[idx],p))
        syns[idx] = syn
        return syn
    except Exception:
        syns[idx] = None
        return None

def similarityMatrix(id2word, pos, take_n=None):
    """
    ##############################################################
    Build matrix of synsets for given id2word dictionary.    
    Optionally, only build a similarity matrix for the first n values.
    
    --- Input ---    
    id2word: dictionary for noun / adj to build n x n matrix of similarity.
    pos: WordNet position, `NOUN` or `ADJ` imported based on needs
    take_n: whether take the first n values for testing, default=None
    
    --- Return ---
    return a tuple, t where
    t[0]: n x n matrix with raw similarity score or zero
    t[1]: dictionary of synsets with k: id, v: Synset or None
    ##############################################################    
    """    
    syns = {} # obtain O(n)
    p = posToSingle(pos)
    
    # determine n
    n = len(id2word)
    if take_n:
        n = take_n
    
    # n x n matrix, initialized with zeros 
    matrix = np.zeros((n,n))
    
    # populate
    ns = range(n)
    for i in ns:  
        isyn = cachedSynsetOrBuild(i,syns,p,id2word)       
        for j in ns:
            # find j in synset
            jsyn = None
            if isyn:
                jsyn = cachedSynsetOrBuild(j,syns,p,id2word) # no reason unless isyn is ok
        
            # update matrix with path_similarity between i and j words
            if isyn and jsyn:            
                ps = isyn.path_similarity(jsyn)            
                if ps:
                    matrix[i][j] = ps
            
    return matrix, syns

In [92]:
## FUNCTIONS FOR EVALUATING SIMILARITY MATRIX RESULTS

def getSimilarityPairs(matrix, print_n=None, id_lookup=None, sim_threshold=SIM_THRESHOLD): 
    """
    print non zero similarities, ignoring diagonals.
    Optionally, show only first n non zeros then return.
    Optionally, lookup ids with words.
    Optionally, only evaluate values at/above a threshold.
    """
    
    pairs = []
    
    ns = range(len(matrix))      
    c = 0
    for i in ns:
        for j in ns:
            v = matrix[i][j] 
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
                    
            if (i != j) and met_threshold:                
                if not print_n or c < print_n:
                    c += 1
                    s_i = i
                    s_j = j
                    if id_lookup:
                        s_i = id_lookup[i]
                        s_j = id_lookup[j]
                    if print_n:    
                        print "{},{} --> {}".format(s_i,s_j,v)
                    pairs.append((s_i,s_j))
                elif print_n:
                    return pairs
    return pairs
                
def countSimilarityPairs(matrix, sim_threshold=SIM_THRESHOLD):
    """
    count non zero similarities, ignoring diagonals.
    Optionally, only evaluate values at/above a threshold.    
    """
    c = 0
    ns = range(len(matrix))         
    for i in ns:
        for j in ns:
            v = matrix[i][j]
            
            # handle sim_threshold
            met_threshold = True
            if sim_threshold and v < sim_threshold:
                met_threshold = False
            elif not v:
                met_threshold = False
            
            if (i != j) and met_threshold:                
                c += 1                    
    return c

In [93]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 08 Dec 2015 03:54:57


In [94]:
%%time
# build adj similarity matrix
asimatrix, asyns = similarityMatrix(adjid2word, ADJ)

CPU times: user 26 s, sys: 29.7 ms, total: 26 s
Wall time: 26.1 s


In [95]:
# Count non-zero similarities for adjectivies at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(asimatrix)

334

In [96]:
# Check adj similarity results, are they any good?
getSimilarityPairs(asimatrix, print_n=10, id_lookup=adjid2word)

# build the actual (to be dumped) variables <-- NOTE: Hypernyms will be built from here!
asimpairs_words = getSimilarityPairs(asimatrix, id_lookup=adjid2word)
asimpairs_ids = getSimilarityPairs(asimatrix)

crimson,ruby --> 1.0
crimson,cherry --> 1.0
crimson,scarlet --> 1.0
crimson,red --> 1.0
magic,magical --> 1.0
aflame,ablaze --> 1.0
small,little --> 1.0
7th,seventh --> 1.0
blue,bluish --> 1.0
unsure,shy --> 1.0


In [97]:
len(asimpairs_words)

334

In [98]:
print "execution start --> {}".format(time.strftime('%a, %d %b %Y %H:%M:%S', time.localtime()))

execution start --> Tue, 08 Dec 2015 03:55:35


In [99]:
%%time
# build noun similarity matrix (can take 30+ minutes!!!)
nsimatrix, nsyns = similarityMatrix(nounid2word, NOUN)

CPU times: user 15min 43s, sys: 7.78 s, total: 15min 50s
Wall time: 15min 56s


In [100]:
# Count non-zero similarities for nouns at/above SIM_THRESHOLD, ignoring diagonal
countSimilarityPairs(nsimatrix)

586

In [101]:
# Check noun similarity results, are they any good?
getSimilarityPairs(nsimatrix, print_n = 10, id_lookup=nounid2word)

# build the actual (to be dumped) variables <-- NOTE: Hypernyms will be built from here!
nsimpairs_words = getSimilarityPairs(nsimatrix, id_lookup=nounid2word)
nsimpairs_ids = getSimilarityPairs(nsimatrix)

sleep,slumber --> 1.0
prick,motherfucker --> 1.0
prick,bastard --> 1.0
prick,asshole --> 1.0
chatter,yack --> 1.0
cavity,pit --> 1.0
topic,subject --> 1.0
tush,ass --> 1.0
tush,derriere --> 1.0
tush,fanny --> 1.0


In [102]:
len(nsimpairs_words)

586

###Save Synonym work
####Similarity Matrix and Synsets

In [103]:
# save asimatrix
pickle.dump( asimatrix, open(root_out+'asimatrix.p', "wb" ) )  

In [104]:
# flatten and save asyns
with open(root_out+'asyns.json', 'w') as fp:
    json.dump(flattenSynsetValues(asyns), fp)

In [105]:
# save nsimatrix
pickle.dump( nsimatrix, open(root_out+'nsimatrix.p', "wb" ) )


In [106]:
# flatten and save nsyns
with open(root_out+'nsyns.json', 'w') as fp:
    json.dump(flattenSynsetValues(nsyns), fp)

####Similarity Pairs

In [107]:
with open(root_out+'asimpairs_ids.json', 'w') as fp:
    json.dump(asimpairs_ids, fp)  

In [108]:
with open(root_out+'asimpairs_words.json', 'w') as fp:
    json.dump(asimpairs_words, fp)

In [109]:
with open(root_out+'nsimpairs_ids.json', 'w') as fp:
    json.dump(nsimpairs_words, fp)

In [110]:
with open(root_out+'nsimpairs_words.json', 'w') as fp:
    json.dump(nsimpairs_words, fp)

##Hypernyms
find the lowest common [hypernym](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy) between similar

In [111]:
#Quick Test
Synset('dog.n.01').lowest_common_hypernyms(Synset('cat.n.01'))[0]

Synset('carnivore.n.01')

In [112]:
# ## CORE FUNCTIONS FOR BUILDING HYPERNYM -- THIS USES SIMATRIX

# def makeOrderedTuple(idx1, idx2):
#     if idx1 > idx2:
#         return (idx2,idx1) 
#     return (idx1,idx2) 

# def cachedHypernymOrBuild(idx1, idx2, syn_lookup, hypes, hype_as_str=True):
#     """
#     Build Hypernym for given `idxtuple`, using the `syns_lookup`.
#     Facilitate O(n) computational complexity by caching results
#     Will internally manage hypernym keys as ordered tuple.
    
#     --- Input ---
#     idx: tuple of id to build and cache
#     syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
#     hypes: dictionary for hypernyms with k: ordered tuple, v: hypernym.
#     hype_as_str: optional build map with string values, default = True
#     --- Return ---
#     a hypernym Synset or None
#     """
#     ituple = makeOrderedTuple(idx1,idx2)    
#     if ituple in hypes: 
#         return hypes[ituple] 
    
#     try:    
#         s1 = syn_lookup[ituple[0]]
#         s2 = syn_lookup[ituple[1]]
#         h = s1.lowest_common_hypernyms(s2)[0]
        
#         if hype_as_str:
#             h = synsetStr(h)
            
#         hypes[ituple] = h
#         return h
#     except Exception:
#         hypes[ituple] = None
#         return None

# def lowestCommonHypernyms(simatrix, syn_lookup, sim_threshold=SIM_THRESHOLD, hype_as_str=True):
#     """
#     Build a matrix with hypernym where found.
#     Optionally, only evaluate values at/above a threshold.
    
#     --- Input ---
#     simatrix: tuple of id to build and cache
#     syn_lookup: existing dictionary of synsets, with k: id, v: Synset or None    
#     sim_threshold: optional threshold to use for establishing hypernyms, default = SIM_THRESHOLD
#     hype_as_str: optional build map with string values, default = True
    
#     --- Return ---
#     dictionary for hypernyms with k: ordered tuple, v: Synset.    
#     """
    
#     hypes = {} # dictionary to build up.
    
#     n = len(simatrix)
#     ns = range(n)          
#     for i in ns:
#         for j in ns:
#             v = simatrix[i][j] 
            
#             # handle sim_threshold
#             met_threshold = True
#             if sim_threshold and v < sim_threshold:
#                 met_threshold = False
#             elif not v:
#                 met_threshold = False
                    
#             if (i != j) and met_threshold:                                
#                 cachedHypernymOrBuild(i,j, syn_lookup, hypes, hype_as_str)
                
#     return hypes


In [113]:
## CORE FUNCTIONS FOR BUILDING HYPERNYM -- THIS USES SIMPAIR

def makeOrderedTuple(idx1, idx2):
    if idx1 > idx2:
        return (idx2,idx1) 
    return (idx1,idx2) 

def lowestCommonHypernyms(simpair_words, syn_pos, hype_as_str=True):
    """
    Build a dict with hypernym where found.
    
    --- Input ---
    simpair_words: tuple of words to build and cache
    p: part
    hype_as_str: optional build map with string values, default = True
    
    --- Return ---
    dictionary for hypernyms with k: ordered tuple, v: Synset | String .    
    """
    
    hypes = {} # dictionary to build up.
    
    for ts in simpair_words:          
        ituple = makeOrderedTuple(ts[0],ts[1])    
        if ituple not in hypes: 
            try:                   
                s1 = Synset("{}.{}.01".format(ituple[0],syn_pos))
                s2 = Synset("{}.{}.01".format(ituple[1],syn_pos))
                h = s1.lowest_common_hypernyms(s2)[0]
                
                if hype_as_str:
                    h = synsetStr(h)
                    
                hypes[ituple] = h
                
            except Exception:
                hypes[ituple] = None
                
    return hypes

In [114]:
## FUNCTIONS FOR EVALUATING HYPERNYMS

def countHypernyms(hypes, count_valid=True, count_invalid=True):
    """
    Count  hypernyms, ignoring None
    """
    c = 0
    for k,v in hypes.iteritems():
        if count_valid and v:
            c += 1
        elif count_invalid and not v:
            c += 1        
    return c

###Adjective Hypernyms

In [115]:
# find adj hypernyms, defaulting to only the string value
ahypes = lowestCommonHypernyms(asimpairs_words, ADJ)

In [116]:
# check results
print "how many adj hypernyms? ", countHypernyms(ahypes)
print "how many valid adj hypernyms? ", countHypernyms(ahypes, count_valid=True, count_invalid=False)
print "how many invalid adj hypernyms? ", countHypernyms(ahypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(ahypes.keys()[0],ahypes[ahypes.keys()[0]])

how many adj hypernyms?  167
how many valid adj hypernyms?  167
how many invalid adj hypernyms?  0
example key: (u'everyday', u'mundane'), value: everyday


In [117]:
ahypes

{(u'18th', u'eighteenth'): u'eighteenth',
 (u'5th', u'fifth'): u'fifth',
 (u'6th', u'sixth'): u'sixth',
 (u'7th', u'seventh'): u'seventh',
 (u'8th', u'eighth'): u'eighth',
 (u'ablaze', u'aflame'): u'ablaze',
 (u'ageless', u'eternal'): u'ageless',
 (u'ageless', u'everlasting'): u'ageless',
 (u'ageless', u'perpetual'): u'ageless',
 (u'all-night', u'overnight'): u'nightlong',
 (u'amazing', u'astonishing'): u'amazing',
 (u'apparent', u'evident'): u'apparent',
 (u'apparent', u'plain'): u'apparent',
 (u'average', u'mean'): u'average',
 (u'bare', u'naked'): u'bare',
 (u'bare', u'nude'): u'bare',
 (u'barren', u'desolate'): u'bare',
 (u'big', u'large'): u'large',
 (u'bigger', u'larger'): u'bigger',
 (u'bitty', u'itty-bitty'): u'bitty',
 (u'bitty', u'wee'): u'bitty',
 (u'blamed', u'damned'): u'blasted',
 (u'blond', u'blonde'): u'blond',
 (u'blue', u'bluish'): u'blue',
 (u'broad', u'wide'): u'wide',
 (u'bronzed', u'tanned'): u'bronzed',
 (u'bushy', u'shaggy'): u'bushy',
 (u'calm', u'serene'): u'c

###Noun Hypernyms

In [118]:
# find noun hypernyms
nhypes = lowestCommonHypernyms(nsimpairs_words, NOUN)

In [119]:
# check results
print "how many noun hypernyms? ", countHypernyms(nhypes)
print "how many valid noun hypernyms? ", countHypernyms(nhypes, count_valid=True, count_invalid=False)
print "how many invalid noun hypernyms? ", countHypernyms(nhypes, count_valid=False, count_invalid=True)
print "example key: {}, value: {}".format(nhypes.keys()[0],nhypes[nhypes.keys()[0]])

how many noun hypernyms?  293
how many valid noun hypernyms?  293
how many invalid noun hypernyms?  0
example key: (u'material', u'stuff'), value: material


In [120]:
nhypes

{(u'adult', u'grownup'): u'adult',
 (u'affair', u'matter'): u'matter',
 (u'aim', u'intent'): u'purpose',
 (u'aim', u'intention'): u'purpose',
 (u'airplane', u'plane'): u'airplane',
 (u'anguish', u'torture'): u'anguish',
 (u'animal', u'beast'): u'animal',
 (u'animal', u'creature'): u'animal',
 (u'answer', u'reply'): u'answer',
 (u'arena', u'domain'): u'sphere',
 (u'ass', u'derriere'): u'buttocks',
 (u'ass', u'fanny'): u'buttocks',
 (u'ass', u'tush'): u'buttocks',
 (u'asshole', u'bastard'): u'asshole',
 (u'asshole', u'motherfucker'): u'asshole',
 (u'asshole', u'prick'): u'asshole',
 (u'athlete', u'jock'): u'athlete',
 (u'attempt', u'effort'): u'attempt',
 (u'attempt', u'try'): u'attempt',
 (u'automobile', u'car'): u'car',
 (u'autumn', u'fall'): u'fall',
 (u'babe', u'baby'): u'baby',
 (u'baggage', u'luggage'): u'baggage',
 (u'bait', u'come-on'): u'bait',
 (u'bandana', u'bandanna'): u'bandanna',
 (u'bang', u'smash'): u'knock',
 (u'bar', u'saloon'): u'barroom',
 (u'barkeeper', u'bartender')

##Save Hypernyms

In [121]:
# save adj hypernyms
pickle.dump( ahypes, open(root_out+'ahypes.p', "wb" ) )  

In [122]:
# save noun hypernyms
pickle.dump( nhypes, open(root_out+'nhypes.p', "wb" ) )  

In [123]:
# New: do some conversion for a json file

def saveHypesAsJson(hypes,json_name,root_out=root_out):
    h = {}
    hkeys = [] #hypernym keys
    
    for ts,v in hypes.iteritems():
        if not v in hkeys:
            hkeys.append(v)
            
    for ts,v in hypes.iteritems():
        if v in h:
            s = h[v]
            if ts[0] not in s:
                s.append(ts[0])
            if ts[1] not in s:
                s.append(ts[1])
        else:
            h[v] = []
            h[v].append(ts[0])
            h[v].append(ts[1])
    
    # save h
    with open(root_out+ json_name + '.json', 'w') as fp:
        json.dump(h, fp)
        
    return h

In [124]:
njhypes = saveHypesAsJson(nhypes,'noun_hype_syns_words')
njhypes

{u'ace': [u'superstar', u'whiz', u'wizard'],
 u'adieu': [u'cheerio', u'good-bye', u'goodbye'],
 u'adult': [u'adult', u'grownup'],
 u'ailment': [u'complaint', u'ill'],
 u'airplane': [u'airplane', u'plane'],
 u'amour_propre': [u'conceit', u'vanity'],
 u'anguish': [u'anguish', u'torture'],
 u'animal': [u'animal', u'creature', u'beast'],
 u'answer': [u'answer', u'reply'],
 u'aroma': [u'perfume', u'scent'],
 u'asshole': [u'motherfucker', u'prick', u'asshole', u'bastard'],
 u'athlete': [u'athlete', u'jock'],
 u'attempt': [u'attempt', u'effort', u'try'],
 u'baby': [u'babe', u'baby'],
 u'baggage': [u'baggage', u'luggage'],
 u'bait': [u'bait', u'come-on'],
 u'ball': [u'chunk', u'lump'],
 u'ballyhoo': [u'hoopla', u'hype'],
 u'bandanna': [u'bandana', u'bandanna'],
 u'barroom': [u'bar', u'saloon'],
 u'bartender': [u'barkeeper', u'bartender'],
 u'base': [u'pedestal', u'stand'],
 u'basement': [u'basement', u'cellar'],
 u'battle': [u'battle', u'fight'],
 u'bent': [u'hang', u'knack'],
 u'bit': [u'chip

In [125]:
ajhypes = saveHypesAsJson(ahypes,'adj_hype_syns_words')
ajhypes

{u'ablaze': [u'ablaze', u'aflame'],
 u'ace': [u'crack', u'super'],
 u'adolescent': [u'teen', u'teenage'],
 u'aged': [u'elderly', u'older'],
 u'ageless': [u'ageless', u'perpetual', u'everlasting', u'eternal'],
 u'all_right': [u'fine', u'okay', u'ok'],
 u'alone': [u'lone', u'lonely'],
 u'amazing': [u'amazing', u'astonishing'],
 u'amusing': [u'comical', u'funny'],
 u'apparent': [u'apparent', u'plain', u'evident'],
 u'aroused': [u'horny', u'randy'],
 u'aureate': [u'gilded', u'golden'],
 u'average': [u'average', u'mean'],
 u'awful': [u'frightening', u'terrible'],
 u'bally': [u'flaming', u'fucking'],
 u'bang-up': [u'groovy', u'smashing', u'peachy', u'nifty'],
 u'bantam': [u'petite', u'tiny'],
 u'barbarous': [u'cruel', u'savage', u'vicious'],
 u'bare': [u'bare', u'naked', u'nude', u'barren', u'desolate'],
 u'baronial': [u'noble', u'stately'],
 u'besotted': [u'fuddled', u'smashed'],
 u'bigger': [u'bigger', u'larger'],
 u'bigheaded': [u'snot-nosed', u'snotty'],
 u'bitty': [u'bitty', u'itty-bitt

###Compare Synonym and Hypernym Lists

In [126]:
def compareWordLists(alist,blist):
    same = []
    ina = []
    inb = []
    
    for a in alist:
        if a in blist:
            same.append(a)
        else:
            ina.append(a)
    
    for b in blist:
        if b not in same:
            inb.append(b)
    return sorted(same), sorted(ina), sorted(inb)

In [127]:
tncomp = compareWordLists(flattenSynsetValues(nsyns).values(),njhypes.values())
print "For FULL noun syn versus syn-hype words..."
print "\tHow many are same? ", len(tncomp[0])
print "\tHow many are only in syn? ", len(tncomp[1])
print "\tHow many are only in hype? ", len(tncomp[2])
print
tacomp = compareWordLists(flattenSynsetValues(asyns).values(),ajhypes.values())
print "For FULL adj syn versus syn-hype words..."
print "\tHow many are same? ", len(tacomp[0])
print "\tHow many are only in syn? ", len(tacomp[1])
print "\tHow many are only in hype? ", len(tacomp[2])

For FULL noun syn versus syn-hype words...
	How many are same?  0
	How many are only in syn?  3580
	How many are only in hype?  201

For FULL adj syn versus syn-hype words...
	How many are same?  0
	How many are only in syn?  1707
	How many are only in hype?  118


In [128]:
def flattenListOfLists(alist):
    v = []
    for a in alist:
        for x in a:
            v.append(x)
    return v

In [129]:
incomp = compareWordLists(flattenListOfLists(njhypes.values()),njhypes.keys())
print "For ONLY hypernym relevant nouns, syn versus hype words..."
print "\tHow many are same? ", len(incomp[0])
print "\tHow many are only in syn? ", len(incomp[1])
print "\tHow many are only in hype? ", len(incomp[2])
print
iacomp = compareWordLists(flattenListOfLists(ajhypes.values()),ajhypes.keys())
print "For ONLY hypernym relevant adj, syn versus hype words..."
print "\tHow many are same? ", len(iacomp[0])
print "\tHow many are only in syn? ", len(iacomp[1])
print "\tHow many are only in hype? ", len(iacomp[2])

For ONLY hypernym relevant nouns, syn versus hype words...
	How many are same?  153
	How many are only in syn?  287
	How many are only in hype?  48

For ONLY hypernym relevant adj, syn versus hype words...
	How many are same?  78
	How many are only in syn?  181
	How many are only in hype?  40
