# Coming up with Labelling functions

This document provides a short overview of .
In general I was thinking of features as described in past papers on the topic and was trying to look whether these features can translate into labelling functions.

## Using Word2Vec Google pre-trained model
In the first part I am looking at pre-trained word embedding model. This model 

In [1]:
import gensim

# Load Google's pre-trained Word2Vec model.
trained_model = gensim.models.KeyedVectors.load_word2vec_format('~/models/GoogleNews-vectors-negative300.bin', binary=True)

### word + 'negative' - 'positive'

My hypothesis was that if we take take a positive word (e.g. enthusiastic) and using word vectors perform the operation, we might get something like this: 
'enthusiastic' - 'positive' + 'negative' = antonym to enthusiastic  
This sometimes actually works, but is obviously not as straightforward. Sometimes we still get synonyms higher on the list than antonyms.  
However, I did notice the following property comparing the lists of most similar to *word* (1) and most similar to *word - 'positive' + 'negative* (2). The antonyms tend to have higher rank on (2) than they did on (1) even if they don't reach the top rank.

In [3]:
trained_model.similarity('proud', 'ashamed')

0.51042433387974262

In [16]:
# 'enthusiastic' - 'positive' + 'negative'
trained_model.most_similar(positive=['enthusiastic', 'negative'], negative=['positive'])

[(u'unenthusiastic', 0.5785017013549805),
 (u'enthused', 0.5230987071990967),
 (u'effusive', 0.5033376812934875),
 (u'enthusiatic', 0.49656665325164795),
 (u'Enthusiastic', 0.49166086316108704),
 (u'vociferous', 0.48619210720062256),
 (u'enthusiastically', 0.4839787483215332),
 (u'passionate', 0.4743078649044037),
 (u'apprehensive', 0.47088587284088135),
 (u'ardent', 0.46236997842788696)]

In [19]:
# 'enthusiastic'
trained_model.most_similar(positive=['enthusiastic'])

[(u'enthused', 0.7288056015968323),
 (u'appreciative', 0.6407038569450378),
 (u'passionate', 0.6311060786247253),
 (u'Enthusiastic', 0.6302011013031006),
 (u'enthusiatic', 0.6265905499458313),
 (u'enthusiastically', 0.6244474053382874),
 (u'excited', 0.6102643013000488),
 (u'unenthusiastic', 0.6077113151550293),
 (u'ecstatic', 0.6036043167114258),
 (u'effusive', 0.5879316329956055)]

In [17]:
# 'balanced' - 'positive' + 'negative'
trained_model.most_similar(positive=['balanced', 'negative'], negative=['positive'])

[(u'Balanced', 0.5382355451583862),
 (u'unbalanced', 0.5257672071456909),
 (u'imbalanced', 0.45854058861732483),
 (u'potent_reuptake_inhibitor', 0.45243775844573975),
 (u'balance', 0.42325860261917114),
 (u'balancing', 0.4219854176044464),
 (u'freshly_saut\xe9ed', 0.41497623920440674),
 (u'unbalance', 0.4121413230895996),
 (u'Bayswater_combines', 0.40477824211120605),
 (u'slanted', 0.3982120156288147)]

In [20]:
# 'balanced'
trained_model.most_similar(positive=['balanced'])

[(u'Balanced', 0.5727792978286743),
 (u'potent_reuptake_inhibitor', 0.5593193769454956),
 (u'consistent', 0.5022857189178467),
 (u'balance', 0.5000194907188416),
 (u'ChartPoppers.com_strives', 0.499819815158844),
 (u'unbalanced', 0.4863540232181549),
 (u'freshly_saut\xe9ed', 0.48340001702308655),
 (u'fiscally_responsible', 0.47322702407836914),
 (u'balancing', 0.466336190700531),
 (u'Bayswater_combines', 0.46510744094848633)]

In [21]:
# 'happy' - 'positive' + 'negative'
trained_model.most_similar(positive=['happy', 'negative'], negative=['positive'])

[(u'unhappy', 0.5862479209899902),
 (u'glad', 0.5782526731491089),
 (u'disappointed', 0.5319445729255676),
 (u'worried', 0.5297830104827881),
 (u'Said_Hirschbeck', 0.5221741199493408),
 (u'sorry', 0.5157001614570618),
 (u'overjoyed', 0.5088183879852295),
 (u'ecstatic', 0.5060049891471863),
 (u'numb_Gwen_Bacquet', 0.499439001083374),
 (u'annoyed', 0.4985652565956116)]

In [22]:
# 'happy'
trained_model.most_similar(positive=['happy'])

[(u'glad', 0.7408890128135681),
 (u'pleased', 0.6632171273231506),
 (u'ecstatic', 0.6626912355422974),
 (u'overjoyed', 0.6599286794662476),
 (u'thrilled', 0.6514049768447876),
 (u'satisfied', 0.6437950134277344),
 (u'proud', 0.636042058467865),
 (u'delighted', 0.627237856388092),
 (u'disappointed', 0.6269949674606323),
 (u'excited', 0.6247666478157043)]

In [23]:
# 'proud' - 'positive' + 'negative'
trained_model.most_similar(positive=['proud', 'negative'], negative=['positive'])

[(u'immensely_proud', 0.5935452580451965),
 (u'thrilled', 0.5332244634628296),
 (u'ashamed', 0.5059055089950562),
 (u'Lynn_Coeby', 0.49850228428840637),
 (u'grateful', 0.4978291392326355),
 (u'glad', 0.49502453207969666),
 (u'prouder', 0.4870137870311737),
 (u'justifiably_proud', 0.4836322069168091),
 (u'honored', 0.4793172776699066),
 (u'Ripton_alongside', 0.4778229892253876)]

In [24]:
# 'proud'
trained_model.most_similar(positive=['proud'])

[(u'immensely_proud', 0.7941136360168457),
 (u'thrilled', 0.7283560037612915),
 (u'pleased', 0.7123605608940125),
 (u'delighted', 0.7018399238586426),
 (u'grateful', 0.7007218599319458),
 (u'excited', 0.6833415627479553),
 (u'justifiably_proud', 0.6696771383285522),
 (u'glad', 0.657325029373169),
 (u'honored', 0.6445347666740417),
 (u'thankful', 0.6442625522613525)]

### word + 'positive' - 'negative' 
This is analogous to the above, but when a negative word is considered as a base and we are looking for a negative antonym. The observations from above still hold.

In [27]:
# 'sad' - 'negative' + 'positive'
trained_model.most_similar(positive=['sad', 'positive'], negative=['negative'])

[(u'saddening', 0.5947166085243225),
 (u'bittersweet', 0.5927517414093018),
 (u'happy', 0.584730863571167),
 (u'saddened', 0.5550497770309448),
 (u'heartbreaking', 0.5526055693626404),
 (u'wonderful', 0.5431721210479736),
 (u'heartening', 0.5382798910140991),
 (u'disheartening', 0.5335900187492371),
 (u'Sad', 0.5243781805038452),
 (u'tragic', 0.5188040137290955)]

In [28]:
# 'sad'
trained_model.most_similar(positive=['sad'])

[(u'saddening', 0.7273085713386536),
 (u'Sad', 0.6610826253890991),
 (u'saddened', 0.6604382991790771),
 (u'heartbreaking', 0.6573507785797119),
 (u'disheartening', 0.6507317423820496),
 (u'Meny_Friedman', 0.6487058401107788),
 (u'parishioner_Pat_Patello', 0.6475859880447388),
 (u'saddens_me', 0.6407118439674377),
 (u'distressing', 0.6399092674255371),
 (u'reminders_bobbing', 0.6357713937759399)]

In [29]:
# 'ashamed' - 'negative' + 'positive'
trained_model.most_similar(positive=['ashamed', 'positive'], negative=['negative'])

[(u'proud', 0.5834987163543701),
 (u'sorry', 0.5710163116455078),
 (u'embarrassed', 0.5382394790649414),
 (u'grateful', 0.5158604383468628),
 (u'glad', 0.5080356597900391),
 (u'happy', 0.5078516006469727),
 (u'thankful', 0.49530261754989624),
 (u'embarassed', 0.4939783215522766),
 (u'disgusted', 0.49206268787384033),
 (u'remorseful', 0.4911426305770874)]

Just out of curiosity, I tried this for a not strongly polar word ('yellow'). As expected this doesn't show anything interesting, just re-shiffles a list of colours. This shows that we need to be careful, in genral that method will not be good for determining wether a word is polar or non-polar, but knowing a polar word it can indicate whether it's positive or negative.

In [4]:
trained_model.most_similar(positive=['yellow', 'positive'], negative=['negative'])

[(u'bright_yellow', 0.6055178642272949),
 (u'red', 0.5892467498779297),
 (u'orange', 0.5437914133071899),
 (u'bright_orange', 0.5282124876976013),
 (u'pink', 0.5237631797790527),
 (u'blue', 0.5166007280349731),
 (u'yellows', 0.5036360025405884),
 (u'purple', 0.4984946548938751),
 (u'Yellow', 0.4921160638332367),
 (u'maroon', 0.47217172384262085)]

In [5]:
trained_model.most_similar(positive=['yellow'])

[(u'red', 0.751919150352478),
 (u'bright_yellow', 0.6869138479232788),
 (u'orange', 0.6421886682510376),
 (u'blue', 0.6376121044158936),
 (u'purple', 0.6272757053375244),
 (u'yellows', 0.612633228302002),
 (u'pink', 0.6098285913467407),
 (u'bright_orange', 0.5974606871604919),
 (u'Warplanes_streaked_overhead', 0.583052396774292),
 (u'participant_LOGIN', 0.5816755294799805)]

### word distance from 'positive' and 'negative'

I also looked at something simpler - compare the similiarity of a word to a known strongly polar word-pair. Higher similarity to one of them, might indocate the direction of word's polarity. E.g.:
sim('happy', 'positive') > sim('happy', 'negative')  
hence 'happy' is a positive word.
This, of course, does not always give a correct result, but could potentially be good enough for a labelling function. 
As above, we would need to know first wether a word is polar or not.

In [32]:
trained_model.similarity('happy', 'positive')

0.35381865425934766

In [33]:
trained_model.similarity('happy', 'negative')

0.15686738467201589

In [34]:
trained_model.similarity('proud', 'positive')

0.26622150915217901

In [35]:
trained_model.similarity('proud', 'negative')

0.076199017451654236

In [37]:
trained_model.similarity('ashamed', 'positive')

0.1383159178191852

In [36]:
trained_model.similarity('ashamed', 'negative')

0.15910607381262118

In [42]:
trained_model.similarity('sad', 'good')

0.36027422016349692

In [43]:
trained_model.similarity('sad', 'bad')

0.42367145004619988

In [40]:
trained_model.similarity('bitter', 'positive')

0.134985924898754

In [41]:
trained_model.similarity('bitter', 'negative')

0.18035062511180683

In [67]:
trained_model.similarity('green', 'negative')

0.12414770358757789

In [68]:
trained_model.similarity('green', 'positive')

0.17292317890581094

# Using WordNet antonyms

Wether a word has an antonym or not might indicate that it has polarity (aka is either positive or negative). For example words such as 'scientific' or 'green' will not have any antonyms, while antonym pairs wil often reflect positive-negative relation, e.g. 'happy-sad' 'proud-'ashamed'
The obvious limitations are:
1. not every word that has an antonymous word pair represents positive-negative realtion. (e.g. dry-wet; tall-short)
2. some more obscure words will not have an antonym even if they are positive/negative (e.g. forlorn)
3. This only distinguises between neutral and non-neutral words and not neutral-positive-negative.  

Given (3), I think this a better suited idea for a labelling funtion in a polar vs non-polar classification task.

In [2]:
from nltk.corpus import wordnet as wn

In [46]:
word = wn.synset('scientific.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(),  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('scientific.a.01')
No antonyms


In [47]:
word = wn.synset('green.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(),  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('green.s.01')
No antonyms
No antonyms
No antonyms
No antonyms


In [57]:
word = wn.synset('happy.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('happy.a.01')
(u'happy', ' - ', u'unhappy')


In [58]:
word = wn.synset('sad.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('sad.a.01')
(u'sad', ' - ', u'glad')


In [49]:
word = wn.synset('elated.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(),  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('elated.a.01')
(u'elated', u'dejected')


In [53]:
word = wn.synset('proud.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('proud.a.01')
(u'proud', ' - ', u'humble')


In [51]:
word = wn.synset('ashamed.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('ashamed.a.01')
(u'ashamed', ' - ', u'unashamed')


In [54]:
word = wn.synset('tall.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('tall.a.01')
(u'tall', ' - ', u'short')


In [56]:
word = wn.synset('forlorn.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('forlorn.s.01')
No antonyms


In [4]:
word = wn.synset('other.a.01')
print (word)
for l in word.lemmas():
    if l.antonyms():
        print(l.name(), " - ",  l.antonyms()[0].name())
    else:
        print("No antonyms")

Synset('other.a.01')
(u'other', ' - ', u'same')


Based on 