This practice implies testing logic laws (commutation, association, transition) in texts.

In [1]:
import warnings
import tqdm
warnings.filterwarnings('ignore')
from gensim.models import Word2Vec
from string import punctuation
from nltk.tokenize import sent_tokenize, word_tokenize
from datetime import datetime
exclude = set(punctuation + '0123456789[]—«»–')

In [2]:
sentences = []

For fitting there will be texts acquired and saved from kaggle. Standard preprocessing and spitting will be applied to all of them.

In [3]:
with open('testData.tsv', 'r', encoding='utf-8') as f:
    text1 = f.read()
text1 = text1.replace('\n', ' ')
sents = sent_tokenize(text1)
for s in sents:
    buf = ''.join(ch for ch in s if ch not in exclude).lower()
    sentences.append(word_tokenize(buf))
print("%s sentences loaded" % len(sentences))

256980 sentences loaded


In [4]:
with open('unlabeledTrainData.tsv', 'r', encoding='utf-8') as f:
    text2 = f.read()
text2 = text2.replace('\n', ' ')
sents = sent_tokenize(text2)
for s in sents:
    buf = ''.join(ch for ch in s if ch not in exclude).lower()
    sentences.append(word_tokenize(buf))
print("%s sentences loaded" % len(sentences))

777122 sentences loaded


In [5]:
start_time = datetime.now()
model = Word2Vec(sentences, min_count=1)
print('%s elapsed' % (datetime.now() - start_time))

0:06:44.004894 elapsed


In [6]:
model.save('kaggleDataModel')

In [2]:
model = Word2Vec.load('kaggleDataModel')

## Lexically close words

In [7]:
words = ('man', 'woman', 'run', 'sleep', 'love', 'peace', 'good', 'bad')
for word in words:
    print('Similar to: ', word)
    print(model.most_similar(word))
    print()

Similar to:  man
[('woman', 0.7610129714012146), ('guy', 0.7300944328308105), ('soldier', 0.7272496819496155), ('person', 0.7112268805503845), ('boy', 0.6943151354789734), ('lady', 0.6704555153846741), ('lad', 0.6570690274238586), ('monk', 0.6562662720680237), ('gentleman', 0.649365246295929), ('doctor', 0.6390960812568665)]

Similar to:  woman
[('girl', 0.8390386700630188), ('lady', 0.7814967036247253), ('man', 0.7610130906105042), ('prostitute', 0.760490357875824), ('widow', 0.7088424563407898), ('person', 0.6998924612998962), ('teenager', 0.6989359855651855), ('nurse', 0.6923184394836426), ('housewife', 0.6889249682426453), ('lad', 0.6849828958511353)]

Similar to:  run
[('drive', 0.6993310451507568), ('walk', 0.6978382468223572), ('wander', 0.6943727135658264), ('go', 0.6800222396850586), ('move', 0.6617880463600159), ('fly', 0.6617198586463928), ('running', 0.6562157273292542), ('slip', 0.6523092985153198), ('bump', 0.6421452760696411), ('haul', 0.6290749907493591)]

Similar to:  

In general, words were found correctly. Out of 10 words, which are marked as lexically close, there are some that fall out of scope in each case. it's related to text context, which was used for fitting, and also with using idioms. Many words, which are characterized as very close (> 0.7) we can call synonims, e.g. bad -> terrible, horrible; good -> descent; woman -> lady.

## Association rules

In [7]:
# According to example at https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/
# "boy" is to "father" as "girl" is to ...?
# model.most_similar(['girl', 'father'], ['boy'], topn=3)
relations = (
    (['girl', 'father'], ['boy']),
    (['hater', 'love'], ['lover']),
    (['smart', 'dumb'], ['stupid']),
    (['wife', 'man'], ['husband']),
    (['woman', 'girl'], ['guy']),
    (['bad', 'best'], ['good']),
)
for i, j in relations:
    print(model.most_similar(i, j, topn=3))

[('mother', 0.86660236120224), ('husband', 0.8353188037872314), ('daughter', 0.8319278955459595)]
[('wellcompressed', 0.5845099091529846), ('anythinggood', 0.5755310654640198), ('boycotting', 0.569663405418396)]
[('goodlooking', 0.6495766043663025), ('tough', 0.6393650770187378), ('handsome', 0.6369926929473877)]
[('woman', 0.7167847752571106), ('soldier', 0.6674887537956238), ('guy', 0.6461684107780457)]
[('lady', 0.6756771206855774), ('widow', 0.6608632206916809), ('prostitute', 0.6565677523612976)]
[('worst', 0.8059595227241516), ('funniest', 0.6849568486213684), ('finest', 0.6358859539031982)]


Association for cases 1, 4, 6 were found correctly. For case 3 association is found merely correct, the word hase close meaning. 
For case 5 there was found a word that is simply close by meaning but doesn't comply associative rule. The worst case is 2, word is not found at all. This may be because of the fitted context, which doesn't have enough examples for reproducing such complicated associative rules.

## Extra words

In [10]:
extra_words = (
    "man guy husband girl",
    "breakfast cereal dinner lunch",
    "good great pleasure bad",
    "smart clever sophisticated dumb",
    "run climb ski lay",
    "table chair car",
    "green yellow high black",
    "mother sister daughter brother"
)
for sentence in extra_words:
    print("{} is extra".format(model.doesnt_match(sentence.split()).upper()))

HUSBAND is extra
CEREAL is extra
PLEASURE is extra
DUMB is extra
SKI is extra
CAR is extra
HIGH is extra
BROTHER is extra


Extra words are found absolutely correctly for lines 2, 3, 4, 6, 7, 8. For other lines extra words can be found differently from different points of views, e.g. line 1 has word 'husband' chosen, which is a married man, and other words are simply too abstract. From another point of view, we can pick word 'girl' as extra, because it's a feminine, while other words are masculine.

## Commutation rule doesn't apply

In [105]:
# "a b" => "b a" , т.е. 'a' относится к 'b', как 'b' к 'a'
words_commutations = (
    "father mother",
    "high highest",
    "dumb smart",
)
for words in words_commutations:
    a, b = words.split()
    top3 = model.most_similar([b, b], [a], topn=3)
    print('{a} => {b}, as {b} to {a}'.format(a=a, b=b) 
          if (a in top3) else '{a} => {b}, {b} != {a}'.format(a=a, b=b))
    print("{b} => {c}".format(b=b, c=[word for word, prob in top3]))
    print()

father => mother, mother != father
mother => ['daughter', 'grandmother', 'aunt']

high => highest, highest != high
highest => ['lowest', 'biggest', 'sole']

dumb => smart, smart != dumb
smart => ['intelligent', 'brave', 'charming']



## Transitive rule doesn't apply

In [8]:
words_transitives = (
    "father day", 
    "lunch bigger", 
    "going sun"
)
for words in words_transitives:
    a, b = words.split()
    predicted = model.most_similar([b, b], [a], topn=3)
    print("{a} => {b}, {b} => {c}".format(a=a, b=b, c=[word for word, prob in predicted]))
    trans = model.most_similar([a, a], [a], topn=3)
    print("{a} => {c}".format(a=a, c=[word for word, prob in trans]))
    is_transitive = bool(set([word for word, prob in predicted]) & set([word for word, prob in trans]))
    print('Транзитивность соблюдается' if is_transitive else 'Транзитивность не выполняется')
    print()

father => day, day => ['morning', 'week', 'night']
father => ['mother', 'son', 'husband']
Транзитивность не выполняется

lunch => bigger, bigger => ['larger', 'smaller', 'stronger']
lunch => ['meal', 'coffee', 'breakfast']
Транзитивность не выполняется

going => sun, sun => ['river', 'tower', 'sky']
going => ['supposed', 'ready', 'trying']
Транзитивность не выполняется



## Transitive rule applies

In [9]:
words_transitives = (
    "he his", 
    "big bigger", 
    "going went"
)
for words in words_transitives:
    a, b = words.split()
    predicted = model.most_similar([b, b], [a], topn=3)
    print("{a} => {b}, {b} => {c}".format(a=a, b=b, c=[word for word, prob in predicted]))
    trans = model.most_similar([a, b], [a], topn=3)
    print("{a} => {c}".format(a=a, c=[word for word, prob in trans]))
    is_transitive = bool(set([word for word, prob in predicted]) & set([word for word, prob in trans]))
    print('Транзитивность выполняется' if is_transitive else 'Транзитивность не выполняется')
    print()

he => his, his => ['sams', 'jacks', 'bens']
he => ['her', 'jacks', 'sams']
Транзитивность выполняется

big => bigger, bigger => ['larger', 'quicker', 'scarier']
big => ['smaller', 'larger', 'cheaper']
Транзитивность выполняется

going => went, went => ['came', 'ran', 'took']
going => ['came', 'goes', 'ran']
Транзитивность выполняется



## Conclusion

During this practice there was constructed a word2vec model with 777 thousand sentences. Completed tasks are:
- Close by meaning words found
- Association rule reproduced
- Extra words found from each set
- Examples of transitive and commutative rules applications and opposite.
<br><br>
The model could have been build more generalized and precise, if we had a bigger set of sentences for learning, but as it turns out to be - it can take a lot of computational resources.