## Подготовка к работе с Википедией

Скачайте 'enwiki-20200401-pages-articles.xml.bz2' по ссылке https://meta.wikimedia.org/wiki/Data_dump_torrents — архив весит порядка 16Гб

Скачайте 'wiki.corpus' по ссылке https://yadi.sk/d/TVo-KPUbgx4vPA — это слепок памяти объекта для работы с нелемматизированной(!) Википедией


In [1]:
from gensim.corpora.wikicorpus import WikiCorpus

In [2]:
# logging is important to get the state of the functions
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

In [3]:
wiki = WikiCorpus.load('wiki.corpus')

2020-11-23 04:38:51,654: INFO: loading WikiCorpus object from wiki.corpus
2020-11-23 04:38:53,799: INFO: loading dictionary recursively from wiki.corpus.dictionary.* with mmap=None
2020-11-23 04:38:53,800: INFO: loaded wiki.corpus


## Построим word2vec вручную средствами gensim

Использовался код https://gist.github.com/maxbellec/85d90d3d7f2f96589f1517e5c4567dc3

In [None]:
import multiprocessing
from gensim.models.word2vec import Word2Vec

class MySentences(object):
    def __iter__(self):
        for text in wiki.get_texts():
            yield text
sentences = MySentences()
params = {'size': 300, 'window': 10, 'min_count': 40, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1e-3,}
word2vec = Word2Vec(sentences, **params)
word2vec.save('wiki.word2vec.model')

2020-11-20 19:38:17,061: INFO: collecting all words and their counts
2020-11-20 19:38:35,687: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


## Модель посчиталась, пора оценить её качество

Скачать предпосчитанную модель можно по ссылкам, она состоит из трёх частей:

- 'wiki.word2vec.model' - https://yadi.sk/d/LTFU0Ukc2Bp2MA

- 'wiki.word2vec.model.trainables.syn1neg.npy' - https://yadi.sk/d/g7oWXFwga8l9OA

- 'wiki.word2vec.model.wv.vectors.npy' - https://yadi.sk/d/nGaMaQT_FkqnLQ

In [5]:
from gensim.models.word2vec import Word2Vec
word2vec = Word2Vec.load('wiki.word2vec.model')
len(word2vec.wv.vocab)

2020-11-23 04:39:18,217: INFO: loading Word2Vec object from wiki.word2vec.model
2020-11-23 04:39:20,465: INFO: loading wv recursively from wiki.word2vec.model.wv.* with mmap=None
2020-11-23 04:39:20,466: INFO: loading vectors from wiki.word2vec.model.wv.vectors.npy with mmap=None
2020-11-23 04:39:28,183: INFO: setting ignored attribute vectors_norm to None
2020-11-23 04:39:28,185: INFO: loading vocabulary recursively from wiki.word2vec.model.vocabulary.* with mmap=None
2020-11-23 04:39:28,186: INFO: loading trainables recursively from wiki.word2vec.model.trainables.* with mmap=None
2020-11-23 04:39:28,186: INFO: loading syn1neg from wiki.word2vec.model.trainables.syn1neg.npy with mmap=None
2020-11-23 04:39:34,573: INFO: setting ignored attribute cum_table to None
2020-11-23 04:39:34,575: INFO: loaded wiki.word2vec.model


642768

In [6]:
from scipy.stats import spearmanr

f=open("SimLex-999.txt", 'r').readlines()

def rank(model):
        
    not_in_model=[]
    w2v_pairs=[]
    for i in f[1:]:
        ii=i.split('\t')
        first_word=ii[0]
        second_word=ii[1]
        flag=0
        if first_word not in model:
            not_in_model.append(first_word.split('_')[0])
            flag=1
        if second_word not in model:
            not_in_model.append(second_word.split('_')[0])
            flag=1
        if not flag:
            w2v_pairs.append(model.distance(first_word, second_word))
        #print(first_word, second_word)
    print(len(w2v_pairs), not_in_model)
    
    simlex_pairs=[]
    for i in f[1:]:
        ii=i.split('\t')
        if ii[0] not in not_in_model and ii[1] not in not_in_model:
            simlex_pairs.append(float(ii[3]))
    print(len(simlex_pairs))
    
    return spearmanr(simlex_pairs, w2v_pairs)

In [7]:
rank(word2vec.wv)

999 []
999


SpearmanrResult(correlation=-0.36182557483841277, pvalue=2.9039845974509423e-32)

## А теперь построим w2v для лемматизированной Вики

Скачайте 'wiki.lem.corpus' по ссылке https://yadi.sk/d/AsaBf1j_oFBFHw — это слепок памяти объекта для работы с лемматизированной(!) Википедией


In [1]:
from gensim.corpora.wikicorpus import WikiCorpus

import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

wiki = WikiCorpus.load('wiki.lem.corpus')

2020-12-03 04:08:27,367: INFO: loading WikiCorpus object from wiki.lem.corpus
2020-12-03 04:08:28,501: INFO: loading dictionary recursively from wiki.lem.corpus.dictionary.* with mmap=None
2020-12-03 04:08:28,502: INFO: loaded wiki.lem.corpus


In [2]:
import multiprocessing
from gensim.models.word2vec import Word2Vec

class MySentences(object):
    def __iter__(self):
        for text in wiki.get_texts():
            yield [word.decode('utf-8').split('/')[0] for word in text]
sentences = MySentences()
params = {'size': 300, 'window': 10, 'min_count': 40, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1e-3,}
word2vec = Word2Vec(sentences, **params)
word2vec.save('wiki.lem.word2vec.model')

2020-12-03 04:09:32,949: INFO: collecting all words and their counts
2020-12-03 04:09:33,704: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-12-03 04:26:34,729: INFO: PROGRESS: at sentence #10000, processed 21764086 words, keeping 559841 word types
2020-12-03 04:41:38,322: INFO: PROGRESS: at sentence #20000, processed 41004355 words, keeping 825048 word types
2020-12-03 04:54:06,670: INFO: PROGRESS: at sentence #30000, processed 56883281 words, keeping 1018881 word types
2020-12-03 05:05:08,049: INFO: PROGRESS: at sentence #40000, processed 70535350 words, keeping 1192259 word types
2020-12-03 05:12:18,420: INFO: PROGRESS: at sentence #50000, processed 80269014 words, keeping 1295042 word types
2020-12-03 05:15:45,241: INFO: PROGRESS: at sentence #60000, processed 85762124 words, keeping 1323832 word types
2020-12-03 05:18:52,957: INFO: PROGRESS: at sentence #70000, processed 90746794 words, keeping 1349156 word types
2020-12-03 05:21:40,263: INFO: PROGRES

2020-12-03 11:24:16,108: INFO: PROGRESS: at sentence #710000, processed 526697674 words, keeping 4996861 word types
2020-12-03 11:29:01,752: INFO: PROGRESS: at sentence #720000, processed 531771846 words, keeping 5030018 word types
2020-12-03 11:33:31,648: INFO: PROGRESS: at sentence #730000, processed 536650665 words, keeping 5064099 word types
2020-12-03 11:37:43,228: INFO: PROGRESS: at sentence #740000, processed 541334583 words, keeping 5096182 word types
2020-12-03 11:41:57,711: INFO: PROGRESS: at sentence #750000, processed 546145362 words, keeping 5130694 word types
2020-12-03 11:46:03,129: INFO: PROGRESS: at sentence #760000, processed 550741797 words, keeping 5160577 word types
2020-12-03 11:50:02,093: INFO: PROGRESS: at sentence #770000, processed 555287907 words, keeping 5191507 word types
2020-12-03 11:54:16,099: INFO: PROGRESS: at sentence #780000, processed 559959020 words, keeping 5226906 word types
2020-12-03 11:58:46,568: INFO: PROGRESS: at sentence #790000, processed 

2020-12-03 15:58:52,360: INFO: PROGRESS: at sentence #1420000, processed 820899447 words, keeping 7000515 word types
2020-12-03 16:02:36,418: INFO: PROGRESS: at sentence #1430000, processed 824745518 words, keeping 7023859 word types
2020-12-03 16:06:31,700: INFO: PROGRESS: at sentence #1440000, processed 828834463 words, keeping 7047410 word types
2020-12-03 16:10:00,806: INFO: PROGRESS: at sentence #1450000, processed 832388098 words, keeping 7069881 word types
2020-12-03 16:12:58,166: INFO: PROGRESS: at sentence #1460000, processed 835555006 words, keeping 7090399 word types
2020-12-03 16:16:51,773: INFO: PROGRESS: at sentence #1470000, processed 839319482 words, keeping 7116397 word types
2020-12-03 16:20:39,902: INFO: PROGRESS: at sentence #1480000, processed 843110594 words, keeping 7138047 word types
2020-12-03 16:24:06,322: INFO: PROGRESS: at sentence #1490000, processed 846694143 words, keeping 7158225 word types
2020-12-03 16:27:37,353: INFO: PROGRESS: at sentence #1500000, p

2020-12-03 20:12:41,982: INFO: PROGRESS: at sentence #2120000, processed 1066497854 words, keeping 8521686 word types
2020-12-03 20:16:03,605: INFO: PROGRESS: at sentence #2130000, processed 1069617386 words, keeping 8539087 word types
2020-12-03 20:19:14,419: INFO: PROGRESS: at sentence #2140000, processed 1072786068 words, keeping 8560127 word types
2020-12-03 20:22:57,105: INFO: PROGRESS: at sentence #2150000, processed 1076156779 words, keeping 8598772 word types
2020-12-03 20:26:41,399: INFO: PROGRESS: at sentence #2160000, processed 1079670145 words, keeping 8617509 word types
2020-12-03 20:30:17,151: INFO: PROGRESS: at sentence #2170000, processed 1083007820 words, keeping 8639769 word types
2020-12-03 20:33:46,107: INFO: PROGRESS: at sentence #2180000, processed 1086432359 words, keeping 8660851 word types
2020-12-03 20:37:12,309: INFO: PROGRESS: at sentence #2190000, processed 1089813764 words, keeping 8678689 word types
2020-12-03 20:40:38,661: INFO: PROGRESS: at sentence #22

2020-12-04 00:30:09,044: INFO: PROGRESS: at sentence #2820000, processed 1306388475 words, keeping 10026842 word types
2020-12-04 00:34:00,462: INFO: PROGRESS: at sentence #2830000, processed 1309861214 words, keeping 10050562 word types
2020-12-04 00:37:42,207: INFO: PROGRESS: at sentence #2840000, processed 1313209928 words, keeping 10070615 word types
2020-12-04 00:41:34,825: INFO: PROGRESS: at sentence #2850000, processed 1316675269 words, keeping 10092280 word types
2020-12-04 00:45:17,585: INFO: PROGRESS: at sentence #2860000, processed 1320058957 words, keeping 10113193 word types
2020-12-04 00:48:57,128: INFO: PROGRESS: at sentence #2870000, processed 1323477884 words, keeping 10132984 word types
2020-12-04 00:52:48,635: INFO: PROGRESS: at sentence #2880000, processed 1327037737 words, keeping 10153117 word types
2020-12-04 00:56:26,399: INFO: PROGRESS: at sentence #2890000, processed 1330395400 words, keeping 10171817 word types
2020-12-04 00:59:55,672: INFO: PROGRESS: at sent

2020-12-04 04:50:34,303: INFO: PROGRESS: at sentence #3510000, processed 1539434286 words, keeping 11379909 word types
2020-12-04 04:54:08,408: INFO: PROGRESS: at sentence #3520000, processed 1542658423 words, keeping 11397428 word types
2020-12-04 04:57:42,748: INFO: PROGRESS: at sentence #3530000, processed 1545957298 words, keeping 11413665 word types
2020-12-04 05:01:22,400: INFO: PROGRESS: at sentence #3540000, processed 1549260448 words, keeping 11430602 word types
2020-12-04 05:05:04,541: INFO: PROGRESS: at sentence #3550000, processed 1552577085 words, keeping 11448801 word types
2020-12-04 05:08:40,885: INFO: PROGRESS: at sentence #3560000, processed 1555885761 words, keeping 11466210 word types
2020-12-04 05:12:24,402: INFO: PROGRESS: at sentence #3570000, processed 1559355524 words, keeping 11481779 word types
2020-12-04 05:16:08,408: INFO: PROGRESS: at sentence #3580000, processed 1562637939 words, keeping 11497995 word types
2020-12-04 05:19:34,058: INFO: PROGRESS: at sent

2020-12-04 08:58:36,039: INFO: PROGRESS: at sentence #4210000, processed 1757556026 words, keeping 12643754 word types
2020-12-04 09:02:49,523: INFO: PROGRESS: at sentence #4220000, processed 1760922603 words, keeping 12659434 word types
2020-12-04 09:07:09,301: INFO: PROGRESS: at sentence #4230000, processed 1764400004 words, keeping 12676968 word types
2020-12-04 09:11:00,484: INFO: PROGRESS: at sentence #4240000, processed 1767529245 words, keeping 12693728 word types
2020-12-04 09:14:26,782: INFO: PROGRESS: at sentence #4250000, processed 1770527755 words, keeping 12709021 word types
2020-12-04 09:17:48,619: INFO: PROGRESS: at sentence #4260000, processed 1773491960 words, keeping 12723710 word types
2020-12-04 09:21:24,423: INFO: PROGRESS: at sentence #4270000, processed 1776554517 words, keeping 12738729 word types
2020-12-04 09:24:42,846: INFO: PROGRESS: at sentence #4280000, processed 1779547719 words, keeping 12754646 word types
2020-12-04 09:28:12,356: INFO: PROGRESS: at sent

TypeError: can't concat str to bytes

## Модель посчиталась, пора оценить её качество

Скачать предпосчитанную модель можно по ссылкам, она состоит из трёх частей:

- 'wiki.lem.word2vec.model' -

- 'wiki.lem.word2vec.model.trainables.syn1neg.npy' -

- 'wiki.lem.word2vec.model.wv.vectors.npy' - 

In [None]:
from gensim.models.word2vec import Word2Vec
word2vec = Word2Vec.load('wiki.lem.word2vec.model')
len(word2vec.wv.vocab)

In [None]:
from scipy.stats import spearmanr

f=open("SimLex-999.txt", 'r').readlines()

def rank(model):        
    w2v_pairs=[]
    simlex_pairs=[]
    for i in f[1:]:
        ii=i.split('\t')
        first_word=ii[0]
        second_word=ii[1]
        if first_word in model and second_word in model:
            w2v_pairs.append(model.distance(first_word, second_word))
            simlex_pairs.append(float(ii[3]))
    print(len(w2v_pairs))
    return spearmanr(simlex_pairs, w2v_pairs)

In [None]:
rank(word2vec.wv)