Skip to content

Word2Vec Training

Marcel Schwittlick edited this page Dec 6, 2016 · 6 revisions

When training an own Word2Vec model on the wikipedia data:

2016-12-05 14:24:05,276 : INFO : capital-common-countries: 93.3% (472/506)
2016-12-05 14:24:13,336 : INFO : capital-world: 93.6% (2260/2415)
2016-12-05 14:24:13,603 : INFO : currency: 9.3% (8/86)
2016-12-05 14:24:20,750 : INFO : city-in-state: 75.9% (1719/2265)
2016-12-05 14:24:21,966 : INFO : family: 94.2% (358/380)
2016-12-05 14:24:24,036 : INFO : gram1-adjective-to-adverb: 32.9% (214/650)
2016-12-05 14:24:24,778 : INFO : gram2-opposite: 50.4% (121/240)
2016-12-05 14:24:27,941 : INFO : gram3-comparative: 89.2% (885/992)
2016-12-05 14:24:29,552 : INFO : gram4-superlative: 69.0% (349/506)
2016-12-05 14:24:32,291 : INFO : gram5-present-participle: 61.8% (538/870)
2016-12-05 14:24:36,623 : INFO : gram6-nationality-adjective: 97.7% (1339/1371)
2016-12-05 14:24:41,062 : INFO : gram7-past-tense: 62.5% (879/1406)
2016-12-05 14:24:44,180 : INFO : gram8-plural: 88.2% (875/992)
2016-12-05 14:24:46,093 : INFO : gram9-plural-verbs: 68.8% (413/600)
2016-12-05 14:24:46,093 : INFO : total: 78.5% (10430/13279)

When loading the pretrained binary GoogleNews model

/mnt/drive/models/GoogleNews-vectors-negative300.bin

2016-12-05 14:29:05,299 : INFO : capital-common-countries: 83.6% (423/506)
2016-12-05 14:29:09,778 : INFO : capital-world: 82.7% (1144/1383)
2016-12-05 14:29:10,172 : INFO : currency: 39.8% (51/128)
2016-12-05 14:29:17,550 : INFO : city-in-state: 74.6% (1739/2330)
2016-12-05 14:29:18,667 : INFO : family: 90.1% (308/342)
2016-12-05 14:29:21,235 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2016-12-05 14:29:22,426 : INFO : gram2-opposite: 50.5% (192/380)
2016-12-05 14:29:26,659 : INFO : gram3-comparative: 91.9% (1224/1332)
2016-12-05 14:29:28,882 : INFO : gram4-superlative: 88.0% (618/702)
2016-12-05 14:29:31,607 : INFO : gram5-present-participle: 79.8% (694/870)
2016-12-05 14:29:35,517 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2016-12-05 14:29:40,207 : INFO : gram7-past-tense: 66.5% (986/1482)
2016-12-05 14:29:43,327 : INFO : gram8-plural: 85.6% (849/992)
2016-12-05 14:29:45,620 : INFO : gram9-plural-verbs: 68.9% (484/702)
2016-12-05 14:29:45,621 : INFO : total: 77.1% (10167/13190)

The model which combines the enwiki corpus and our pdf corpus (including invalid sentences):

/mnt/drive/data/eco/word2vec_models/wiki_plus_v3_combined.txt_numpy.w2vmodel

2016-12-05 18:24:35,822 : INFO : capital-common-countries: 94.5% (478/506)
2016-12-05 18:24:44,503 : INFO : capital-world: 93.6% (2157/2304)
2016-12-05 18:24:44,745 : INFO : currency: 5.6% (3/54)
2016-12-05 18:24:52,773 : INFO : city-in-state: 73.5% (1619/2203)
2016-12-05 18:24:54,101 : INFO : family: 96.1% (365/380)
2016-12-05 18:24:56,781 : INFO : gram1-adjective-to-adverb: 32.0% (208/650)
2016-12-05 18:24:58,046 : INFO : gram2-opposite: 44.1% (135/306)
2016-12-05 18:25:01,769 : INFO : gram3-comparative: 91.4% (907/992)
2016-12-05 18:25:03,677 : INFO : gram4-superlative: 71.9% (364/506)
2016-12-05 18:25:07,128 : INFO : gram5-present-participle: 65.9% (613/930)
2016-12-05 18:25:13,206 : INFO : gram6-nationality-adjective: 97.6% (1338/1371)
2016-12-05 18:25:19,022 : INFO : gram7-past-tense: 66.6% (936/1406)
2016-12-05 18:25:23,023 : INFO : gram8-plural: 88.2% (875/992)
2016-12-05 18:25:25,505 : INFO : gram9-plural-verbs: 69.4% (451/650)
2016-12-05 18:25:25,505 : INFO : total: 78.9% (10449/13250)

The model which combines the enwiki corpus and our pdf corpus (only valid sentences):

/mnt/drive/data/eco/word2vec_models/wiki_plus_v3_valid_combined.txt_numpy.w2vmodel

2016-12-06 04:16:29,487 : INFO : capital-common-countries: 94.9% (480/506)
2016-12-06 04:16:38,029 : INFO : capital-world: 93.3% (2202/2359)
2016-12-06 04:16:38,202 : INFO : currency: 3.7% (2/54)
2016-12-06 04:16:46,969 : INFO : city-in-state: 74.2% (1680/2265)
2016-12-06 04:16:48,452 : INFO : family: 94.2% (358/380)
2016-12-06 04:16:50,742 : INFO : gram1-adjective-to-adverb: 32.8% (213/650)
2016-12-06 04:16:51,806 : INFO : gram2-opposite: 43.1% (132/306)
2016-12-06 04:16:55,665 : INFO : gram3-comparative: 89.5% (888/992)
2016-12-06 04:16:57,604 : INFO : gram4-superlative: 70.0% (354/506)
2016-12-06 04:17:01,111 : INFO : gram5-present-participle: 64.3% (559/870)
2016-12-06 04:17:07,295 : INFO : gram6-nationality-adjective: 97.4% (1336/1371)
2016-12-06 04:17:12,447 : INFO : gram7-past-tense: 62.2% (875/1406)
2016-12-06 04:17:16,129 : INFO : gram8-plural: 87.5% (868/992)
2016-12-06 04:17:18,373 : INFO : gram9-plural-verbs: 70.3% (422/600)
2016-12-06 04:17:18,374 : INFO : total: 78.2% (10369/13257)
Clone this wiki locally