Skip to content

joao8tunes/Wiki2Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki2Model

These scripts allow you to generate language models based on the CBoW model of Word2Vec, trained through text documents extracted directly from Wikipedia in multiple languages. In addition to the possibility of generating templates trained by the original content of Wikipedia, the scripts also allows to generate models trained by the semantically enriched content of Wikipedia. This textual enrichment can be based on the application of named entity recognition (NER) and word sense disambiguation (WSD) procedures.

Generating a CBoW based model with Wikipedia original documents as external knowledge source:

python3 Wiki2Model.py --language EN --download in/db/ --extractor tools/WikiExtractor.py --output in/models/

Generating a CBoW based model with Wikipedia semantically enriched documents as external knowledge source:

python3 Wiki2Model_S-Enrich.py --language EN --download in/db/ --extractor tools/WikiExtractor.py --s_enrich tools/S-Enrich_Bfy.jar --output in/models/

Related scripts

Requirements installation (Linux)

Python 3 + PIP installation as super user:

apt-get install python3 python3-pip

Gensim installation as normal user:

pip3 install --upgrade gensim

NLTK + Scipy + Numpy installation as normal user:

pip3 install -U nltk scipy numpy

See more

Project page on LABIC website: http://sites.labic.icmc.usp.br/MSc-Thesis_Antunes_2018

About

Wikipedia text pages to word embedding based models converter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages