A comparison of word embedding algorithms applied to the clustering of English Wikipedia. Written by Michael Wheeler, December 2020.
config.py
: Contains project-wide configurations, mainly paths to data assets and model hyperparameters as keyword arguments to be passed on initializationdata/
: Contains the data assets for the project, such as the multistream bzip2 archives containing Wikipedia.pipeline/
: Contains the actual procedure to execute the project steps from (almost) the beginning to the end. The pipeline dependencies can be traced back by looking at therequires
method for each Task. For more information see the Luigi docs.report/
: Contains the LaTeX and LyX editor files I used to generate the final report. Also contains the final report itself as a PDF.
- Install Python 3.7.9 or higher,
sudo yum install gcc python3-devel
python3 -m venv virutalenv
,source virtualenv/bin/activate
,pip install -r requirements.txt
- Download Wikipedia archives to the project's data directory: for detailed instructions see this Wikipedia page
- In order to run the WikiExtractor itself (NOT included in the pipeline) follow the instructions here
- Finally, activate the pipeline:
python3 -m pipeline