Various projects using wikipedia data dumps.
Run the data setup script. This will take 1hr+ due to seeding for other downloaders. This script downloads the wikipedia data dump via torrent, and decompresses the data dump.
sudo apt update && sudo apt upgrade
sudo apt install aria2
cd scripts
chmod +x ./setup.sh && ./setup.sh
cd ..
Create a conda environment from the wikipedia.yml
:
conda env create -f wikipedia.yml
If required, run script to extract plain text from the wikipedia articles. Some notebooks use wikipedia in its compressed form, so this may or may not be necessary:
wikiextractor \
--processes 96 \
--json \
-o ./data/<dump-date>/json/ \
./data/<dump-date>/unzipped/enwiki-<dump-date-compressed>-pages-articles-multistream.xml
To save your python environment (and overwrite the current environment file):
conda env export > wikipedia.yml
To download the WIKIR datasets:
cd data
wget https://www.zenodo.org/record/3707606/files/enwikIR.zip?download=1
wget https://www.zenodo.org/record/3707238/files/enwikIRS.zip?download=1
To unzip and organize:
unzip 'enwikIR.zip?download=1' && unzip 'enwikIRS.zip?download=1'
rm 'enwikIR.zip?download=1' 'enwikIRS.zip?download=1'
mkdir WIKIR
mv enwikIR WIKIR && mv enwikIRS WIKIR
To download the WikiPassageQA dataset:
cd data
wget https://ciir.cs.umass.edu/downloads/wikipassageqa/WikiPassageQA.zip
To unzip and organize:
unzip WikiPassageQA.zip -d WikiPassageQA && rm WikiPassageQA.zip