wikipedia

Various projects using wikipedia data dumps.

Getting Started

Run the data setup script. This will take 1hr+ due to seeding for other downloaders. This script downloads the wikipedia data dump via torrent, and decompresses the data dump.

    sudo apt update && sudo apt upgrade
    sudo apt install aria2
    cd scripts
    chmod +x ./setup.sh && ./setup.sh
    cd ..

Create a conda environment from the wikipedia.yml:

conda env create -f wikipedia.yml

If required, run script to extract plain text from the wikipedia articles. Some notebooks use wikipedia in its compressed form, so this may or may not be necessary:

wikiextractor \
    --processes 96 \
    --json  \
    -o ./data/<dump-date>/json/ \
    ./data/<dump-date>/unzipped/enwiki-<dump-date-compressed>-pages-articles-multistream.xml

Python Environment

To save your python environment (and overwrite the current environment file):

conda env export > wikipedia.yml

WIKIR

To download the WIKIR datasets:

cd data
wget https://www.zenodo.org/record/3707606/files/enwikIR.zip?download=1
wget https://www.zenodo.org/record/3707238/files/enwikIRS.zip?download=1

To unzip and organize:

unzip 'enwikIR.zip?download=1' && unzip 'enwikIRS.zip?download=1'
rm 'enwikIR.zip?download=1' 'enwikIRS.zip?download=1'
mkdir WIKIR
mv enwikIR WIKIR && mv enwikIRS WIKIR

WikiPassageQA

To download the WikiPassageQA dataset:

cd data
wget https://ciir.cs.umass.edu/downloads/wikipassageqa/WikiPassageQA.zip

To unzip and organize:

unzip WikiPassageQA.zip -d WikiPassageQA && rm WikiPassageQA.zip

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
archived_scripts		archived_scripts
embed		embed
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
wikipedia.yml		wikipedia.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia

Getting Started

Python Environment

WIKIR

WikiPassageQA

About

Releases

Packages

Languages

martmichals/wikipedia

Folders and files

Latest commit

History

Repository files navigation

wikipedia

Getting Started

Python Environment

WIKIR

WikiPassageQA

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages