Skip to content

EDA and pre-processing methods for Wikipedia data dumps.

Notifications You must be signed in to change notification settings

martmichals/wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikipedia

Various projects using wikipedia data dumps.

Getting Started

Run the data setup script. This will take 1hr+ due to seeding for other downloaders. This script downloads the wikipedia data dump via torrent, and decompresses the data dump.

    sudo apt update && sudo apt upgrade
    sudo apt install aria2
    cd scripts
    chmod +x ./setup.sh && ./setup.sh
    cd ..

Create a conda environment from the wikipedia.yml:

conda env create -f wikipedia.yml

If required, run script to extract plain text from the wikipedia articles. Some notebooks use wikipedia in its compressed form, so this may or may not be necessary:

wikiextractor \
    --processes 96 \
    --json  \
    -o ./data/<dump-date>/json/ \
    ./data/<dump-date>/unzipped/enwiki-<dump-date-compressed>-pages-articles-multistream.xml 

Python Environment

To save your python environment (and overwrite the current environment file):

conda env export > wikipedia.yml

WIKIR

To download the WIKIR datasets:

cd data
wget https://www.zenodo.org/record/3707606/files/enwikIR.zip?download=1
wget https://www.zenodo.org/record/3707238/files/enwikIRS.zip?download=1

To unzip and organize:

unzip 'enwikIR.zip?download=1' && unzip 'enwikIRS.zip?download=1'
rm 'enwikIR.zip?download=1' 'enwikIRS.zip?download=1'
mkdir WIKIR
mv enwikIR WIKIR && mv enwikIRS WIKIR

WikiPassageQA

To download the WikiPassageQA dataset:

cd data
wget https://ciir.cs.umass.edu/downloads/wikipassageqa/WikiPassageQA.zip

To unzip and organize:

unzip WikiPassageQA.zip -d WikiPassageQA && rm WikiPassageQA.zip 

About

EDA and pre-processing methods for Wikipedia data dumps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages