Skip to content

kaushal2161/Formulae_Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Formulae_Extractor

##Extract the formulae from Wikipedia articles with the title

To extract the formulae from all pages that contain math from enwiki you can do the following

git clone https://kaushal2161/Formulae_Extractor/
cd Formulae_Extractor
mkdir wout
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
./formulae_extractor.py

All options of formula_extractor can be seen via

./formulae_extractor.py --help
usage: formulae_extractor.py [-h] [-f FILE] [-s SIZE] [-d DIR] [-t TAG]

Extract Formulae from wikipages.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --filename FILE
                        the bz2-file to be split and filtered (default:
                        enwiki-latest-pages-articles.xml.bz2)
  -s SIZE, --splitsize SIZE
                        the number of pages contained in each split (default:
                        1000000)
  -d DIR, --outputdir DIR
                        the directory name where the files go (default: wout)
  -t TAG, --tagname TAG
                        the tag to search for (default: math)
  

About

Extract the formulae from Wikipedia articles with title

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages