Processing stack exchange data dumps to a dataset of mathematical formulas
- Python 3.7
Install python packages by running: pip install -r requirements.txt
cd main
- execute:
python main.py --input ../input/ --dumps test_dumps --download yes --extract yes --output ../output/database.db
- execute (only after main.py has been executed):
python context.py --input ../input/ --dumps test_dumps --download yes --database ../output/database.db --context 10 --topn 3 --tablename FormulaContext
- execute (only after main.py has been executed):
python statistics.py --dumps test_dumps --database ../output/database.db --output ../output/
The data from the database can be accessed via SQL commands, e.g.:
sqlite3
.open output/database.db
SELECT * FROM FormulasPosts WHERE FormulaId="5783548";
SELECT Site, Tag FROM Tags WHERE Count>"1500";
SELECT AnswerText.Site, AnswerText.AnswerId, AnswerText.Body
FROM AnswerText INNER JOIN AnswerMeta ON AnswerText.Site=AnswerMeta.Site
AND AnswerText.AnswerId=AnswerMeta.AnswerId AND AnswerMeta.Score="100";
SELECT max(Score) FROM QuestionMeta;
- input: Input directory of stackexchange dump *.7z files. Where they are or where there should be downloaded to.
- default= "../input/"
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps.
- default="test_dumps" in main directory
- download: Whether or not the program should download the dumps.
- options: yes or no
- default="yes"
- extract: Whether or not to extract the *.7z dump files.
- options: yes or no
- default="yes"
- output: database output
- default='../output/database.db'
- all: Force to process all dumps, even if they have previously been processed and already exist in the database.
- options: yes or no
- default="no"
- The Dataset will be saved in a database (
*.db
) as specified by the input parameter 'output'. - Analysis/Statistics will be saved in file
statistics.log
in same directory as the database.
main.py
must previously been run for the dumps before running context.py
- input: Input directory of stackexchange dump files and directories
- default= "../input/"
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default="test_dumps" in main directory
- database: database input and output
- default='../output/database.db'
- context: The number of words around formula to be reagarded as possible context.
- options: an integer
- default="10"
- topn: The number of top terms in context regarding their tf-idf scores to be retrieved as formula context.
- default='3'
- corpus: Whether the corpus for idf ratings should be calculated over all sites or individually for each site.
- options: all or individual
- default="all"
- stopwords: Whether or not to use stopwords and which ones to use
- Options: none, english, or filename containing list of stopwords (e.g. stopwords). 'english' is the english stopword list of the nltk toolkit
- default="stopwords"
- tfidf: Whether or not to show tf-idf ratings with top context words
- options: yes, no
- default="yes"
- tablename: Name of table to write topn contexts words of formulas in (will be overwritten if it exists)
- default="FormulaContext"
- all: Get all words as context. This will lead to ignoring the values of input parameters for context and topn.
- options: yes or no
- default="no"
- The Context of the specified sites formulas will be saved in a database (
*.db
) as specified by the input parameter 'database' inside of the Table specified by the input parameter 'tablename'. - Analysis/Statistics will be saved in file
statistics.log
in same directory as the database.
main.py
must previously been run for the dumps before running parse_formulas.py
.
parse_formulas.py
will only parse formulas that have not been parsed due to the long runtime of LaTeXML.
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default="test_dumps" in main directory
- database: database input and output
- default='../output/database.db'
- mode: parse formulas in ContentMathML (cmml), PresentationMathML (pmml), or both
- options: cmml, pmml, both
- tree: Whether or not to calculate slt trees from pmml and opt from cmml.
- options: yes, no
- default="yes"
- threads: Number of threads to run parallel. One thread used to convert a single formula in MathML.
- options: integer
- default="20"
- The ContentMathML and/or PresentationMathML of the specified sites formulas will be saved in the database (
*.db
) as specified by the input parameter 'database' inside of the Tables 'FormulasPostsMathML' and 'FormulasCommentsMathML'. - Analysis/Statistics will be saved in file
statistics.log
in same directory as the database.
main.py
must previously been run for the dumps before running statistics.py
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default='test_dumps' in main directory
- database: Database file for data input.
- default='../output/database.db'
- output: The output directory.
- default='../output/'
- Directory '/diagrams/' in the output directory (as specified by the user) filled with diagrams of formula distributions for the sites specified by the user.
- HTML files displaying the diagram and some statistical values for each of the sites.