Processing stack exchange data dumps to a dataset of mathematical formulas
- Python 3.7
Install python packages by running: pip install -r requirements.txt
cd main
- execute:
python --input ../input/ --dumps test_dumps --download yes --extract yes --output ../output/database.db
- execute (only after has been executed):
python --input ../input/ --dumps test_dumps --download yes --database ../output/database.db --context 10 --topn 3 --tablename FormulaContext
- execute (only after has been executed):
python --dumps test_dumps --database ../output/database.db --output ../output/
The data from the database can be accessed via SQL commands, e.g.:
.open output/database.db
SELECT * FROM FormulasPosts WHERE FormulaId="5783548";
SELECT Site, Tag FROM Tags WHERE Count>"1500";
SELECT AnswerText.Site, AnswerText.AnswerId, AnswerText.Body
FROM AnswerText INNER JOIN AnswerMeta ON AnswerText.Site=AnswerMeta.Site
AND AnswerText.AnswerId=AnswerMeta.AnswerId AND AnswerMeta.Score="100";
SELECT max(Score) FROM QuestionMeta;
- input: Input directory of stackexchange dump *.7z files. Where they are or where there should be downloaded to.
- default= "../input/"
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps.
- default="test_dumps" in main directory
- download: Whether or not the program should download the dumps.
- options: yes or no
- default="yes"
- extract: Whether or not to extract the *.7z dump files.
- options: yes or no
- default="yes"
- output: database output
- default='../output/database.db'
- all: Force to process all dumps, even if they have previously been processed and already exist in the database.
- options: yes or no
- default="no"
- The Dataset will be saved in a database (
) as specified by the input parameter 'output'. - Analysis/Statistics will be saved in file
in same directory as the database.
must previously been run for the dumps before running
- input: Input directory of stackexchange dump files and directories
- default= "../input/"
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running `````
- default="test_dumps" in main directory
- database: database input and output
- default='../output/database.db'
- context: The number of words around formula to be reagarded as possible context.
- options: an integer
- default="10"
- topn: The number of top terms in context regarding their tf-idf scores to be retrieved as formula context.
- default='3'
- corpus: Whether the corpus for idf ratings should be calculated over all sites or individually for each site.
- options: all or individual
- default="all"
- stopwords: Whether or not to use stopwords and which ones to use
- Options: none, english, or filename containing list of stopwords (e.g. stopwords). 'english' is the english stopword list of the nltk toolkit
- default="stopwords"
- tfidf: Whether or not to show tf-idf ratings with top context words
- options: yes, no
- default="yes"
- tablename: Name of table to write topn contexts words of formulas in (will be overwritten if it exists)
- default="FormulaContext"
- all: Get all words as context. This will lead to ignoring the values of input parameters for context and topn.
- options: yes or no
- default="no"
- The Context of the specified sites formulas will be saved in a database (
) as specified by the input parameter 'database' inside of the Table specified by the input parameter 'tablename'. - Analysis/Statistics will be saved in file
in same directory as the database.
must previously been run for the dumps before running
will only parse formulas that have not been parsed due to the long runtime of LaTeXML.
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running `````
- default="test_dumps" in main directory
- database: database input and output
- default='../output/database.db'
- mode: parse formulas in ContentMathML (cmml), PresentationMathML (pmml), or both
- options: cmml, pmml, both
- tree: Whether or not to calculate slt trees from pmml and opt from cmml.
- options: yes, no
- default="yes"
- threads: Number of threads to run parallel. One thread used to convert a single formula in MathML.
- options: integer
- default="20"
- The ContentMathML and/or PresentationMathML of the specified sites formulas will be saved in the database (
) as specified by the input parameter 'database' inside of the Tables 'FormulasPostsMathML' and 'FormulasCommentsMathML'. - Analysis/Statistics will be saved in file
in same directory as the database.
must previously been run for the dumps before running
- dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running `````
- default='test_dumps' in main directory
- database: Database file for data input.
- default='../output/database.db'
- output: The output directory.
- default='../output/'
- Directory '/diagrams/' in the output directory (as specified by the user) filled with diagrams of formula distributions for the sites specified by the user.
- HTML files displaying the diagram and some statistical values for each of the sites.