Stack Exchange Dataset of Mathematical Formulas

Processing stack exchange data dumps to a dataset of mathematical formulas

Requirements

Python 3.7

Python Packages

Install python packages by running: pip install -r requirements.txt

Other tools

Dataset

Creating the Dataset

cd main
execute: python main.py --input ../input/ --dumps test_dumps --download yes --extract yes --output ../output/database.db
execute (only after main.py has been executed): python context.py --input ../input/ --dumps test_dumps --download yes --database ../output/database.db --context 10 --topn 3 --tablename FormulaContext
execute (only after main.py has been executed): python statistics.py --dumps test_dumps --database ../output/database.db --output ../output/

Using the Dataset

The data from the database can be accessed via SQL commands, e.g.:

sqlite3
.open output/database.db
SELECT * FROM FormulasPosts WHERE FormulaId="5783548";
SELECT Site, Tag FROM Tags WHERE Count>"1500";
SELECT AnswerText.Site, AnswerText.AnswerId, AnswerText.Body 
   FROM AnswerText INNER JOIN AnswerMeta ON AnswerText.Site=AnswerMeta.Site 
   AND AnswerText.AnswerId=AnswerMeta.AnswerId AND AnswerMeta.Score="100";
SELECT max(Score) FROM QuestionMeta;

`main.py`

Parameters of `main.py`

input: Input directory of stackexchange dump *.7z files. Where they are or where there should be downloaded to.
- default= "../input/"
dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps.
- default="test_dumps" in main directory
download: Whether or not the program should download the dumps.
- options: yes or no
- default="yes"
extract: Whether or not to extract the *.7z dump files.
- options: yes or no
- default="yes"
output: database output
- default='../output/database.db'
all: Force to process all dumps, even if they have previously been processed and already exist in the database.
- options: yes or no
- default="no"

Outputs of `main.py`

The Dataset will be saved in a database (*.db) as specified by the input parameter 'output'.
Analysis/Statistics will be saved in file statistics.log in same directory as the database.

`context.py`

main.py must previously been run for the dumps before running context.py

Parameters of `context.py`

input: Input directory of stackexchange dump files and directories
- default= "../input/"
dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default="test_dumps" in main directory
database: database input and output
- default='../output/database.db'
context: The number of words around formula to be reagarded as possible context.
- options: an integer
- default="10"
topn: The number of top terms in context regarding their tf-idf scores to be retrieved as formula context.
- default='3'
corpus: Whether the corpus for idf ratings should be calculated over all sites or individually for each site.
- options: all or individual
- default="all"
stopwords: Whether or not to use stopwords and which ones to use
- Options: none, english, or filename containing list of stopwords (e.g. stopwords). 'english' is the english stopword list of the nltk toolkit
- default="stopwords"
tfidf: Whether or not to show tf-idf ratings with top context words
- options: yes, no
- default="yes"
tablename: Name of table to write topn contexts words of formulas in (will be overwritten if it exists)
- default="FormulaContext"
all: Get all words as context. This will lead to ignoring the values of input parameters for context and topn.
- options: yes or no
- default="no"

Outputs of `context.py`

The Context of the specified sites formulas will be saved in a database (*.db) as specified by the input parameter 'database' inside of the Table specified by the input parameter 'tablename'.
Analysis/Statistics will be saved in file statistics.log in same directory as the database.

`parse_formulas.py`

main.py must previously been run for the dumps before running parse_formulas.py.

parse_formulas.py will only parse formulas that have not been parsed due to the long runtime of LaTeXML.

Parameters of `parse_formulas.py`

dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default="test_dumps" in main directory
database: database input and output
- default='../output/database.db'
mode: parse formulas in ContentMathML (cmml), PresentationMathML (pmml), or both
- options: cmml, pmml, both
tree: Whether or not to calculate slt trees from pmml and opt from cmml.
- options: yes, no
- default="yes"
threads: Number of threads to run parallel. One thread used to convert a single formula in MathML.
- options: integer
- default="20"

Outputs of `parse_formulas.py`

The ContentMathML and/or PresentationMathML of the specified sites formulas will be saved in the database (*.db) as specified by the input parameter 'database' inside of the Tables 'FormulasPostsMathML' and 'FormulasCommentsMathML'.
Analysis/Statistics will be saved in file statistics.log in same directory as the database.

`statistics.py`

main.py must previously been run for the dumps before running statistics.py

Parameters of `statistics.py`

dumps: A text file containing a list of stackexchange dump sites names to be processed.
- format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
- default='test_dumps' in main directory
database: Database file for data input.
- default='../output/database.db'
output: The output directory.
- default='../output/'

Outputs of `statistics.py`

Directory '/diagrams/' in the output directory (as specified by the user) filled with diagrams of formula distributions for the sites specified by the user.
HTML files displaying the diagram and some statistical values for each of the sites.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
classification_training_data		classification_training_data
images		images
input		input
main		main
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stack Exchange Dataset of Mathematical Formulas

Requirements

Python Packages

Other tools

Dataset

Creating the Dataset

Using the Dataset

`main.py`

Parameters of `main.py`

Outputs of `main.py`

`context.py`

Parameters of `context.py`

Outputs of `context.py`

`parse_formulas.py`

Parameters of `parse_formulas.py`

Outputs of `parse_formulas.py`

`statistics.py`

Parameters of `statistics.py`

Outputs of `statistics.py`

About

Releases 1

Packages

Languages

noemiernst/StackExchangeMathDataset

Folders and files

Latest commit

History

Repository files navigation

Stack Exchange Dataset of Mathematical Formulas

Requirements

Python Packages

Other tools

Dataset

Creating the Dataset

Using the Dataset

main.py

Parameters of main.py

Outputs of main.py

context.py

Parameters of context.py

Outputs of context.py

parse_formulas.py

Parameters of parse_formulas.py

Outputs of parse_formulas.py

statistics.py

Parameters of statistics.py

Outputs of statistics.py

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`main.py`

Parameters of `main.py`

Outputs of `main.py`

`context.py`

Parameters of `context.py`

Outputs of `context.py`

`parse_formulas.py`

Parameters of `parse_formulas.py`

Outputs of `parse_formulas.py`

`statistics.py`

Parameters of `statistics.py`

Outputs of `statistics.py`

Packages