### **Semantic Math Types from unstructured text**

This is an overview of the Mathematics Stack Exchange Dataset and a showroom for semantic annotations of mathematical expressions

In [22]:
from mse_db import MSE_DBS # provides an interface for interaction with a local MongoDB instance
import funcs               # provides data processing functions
from sem_math import PostThread, \
                     FormulaContextType, \
                     FormulaType, \
                     Comparer

The PostThread class encapsulates one question post and all corresponding answers. The three other classes (FormulaContextType, FormulaType and Comparer) are used to determine a semantic type.

#### **1. Dataset Overview**

The basic unit of the processed MSE dataset is a PostThread instance which includes one question post and all corresponding answer posts. All of these are stored within the **posts** attribute. 

![Post Thread Entry in DB](images/post_thread_entry.png)

A **post** entry includes general post metadata, as well as a set of tags and a **PostTypeId** which has the value 1 for question posts and a value of 2 for answer posts.

![Post Entry in DB](images/post_entry.png)

In [23]:
log_file_name = "conf\log.txt"                          # processing log
db_settings_file_name = "conf\db_conf.json"             # settings file for the database connection (local)

data = MSE_DBS(db_settings_file_name, log_file_name) 
total_count = data.apply_once("threads", funcs.count_all_post_threads)      # counts all documents in threads collection -> ENTIRE DATASET
print("ENTIRE dataset has {} post-threads".format(total_count))
data.reset_count()

ENTIRE dataset has 1502850 post-threads


In [24]:
sel_coll_names = ["algebra-precalculus", "analytic-geometry", "elementary-functions", "elementary-number-theory", \
                  "elementary-set-theory", "euclidean-geometry", "trigonometry"]

sel_data_size = 0
for coll in sel_coll_names:
    coll_size = data.apply_once(coll, funcs.count_all_post_threads)
    posts_len, posts_av = data.apply_once(coll, funcs.count_av_all_posts_once)  
    print("\"{}\" has {} post-threads. \t |  Total number of posts: {} with average of {:.2f} for each post-thread".format(coll,coll_size, posts_len, posts_av))
    sel_data_size += coll_size

print("\n")
print("selected data includes a total of {} post-threads".format(sel_data_size))
print("selected data is {:.2f} % of ENTIRE dataset".format(100* (sel_data_size/total_count)))

"algebra-precalculus" has 43604 post-threads. 	 |  Total number of posts: 126404 with average of 2.90 for each post-thread
"analytic-geometry" has 5934 post-threads. 	 |  Total number of posts: 14297 with average of 2.41 for each post-thread
"elementary-functions" has 515 post-threads. 	 |  Total number of posts: 1225 with average of 2.38 for each post-thread
"elementary-number-theory" has 34454 post-threads. 	 |  Total number of posts: 90098 with average of 2.62 for each post-thread
"elementary-set-theory" has 26535 post-threads. 	 |  Total number of posts: 66175 with average of 2.49 for each post-thread
"euclidean-geometry" has 8188 post-threads. 	 |  Total number of posts: 19052 with average of 2.33 for each post-thread
"trigonometry" has 27356 post-threads. 	 |  Total number of posts: 75653 with average of 2.77 for each post-thread


selected data includes a total of 146586 post-threads
selected data is 9.75 % of ENTIRE dataset


The project will be using only this subset of data because it is sufficiently large to demonstrate a working prototype and because questions and answers from the selected categories include the most common and widely used mathematical types and notations.

#### **2. Formula Occurence**

The **Post Thread** entry also includes a formulas attribute, which is a list of all extracted formulas / mathematical expressions. Each formula entry has an **id**, and a **latex string**, but also a **mathematical type**, a **relevant string** (if found) that describes the value and a **decision string** that explains how the program determined the type.

![Formula Entry in DB](./images/formulas_entry.PNG)


In [26]:
sel_coll_names = ["algebra-precalculus", "analytic-geometry", "elementary-functions", "elementary-number-theory", \
                  "elementary-set-theory", "euclidean-geometry", "trigonometry"]

sel_formulas_total = 0
for coll in sel_coll_names:
    num_formulas = data.apply_once(coll, funcs.count_all_formulas_once)
    av_formulas = data.apply_once(coll, funcs.formulas_av_once)  
    print("\"{}\" has {} formulas with an average of {:.2f} formulas for each post-thread".format(coll,num_formulas, av_formulas))
    sel_formulas_total += num_formulas

print("\n")
print("selected data has a total of {} formulas".format(sel_formulas_total))

"algebra-precalculus" has 1014466 formulas with an average of 23.27 formulas for each post-thread
"analytic-geometry" has 145334 formulas with an average of 24.49 formulas for each post-thread
"elementary-functions" has 11125 formulas with an average of 21.60 formulas for each post-thread
"elementary-number-theory" has 1144251 formulas with an average of 33.21 formulas for each post-thread
"elementary-set-theory" has 715175 formulas with an average of 26.95 formulas for each post-thread
"euclidean-geometry" has 224931 formulas with an average of 27.47 formulas for each post-thread
"trigonometry" has 565353 formulas with an average of 20.67 formulas for each post-thread


selected data has a total of 3820635 formulas


#### **3. Semantic Types Extraction**

#### **4. Posts titles**