# **Root Diary**

## **The Question**

This project started with a simple question: how many pages on wikipedia are related to mathematics? 

As it turns out, the answer is not so simple.

After searching with Mathematics category page, I found this line:

But after reaching out to the category page editors, I was told that the number was not updated since 2015, and that other numbers were floating around from around the same time that were estimating the number to be around 17,000 and 21,000. The editors quicly deleted the line.


## **The Problem**

Wikipedia's system of categorizing content is all over the place. 

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of Mathematics page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)

Of these, Mathematics is organized using subset of these structures:

- [Mathematics Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Mathematics Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Mathematics Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Mathematics Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of Mathematics page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Mathematics Lists](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [lists of lists of lists](https://en.wikipedia.org/wiki/List_of_lists_of_lists#Mathematics_and_logic)

This also includes the [list of mathematics topics page](https://en.wikipedia.org/wiki/Lists_of_mathematics_topics)

Contents Overview Outlines Lists Portals Glossaries Categories Indices

#### _Wikidata Methods of Categorizing:_

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (using the Sparql language) that is community edited organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

#### _Wikipedia Vital Articles Page_
Addictionally, some Wikipedians a have been hand-selecting Wikipedia articles to be given the designation of a "Vital article." The page ["Wikipedia:Vital_articles"](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles) includes a subcategory of [vital mathematics articles](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5/Mathematics). With stated goal of compiling 1,100 articles, its current count of only 489 articles reflects the difficulty faced with manually sorting articles into categories.

## **Data Wrangling**

### _Wikidata Search_

While the results from the Wikidata database search did not answer our original question, we can use the results to create a seed dataset.

While ~800 articles may be enough to train a Natural Language Processing model, more data is always better.

Even though the Wikidata query is complex and not likely something that an average wikimedia user is capable of writing, we will use this number as our evaluation metric. This metric represents the ability of wikimedia contributors to structure content around a major category.

### _Webscraping_

To accomplish this, we can turn to scraping Wikipedia category strucutures. For each category structure page, we can find all the internal wiki links, store them in a container. This represents a depth search of one. Next, scraping the pages in the container for more internal wiki links results in a depth search of two. The number of iterations we perform on the articles in our container will determine the depth of our search. This can result in a large set of links, however it is a bit crude since links can be arbitrarily added to a page.

While web scraping is fun, the Wikimedia Foundation has already developed a program for us that does this for us. It is called the [PetScan](https://petscan.wmflabs.org/).



https://en.wikipedia.org/wiki/Special:CategoryTree

### _Node Weighting_
The deeper we traverse each tree, we are more likely to find pages within the tree that are not highly correlated with our target category, or to provide noise into the machine learning model during training. Since we have tree depth data, we can give nodes close to the root a higher weighting than nodes farther away from the root. During the scraping process, we are keeping track of the depth while saving the node to a file.

ADD AN IMAGE OF TREES WITH A GRADIENT AWAY FROM ROOT

### _PetScan searching_

Contributors to the Wikimedia Foundation have built a tool for finding articles based on categories. It is a powerful tool, but it is too crude to answer our original question. Instead, we will use it to find our seed dataset. 

PetScan takes in a category parameter and a depth of search parameter. This roughly translates to the same process that we would have used if we were scarping Wikipedia ourselves.  

Searching the Mathematics category with a depth of one yields 1,216 results. Kinda disappointing. A depth search of two yields 11,393 results, and a depth search of three yiedls 34,013. Well, that escalates quickly. However, with larger depths, we are also getting more noise in the results. We'll pick a depth search of one, and supplement this set by also searching mathematics subtopics using Petscan. After searching the Algebra, Arithmetic, Calculus, Combinatorics, Game Theory, Geometry, Linear Algebra, Probability, Statistics, and Topology categories with a depth of one, respectively, we have a total of about 10,000 results to add to our Wikidata results. 

### _Putting it All Together_

After combining all of the articles we gathered, we can delete any duplicates and make a count of our bounty.

_Seed Dataset Count: 10,338_

If that was all we had to do, we'd be golden. But we want to be thorough and check that our seed dataset is purged of bad articles.

### _Manual Cleaning_

There is no way around it: each item in our seed dataset needs to be checked and purged if it is not related to mathematics, or is only tangentially related to mathematics. Common articles that were purged included people related to mathematics, articles about academic journals, and universities.

_Seed Dataset Count after Cleaning: 8,687_

_Hours of Manual Cleaning: 4.25_

## Noise Dataset

In order to train a model to teach it how to recognize mathematics articles, we need a dataset of non-mathematics articles. For this we will use Physics, Biology, Chemistry, and Computer Science articles. We will again use Petscan with a depth of one search.

Physics:

Biology:

Chemistry:

Computer Science:

## **Evaluation Metric**

As described above, the evaluation metric we will be using is 800 mathematics related articles. This metric was derived from a search of the Wikidata graph database for mathematics-related nodes with associated English wikipedia articles. 

## **Lets See What We Can Do with Our Seed Dataset!**

## Traversing Wikipedia

['Wikipedia Data Science: Working with the World’s Largest Encyclopedia'](https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c)

Traversing XML tree iteratively:
xml.sax

parsing wikipedia content pages:
mwparserfromhell

http://wiki.knoesis.org/index.php/File:HPCO-treemap.png

https://tools.wmflabs.org/vcat/render?wiki=enwiki&category=Biology&rel=subcategory

https://en.wikipedia.org/wiki/Special:CategoryTree?target=mathematics&mode=all&namespaces=&title=Special%3ACategoryTree

Interesting results from deep searches of wikipedia trees:
Depth of 6 in mathematics:
- 'Category:Pokémon',
- 'Category:Pokémon_anime',
- 'Category:Pokémon_characters',
- 'Category:Pokémon_lists',
- 'Category:Pokémon_manga',
- 'Category:Pokémon_media',
- 'Category:Pokémon_World_Championships',
- 'Category:Pokémon_Trading_Card_Game',
- 'Category:Pokémon_video_games',
- 'Category:Pokémon_RPGs',
- 'Category:People_with_Crohn's_Disease'
- 'Category:Women's_Prisons_in_Australia'
- 'Category:Sex_differences_in_psychology',
- 'Category:Sex_hormones',
- 'Category:Sex-determination_systems',
- 'Category:Sexual_anatomy',
- 'Category:Sexual_dimorphism',
- 'Category:Sexual_selection',
- 'Category:Testosterone',
- 'Category:Comparative_education',
- 'Category:Language_comparison',
- 'Category:Comparative_law',
- 'Category:Comparative_literature',
- 'Category:Power_(physics)',
- 'Category:Radioactivity',
- 'Category:Velocity',

### Wikipedia Category Cluster Visualization 

<img src="images/wikipedia_network_graph_small.png">

### Wikipedia Subset Category of Mathematics Cluster Graph:

<img src="images/mathematics_parents_children_fdp_small.png">

### Mathematics Tree Graph:

<img src="images/mathematics_children.png">

4.3 Category hierarchy as isa graph

In many of the knowledge discovery tasks, Wikipedia category hierarchy is treated as an isa graph. Though it makes sense (in most of the cases) to treat the concept represented by a child category as a specific type of a broader concept represented by a parent category (e.g., Computer_science is a type of Applied_sciences), it is often the case that long distance isa relationship in category hierarchy does not make sense. For example, we can trace a path from the category “Optical_storage” to “Biology” as follows:

Optical_computer_storage is a descendant of Biology as per the following relation hierarchy: 

Optical_computer_storage → 
Computer_storage_devices → 
Recorders → 
Equipment → 
Technology → 
Intellectual_works → 
Creativity → 
Intelligence →
Neuroscience → 
Biology

Similarly, from “Automotive_testing_agencies” to “Algebra” as follows:
Automotive_testing_agencies is a descendant of Algebra as per the following relation hierarchy:

Automotive_testing_agencies → 
Automobiles → 
Private_transport → 
Transport → 
Industries → 
Economic_systems → 
Economics → 
Society → 
Structure → 
Dimension → 
Manifolds → 
Geometric_topology → 
Structures_on_manifolds → 
Algebraic_geometry → 
Abstract_algebra → 
Algebra

As explained in Section 3.1, our manual inspection of a few (39) randomly sampled documents show that, is-a relation does not make sense beyond 7 levels. We found that, between 3 to 5 levels, we can get a reasonably good isa relation.


Bairi R. B. & Carman M. & Ramakrishnan G. (2015). On the Evolution of Wikipedia: Dynamics of Categories and Articles. Papers from the 2015 ICWSM Workshop, pp. 9.

### Steps

<img src="images/graph_decomposition_small.png">

Add to the above image: extra edges leading away from the visible trees

<img src="images/primary_edge_tree_small.png">

## Model Creation

Vectorizer: 
Use count vectorizer. Use TF\*IDF. Use Word2Vec. Use Doc2Vec.

Models:
Naive Bayes, Logistic Regression, RNN.

## Unsupervised Topic Searching
[Concept Search Project](http://mccormickml.com/2017/02/22/concept-search-on-wikipedia/)


#### Logistic Regression 1000 most important words, least to most important:

['scienc',
 'econom',
 'john',
 'edit',
 'actuari',
 'puzzl',
 'cipher',
 'publish',
 '(mathematics)',
 'polit',
 'b',
 'book',
 '2',
 'space',
 'z',
 'born',
 'physic',
 'category:theorem',
 'order',
 'cube',
 'manifold',
 'prime',
 'prefix',
 'theorist',
 'univers',
 'random',
 'categori theori',
 'curv',
 'prize',
 'census',
 'p',
 'category:mathematician',
 'fractal',
 'signatur',
 'system engin',
 'languag',
 'integ',
 'olympiad',
 'n',
 'school',
 'notat',
 'symbol',
 'arithmet',
 'tile',
 'differenti',
 'theori',
 'coordin',
 'x',
 'field',
 'finit',
 'conjectur',
 'category:mathemat',
 'paradox',
 'variabl',
 'mathemat journal',
 'american mathemat',
 'calculus',
 'he',
 'mathemat educ',
 'lemma',
 'cryptographi',
 'numer',
 'equat',
 'distribut',
 'dynam',
 'complex',
 'neural network',
 'encrypt',
 'inequ',
 'probabl',
 'neural',
 'model',
 'game',
 'statist',
 'graph',
 'valid',
 'logic',
 'group',
 'calcul',
 'philosophi',
 'cambridg',
 'cryptograph',
 'topolog',
 'formal',
 'proof',
 'function',
 'problem',
 'appli mathemat',
 'algorithm',
 'mathematics.',
 'mathemat societi',
 'set',
 'uml',
 'math',
 'algebra',
 'geometri',
 'number',
 'theorem',
 'in mathemat',
 'mathematician']


#### 100 words most likely to indicate not mathematics, least to most important:

['light',
 'truck',
 'robot',
 'tool',
 'typic',
 'interact',
 'product',
 'dam',
 'organis',
 'air',
 'instrument',
 'liquid',
 'qualiti',
 'reliabl',
 'countri',
 'compani',
 'nois',
 'protect',
 '-',
 'control',
 'drill',
 'mine',
 'citi',
 'busi',
 '3d',
 'qualif',
 'aeronaut',
 'drum',
 'flow',
 'genet',
 'actor',
 'video',
 'mainten',
 'test',
 'chemic',
 'optic',
 'ship',
 'develop',
 'built',
 'combust',
 'failur',
 'load',
 'pipe',
 'beam',
 'sound',
 'this categori',
 'biotechnolog',
 'manag',
 'activ',
 'measur',
 'temperatur',
 'river',
 'speed',
 'interfac',
 'glass',
 'wall',
 'bridg',
 'user',
 'redirect',
 'cut',
 'land',
 'voltag',
 'architectur',
 'equip',
 'requir',
 'circuit',
 'station',
 'flight',
 'audio',
 'steel',
 'project',
 'water',
 'aircraft',
 'frequenc',
 'organ',
 'industri',
 'survey',
 'electr',
 'electron',
 'devic',
 'unit',
 'broadcast',
 'comic',
 'pressur',
 'metal',
 'genom',
 'safeti',
 'technolog',
 'vehicl',
 'telecommun',
 'tensor',
 'technic',
 'servic',
 'locat',
 'materi',
 'design',
 'manufactur',
 'use',
 'build',
 'standard',
 'engin']


#### Example words that were reduced to zero by Lasso Regression:

['0 1',
 '0 1 real',
 '1 real',
 'a',
 'a similar',
 'abelian',
 'abelian von',
 'abelian von neumann',
 'algebra bound',
 'algebra i.e.',
 'algebra isomorph',
 'algebras.',
 'atom',
 'banach',
 'banach space',
 'banach spaces.',
 'borel',
 'borel set',
 'bound function',
 'brief',
 'classic',
 'complet',
 'complet measur',
 'consid',
 'copi',
 'count',
 'count measur',
 'decompos',
 'decomposit',
 'deep',
 'deep result',
 'discret',
 'discret set.',
 'discret space.',
 'dorothi',
 'everi',
 'extend',
 'finit set.',
 'function discret',
 'function general',
 'general',
 'general measur',
 'given',
 'i.e.',
 'import',
 'import classic',
 'import role',
 'import role theori',
 'in brief',
 'integ finit',
 'interv',
 'interv 0',
 'interv 0 1',
 'irv',
 'irv segal',
 'isomorph',
 'it',
 'it extend',
 'kuratowski',
 'lp',
 'lp space',
 'maharam',
 'mathemat measur',
 'mathemat measur space',
 'measur discret',
 'measur function',
 'measur space',
 'neumann',
 'neumann algebra',
 'neumann algebras.',
 'non-atom',
 'part',
 'part use',
 'parts.',
 'play',
 'play import',
 'play import role',
 'polish',
 'polish space',
 'polish space borel',
 'product unit',
 'pure',
 'real',
 'real integ',
 'refer mathemat',
 'result',
 'result import',
 'role',
 'role theori',
 'segal',
 'set real',
 'set refer',
 'set.',
 'set. a',
 'set. refer',
 'similar',
 'similar theorem',
 'space abelian',
 'space banach',
 'space banach space',
 'space borel',
 'space borel set',
 'space complet',
 'space given',
 'space measur',
 'space measur function',
 'space play',
 'space state',
 'space suffici',
 'space theori',
 'space unit',
 'space unit interv',
 'space.',
 'space. the',
 'spaces.',
 'spaces. in',
 'spatial',
 'state',
 'state complet',
 'suffici',
 'tensor product',
 'term',
 'the result',
 'the theorem',
 'theorem given',
 'theorem in',
 'theorem in mathemat',
 'theori banach',
 'theori consid',
 'translat languag',
 'understand',
 'unit interv',
 'unit interv 0',
 'use count',
 'von',
 'von neumann',
 'von neumann algebra',
 'von neumann algebras.',
 'σ-finit',
 '. this',
 '1 6',
 '10',
 '10 number',
 '157',
 '19',
 '1968',
 '1968.',
 '1971',
 '1973',
 '1973 pp.',
 '1975.',
 '1976',
 '1977',
 '1980',
 '1984',
 '1989',
 '1989 .',
 '1994.',
 '1997',
 '2 6',
 '2008',
 '3',
 '38',
 '39',
 '3–66',
 '542',
 '59',
 '6',
 '62',
 '70',
 '72',
 '73',
 '79',
 '97',
 'a.',
 'actual',
 'advanc',
 'algebra method',
 'algebra topology.',
 'appli',
 'applic',
 'applic geometr',
 'approxim',
 'approxim mathemat',
 'arbitrari',
 'arc.',
 'associ',
 'asymptot',
 'background',
 'birkhäus',
 'borsuk',
 'branch',
 'branch topolog',
 'c*-algebra',
 'categor',
 'characterist',
 'circle.',
 'circle. it',
 'circle. this',
 'circumst',
 'close',
 'coher',
 'coin',
 'coincid',
 'compact',
 'compact space',
 'compact subset',
 'compacta',
 'complex.',
 'concern',
 'contents.',
 'continu',
 'cw',
 'cw complex.',
 'd.a.',
 'deform',
 'develop refer',
 'develop strong',
 'différentiell',
 'dimension',
 'doe',
 'doe appli',
 'domin',
 'dover',
 'due',
 'duke',
 'duke math.',
 'duke math. j.',
 'dynam system',
 'e.g.',
 'edward',
 'edward h.',
 'eilenberg',
 'elli',
 'equival',
 'exampl',
 'exampl area',
 'for',
 'for purpos',
 'found.',
 'foundat',
 'foundat algebra',
 'functor',
 'fund',
 'fund.',
 'fundament',
 'fundament exampl',
 'further',
 'further develop',
 'futur',
 'gain',
 'general arbitrari',
 'general categori',
 'geometr topolog',
 'geometri oper',
 'global',
 'great',
 'great popular',
 'grothendieck',
 'group group',
 'group group isomorph',
 'group isomorph',
 'h.',
 'h. m.',
 'hard',
 'hast',
 'hilton',
 'homolog',
 'homolog theori',
 'homolog theory.',
 'homotop',
 'homotopi',
 'homotopi equival',
 'homotopi group',
 'homotopi theori',
 'homotopi theory.',
 'homotopi theory. the',
 'i',
 'i ii',
 'ii',
 'in algebra',
 'instead',
 'invari',
 'j.',
 'jack',
 'jean-marc',
 'journal',
 'journal mathemat',
 'k.',
 'karol',
 'karol borsuk',
 'later',
 'lectur note',
 'like',
 'littl',
 'live',
 'live work',
 'london',
 'london math.',
 'london math. soc.',
 'lore',
 'm.',
 'marius',
 'math.',
 'math. j.',
 'math. soc.',
 'mathemat applic',
 'mathemat volum',
 'maths.',
 'method',
 'method approxim',
 'method oper',
 'michael',
 'monograph',
 'morphism',
 'mountain',
 'mountain journal',
 'mountain journal mathemat',
 'naiv',
 'no.',
 'no. 1',
 'noncommut',
 'noncommut geometri',
 'norman',
 'norman steenrod',
 'note',
 'notic',
 'number 3',
 'on',
 'oper',
 'oper algebra',
 'oper theori',
 'p.',
 'p.j.',
 'paper',
 'past',
 'past present',
 'past present futur',
 'pdf',
 'phil.',
 'plane',
 'point',
 'point homotopi',
 'polish mathematician',
 'polyhedra.',
 'popular',
 'porter',
 'pp.',
 'predict',
 'present',
 'present futur',
 'proc.',
 'promot',
 'properti',
 'provid',
 'publish littl',
 'purpos',
 'reflect',
 'reinvent',
 'render',
 'reprint',
 'reprint dover',
 'rocki',
 'rocki mountain',
 'rocki mountain journal',
 's.',
 'samuel',
 'samuel eilenberg',
 'shape',
 'shape.',
 'sine',
 'sine curv',
 'singular',
 'singular homolog',
 'soc.',
 'sophist',
 'space general',
 'space homotopi',
 'sphere',
 'sphere .',
 'springer-verlag.',
 'steenrod',
 'strong',
 'style',
 'subset',
 'subset plane',
 'summer',
 'system',
 'terri',
 'theorem doe',
 'theorem doe appli',
 'theori applic',
 'theori associ',
 'theori asymptot',
 'theori branch',
 'theori categor',
 'theori general',
 'theori homotopi',
 'theori homotopi theori',
 'theori oper',
 'theori oper algebra',
 'theori p.',
 'theori shape',
 'theory.',
 'theory. the',
 'this',
 'this compact',
 'this continu',
 'tim',
 'tom',
 'topolog homotopi',
 'topolog space',
 'topolog space homotopi',
 'topologi',
 'topologist',
 'topology.',
 'view',
 'volum',
 'volum 10',
 'warsaw',
 'warszawa',
 'whitehead',
 'work',
 'włodzimierz',
 'year',
 'čech',
 'absolut',
 'data',
 'data set',
 'impli',
 'in statist',
 'indic',
 'inher',
 'larson',
 'level',
 'level level',
 'level measur',
 'magnitud',
 'mathematician refer',
 'measur ratio',
 'nature.',
 'point use',
 'ratio',
 'ratio measur',
 'refer point',
 'ron',
 'ron larson',
 'set indic',
 'use data',
 'use ratio',
 'zero',
 'zero in',
 'zero use',
 '. there',
 '. there general',
 '...',
 '... k',
 '....',
 '1 ...',
 '1 integ',
 '1.',
 '1. from',
 '1972',
 '1987',
 '19th',
 '19th centuri',
 '2 z',
 '2.',
 '3-space',
 '3.',
 '3. if',
 '3. we',
 '8',
 '8.2',
 '= 1',
 '= 1 ...',
 '= n',
 'a b',
 'a similar argument',
 "a'i",
 "a'k",
 'a1',
 'a1 ...',
 'accordingly.',
 'act',
 'act set',
 'action',
 'action homeomorph',
 'action linear',
 'action set',
 'alexand',
 'alexand lubotzki',
 'altern',
 'altern form',
 'alternative.',
 'alternative. the',
 'amalgam',
 'american',
 'american mathemat societi',
 'and',
 'appear',
 'argument',
 'argument goe',
 'as',
 'assum',
 'assumpt',
 'attract',
 'attract fix',
 'attract fix point',
 'attribut',
 'automorph',
 'automorph group',
 'avail',
 'b element',
 'basi',
 'basic',
 'basic fact',
 'bernard',
 'bestvina',
 'boundari',
 'by',
 'by definit',
 'case',
 'case if',
 'centuri',
 'ch.',
 'chain',
 'check',
 'check given',
 'check use',
 'claim.',
 'claim. the',
 'class',
 'class group',
 'combin',
 'common',
 'commut',
 'conclud',
 'consid case',
 'consid standard',
 'contain',
 'contain free',
 'contain proof',
 'context',
 'context group',
 'corollari',
 'cyclic',
 'cyclic group',
 'cyclic group subgroup',
 'cyclic subgroup',
 'definit',
 'descript',
 'discontinu',
 'discret group',
 'disjoint',
 'distinct',
 'distinct point',
 'element',
 'element exact',
 'element finit',
 'element g',
 'element group',
 'element infinit',
 'element set',
 'element.',
 'element. to',
 'ensur',
 'essay',
 'exact',
 'exampl let',
 'exampl one',
 'exampl proof',
 'exampl special',
 'exist',
 'exist m',
 'explan',
 'explicit',
 'explicit descript',
 'extens',
 'extensions.',
 'fact',
 'famous',
 'famous result',
 'felix',
 'felix klein',
 'final',
 'finish',
 'finit generat',
 'finit generat group',
 'finit group',
 'finite.',
 'fix',
 'fix point',
 'fix point mathemat',
 'fix point.',
 'follow',
 'follow chain',
 'follow corollari',
 'follow hold',
 'follow properti',
 'follow statement',
 'form',
 'formal statement',
 'free',
 'free group',
 'free group.',
 'free product',
 'freeli',
 'from',
 'from.',
 'fulli',
 'function analysi',
 'g',
 'g act',
 'g generat',
 'g group',
 'g h',
 'g x1',
 'g ∈',
 "g'n",
 'general assum',
 'general context',
 'general group',
 'general use',
 'generat a',
 'generat free',
 'generat group',
 'generat linear',
 'generat set',
 'generat set group',
 'geometr finit',
 'geometr function',
 'geometr function analysi',
 'geometr group',
 'geometr group theori',
 'geometr group theory.',
 'geometri topolog',
 'gh',
 'goe',
 'greater',
 'greater 2.',
 'gromov.',
 'group act',
 'group act set',
 'group action',
 'group action homeomorph',
 'group action set',
 'group automorph',
 'group automorph group',
 'group context',
 'group cyclic',
 'group cyclic group',
 'group discret',
 'group exampl',
 'group free',
 'group free group',
 'group g',
 'group g act',
 'group generat',
 'group geometr',
 'group geometr group',
 'group group action',
 'group histori',
 'group hyperbol',
 'group impli',
 'group invention',
 'group invention mathematica',
 'group isometri',
 'group linear',
 'group mathemat',
 'group mathemat group',
 'group order',
 'group order group',
 'group proper',
 'group rank',
 'group riemann',
 'group riemann surfac',
 'group studi',
 'group subgroup',
 'group theori',
 'group theory.',
 'group tit',
 'group virtual',
 'group.',
 'groups.',
 'guarante',
 'h',
 'h =',
 'h free',
 'h x',
 'h ∈',
 "h'i",
 "h'k",
 'h1',
 'h2',
 'harp',
 'hi',
 'histori',
 'histori formal',
 'histori the',
 'hold',
 'hold for',
 'homeomorph',
 'homeomorphisms.',
 'hyperbol 3-space',
 'h∞',
 'i.',
 'idea',
 'ident element.',
 'if',
 'if let',
 'impli exist',
 'in case',
 'includ',
 'includ use',
 'inde',
 'inde let',
 'infinit',
 'infinit cyclic',
 'infinit cyclic group',
 'institut public',
 'integ n',
 'integ n ≥',
 'invention',
 'invention mathematica',
 'involv',
 'isometri',
 'isometri hyperbol',
 'it hard',
 'it known',
 'jacqu',
 'jacqu tit',
 'journal algebra',
 'just',
 'k',
 'k ≥',
 'k ≥ 2.',
 'k. then',
 'key',
 'key tool',
 'klein',
 'kleinian',
 'kleinian group',
 'kleinian groups.',
 'known',
 'la',
 'late',
 'late 19th',
 'late 19th centuri',
 'lemma .',
 'lemma guarante',
 'lemma impli',
 'lemma in',
 'lemma in mathemat',
 'lemma mathemat',
 'lemma proof',
 'lemma prove',
 'lemma refer',
 'lemma the',
 'lemma use',
 'let',
 'let a1',
 'let g',
 'let g group',
 'let w',
 'linear',
 'linear group',
 'linear group group',
 'linear groups.',
 'linear transform',
 'linear transformations.',
 'loss',
 'loss general',
 'loss general assum',
 'lubotzki',
 'lyndon',
 'm',
 'm m',
 'm ≥',
 'make',
 'make assumpt',
 'map class',
 'map class group',
 'mathemat free',
 'mathemat group',
 'mathemat group action',
 'mathemat scienc research',
 'mathemat statement',
 'mathematica',
 'matric',
 'mladen',
 'mladen bestvina',
 'modern',
 'modern version',
 'möbius',
 'möbius transform',
 'n ≥',
 'neighborhood',
 'neighbourhood',
 'neighbourhood mathemat',
 'new',
 'new york',
 'non-commut',
 'nonempti',
 'nonempti subset',
 'nontrivi',
 'nontrivial.',
 'one',
 'one famous',
 'one use',
 'order 3',
 'order 3.',
 'order greater',
 'order group',
 'order group theori',
 'order k',
 'order.',
 'others.',
 'others. refer',
 'outer',
 'outer automorph',
 'outer automorph group',
 'overview',
 'pairwis',
 'pairwis disjoint',
 'paper contain',
 'particular',
 'particular group',
 'particular proof',
 'point fix',
 'point fix point',
 'point g',
 'point mathemat',
 'point.',
 'product order',
 'product.',
 'product. the',
 'proof by',
 'proof consid',
 'proof exampl',
 'proof the',
 'proof this',
 'proper',
 'proper discontinu',
 'properti fix',
 'prove',
 'prove claim.',
 'public',
 'put',
 'r2',
 'rank',
 'reduc',
 'reduc word',
 'refer see',
 'repel',
 'research institut',
 'respectively.',
 'respectively. then',
 'result known',
 'result state',
 'riemann',
 'riemann sphere',
 'riemann sphere.',
 'riemann surfac',
 'roger',
 'roger lyndon',
 'schottki',
 'schottki group',
 'scienc research',
 'scienc research institut',
 'see',
 'see free',
 'semigroup',
 'semigroup group',
 'semigroup.',
 'set generat',
 'set group',
 'set map',
 'set specif',
 'set x',
 'set x.',
 'similar argument',
 'sinc',
 'sinc g',
 'sinc group',
 'sketch',
 'sketch proof',
 'sketch proof the',
 'sl',
 'sl 2',
 'sl 2 z',
 'so-cal',
 'societi',
 'solvabl',
 'solvabl group',
 'space. a',
 'special',
 'special linear',
 'special linear group',
 'specif type',
 'sphere.',
 'sphere. the',
 'springer',
 'springer new',
 'springer new york',
 'standard action',
 'state finit',
 'statement',
 'statement follow',
 'studi',
 'subgroup',
 'subgroup free',
 'subgroup g',
 'subgroup given',
 'subgroup group',
 'subgroup h',
 'subgroup order',
 'subgroups.',
 'subgroups. in',
 'such',
 'suffic',
 'suffic check',
 'suppos',
 'suppos exist',
 'surfac',
 'surfac set',
 'take',
 'teichmüller',
 'teichmüller space',
 'the follow',
 'the follow statement',
 'the group',
 'the group g',
 'then',
 'then exist',
 'then proof',
 'theori american',
 'theori american mathemat',
 'theori free',
 'theori free group',
 'theori particular',
 'theori roger',
 'there',
 'there general',
 'there version',
 'these',
 'these general',
 'this prove',
 'this statement',
 'this version',
 'thurston',
 'tit',
 'to',
 'tool use',
 'topolog free',
 'topolog geometr',
 'topolog geometr group',
 'torsion-fre',
 'transform',
 'transformations.',
 'tree',
 'trees.',
 'two.',
 'two. the',
 'type',
 'type action',
 'use altern',
 'use explicit',
 'use geometr',
 'use particular',
 'use studi',
 'util',
 'variat',
 'version',
 'version avail',
 'version general',
 'virtual',
 'w',
 'w.',
 'we',
 'we follow',
 'we let',
 'we make',
 'where',
 'wide',
 'wide use',
 'word',
 'word-hyperbol',
 'word-hyperbol group',
 'x follow',
 'x let',
 'x −',
 "x'i.",
 "x'k",
 'x.',
 'x. let',
 'x1',
 'x1 x2',
 'x2',
 'x2.'

## **Next Steps**

During this project, I played around with the gensim python package for creating corpuses from Wikipedia data. This package handles many if not most of the tasks we performed using xml.sax, mwparserfromhell and sklearn. Not only is a powerful tool for creating a natural language proecessing pipeline, but it is also otimized for speed with built-in multiprocessing options.

### STEPS FOR FUTURE DEVELOPMENT:

* Explore generalizability of model on to arXiv.org papers.
* Train only on page metadata to explore tradeoff associated with classification speed increase and reduction in precision/recall