# ROOT DIRECTORY PRESENTATION

### PROBLEM

Wikipedia has an organization problem:

<img src='images/31444.png'/>

While several categorization structures exist on Wikipedia, none truly solve the problem:

Wikipedia possess the following organizing structures:

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of Mathematics page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)

#### _Wikidata Methods of Categorizing:_

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (using the Sparql language) that is community edited organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

### SOLUTION

One possible solution to this problem is to use graph analysis tools to detect communities of nodes in a network and hope it cuts properly. 

However, this is clumsy and wholly dependent on the quality of the categorization system.

<img src="images/louvain_modularity_3.png">

A better option is to use the node's article content, since the article content is what Wikipedia is known for.

### METHODS USED

Instead of using wikipedia directly, I am using the Wikipedia XML data dump. This way, I don
The biggest problem is how to efficiently traverse a Wikipedia XML data dump. This data dump is about 15gb in size compressed.

* Solution: generative parsing of documents

Training corpus is huge, about 80,000 Wikipedia pages. This is too much to hold in memory to build a tfidf model.

* Solution: use Gensim to build the tfidf model while holding only one page in memory at a time.

NLP models used:

* TFIDF: Produces 100,000 features

* Multinomial Naive Bayes
* Logistic Regression with Regularization

In [27]:
import src.wiki_finder as wf
import glob

In [28]:
files = glob.glob('wiki_dump/*.bz2')

In [29]:
finder = wf.WikiFinder(titles_csv='seed/mathematics_d3.csv', page_limit=100)

In [30]:
lines = finder._get_lines_bz2(files[0])

In [31]:
for line in lines:
    print(line)

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">

  <siteinfo>

    <sitename>Wikipedia</sitename>

    <dbname>enwiki</dbname>

    <base>https://en.wikipedia.org/wiki/Main_Page</base>

    <generator>MediaWiki 1.32.0-wmf.12</generator>

    <case>first-letter</case>

    <namespaces>

      <namespace key="-2" case="first-letter">Media</namespace>

      <namespace key="-1" case="first-letter">Special</namespace>

      <namespace key="0" case="first-letter" />

      <namespace key="1" case="first-letter">Talk</namespace>

      <namespace key="2" case="first-letter">User</namespace>

      <namespace key="3" case="first-letter">User talk</namespace>

      <namespace key="4" case="first-letter">Wikipedia</namespace>

      <namespace key="5" case="first-letter">Wikipedia talk</name

In [32]:
from pymongo import MongoClient
documents = MongoClient()['wiki_cache']['all'].find().limit(5)

In [33]:
documents.next()

{'_id': ObjectId('5bfc8162dc335a4485b9b2e0'),
 'title': "Maharam's theorem",
 'full_raw_xml': '<title>Maharam\'s theorem</title>\n<ns>0</ns>\n<id>18759825</id>\n<revision>\n<id>616609368</id>\n<parentid>616609349</parentid>\n<timestamp>2014-07-12T03:17:45Z</timestamp>\n<contributor>\n<ip>2601:2:4D00:27B:30E0:3741:489E:18EA</ip>\n</contributor>\n<model>wikitext</model>\n<format>text/x-wiki</format>\n<text xml:space="preserve">In [[mathematics]], \'\'\'Maharam\'s theorem\'\'\' is a deep result about the decomposability of [[measure space]]s, which plays an important role in the theory of [[Banach space]]s.  In brief, it states that every [[complete measure space]] is decomposable into &quot;non-atomic parts&quot; (copies of products of the [[unit interval]] [0,1] on the reals), and &quot;purely atomic parts&quot;, using the [[counting measure]] on some discrete space.&lt;ref&gt;D. Maharam, &quot;On homogeneous measure algebras&quot;, \'\'Proceedings of the National Academy of Sciences US

### RESULTS

pass

### NEXT STEPS

* Convert functions to allow multiprocessing to search xml pages in parallel, parse the xml, convert to tfidf sparse matrix, predict probability, save positive preditions to a database for analysis
* Use GraphFrames on Spark to run louvain modularity analysis to compare to nlp results
* Build website with table of results for high-level categories