# ROOT DIRECTORY PRESENTATION

### PROBLEM

Wikipedia has an organization problem:

<img src='images/31444.png'/>

While several categorization structures exist on Wikipedia, none truly solve the problem:

Wikipedia possess the following organizing structures:

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of Mathematics page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)

#### _Wikidata Methods of Categorizing:_

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (using the Sparql language) that is community edited organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

### SOLUTION

One possible solution to this problem is to use graph analysis tools to detect communities of nodes in a network and cut off 

<img src="images/louvain_modularity_3.png">

### METHODS USED

The biggest problem is how to efficiently traverse a Wikipedia XML data dump. This data dump is about 15gb in size compressed.

* Solution: generative parsing of documents

Training corpus is huge, about 80,000 Wikipedia pages. This is too much to hold in memory to build a tfidf model.

* Solution: use Gensim to build the tfidf model while holding only one page in memory at a time.

NLP models used:

* Multinomial Naive Bayes
* Logistic Regression with Regularization

In [2]:
import src.wiki_finder as wf
import glob

In [3]:
files = glob.glob('wiki_dump/*.bz2')

In [18]:
finder = wf.WikiFinder(titles_csv='seed/mathematics_d3.csv', page_limit=100)

In [19]:
lines = finder._get_lines_bz2(files[0])

In [None]:
for line in lines:
    print(line)

In [21]:
from pymongo import MongoClient
documents = MongoClient()['wiki_cache']['all'].find().limit(5)

In [None]:
documents.next()

### RESULTS

### NEXT STEPS

* Tweek code to multiprocess a search of the xml pages, parse the xml, convert to tfidf sparse matrix, predict probability, save positive preditions to a database for analysis