# ROOT DIRECTORY
## _Organizing Wikipedia Content by Category_

### BUSINESS PROBLEM

#### READER PROBLEMS

It is difficult explore Wikipedia in a structured way, and it is also know how to contribute to Wikipedia in a meaningful way:

<img src='images/rabbit_hole.png'>

#### EDITOR PROBLEMS

#### **Editor enthusiasm has been declining**

[MIT TECHNOLOGY REVIEW: The Decline of Wikipedia](https://www.technologyreview.com/s/520446/the-decline-of-wikipedia/)

#### **Women are underrepresented as article topics**

[EL PAIS: Raising the profile of female scientists, one Wikipedia article at a time](https://elpais.com/elpais/2018/07/10/inenglish/1531237118_130796.html)

#### **Articles are overly complex**

[WIRED: Wikipedia's Science Articles are Elitist](https://motherboard.vice.com/en_us/article/ne7xzq/wikipedias-science-articles-are-elitist)

#### **Academics aren't fully engaged with Wikipedia**

[SCIENCE: Academics can help shape Wikipedia](http://science.sciencemag.org/content/357/6351/557.2)

## PROBLEM

Wikipedia is the 5th most visited site on the web, and it contains about 5 million english-language articles! Given its popularity and success, Wikipedia has an organization problem:

<img src='images/31444.png'/>

While several categorization structures exist on Wikipedia, none truly solve the problem:

Wikipedia possess the following organizing structures:

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Category: Lists](https://en.m.wikipedia.org/wiki/Category:Lists)
- [Lists of lists of lists](https://en.m.wikipedia.org/wiki/List_of_lists_of_lists)
- [Contents/Lists](https://en.m.wikipedia.org/wiki/Portal:Contents/Lists)


#### _Wikidata Methods of Categorizing:_

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (using the Sparql language) that is community edited organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

### SOLUTION

One possible solution to this problem is to use graph analysis tools to detect communities of nodes in a network and hope it cuts properly. 

However, this is clumsy and wholly dependent on the quality of the categorization system, which is the problem in the first place.

<img src="images/louvain_modularity_3.png">

A better option is to use the node's article content, since the article content is what Wikipedia is known for.

### DATA UNDERSTANDING



https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMathematics&mode=all&namespaces=&title=Special%3ACategoryTree

https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMachine+learning&mode=all&namespaces=&title=Special%3ACategoryTree

### MODELING

Instead of using wikipedia directly, I am using the Wikipedia XML data dump. This way, I don
The biggest problem is how to efficiently traverse a Wikipedia XML data dump. This data dump is about 15gb in size compressed.

* Solution: generative parsing of documents

Training corpus is huge, about 80,000 Wikipedia pages. This is too much to hold in memory to build a tfidf model.

* Solution: use Gensim to build the tfidf model while holding only one page in memory at a time.

NLP models used:

* TFIDF: Produces 100,000 features

* Multinomial Naive Bayes
* Logistic Regression with Regularization

### RESULTS

|Category          | C (regularization) | Log-Loss Score    | Threshold | TP   | TN    | FP   | FN  | Precision         | Recall            |
|------------------|:------------------:|:-----------------:|:---------:|:----:|:-----:|:----:|:---:|:-----------------:|:-----------------:|
| Aeronautics      |                    |                   |           |      |       |      |     |                   |                   |
| Arts             |                    |                   |           |      |       |      |     |                   |                   |
| Biology          |                    |                   |           |      |       |      |     |                   |                   |
| Chemistry        |                    |                   |           |      |       |      |     |                   |                   |
| Computer Science |          8         | 0.093             |    0.1    | 1034 | 14796 |  688 | 144 | 0.600              | 0.87                |
| Engineering      |                    |                   |           |      |       |      |     |                   |                   |
| Mathematics      |        0.05        |      0.27258      |    0.2    | 1813 | 11792 | 2780 | 277 |       0.394       |       0.867       |
| Philosophy       |                    |                   |           |      |       |      |     |                   |                   |
| Physics          |                    |                   |           |      |       |      |     |                   |                   |

### FEATURE IMPORTANCES

### NEXT STEPS

* Deploy model over night, get predictions for all of wikipedia for one category
* Create Feature Importance Visualization
* Improve metadata page extraction (regex that does not catastrophically fail)
* Make code fully PEP-8 Compliant
* Generate a requirements.txt file
* Programmatically delete most important features from dictionary?
* Convert notebook presentation to Google Slides


* Implement tools on top of listed categories
* Explore generalizability of model on to arXiv.org papers
* Train only on page metadata to explore tradeoff associated with classification speed increase and reduction in precision/recall
* Pre-clean / Mechanical Turk positive class datasets for more desireable classification

## CONCLUSION

This is a big problem, however with simple ML methods the problem is becoming tractable.




# JAKUB SVEC

## www.github.com/jakubsvec001

## jakubsvec001@gmail.com

## www.linkedin.com/jakubsvec001/