# **ROOT DIRECTORY**
## _Single-Source Truth for Wikipedia Categories_

### BUSINESS PROBLEMS

#### READER PROBLEMS

It is difficult explore Wikipedia in a structured way, and it is also know how to contribute to Wikipedia in a meaningful way:

<img src='images/rabbit_hole.png'>

#### EDITOR PROBLEMS

#### **Editor enthusiasm has been declining**

[MIT TECHNOLOGY REVIEW: The Decline of Wikipedia](https://www.technologyreview.com/s/520446/the-decline-of-wikipedia/)

#### **Women are underrepresented as article topics**

[EL PAIS: Raising the profile of female scientists, one Wikipedia article at a time](https://elpais.com/elpais/2018/07/10/inenglish/1531237118_130796.html)

#### **Articles are overly complex**

[WIRED: Wikipedia's Science Articles are Elitist](https://motherboard.vice.com/en_us/article/ne7xzq/wikipedias-science-articles-are-elitist)

#### **Academics aren't fully engaged with Wikipedia**

[SCIENCE: Academics can help shape Wikipedia](http://science.sciencemag.org/content/357/6351/557.2)

## **TARGET PROBLEM**

### **SINGLE SOURCE OF TRUTH**

Wikipedia is the 5th most visited site on the web, and it contains about 5 million english-language articles! While Wikipedia's success has been amazing, Wikipedia's strength is primarily in building tools at article-level collaboration. Article organization is a very different problem and requires very different tools.

While several categorization structures exist on Wikipedia, none truly solve the problem:

Wikipedia possess the following organizing structures:

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Category: Lists](https://en.m.wikipedia.org/wiki/Category:Lists)
- [Lists of lists of lists](https://en.m.wikipedia.org/wiki/List_of_lists_of_lists)
- [Contents/Lists](https://en.m.wikipedia.org/wiki/Portal:Contents/Lists)


At the most simple level, Wikipedia has a hard time answering a simple question:

HOW MANY ARTICLES ON WIKIPEDIA ARE RELATED TO THE FIELD OF MATHEMATICS?

### PORTAL ANSWER FOR MATH COUNT

<img src='images/31444.png'/>

_WIKIPEDIA PORTAL ANSWER: **31,444**_

### **WIKIDATA ANSWER FOR MATH COUNT**

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (sparql) that is community edited to organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

_WIKIDATA ANSWER: **842 (23,048)**_

### **WIKIMEDIA PETSCAN ANSWER**
https://petscan.wmflabs.org/

Contributors to the Wikimedia Foundation have built a tool for finding articles based on categories. It is a powerful tool, but it is too crude to answer our original question. Instead, we will use it to find our seed dataset. 

PetScan takes in a category parameter and a depth of search parameter. This roughly translates to the same process that we would have used if we were scarping Wikipedia ourselves.  

Searching the Mathematics category with a depth of one yields 1,216 results. Kinda disappointing. A depth search of two yields 11,393 results, and a depth search of three yiedls 34,013. Well, that escalates quickly. However, with larger depths, we are also getting more noise in the results. We'll pick a depth search of one, and supplement this set by also searching mathematics subtopics using Petscan. After searching the Algebra, Arithmetic, Calculus, Combinatorics, Game Theory, Geometry, Linear Algebra, Probability, Statistics, and Topology categories with a depth of one, respectively, we have a total of about 10,000 results to add to our Wikidata results. 

_PETSCAN ANSWER AT DEPTH OF 0: **13**_

_PETSCAN ANSWER AT DEPTH OF <= 1: **1213**_

_PETSCAN ANSWER AT DEPTH OF <= 2: **11330**_

_PETSCAN ANSWER AT DEPTH OF <= 3: **34421**_ 

_PETSCAN ANSWER AT DEPTH OF <= 4: **59036**_

_PETSCAN ANSWER AT DEPTH OF > 4: $\infty$_

### **WIKIPEDIA CATEGORY TREE ANSWER**

[https://en.wikipedia.org/wiki/Special:CategoryTree](https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMathematics&mode=all&namespaces=&title=Special%3ACategoryTree)



_CATEGORY TREE ANSWER AT DEPTH OF 0:  **45**_

_CATEGORY TREE ANSWER AT DEPTH OF <= 1: **1377**_

_CATEGORY TREE ANSWER AT DEPTH OF <= 2: **10603**_

_CATEGORY TREE ANSWER AT DEPTH OF <= 3: **31127**_ 

_CATEGORY TREE ANSWER AT DEPTH OF <= 4: **51444**_

_CATEGORY TREE ANSWER AT DEPTH OF > 4:    $ \infty $_ 


## **OBJECTIVE**
### **TRUTH TABLE**

| TITLE | Wiki Article | Simple Wiki Article | Probability Score | Complexity Score | Quality Score |
|-------|:--------------:|:---------------------:|:-------------------:|:------------------:|:---------------:|
| Logistic Regression | https://en.wikipedia.org/wiki/Logistic_regression | https://simple.wikipedia.org/wiki/Logistic_Regression |   0.92   |   0.74   |   0.2        |

## **SOLUTION**

### OPTION #1 UNSUPERVISED MACHINE LEARNING

#### GRAPH ANALYSIS USING LOUVAIN MODULARITY

One possible solution to this problem is to use graph analysis tools to detect communities of nodes in a network and hope it cuts properly. 

However, this is clumsy and wholly dependent on the quality of the categorization system, which is the problem in the first place.

<img src="images/louvain_modularity_3.png">

### OPTION #2 SUPERVISED MACHINE LEARNING: 

#### NATURAL LANGUAGE PROCESSING OF ARTICLE CONTENT

A better option is to use the node's article content, since the article content is what Wikipedia is known for.

## **DATA UNDERSTANDING**



<img src='images/tree.png'>

https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMathematics&mode=all&namespaces=&title=Special%3ACategoryTree

https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMachine+learning&mode=all&namespaces=&title=Special%3ACategoryTree

Instead of using wikipedia directly, I am using the Wikipedia XML data dump. This way, I don
The biggest problem is how to efficiently traverse a Wikipedia XML data dump. This data dump is about 15gb in size compressed.

* Solution: generative parsing of documents

Training corpus is huge, about 80,000 Wikipedia pages. This is too much to hold in memory to build a tfidf model.

* Solution: use Gensim to build the tfidf model while holding only one page in memory at a time.

## **MODELING**

#### **NLP model used:**

##### **TFIDF: Produces 100,000 features**

$tf-idf_{t,d} = (1 +\log tf_{t,d}) \cdot \log \frac{N}{df_t}$

#### **Machine Learning Models used:**

##### **Multinomial Naive Bayes**

$p(f1,...,fn|c)=∏i=1np(fi|c)$

##### **Logistic Regression with Regularization**

$ p(y=1 \mid x) = \frac{1}{1 + exp(-\theta^T x)} $

## **RESULTS**

| Category         | Regularization (c) | Log-Loss Score | Threshold | TP   | TN    | FP   | FN  | Precision |
|------------------|--------------------|----------------|-----------|------|-------|------|-----|-----------|
| Aeronautics      |                    |                |           |      |       |      |     |           |
| Arts             |                    |                |           |      |       |      |     |           |
| Biology          |                    |                |           |      |       |      |     |           |
| Chemistry        | 6                  | 0.187          | 0.2       | 2428 | 12543 | 1354 | 337 | 0.642     |
| Computer Science | 8                  | 0.093          | 0.1       | 1034 | 14796 | 688  | 144 | 0.6       |
| Engineering      |                    |                |           |      |       |      |     |           |
| Mathematics      | 0.05               | 0.273          | 0.2       | 1813 | 11792 | 2780 | 277 | 0.394     |
| Philosophy       | 8                  | 0.108          | 0.2       | 733  | 15185 | 470  | 274 | 0.609     |
| Physics          |                    |                |           |      |       |      |     |           |

MATHEMATICS CONFUSION MATRIX


| TN: 11792 | FP: 2780 |
|-----------|----------|
| **FN: 277**   | **TP: 1823** |

COMPUTER SCIENCE CONFUSION MATRIX

| TN: 14790  | FP: 688  |
|--------|------|
| **FN: 144**    | **TP: 1034** |

### FEATURE IMPORTANCES

<img src='images/mathematics_word_import.png'>

## **EVALUATION**

In [64]:
import pandas as pd
FN_math = pd.read_csv('misclassified_results/math_sorted_FN.csv', sep='\t')[['title', 'predicted', 'actual', 'threshold_pred', 'USE?']]
FN_math.shape

(278, 5)

In [65]:
FN_math.iloc[:20,:]

Unnamed: 0,title,predicted,actual,threshold_pred,USE?
0,A Guide-Book to Mathematics for Technologists ...,0.008715,1.0,False,
1,Nuria Juncosa,0.011367,1.0,False,
2,Lin Hsin Hsin,0.011919,1.0,False,
3,NinKi: Urgency of Proximate Drawing Photograph,0.013497,1.0,False,
4,Hidden Figures,0.015482,1.0,False,
5,Category:Cryptography organizations,0.016806,1.0,False,
6,Homeokinetics,0.023209,1.0,False,
7,Verification and validation,0.024878,1.0,False,
8,David Suter,0.025197,1.0,False,
9,Dunham expansion,0.028177,1.0,False,INTERESTING


In [66]:
FN_math.iloc[-20:,:]

Unnamed: 0,title,predicted,actual,threshold_pred,USE?
258,Gentleman's Diary,0.19173,1.0,False,
259,American Regions Mathematics League,0.191816,1.0,False,
260,Thomas Rawson Birks,0.19182,1.0,False,INTERESTING
261,Bruce Irons (engineer),0.192257,1.0,False,
262,Portal:Number theory,0.193504,1.0,False,
263,Category:Set families,0.193807,1.0,False,
264,Everything and More (book),0.19476,1.0,False,INTERESTING
265,List of things named after Hermann Grassmann,0.195324,1.0,False,
266,Category:Operations researchers,0.195498,1.0,False,
267,Category:Coordinate systems by dimensions,0.195721,1.0,False,


https://en.wikipedia.org/wiki/Biconcave_disc

https://en.wikipedia.org/wiki/Hidden_Figures

https://en.wikipedia.org/wiki/Gifted_(film)

In [10]:
FP_math = pd.read_csv('misclassified_results/math_sorted_FP.csv', sep='\t')
FP_math.shape

(2780, 4)

In [11]:
FP_math.iloc[-20:,:]

Unnamed: 0,title,predicted,actual,threshold_pred
2760,Criss-cross algorithm,0.952502,0.0,True
2761,Vector logic,0.953342,0.0,True
2762,Category:Biological theorems,0.95454,0.0,True
2763,Chaitin's constant,0.955203,0.0,True
2764,Computational complexity,0.958522,0.0,True
2765,P versus NP problem,0.959869,0.0,True
2766,Non-constructive algorithm existence proofs,0.960831,0.0,True
2767,Generic-case complexity,0.96255,0.0,True
2768,Time complexity,0.962793,0.0,True
2769,Computing the permanent,0.964047,0.0,True


## **CONCLUSION**

## _SO..._
## _... how many mathematics articles are on Wikipedia ! ! ! ! ? ? ?_ 

~ `235,000` english language mathematics related articles

~ `5,800,000` total english language articles

OR:

~ `0.045%`

## **NEXT STEPS**

* Pre-clean / Mechanical Turk positive class datasets for more desireable classification
* Implement tools to populate article quality statistics
* Explore other vectorizing models (Word2Vec, Doc2Vec)
* Explore generalizability of model on to arXiv.org papers
* Train only on page metadata to explore tradeoff associated with classification speed increase and reduction in precision/recall

## CONCLUSION

This is a big problem, however with simple ML methods the problem is becoming tractable.




# JAKUB SVEC

## www.github.com/jakubsvec001

## jakubsvec001@gmail.com

## www.linkedin.com/jakubsvec001/

In [86]:
import src.wiki_finder as wf
import glob

In [87]:
files = glob.glob('wiki_dump/*bz2')

In [88]:
lines = wf.get_lines_bz2(files[0])

In [111]:
page_gen = wf.page_generator(lines, limit=100)

In [121]:
next(page_gen)

'<title>Template:Minamiaso Railway Takamori Line</title>\n<ns>10</ns>\n<id>18754827</id>\n<revision>\n<id>780579013</id>\n<parentid>567182113</parentid>\n<timestamp>2017-05-16T00:07:44Z</timestamp>\n<contributor>\n<username>AnomieBOT</username>\n<id>7611264</id>\n</contributor>\n<minor />\n<comment>[[User:AnomieBOT/docs/TemplateSubster|Substing templates]]: {{ja-stalink}}. See [[User:AnomieBOT/docs/TemplateSubster]] for info.</comment>\n<model>wikitext</model>\n<format>text/x-wiki</format>\n<text xml:space="preserve">{{navbox\n| name = Minamiaso Railway Takamori Line\n| title = Stations of the [[Minamiaso Railway Takamori Line|Takamori Line]]\n| titlestyle = {{rail navbox titlestyle|#ff0000}}\n| listclass = hlist\n| list1 =\n* {{STN|Tateno|Kumamoto}}\n* {{STN|Chōyō}}\n* {{STN|Kase|Kumamoto}}\n* {{STN|Aso-Shimodajō-Fureai-Onsen}}\n* {{STN|Minamiaso Mizu-no-Umareru-Sato Hakusui-Kōgen}}\n* {{STN|Nakamatsu}}\n* {{STN|Aso-Shirakawa}}\n* {{STN|Minamiaso-Shirakawasuigen }}\n* {{STN|Miharashid

In [122]:
from pymongo import MongoClient

In [123]:
wiki_cache = MongoClient()['wiki_cache']

In [124]:
wiki_cache.list_collection_names()

['working_collection',
 'edgelist2',
 'mathematics_logistic_predictions_5',
 'pages',
 'mathematics_logistic_predictions',
 'mathematics_logistic_predictions_3',
 'temp_mathematics',
 'edgelist1',
 'mathematics_logistic_predictions_4',
 'all',
 'small_edgelist',
 'mathematics_logistic_predictions_6',
 'mathematics_predictions',
 'mathematics_logistic_predictions_2',
 'new_col']

In [125]:
math_preds = wiki_cache['mathematics_predictions']

In [127]:
math_articles = math_preds.find()

In [165]:
math_articles.next()

{'_id': ObjectId('5c0d2af2dc335a1ae1c1c033'),
 'title': 'Portal:Video games/Featured topic/11',
 'prediction': 0}

In [167]:
articles_gt_18 = math_preds.find({"prediction": {"$gt": .18}})

In [168]:
articles_gt_18.count()

  """Entry point for launching an IPython kernel.


769832

In [169]:
769832 / 19000000

0.04051747368421053

In [170]:
articles_gt_20 = math_preds.find({"prediction": {"$gt": .2}})

In [171]:
articles_gt_20.count()

  """Entry point for launching an IPython kernel.


486886

In [172]:
486886 / 19000000

0.02562557894736842

In [174]:
0.02562557894736842 * 5.8e6

148628.35789473684