# ROOT DIRECTORY
## _Organizing Wikipedia Content by Category_

### BUSINESS PROBLEM

#### READER PROBLEMS

It is difficult explore Wikipedia in a structured way, and it is also know how to contribute to Wikipedia in a meaningful way:

<img src='images/rabbit_hole.png'>

#### EDITOR PROBLEMS

#### **Editor enthusiasm has been declining**

<img src='images/editor_decline.png'>

#### **Women are underrepresented as article topics**

<img src='images/women_wikipedia.png'>

#### **Articles are overly complex**

<img src='images/complex_wikipedia.png'>

#### Simple wiki data dump size: ~150 MB

<img src='images/simple_wiki_size.png'>

#### Simple wiki data dump size: ~15 GB

<img src='images/wikipedia_dump_size.png'>

#### **Article Quality Assessment**

<img src='images/quality_wikipedia.png'>

## PROBLEM

Wikipedia is the 5th most visited site on the web,

and it contains about 5M english-language articles.

Nevertheless, Wikipedia has an organization problem:

<img src='images/31444.png'/>

While several categorization structures exist on Wikipedia, none truly solve the problem:

Wikipedia possess the following organizing structures:

#### _Wikipedia Methods of Categorizing:_

- [Portal page](https://en.wikipedia.org/wiki/Portal:Contents/Portals#Mathematics_and_logic)
- [Contents page](https://en.wikipedia.org/wiki/Portal:Contents/Mathematics_and_logic)
- [Category page](https://en.wikipedia.org/wiki/Category:Mathematics)
- [Outline page](https://en.wikipedia.org/wiki/Outline_of_mathematics)
- [Areas of Mathematics page](https://en.wikipedia.org/wiki/Areas_of_mathematics)
- [Indices](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Overviews](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)
- [Glossaries](https://en.wikipedia.org/wiki/Category:Mathematics-related_lists)

#### _Wikidata Methods of Categorizing:_

Outside of the Wikipedia project, the WikiMedia Foundation also includes the the Wikidata project. This project is based on a graph database (using the Sparql language) that is community edited organize data based on their interconnections.

```sparql
 SELECT distinct ?item ?article ?sitelink ?linkTo WHERE {
       { ?item wdt:P361* wd:Q395 .}
       union
       { ?item wdt:P361/wdt:P279* wd:Q395 .}
       union
       { ?item wdt:P31/wdt:P279* wd:Q1936384 .}
       union
       { ?item wdt:P921/wdt:P279* wd:Q395 .}
       optional 
       { ?sitelink ^schema:name ?article .
         ?article schema:about ?item ;
         schema:isPartOf <https://en.wikipedia.org/> .
       }
       OPTIONAL { ?item wdt:P361 ?linkTo. }
       SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
       }
```
While graph databases are dope, when humans are used to create nodes and relationships the data is not entirely reliable.

### SOLUTION

One possible solution to this problem is to use graph analysis tools to detect communities of nodes in a network and hope it cuts properly. 

However, this is clumsy and wholly dependent on the quality of the categorization system.

<img src="images/louvain_modularity_3.png">

A better option is to use the node's article content, since the article content is what Wikipedia is known for.

### DATA UNDERSTANDING:



https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMathematics&mode=all&namespaces=&title=Special%3ACategoryTree

https://en.wikipedia.org/wiki/Special:CategoryTree?target=Category%3AMachine+learning&mode=all&namespaces=&title=Special%3ACategoryTree

### METHODS USED

Instead of using wikipedia directly, I am using the Wikipedia XML data dump. This way, I don
The biggest problem is how to efficiently traverse a Wikipedia XML data dump. This data dump is about 15gb in size compressed.

* Solution: generative parsing of documents

Training corpus is huge, about 80,000 Wikipedia pages. This is too much to hold in memory to build a tfidf model.

* Solution: use Gensim to build the tfidf model while holding only one page in memory at a time.

NLP models used:

* TFIDF: Produces 100,000 features

* Multinomial Naive Bayes
* Logistic Regression with Regularization

### RESULTS

Log-loss = 0.106

**threshold:** 

* 15% 

**Confusion Matrix**

* TN = 14990
* TP = 1024
* FP = 494
* FN = 154


**ROC / AUC**

* 0.98

**Precision score:**

* 0.676

**Recall score:**

* 0.851


<img src='images/final_mathematics_roc_cv_logistic_regression_100000.png'>

### FEATURE IMPORTANCE OF _COMPUTER SCIENCE_

#### 100 words most likely to indicate not 'computer science', most to least important:

('chemic', -7.851578360850692),
 ('mathemat', -7.758474907461552),
 ('chemistri', -7.669446723171006),
 ('theorem', -6.918966883271065),
 ('engin', -6.6406563203012),
 ('art', -6.239766768705818),
 ('physic', -5.994345330764655),
 ('molecul', -5.810642942875946),
 ('mechan', -5.179759753880696),
 ('standard', -5.164808156986166),
 ('artist', -5.153499832835458),
 ('broadcast', -4.9958207998841555),
 ('physicist', -4.8905381365296146),
 ('plant', -4.574029118138461),
 ('chemist', -4.393344046399455),
 ('philosophi', -4.348499226618056),
 ('station', -4.2326820824794495),
 ('in mathemat', -4.15840017963774),
 ('materi', -4.1350960288659895),
 ('energi', -4.111683984275497),
 ('mathematician', -4.094475856879512),
 ('book', -4.0499888315776005),
 ('compound', -4.025824975544282),
 ('neural network', -3.9039812197787285),
 ('centuri', -3.8971872722555765),
 ('decad', -3.8787479384432726),
 ('model languag', -3.8738909193297966),
 ('water', -3.8573826952508625),
 ('design', -3.80942866826711),
 ('oper research', -3.7807105998635855),
 ('integr', -3.7701622620380344),
 ('acid', -3.731267402079188),
 ('fiction', -3.707623008977981),
 ('descript', -3.7066722429161683),
 ('paint', -3.616244906860662),
 ('stochast', -3.6024292291427304),
 ('isol', -3.6018469760558944),
 ('cultur', -3.59274976177639),
 ('x-ray', -3.576927490310963),
 ('cipher', -3.5694417481949654),
 ('revert', -3.5608469932451214),
 ('safeti', -3.5187271613187345),
 ('concret', -3.5176821456173175),
 ('molecular', -3.498260604536983),
 ('oil', -3.4438906883998186),
 ('open', -3.425944877691879),
 ('manufactur', -3.4113904838951243),
 ('instrument', -3.3379635460447545),
 ('aerospac', -3.320821719453244),
 ('canon', -3.2616950814280363),
 ('bond', -3.23595956477185),
 ('marin', -3.2193800660682594),
 ('alan ture', -3.161921156868509),
 ('mathemat softwar', -3.1322970535672843),
 ('optic', -3.1178655102725488),
 ('deep', -3.1095446190294904),
 ('experi', -3.0855568588610947),
 ('ring', -3.064782091267034),
 ('tissu', -3.055563259863252),
 ('motion', -3.0435956130256696),
 ('style', -3.0381946304942673),
 ('softwar visual', -3.0139629999655346),
 ('comic', -3.01198289861977),
 ('ada', -3.0027478734658786),
 ('theori', -2.992578171524996),
 ('orbit', -2.9822645259937275),
 ('notat', -2.9549331507379777),
 ('road', -2.9459073764443597),
 ('industri', -2.94389202591929),
 ('combinator', -2.933160107876594),
 ('stream process', -2.912682947291176),
 ('uml', -2.904199672162106),
 ('concentr', -2.895381063503849),
 ('church', -2.868581654468579),
 ('georg', -2.856796015951508),
 ('binomi', -2.8486241053562806),
 ('number theori', -2.844888084141401),
 ('gravit', -2.8393673393682297),
 ('drug', -2.8174278083630093),
 ('vehicl', -2.8078551647054386),
 ('reaction', -2.807416462819754),
 ('protein', -2.8057896691000366),
 ('ann', -2.7895874131255094),
 ('elev', -2.7789608821085765),
 ('professor mathemat', -2.7763163013972223),
 ('magnet', -2.774430367485002),
 ('backpropag', -2.772972123916454),
 ('guidelin', -2.7728131875953195),
 ('spectroscopi', -2.761489635466199),
 ('work', -2.7458929403114505),
 ('greek', -2.7446226568314476),
 ('telephon', -2.736297556833914),
 ('matter', -2.7349926547211068),
 ('biolog', -2.7148538121878385),
 ('usabl', -2.7115433986143156),
 ('b', -2.7022794475222924),
 ('axiomat', -2.701689389968814),
 ('relev', -2.693875244865861),
 ('labor', -2.6828342262877194),
 ('weight', -2.678941268386128)...

#### 200 words with nearly zero betas:

('domain mathemat', -0.19998967997110387),
 ('project', -0.1999765403616043),
 ('one-to-on', -0.19997296835137557),
 ('virtual machines.', -0.19995764721266107),
 ('possibl defin', -0.19988406297924893),
 ('case consid', -0.19975609588153925),
 ('problem p', -0.1996774884310592),
 ('work digit', -0.19967115548835784),
 ('if need', -0.1996067848070963),
 ('help.', -0.1995900672484893),
 ('project plan', -0.1995882388945452),
 ('bnf', -0.19958447680482064),
 ('languag cobol', -0.1995756534458149),
 ('channels.', -0.19956928347049493),
 ('number peopl', -0.19955412096564773),
 ('signatures.', -0.19954826299912903),
 ('fellow royal societi', -0.1995440716645988),
 ('difficult task', -0.19952713872454256),
 ('licens java program', -0.19949453137091275),
 ('univers michigan univers', -0.1994885587900688),
 ('third intern', -0.19948222237908406),
 ('bug track', -0.19942146844105563),
 ('ieee symposium foundat', -0.19937587418109862),
 ('quantit approach', -0.1993534313601791),
 ('improv support', -0.1993131036689898),
 ('euler equat fluid', -0.199308465779406),
 ('program quadrat program', -0.19925649636272486),
 ('1965.', -0.19917693755856553),
 ('v6', -0.19916645937840702),
 ('graph g', -0.19914025218755527),
 ('ture award associ', -0.19909343100431498),
 ('fossil', -0.1990763729273989),
 ('produc correct', -0.19904431821666857),
 ('$1 000', -0.19901761147993957),
 ('quantifi logic', -0.19901205560670857),
 ('image.', -0.1990039050115631),
 ('use window', -0.19897967548530437),
 ('biochem', -0.1989662049006487),
 ('includ increas', -0.19891901100827145),
 ('agent multi-ag system', -0.19890784256192426),
 ('comput pc', -0.19890769876764783),
 ('use captur', -0.19889402710078397),
 ('affect comput', -0.19888261015549016),
 ('if s', -0.19887821685945875),
 ('high integr', -0.1988458201964609),
 ('telephon network', -0.19884237770052438),
 ('c++.', -0.19884026199276802),
 ('eigenvalu eigenvector', -0.19878193523218887),
 ('width.', -0.19873713991642714),
 ('mid-1980', -0.1986265402310612),
 ('visual display', -0.19861375030174286),
 ('harvard mark', -0.1985944816275466),
 ('aqua', -0.19858646966674937),
 ('dunde', -0.198550735494423),
 ('32 64', -0.19852040331960363),
 ('origin equip manufactur', -0.1984934772074005),
 ('n o p', -0.19846727223971347),
 ('window includ', -0.19837034641142542),
 ('complet project', -0.1983696338238587),
 ('human capabl', -0.19835283114842273),
 ('read .', -0.19830431571560553),
 ('learn learn', -0.19827280741722544),
 ('undertook', -0.19823789318440677),
 ('scienc open', -0.19821847120757116),
 ('univers univers pennsylvania', -0.19821507623200285),
 ('bit byte', -0.19817861116164776),
 ('privaci concern', -0.19814151909879552),
 ('use sound', -0.19813285310592857),
 ('function approxim', -0.1981304644391662),
 ('use evalu', -0.19807000802494107),
 ('step size', -0.19799466480980918),
 ('let consid', -0.19795737772563315),
 ('valu point', -0.19795617793148348),
 ('colleg dublin', -0.1978993850811406),
 ('primari reason', -0.19787441947772805),
 ('columns.', -0.19781555049923732),
 ('landscap', -0.1977419032213642),
 ('theorem the', -0.19771133054370232),
 ('trivia', -0.19766528084780868),
 ('comput introduc', -0.19761457188745835),
 ('secondari school', -0.19761397582263174),
 ('institut standard', -0.1975821023595345),
 ('circumst', -0.197551396491683),
 ('it organ', -0.19753973443272113),
 ('novemb 2006', -0.1974755477674394),
 ('lesser extent', -0.19747126479380467),
 ('data. a', -0.19746720544446833),
 ('languag cognit', -0.1974604282713564),
 ('holland', -0.19742301553082164),
 ('confer comput vision', -0.19739739228640008),
 ('deduc', -0.19739372782124862),
 ('comparison word', -0.1973771393971669),
 ('refer applic', -0.19735231129802627),
 ('tape. the', -0.19730491547091067),
 ('june 2', -0.19726322247378283),
 ('power tool', -0.19719828404933343),
 ('decision-making.', -0.19712236600072014),
 ('level languag', -0.1970699867875656),
 ('reuse.', -0.19702892027488517),
 ('depot', -0.1970066027378353)
 
 
 ('bandwidth signal process', 0.1983648361546687),
 ('for exampl object', 0.19839550987651938),
 ('processor program', 0.19839594327427373),
 ('librari creat', 0.19842238899384235),
 ('inconsist', 0.19845211279342825),
 ('complex imag', 0.19849358404174341),
 ('morgan kaufmann', 0.19849670979937362),
 ('number languag', 0.19851093419804644),
 ('function appli', 0.19852662386225087),
 ('john v.', 0.1985382284770658),
 ('bounds.', 0.198547476023647),
 ('3 data', 0.1985600152765519),
 ('2004 –', 0.19856529429566824),
 ('code need', 0.19858173845223365),
 ('system chip', 0.19858645605879074),
 ('broad rang', 0.19861945930621674),
 ('intel-bas', 0.19867274797706944),
 ('formal grammar', 0.19867413388768923),
 ('program languag refer', 0.19869190085917768),
 ('current best', 0.19872527456561745),
 ('simpl general', 0.19873329563036385),
 ('algorithm process', 0.19874783548529756),
 ('comput term', 0.19875167781967074),
 ('let x', 0.19876768254134636),
 ('the edit', 0.1987768147168433),
 ('product server', 0.19877899616491967),
 ('problem said', 0.19884845750338678),
 ('workshop approxim', 0.1988587090769727),
 ('. in practic', 0.19886278425867324),
 ('for problem', 0.19890286338324864),
 ('verg', 0.19892992840689822),
 ('feldman', 0.1989556741884135),
 ('base estim', 0.19896148371030076),
 ('mathemat techniqu', 0.19897766446256618),
 ('compar general', 0.19898569627032237),
 ('semant the', 0.198989052728307),
 ('addit charg', 0.19901453727148233),
 ('feder communic commiss', 0.19902057948569968),
 ('programm system', 0.19902331662958997),
 ('highlight import', 0.19904100672396247),
 ('the award present', 0.19904212404199237),
 ('g x', 0.19904665841553013),
 ('softwar cinema', 0.19906172360528007),
 ('softwar cinema 4d', 0.19906172360528007),
 ('given type', 0.1990627261224826),
 ('o mn', 0.19907187786530517),
 ('develop sourc', 0.19911366157201707),
 ('page document', 0.19912301477624056),
 ('simul entir', 0.19912704509754822),
 ('& associ', 0.19915058609226188),
 ('train phase', 0.19916385021895108),
 ('morphogenesi', 0.19917345943479575),
 ('the follow tabl', 0.19919413654682885),
 ('scienc univers massachusett', 0.19920427859018328),
 ('project coordin', 0.19923743410051992),
 ('see distribut', 0.19925437307187166),
 ('improv bound', 0.19931183780150208),
 ('extern link web', 0.19932193123565664),
 ('econom inform', 0.19935907671240058),
 ('eec', 0.19938060253937914),
 ('angeles.', 0.19942224426074479),
 ('2003 window server', 0.1994354240002631),
 ('los angeles.', 0.1994423332241421),
 ('implement evalu', 0.19944625479618358),
 ('--', 0.1994562202825539),
 ('strict contain', 0.1994748148463178),
 ('volumes. the', 0.19950962455449406),
 ('support virtual', 0.19951649193933624),
 ('index search engin', 0.1995213342165604),
 ('publish the', 0.19952383860810263),
 ('syntact sugar', 0.19952928035327117),
 ('data-intens', 0.19952965688023752),
 ('pros con', 0.19953886673313234),
 ('comput scienc symposium', 0.1995400533504028),
 ('window command', 0.19954392821564831),
 ('import step', 0.19955442184533814),
 ('space shuttl', 0.19956621628409382),
 ('number input', 0.1995778671547914),
 ('idea machin', 0.19958908084162677),
 ('reliability.', 0.19961260239728867),
 ('began research', 0.19961798713145473),
 ('use raster', 0.1996313648562136),
 ('advantag allow', 0.19965937745804768),
 ('basic applic', 0.19966870354934282),
 ('remov correspond', 0.19967371735359601),
 ('motion pictur', 0.19969015422723047),
 ('algorithm divid', 0.19970317817269725),
 ('phase group delay', 0.1997230200731397),
 ('tenex', 0.19973858987479498),
 ('took year', 0.19978180647367025),
 ('provid flexibl', 0.19980927099493404),
 ('unix-bas', 0.19989574381133746),
 ('recognit natur languag', 0.19991147025549205),
 ('digit filters.', 0.19994328128474415),
 ('compat microsoft', 0.1999515151087822),
 ('type new', 0.19996884397026715),
 ('disc author softwar', 0.19997977812873127),
 ('wheeler comput', 0.19999231641387927),
 ('express power', 0.1999979928766968),
 ('with increas', 0.19999806888751756)

#### Logistic Regression 100 most important words, least to most important:

('morph', 4.119365236338052),
 ('rmit', 4.128415639906076),
 ('recurs', 4.131237790448959),
 ('film record', 4.1363871469202484),
 ('processor', 4.136505444432527),
 ('os', 4.137692114896681),
 ('fingerprint', 4.14216059111616),
 ('render', 4.142311861619426),
 ('auditori', 4.154903871040167),
 ('comput fluid dynam', 4.167665271653621),
 ('suit', 4.170170884858311),
 ('comput fluid', 4.180597793074947),
 ('ai', 4.193128388470043),
 ('sta', 4.214921538582018),
 ('string', 4.221619195259182),
 ('he', 4.2339412212221506),
 ('softwar product', 4.242138577214748),
 ('category:softwar develop', 4.2535278435774355),
 ('attribut', 4.304227283964547),
 ('histogram', 4.30849617277669),
 ('pixel', 4.313212961274539),
 ('silico', 4.31945720774421),
 ('mandelbrot', 4.322514942257124),
 ('program languag', 4.324303082710079),
 ('softwar this', 4.356738182065399),
 ('facebook', 4.359344009717464),
 ('outsourc', 4.382432082111433),
 ('programm', 4.396209775998088),
 ('category:str', 4.397424056945821),
 ('javascript', 4.415983836794731),
 ('window', 4.428375837230501),
 ('problem redirect', 4.43046259775955),
 ('php', 4.440126219663056),
 ('ibm', 4.448151047033486),
 ('logic error', 4.467883258517321),
 ('list music', 4.472646392581549),
 ('microsoft', 4.4783514255039885),
 ('extractor', 4.511252662777549),
 ('search', 4.535790613323257),
 ('associ comput machineri', 4.535877442673982),
 ('user group', 4.5467000645112865),
 ('professor comput scienc', 4.584296775891473),
 ('automaton', 4.593175873619951),
 ('artifici life', 4.5951518388624315),
 ('geoscienc', 4.632398326418058),
 ('fair use', 4.643289405038582),
 ('comput science.', 4.6439486709306745),
 ('research', 4.664634318196702),
 ('hacker', 4.67618733754547),
 ('voronoi', 4.690139434404655),
 ('amiga', 4.793060050500764),
 ('optim', 4.815805847607563),
 ('nanci', 4.817618503692693),
 ('category:adob', 4.8251263930720265),
 ('logo', 4.831057552363667),
 ('refer comput', 4.851717162223743),
 ('comput machineri', 4.867049014059079),
 ('british comput', 4.871924755014749),
 ('proprietari', 4.912058898310471),
 ('category:comput scienc', 4.986494905009353),
 ('screenshot', 4.996007555552266),
 ('category:databas', 4.996938310042228),
 ('category:computer-rel', 4.997013818293584),
 ('category:computer-rel introduct', 4.997013818293584),
 ('adob', 5.037659146024428),
 ('symposium', 5.085072503056008),
 ('(software)', 5.144400512626735),
 ('inform system', 5.145772017753641),
 ('professor comput', 5.23905730939863),
 ('process valid', 5.317083821363695),
 ('comput architectur', 5.3308123363557245),
 ('oper system', 5.376299512490849),
 ('category:program', 5.390320188126921),
 ('softwar', 5.591492760817463),
 ('logo.png', 5.690172408177288),
 ('digit human', 5.7860981900219715),
 ('inform technolog', 5.873187140132582),
 ('artifici', 5.941616783107433),
 ('category:microsoft', 5.992778463203099),
 ('associ comput', 6.020310143472257),
 ('comput programm', 6.077404719174291),
 ('googl', 6.093967669568957),
 ('intellig', 6.281565472673802),
 ('category:inform technolog', 6.288972702562188),
 ('graphic', 6.794787690099601),
 ('summari licens', 6.897754489010812),
 ('cyber', 6.9047325562492645),
 ('acm', 6.922219835582864),
 ('comput societi', 6.929344153203035),
 ('data structur', 6.9294262336102355),
 ('informat', 7.0507309438298424),
 ('cfd', 7.180017444881313),
 ('comput magazin', 7.200889092642354),
 ('category:softwar use', 7.2962979806974735),
 ('comput graphic', 7.521076787727505),
 ('in comput', 7.764282905548294),
 ('digit', 8.037493479582334),
 ('secur', 8.30118471581034),
 ('comput scienc', 8.761620273873715),
 ('comput scientist', 9.148392384554072)

### NEXT STEPS

* Deploy model over night, get predictions for all of wikipedia for one category
* Create Feature Importance Visualization
* Improve metadata page extraction (regex that does not catastrophically fail)
* Make code fully PEP-8 Compliant
* Generate a requirements.txt file
* Programmatically delete most important features from dictionary?
* Convert notebook presentation to Google Slides


* Implement tools on top of listed categories
* Explore generalizability of model on to arXiv.org papers
* Train only on page metadata to explore tradeoff associated with classification speed increase and reduction in precision/recall
* Pre-clean / Mechanical Turk positive class datasets for more desireable classification

## CONCLUSION

This is a big problem, however with simple ML methods the problem is becoming tractable.




# JAKUB SVEC

## www.github.com/jakubsvec001

## jakubsvec001@gmail.com

## www.linkedin.com/jakubsvec001/