# WikiD

## Overview

"WikiD" is a python package to access Wikipedia locally and get information useful for Natural Language Processing in a way not possible with other wikipedia packages.

Wikipedia is accessible locally (as opposed to a web interface) primarily for speed reasons. I wanted the API to iterate through 100,000 pages in less than a minute for instance. Having Wikipedia locally also allows me to enrich it with information not stored in wikipedia pages (such as 'reverse links', i.e. Wikipedia links that point to the page rather than links that are in the page).

The wikid packages has a compile module which takes the raw wikipedia download (the entire xml file) and compiles it into a file structure designed for efficient access. Wikipedia compilation takes about 7 hours but it needs to be done only once per wikipedia download.

The runtime module is the most important one, it allows to allow pages in a variety of ways as described below.


In [1]:
from wikid import *

## Loading

Loads prepared pickles (~1.2 Gb) into a "WikiD" object, so it can read the Wikipeida source xml files. Wikipedia itself is not loaded into memory (that would be way too large) but a number of indexes are.

In [2]:
wkd = WikiD()

LOADING TITLE TO INDEX
LOADING INDEX TO TITLE


## Page:

"get_page" returns an "Page" Object that contains information about a Wikipedia page, sourced from a locally storred xml file.

Pages store attributes includes:
* Title
* The Raw Wikiepdia source xml
* Wikipedia Links in the page
* Wikipedia Pages pointing to the current page.*
* Whether or not its a person

(*Note: As of the current build, some links are missing)

In [3]:
page_sj = wkd.get_page("Steve Jobs")
print(page_sj.title)
print(page_sj.links()[:50])

Steve Jobs
['Apple.com', 'Jerry Brown', 'Dow Jones and Company', 'Dock (Mac OS X)', 'Apple Store', 'Chrisann Brennan', 'Jackling House', 'Pancreaticoduodenectomy', 'Microsoft', 'Transistor–transistor logic', 'Backdating', 'Apple Macintosh', 'Dementia', 'William Shakespeare', 'Sun Microsystems', 'Dieter Rams', 'West Coast Computer Faire', 'IBM Personal Computer', 'IWoz', 'Pong', 'Reality distortion field', 'Computer platform', 'Glass ceiling', 'NeXT Introduction', 'Market capitalization', 'WALL-E', 'Haidakhan Babaji', 'Relapse', 'Regis McKenna', 'Audiobook', 'Howard Vollum Award', 'Cupertino_ California', 'Dylan Thomas', 'Toy Story 3', 'Zen', 'Bill Hewlett', 'Classic Mac OS', 'Palo Alto_ California', 'San Jose_ California', 'Wayne Gretzky', 'Sōtō', 'ITunes', 'PC Magazine', 'Pixar', 'Simon and Schuster', 'Circuit board', 'Macworld Conference & Expo#2007', 'Commodore 64', 'United States Coast Guard', 'DNA paternity testing']


## Bidirectional links

Because we have all page links, we can actually identify which of the links point to pages that are pointing back to the current page. These 'bidirectional_links' are a subset of the 'links' and are often more interesting. They allow to prune out links that point to too generic pages. For instance, if a page about 'Bill Gates' points to 'Technology', the 'Technology' page is very generic but it is unlikely to point back to 'Bill Gates'. Hence, in that instance, 'Technology' will be a link but not a bidirectional link.

In [4]:
page_bg = wkd.get_page("Bill Gates")

# is_person indicates whether the page describes a person
print(page_bg.is_person)

# bidirectional_links are links to pages that are pointing back at this page.
print(page_bg.bidirectional_links()[:50])


True
['BgC3', 'Lakeside School (Seattle_ Washington)', 'World Economic Forum', 'Bono', 'Reddit', 'Microsoft', 'Criticism of Microsoft', 'Altair 8800', 'Omni Processor', 'Anthony Michael Hall', "The World's Billionaires", 'David Boies', 'The Giving Pledge', 'Micro Instrumentation and Telemetry Systems', 'Leonardo da Vinci', 'MITS Altair 8800', 'Carlos Slim', "Bill Gates' house", 'Jefferson Awards for Public Service', 'Melinda Gates', 'DFBCS', 'Nerds 2.0.1', 'TerraPower', 'National Merit Scholarship Program', 'The Dating Game', 'BASIC', 'William H. Gates Sr.', 'Berkshire Hathaway', 'Superintelligence: Paths_ Dangers_ Strategies', 'Christos Papadimitriou', 'Lists of billionaires', 'New York Institute of Technology', 'Harvard College', 'The Road Ahead (Bill Gates book)', 'The Power of Half', 'Mark Zuckerberg', 'PC DOS', 'Washington (state)', 'Corbis', 'Steve Ballmer', 'Forbes 400', 'OS/2', 'ResearchGate', 'John D. Rockefeller', 'Giving Pledge', 'PDP-10', 'Pancake sorting', 'Harry R. Lewis'

## Iteration through sections Wikipedia

For testing purposes, it is often useful to iterate through a subset of wikipedia. The example below shows how to get 10 consecutive pages at position 100,000.

In [5]:
for page in wkd.range(2000,2010):
    if len(page.links()) > 10:
        print(page.title)
    

Bursa
The Bahamas
Baker Island
Bangladesh
Barbados
Bassas da India
Belarus
Belize
Benin
Bermuda


In [6]:
bridge = wkd.get_page("Oberbaum Bridge")
print(bridge.reverse_links())
print(bridge.raw)

['List of sights in Berlin', 'Portal:Berlin/Topics', 'Culture in Berlin', 'Spree', 'Architecture in Berlin', 'Cengaver Katrancı', 'I Will Follow You', 'Timeline of Berlin', 'Template:Visitor attractions in Berlin', 'Template:Bridges of Berlin', 'Berlin', 'West Berlin', 'Friedrichshain-Kreuzberg', 'Fernsehturm Berlin', 'Berlin U-Bahn', 'Friedrichshain', 'Unknown (2011 film)', 'Template:Berlin Wall', '59th (2nd North Midland) Division']
  <page>
    <title>Oberbaum Bridge</title>
    <ns>0</ns>
    <id>2939238</id>
    <revision>
      <id>736276111</id>
      <parentid>735447604</parentid>
      <timestamp>2016-08-26T10:52:33Z</timestamp>
      <contributor>
        <username>Brewer Bob</username>
        <id>20461221</id>
      </contributor>
      <minor />
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{unreferenced|date=April 2016}}
[[Image:Oberbaumbrücke mit U-Bahn.jpg|thumb|An [[Berlin U-Bahn|U-Bahn]] train crosses the Oberbaum B