Playing with giant datasets. An attempt to map all of the knowledge in the world by drawing connections between Wikipedia topics. Will identify which topics are the most influential in general knowledge, as well as what each topic is related to by links. Hopefully I'll learn how to make a visualization too, that'd be awesome.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
model
oldXmlExtractor
.gitignore
Indexing Progress.xls
README
articleLinkParser.py
articleLinkParserTest.py
miniwiki.xml
page_parser.py
wikiLinkExtractor.py

README

Author: Kevin Crane

Playing with giant datasets. An attempt to map all of the knowledge in the world by
drawing connections between Wikipedia topics. Will identify which topics are the
most influential in general knowledge, as well as what each topic is related to by
links. Hopefully I'll learn how to make a visualization too, that'd be awesome.


Files:
    - wikiLinkExtractor.py      : Reads a wiki XML file, iterates through each
                                    article, extracts every link from the article,
                                    then stores the page metadata in IndexModel,
                                    and the links in LinksModel.
    - page_parser.py            : Taken from https://github.com/gareth-lloyd/visualizing-events
                                    Creates an XML parser optimized for long files
                                    and designed to store wiki articles
    - articleLinkParser.py      : Receives a WikiPage object containing the article
                                    text and parses it, extracting all of the links
                                    inside (denoted by: [[link]]). It's not perfect,
                                    but it does a pretty great job at pulling out
                                    links.
    - model
        - IndexModel.py         : Contains all information and methods for the
                                    Index DB table
        - LinksModel.py         : Contains all information and methods for the
                                    Links DB table


To Run:
1. First, make wikiXmlIndexer.py and articleLinkExtractor.py executable

$ chmod +x wikiLinkExtractor.py

2. Make sure you're all set up for this. There are some configurations
   you can do to suit yourself, or you can just leave everything as
   default and it'll work. The only thing you have to do if you use
   the default settings is make a new MySQL database called 'wikimap'.

> CREATE DATABASE wikimap;

   - If you want to make changes, go inside the 'main' method of
       wikiLinkExtractor.py and edit the arguments used when instantiating
       a new WikiLinkExtractor. Alternatively, you can change the defaults
       directly in IndexModel.py and LinksModel.py.

3. Next, make sure you know how to use the program so you don't blow
   anything up.

$ ./wikiXmlExtractor.py --help

4. Now comes the fun part. You're going to point the program to your
   giant Wikipedia XML file (~40GB), specify the names of the tables you
   want to store the metadata and links to, and then you just wait. For
   like, 40 hours. Luckily I designed it so it could pick up where it
   left off, so feel free to Ctrl-C and restart it whenever you like.

$ ./wikiLinkExtractor.py -f enwiki-20130102-pages-articles.xml -i wiki_index -l wiki_links

   - This program will iterate through the Wiki XML file you pointed to,
       read each article, and extract the links from the text. It will then
       store the article metatdata into the Index table you specified, and
       the links contained within that article in the Links table.
   - The first time you run it, you should add the '--reset' flag as well.
       This will create new tables for you and get them all set up. However,
       DO NOT use this flag after you've already started indexing everything.
       It will drop your tables on the floor, erasing all the time you spent
       waiting.
   - Try prepending the 'time' command above if you're curious about
       exactly how long you've spent waiting.
   * As you can see, you'll need to point at an XML file from Wikipedia
       containing all of the information in the world.
       http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
       (tested with data dump from January 2nd, 2013)

5. One last step. There's a bunch of waiting in this too, but nowhere near
   as much as the first time. Just call this command and you're free to
   leave and go surf Facebook or Hacker News.

$ ./wikiLinkExtractor.py -i wiki_index -l wiki_links --linkto

   - This will start by spending a good amount of time grouping every single
       row in the Links table by article and count how many times a different
       page linked to it. This is the most time-consuming operation of this
       whole step as it needs to sort and group like 24,000,000 wikipedia
       article titles, each containing ~140 links.
   - Once this ordeal of sorting is done, it will store these link counts
       into the metadata of the Index table.

6. You did it, you're done! You should now have 2 MySQL database tables:
   1. The Index table, containing:
      - Article title
      - Wiki Id number
      - Number of pages it linked to
      - Number of pages that linked to it (how popular it is)
   2. The Links table, containing:
      - The Wiki Id of a link's origin
      - The article title of a link's destination

   - This will all be used to serve up a (hopefully) badass web app that
     will let you map the connections between topics of knowledge.
   - If it doesn't turn out THAT incredibly, it will still end up being a
     hugely useful application for buidling up web dev expertise.


Developer Information:
- Name: Kevin Crane
- Contact: kevincrane@gmail.com
- OS: Ubuntu 12.04 LTS
- Language: Python 2.7
- IDE: Pycharm 2.7
- System: Dell Vostro v131