Skip to content

nikhilgk/wiki-pat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The cathierarchy table gives a page_id and page_title and that page's categories in category_title. Some pages will also be categories, so connecting all links will result in a hierarchy of categories.

Schema for cathierarchy table/cathierarchy.csv:

Field:Type:
page_idint(8) unsigned
page_titlevarbinary(255)
category_titlevarbinary(255)
page_typeenum('page','subcat','file') -- defaults to 'page'
page_is_redirecttinyint(1) unsigned

Setup the datasource on AWS

The data set of interest is here : http://aws.amazon.com/datasets/2345 Documentation here : http://wiki.freebase.com/wiki/WEX/Documentation

To set this on EC2, log on to AWS and go to the EC2 page. Change your region to US East/N. Virgina. That is the only region where the the data set volume is available. Click on Volumes > Create Volume > Set size to 80 GB and search for the snapshot id "snap-1781757e". Accept the other defaults Note the zone where this is created.

Now create a new Instance. Since this is just for the data, t1.micro is sufficient. I used the following AMI image XLT-4.3.4-Java7u21-Ubuntu-13.04-64bit-IPv6 (ami-01061568) since it had Java and Python already on it. When launching make sure this is the same region and zone where the EBS volume was created. Also the default user to SSH intothis one is 'ubuntu' and not 'root'

Once the EC2 instance launcher, go back to the Volume tab and attach the Wikipedia volume to this instance. Subsequently you can SSH to the instance and mount the volume to the instance.

Here are some more details on what you could do with the data: https://github.com/utcompling/applied-nlp/wiki/Spark-AWS-Exercise2#create-ebs-volume-with-wikipedia-wex-data-and-attach-it-to-the-cluster

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published