How to Load RDF Data

Paul Houle edited this page Jun 6, 2014 · 10 revisions

Assumptions

We're assuming your RDF data is stored on an EBS volume (called DATA) in some specific availability zone (called ZONE). (Note: Add link to a test snapshot) We're also assuming that this volume has a partition table and that the first partition holds the data, thus we will mount the drive on /dev/xvdp1. We will call the machine you are doing this on the HOST.

(An alternative strategy is to create a new DATA volume, attach, format and mount, then copy the data from some other location such as S3 before you schedule the load.)

System sizing

RDFeasy targets the r3 series of instances in AWS. Up to at least the r3.2xlarge instance, the size of the database file that can be stored on the SSD is the factor that limits how much you can load and this is expected to continue up to the r3.4xlarge without software changes.

Create the Machine

  1. Start up instance of product B00KRI3DWW in the same ZONE as DATA; log in but do not proceed until database password (visible on login) is assigned.

Instructions for logging in are in the Basic Usage instructions

Update operating system

Strictly speaking, this step is optional, but if you're going to burn a machine image, you might as well burn one that has the latest security fixes on it.

You DO NOT want to update the operating system during the first boot, before the new password has been installed in the database.

The following procedure is overkill but bulletproof.

  1. sudo service virtuoso stop
  2. sudo apt-get update
  3. sudo apt-get upgrade -y
  4. sudo reboot

Prepare Machine For Loading

In these steps, the var directory for Virtuoso is copied to an SSD and remounted. Fulltext indexing is disabled to improve scalability, and the raw data is added to the system

  1. initialize_ssd
  2. wait_until virtuoso_ready (junk output from the curl command is normal when this script runs)
  3. disable_fulltext
  4. use AWS Console or API to assign DATA to /dev/xvdp
  5. sudo mkdir /mnt/data ; sudo mount /dev/xvdp1 /mnt/data

Schedule load

In the Virtuoso bulk loading process, it is necessary first to populate the database table db.dba.load_list with a list of files to be loaded. This is detailed in the Virtuoso Documentation.

Pre-built configurations

Shell scripts are included to schedule the loading of certain data sets. If you use the snap-7dbc8eaf data set that includes :BaseKB Gold and :SubjectiveEye, there are two different loader scripts:

schedule_small_load -- loads the Compact Edition of :BaseKB on an r3.xlarge instance. schedule_large_load -- loads the Complete Edition of :BaseKB on an r3.2xlarge instance

Looking at the source code for these scripts may give you some idea as to how to write your own scripts to load your own data sets. The RDFeasy directory is checked in with Git; feel free to fork it if you wish to write your own loading scripts.

Run Bulk Loader

A single instance of the RDF Bulk Loader can reach nearly 100% CPU usage on a 4 core or smaller machine (r3.large or r3.xlarge). 100% CPU usage can be attained with 2 copies of the bulk loader running concurrently (r3.2xlarge) and presumably one runs 4 copies on an r3.4xlarge and 8 copies on an r3.8xlarge.

A single instance of the bulk loader is created by the command

rdf_loader_run

(which prints some trash to the console) Multiple RDF loaders can be run by running this command more than once.

The following script waits until the end of the load, which could take a few hours for a large data set.

wait_and_beep still_loading

Create machine image

By creating the machine image, you take a snapshot of the database state which can be restored later.

  1. create an EBS volume large enough to hold the database snapshot (call it NEW) It is a conservative choice to create a volume as large as the SSD on the machine you are running on, but it is reasonable to create a volume which is 20% larger than the data file to allow for temporary files created by large queries.

  2. Attach NEW to /dev/xvdf on the host with the AWS Console or API

  3. copy_to_ebs

  4. add_ebs_database_to_fstab

  5. shred_evidence_and_halt

The shred_evidence_and_halt removes cryptographic key information to make the AMI safe for general distribution. Your cryptographic keys will be installed when you create a new instance based on this AMI, however, the loss of key information on the HOST means you will not be able to log into it if you reboot it. This condition can be repaired by mounting the root filesystem of HOST on another computer and editing the \home\ubuntu\authorized_keys but it is a best practice to terminate HOST once you've created an image from host.

Finally, you need to create the machine image. This can be done from the EC2 Management Console. You should make sure that exactly one EBS volume (the NEW volume) is attached to the machine and that this volume is marked with "Delete on Termination" as true. (This way you can spin up and terminate many instances of this machine without the accumulation of large EBS volumes)

The time scale of image creation is 'an hour or so' for data sets that fill an r3.xlarge or r3.xlarge2. Terminate HOST when the image is complete.

Using the machine image

Launch the machine image on the same-sized instance as you used to create it. See usage instructions for for the new AMI.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.