Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
How to Load RDF Data
We're assuming your RDF data is stored on an EBS volume (called
DATA) in some specific availability zone (called
ZONE). (Note: Add link to a test snapshot) We're also assuming that this volume has a partition table and that the first partition holds the data, thus we will mount the drive on /dev/xvdp1. We will call the machine you are doing this on the
(An alternative strategy is to create a new
DATA volume, attach, format and mount, then copy the data from some other location such as S3 before you schedule the load.)
RDFeasy targets the
r3 series of instances in AWS. Up to at least the
r3.2xlarge instance, the size of the database file that can be stored on the SSD is the factor that limits how much you can load and this is expected to continue up to the
r3.4xlarge without software changes.
Create the Machine
- Start up instance of product B00KRI3DWW in the same
DATA; log in but do not proceed until database password (visible on login) is assigned.
Instructions for logging in are in the Basic Usage instructions
Update operating system
Strictly speaking, this step is optional, but if you're going to burn a machine image, you might as well burn one that has the latest security fixes on it.
You DO NOT want to update the operating system during the first boot, before the new password has been installed in the database.
The following procedure is overkill but bulletproof.
sudo service virtuoso stop
sudo apt-get update
sudo apt-get upgrade -y
Prepare Machine For Loading
In these steps, the
var directory for Virtuoso is copied to an SSD and remounted. Fulltext indexing is disabled to improve scalability, and the raw data is added to the system
wait_until virtuoso_ready(junk output from the curl command is normal when this script runs)
- use AWS Console or API to assign
sudo mkdir /mnt/data ; sudo mount /dev/xvdp1 /mnt/data
In the Virtuoso bulk loading process, it is necessary first to populate the database table
db.dba.load_list with a list of files to be loaded. This is detailed in the Virtuoso Documentation.
Shell scripts are included to schedule the loading of certain data sets. If you use the
snap-7dbc8eaf data set that includes :BaseKB Gold and :SubjectiveEye, there are two different loader scripts:
Looking at the source code for these scripts may give you some idea as to how to write your own scripts to load your own data sets. The
RDFeasy directory is checked in with Git; feel free to fork it if you wish to write your own loading scripts.
Run Bulk Loader
A single instance of the RDF Bulk Loader can reach nearly 100% CPU usage on a 4 core or smaller machine (
r3.xlarge). 100% CPU usage can be attained with 2 copies of the bulk loader running concurrently (
r3.2xlarge) and presumably one runs 4 copies on an
r3.4xlarge and 8 copies on an
A single instance of the bulk loader is created by the command
(which prints some trash to the console) Multiple RDF loaders can be run by running this command more than once.
The following script waits until the end of the load, which could take a few hours for a large data set.
Create machine image
By creating the machine image, you take a snapshot of the database state which can be restored later.
create an EBS volume large enough to hold the database snapshot (call it
NEW) It is a conservative choice to create a volume as large as the SSD on the machine you are running on, but it is reasonable to create a volume which is 20% larger than the data file to allow for temporary files created by large queries.
NEWto /dev/xvdf on the host with the AWS Console or API
shred_evidence_and_halt removes cryptographic key information to make the AMI safe for general distribution. Your cryptographic keys will be installed when you create a new instance based on this AMI, however, the loss of key information on the
HOST means you will not be able to log into it if you reboot it. This condition can be repaired by mounting the root filesystem of
HOST on another computer and editing the
\home\ubuntu\authorized_keys but it is a best practice to terminate
HOST once you've created an image from host.
Finally, you need to create the machine image. This can be done from the EC2 Management Console. You should make sure that exactly one EBS volume (the
NEW volume) is attached to the machine and that this volume is marked with "Delete on Termination" as true. (This way you can spin up and terminate many instances of this machine without the accumulation of large EBS volumes)
The time scale of image creation is 'an hour or so' for data sets that fill an
HOST when the image is complete.
Using the machine image
Launch the machine image on the same-sized instance as you used to create it. See usage instructions for for the new AMI.