Getting Started

jeromatron edited this page Sep 15, 2012 · 13 revisions
Clone this wiki locally

Getting Started

Step 1: Get Cassandra Source Distribution

First you need to download the source distribution of Cassandra, found here. This contains the code that will enable Pig to talk to Cassandra. This is called CassandraStorage and implements a Pig loadfunc and storefunc. Untar Cassandra. The directory where it is untarred we'll call $CASSANDRA_HOME.

Step 2: Build and start Cassandra

Build Cassandra from source. This requires you have ant 1.8+ installed and preferably Sun's latest Java JDK. On recent ubuntu releases, you may have to do something like this to get Sun's JDK. Go to the directory where you untarred Cassandra - $CASSANDRA_HOME - and run ant. This builds Cassandra. Then start Cassandra. You do this when you're in $CASSANDRA_HOME by typing sudo bin/cassandra -f which will start Cassandra in the foreground as the root user.

Step 3: Get Pig

You'll need an updated version of Pig. Untar this and we'll call the root directory of this expanded file $PIG_HOME. You can set this via export PIG_HOME=/home/zaphod/pig-0.9.2.

Step 4: Build CassandraStorage

Once Cassandra is running and you have Pig downloaded and PIG_HOME set, you can build the integration code, called CassandraStorage (this step is not necessary 1.1+ as CassandraStorage is built with the rest of Cassandra). Go to $CASSANDRA_HOME/contrib/pig. Run ant in that directory.

Step 5: Run Pig

Before running with Pig and Cassandra, you need to inform Pig how to contact Cassandra. You'll need to give it three pieces of information: an initial address to reach Cassandra, a port on that address, and the partitioner you are using with Cassandra. You need to set these either as environment variables or Hadoop variables. For example:

export PIG_INITIAL_ADDRESS=localhost
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner

Now run bin/pig_cassandra -x local (note: you may need to chmod +x bin/pig_cassandra). This is just a script that loads necessary dependencies including CassandraStorage, then starts the Pig Grunt shell.

More information can be found in the README file in the examples/pig directory in Cassandra 1.1+ (or in contrib/pig prior to 1.1)

Step 6: Do something

Now that you are on the Grunt shell, you can run Pig commands or you can run a script by saying bin/pig_cassandra -x local my_script.pig. A simple thing to do is to count the number of rows in a column family. The script for this is found here. You can either copy the statements to your Grunt shell or run the script directly. Just set the keyspace and column family appropriately.

More resources

Pig resources:

Pig + Cassandra resources

  • See the source download of the latest version of Cassandra and check out the contrib/pig section.
  • See the Hadoop Support page in the Cassandra wiki