Clone this wiki locally
Step 1: Get Cassandra Source Distribution
First you need to download the source distribution of Cassandra, found here. This contains the code that will enable Pig to talk to Cassandra. This is called CassandraStorage and implements a Pig loadfunc and storefunc. Untar Cassandra. The directory where it is untarred we'll call $CASSANDRA_HOME.
Step 2: Build and start Cassandra
Build Cassandra from source. This requires you have ant 1.8+ installed and preferably Sun's latest Java JDK. On recent ubuntu releases, you may have to do something like this to get Sun's JDK. Go to the directory where you untarred Cassandra - $CASSANDRA_HOME - and run
ant. This builds Cassandra. Then start Cassandra. You do this when you're in $CASSANDRA_HOME by typing
sudo bin/cassandra -f which will start Cassandra in the foreground as the root user.
Step 3: Get Pig
You'll need an updated version of Pig. Untar this and we'll call the root directory of this expanded file $PIG_HOME. You can set this via
Step 4: Build CassandraStorage
Once Cassandra is running and you have Pig downloaded and PIG_HOME set, you can build the integration code, called CassandraStorage (this step is not necessary 1.1+ as CassandraStorage is built with the rest of Cassandra). Go to $CASSANDRA_HOME/contrib/pig. Run
ant in that directory.
Step 5: Run Pig
Before running with Pig and Cassandra, you need to inform Pig how to contact Cassandra. You'll need to give it three pieces of information: an initial address to reach Cassandra, a port on that address, and the partitioner you are using with Cassandra. You need to set these either as environment variables or Hadoop variables. For example:
export PIG_INITIAL_ADDRESS=localhost export PIG_RPC_PORT=9160 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
bin/pig_cassandra -x local (note: you may need to
chmod +x bin/pig_cassandra). This is just a script that loads necessary dependencies including CassandraStorage, then starts the Pig Grunt shell.
More information can be found in the README file in the examples/pig directory in Cassandra 1.1+ (or in contrib/pig prior to 1.1)
Step 6: Do something
Now that you are on the Grunt shell, you can run Pig commands or you can run a script by saying
bin/pig_cassandra -x local my_script.pig. A simple thing to do is to count the number of rows in a column family. The script for this is found here. You can either copy the statements to your Grunt shell or run the script directly. Just set the keyspace and column family appropriately.
- Pig 0.9 docs
- Programming Pig - A great reference by Alan Gates of Hortonworks.
- Introduction to Pig video by Alan Gates (then) of Yahoo! It's a little older but good. Project Gutenberg has the Bible and Shakespeare texts. I just removed the headers from the UTF-8 versions to use them.
- Introduction to Pig from Data Day Austin by Jacob Perkins from Infochimps. The video is linked here. The airports project is on github. Also a similar blog post by him.
- elephant-bird - A set of twitter created hadoop utilities. It includes a great JSON loader for pig.
- tmbundle - A textmate code highlighting bundle by Kevin Weil at Twitter.
- sublime-text-pig - A sublime text 2 package for Pig.
Pig + Cassandra resources
- See the source download of the latest version of Cassandra and check out the contrib/pig section.
- See the Hadoop Support page in the Cassandra wiki