Currently, CqlStorage
has an apparent issue with loading CQL3 table data. This problem is seen with Cassandra 1.2.8 and Pig 0.11.1.
The following JIRAs are open for this issue:
- https://issues.apache.org/jira/browse/CASSANDRA-5941
- https://issues.apache.org/jira/browse/CASSANDRA-5867
Loading a data structure seems to work:
data = LOAD 'cql://bookdata/books' USING CqlStorage();
DESCRIBE data;
results in this:
data: {isbn: chararray,bookauthor: chararray,booktitle: chararray,publisher: chararray,yearofpublication: int}
However, DUMP
ing the data gets results like these:
((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
Clearly the results from Cassandra are key/value pairs, as would be expected. The schema generated by CqlStorage()
is different - trying to operate on data
per the schema yields wrongs results, and trying to operate on data
per the actual structure causes runtime errors.
This UDF is a temporary workaround until the issue is solved.
Run mvn target
to generate the jar file. Place it somewhere that your Pig script has access to it, and modify your Pig script like this:
-- Register the UDF
REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
-- FromCqlColumn will convert chararray, int, long, float, double
DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
-- Load data as normal
data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
-- Use the UDF
data = FOREACH data_raw GENERATE
FromCqlStorage(isbn) AS ISBN,
FromCqlStorage(bookauthor) AS BookAuthor,
FromCqlStorage(booktitle) AS BookTitle,
FromCqlStorage(publisher) AS Publisher,
FromCqlStorage(yearofpublication) AS YearOfPublication;
-- Process data as desired