ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.
This library allows C++ programs to read and write the Optimized Row Columnar (ORC) file format.
-To compile:
% export TZ=America/Los_Angeles
% mkdir build
% cd build
% cmake ..
% make
% make test-out
To build the project, use Maven (3.0.x) from http://maven.apache.org/.
You'll also need to install the protobuf compiler (2.4.x) from https://code.google.com/p/protobuf/.
You'll also need to install the jdo2 jar in your maven repository:
- download jdo2-api-2.3-ec.jar to your working directory
- mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api -Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar
Building the jar and running the unit tests:
% mvn package