Skip to content
No description or website provided.
Java
Find file
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
src
.gitignore
README.md
assembly-hadoop-job.xml
pom.xml

README.md

jcascalog-parquet-example

JCascalog Example with Parquet Data Format.

Run Examples

Make sure that the native lzop is installed on your hadoop cluster, for more details, please see https://github.com/twitter/hadoop-lzo.

Write Parquet from Json

This example M/R Job output produces parquet data from the input json data.

Edit the namenode and jobtracker configuration in jcascalog.parquet.WriteParquetFromJson.main():

    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://hadoop01:9000");
    conf.set("mapred.job.tracker", "hadoop01:9001");

To run M/R Job:

mvn -e -DskipTests=true clean install assembly:single ;
hadoop fs -rmr parquet/json;
hadoop fs -rmr parquet/out;
hadoop fs -rmr /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
hadoop fs -mkdir /tmp/lib;
hadoop fs -mkdir parquet/json;
hadoop fs -put target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar /tmp/lib;
hadoop fs -put src/test/resources/electricPowerUsageGenerator.json parquet/json/
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
              jcascalog.parquet.WriteParquetFromJson \
              parquet/json/electricPowerUsageGenerator.json \
              parquet/out \
              /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;

Read Specified Columns from Parquet data with raw M/R Job.

With this raw M/R Job, the specified column will be read from the parquet data.

Edit the namenode and jobtracker configuration in jcascalog.parquet.ReadSpecifiedColumns.main():

    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://hadoop01:9000");
    conf.set("mapred.job.tracker", "hadoop01:9001");

To run this M/R Job:

hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
          jcascalog.parquet.ReadSpecifiedColumns \
          parquet/out/  \
          parquet/specified-columns \
          /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;

Read Specified Columns from Parquet data with JCascalog.

With JCascalog, the specified column will be read from the parquet data.

Edit the namenode and jobtracker configuration in jcascalog.parquet.ReadSpecifedColumnsWithParquetScheme.main():

    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://hadoop01:9000");
    conf.set("mapred.job.tracker", "hadoop01:9001");

To run this M/R Job:

hadoop fs -rmr parquet/specified-columns ;
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
          jcascalog.parquet.ReadSpecifedColumnsWithParquetScheme \
          parquet/out/  \
          parquet/specified-columns \
          /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;

Verify parquet data with JCascalog.

To verify parquet data, read the parquet data, and produce it as text data.

Edit the namenode and jobtracker configuration in jcascalog.parquet.WriteTextFromParquet.main():

    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://hadoop01:9000");
    conf.set("mapred.job.tracker", "hadoop01:9001");

To run this M/R Job:

hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
                  jcascalog.parquet.WriteTextFromParquet \
                  parquet/specified-columns  \
                  parquet/text-out \
                  /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
Something went wrong with that request. Please try again.