Skip to content
This repository has been archived by the owner on May 3, 2019. It is now read-only.

Commit

Permalink
Added README
Browse files Browse the repository at this point in the history
  • Loading branch information
electrum committed Oct 13, 2011
1 parent bfa06e8 commit 561d749
Showing 1 changed file with 119 additions and 0 deletions.
119 changes: 119 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# JSON Serde for Hive

## Features

* Full support for arrays, maps and structures
* Automatic column to field mapping using table DDL
* Map keys are case-insensitive for convenience
* Optional ignoring of bad records

## Setup

Compile using `mvn clean package`, or download the release JAR:

curl -L http://bit.ly/mRYaNB > hive-serde-1.0.jar

Register the JAR with Hive:

add jar hive-serde-1.0.jar;

## Examples

### Simple Table

Create the table:

CREATE EXTERNAL TABLE message (
messageid string,
messagesize int
)
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
LOCATION '/tmp/json';

Corresponding JSON record:

{
"messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
"messageSize": 12345
}

Notice that the JSON field names can contain upper case characters.

### Ignoring Errors

Create a table and set the `errors.ignore` serde property:

CREATE EXTERNAL TABLE message (
messageid string,
messagesize int
)
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
WITH SERDEPROPERTIES ('errors.ignore' = 'true')
LOCATION '/tmp/json';

With the default `errors.ignore` value of `false`, an error in any record
will cause the entire query to fail.

When set to `true`, if a record has errors, then every column for that
record will be `NULL`. This is a limitation of the Hive serde API.
Unfortunately, it is not possible for the serde to cause Hive to skip the
record entirely. However, if you have a column that is never `NULL`, such
as the primary key, you can use this column to filter out bad records:

SELECT * FROM message WHERE messageid IS NOT NULL;

This logic can be encapsulated into a view:

CREATE VIEW v_message AS
SELECT * FROM message WHERE messageid IS NOT NULL;

### Nested Structures

Create the table:

CREATE EXTERNAL TABLE message (
messageid string,
messagesize int,
sender string,
recipients array<string>,
messageparts array<struct<
extension: string,
size: int
>>,
headers map<string,string>
)
ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
LOCATION '/tmp/json';

Corresponding JSON record:

{
"messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
"messageSize": 12345,
"sender": "alice@example.com",
"recipients": ["joe@example.com", "bob@example.com"],
"messageParts": [
{
"extension": "pdf",
"size": 4567
},
{
"extension": "jpg",
"size": 9451
}
],
"headers": {
"Received-SPF": "pass",
"X-Broadcast-Id": "9876"
}
}

Query the table:

SELECT
messageid,
recipients[0],
SIZE(recipients) AS recipient_count,
messageParts[0].extension,
headers['received-spf']
FROM message;

0 comments on commit 561d749

Please sign in to comment.