No description, website, or topics provided.
Switch branches/tags
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config/checkstyle First import Aug 26, 2016
example Update README and example Aug 28, 2016
gradle/wrapper Upgrade Embulk version to 0.9.7 Jun 11, 2018
lib/embulk Check all data in Array. Change message Aug 31, 2016
src Upgrade Embulk version to 0.9.7 Jun 11, 2018
.gitignore First import Aug 26, 2016
.travis.yml Add openjdk8 test in travis Jun 11, 2018
CHANGELOG.md v0.3.1 Jun 11, 2018
LICENSE.txt First import Aug 26, 2016
README.md Update README Mar 11, 2017
build.gradle v0.3.1 Jun 11, 2018
gradlew Upgrade Embulk version to 0.9.7 Jun 11, 2018
gradlew.bat First import Aug 26, 2016

README.md

Jsonpath parser plugin for Embulk

The JSON with JSONPath parser plugin for the Embulk.

Overview

  • Plugin type: parser
  • Guess supported: yes (A JSON data size supports up to 32KB. #476)

Configuration

  • type: Specify this parser as jsonpath
  • columns: Specify column name and type. See below (array, required)
  • root: Specify data path with JSONPath. It must be Array object (string, default:'$') (detail)
  • stop_on_invalid_record: Stop bulk load transaction if a file includes invalid record (such as invalid timestamp) (boolean, default: false)
  • default_timezone: Default timezone of the timestamp (string, default: UTC)
  • default_timestamp_format: Default timestamp format of the timestamp (string, default: %Y-%m-%d %H:%M:%S.%N %z)
  • default_typecast: Specify whether to cast values automatically to the specified types or not (boolean, default: true)

columns

  • name: Name of the column (string, required)
  • type: Type of the column (string, required)
  • timezone: Timezone of the timestamp if type is timestamp (string, default: default_timestamp)
  • format: Format of the timestamp if type is timestamp (string, default: default_format)
  • typecast: Whether cast values or not (boolean, default: default_typecast)
  • path: JSON ppath for specific column. (string, default: null)

Example

Basic Usage

{
  "count": 100,
  "page": 1,
  "results": [
    {
      "name": "Hugh Rutherford",
      "city": "Mitchellfurt",
      "street_name": "Ondricka Island",
      "zip_code": "75232",
      "registered_at": "2015-09-09 05:28:45",
      "vegetarian": true,
      "age": 44,
      "ratio": 79.092
    },
    {
      "name": "Miss Carmella Bashirian",
      "city": "Madilynchester",
      "street_name": "Rhea Walks",
      "zip_code": "44398",
      "registered_at": "2014-07-01 04:25:27",
      "vegetarian": true,
      "age": 73,
      "ratio": 50.608
    }]
}
in:
  type: any file input plugin type
  parser:
    type: jsonpath
    root: "$.results"
    default_timezone: "Asia/Tokyo"
    columns:
      - { name: "name",          type: string }
      - { name: "city",          type: string }
      - { name: "street_name",   type: string }
      - { name: "zip_code",      type: string }
      - { name: "registered_at", type: timestamp, format: "%Y-%m-%d %H:%M:%S" }
      - { name: "vegetarian",    type: boolean }
      - { name: "age",           type: long }
      - { name: "ratio",         type: double }

Preview results

*************************** 1 ***************************
         name (   string) : Hugh Rutherford
         city (   string) : Mitchellfurt
  street_name (   string) : Ondricka Island
     zip_code (   string) : 75232
registered_at (timestamp) : 2015-09-08 20:28:45 UTC
   vegetarian (  boolean) : true
          age (     long) : 44
        ratio (   double) : 79.092
*************************** 2 ***************************
         name (   string) : Miss Carmella Bashirian
         city (   string) : Madilynchester
  street_name (   string) : Rhea Walks
     zip_code (   string) : 44398
registered_at (timestamp) : 2014-06-30 19:25:27 UTC
   vegetarian (  boolean) : true
          age (     long) : 73
        ratio (   double) : 50.608

Handle more complicated json

If you want to handle more complicated json, you can specify jsonpath to also path in columns section like as follows:

{
    "result" : "success",
    "students" : [
      { "names" : ["John", "Lennon"], "age" : 10 },
      { "names" : ["Paul", "Maccartney"], "age" : 10 }
    ]
}
root: $.students
columns:
  - {name: firstName, type: string, path: "names[0]"}
  - {name: lastName, type: string, path: "names[1]"}

In this case, names[0] will be firstName of schema and names[1] will be lastName.

Guess

This plugin supports minimal guess feature. You don't have to write parser: section in the configuration file. After writing in: section, you can let embulk guess parser: section using this command:

$ embulk gem install embulk-parser-jsonpath
$ embulk guess -g jsonpath config.yml -o guessed.yml

Example

If you want to guess the following JSON file, (This JSON data start with array) You don't have to need parser section.

[
  {
    "name": "Hugh Rutherford",
    "city": "Mitchellfurt",
    "street_name": "Ondricka Island",
    "zip_code": "75232",
    "registered_at": "2015-09-09 05:28:45",
    "vegetarian": true,
    "age": 44,
    "ratio": 79.092
  }
]
in:
  type: file
  path_prefix: example/hoge
out:
  type: stdout

However, If a JSON data doesn't start with array, You have to specify root parameter explicitly.

{
  "count": 100,
  "page": 1,
  "results": [
    {
      "name": "Hugh Rutherford",
      "city": "Mitchellfurt",
      "street_name": "Ondricka Island",
      "zip_code": "75232",
      "registered_at": "2015-09-09 05:28:45",
      "vegetarian": true,
      "age": 44,
      "ratio": 79.092
    }
  ]
}
in:
  type: file
  path_prefix: example/input
  parser:
    type: jsonpath
    root: "$.results"
out:
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Acknowledgment

I would like to express my special thanks to the developers of embulk-parser-jsonl and embulk-filter-typecast projects.

Almost codes copied from this project.