Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alternative representations of DocumentDb data types. #4461

Open
Tracked by #4490
dlvenable opened this issue Apr 24, 2024 · 0 comments
Open
Tracked by #4490

Support alternative representations of DocumentDb data types. #4461

dlvenable opened this issue Apr 24, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@dlvenable
Copy link
Member

dlvenable commented Apr 24, 2024

Problem/Background

Data Prepper has an upcoming documentdb source. The issue #4458 proposes a simple data type solution. However, sometimes we need to get all the data that is available.

Solution

Provide options for complex and extended types coming out of DynamoDB.

Data Prepper can support the following mapping options for types.

If you use the relaxed mapping, then we will use extended for any type that does not support relaxed. The relaxed is more closely related to extended than simple types.

Options

We will add a new type_mappings option group within the documentdb source. It will have the following options.

  • default - Can be simple, extended, or relaxed. All types will use this form.
  • object_id - Can be simple, extended, or complex. This configures how BSON ObjectIds are mapped. When configured, this overrides default for BSON ObjectIds
  • bindata - Can be simple, extended, or complex. This configures how BSON BinData is mapped. When configured, this overrides default for BSON BinData fields.
  • timestamp - Can be simple, extended, or complex. This configures how BSON Timestamps are mapped. When configured, this overrides default for BSON Timestamps.
source:
  documentdb:
    host: 'https://my-docdb.docdb.amazonaws.com'
    type_mappings:
      default: relaxed
      bin_data: extended
      timestamp: simple
      object_id: simple

Complex types

source:
  documentdb:
    host: 'https://my-docdb.docdb.amazonaws.com'
    type_mappings:
      object_id: complex
      bindata: complex
      timestamp: complex

ObjectId

For BSON ObjectId, the complex form would include the timstamp.

Input:

{
  "name" : "Star Wars",
  "directorId" : ObjectId("fdd898945")
}

Output:

{
  "name" : "Star Wars",
  "directorId" : {
    "id" : "fdd898945",   # The Id string
    "timestamp" : 1713536423.    # Linux time, extracted from _id
  }
}

BinData

The complex BinData will include the subtype. It solves this by making the field an object which will translate into a nested field in OpenSearch.

Input:

{
  "filepath" : "/usr/share/doc1",
  "myBinary" : BinData(binary="[0x88]", subtype="MD5")   # Not actual format; just a conceptual representation
}

Output:

{
  "filepath" : "/usr/share/doc1",
  "myBinary" : {
    "binary" : "X7ah==",
    "subtype" : "MD5"
  }
}

Timestamp

The complex BinData will include the ordinal. It solves this by making the field an object which will translate into a nested field in OpenSearch.

Input:

{
  "name" : "Star Wars",
  "lastUpdatedAt" : Timestamp(time=1713536835, ordinal=12)   # Not actual format
}

Output:

{
  "_id" : "abcdef12345",
  "name" : "Star Wars",
  "lastUpdatedAt" : {
     "timestamp" : 1713536835,    # The time part
     "ordinal" : 12   # The ordinal value
  }
}

Relaxed types

Configuring the relaxed types will also provide BSON type information. These mappings will look similar to the MongoDB relaxedformat.

source:
  documentdb:
    host: 'https://my-docdb.docdb.amazonaws.com'
    type_mappings:
      object_id: relaxed
      bindata: relaxed
      timestamp: relaxed

BinData

Input:

{
  "filepath" : "/usr/share/doc1",
  "myBinary" : BinData(binary="[0x88]", subtype="MD5")   # Not actual format; just a conceptual representation
}

Output:

"myBinary": {
  "$binary": {
    "base64": "X7ah==",
    "subType": "05"
  }
}

Timestamp

Input:

{
  "name" : "Star Wars",
  "lastUpdatedAt" : Timestamp(time=1713536835, ordinal=12)   # Not actual format
}

Output:

{
  "_id" : "abcdef12345",
  "name" : "Star Wars",
  "lastUpdatedAt" : {
    "$timestamp": {
        "timestamp": 1713536835,
        "ordinal": 12
      }
  }
}

Extended

Additionally we can include extended as an option to include all type information.

Alternative

The original proposal had complex_ boolean fields.

source:
  documentdb:
    host: 'https://my-docdb.docdb.amazonaws.com'
    mappings:
      complex_object_id: true
      complex_bindata: true
      complex_timestamp: true

However, I've changed the proposal to use an enum option since we want to have three options: simple, complex, and relaxed.

References

@dlvenable dlvenable added enhancement New feature or request and removed untriaged labels Apr 30, 2024
@dlvenable dlvenable changed the title Support complex representations of DocumentDb data types. Support complex & extended representations of DocumentDb data types. May 3, 2024
@dlvenable dlvenable changed the title Support complex & extended representations of DocumentDb data types. Support complex & relaxed representations of DocumentDb data types. May 3, 2024
@dlvenable dlvenable changed the title Support complex & relaxed representations of DocumentDb data types. Support alternative representations of DocumentDb data types. May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

1 participant