Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't store entities with embedded entities which have properties > 1500 bytes #1916

Closed
bedeoverend opened this issue Jan 11, 2017 · 41 comments
Assignees
Labels
api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@bedeoverend
Copy link

Seems that excluding a top level property doesn't exclude an embedded entities properties from being indexed - on top of that, I don't believe there's a way to exclude an embedded entities individual properties, that would solve my problem too.

Steps to reproduce

Just use this script (I'm running the datastore emulator)

const Datastore = require('@google-cloud/datastore');
const datastore = new Datastore({ projectId: 'datastore-test' });

datastore.insert({
  key: datastore.key('User'),
  data: [
    {
      name: 'description',
      value: {
        text: Buffer.alloc(1501, 'a').toString(),
      },
      excludeFromIndexes: true
    }
  ]
}, (err, result) => {
  if (err) {
    console.error(err);
  } else {
    console.log(result.data);
  }
});

I'd expect there to be no error - the embedded entity ({ text: '...' }) shouldn't be indexed I believe? But it throw an error saying:

Error: The value of property "text" is longer than 1500 bytes.

Environment

  • OS: OS X Sierra
  • Node.js: v6.7.0
  • npm: 3.10.3
  • @google-cloud/datastore: v0.6.0
@stephenplusplus
Copy link
Contributor

Thanks for reporting.

@pcostell is it possible in Datastore to exclude individual properties within an Entity from indexing?

@stephenplusplus stephenplusplus added the api: datastore Issues related to the Datastore API. label Jan 11, 2017
@bedeoverend
Copy link
Author

FWIW I've had a play with the datastore save method to support the explicit syntax on nested properties e.g.

datastore.insert({
  key: datastore.key('Stuff'),
  data: [{
    name: 'description',
    value: [{
      name: 'text',
      value: Buffer.alloc(1501, 'a').toString(),
      excludeFromIndexes: true
    }, {
      name: 'long',
      value: true
    }]
  }]
}, handler);

obviously though that wouldn't work as an interface as it would break any properties that have array values e.g. this would break:

datastore.insert({
  key: datastore.key('Stuff'),
  data: [{
    name: 'foo',
    value: [ 1, 2, 3 ]
  }]
}, handler);

For reference, this is the hacked reduction (previously here) function to make this work:

function encodeValues(acc, data) {
  var value;

  if (is.array(data.value)) {
    value = {
      entityValue: {
        properties: data.value.reduce(encodeValues, {})
      }
    };
  } else {
    value = entity.encodeValue(data.value);

    if (is.boolean(data.excludeFromIndexes)) {
      var excluded = data.excludeFromIndexes;
      var values = value.arrayValue && value.arrayValue.values;

      if (values) {
        values = values.map(propAssign('excludeFromIndexes', excluded));
      } else {
        value.excludeFromIndexes = data.excludeFromIndexes;
      }
    }
  }

  acc[data.name] = value;

  return acc;
}

@pcostell
Copy link
Contributor

Excluding a top-level entity value will exclude it and all its sub-values from indexing. However, it still must be a valid entity. This means that properties that cannot be indexed should be marked accordingly.

In the Datastore proto API, this is done by passing excludeFromIndexes at the appropriate levels. This is important from an API perspective because it means an entity value is a valid entity (and could be moved to a top-level entity if desired).

@bedeoverend
Copy link
Author

@pcostell so am I right in thinking that if a top level value was marked as to be excluded from indexing, it would be safe for this SDK to mark all sub-values to be excluded? Given that even if not explicitly excluded they will still be as they're nested inside a value that was excluded.

@pcostell
Copy link
Contributor

That would be safe, but it would represent a departure from the design of the API.

@bedeoverend
Copy link
Author

I'm guessing it would be hard to introduce a syntax that would allow excludeFromIndexes on sub properties, while maintaining backwards compatibility @stephenplusplus?

Until that's achieved, would passing down exclude when set on the top level property be viable? I've got the beginnings of it in a branch, happy to make a PR

@stephenplusplus
Copy link
Contributor

Yes, this would be breaking, however we could try something like:

datastore.insert({
+ raw: true,
  key: datastore.key('Stuff'),
  data: [
    {
      name: 'description',
      value: [
        {
          name: 'text',
          value: Buffer.alloc(1501, 'a').toString(),
          excludeFromIndexes: true
        },
        {
          name: 'long',
          value: true
        }
      ]
    }
  ]
}, handler);

If raw is set, it would be required that all values inside of data are specified in the longhand/verbose syntax. Any chance I'm overlooking something that makes this unfeasible?

@bedeoverend
Copy link
Author

bedeoverend commented Jan 23, 2017

@stephenplusplus I don't think that would work for storing array values?

What about the raw flag needed the explicit encoded value returned by encodeValue? I think that's the proto value? You could also open up the encodeValue function to help users e.g:

datastore.insert({
  key: datastore.key('Stuff'),
  data: {
    description: {
      text: Buffer.alloc(1501, 'a').toString(),
      long: true
    },
    values: [ 1, 2, 3 ]
  }
}

the user could then transform into a raw value by:

datastore.insert({
  raw: true,
  key: datastore.key('Stuff'),
  data: Datastore.encodeValue({
    description: Datastore.encodeValue({
      text: Datastore.encodeValue(Buffer.alloc(1501, 'a').toString(), { exclude: true }),
      long: Datastore.encodeValue(true)
    }),
    values: Datastore.encodeValue([ 1, 2, 3 ])
  })
}

Not sure if I've got the syntax right, but that's the idea. Also more likely the user would do it programmatically.

@stephenplusplus
Copy link
Contributor

@stephenplusplus I don't think that would work for storing array values?

Sorry, I've had my head in many APIs lately. Would you mind showing a bit of code that would highlight this problem?

@bedeoverend
Copy link
Author

@stephenplusplus sure, so if I wanted to store the following JS Object:

var personEntity = {
  name: "Mary",
  metadata: {
    bio: "..." // Some long string > 1500 bytes
  },
  subjects: [{
    name: "history",
    credits: 2
  }, {
    name: "french",
    credits: 4
  }, {
    name: "maths",
    credits: 4
  }]
}

I need to have entity.metadata.bio excluded from indexes. Using the raw flag and syntax provided above, I could do this, however I couldn't store the subjects prop as an array e.g.

var rawPersonEntity = [{
  name: "name",
  value: "Mary"
}, {
  name: "metadata",
  value: [{
    name: "bio",
    value: "...", // Some long string > 1500 bytes
    excludeFromIndexes: true
  }]
}, {
  name: "subjects",
  value: [] // What do I do here? An array will indicate explicit syntax of { name, value }
}]

@stephenplusplus
Copy link
Contributor

Got it, thank you! Would it make sense to recurse the data property?

var rawPersonEntity = {
  key: datastore.key('Person'),
  data: [
    {
      name: 'name',
      value: 'Mary'
    },
    {
      name: 'metadata',
      data: [
        {
          excludeFromIndexes: true,
          name: 'bio',
          value: '...'
        }
      ]
    },
    {
      name: 'subjects',
      value: [
        {
          name: 'history',
          credits: 2
        },
        // ...
      ]
    }
  ]
}

Otherwise, I think the raw flag on a nested level could have the same effect:

var rawPersonEntity = {
  key: datastore.key('Person'),
  data: [
    {
      name: 'name',
      value: 'Mary'
    },
    {
      name: 'metadata',
      raw: true,
      value: [
        {
          excludeFromIndexes: true,
          name: 'bio',
          value: '...'
        }
      ]
    },
    {
      name: 'subjects',
      raw: false, // (the default value, but explicitly set for demonstration)
      value: [
        {
          name: 'history',
          credits: 2
        },
        // ...
      ]
    }
  ]
}

Do either of these have any quirks, and if not, which is better? If neither, we can still keep trying :)

@bedeoverend
Copy link
Author

bedeoverend commented Jan 25, 2017

@stephenplusplus thanks for the options - I think the data seems to make the more sense to me... i.e. value is for more 'basic' types, and data for embedded objects.

It's pretty fringe edge case at this point - and I'm not sure if it's even possible to do, but if you had an array of objects, could you index them?

e.g.

var rawPersonEntity = {
  key: datastore.key('Person'),
  data: [
    {
      name: 'name',
      value: 'Mary'
    },
    {
      name: 'metadata',
      data: [
        {
          excludeFromIndexes: true,
          name: 'bio',
          value: '...'
        }
      ]
    },
    {
      name: 'subjects',
      value: [
        {
          excludeFromIndexes: true, // Is this something that can be done? Is there a reason why it would be done?
          name: 'history',
          credits: 2
        },
        // ...
      ]
    }
  ]
}

Other than that, I can't see any issue with them...Another option would be making an datastore.value style factory - like datastore.key - where you can specify types and the data associated with it e.g.

datastore.insert({
  key: datastore.key('Person'),
  data: datastore.value({
    type: datastore.Entity, // Defaults to 'auto'
    data: {
      name: datastore.value({
        type: String,
        value: 'Mary'
      }),
      metadata: datastore.value({
        type: datastore.Entity,
        value: {
          bio: datastore.value({
            type: String,
            excludeFromIndexes: true,
            value: '...'
          })
        }
      }),
      subjects: datastore.value({
        type: Array,
        value: [
          datastore.value({
            type: datastore.Entity,
            value: {
              name: datastore.value({
                type: String,
                value: 'history',
              }),
              credits: datastore.value({
                type: Number,
                value: 2
              })
            }
          },
          datastore.value({
            type: datastore.Entity,
            value: {
              name: datastore.value({
                type: String,
                value: 'french',
              }),
              credits: datastore.value({
                type: Number,
                value: 4
              })
            }
          },
          datastore.value({
            type: datastore.Entity,
            value: {
              name: datastore.value({
                type: String,
                value: 'maths',
              }),
              credits: datastore.value({
                type: Number,
                value: 4
              })
            }
          }
        ]
      })
    }
  });

Ha - after writing that all out, I realise how incredibly over the top that looks, but it feels robust. Perhaps not as a candidate for quickly writing out simple entities, but perhaps as a way that will work well with a helper utility? e.g. for my use case, I'm converting incoming JSON to an entity, and I want to recursively step through it and add excludeFromIndexes as needed...thoughts?

@danoscarmike danoscarmike added this to Cloud Datastore in First 4 (GA) Feb 21, 2017
@bjwatson
Copy link

bjwatson commented Mar 1, 2017

@stephenplusplus Is this a bug in the client, a bug in the service, or user support to workaround a documented limitation?

@bjwatson bjwatson added type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed Type: Enhancement labels Mar 1, 2017
@kzahel
Copy link

kzahel commented Mar 11, 2017

Looking forward to seeing an improvement on this front. I also noticed a nested array type is forced to be indexed. Also I was hoping the API would let me store nested JSON but I can't figure out if it's possible to do with this API. Exposing something like a datastore.entry factory method would make sense.

@bedeoverend
Copy link
Author

Any updates on this? Happy to make a PR if needed

Also @kzahel it will allow nested JSON if you just give it a JSON object as a property, but just can't specify excludeFromIndexes on nested properties, that's where the factory function would help.

@bedeoverend
Copy link
Author

Looking back over this, how about using a Symbol to specify excludeFromIndexes?

You could then use regular ol' syntax in the data prop. But add in a flag using symbols, e.g.

datastore.insert({
  key: datastore.key('Stuff'),
  data: {
    description: {
      text: Buffer.alloc(1501, 'a').toString(),
      long: true,
      [ datastore.EXCLUDE_FROM_INDEXES ]: true
    },
    values: [ 1, 2, 3 ]
  }
}

much like how datastore.KEY can now be used.

I still think a factory method is more explicit, and cleaner, but symbol might be a bit easier to integrate with the current setup.

@stephenplusplus
Copy link
Contributor

I'm intrigued! And thank you for your offer to help. @callmehiphop what do you think about using a Symbol the way @bedeoverend demo'd above?

@kzahel
Copy link

kzahel commented Apr 5, 2017

Using the symbol is clever and makes sense, I think. I already have to use datastore.KEY symbol to get the actual key from a result.

@bedeoverend
Copy link
Author

bedeoverend commented Apr 5, 2017

On this point (and this might be diverging from the original issue), could you now just use the KEY symbol, rather than key for insert?

e.g.

datastore.insert({
  [datastore.KEY]: datastore.key('Stuff'),
  foo: 'bar'
});

Therefore flattening it out. I guess that'd be tricky for backwards compatibility? Technically you could do it: i.e.

if (entity[datastore.KEY]) {
  // Using new syntax
} else {
  // Using old syntax
}

But it's a bit magical...

@lukesneeringer
Copy link
Contributor

Using a symbol seems fine to me. I prefer the more verbose syntax first proposed over the [datastore.KEY] syntax in the most recent comment.

@chadbr
Copy link

chadbr commented Apr 11, 2017

We can't write string properties over 1500 chars. Is this expected? Is there some way to explicitly mark a property as 'text' instead of 'string'?

@rsemenov100
Copy link

  1. Let's say we have an entity with a property of "text" type, i.e. not string.
  2. When the property is updated (with unindexed string), the type of the property is changing from "text" to "string".
  3. This breaks some other software, which expects to have "text" property.

The question is: how to write >1500 chars as a "text"?

@ggprod
Copy link

ggprod commented Jun 5, 2017

Could I ask what the status of this issue is? I have some entities that have embedded entities that have properties that are >1500 Bytes. When I try setting excludeFromIndexes: true on the embedded entity I still get an error thrown for its sub-properties that are >1500 Bytes. Is there some way I can store them?

@stephenplusplus stephenplusplus added this to Not Started in State Jun 12, 2017
@JayGoldberg
Copy link

Just piping in that I'd opened an issue for this internally several days ago, but I'd routed it to the wrong place. Corrected, and looks like @stephenplusplus is aware.

@swcloud
Copy link
Contributor

swcloud commented Jun 28, 2017

@stephenplusplus This issue may be related to a fixed issue in python:

googleapis/google-cloud-python#1206

@arliber
Copy link

arliber commented Jun 29, 2017

For me, it happens in the web console as well - is that related or am I missing something?

@ShahNewazKhan
Copy link

I followed the following steps to exclude from index for indexed properties of an Embedded Entity type from datastore:

Lookup (get) the entity from Cloud Datastore.
Write (update) the entity back to Cloud Datastore with excluded index.

However when I look at the Entities overview page, each of the properties of the Embedded Entity are still indexed even after the update, is this the right behavior?

From the docs for excluding indexes it mentions "The index entries for any existing entities with that property will continue to exist until the entities are updated or deleted."

@jniemin
Copy link

jniemin commented Jul 17, 2017

I just pumped in to this issue. We have data structure where under main entity there is embedded entity (or actually array of entities) that has a property that is longer than 1500 bytes. As we can't exclude this >1500b property from indexing, we can't store even the main entity.

Is there any progress on this? Basically datastore is unusable for us at this point

@arliber
Copy link

arliber commented Jul 20, 2017

@jniemin I had the same issue and found 2 workarounds: One is utilizing the hierarchy supported in Datastore - one entity can be a child of another so if the embedded entity is complex - I suggest you take it out into a new entity where the parent of it is the original one.

Another (sort of) workaround I used is just simplifying the embedded entity by taking some properties back to the parent itself - not ideal but in my case it made sense.

@stephenplusplus
Copy link
Contributor

Thanks for your patience, everyone. There are a couple of solutions we've talked about, but I'd like to get feedback on one more that might be the most simple. Can you see any "gotchas" using an "excludeFromIndexes" array on the top level to define where the unindexed properties are?

(cc @bedeoverend)

datastore.insert({
  key: datastore.key('Person'),
  excludeFromIndexes: [
    'description',
    'metadata.bio',
    'subjects.name'
  ],
  value: {
    description: '...',
    metadata: {
      bio: '...'
    },
    subjects: [
      {
        name: '...',
        credits: 2
      }
    ]
  }
})

This seems like it would also be more simple than the current "data"/verbose definition syntax for non-nested properties, i.e.

// before
datastore.insert({
  key: datastore.key('Person'),
  data: [
    {
      name: 'description',
      value: '...',
      excludeFromIndexes: true
    }
  ]
})

// after
datastore.insert({
  key: datastore.key('Person'),
  excludeFromIndexes: ['description'],

  // `data` remains a plain JavaScript object:
  data: {
    description: '...'
  }
})

@lukesneeringer
Copy link
Contributor

Summoning @JayGoldberg for feedback also. We can implement this if there are no objections to @stephenplusplus's proposed solution.

@stephenplusplus
Copy link
Contributor

Small amendment to the above suggestion, since subjects is an array:

datastore.insert({
  key: datastore.key('Person'),
  excludeFromIndexes: [
    'description',
    'metadata.bio',
-   'subjects.name'
+   'subjects[].name'
  ],
  value: {
    description: '...',
    metadata: {
      bio: '...'
    },
    subjects: [
      {
        name: '...',
        credits: 2
      }
    ]
  }
})

@lukesneeringer
Copy link
Contributor

lukesneeringer commented Jul 28, 2017

Is that amendment necessary? It is unambiguous even in your previous construction.

It is not clear to me how this information gets sent up to Datastore. If that is the correct Datastore format, then obviously it makes sense. If we are recursively going and adding a property, though, there's no reason we need the [].

@stephenplusplus
Copy link
Contributor

stephenplusplus commented Jul 28, 2017

I think it follows standard embedded property notation, to differentiate between "property of an object" and "property of an object inside of an array". I've also seen this in API response documentation on the various Google API official docs sites. So no, not necessary, but should be more familiar. Willing to be downvoted on that assumption.

@stephenplusplus
Copy link
Contributor

PR sent in #2497. Please take a look!

@bedeoverend
Copy link
Author

bedeoverend commented Jul 29, 2017

@stephenplusplus thanks for this! Looks good to me, I like the general concept.

So two potential things:

  1. Is your proposed solution also allowing for exclusion for all nested properties? e.g. Will excluding metadata exclude all properties / nested entities beneath it? Alternatively could a metadata.* be also supported? Otherwise we'll still run into the current problem of having to explicitly ignore all properties, no matter how large the entity.

  2. Suppose you hold an array of values which don't all follow the same schema:

{
  data: [{
    foo: '...'
  }, {
    bar: '...'
  }]
}

Should this solution also try address not indexing one particular entity's property? e.g. a syntax like: data.0.foo? I'm not sure if this is even supported by datastore's indexing...? But a scenario where this might be appropriate is something like storing a series of metadata objects, and searching for arbitrary values inside them. I'm not sure how valid a use case this is, but just flagging it.

Other than that, I think it looks good 👍

@lukesneeringer
Copy link
Contributor

I am pretty sure that trying to do data.0.foo sounds not great:

  1. It is ambiguous in JavaScript because data = {'0': { 'foo': ... }} is distinct from data = [{'foo': ... }].
  2. In the case of differing schemas, @stephenplusplus's current solution still works, since data[].bar just does not match.

I do agree that some concept of * and ** would be useful, but it is technically orthogonal -- we should get the current PR in, and then maybe do that.

@lukesneeringer
Copy link
Contributor

I think it follows standard embedded property notation, to differentiate between "property of an object" and "property of an object inside of an array". I've also seen this in API response documentation on the various Google API official docs sites. So no, not necessary, but should be more familiar. Willing to be downvoted on that assumption.

Prior art, got it. I am convinced.

@bedeoverend
Copy link
Author

It is ambiguous in JavaScript...
In the case of differing schemas, @stephenplusplus's current solution still works

Fair enough - I guess there could be the scenario of similar / same schemas, but wanting to exclude only one item from being indexed. But, I think that's a big edge case, and I think it would be a bad pattern regardless, so happy to ignore that.

we should get the current PR in, and then maybe do that.

Sounds good to me 👍

@mrksbnch
Copy link

mrksbnch commented Nov 1, 2017

@lukesneeringer @stephenplusplus Thanks for the "fix", but can this please be updated in the documentation? Personally, I think it's weird that I have to exclude every single subproperty from indexing but the bigger issue is that the documentation currently says that "If you exclude this value from indexing, then all subproperties are also excluded from indexing". This is very misleading (at least it was for me) and it took quite a while to find this issue.

EDIT: Another question that came up. How do you exclude properties that are not written in camel case?

data: {
  test: {
    'This is a key': 'This is a string > 1500 bytes ...'
  }
}

@stephenplusplus
Copy link
Contributor

Here's the issue for the catch-all feature request: #2510

How do you exclude properties that are not written in camel case?

This worked for me:

datastore.save({
  key: ...,
  data: {
    test: {
      'This is a key': 'This is a string > 1500 bytes ...'
    }
  },
  excludeFromIndexes: [
    'test.This is a key'
  ]
}, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
No open projects
First 4 (GA)
Datastore
State
Not Started
Development

No branches or pull requests