Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bigtable] InvalidChunk #2980

Closed
junghoahnsc opened this issue Feb 2, 2017 · 18 comments
Closed

[bigtable] InvalidChunk #2980

junghoahnsc opened this issue Feb 2, 2017 · 18 comments
Assignees
Labels
api: bigtable Issues related to the Bigtable API.

Comments

@junghoahnsc
Copy link

junghoahnsc commented Feb 2, 2017

  1. OS type and version
    Ubuntu 14.04

  2. Python version and virtual environment information python --version
    Python 2.7.6

  3. google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze

Name: google-cloud-bigtable
Version: 0.22.0

  1. Stacktrace if available
Traceback (most recent call last):
  File "bt.py", line 13, in <module>
    row = bt_table.read_row('mykey')
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/table.py", line 234, in read_row
    rows_data.consume_all()
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py", line 323, in consume_all
    self.consume_next()
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py", line 275, in consume_next
    self._validate_chunk(chunk)
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py", line 390, in _validate_chunk
    self._validate_chunk_row_in_progress(chunk)
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py", line 370, in _validate_chunk_row_in_progress
    _raise_if(not chunk.timestamp_micros or not chunk.value)
  File "/home/jungho.ahn/tmp/test/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py", line 441, in _raise_if
    raise InvalidChunk(*args)
google.cloud.bigtable.row_data.InvalidChunk
  1. Steps to reproduce
    See the code example
  2. Code example
from google.cloud import bigtable

bt_client = bigtable.client.Client(project='myproject', read_only=True)
bt_instance = bt_client.instance(instance_id='myinstance')
bt_table = bt_instance.table(table_id='mytable')

row = bt_table.read_row('mykey')
print row.cells

This exception is raised all the time. Did I use incorrectly?

Thanks,

@dhermes dhermes added the api: bigtable Issues related to the Bigtable API. label Feb 2, 2017
@dhermes
Copy link
Contributor

dhermes commented Feb 2, 2017

It's tough to say without actually seeing the failed chunk. Can you run this with a debugger? It may be easiest to just dump it in a script and run python -m pdb script_name.py or to run it in IPython and use the %debug magic after it fails.

@junghoahnsc
Copy link
Author

(Pdb) print chunk
timestamp_micros: 1485886500000000
(Pdb) print not chunk.value
True

Actually the value is an empty string.

@dhermes
Copy link
Contributor

dhermes commented Feb 2, 2017

Gotcha, ISTM the chunk is equivalent to:

>>> from google.cloud.bigtable._generated import bigtable_pb2
>>> Chunk = bigtable_pb2.ReadRowsResponse.CellChunk
>>> chunk = Chunk(timestamp_micros=1485886500000000)
>>> chunk
timestamp_micros: 1485886500000000

>>> 

@dhermes
Copy link
Contributor

dhermes commented Feb 2, 2017

@tseaver The culprit is exactly as described by @junghoahnsc. I don't quite recognize the validation since it was re-worked for v2 while I was on "paternity leave". Is there a reason we don't allow empty cells on read?

UPDATE: I just verified that writing an empty cell works:

>>> row = table.row(b'row-key')
>>> row.set_cell(u'col-fam-id', b'col-name', b'') 
>>> row.commit()

but then reading this row doesn't fail either (because the chunk type must be different):

>>> partial_result = table.read_row(b'row-key')
>>> cells = partial_result[u'col-fam-id'][b'col-name']
>>> cells[0].value   
''
>>> cells[0].timestamp
datetime.datetime(2017, 2, 2, 22, 5, 4, 333000, tzinfo=<UTC>)

@junghoahnsc The state of the chunks is also relevant here, so I need a little more info to reproduce.

@tseaver
Copy link
Contributor

tseaver commented Feb 2, 2017

Too much water under the bridge since June :). I think we need to ask @garye, @mbrukman, or @sduskis. The context is:

  • The state machine is in the "row in progress" state (no current partial cell)
  • It is validating the chunk which would trigger a new cell (is doesn't reset or commit the row).
  • It is asserting that there must be both a timestamp and a value set on that chunk.

@yuanbaisc
Copy link

Hi, I am trying to run this step by step:
The PartialRowData object's state is "row in progress"
The chunk it currently processing has only "qualifier" and "timestamp_micros". "reset_now" is False, no "commit_row" field, "value" is empty.

@yuanbaisc
Copy link

The error is triggered in _validate_chunk_row_in_progress method:
if not chunk.HasField('commit_row') and not chunk.reset_row:
_raise_if(not chunk.timestamp_micros or not chunk.value)

@garye
Copy link
Contributor

garye commented Feb 3, 2017

Yeah I think the chunk and timestamp check there is invalid.

@junghoahnsc
Copy link
Author

Is there any workaround until it is fixed?
@dhermes do you need any more information from our side? Actually our bigtable was written by Go/Java apps and now we're trying to read it in Python.

@garye
Copy link
Contributor

garye commented Feb 3, 2017

Is there any way you can validate the same read using Go or Java? Just to gain confidence that the fix I suggested will work... And/or can you just remove those checks from your local copy and verify?

@sduskis
Copy link
Contributor

sduskis commented Feb 3, 2017

A timestamp of 0 is valid. I think that there needs to be a chunk.value.

@garye
Copy link
Contributor

garye commented Feb 3, 2017

Timestamp of 0 is fine and empty value for chunk.value is fine, but I think "not chunk.value" is true if it's an empty string which is wrong.

@sduskis
Copy link
Contributor

sduskis commented Feb 3, 2017

empty values are not fine in these cases. An empty cell isn't a thing in BT. If there's no reset, there needs to be a value, IIRC.

@dhermes
Copy link
Contributor

dhermes commented Feb 3, 2017

@junghoahnsc @yuanbaisc Can you consume the response using the low-level API?

>>> # NOT: partial_result = table.read_row(b'row-key')
>>> from google.cloud.bigtable.table import _create_row_request
>>> request_pb = _create_row_request(table.name, row_key=b'row-key')
>>> response_iterator = client._data_stub.ReadRows(request_pb)
>>> all_responses = list(response_iterator)

UPDATE: If there is private data returned, you can redact it or feel free to email me (email in my GitHub profile)

@junghoahnsc
Copy link
Author

@garye there is no issue in reading in Go/Java.
@dhermes Here is the output of all_responses. I trimmed it since it's private data. Plz let me know if you need full output.

[chunks {
  row_key: "166657957cd9d730abe8f5f73bcfecda"
  family_name {
    value: "cluster"
  }
  qualifier {
    value: "concept"
  }
  timestamp_micros: 1485887340000000
}
chunks {
  timestamp_micros: 1485886500000000
}
...
chunks {
  qualifier {
    value: "debug"
  }
  timestamp_micros: 1485887340000000
  value: "\022\332\222\001x..."
}
chunks {
  timestamp_micros: 1485886500000000
  value: "\022\242\222\001x\234..."
}
....
chunks {
  timestamp_micros: 1485879798000000
  value: "\n2\n\006OTR517\022(\n\022\tj\274t\223..."
}
chunks {
  timestamp_micros: 1485877620000000
  value: "\n2\n\006OTR517\022(\n..."
}
chunks {
  qualifier {
    value: "metastory"
  }
  timestamp_micros: 1485887340000000
}
chunks {
  timestamp_micros: 1485886500000000
}
...
chunks {
  qualifier {
    value: "snapgraph"
  }
  timestamp_micros: 1485887340000000
}
chunks {
  timestamp_micros: 1485886500000000
}
...
chunks {
  timestamp_micros: 1485877620000000
  commit_row: true
}
]

@sduskis
Copy link
Contributor

sduskis commented Feb 3, 2017

Regarding my earlier comment, @garye and I spoke offline. While there's no such thing as a Bigtable cell with an value, someone can set a filter that strips the value; that's useful in cases where all you want to do is check for existence of a row.

The check for value should be removed as well as @garye suggested. I authored an implementation guide which may be incorrect. Sorry about that.

@junghoahnsc
Copy link
Author

IIUC, it looks like the current validation check is not correct. Is it easy to remove that?
Or should we use a low-level api until it is fixed?

@garye
Copy link
Contributor

garye commented Feb 9, 2017

@dhermes @tseaver I put a PR together for this, can you take a look? We've heard from other users offline that are being impacted by the bug.

richkadel pushed a commit to richkadel/google-cloud-python that referenced this issue May 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the Bigtable API.
Projects
None yet
Development

No branches or pull requests

6 participants