"Value out of range for 4 bytes" error reported during flow event import #16

rfk · 2016-10-26T08:36:28Z

Checking the mail spool for the ec2-user on our import box, I see a bunch of tracebacks like the following (this one was from today)

make: Entering directory `/home/ec2-user/fxa-activity-metrics'
./build/bin/python ./import_activity_events.py
LOADING 7 DAYS OF DATA
LOADING 2016-10-25
  MIN TIMESTAMP 1477353600
  MAX TIMESTAMP 1477439999
LOADING 2016-10-24
  MIN TIMESTAMP 1477267200
  MAX TIMESTAMP 1477353599
LOADING 2016-10-23
  MIN TIMESTAMP 1477180800
  MAX TIMESTAMP 1477267199
LOADING 2016-10-22
  MIN TIMESTAMP 1477094400
  MAX TIMESTAMP 1477180799
LOADING 2016-10-21
  MIN TIMESTAMP 1477008000
  MAX TIMESTAMP 1477094399
LOADING 2016-10-20
  MIN TIMESTAMP 1476921600
  MAX TIMESTAMP 1477007999
LOADING 2016-10-19
  MIN TIMESTAMP 1476835200
  MAX TIMESTAMP 1476921599
./build/bin/python ./import_flow_events.py
net-mozaws-prod-us-west-2-pipeline-analysis fxa-flow/data/
2016-09-02
2016-09-03
2016-09-04
2016-09-05
2016-09-06
2016-09-07
2016-09-08
2016-09-09
2016-09-10
2016-09-11
2016-09-12
2016-09-13
2016-09-14
2016-09-15
2016-09-16
2016-09-17
2016-09-18
2016-09-19
2016-09-20
2016-09-21
2016-09-22
2016-09-23
2016-09-24
2016-09-25
2016-09-26
2016-09-27
2016-09-28
2016-09-29
2016-09-30
2016-10-01
2016-10-02
2016-10-03
2016-10-04
2016-10-05
2016-10-06
2016-10-07
2016-10-08
2016-10-09
2016-10-10
2016-10-11
2016-10-12
2016-10-13
2016-10-14
2016-10-15
2016-10-16
2016-10-24
2016-10-25
2016-10-26
2016-11-14
LOADING 16 DAYS OF DATA
LOADING 2016-11-14
Traceback (most recent call last):
  File "./import_flow_events.py", line 303, in <module>
    import_events(False)
  File "./import_flow_events.py", line 290, in import_events
    db.run(Q_INSERT_EVENTS)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/postgres/__init__.py", line 374, in run
    cursor.run(sql, parameters)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/postgres/cursors.py", line 92, in run
    self.execute(sql, parameters)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
    return super(NamedTupleCursor, self).execute(query, vars)
psycopg2.InternalError: Value out of range for 4 bytes.
DETAIL:
  -----------------------------------------------
  error:  Value out of range for 4 bytes.
  code:      8001
  context:   Input:2592007602.
  query:     103745
  location:  funcs_int.hpp:91
  process:   query0_48 [pid=15830]
  -----------------------------------------------


make: *** [import] Error 1
make: Leaving directory `/home/ec2-user/fxa-activity-metrics'

Not sure what to make of it, especially if it's trying to load a file from a date in the future. Recording it here for followup.

The text was updated successfully, but these errors were encountered:

philbooth · 2016-10-26T08:48:21Z

...it's trying to load a file from a date in the future.

Oh interesting. Fwiw, those dates just come from whatever happens to be in s3.

rfk · 2016-10-26T08:54:20Z

db.run(Q_INSERT_EVENTS)

This is the query that copies from the temporary table into the main table, so AFAICT it can't be a problem with the loading of the data from S3. And the only column that's not a direct copy is the timestamp, where we convert it from a bigint to a timestamp. Perhaps this is caused by a busted timestamp value? Pretty sure those are 8 bytes though, not 4 as in the error message.

rfk · 2016-10-26T08:55:45Z

Although, the columns in the temporary_raw_activity_data table are VARCHAR(40) while their equivalents in the activity_events table are VARCHAR(30)...

philbooth · 2016-10-27T11:48:05Z

If I run:

SELECT DISTINCT begin_time::DATE
FROM flow_metadata
ORDER BY 1;

I can see that we have no flow events for the following dates:

17th Sep, 18th Sep, 20th Sep, 26th Sep, 29th Sep, 2nd Oct, 5th Oct, 6th Oct, 8th Oct, 12th Oct or anything after 15th Oct.

philbooth · 2016-10-27T12:24:18Z

I notice the title of this issue refers to activity events but the error is actually from the flow event import. Updating the title for that reason.

philbooth · 2016-10-27T13:23:21Z

I've opened bug 1313357 to investigate the curious case of the weird dates in S3.

philbooth · 2016-10-27T14:30:53Z

Getting back to the main problem discussed in this issue, the out-of-range error, 4 bytes is the size of the INTEGER type. There is only one INTEGER column referred to by the Q_INSERT_EVENTS query, flow_time.

If I then look at the flow_time column for the temporary table I can see it is a BIGINT, so that's where our overflow comes from.

However, the very fact that we are suddenly getting a flow_time big enough to trigger this bug makes me suspicious. The hard limit in number of milliseconds for flow_time should be way, way smaller than that because of expiry times.

I can think of three possibilities off the top of my head:

Something has gone wrong with clock skew between a user's machine and the skew-correction code in the content server. This strikes me as the most probable cause, I'm going to play around to see how solid it is locally.
Something has gone wrong with the expiry of flow data in the auth server, either at the validateMetricsContext level or the memcached level, it could be either. I think this is unlikely.
Something funny has happened in the metrics pipeline or CSV export process. I presume this is very unlikely but don't know for certain.

Anyway, in the meantime I'll open a PR to address the schema discrepancy, which should fix the immediate data import bug. But we also need determine whether there is an underlying data integrity problem that caused this error to show up all of a sudden.

philbooth · 2016-10-27T15:06:33Z

I may have just had a huge, huge, yuuuuuuge eureka moment on this.

The error message includes the value of flow_time that is causing the problem: 2592007602.

That equates to a duration of 30 days. And the file this value comes from is our mysterious future date, flow-2016-11-14.csv, which is dun dun dunnnnn... 30 days after the big gap in our CSV export.

Is it possible that the dates on our servers were screwed up for a while, thus causing both problems?

philbooth · 2016-10-27T15:49:47Z

Is it possible that the dates on our servers were screwed up for a while...

I just realised the activity events would also be borked if that was the case, but they aren't.

philbooth · 2016-10-31T13:19:22Z

The final outcome of bug 1313357 was that the two issues are indeed related and they're caused by bad timestamps on the content server flow events.

With my earlier hunch in mind:

Something has gone wrong with clock skew between a user's machine and the skew-correction code in the content server.

I thought it should be quite easy to reproduce the problem by skewing my local clock and then connecting to an fxa-dev box to check the flow events in the log. I tried this with skews of hours, days and years, in both directions, but actually the content server skew-correction code seemed to handle all cases correctly.

Given this and the observations in Bugzilla, I've come to two fresh conclusions:

We should remove the attempt to be tolerant of a missing flow-begin time in the content server, i.e. delete this code. It was a well-meant condition but I think we're better off treating such requests as coming from irreparable clients and ignoring them lest we corrupt good data. This is a simple change but I don't think it will address the main problem.
We should use server-side timestamps for the content server flow events so that we completely remove the possibility of the problem occurring.

I'll raise separate issues in the content server repo to handle these two and point them back to this bug as the source of discussion.

I'll try to get it all done as quickly as possible but I fear it may be too late for train 73, we'll see.

philbooth · 2016-10-31T13:20:42Z

The other thing we need to do to fix the problem in terms of this repo, is delete or ignore all the bad CSVs in S3 once the content server fix is deployed. That data is lost for good, sorry.

philbooth · 2016-10-31T13:42:10Z

Raised mozilla/fxa-content-server#4349 and mozilla/fxa-content-server#4350 to cover the two content server issues identified in #16 (comment).

rfk · 2016-11-10T03:09:51Z

Is there anything else to do in this issue, or is it covered by the linked follow-ups?

philbooth · 2016-11-10T10:04:17Z

Is there anything else to do in this issue, or is it covered by the linked follow-ups?

I think we're covered, good point.

philbooth added this to the FxA-41:signin funnel metrics milestone Oct 27, 2016

philbooth added the waffle:active label Oct 27, 2016

philbooth self-assigned this Oct 27, 2016

philbooth mentioned this issue Oct 27, 2016

fix(db): eliminate schema mismatches #17

Merged

philbooth changed the title ~~Tracebacks from activity event import~~ "Value out of range for 4 bytes" error reported during flow event import Oct 27, 2016

philbooth mentioned this issue Oct 27, 2016

Make the import scripts more fault tolerant #18

Closed

This was referenced Oct 31, 2016

Server should not tolerate a missing flow begin time mozilla/fxa-content-server#4349

Closed

Use server-side timestamps for flow events mozilla/fxa-content-server#4350

Closed

philbooth mentioned this issue Nov 8, 2016

Insane flow id field breaking flow event import #25

Closed

philbooth closed this as completed Nov 10, 2016

philbooth removed the waffle:active label Nov 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Value out of range for 4 bytes" error reported during flow event import #16

"Value out of range for 4 bytes" error reported during flow event import #16

rfk commented Oct 26, 2016

philbooth commented Oct 26, 2016 •

edited

Loading

rfk commented Oct 26, 2016

rfk commented Oct 26, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016 •

edited

Loading

philbooth commented Oct 27, 2016 •

edited

Loading

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 31, 2016 •

edited

Loading

philbooth commented Oct 31, 2016

philbooth commented Oct 31, 2016

rfk commented Nov 10, 2016

philbooth commented Nov 10, 2016

"Value out of range for 4 bytes" error reported during flow event import #16

"Value out of range for 4 bytes" error reported during flow event import #16

Comments

rfk commented Oct 26, 2016

philbooth commented Oct 26, 2016 • edited Loading

rfk commented Oct 26, 2016

rfk commented Oct 26, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016 • edited Loading

philbooth commented Oct 27, 2016 • edited Loading

philbooth commented Oct 27, 2016

philbooth commented Oct 27, 2016

philbooth commented Oct 31, 2016 • edited Loading

philbooth commented Oct 31, 2016

philbooth commented Oct 31, 2016

rfk commented Nov 10, 2016

philbooth commented Nov 10, 2016

philbooth commented Oct 26, 2016 •

edited

Loading

philbooth commented Oct 27, 2016 •

edited

Loading

philbooth commented Oct 27, 2016 •

edited

Loading

philbooth commented Oct 31, 2016 •

edited

Loading