Skip to content
This repository has been archived by the owner on Mar 19, 2021. It is now read-only.

"Value out of range for 4 bytes" error reported during flow event import #16

Closed
rfk opened this issue Oct 26, 2016 · 14 comments
Closed
Assignees

Comments

@rfk
Copy link
Contributor

rfk commented Oct 26, 2016

Checking the mail spool for the ec2-user on our import box, I see a bunch of tracebacks like the following (this one was from today)

make: Entering directory `/home/ec2-user/fxa-activity-metrics'
./build/bin/python ./import_activity_events.py
LOADING 7 DAYS OF DATA
LOADING 2016-10-25
  MIN TIMESTAMP 1477353600
  MAX TIMESTAMP 1477439999
LOADING 2016-10-24
  MIN TIMESTAMP 1477267200
  MAX TIMESTAMP 1477353599
LOADING 2016-10-23
  MIN TIMESTAMP 1477180800
  MAX TIMESTAMP 1477267199
LOADING 2016-10-22
  MIN TIMESTAMP 1477094400
  MAX TIMESTAMP 1477180799
LOADING 2016-10-21
  MIN TIMESTAMP 1477008000
  MAX TIMESTAMP 1477094399
LOADING 2016-10-20
  MIN TIMESTAMP 1476921600
  MAX TIMESTAMP 1477007999
LOADING 2016-10-19
  MIN TIMESTAMP 1476835200
  MAX TIMESTAMP 1476921599
./build/bin/python ./import_flow_events.py
net-mozaws-prod-us-west-2-pipeline-analysis fxa-flow/data/
2016-09-02
2016-09-03
2016-09-04
2016-09-05
2016-09-06
2016-09-07
2016-09-08
2016-09-09
2016-09-10
2016-09-11
2016-09-12
2016-09-13
2016-09-14
2016-09-15
2016-09-16
2016-09-17
2016-09-18
2016-09-19
2016-09-20
2016-09-21
2016-09-22
2016-09-23
2016-09-24
2016-09-25
2016-09-26
2016-09-27
2016-09-28
2016-09-29
2016-09-30
2016-10-01
2016-10-02
2016-10-03
2016-10-04
2016-10-05
2016-10-06
2016-10-07
2016-10-08
2016-10-09
2016-10-10
2016-10-11
2016-10-12
2016-10-13
2016-10-14
2016-10-15
2016-10-16
2016-10-24
2016-10-25
2016-10-26
2016-11-14
LOADING 16 DAYS OF DATA
LOADING 2016-11-14
Traceback (most recent call last):
  File "./import_flow_events.py", line 303, in <module>
    import_events(False)
  File "./import_flow_events.py", line 290, in import_events
    db.run(Q_INSERT_EVENTS)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/postgres/__init__.py", line 374, in run
    cursor.run(sql, parameters)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/postgres/cursors.py", line 92, in run
    self.execute(sql, parameters)
  File "/home/ec2-user/fxa-activity-metrics/build/lib/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
    return super(NamedTupleCursor, self).execute(query, vars)
psycopg2.InternalError: Value out of range for 4 bytes.
DETAIL:
  -----------------------------------------------
  error:  Value out of range for 4 bytes.
  code:      8001
  context:   Input:2592007602.
  query:     103745
  location:  funcs_int.hpp:91
  process:   query0_48 [pid=15830]
  -----------------------------------------------


make: *** [import] Error 1
make: Leaving directory `/home/ec2-user/fxa-activity-metrics'

Not sure what to make of it, especially if it's trying to load a file from a date in the future. Recording it here for followup.

@philbooth
Copy link
Contributor

philbooth commented Oct 26, 2016

...it's trying to load a file from a date in the future.

Oh interesting. Fwiw, those dates just come from whatever happens to be in s3.

@rfk
Copy link
Contributor Author

rfk commented Oct 26, 2016

db.run(Q_INSERT_EVENTS)

This is the query that copies from the temporary table into the main table, so AFAICT it can't be a problem with the loading of the data from S3. And the only column that's not a direct copy is the timestamp, where we convert it from a bigint to a timestamp. Perhaps this is caused by a busted timestamp value? Pretty sure those are 8 bytes though, not 4 as in the error message.

@rfk
Copy link
Contributor Author

rfk commented Oct 26, 2016

Although, the columns in the temporary_raw_activity_data table are VARCHAR(40) while their equivalents in the activity_events table are VARCHAR(30)...

@philbooth
Copy link
Contributor

If I run:

SELECT DISTINCT begin_time::DATE
FROM flow_metadata
ORDER BY 1;

I can see that we have no flow events for the following dates:

17th Sep, 18th Sep, 20th Sep, 26th Sep, 29th Sep, 2nd Oct, 5th Oct, 6th Oct, 8th Oct, 12th Oct or anything after 15th Oct.

@philbooth
Copy link
Contributor

I notice the title of this issue refers to activity events but the error is actually from the flow event import. Updating the title for that reason.

@philbooth philbooth changed the title Tracebacks from activity event import "Value out of range for 4 bytes" error reported during flow event import Oct 27, 2016
@philbooth
Copy link
Contributor

philbooth commented Oct 27, 2016

I've opened bug 1313357 to investigate the curious case of the weird dates in S3.

@philbooth
Copy link
Contributor

philbooth commented Oct 27, 2016

Getting back to the main problem discussed in this issue, the out-of-range error, 4 bytes is the size of the INTEGER type. There is only one INTEGER column referred to by the Q_INSERT_EVENTS query, flow_time.

If I then look at the flow_time column for the temporary table I can see it is a BIGINT, so that's where our overflow comes from.

However, the very fact that we are suddenly getting a flow_time big enough to trigger this bug makes me suspicious. The hard limit in number of milliseconds for flow_time should be way, way smaller than that because of expiry times.

I can think of three possibilities off the top of my head:

  1. Something has gone wrong with clock skew between a user's machine and the skew-correction code in the content server. This strikes me as the most probable cause, I'm going to play around to see how solid it is locally.
  2. Something has gone wrong with the expiry of flow data in the auth server, either at the validateMetricsContext level or the memcached level, it could be either. I think this is unlikely.
  3. Something funny has happened in the metrics pipeline or CSV export process. I presume this is very unlikely but don't know for certain.

Anyway, in the meantime I'll open a PR to address the schema discrepancy, which should fix the immediate data import bug. But we also need determine whether there is an underlying data integrity problem that caused this error to show up all of a sudden.

@philbooth
Copy link
Contributor

I may have just had a huge, huge, yuuuuuuge eureka moment on this.

The error message includes the value of flow_time that is causing the problem: 2592007602.

That equates to a duration of 30 days. And the file this value comes from is our mysterious future date, flow-2016-11-14.csv, which is dun dun dunnnnn... 30 days after the big gap in our CSV export.

Is it possible that the dates on our servers were screwed up for a while, thus causing both problems?

@philbooth
Copy link
Contributor

Is it possible that the dates on our servers were screwed up for a while...

I just realised the activity events would also be borked if that was the case, but they aren't.

@philbooth
Copy link
Contributor

philbooth commented Oct 31, 2016

The final outcome of bug 1313357 was that the two issues are indeed related and they're caused by bad timestamps on the content server flow events.

With my earlier hunch in mind:

Something has gone wrong with clock skew between a user's machine and the skew-correction code in the content server.

I thought it should be quite easy to reproduce the problem by skewing my local clock and then connecting to an fxa-dev box to check the flow events in the log. I tried this with skews of hours, days and years, in both directions, but actually the content server skew-correction code seemed to handle all cases correctly.

Given this and the observations in Bugzilla, I've come to two fresh conclusions:

  1. We should remove the attempt to be tolerant of a missing flow-begin time in the content server, i.e. delete this code. It was a well-meant condition but I think we're better off treating such requests as coming from irreparable clients and ignoring them lest we corrupt good data. This is a simple change but I don't think it will address the main problem.
  2. We should use server-side timestamps for the content server flow events so that we completely remove the possibility of the problem occurring.

I'll raise separate issues in the content server repo to handle these two and point them back to this bug as the source of discussion.

I'll try to get it all done as quickly as possible but I fear it may be too late for train 73, we'll see.

@philbooth
Copy link
Contributor

The other thing we need to do to fix the problem in terms of this repo, is delete or ignore all the bad CSVs in S3 once the content server fix is deployed. That data is lost for good, sorry.

@philbooth
Copy link
Contributor

Raised mozilla/fxa-content-server#4349 and mozilla/fxa-content-server#4350 to cover the two content server issues identified in #16 (comment).

@rfk
Copy link
Contributor Author

rfk commented Nov 10, 2016

Is there anything else to do in this issue, or is it covered by the linked follow-ups?

@philbooth
Copy link
Contributor

Is there anything else to do in this issue, or is it covered by the linked follow-ups?

I think we're covered, good point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants