Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

duplicate tweets insertion in the database #4

Closed
modsaid opened this Issue Mar 28, 2013 · 7 comments

Comments

Projects
None yet
3 participants

modsaid commented Mar 28, 2013

when running the politwoops-worker.py to read the tweets from beanstalkd and insert them into mysql
PYTHONPATH=$PYTHONPATH:pwd/lib ./bin/politwoops-worker.py

the first tweet gets inserted, but it seems no deleted from beanstalkd. so shortly the scripts tries to insert it again leading to duplicate errors

$ PYTHONPATH=$PYTHONPATH:`pwd`/lib ./bin/politwoops-worker.py 

[2013-03-28 15:25] NOTICE: politwoops-worker.py: Log level NOTICE
[2013-03-28 15:25] NOTICE: politwoops-worker.py: New tweet 317294908039901186 from user 14830552/modsaid
./bin/politwoops-worker.py:197: Warning: Out of range value for column 'id' at row 1
  cursor.execute("""INSERT INTO `tweets` (`id`, `user_name`, `politician_id`, `content`, `created`, `modified`, `tweet`) VALUES(%s, %s, %s, %s, NOW(), NOW(), %s)""", (tweet['id'], tweet['user']['screen_name'], self.users[tweet['user']['id']], tweet['text'], anyjson.serialize(tweet)))

[2013-03-28 15:27] NOTICE: politwoops-worker.py: New tweet 317296871301324800 from user 14830552/modsaid
[2013-03-28 15:27] ERROR: Generic: Unhandled exception of type <class '_mysql_exceptions.IntegrityError'>: (1062, "Duplicate entry '2147483647' for key 'PRIMARY'")
Traceback (most recent call last):
  File "./bin/politwoops-worker.py", line 308, in <module>
    sys.exit(main(args))
  File "./bin/politwoops-worker.py", line 280, in main
    return app.run()
  File "./bin/politwoops-worker.py", line 120, in run
    self.handle_tweet(job.body)
  File "./bin/politwoops-worker.py", line 148, in handle_tweet
    self.handle_new(tweet)
  File "./bin/politwoops-worker.py", line 197, in handle_new
    cursor.execute("""INSERT INTO `tweets` (`id`, `user_name`, `politician_id`, `content`, `created`, `modified`, `tweet`) VALUES(%s, %s, %s, %s, NOW(), NOW(), %s)""", (tweet['id'], tweet['user']['screen_name'], self.users[tweet['user']['id']], tweet['text'], anyjson.serialize(tweet)))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.IntegrityError: (1062, "Duplicate entry '2147483647' for key 'PRIMARY'")

modsaid commented Mar 28, 2013

related to c7c1c8d

modsaid commented Mar 28, 2013

my bad... it is not related to that... it is actually related to tweet ids being larger than the max possible for mysql integer columns.. and get silently truncated

We quickly realized the issue and ALTERed the impacted tables, but the MySQL behavior of silently truncating all values larger than 2147483647 with the only fail-safe being the primary key constraint was worrying. Any columns that lacked a similar constraint would be experiencing silent data corruption.

2^31 − 1 = 2,147,483,647
The number 2,147,483,647 is ... the maximum value for a 32-bit signed integer in computing

modsaid commented Mar 28, 2013

this is related to the database creation flow... i'll fix the migrations at https://github.com/sunlightlabs/politwoops to create the data that matches the schema define in this repo

modsaid commented Mar 28, 2013

Can you post an actual active schema dump from the system?

Contributor

dvogel commented Mar 28, 2013

Yes, you do need to alter tweets.id to bigint. It's the only schema change required. I meant to fix that in the ruby migrations. A patch would be welcome, else I will fix it soon.

modsaid commented Mar 28, 2013

I got it running, i'll prepare a patch, a basic one to get the thing up for anyone.. but we'll need to add some missing migrations too... Will keep you posted

modsaid commented Mar 28, 2013

sent you a pull request sunlightlabs/politwoops#4

@dwillis dwillis closed this Sep 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment