Skip to content

Commit

Permalink
Remove old status ID when adding new ones.
Browse files Browse the repository at this point in the history
  • Loading branch information
mihaip committed Jan 20, 2012
1 parent 38b777f commit 96bf460
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 5 deletions.
4 changes: 2 additions & 2 deletions app/birdfeeder/TODO
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
todo:
- encoding bug in https://twitter.com/#!/smcbride/status/158634839388590080
- trigger a Reader crawlondemand when first generating a feed (or resetting the ID)
- add timestamps to birdpinger
- see why Ann's feed doesn't use PSHB in Reader
- cache user data so that feed fetches involve no twitter RPCs
- reduce update cron job frequency
- drop status ID/timestamps older than 24 hours from StreamData
- catch exceptions when fetching timeline tweets
- reply and retweet links
- use user timezone to format timestamps (instead of GMT)
Expand Down Expand Up @@ -64,3 +62,5 @@ done:
- youtube.com/youtu.be
- check if @ replies missing in birdpinger is Twitter's fault or tweetstream's
- convert newlines to html
- encoding bug in https://twitter.com/#!/smcbride/status/158634839388590080 - hub dropped the < when giving it to schedule_crawler?
- drop status ID/timestamps older than 24 hours from StreamData
24 changes: 21 additions & 3 deletions app/birdfeeder/handlers/update.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import logging
import time
import urllib
Expand All @@ -13,6 +14,11 @@

RECENT_STATUS_INTERVAL_SEC = 10 * 60

# Statuses older than this interval will be removed from the ID list that is
# persisted, to avoid it growing uncontrollably. This value should be at least
# as big as FEED_STATUS_INTERVAL_SEC from feeds.py
OLD_STATUS_INTERVAL_SEC = 24 * 60 * 60 # One day

HUB_URL_BATCH_SIZE = 100

# When we get a ping, we don't start updates right away, since the timeline REST
Expand Down Expand Up @@ -121,9 +127,21 @@ def update_timeline(session):

logging.info(' %d new status IDs for this stream' % len(new_status_ids))

stream.status_ids = new_status_ids + stream.status_ids
stream.status_timestamps_sec = \
new_status_timestamps_sec + stream.status_timestamps_sec
dropped_status_ids = 0
combined_status_ids = list(new_status_ids)
combined_status_timestamps_sec = list(new_status_timestamps_sec)
threshold_time = time.time() - OLD_STATUS_INTERVAL_SEC
for status_id, timestamp_sec in itertools.izip(stream.status_ids, stream.status_timestamps_sec):
if timestamp_sec >= threshold_time:
combined_status_ids.append(status_id)
combined_status_timestamps_sec.append(timestamp_sec)
else:
dropped_status_ids += 1

logging.info(' Dropped %d old status IDs' % dropped_status_ids)

stream.status_ids = combined_status_ids
stream.status_timestamps_sec = combined_status_timestamps_sec

unknown_status_ids = data.StatusData.get_unknown_status_ids(new_status_ids)

Expand Down

0 comments on commit 96bf460

Please sign in to comment.