Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Downloads from the past have primary keys from the future #532

Closed
pypt opened this issue Dec 20, 2018 · 3 comments
Closed

Downloads from the past have primary keys from the future #532

pypt opened this issue Dec 20, 2018 · 3 comments
Assignees

Comments

@pypt
Copy link
Contributor

pypt commented Dec 20, 2018

doc-brown-back-to-the-future

By sheer accident I have stumbled upon a download that has a very recent downloads_id but was fetched in ~2015:

mediacloud=# select * from downloads where downloads_id = 1810972751;
 downloads_id | feeds_id | stories_id | parent |                                               url                                               |      host       |       download_time        |  type   |      state      |      path       |                                                  error_message                                                  | priority | sequence | extracted 
--------------+----------+------------+--------+-------------------------------------------------------------------------------------------------+-----------------+----------------------------+---------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------+----------+----------+-----------
   1810972751 |   435674 | 1119893954 |        | http://radaronline.com/features/2006/10/americas_dumbest_congressmen_a_radar_special_report.php | radaronline.com | 2015-04-06 02:08:43.046751 | content | extractor_error | content:pending | Died at /space/mediacloud/mediacloud_RELEASE_20140325/script/../lib/MediaWords/Util/ThriftExtractor.pm line 26.+|        1 |        1 | f
              |          |            |        |                                                                                                 |                 |                            |         |                 |                 |                                                                                                                 |          |          | 
(1 row)

By comparison, downloads "around" it (with downloads_id -1 or +1) were fetched just two days ago:

mediacloud=# select * from downloads where downloads_id in (1810972750, 1810972752) order by downloads_id;
 downloads_id | feeds_id | stories_id |   parent   |                      url                       |       host        |    download_time    |  type   |  state  |          path           | error_message | priority | sequence | extracted 
--------------+----------+------------+------------+------------------------------------------------+-------------------+---------------------+---------+---------+-------------------------+---------------+----------+----------+-----------
   1810972750 |  1101803 | 1119893956 | 1810906084 | https://www.al-madina.com/article/604281?rss=1 | www.al-madina.com | 2018-12-16 09:52:32 | content | success | s3:downloads/1810972750 |               |        0 |        1 | t
   1810972752 |  1101803 | 1119893957 | 1810906084 | https://www.al-madina.com/article/604280?rss=1 | www.al-madina.com | 2018-12-16 09:52:44 | content | success | s3:downloads/1810972752 |               |        0 |        1 | t
(2 rows)
  • Where could those rows could be coming from?
  • What happened to those INSERTs that tried to re-insert downloads that already existed?
@hroberts
Copy link
Contributor

These are not old downloads getting created with future downloads_ids. The are new downloads being created with download_time in the past.

These new downloads with old download_time values are created in mediawords.tm.stories.copy_story_to_new_medium(), which is used to copy stories from a duplicate medium to the parent medium. As part of that process, it copies any downloads associated with the story, including copying the download_time value for the downloads row. I think it is more accurate to keep the old download_time than to assign a more current one, since the content associated with the download is the content originally downloaded (so for the case above, the content associated with both the old and the new download was in fact downloaded in 2015).

@hroberts
Copy link
Contributor

(but bonus points for the BTF meme!)

@pypt
Copy link
Contributor Author

pypt commented Jan 17, 2019

Thanks for the clarification! Just thought it was a bug in disguise.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants