Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tweet threads on the assignments screen #15

Closed
34 of 54 tasks
keithamoss opened this issue Sep 25, 2018 · 0 comments
Closed
34 of 54 tasks

Support tweet threads on the assignments screen #15

keithamoss opened this issue Sep 25, 2018 · 0 comments
Milestone

Comments

@keithamoss
Copy link
Owner

keithamoss commented Sep 25, 2018

Tweet threads are hard - there's no API concept for them. We're getting replies to our replies streamed in by virtue of including @DemSausage as one of our search terms - BUT Twitter's API has removed all of the streaming endpoints for tweets from a given account. As such, we can't get our replies to people and this stops us following the tweet chain down from the assigned tweet through to all replies.

Threads will need to be updated to include replies. Here is one approach. To conserve rate limits we may have to require to user to take an action to see updates - e.g. Clicking refresh, only seeing one assigned tweet at a time.

Approach

  • On receiving a tweet walk back up the chain using in_reply_to_id (from local, from remote) to find the ultimate parent id. If parent is not parent of an assignment, save all the tweets we collected, and issue the standard new tweet event. If parent is part of an assignment, pass all tweets we collected to the refresh_assignment logic (which saves all tweets - including the new one, updates the assignment, and returns an updated assignment)
  • To refresh an assignment we'll issue requests to find all replies to all users in the thread for the lowest possible tweetId (depending on when they came into the conversation). This will uncover "hidden" threads that we didn't receive via streaming. We'll get those results and use that to build/refresh the relationships, we'll save al of those tweets, update the assignment, and return the updated assignment.
  • Background refreshing of assignments is a nice to have and will help us capture sub-threads that we aren't included on/getting via streaming that are useful (e.g. Someone replying to a tweet to say they also found sausage). This could also happen as a result of a user interaction like "Check for updates on this assignment/my assignments".
  • A refreshed assignment should throw a notification for users. It may also want to update a timestamp so we can sort on the queue by either creation date or update date.

Backend Implementation

  • When tweets are backfilled we resolve all of these relationships en masse after the tweets are collected, and then we save the whole lot together in one transaction. We then send the relevant events - decide if this happens en masse or individually.
  • resolve_tweet_parents(tweetId) Finds all of the parent tweet objects (from local, from remote) for a given tweet. Returns tweets[].
  • resolve_tweet_children(tweetId) Given a parent tweet (a tweet that has no in_reply_to) find all children of all tweets (except to @DemSausage). Returns tweets[].
  • build_relationship(tweets[]) Given a set of tweets that represent a complete relationship build a new relationship object (per below).
  • create_assignment(tweets[]) Given a set of tweets that represent a complete relationship call build_relationship() and stuff save a new assignment row. Always saves new tweet objects. Sets a created_on and last_updated_on date.
  • update_assignment(assignmentId) Given an assignment refresh all of its tweets via resolve_tweet_children() and, if needed, update the assignment in the database. If an assignment was marked as DONE, change it back to PENDING. Always saves new tweet objects. Updates thelast_updated_on date.
  • On receiving a new tweet. If it doesn't have in_reply_to, current logic applies. If it does have an in_reply_to pass the tweet object to the new celery queue and do nothing.
  • Celery queue for saving threaded tweets given a tweet object: If the tweet is part of an assignment already, do nothing but saving the tweet (handles tweets arriving or being processed out of order). If it's not part of an assignment, call resolve_tweet_parents(). If parent is NOT part of an assignment, save the tweets we found and issue a NEW_TWEET event for the original tweet. If parent IS part of an assignment, call resolve_tweet_children() to refresh the thread, pass it to update_assignment(), and then issue NEW_TWEET, UPDATED_ASSIGNMENT, and if necessary a COMPLETED_ASSIGNMENT_WAS_UPDATED events that send all of the new tweet objects along. Show the assigned user a notification of one of their assignments being updated and highlight that on the queue UI in some fashion and show the tweet being as already part of an assignment in the triage UI.
  • On assigning a tweet. Call resolve_tweet_parents() if necessary - pass that or the tweet itself to resolve_tweet_children(), pass that to create_assignment(), and then issue NEW_ASSIGNMENT events that send all of the new tweet objects along to update the queue and triage UIs.
  • Functions that use API calls will log how many they used (directly or via Tweepy?) for post-election analysis. Part of Add a page to see our current Twitter API consumption state #22. Logging should give us a chain of logs to see what specific parts of tweet handling cost in API calls (new tweets, backfilling, assigning).
  • Use a dirty flag to distinguish tweets we saved, but couldn't resolve stuff for and thus didn't send to the clients.
  • If we distinguish the source of a tweet (for use with sinceId/maxId logic) should we start caching tweets from users to limit the number of API calls used in get_tweets_from_user_since_tweet_from_api()?
  • Use limits around the API getters to prevent consuming a lot of API calls for non-@DemSausage users if an old tweet is replied to
  • Think about how tweet flow in: If most tweets are part of an assignment that's API calls up and down. Even if they're not, replies are always resolved up. How's that going to play out?
  • We can reduce the need for replies from us to be requested by an API call if we cache replies sent from within Scremsong, but can't rely on all replies going out from within Scremsong. What's the app rate limit on posting statuses?
  • Allow only sending notifications for a particular set of userIds
  • flake8

UI Implementation

  • [Queue] Display tweet threads based on the relationships data sorted chronologically.
  • [Queue] Show tweet action buttons for all tweets in the thread.
  • [Queue] Visually show when an assignment that you closed has been set back to pending by an update.
  • [Queue] Allow the user to sort by the creation date or update date of their assignments.
  • [Queue] Allow the user to only show recently updated or "unread" assignments.
  • [Queue] Visually show the user when an assigned thread has been updated (e.g. highlighting the thread and individual tweets as "unread" until the user takes an action).
  • [Triage] All tweets in a thread that's assigned to someone will show as assigned.
  • [Notifications] When an assignment is updated.
  • [Notifications] When an assignment that you marked as done has been updated.

Thoughts

  • What effect will bouncing and replacing the Django server have on the Celery queue? I think the queue remains unaffected (it's on the database droplet), but what happens to any tasks in the queue that are currently being run.
  • Maybe we need to do https://github.com/keithamoss/scremsong/issues/44 sooner since we're using Celery queues more now?
  • With locally caching replies we're not accounting for later tweets arriving by other means and bumping the sinceId higher i.e. leaving gaps between the last lot we cached and the new tweet. This isn't an issue as long as we're only caching things from/to @DemSausage (I think?), but would be if we're doing it for other accounts. Have a think about this.
  • Perhaps we need to track the source of tweets if we're relying on sinceId/maxId? e.g. From stream, From backfill, From resolving parents/children.
  • Don't send new tweet notifications until after assignment events
  • For backfill: Send new tweet events as a single event after all of the assignment resolution/notifications.
  • For backfill: Make en masse assigning work backwards from the highest id to ensure maximum use of the local db.
  • Locked accounts replying to tweets break our thread fetching implementation
  • Walk through all of the scenarios of tweets arriving and think about the API usage. Are we going to burn a lot on replies?
  • Rebuild Celery container
  • Document the logic of how tweets flow in and handling logic in the Enhancements issue

Notes

Edit: Ooh, a more elegant approach to the data fetching would be to have tweet streaming take care of filling in missing in-reply-to tweets before it writes the received tweet to the database. Then we just need to handle the UI side (which can be separate Tweet components with some of our own CSS applied).

Do a backend POC of the structure for one thread, wire it up to the frontend loosely, then build the frontend around that and make sure it works end-to-end. Then we can build out the rest of the backend logic (Celery queues, et cetera).

  • Create a never ending celery task to ping statuses/user_timeline for all tweets after since_id.
  • Store tweets in the usual tweets table and think of a way to easily grab the highest for comparison with since_id. Don't show tweets from us in the triage columns.
  • Update the assignments API to return an array of tweet ids for each assignment, not a single id. This array should be a chronologically sorted list of tweets.
  • The assignments API should also return all assignments chronologically sorted by the most recent tweet for each assignment (so new stuff gets bumped to the top)
  • The changes to the assignments API and GUI should accommodate showing notifications to the user like "You're assigned tweets have some new replies".
  • The assignments GUI can then just display separate <Tweet /> elements for each without worrying about doing fancy threading.

Further thinking

  • We'll need to handle tweets being assigned that are not the first tweet in a thread. e.g. Someone tweets about sausage (but has none of our search terms), someone responds (with our search terms), we assign that second tweet, but don't have that first one. Gotta fill in those gaps.
  • A tweet can only be assigned if its parent or child tweets are unassigned
  • When a tweet is assigned all of its parent and child tweets are part of the same assignment
  • When a tweet is received from the stream we resolve if its part of an assignment
  • How is the one-to-many relationship of assignments-to-tweets going to work? Right now it's built around a one-to-one relationship and deleting an assignment reflects that.
  • Do we need to handle tweets coming in out of order or being processed out of order? Maybe we need to walk the thread chain in both directions to be sure?
  • We're going to have to be careful about when stuff gets saved to the database. When a client connects/resyncs we don't want the tweet being in there, but the thread info isn't yet, so they only get partial information. But would that only be for a little window until we send thread info + assignment info? In any event, it may be good to wrap the tweet, thread, and assignment update logic up in a transaction.
  • Do we need to always maintain tweet relationships or do we ONLY build it when a tweet is assigned to someone?
  • We'll have to deal with backfilling tweets with parents happening in the right order so that we've resolved parents of younger tweets before their children are added so that the whole thread chain exists.
  • Data Structure 1: Django table for tweet threads with one row per tweet per thread with parent_id, thread_id, and tweet_id
  • Data Structure 2: Django table for tweet threads with one row per thread with parent_id, thread_id and thread_data (JSON field with tweet_ids: string[] and relationships: <some json>
  • Data Structures: It's a question of (1) Easier in Django and indexable vs (2) Fewer rows, existing JSON data structure to use on the frontend, no Django convenience methods, and MAYBE not easily indexable (but we can deal with that by maintaining tweet_ids as a flat array of all tweets in the relationship that we CAN index.

Relationships example:

[
    "1",
    "2",
    {
        "tweet_id": 3,
        "children": [
            "10",
            {
                "tweet_id": 11,
                "children": ["20", "21"]
            },
            "12"
        ]
    },
    "4"
]

Reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant