Support for indexing video (and fix indexing URLs)#401
Merged
SaptakS merged 4 commits into354-index-mediafrom Feb 17, 2025
Merged
Support for indexing video (and fix indexing URLs)#401SaptakS merged 4 commits into354-index-mediafrom
SaptakS merged 4 commits into354-index-mediafrom
Conversation
…to use camelCase for indexStart and indexEnd for consistency
Member
Author
|
I also added type definitions for URLs, wrote a test, and actually also found a bug, so now indexing URLs works: ca55615 |
SaptakS
approved these changes
Feb 17, 2025
Contributor
SaptakS
left a comment
There was a problem hiding this comment.
Thanks for the detailed description! Looks good to me.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey @SaptakS this fixes the "Figure out issue with downloading of videos" step in #393. Here's how I did it:
In a test X account, I posted a tweet with an image, and another tweet with a video. Then using Burp Suite, I logged into my test X account and loaded my test account's replies feed. In Burp Suite, I found the
UserTweetsAndRepliesAPI request:I copied the JSON into a text editor and formatted it so I could see exactly what's in each
tweets['legacy']['media']object. I also saved this astestdata/XAPIUserTweetsAndRepliesMedia.json, so I could later write a test using it.Based on what I found, I started with updating the typescript interfaces to support the media fields we need. In
src/account_x/types.ts, I editedXAPILegacyTweetto support themediafield. I made sure to include all fields that I could find in both the video and photo tweets, making sure to mark some of them as optional. In the end I expanded the media part into its own sub-typesXAPILegacyTweetMediaandXAPILegacyTweetMediaVideoVariant.Here's what my example video tweet JSON object looks like:
{ "bookmark_count": 0, "bookmarked": false, "created_at": "Fri Feb 14 21:30:00 +0000 2025", "conversation_id_str": "1890513848811090236", "display_text_range": [ 0, 28 ], "entities": { "hashtags": [], "media": [ { "display_url": "pic.x.com/MMfXeoZEdi", "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1", "id_str": "1890513743144185859", "indices": [ 29, 52 ], "media_key": "7_1890513743144185859", "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg", "type": "video", "url": "https://t.co/MMfXeoZEdi", "additional_media_info": { "monetizable": false }, "ext_media_availability": { "status": "Available" }, "sizes": { "large": { "h": 1920, "w": 1080, "resize": "fit" }, "medium": { "h": 1200, "w": 675, "resize": "fit" }, "small": { "h": 680, "w": 383, "resize": "fit" }, "thumb": { "h": 150, "w": 150, "resize": "crop" } }, "original_info": { "height": 1920, "width": 1080, "focus_rects": [] }, "allow_download_status": { "allow_download": true }, "video_info": { "aspect_ratio": [ 9, 16 ], "duration_millis": 53111, "variants": [ { "content_type": "application/x-mpegURL", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12" }, { "bitrate": 632000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12" }, { "bitrate": 950000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12" }, { "bitrate": 2176000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12" } ] }, "media_results": { "result": { "media_key": "7_1890513743144185859" } } } ], "symbols": [], "timestamps": [], "urls": [], "user_mentions": [] }, "extended_entities": { "media": [ { "display_url": "pic.x.com/MMfXeoZEdi", "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1", "id_str": "1890513743144185859", "indices": [ 29, 52 ], "media_key": "7_1890513743144185859", "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg", "type": "video", "url": "https://t.co/MMfXeoZEdi", "additional_media_info": { "monetizable": false }, "ext_media_availability": { "status": "Available" }, "sizes": { "large": { "h": 1920, "w": 1080, "resize": "fit" }, "medium": { "h": 1200, "w": 675, "resize": "fit" }, "small": { "h": 680, "w": 383, "resize": "fit" }, "thumb": { "h": 150, "w": 150, "resize": "crop" } }, "original_info": { "height": 1920, "width": 1080, "focus_rects": [] }, "allow_download_status": { "allow_download": true }, "video_info": { "aspect_ratio": [ 9, 16 ], "duration_millis": 53111, "variants": [ { "content_type": "application/x-mpegURL", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12" }, { "bitrate": 632000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12" }, { "bitrate": 950000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12" }, { "bitrate": 2176000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12" } ] }, "media_results": { "result": { "media_key": "7_1890513743144185859" } } } ] }, "favorite_count": 0, "favorited": false, "full_text": "check out this video i found https://t.co/MMfXeoZEdi", "is_quote_status": false, "lang": "en", "possibly_sensitive": false, "possibly_sensitive_editable": true, "quote_count": 0, "reply_count": 0, "retweet_count": 0, "retweeted": false, "user_id_str": "1769426369526771712", "id_str": "1890513848811090236" }If you look
tweet['entities']['media'][0]['video_info']['variants'], you can see that it lists the actual URLs of MP4s to download:[ { "content_type": "application/x-mpegURL", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12" }, { "bitrate": 632000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12" }, { "bitrate": 950000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12" }, { "bitrate": 2176000, "content_type": "video/mp4", "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12" } ]The first one that's
application/x-mpegURLis a streaming format thing that you can't just download, but the rest are MP4s. I tried downloading them all and strangely, I discovered that when I keep the?tag=12at the end I get a 404, but if I strip it, it downloads the video.So, I updated
XAccountController.indexTweetMediato check if the media is a video, and if so to download the variant of the video with the highest bitrate (which is the best quality/largest file):And it works! I also added a test to
src/account_x.test.tsthat confirms it works. It parses theXAPIUserTweetsAndRepliesMedia.jsonfile and then runs SQL queries to confirm that it successfully imported the tweet and the media, and that the media filename is correct. (I also tried stubbingfetchso that the tests won't actually try downloading the media... I'm unsure if it worked or not if I'm just downloading the media when I run the tests.)Also, I updated the
XTweetMediaRowto include all of the fields that it uses now, and I realized that your migration had used the fieldsstart_indexandend_indexwhile the convention for the rest of the database tables is to use snakeCase, so I updated them tostartIndexandendIndex-- which mean you'll just need to delete the database and start over, because I didn't add a new migration for it or anything.