Skip to content

Support for indexing video (and fix indexing URLs)#401

Merged
SaptakS merged 4 commits into354-index-mediafrom
354-index-media-video
Feb 17, 2025
Merged

Support for indexing video (and fix indexing URLs)#401
SaptakS merged 4 commits into354-index-mediafrom
354-index-media-video

Conversation

@micahflee
Copy link
Member

Hey @SaptakS this fixes the "Figure out issue with downloading of videos" step in #393. Here's how I did it:

In a test X account, I posted a tweet with an image, and another tweet with a video. Then using Burp Suite, I logged into my test X account and loaded my test account's replies feed. In Burp Suite, I found the UserTweetsAndReplies API request:

Screenshot 2025-02-14 at 2 43 30 PM

I copied the JSON into a text editor and formatted it so I could see exactly what's in each tweets['legacy']['media'] object. I also saved this as testdata/XAPIUserTweetsAndRepliesMedia.json, so I could later write a test using it.

Based on what I found, I started with updating the typescript interfaces to support the media fields we need. In src/account_x/types.ts, I edited XAPILegacyTweet to support the media field. I made sure to include all fields that I could find in both the video and photo tweets, making sure to mark some of them as optional. In the end I expanded the media part into its own sub-types XAPILegacyTweetMedia and XAPILegacyTweetMediaVideoVariant.

Here's what my example video tweet JSON object looks like:

{
    "bookmark_count": 0,
    "bookmarked": false,
    "created_at": "Fri Feb 14 21:30:00 +0000 2025",
    "conversation_id_str": "1890513848811090236",
    "display_text_range": [
        0,
        28
    ],
    "entities": {
        "hashtags": [],
        "media": [
            {
                "display_url": "pic.x.com/MMfXeoZEdi",
                "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1",
                "id_str": "1890513743144185859",
                "indices": [
                    29,
                    52
                ],
                "media_key": "7_1890513743144185859",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg",
                "type": "video",
                "url": "https://t.co/MMfXeoZEdi",
                "additional_media_info": {
                    "monetizable": false
                },
                "ext_media_availability": {
                    "status": "Available"
                },
                "sizes": {
                    "large": {
                        "h": 1920,
                        "w": 1080,
                        "resize": "fit"
                    },
                    "medium": {
                        "h": 1200,
                        "w": 675,
                        "resize": "fit"
                    },
                    "small": {
                        "h": 680,
                        "w": 383,
                        "resize": "fit"
                    },
                    "thumb": {
                        "h": 150,
                        "w": 150,
                        "resize": "crop"
                    }
                },
                "original_info": {
                    "height": 1920,
                    "width": 1080,
                    "focus_rects": []
                },
                "allow_download_status": {
                    "allow_download": true
                },
                "video_info": {
                    "aspect_ratio": [
                        9,
                        16
                    ],
                    "duration_millis": 53111,
                    "variants": [
                        {
                            "content_type": "application/x-mpegURL",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
                        },
                        {
                            "bitrate": 632000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
                        },
                        {
                            "bitrate": 950000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
                        },
                        {
                            "bitrate": 2176000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
                        }
                    ]
                },
                "media_results": {
                    "result": {
                        "media_key": "7_1890513743144185859"
                    }
                }
            }
        ],
        "symbols": [],
        "timestamps": [],
        "urls": [],
        "user_mentions": []
    },
    "extended_entities": {
        "media": [
            {
                "display_url": "pic.x.com/MMfXeoZEdi",
                "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1",
                "id_str": "1890513743144185859",
                "indices": [
                    29,
                    52
                ],
                "media_key": "7_1890513743144185859",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg",
                "type": "video",
                "url": "https://t.co/MMfXeoZEdi",
                "additional_media_info": {
                    "monetizable": false
                },
                "ext_media_availability": {
                    "status": "Available"
                },
                "sizes": {
                    "large": {
                        "h": 1920,
                        "w": 1080,
                        "resize": "fit"
                    },
                    "medium": {
                        "h": 1200,
                        "w": 675,
                        "resize": "fit"
                    },
                    "small": {
                        "h": 680,
                        "w": 383,
                        "resize": "fit"
                    },
                    "thumb": {
                        "h": 150,
                        "w": 150,
                        "resize": "crop"
                    }
                },
                "original_info": {
                    "height": 1920,
                    "width": 1080,
                    "focus_rects": []
                },
                "allow_download_status": {
                    "allow_download": true
                },
                "video_info": {
                    "aspect_ratio": [
                        9,
                        16
                    ],
                    "duration_millis": 53111,
                    "variants": [
                        {
                            "content_type": "application/x-mpegURL",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
                        },
                        {
                            "bitrate": 632000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
                        },
                        {
                            "bitrate": 950000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
                        },
                        {
                            "bitrate": 2176000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
                        }
                    ]
                },
                "media_results": {
                    "result": {
                        "media_key": "7_1890513743144185859"
                    }
                }
            }
        ]
    },
    "favorite_count": 0,
    "favorited": false,
    "full_text": "check out this video i found https://t.co/MMfXeoZEdi",
    "is_quote_status": false,
    "lang": "en",
    "possibly_sensitive": false,
    "possibly_sensitive_editable": true,
    "quote_count": 0,
    "reply_count": 0,
    "retweet_count": 0,
    "retweeted": false,
    "user_id_str": "1769426369526771712",
    "id_str": "1890513848811090236"
}

If you look tweet['entities']['media'][0]['video_info']['variants'], you can see that it lists the actual URLs of MP4s to download:

[
    {
        "content_type": "application/x-mpegURL",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
    },
    {
        "bitrate": 632000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
    },
    {
        "bitrate": 950000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
    },
    {
        "bitrate": 2176000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
    }
]

The first one that's application/x-mpegURL is a streaming format thing that you can't just download, but the rest are MP4s. I tried downloading them all and strangely, I discovered that when I keep the ?tag=12 at the end I get a 404, but if I strip it, it downloads the video.

So, I updated XAccountController.indexTweetMedia to check if the media is a video, and if so to download the variant of the video with the highest bitrate (which is the best quality/largest file):

// Get the HTTPS URL of the media -- this works for photos
let mediaURL = media["media_url_https"];

// If it's a video, set mediaURL to the video variant with the highest bitrate
if (media["type"] == "video") {
    let highestBitrate = 0;
    if (media["video_info"] && media["video_info"]["variants"]) {
        media["video_info"]["variants"].forEach((variant: XAPILegacyTweetMediaVideoVariant) => {
            if (variant["bitrate"] && variant["bitrate"] > highestBitrate) {
                highestBitrate = variant["bitrate"];
                mediaURL = variant["url"];

                // Stripe query parameters from the URL.
                // For some reason video variants end with `?tag=12`, and when we try downloading with that
                // it responds with 404.
                const queryIndex = mediaURL.indexOf("?");
                if (queryIndex > -1) {
                    mediaURL = mediaURL.substring(0, queryIndex);
                }
            }
        });
    };
}

And it works! I also added a test to src/account_x.test.ts that confirms it works. It parses the XAPIUserTweetsAndRepliesMedia.json file and then runs SQL queries to confirm that it successfully imported the tweet and the media, and that the media filename is correct. (I also tried stubbing fetch so that the tests won't actually try downloading the media... I'm unsure if it worked or not if I'm just downloading the media when I run the tests.)

Also, I updated the XTweetMediaRow to include all of the fields that it uses now, and I realized that your migration had used the fields start_index and end_index while the convention for the rest of the database tables is to use snakeCase, so I updated them to startIndex and endIndex -- which mean you'll just need to delete the database and start over, because I didn't add a new migration for it or anything.

@micahflee micahflee requested a review from SaptakS February 14, 2025 23:01
@micahflee micahflee changed the title Support for indexing video Support for indexing video (and fix indexing URLs) Feb 16, 2025
@micahflee
Copy link
Member Author

I also added type definitions for URLs, wrote a test, and actually also found a bug, so now indexing URLs works: ca55615

Copy link
Contributor

@SaptakS SaptakS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description! Looks good to me.

@SaptakS SaptakS merged commit d53a26b into 354-index-media Feb 17, 2025
@SaptakS SaptakS deleted the 354-index-media-video branch February 17, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants