Support for indexing video (and fix indexing URLs) by micahflee · Pull Request #401 · lockdown-systems/cyd

micahflee · 2025-02-14T23:00:51Z

Hey @SaptakS this fixes the "Figure out issue with downloading of videos" step in #393. Here's how I did it:

In a test X account, I posted a tweet with an image, and another tweet with a video. Then using Burp Suite, I logged into my test X account and loaded my test account's replies feed. In Burp Suite, I found the UserTweetsAndReplies API request:

I copied the JSON into a text editor and formatted it so I could see exactly what's in each tweets['legacy']['media'] object. I also saved this as testdata/XAPIUserTweetsAndRepliesMedia.json, so I could later write a test using it.

Based on what I found, I started with updating the typescript interfaces to support the media fields we need. In src/account_x/types.ts, I edited XAPILegacyTweet to support the media field. I made sure to include all fields that I could find in both the video and photo tweets, making sure to mark some of them as optional. In the end I expanded the media part into its own sub-types XAPILegacyTweetMedia and XAPILegacyTweetMediaVideoVariant.

Here's what my example video tweet JSON object looks like:

{
    "bookmark_count": 0,
    "bookmarked": false,
    "created_at": "Fri Feb 14 21:30:00 +0000 2025",
    "conversation_id_str": "1890513848811090236",
    "display_text_range": [
        0,
        28
    ],
    "entities": {
        "hashtags": [],
        "media": [
            {
                "display_url": "pic.x.com/MMfXeoZEdi",
                "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1",
                "id_str": "1890513743144185859",
                "indices": [
                    29,
                    52
                ],
                "media_key": "7_1890513743144185859",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg",
                "type": "video",
                "url": "https://t.co/MMfXeoZEdi",
                "additional_media_info": {
                    "monetizable": false
                },
                "ext_media_availability": {
                    "status": "Available"
                },
                "sizes": {
                    "large": {
                        "h": 1920,
                        "w": 1080,
                        "resize": "fit"
                    },
                    "medium": {
                        "h": 1200,
                        "w": 675,
                        "resize": "fit"
                    },
                    "small": {
                        "h": 680,
                        "w": 383,
                        "resize": "fit"
                    },
                    "thumb": {
                        "h": 150,
                        "w": 150,
                        "resize": "crop"
                    }
                },
                "original_info": {
                    "height": 1920,
                    "width": 1080,
                    "focus_rects": []
                },
                "allow_download_status": {
                    "allow_download": true
                },
                "video_info": {
                    "aspect_ratio": [
                        9,
                        16
                    ],
                    "duration_millis": 53111,
                    "variants": [
                        {
                            "content_type": "application/x-mpegURL",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
                        },
                        {
                            "bitrate": 632000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
                        },
                        {
                            "bitrate": 950000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
                        },
                        {
                            "bitrate": 2176000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
                        }
                    ]
                },
                "media_results": {
                    "result": {
                        "media_key": "7_1890513743144185859"
                    }
                }
            }
        ],
        "symbols": [],
        "timestamps": [],
        "urls": [],
        "user_mentions": []
    },
    "extended_entities": {
        "media": [
            {
                "display_url": "pic.x.com/MMfXeoZEdi",
                "expanded_url": "https://x.com/nexamind91326/status/1890513848811090236/video/1",
                "id_str": "1890513743144185859",
                "indices": [
                    29,
                    52
                ],
                "media_key": "7_1890513743144185859",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1890513743144185859/pu/img/jZqLegack-BV-8TT.jpg",
                "type": "video",
                "url": "https://t.co/MMfXeoZEdi",
                "additional_media_info": {
                    "monetizable": false
                },
                "ext_media_availability": {
                    "status": "Available"
                },
                "sizes": {
                    "large": {
                        "h": 1920,
                        "w": 1080,
                        "resize": "fit"
                    },
                    "medium": {
                        "h": 1200,
                        "w": 675,
                        "resize": "fit"
                    },
                    "small": {
                        "h": 680,
                        "w": 383,
                        "resize": "fit"
                    },
                    "thumb": {
                        "h": 150,
                        "w": 150,
                        "resize": "crop"
                    }
                },
                "original_info": {
                    "height": 1920,
                    "width": 1080,
                    "focus_rects": []
                },
                "allow_download_status": {
                    "allow_download": true
                },
                "video_info": {
                    "aspect_ratio": [
                        9,
                        16
                    ],
                    "duration_millis": 53111,
                    "variants": [
                        {
                            "content_type": "application/x-mpegURL",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
                        },
                        {
                            "bitrate": 632000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
                        },
                        {
                            "bitrate": 950000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
                        },
                        {
                            "bitrate": 2176000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
                        }
                    ]
                },
                "media_results": {
                    "result": {
                        "media_key": "7_1890513743144185859"
                    }
                }
            }
        ]
    },
    "favorite_count": 0,
    "favorited": false,
    "full_text": "check out this video i found https://t.co/MMfXeoZEdi",
    "is_quote_status": false,
    "lang": "en",
    "possibly_sensitive": false,
    "possibly_sensitive_editable": true,
    "quote_count": 0,
    "reply_count": 0,
    "retweet_count": 0,
    "retweeted": false,
    "user_id_str": "1769426369526771712",
    "id_str": "1890513848811090236"
}

If you look tweet['entities']['media'][0]['video_info']['variants'], you can see that it lists the actual URLs of MP4s to download:

[
    {
        "content_type": "application/x-mpegURL",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/pl/Iu8GfeKfzuWBOu_l.m3u8?tag=12"
    },
    {
        "bitrate": 632000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/320x568/Ph1QyGrCB8rJnAx5.mp4?tag=12"
    },
    {
        "bitrate": 950000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/480x852/UCfVhahPJPM6LH3w.mp4?tag=12"
    },
    {
        "bitrate": 2176000,
        "content_type": "video/mp4",
        "url": "https://video.twimg.com/ext_tw_video/1890513743144185859/pu/vid/avc1/720x1280/WTi2T6_vWXhqOaxi.mp4?tag=12"
    }
]

The first one that's application/x-mpegURL is a streaming format thing that you can't just download, but the rest are MP4s. I tried downloading them all and strangely, I discovered that when I keep the ?tag=12 at the end I get a 404, but if I strip it, it downloads the video.

So, I updated XAccountController.indexTweetMedia to check if the media is a video, and if so to download the variant of the video with the highest bitrate (which is the best quality/largest file):

// Get the HTTPS URL of the media -- this works for photos
let mediaURL = media["media_url_https"];

// If it's a video, set mediaURL to the video variant with the highest bitrate
if (media["type"] == "video") {
    let highestBitrate = 0;
    if (media["video_info"] && media["video_info"]["variants"]) {
        media["video_info"]["variants"].forEach((variant: XAPILegacyTweetMediaVideoVariant) => {
            if (variant["bitrate"] && variant["bitrate"] > highestBitrate) {
                highestBitrate = variant["bitrate"];
                mediaURL = variant["url"];

                // Stripe query parameters from the URL.
                // For some reason video variants end with `?tag=12`, and when we try downloading with that
                // it responds with 404.
                const queryIndex = mediaURL.indexOf("?");
                if (queryIndex > -1) {
                    mediaURL = mediaURL.substring(0, queryIndex);
                }
            }
        });
    };
}

And it works! I also added a test to src/account_x.test.ts that confirms it works. It parses the XAPIUserTweetsAndRepliesMedia.json file and then runs SQL queries to confirm that it successfully imported the tweet and the media, and that the media filename is correct. (I also tried stubbing fetch so that the tests won't actually try downloading the media... I'm unsure if it worked or not if I'm just downloading the media when I run the tests.)

Also, I updated the XTweetMediaRow to include all of the fields that it uses now, and I realized that your migration had used the fields start_index and end_index while the convention for the rest of the database tables is to use snakeCase, so I updated them to startIndex and endIndex -- which mean you'll just need to delete the database and start over, because I didn't add a new migration for it or anything.

…to use camelCase for indexStart and indexEnd for consistency

…rectly

micahflee · 2025-02-16T18:34:30Z

I also added type definitions for URLs, wrote a test, and actually also found a bug, so now indexing URLs works: ca55615

SaptakS

Thanks for the detailed description! Looks good to me.

micahflee added 3 commits February 14, 2025 14:14

Add new types for tweet media, and support downloading videos

870f333

Add more fields to the XTweetMediaRow type, and update the migration …

e4c7a5c

…to use camelCase for indexStart and indexEnd for consistency

Add a test that ensures tweets with videos and photos are indexed cor…

5f49465

…rectly

micahflee requested a review from SaptakS February 14, 2025 23:01

Fix indexing URLs, and write test for it

ca55615

micahflee changed the title ~~Support for indexing video~~ Support for indexing video (and fix indexing URLs) Feb 16, 2025

SaptakS approved these changes Feb 17, 2025

View reviewed changes

SaptakS merged commit d53a26b into 354-index-media Feb 17, 2025

SaptakS deleted the 354-index-media-video branch February 17, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for indexing video (and fix indexing URLs)#401

Support for indexing video (and fix indexing URLs)#401
SaptakS merged 4 commits into354-index-mediafrom
354-index-media-video

micahflee commented Feb 14, 2025

Uh oh!

micahflee commented Feb 16, 2025

Uh oh!

SaptakS left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

micahflee commented Feb 14, 2025

Uh oh!

micahflee commented Feb 16, 2025

Uh oh!

SaptakS left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants