First party YouTube API client - with video info #1082

jon-betts · 2023-07-13T18:20:19Z

For:

Spike: how can we select the transcript for a YouTube video? #1062

This is a modification of the approach taken by https://pypi.org/project/youtube-transcript-api/

The basic structure of that approach is still here:

Get the HTML page for the video (dealing with cookie popups)
Parse parts of the video config JSON
Marshal parts of the parsed JSON into structured info about the video centered on transcripts
Check other parts of the video config JSON as raw HTML strings to detect situations and errors
Pick a transcript from those listed
Use the URL provided by YouTube to get the XML transcript

How this approach differs:

We parse the whole JSON config instead of just parts
We marshal more information into structured info
We use that structured info to detect more error situations (which should be less error prone than string searching)
We make no assumptions about the uniqueness of language codes
We use a discovered vssId for a transcript instead

Side effects:

While we are marshalling the transcript info, we can also gather lots of video info
This include things like name, thumbnails, description etc.

This is a demo! We don't want to use it like this!

At the time of writing this is integrated into the view which loads up a YouTube video, which means we are doing the look up twice:

Once during the initial page load
Once during the transcript lookup

As it's currently written, even with caching for transcripts we'd continue to hit YouTube all the time.

We don't want this. I'm not proposing this. This is just a demo showing we can get all of the info.

As / when we get an official API we could replace the first lookup however:

It's important to realise that parsing this info from scratch is how this approach works
The transcript id is not enough, we need a short lived URL from YouTube to get it
The only extra parsing that's done here (for show) is the video details part
Even if we get an official thing which replicates a lot of this, we'll still need to do almost all of this anyway

This involves parsing the whole video JSON and consulting parsed data rather than relying so heavily on checking for the presence of strings in the text.

jon-betts · 2023-07-13T18:22:45Z

via/services/youtube_api/LICENSE.txt

+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.


I don't know at what point we've transformed this out of being a derivative work?

Once we aren't scraping HTML we have to be pretty close...

jon-betts · 2023-07-13T18:23:10Z

via/services/youtube_api/client.py

+        associated with.
+        """
+        escaped_id = quote_plus(video_id)
+        return f"https://www.youtube.com/watch?v={escaped_id}"


Robbed from the other service. Maybe the YT parsing thing should be here too?

jon-betts · 2023-07-13T18:40:15Z

via/services/youtube_api/models.py

+            short_description=data["shortDescription"],
+            author=data["author"],
+            thumbnails=data["thumbnail"]["thumbnails"],
+        )


This part is the only part which isn't strictly necessary for transcripts. I put it here to show we can grab this at the same time easily.

We could use this type of info to give a very rich interface in the file picker.

FYI we already have code for fetching these kinds of video details (name, thumbnail) from the offical V3 API, the file picker in LMS already does this and presents a rich picker interface. But that's in LMS, it's separate from the code in Via that needs to get the transcripts themselves.

Via does also need to get the video title as well as the transcript but that's quite easily done from the v3 API:

via/via/services/youtube.py

Lines 58 to 69 in a04feaa

def get_video_title(self, video_id):

"""Call the YouTube API and return the title for the given video_id."""

# https://developers.google.com/youtube/v3/docs/videos/list

return self._http_service.get(

"https://www.googleapis.com/youtube/v3/videos",

params={

"id": video_id,

"key": self._api_key,

"part": "snippet",

"maxResults": "1",

},

).json()["items"][0]["snippet"]["title"]

jon-betts · 2023-07-13T18:42:14Z

via/services/youtube_api/models.py

+            translation_languages=[
+                {"code": language["languageCode"], "name": language["languageName"]}
+                for language in data.get("translationLanguages", [])
+            ],


We're not using it, but if:

A transcript is_translatable

And you have languages here

Then you can request a transcript be translated by appending &tlang={language_code} on the end

This list of languages does not belong to any specific caption track.

jon-betts · 2023-07-13T18:43:24Z

via/services/youtube_api/models.py

+
+
+@dataclass
+class VideoDetails:


All of the names of these things are chosen for minimum mapping from the YouTube JSON names. I followed the format very closely.

jon-betts · 2023-07-13T18:47:40Z

via/services/youtube_api/models.py

+    title: str
+    short_description: str
+    author: str
+    thumbnails: List[dict]


There are other things in here which might be of interest to us. For example:

isPrivate - Might this be an indicator a user has messed up? We might want to catch that

isLiveContent - Doesn't seem annotatable?

seanh · 2023-07-20T15:44:05Z

This is no longer needed: it's been replaced by a series of non-WIP PRs

Jon Betts added 5 commits July 13, 2023 17:27

First stab at some code to get YT transcript info

7ee3882

Move to a more general method which parses more, and guesses less

8850f57

This involves parsing the whole video JSON and consulting parsed data rather than relying so heavily on checking for the presence of strings in the text.

Fixup

58b4930

Integrate into existing system

1a6bcf0

Fixup

2171c0d

jon-betts added wip Spike technical enabler Work which only serves to enable other work labels Jul 13, 2023

jon-betts self-assigned this Jul 13, 2023

jon-betts commented Jul 13, 2023

View reviewed changes

Jon Betts added 11 commits July 13, 2023 20:15

Try a version which doesn't parse HTML at all...

d31859f

Add some API v3 stubs / examples

e4a08a6

Fixup

00406da

Fixup

7b3cb17

Fixup

0e144a1

Fixup

74b8acc

Nope

e4cf03d

Remove HTML parsing stuff

a38b8f1

Add h-matchers and remove the YouTube API library

c21cd39

Many changes and tests

362c986

more

7c6adbe

seanh closed this Jul 20, 2023

seanh deleted the yt-cc-first-party branch July 20, 2023 15:44

seanh mentioned this pull request Jul 20, 2023

Spike: how can we select the transcript for a YouTube video? #1062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First party YouTube API client - with video info #1082

First party YouTube API client - with video info #1082

jon-betts commented Jul 13, 2023 •

edited

Loading

jon-betts Jul 13, 2023

jon-betts Jul 13, 2023

jon-betts Jul 13, 2023

jon-betts Jul 13, 2023

seanh Jul 14, 2023 •

edited

Loading

jon-betts Jul 13, 2023 •

edited

Loading

jon-betts Jul 13, 2023

jon-betts Jul 13, 2023

seanh commented Jul 20, 2023

	def get_video_title(self, video_id):
	"""Call the YouTube API and return the title for the given video_id."""
	# https://developers.google.com/youtube/v3/docs/videos/list
	return self._http_service.get(
	"https://www.googleapis.com/youtube/v3/videos",
	params={
	"id": video_id,
	"key": self._api_key,
	"part": "snippet",
	"maxResults": "1",
	},
	).json()["items"][0]["snippet"]["title"]



		@dataclass
		class VideoDetails:

First party YouTube API client - with video info #1082

First party YouTube API client - with video info #1082

Conversation

jon-betts commented Jul 13, 2023 • edited Loading

This is a demo! We don't want to use it like this!

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

seanh Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

jon-betts Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

jon-betts Jul 13, 2023

Choose a reason for hiding this comment

seanh commented Jul 20, 2023

jon-betts commented Jul 13, 2023 •

edited

Loading

seanh Jul 14, 2023 •

edited

Loading

jon-betts Jul 13, 2023 •

edited

Loading