Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First party YouTube API client - with video info #1082

Closed
wants to merge 16 commits into from

Conversation

jon-betts
Copy link
Contributor

@jon-betts jon-betts commented Jul 13, 2023

For:

This is a modification of the approach taken by https://pypi.org/project/youtube-transcript-api/

The basic structure of that approach is still here:

  • Get the HTML page for the video (dealing with cookie popups)
  • Parse parts of the video config JSON
  • Marshal parts of the parsed JSON into structured info about the video centered on transcripts
  • Check other parts of the video config JSON as raw HTML strings to detect situations and errors
  • Pick a transcript from those listed
  • Use the URL provided by YouTube to get the XML transcript

How this approach differs:

  • We parse the whole JSON config instead of just parts
  • We marshal more information into structured info
  • We use that structured info to detect more error situations (which should be less error prone than string searching)
  • We make no assumptions about the uniqueness of language codes
  • We use a discovered vssId for a transcript instead

Side effects:

  • While we are marshalling the transcript info, we can also gather lots of video info
  • This include things like name, thumbnails, description etc.

This is a demo! We don't want to use it like this!

At the time of writing this is integrated into the view which loads up a YouTube video, which means we are doing the look up twice:

  • Once during the initial page load
  • Once during the transcript lookup

As it's currently written, even with caching for transcripts we'd continue to hit YouTube all the time.

We don't want this. I'm not proposing this. This is just a demo showing we can get all of the info.

As / when we get an official API we could replace the first lookup however:

  • It's important to realise that parsing this info from scratch is how this approach works
  • The transcript id is not enough, we need a short lived URL from YouTube to get it
  • The only extra parsing that's done here (for show) is the video details part
  • Even if we get an official thing which replicates a lot of this, we'll still need to do almost all of this anyway

Jon Betts added 5 commits July 13, 2023 17:27
This involves parsing the whole video JSON and consulting parsed data
rather than relying so heavily on checking for the presence of strings
in the text.
@jon-betts jon-betts added wip Spike technical enabler Work which only serves to enable other work labels Jul 13, 2023
@jon-betts jon-betts self-assigned this Jul 13, 2023
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know at what point we've transformed this out of being a derivative work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we aren't scraping HTML we have to be pretty close...

associated with.
"""
escaped_id = quote_plus(video_id)
return f"https://www.youtube.com/watch?v={escaped_id}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robbed from the other service. Maybe the YT parsing thing should be here too?

short_description=data["shortDescription"],
author=data["author"],
thumbnails=data["thumbnail"]["thumbnails"],
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is the only part which isn't strictly necessary for transcripts. I put it here to show we can grab this at the same time easily.

We could use this type of info to give a very rich interface in the file picker.

Copy link
Contributor

@seanh seanh Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we already have code for fetching these kinds of video details (name, thumbnail) from the offical V3 API, the file picker in LMS already does this and presents a rich picker interface. But that's in LMS, it's separate from the code in Via that needs to get the transcripts themselves.

Via does also need to get the video title as well as the transcript but that's quite easily done from the v3 API:

def get_video_title(self, video_id):
"""Call the YouTube API and return the title for the given video_id."""
# https://developers.google.com/youtube/v3/docs/videos/list
return self._http_service.get(
"https://www.googleapis.com/youtube/v3/videos",
params={
"id": video_id,
"key": self._api_key,
"part": "snippet",
"maxResults": "1",
},
).json()["items"][0]["snippet"]["title"]

translation_languages=[
{"code": language["languageCode"], "name": language["languageName"]}
for language in data.get("translationLanguages", [])
],
Copy link
Contributor Author

@jon-betts jon-betts Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using it, but if:

  • A transcript is_translatable
  • And you have languages here
  • Then you can request a transcript be translated by appending &tlang={language_code} on the end

This list of languages does not belong to any specific caption track.



@dataclass
class VideoDetails:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the names of these things are chosen for minimum mapping from the YouTube JSON names. I followed the format very closely.

title: str
short_description: str
author: str
thumbnails: List[dict]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other things in here which might be of interest to us. For example:

  • isPrivate - Might this be an indicator a user has messed up? We might want to catch that
  • isLiveContent - Doesn't seem annotatable?

@seanh
Copy link
Contributor

seanh commented Jul 20, 2023

This is no longer needed: it's been replaced by a series of non-WIP PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spike technical enabler Work which only serves to enable other work wip
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants