Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Youtube: only favicon gets extracted #67

Closed
trieloff opened this issue Jan 4, 2021 · 2 comments
Closed

Youtube: only favicon gets extracted #67

trieloff opened this issue Jan 4, 2021 · 2 comments
Labels

Comments

@trieloff
Copy link
Contributor

trieloff commented Jan 4, 2021

Youtube changed its HTML a month ago and since then our tests (adobe/helix-embed#345) have been failing when verifying the output for Youtube.

The underlying issue is a combination of making the reasonable assumption that all metadata is in the head here

unfurl/src/index.ts

Lines 270 to 273 in db57429

// We want to parse as little as possible so finish once we see </head>
if (tag === 'head') {
parser.reset()
}

and Youtube being above convention, standards, and reason:

<!DOCTYPE html>
<html
  style="font-size: 10px; font-family: Roboto, Arial, sans-serif"
  lang="de-DE"
>
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <link
      rel="shortcut icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon.ico"
      type="image/x-icon"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_32.png"
      sizes="32x32"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_48.png"
      sizes="48x48"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_96.png"
      sizes="96x96"
    />
    <link
      rel="icon"
      href="https://www.youtube.com/s/desktop/d743f786/img/favicon_144.png"
      sizes="144x144"
    />
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=Roboto:500,300,700,400"
      name="www-roboto"
    />
    <script name="www-roboto" nonce="26OMsP9eT4h+T5PS9iXDRQ">
      if (document.fonts && document.fonts.load) {
        document.fonts.load("400 10pt Roboto", "");
        document.fonts.load("500 10pt Roboto", "");
      }
    </script>
    <link
      rel="stylesheet"
      href="//fonts.googleapis.com/css?family=YT%20Sans%3A300%2C500%2C700"
      name="www-webfont-yt-sans"
    />
    <link rel="stylesheet" href="/s/player/5dd3f3b2/www-player.css" />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-watch-page-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-main-desktop-player-skeleton.css"
    />
    <link
      rel="stylesheet"
      href="https://www.youtube.com/s/desktop/d743f786/cssbin/www-onepick.css"
    />
    <meta name="theme-color" content="rgba(255, 255, 255, 0.98)" />
    <link
      rel="search"
      type="application/opensearchdescription+xml"
      href="https://www.youtube.com/opensearch?locale=de_DE"
      title="YouTube"
    />
    <link
      rel="manifest"
      href="/s/notifications/manifest/manifest.json"
      crossorigin="use-credentials"
    />
  </head> <!-- END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE END OF HEAD HERE  --->
  <body dir="ltr" no-y-overflow>
    <link
      rel="canonical"
      href="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="handheld"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      media="only screen and (max-width: 640px)"
      href="https://m.youtube.com/watch?v=ccYpEv4APec"
    /><title>
      Google Translate Sings: &quot;The Sound of Silence&quot; (Simon &amp;
      Garfunkel) - YouTube</title
    ><meta
      name="title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="keywords"
      content="sound of silence, parody, google translate, google translate sings, disturbed, pentatonix, performance, the sound of silence, simon and garfunkel, translator fails, translation, fail, comedy, 1960s, paul simon, official video"
    /><link rel="shortlinkUrl" href="https://youtu.be/ccYpEv4APec" /><link
      rel="alternate"
      href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=ccYpEv4APec"
    /><link
      rel="alternate"
      type="application/json+oembed"
      href="http://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="alternate"
      type="text/xml+oembed"
      href="http://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DccYpEv4APec"
      title='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><link
      rel="image_src"
      href="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:site_name" content="YouTube" /><meta
      property="og:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      property="og:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      property="og:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta property="og:image:width" content="1280" /><meta
      property="og:image:height"
      content="720"
    /><meta
      property="og:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta property="al:ios:app_store_id" content="544007664" /><meta
      property="al:ios:app_name"
      content="YouTube"
    /><meta
      property="al:ios:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:android:url"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      property="al:web:url"
      content="http://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta property="og:type" content="video.other" /><meta
      property="og:video:url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta
      property="og:video:secure_url"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta property="og:video:type" content="text/html" /><meta
      property="og:video:width"
      content="1280"
    /><meta property="og:video:height" content="720" /><meta
      property="al:android:app_name"
      content="YouTube"
    /><meta
      property="al:android:package"
      content="com.google.android.youtube"
    /><meta property="og:video:tag" content="sound of silence" /><meta
      property="og:video:tag"
      content="parody"
    /><meta property="og:video:tag" content="google translate" /><meta
      property="og:video:tag"
      content="google translate sings"
    /><meta property="og:video:tag" content="disturbed" /><meta
      property="og:video:tag"
      content="pentatonix"
    /><meta property="og:video:tag" content="performance" /><meta
      property="og:video:tag"
      content="the sound of silence"
    /><meta property="og:video:tag" content="simon and garfunkel" /><meta
      property="og:video:tag"
      content="translator fails"
    /><meta property="og:video:tag" content="translation" /><meta
      property="og:video:tag"
      content="fail"
    /><meta property="og:video:tag" content="comedy" /><meta
      property="og:video:tag"
      content="1960s"
    /><meta property="og:video:tag" content="paul simon" /><meta
      property="og:video:tag"
      content="official video"
    /><meta property="fb:app_id" content="87741124305" /><meta
      name="twitter:card"
      content="player"
    /><meta name="twitter:site" content="@youtube" /><meta
      name="twitter:url"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:title"
      content='Google Translate Sings: "The Sound of Silence" (Simon &amp; Garfunkel)'
    /><meta
      name="twitter:description"
      content="SUBSCRIBE: http://bit.ly/sub2MalindaCHECK OUT MY MUSIC CHANNEL: https://bit.ly/2GsRyrqPATREON: http://bit.ly/MKRsupportMERCH: http://shopmalinda.com/Follow m..."
    /><meta
      name="twitter:image"
      content="https://i.ytimg.com/vi/ccYpEv4APec/maxresdefault.jpg"
    /><meta name="twitter:app:name:iphone" content="YouTube" /><meta
      name="twitter:app:id:iphone"
      content="544007664"
    /><meta name="twitter:app:name:ipad" content="YouTube" /><meta
      name="twitter:app:id:ipad"
      content="544007664"
    /><meta
      name="twitter:app:url:iphone"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta
      name="twitter:app:url:ipad"
      content="vnd.youtube://www.youtube.com/watch?v=ccYpEv4APec&amp;feature=applinks"
    /><meta name="twitter:app:name:googleplay" content="YouTube" /><meta
      name="twitter:app:id:googleplay"
      content="com.google.android.youtube"
    /><meta
      name="twitter:app:url:googleplay"
      content="https://www.youtube.com/watch?v=ccYpEv4APec"
    /><meta
      name="twitter:player"
      content="https://www.youtube.com/embed/ccYpEv4APec"
    /><meta name="twitter:player:width" content="1280" /><meta
      name="twitter:player:height"
      content="720"
    />

(HTML reformatted and all script and style tags removed)

As you can see, most of the interesting metadata (even title) is outside the head.

I will submit a PR to address that.

trieloff added a commit to trieloff/unfurl that referenced this issue Jan 4, 2021
this is a test for metadata in the body as exposed by youtube since the december 2020 update

test for jacktuck#67
trieloff added a commit to trieloff/unfurl that referenced this issue Jan 4, 2021
this change defers the early termination of the parser only if a title tag has been found in the head of the html

fixes jacktuck#67
@github-actions
Copy link

github-actions bot commented Jan 5, 2021

🎉 This issue has been resolved in version 5.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@jacktuck
Copy link
Owner

jacktuck commented Jan 5, 2021

If it turns out title is often in the head but other meta is in the body we could in the future just remove this optimisation all together or default to not having it and add a option flag for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants