Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threw an error on ingestion #21

Closed
Truncated opened this issue May 8, 2024 · 6 comments
Closed

Threw an error on ingestion #21

Truncated opened this issue May 8, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Truncated
Copy link

Truncated commented May 8, 2024

<Edited to focus on the relevant error, rather than the entire log contents which had slurps that were fine even if the title was missing>

1715205555277 | DEBUG | onValidate called
  • Caller: HTMLDivElement.<anonymous> (app://obsidian.md/app.js:1:2170951)
[
  {
    "enabled": true,
    "custom": false,
    "_key": "link",
    "_idx": 0,
    "id": "link",
    "metaFields": [
      "url",
      "og:url",
      "parsely-link",
      "twitter:url"
    ],
    "defaultIdx": 0,
    "defaultKey": "link",
    "description": "Page URL provided or a permalink discovered in metadata."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "byline",
    "_idx": 1,
    "id": "byline",
    "metaFields": [
      "author",
      "article:author",
      "parsely-author",
      "cXenseParse:author"
    ],
    "defaultIdx": 1,
    "defaultKey": "byline",
    "description": "Name of the primary author or the first author detected."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "site",
    "_idx": 2,
    "id": "siteName",
    "metaFields": [
      "og:site_name",
      "page.content.source",
      "application-name",
      "apple-mobile-web-app-title",
      "twitter:site"
    ],
    "defaultIdx": 2,
    "defaultKey": "site",
    "description": "Website or publication name."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "date",
    "_idx": 3,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "publishedTime",
    "metaFields": [
      "article:published_time",
      "parsely-pub-date",
      "datePublished",
      "article.published"
    ],
    "defaultIdx": 3,
    "defaultKey": "date",
    "description": "Date/time that the page was initially published.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "updated",
    "_idx": 4,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "modifiedTime",
    "metaFields": [
      "article:modified_time",
      "dateModified",
      "dateLastPubbed"
    ],
    "defaultIdx": 4,
    "defaultKey": "updated",
    "description": "Date/time that the page was last modified, if available.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "type",
    "_idx": 5,
    "id": "type",
    "metaFields": [
      "og:type",
      "parsely-type",
      "medium",
      "page.content.type"
    ],
    "defaultIdx": 5,
    "defaultKey": "type",
    "description": "Type of publication, eg: \"page\", \"post\", \"article\"."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "excerpt",
    "_idx": 6,
    "id": "excerpt",
    "metaFields": [
      "description",
      "og:description",
      "twitter:description"
    ],
    "defaultIdx": 6,
    "defaultKey": "excerpt",
    "description": "Often used for subtitles, excerpts, descriptions, and abstracts."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "twitter",
    "_idx": 7,
    "_format": "s|https://twitter.com/{s}",
    "id": "twitter",
    "metaFields": [
      "twitter:creator",
      "twitter:site"
    ],
    "defaultIdx": 7,
    "defaultKey": "twitter",
    "description": "Twitter/X link for the author or site.",
    "defaultFormat": "s|https://twitter.com/{s}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "tags",
    "_idx": 8,
    "_format": "S|{prefix}/{tag}",
    "id": "tags",
    "metaFields": [
      "tags",
      "keywords",
      "article:tag",
      "parsely-tags",
      "news_keywords"
    ],
    "defaultIdx": 8,
    "defaultKey": "tags",
    "description": "Tags and keywords present in the page's metadata.",
    "defaultFormat": "S|{prefix}/{tag}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "onion",
    "_idx": 9,
    "id": "onion",
    "metaFields": [
      "onion-location"
    ],
    "defaultIdx": 9,
    "defaultKey": "onion",
    "description": "Link to a mirror of the content on Tor."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "slurped",
    "_idx": 10,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "slurped",
    "defaultIdx": 10,
    "defaultKey": "slurped",
    "description": "Date/time that the page was accessed by Slurp.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "title",
    "_idx": 11,
    "id": "title",
    "metaFields": [
      "og:title",
      "twitter:title"
    ],
    "defaultIdx": 11,
    "defaultKey": "title",
    "description": "Page title as seen in the browser, falling back to the title presented in metadata."
  }
]
@inhumantsar
Copy link
Owner

hey thanks for the report! looks like neither readability nor slurp were able to find a title for this page. i'll probably have to submit a patch upstream of slurp for this one.

can you share the URL? i'm not seeing it in the logs

@inhumantsar inhumantsar added the bug Something isn't working label May 9, 2024
@inhumantsar inhumantsar self-assigned this May 9, 2024
@Truncated
Copy link
Author

These were from www.fastcompany.com
Any link does this; if I'm reading the log above correctly, that represents multiple different links but I honestly didn't think to record the URLs with the auto-generated bug log. I will in the future (got a few more to submit).

@inhumantsar
Copy link
Owner

The logs were quoting a product page. I looked it up and found this: https://sparksoftcorp.com/dev-sec-ops-delivery

The site doesn't have any meta tags or even a title tag so there's not much that Slurp can do on its own. Filenames are sourced from the title. I could set it up to just call it Untitled Page or something but this feels like a pretty rare edge case.

I will be adding more options to the Slurp New Note dialog soon though. That will be the best place to manually give it a title to use.

@Truncated
Copy link
Author

Truncated commented May 10, 2024

That's a red herring - the sparkssoft pages were ones I had ingested prior; yes, there wasn't much to pull, but I was most concerned with the text and didn't care about the metadata.

It's the links from the fast company site which is what throws the error. The log output in settings didn't give me a good way to reliably tell what was needed for just the error message, so you got both of the ingestions.

Literally any link from Fastcompany.com throws an error. Here's a clean example from https://www.fastcompany.com/91122708/heres-how-california-state-agencies-plan-use-generative-ai

1715349697499 | DEBUG | onValidate called
  • Caller: HTMLDivElement.<anonymous> (app://obsidian.md/app.js:1:2170951)
[
  {
    "enabled": true,
    "custom": false,
    "_key": "Source",
    "_idx": 0,
    "id": "link",
    "metaFields": [
      "url",
      "og:url",
      "parsely-link",
      "twitter:url"
    ],
    "defaultIdx": 0,
    "defaultKey": "link",
    "description": "Page URL provided or a permalink discovered in metadata."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "byline",
    "_idx": 1,
    "id": "byline",
    "metaFields": [
      "author",
      "article:author",
      "parsely-author",
      "cXenseParse:author"
    ],
    "defaultIdx": 1,
    "defaultKey": "byline",
    "description": "Name of the primary author or the first author detected."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "site",
    "_idx": 2,
    "id": "siteName",
    "metaFields": [
      "og:site_name",
      "page.content.source",
      "application-name",
      "apple-mobile-web-app-title",
      "twitter:site"
    ],
    "defaultIdx": 2,
    "defaultKey": "site",
    "description": "Website or publication name."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "date",
    "_idx": 3,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "publishedTime",
    "metaFields": [
      "article:published_time",
      "parsely-pub-date",
      "datePublished",
      "article.published"
    ],
    "defaultIdx": 3,
    "defaultKey": "date",
    "description": "Date/time that the page was initially published.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "updated",
    "_idx": 4,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "modifiedTime",
    "metaFields": [
      "article:modified_time",
      "dateModified",
      "dateLastPubbed"
    ],
    "defaultIdx": 4,
    "defaultKey": "updated",
    "description": "Date/time that the page was last modified, if available.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "type",
    "_idx": 5,
    "id": "type",
    "metaFields": [
      "og:type",
      "parsely-type",
      "medium",
      "page.content.type"
    ],
    "defaultIdx": 5,
    "defaultKey": "type",
    "description": "Type of publication, eg: \"page\", \"post\", \"article\"."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "excerpt",
    "_idx": 6,
    "id": "excerpt",
    "metaFields": [
      "description",
      "og:description",
      "twitter:description"
    ],
    "defaultIdx": 6,
    "defaultKey": "excerpt",
    "description": "Often used for subtitles, excerpts, descriptions, and abstracts."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "twitter",
    "_idx": 7,
    "_format": "s|https://twitter.com/{s}",
    "id": "twitter",
    "metaFields": [
      "twitter:creator",
      "twitter:site"
    ],
    "defaultIdx": 7,
    "defaultKey": "twitter",
    "description": "Twitter/X link for the author or site.",
    "defaultFormat": "s|https://twitter.com/{s}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "tags",
    "_idx": 8,
    "_format": "S|{prefix}/{tag}",
    "id": "tags",
    "metaFields": [
      "tags",
      "keywords",
      "article:tag",
      "parsely-tags",
      "news_keywords"
    ],
    "defaultIdx": 8,
    "defaultKey": "tags",
    "description": "Tags and keywords present in the page's metadata.",
    "defaultFormat": "S|{prefix}/{tag}"
  },
  {
    "enabled": false,
    "custom": false,
    "_key": "onion",
    "_idx": 9,
    "id": "onion",
    "metaFields": [
      "onion-location"
    ],
    "defaultIdx": 9,
    "defaultKey": "onion",
    "description": "Link to a mirror of the content on Tor."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "slurped",
    "_idx": 10,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "slurped",
    "defaultIdx": 10,
    "defaultKey": "slurped",
    "description": "Date/time that the page was accessed by Slurp.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "title",
    "_idx": 11,
    "id": "title",
    "metaFields": [
      "og:title",
      "twitter:title"
    ],
    "defaultIdx": 11,
    "defaultKey": "title",
    "description": "Page title as seen in the browser, falling back to the title presented in metadata."
  }
]

@inhumantsar
Copy link
Owner

ah ok, yeah the error message slurp displays says that it got a 403 back from fast company, so I'm guessing that they block non-browsers from accessing their pages. I'll have a look but there's likely not much we can do about that

inhumantsar added a commit that referenced this issue May 11, 2024
- fix: refactor new note modal, add validation (#21)
- fix: remove broken github link and useless log refresh button (#22)
- fix: avoid saving settings if no changes are detected
inhumantsar added a commit that referenced this issue May 11, 2024
- fix: refactor new note modal, add validation (#21)
- fix: remove broken github link and useless log refresh button (#22)
- fix: avoid saving settings if no changes are detected
@inhumantsar
Copy link
Owner

fast company does seem to block application access entirely, so i've added a validation step to new note creation which will complain if a fast company link is used. did the same for that product site too.

let me know if you find any other sites which just refuse to be slurped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants