Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Also) parsing structured data while you're at it #2

Closed
westurner opened this issue Feb 28, 2019 · 5 comments
Closed

(Also) parsing structured data while you're at it #2

westurner opened this issue Feb 28, 2019 · 5 comments

Comments

@westurner
Copy link

One might as well extract structured data from each element of such a dataset.

Linked data.
https://5stardata.info/

Useful features:

... from chiphuyen/lazynlp#1

@westurner
Copy link
Author

https://github.com/scrapinghub/extruct

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib

@jcpeterson
Copy link
Owner

@westurner can you say a bit more about the motivations/applications here?

@westurner
Copy link
Author

How useful is a trained synthetic language model (with 'transformers' in this case) without reading comprehension?

I think maybe people are expecting more out of this approach (from Google, OpenAI) to NLP than _______.

Can these models learn and do reasoning and inference (and synthesis of not rehashed but new ideas) with lots of noisy information? If so, extracting reusable, shareable structured data for more energy efficient narrow ml applications is a most useful task.

More structured data from all of that noise would be great; might it be more efficient to extract structured data from HTML that's already paged into RAM instead of as a separate pass.

Perhaps an ironic gesture of opportunism

Repository owner deleted a comment from westurner Feb 28, 2019
@jcpeterson
Copy link
Owner

Deleted the duplicate reply. I think this goes beyond the scope of a replication but might be something to look into after the main goals have been reached. Otherwise, a fork is also an option.

@westurner
Copy link
Author

I had thought that this thread held a reference to state of the art language comprehension metrics, but was wrong. I don't remember where that is; though this is a fantabulous resource regarding the topic: "Better Language Models and Their Implications" https://news.ycombinator.com/item?id=19163522

"Oooh, I neeed this"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants