(Also) parsing structured data while you're at it #2

westurner · 2019-02-28T15:38:12Z

One might as well extract structured data from each element of such a dataset.

Linked data.
https://5stardata.info/

Useful features:

Relations to e.g. https://schema.org/Dataset (s)

Reified edges to other https://schema.org/ScholarlyArticle (s) indicating whether A seems to confirm or disprove B

URIs for columns in CSV and CSVW datasets

https://www.w3.org/TR/tabular-data-primer/ (CSVW)

... from chiphuyen/lazynlp#1

westurner · 2019-02-28T15:39:11Z

https://github.com/scrapinghub/extruct

extruct is a library for extracting embedded metadata from HTML markup.

It also has a built-in HTTP server to test its output as JSON.

Currently, extruct supports:

W3C's HTML Microdata

embedded JSON-LD

Microformat via mf2py

Facebook's Open Graph

(experimental) RDFa via rdflib

jcpeterson · 2019-02-28T16:37:19Z

@westurner can you say a bit more about the motivations/applications here?

westurner · 2019-02-28T17:38:41Z

How useful is a trained synthetic language model (with 'transformers' in this case) without reading comprehension?

I think maybe people are expecting more out of this approach (from Google, OpenAI) to NLP than _______.

Can these models learn and do reasoning and inference (and synthesis of not rehashed but new ideas) with lots of noisy information? If so, extracting reusable, shareable structured data for more energy efficient narrow ml applications is a most useful task.

More structured data from all of that noise would be great; might it be more efficient to extract structured data from HTML that's already paged into RAM instead of as a separate pass.

Perhaps an ironic gesture of opportunism

jcpeterson · 2019-02-28T17:53:11Z

Deleted the duplicate reply. I think this goes beyond the scope of a replication but might be something to look into after the main goals have been reached. Otherwise, a fork is also an option.

westurner · 2019-02-28T18:13:59Z

I had thought that this thread held a reference to state of the art language comprehension metrics, but was wrong. I don't remember where that is; though this is a fantabulous resource regarding the topic: "Better Language Models and Their Implications" https://news.ycombinator.com/item?id=19163522

"Oooh, I neeed this"

Repository owner deleted a comment from westurner Feb 28, 2019

jcpeterson closed this as completed Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Also) parsing structured data while you're at it #2

(Also) parsing structured data while you're at it #2

westurner commented Feb 28, 2019

westurner commented Feb 28, 2019

jcpeterson commented Feb 28, 2019

westurner commented Feb 28, 2019

jcpeterson commented Feb 28, 2019

westurner commented Feb 28, 2019

(Also) parsing structured data while you're at it #2

(Also) parsing structured data while you're at it #2

Comments

westurner commented Feb 28, 2019

westurner commented Feb 28, 2019

jcpeterson commented Feb 28, 2019

westurner commented Feb 28, 2019

jcpeterson commented Feb 28, 2019

westurner commented Feb 28, 2019