-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arXiv #1
Comments
To be more precise, s/arXiv dataset/CC arXiv dataset/.
Anything that is !pdf (well there is
And e.g.
https://blog.lateral.io/2015/07/harvesting-research-arxiv/ uses the abstract dataset to create a recommender system. Though I find arxiv's search result to be far more relevant (test case: piron+lattice). (to be continued...) |
I wonder if you can use statistical models on such 'raw' data. Established scientific terms are definitely more regular (more regular than terms in the rest of language) [1]. There are definition environments in maths that can be extracted, but they are often localized instead of global terms for a dictionary.
THIS. [1] Putnam suggested a "division of linguistic labor" (done by scientists from each fields) to define meaning http://libgen.io/get.php?md5=C931B36BCE8C21DA613AC02C40F634DC (what to do with this type of link?) |
It's already terse, but the tl; dr is whether to approach the data using ML or GOFAI. |
AWESOME! 👍 |
@rht glad to see I'm not the only one who's been thinking about this :)
See http://dlmf.nist.gov/LaTeXML and the (now defunct I suspect) https://trac.kwarc.info/arXMLiv
That would be cool. I also want to port https://github.com/davidar/bib to IPFS, which currently uses a half-baked content-addressable-storage for fulltext.
Yeah, it wouldn't be trivial, but it's a field that has been studied, e.g.:
That would be nice (see http://www.texmacs.org/joris/semedit/semedit.html ), but probably not going to happen on a large scale.
My area is probabilistic (Bayesian) machine learning, but simpler approaches may well be Good Enough for some of this. |
tex -> html: tex -> xml (for parsing): (to be continued...) |
Keep in mind that, if you took care of the top 7 macros in that list, the remaining macros are used in less than 1% of papers, so it's not too bad. As it is, it's still less than 3%. |
@rht said:
Which foundation is that? I happen to be biased on the subject, but when it comes to dealing with arXiv pandoc's coverage can only be described as "basic". When it comes to evaluating TeX programs (which LaTeX papers are), it's pandoc's TeX reader that you can qualify as "Haskell scripts", while LaTeXML has a fully fleshed out implementation of a TeX engine as a foundation. Pandoc indeed has a very powerful model that allows to connect readers and writers of concrete syntaxes via an internal abstract syntax. In fact, if the abstract model evolves to meet the coverage of LaTeXML's XML schema, it could be a wonderful post-processor for the LaTeXML ecosystem. And vice-versa, LaTeXML can be a wonderful "TeX reader" for pandoc. For me personally it would be quite curious to see the two projects interoperate, as they focus on, and excel at, different problems. |
@dginev what needs to be added to pandoc's abstract model to support LaTeXML? |
Keep in mind I am not an expert in the Pandoc model. But I have seen the occasional comment suggesting that Pandoc wants to "stay pure" and be restricted in coverage in certain respects. To quote one closed issue:
That is a very noble sentiment, and LaTeXML mostly shares it, but remains open to eventually covering all of TeX's pathological typesetting acrobatics. Here is a relatively simple example of what I have in mind. The restrictions pandoc is self-imposing keep it quite elegant, and make tying in different syntaxes manageable, but it also limits the depth of the support. In order to meaningfully handle arXiv, making a compromise on elegance in order to have enough expressivity is a necessity - or you end up losing half (or more) of the content. But even when it comes to document structure, I am unsure how far pandoc has dealt with support for the "advanced" structures out there - indexes, glossaries, bibliographies, wacky constructions such as inline blocks, math in TeX graphics (e.g. Tikz images that convert to SVG+MathML), etc. In this respect I find tex4ht to be a much more impressive and suitable comparison for LaTeXML, and both dedicated TeX converters strive to get better with covering the full spectrum of structure and style that TeX allows for, and authors use on a regular basis. |
On a more pragmatic note, I am sure a LaTeXML-reader integration for pandoc is already possible today, by just mapping whatever structure is currently overlapping. May be a rather nice and simple project actually, it's just matching an XML DOM in LaTeXML's schema to Pandoc's internal datastructures. I will think of spending a few hours and doing that on a weekend. |
@dginev It would be really cool to have a "universal" document model, but as you say, it's a tricky problem. I've spent a little bit of time thinking about it in the past, but can't say I came up with any solutions :(
Awesome, let me know how you go :) I can help with things on the Haskell side, if necessary. I'll leave the Perl stuff to you though :p |
Now that the first Creative Commons complete arXiv dataset has been published (ipfs-inactive/archives#2), it's time to build some cool apps on top of it!
Some ideas (please add more in the comments):
CC: @jbenet @rht
The text was updated successfully, but these errors were encountered: