arXiv #1

davidar · 2015-09-30T08:08:04Z

Now that the first Creative Commons complete arXiv dataset has been published (ipfs-inactive/archives#2), it's time to build some cool apps on top of it!

Some ideas (please add more in the comments):

search engine for LaTeX equations, like http://latexsearch.com/
publishing papers in alternative formats (html, etc)
building a citation graph
training a topic model (latent Dirichlet allocation / hierarchical Dirichlet processes)
automatically extracting definitions to build a dictionary of terminology
semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

CC: @jbenet @rht

jbenet · 2015-09-30T08:37:09Z

rht · 2015-09-30T13:29:10Z

To be more precise, s/arXiv dataset/CC arXiv dataset/.

publishing papers in alternative formats (html, etc)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

building a citation graph

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

training a topic model

https://blog.lateral.io/2015/07/harvesting-research-arxiv/ uses the abstract dataset to create a recommender system. Though I find arxiv's search result to be far more relevant (test case: piron+lattice).

(to be continued...)

rht · 2015-09-30T15:49:52Z

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data.
As much as terms in scientific papers are expected to be more 'regular' than those of literature, demarcating 'real' papers is not something a mere mortal could do, never mind machines, e.g. arxiv vs snarxiv. This is made worse if the authors had the tendency to sound like the latter (e.g. http://arxiv.org/abs/hep-th/0003075, or a report on a certain Greek island).

Established scientific terms are definitely more regular (more regular than terms in the rest of language) [1].

There are definition environments in maths that can be extracted, but they are often localized instead of global terms for a dictionary.

semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

THIS.
(on disambiguating notation: using nlp to reverse-engineer the source code back is tough. Another path is to tell people to actually use unambiguous notations, in e.g. "Calculus on Manifolds", SICM, "Functional Differential Geometry" [2])

[1] Putnam suggested a "division of linguistic labor" (done by scientists from each fields) to define meaning http://libgen.io/get.php?md5=C931B36BCE8C21DA613AC02C40F634DC (what to do with this type of link?)
[2] http://mitpress.mit.edu/sites/default/files/titles/content/sicm/book-Z-H-79.html#%_chap_8

rht · 2015-09-30T15:59:35Z

It's already terse, but the tl; dr is whether to approach the data using ML or GOFAI.

zignig · 2015-10-01T00:38:21Z

AWESOME! 👍

davidar · 2015-10-01T09:54:11Z

@rht glad to see I'm not the only one who's been thinking about this :)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

See http://dlmf.nist.gov/LaTeXML and the (now defunct I suspect) https://trac.kwarc.info/arXMLiv

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

That would be cool. I also want to port https://github.com/davidar/bib to IPFS, which currently uses a half-baked content-addressable-storage for fulltext.

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data.
[...] on disambiguating notation: using nlp to reverse-engineer [...]

Yeah, it wouldn't be trivial, but it's a field that has been studied, e.g.:

Another path is to tell people to actually use unambiguous notations

That would be nice (see http://www.texmacs.org/joris/semedit/semedit.html ), but probably not going to happen on a large scale.

tl; dr is whether to approach the data using ML or GOFAI.

My area is probabilistic (Bayesian) machine learning, but simpler approaches may well be Good Enough for some of this.

rht · 2015-10-01T18:14:31Z

tex -> html:
In the past, people have been using tex4ht / tex2page.

tex -> xml (for parsing):
Someone has to do it eventually...
Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs.
Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor (then what about the scale of reengineering the web? ...don't ask which leg moves after which).

(to be continued...)

davidar · 2015-10-03T06:37:58Z

Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor

Keep in mind that, if you took care of the top 7 macros in that list, the remaining macros are used in less than 1% of papers, so it's not too bad. As it is, it's still less than 3%.

dginev · 2015-12-03T05:11:01Z

@rht said:

Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs.

Which foundation is that? I happen to be biased on the subject, but when it comes to dealing with arXiv pandoc's coverage can only be described as "basic". When it comes to evaluating TeX programs (which LaTeX papers are), it's pandoc's TeX reader that you can qualify as "Haskell scripts", while LaTeXML has a fully fleshed out implementation of a TeX engine as a foundation.

Pandoc indeed has a very powerful model that allows to connect readers and writers of concrete syntaxes via an internal abstract syntax. In fact, if the abstract model evolves to meet the coverage of LaTeXML's XML schema, it could be a wonderful post-processor for the LaTeXML ecosystem. And vice-versa, LaTeXML can be a wonderful "TeX reader" for pandoc. For me personally it would be quite curious to see the two projects interoperate, as they focus on, and excel at, different problems.

davidar · 2015-12-03T05:34:02Z

@dginev what needs to be added to pandoc's abstract model to support LaTeXML?

dginev · 2015-12-03T05:53:03Z

Keep in mind I am not an expert in the Pandoc model. But I have seen the occasional comment suggesting that Pandoc wants to "stay pure" and be restricted in coverage in certain respects. To quote one closed issue:

Remember, pandoc is about document structure. CSS is about details of presentation. To change the size of headers in tex output, you could use a custom latex template (see the documentation).

That is a very noble sentiment, and LaTeXML mostly shares it, but remains open to eventually covering all of TeX's pathological typesetting acrobatics. Here is a relatively simple example of what I have in mind.

The restrictions pandoc is self-imposing keep it quite elegant, and make tying in different syntaxes manageable, but it also limits the depth of the support. In order to meaningfully handle arXiv, making a compromise on elegance in order to have enough expressivity is a necessity - or you end up losing half (or more) of the content.

But even when it comes to document structure, I am unsure how far pandoc has dealt with support for the "advanced" structures out there - indexes, glossaries, bibliographies, wacky constructions such as inline blocks, math in TeX graphics (e.g. Tikz images that convert to SVG+MathML), etc. In this respect I find tex4ht to be a much more impressive and suitable comparison for LaTeXML, and both dedicated TeX converters strive to get better with covering the full spectrum of structure and style that TeX allows for, and authors use on a regular basis.

dginev · 2015-12-03T06:08:37Z

On a more pragmatic note, I am sure a LaTeXML-reader integration for pandoc is already possible today, by just mapping whatever structure is currently overlapping. May be a rather nice and simple project actually, it's just matching an XML DOM in LaTeXML's schema to Pandoc's internal datastructures. I will think of spending a few hours and doing that on a weekend.

davidar · 2015-12-03T08:24:31Z

@dginev It would be really cool to have a "universal" document model, but as you say, it's a tricky problem. I've spent a little bit of time thinking about it in the past, but can't say I came up with any solutions :(

I will think of spending a few hours and doing that on a weekend.

Awesome, let me know how you go :)

I can help with things on the Haskell side, if necessary. I'll leave the Perl stuff to you though :p

davidar mentioned this issue Sep 30, 2015

Sprint Sep 21 ipfs/team-mgmt#33

Closed

65 tasks

davidar mentioned this issue Oct 3, 2015

TeX for the web #5

Open

davidar mentioned this issue Oct 14, 2015

arXMLiv / CorTeX ipfs-inactive/archives#31

Open

nicola mentioned this issue Nov 25, 2015

Inter-planetary Academic publishing / Distributed Open Access #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv #1

arXiv #1

davidar commented Sep 30, 2015

jbenet commented Sep 30, 2015

rht commented Sep 30, 2015

rht commented Sep 30, 2015

rht commented Sep 30, 2015

zignig commented Oct 1, 2015

davidar commented Oct 1, 2015

rht commented Oct 1, 2015

davidar commented Oct 3, 2015

dginev commented Dec 3, 2015

davidar commented Dec 3, 2015

dginev commented Dec 3, 2015

dginev commented Dec 3, 2015

davidar commented Dec 3, 2015

arXiv #1

arXiv #1

Comments

davidar commented Sep 30, 2015

jbenet commented Sep 30, 2015

rht commented Sep 30, 2015

rht commented Sep 30, 2015

rht commented Sep 30, 2015

zignig commented Oct 1, 2015

davidar commented Oct 1, 2015

rht commented Oct 1, 2015

davidar commented Oct 3, 2015

dginev commented Dec 3, 2015

davidar commented Dec 3, 2015

dginev commented Dec 3, 2015

dginev commented Dec 3, 2015

davidar commented Dec 3, 2015