Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv #1

Open
davidar opened this issue Sep 30, 2015 · 13 comments
Open

arXiv #1

davidar opened this issue Sep 30, 2015 · 13 comments

Comments

@davidar
Copy link
Member

davidar commented Sep 30, 2015

Now that the first Creative Commons complete arXiv dataset has been published (ipfs-inactive/archives#2), it's time to build some cool apps on top of it!

Some ideas (please add more in the comments):

  • search engine for LaTeX equations, like http://latexsearch.com/
  • publishing papers in alternative formats (html, etc)
  • building a citation graph
  • training a topic model (latent Dirichlet allocation / hierarchical Dirichlet processes)
  • automatically extracting definitions to build a dictionary of terminology
  • semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

CC: @jbenet @rht

@davidar davidar mentioned this issue Sep 30, 2015
65 tasks
@jbenet
Copy link
Member

jbenet commented Sep 30, 2015

@rht
Copy link

rht commented Sep 30, 2015

To be more precise, s/arXiv dataset/CC arXiv dataset/.

publishing papers in alternative formats (html, etc)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

building a citation graph

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

training a topic model

https://blog.lateral.io/2015/07/harvesting-research-arxiv/ uses the abstract dataset to create a recommender system. Though I find arxiv's search result to be far more relevant (test case: piron+lattice).

(to be continued...)

@rht
Copy link

rht commented Sep 30, 2015

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data.
As much as terms in scientific papers are expected to be more 'regular' than those of literature, demarcating 'real' papers is not something a mere mortal could do, never mind machines, e.g. arxiv vs snarxiv. This is made worse if the authors had the tendency to sound like the latter (e.g. http://arxiv.org/abs/hep-th/0003075, or a report on a certain Greek island).

Established scientific terms are definitely more regular (more regular than terms in the rest of language) [1].

There are definition environments in maths that can be extracted, but they are often localized instead of global terms for a dictionary.

semantic markup of math (connecting variables in equations to their textual descriptions, disambiguating notation, etc)

THIS.
(on disambiguating notation: using nlp to reverse-engineer the source code back is tough. Another path is to tell people to actually use unambiguous notations, in e.g. "Calculus on Manifolds", SICM, "Functional Differential Geometry" [2])

[1] Putnam suggested a "division of linguistic labor" (done by scientists from each fields) to define meaning http://libgen.io/get.php?md5=C931B36BCE8C21DA613AC02C40F634DC (what to do with this type of link?)
[2] http://mitpress.mit.edu/sites/default/files/titles/content/sicm/book-Z-H-79.html#%_chap_8

@rht
Copy link

rht commented Sep 30, 2015

It's already terse, but the tl; dr is whether to approach the data using ML or GOFAI.

@zignig
Copy link

zignig commented Oct 1, 2015

AWESOME! 👍

@davidar
Copy link
Member Author

davidar commented Oct 1, 2015

@rht glad to see I'm not the only one who's been thinking about this :)

Anything that is !pdf (well there is \usepackage{hyperref}, but extracting semantic data is a pain) and machine parseable.

See http://dlmf.nist.gov/LaTeXML and the (now defunct I suspect) https://trac.kwarc.info/arXMLiv

And e.g. apt-get install /doi/$doi --depth 2, citation ("vendor/") auto-suggest.

That would be cool. I also want to port https://github.com/davidar/bib to IPFS, which currently uses a half-baked content-addressable-storage for fulltext.

automatically extracting definitions to build a dictionary of terminology

I wonder if you can use statistical models on such 'raw' data.
[...] on disambiguating notation: using nlp to reverse-engineer [...]

Yeah, it wouldn't be trivial, but it's a field that has been studied, e.g.:

Another path is to tell people to actually use unambiguous notations

That would be nice (see http://www.texmacs.org/joris/semedit/semedit.html ), but probably not going to happen on a large scale.

tl; dr is whether to approach the data using ML or GOFAI.

My area is probabilistic (Bayesian) machine learning, but simpler approaches may well be Good Enough for some of this.

@rht
Copy link

rht commented Oct 1, 2015

tex -> html:
In the past, people have been using tex4ht / tex2page.

tex -> xml (for parsing):
Someone has to do it eventually...
Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs.
Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor (then what about the scale of reengineering the web? ...don't ask which leg moves after which).

(to be continued...)

@davidar
Copy link
Member Author

davidar commented Oct 3, 2015

Looking at http://arxmliv.kwarc.info/top_macros.php, I think this task is about the scale of vim -> neovim refactor

Keep in mind that, if you took care of the top 7 macros in that list, the remaining macros are used in less than 1% of papers, so it's not too bad. As it is, it's still less than 3%.

@dginev
Copy link

dginev commented Dec 3, 2015

@rht said:

Pandoc (the "llvm" of markup lang) currently has less latex coverage than latexml, but has better foundation (especially when vs perl scripts) and it connects with other markup langs.

Which foundation is that? I happen to be biased on the subject, but when it comes to dealing with arXiv pandoc's coverage can only be described as "basic". When it comes to evaluating TeX programs (which LaTeX papers are), it's pandoc's TeX reader that you can qualify as "Haskell scripts", while LaTeXML has a fully fleshed out implementation of a TeX engine as a foundation.

Pandoc indeed has a very powerful model that allows to connect readers and writers of concrete syntaxes via an internal abstract syntax. In fact, if the abstract model evolves to meet the coverage of LaTeXML's XML schema, it could be a wonderful post-processor for the LaTeXML ecosystem. And vice-versa, LaTeXML can be a wonderful "TeX reader" for pandoc. For me personally it would be quite curious to see the two projects interoperate, as they focus on, and excel at, different problems.

@davidar
Copy link
Member Author

davidar commented Dec 3, 2015

@dginev what needs to be added to pandoc's abstract model to support LaTeXML?

@dginev
Copy link

dginev commented Dec 3, 2015

Keep in mind I am not an expert in the Pandoc model. But I have seen the occasional comment suggesting that Pandoc wants to "stay pure" and be restricted in coverage in certain respects. To quote one closed issue:

Remember, pandoc is about document structure. CSS is about details of presentation. To change the size of headers in tex output, you could use a custom latex template (see the documentation).

That is a very noble sentiment, and LaTeXML mostly shares it, but remains open to eventually covering all of TeX's pathological typesetting acrobatics. Here is a relatively simple example of what I have in mind.

The restrictions pandoc is self-imposing keep it quite elegant, and make tying in different syntaxes manageable, but it also limits the depth of the support. In order to meaningfully handle arXiv, making a compromise on elegance in order to have enough expressivity is a necessity - or you end up losing half (or more) of the content.

But even when it comes to document structure, I am unsure how far pandoc has dealt with support for the "advanced" structures out there - indexes, glossaries, bibliographies, wacky constructions such as inline blocks, math in TeX graphics (e.g. Tikz images that convert to SVG+MathML), etc. In this respect I find tex4ht to be a much more impressive and suitable comparison for LaTeXML, and both dedicated TeX converters strive to get better with covering the full spectrum of structure and style that TeX allows for, and authors use on a regular basis.

@dginev
Copy link

dginev commented Dec 3, 2015

On a more pragmatic note, I am sure a LaTeXML-reader integration for pandoc is already possible today, by just mapping whatever structure is currently overlapping. May be a rather nice and simple project actually, it's just matching an XML DOM in LaTeXML's schema to Pandoc's internal datastructures. I will think of spending a few hours and doing that on a weekend.

@davidar
Copy link
Member Author

davidar commented Dec 3, 2015

@dginev It would be really cool to have a "universal" document model, but as you say, it's a tricky problem. I've spent a little bit of time thinking about it in the past, but can't say I came up with any solutions :(

I will think of spending a few hours and doing that on a weekend.

Awesome, let me know how you go :)

I can help with things on the Haskell side, if necessary. I'll leave the Perl stuff to you though :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants