Skip to content

Latest commit

 

History

History
33 lines (31 loc) · 2.03 KB

Parsing.md

File metadata and controls

33 lines (31 loc) · 2.03 KB

Parsing Wikipedia is not an easy tasks. Some tags and formattings signs can be only after newline, some can be everywhere in text; some formatting can span several lines, some is force-closed on line end; there can be tons and tons of markup inside image captions, templates and 's, so... Here's what I've came with:

  1. Entire page text is split into lines (after replacing of <!-- --> comments -- they go nowhere).
  2. First, we are in paragraph context. We are looking at next line in list and guessing what it is: list, heading and so on
  3. Then, we are in inline context for text of paragraph (unless it is table, which is different story, and headings, which also different, and of course preformatted text,... you've got the idea). We scan text until any of inline formatting will be met (or end of line).
  4. When met with some formatting, we push current context and scan inside it. The inline scanning is tricky!
  • Simple formatting like '' (italic) is implicitly closed at the end of line (it is called "short inline scan" inside Infoboxer's parser)
  • Long formatting like templates can span several lines, so we continue scan through next lines, till template end (it means we are still in same paragraph!), it's "normal inline scan", or just "inline scan"
  • Some inline formatting (like <ref>'s) and special formatting, like table cells, can have other paragraphs inside! (But it's still "inline" formatting, because when is ended, the same paragraph is continued -- while showing it in Wikipedia, ref will leave a small footnote mark in paragraph, and the contents will be below). We call such a cases "long inline scan".
  1. So, parser tries to do everything in one forward scan, without returning to previous positions or tricks like "scan all symbols till the end of template, then parse them as a separate sub-document" (the letter is the simplest way to parse MediaWiki markup; that's how Infoboxer worked at first; it was not very fast and not memory-effective at all).