Wikipedia Corpus Builder

The Leibniz-Institut für Deutsche Sprache (IDS) develops a corpus builder for Wikipedia, that converts Wikipedia pages from its native text format, Wikitext, into our target corpus format, I5. I5 is the IDS text model used in Das Deutsche Referenzkorpus (DeReKo). It is a customized TEI format based on XCES, enriched with metadata information on different corpus structure levels (Lüngen and Sperberg-McQueen, 2012). As part of DeReKo, Wikipedia corpora built using this tool, are accessible through Corpus Search, Management and Analysis System II (COSMAS II) and Corpus Analysis Platform (KorAP).

The corpus builder works in two stages of conversion (Margaretha and Lüngen, 2014). In the first stage, WikiXMLConverter converts Wikitext into WikiXML by using Sweble Parser (Dohrn and Riehle, 2011) and generates a WikiXML file for each wikipage within a Wikipedia namespace, for instance articles. In the second stage, WikiI5Converter converts each WikiXML file into I5 using XSLT Stylesheets and assemble them altogether as a single corpus file as required for DeReKo.

The corpus builder is also designed for building Computer Mediated Communication (CMC) corpora from Wikipedia talk or discussion pages, such as in the Talk and User talk namespaces. A talk corpus is structured by postings and threads following the TEI scheme for CMC corpus (Beißwenger, et al., 2012). Our posting segmentation is done heuristically in WikiXMLConverter.

The corpus builder supports parsing Wikipedia of multiple languages. It has been tested for the following languages: english, french, hungarian, norwegian, spanish, croatian, italian, polish and rumanian. We also provide Wikipedia corpora of these languages in WikiXML and I5 formats for download.

References

Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., and Storrer, A. (2012). A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative [Online], 3. URL : https://doi.org/10.4000/jtei.476 ; DOI : 10.4000/jtei.476

Dohrn, H., Riehle, D. (2011). Design and implementation of the Sweble Wikitext parser: unlocking the structured data of Wikipedia. Proceedings of the 7th International Symposium on Wikis and Open Collaboration, DOI : 10.1145/2038558.2038571

Margaretha, E., and Lüngen,H. (2014). Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technologie and Computational Linguistics (JLCL), 2/2014. URL : http://www.jlcl.org/2014_Heft2/3MargarethaLuengen.pdf ;

Lüngen, H., and Sperberg-McQueen, C. M. (2012). A TEI P5 Document Grammar for the IDS Text Model. Journal of the Text Encoding Initiative [Online], 3. URL : http://jtei.revues.org/508 ; DOI : 10.4000/jtei.508

Name		Name	Last commit message	Last commit date
Latest commit History 333 Commits
.github/workflows		.github/workflows
WikiI5Converter		WikiI5Converter
WikiXMLConverter		WikiXMLConverter
WikitextSampleBuilder		WikitextSampleBuilder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

WikiI5Converter

WikiI5Converter

WikiXMLConverter

WikiXMLConverter

WikitextSampleBuilder

WikitextSampleBuilder

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Wikipedia Corpus Builder

References

About

Releases

Packages

Contributors 2

Languages

License

IDS-Mannheim/Wikipedia-Corpus-Builder

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Corpus Builder

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages