Skip to content

marcmiquel/WDO

Repository files navigation

WDO

==========

The Wikipedia Diversity Obsevatory (WDO) is a research project whose purpose is to raise awareness on Wikipedia’s current state of content diversity, (1) providing datasets, (2) sites with visualizations and statistics, and (3) pointing out solutions to improve knowledge coverage and knowledge inequalities among languages and categories relevant to overall diversity (e.g. culture, gender, geography, ethnic groups, sexual orientation, etc.).

You can learn more about the project in this video or by visiting the website with visualizations and tools.

Data: Wikipedia Diversity and Stats Databases

We created a dataset/database table for each Wikipedia language edition in which each article is characterized according to features that can determine whether it belongs to a relevant category for diversity (culture, gender, place, etc.). Categories like gender, sexual orientation, religion or ethnic origin are straightforward, as they can be traced to Wikidata semantic relations structured as properties and items.

Instead, the relationship from an article as belonging to the language’s related topics requires a more sophisticated method. In this case, we use a variety of features based on the article title, category and links graph structure, among others, to label each article according to the possible relationship with territories where the language is spoken and to the peoples that inhabit them. Then, we introduce all of them into a machine learning classifier to obtain the final selection of articles belonging to a language cultural and geographical context. This collection of articles is called Cultural Context Content (CCC) adn it is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.).

The method is build with:

The datasets/database tables are generated on a monthly basis at wcdo.wmflabs.org in CSV and SQLite3. You can download the last version in datasets or databases.

These are the scripts that generate the database wikipedia_diversity.db we created the following scripts:

  • wikipedia_diversity.py, content_retrieval.py and content_selection.py. they retrieve the data from Wikimedia dumps and databases, process them according to some criteria, and introduce them into the database.

To answer questions on Wikipedia content diversity, it is necessary to compute several statistics based on CCC and other groups of articles. This is the script we used to calculate them:

  • stats_generation.py computes these statistics and ranks the articles in order to create valuable lists of articles for each Wikipedia language edition. It stores the results in stats.db on a monthly basis so it can be used to create tables and graphs. The list of all the diversity categories and groups of articles is in this Excel file sets_intersections.xls

Site(s): Observatory website (WDO) and Meta page (WDO home)

These are the scripts that create the tables and visualizations for the WDO, both the website visualizations and tools and the meta page.

Research: Main papers and presentations

Several papers and talks have been published to explain the usefulness of the Diversity Observatory and the importance of exchanging content across languaeg editions in order to reduce the knowledge inequalities.

  • Miquel-Ribé, M. & Laniado, D. (2020). The Wikipedia Diversity Observatory: A Project to Identify and Bridge Content Gaps in Wikipedia. In Proceedings of the International Symposium on Open Collaboration (OpenSym 2020). ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3412569.3412866 (pdf).

  • Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM. 2334-0770

  • Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics (pdf).

  • Miquel Ribé, M. (2017) Identity-based motivation in digital engagement: the influence of community and cultural identity on participation in Wikipedia (Doctoral dissertation, Universitat Pompeu Fabra).

  • Presentation Wikipedia Diversity Observatory (WDO) (OpenSym Conference, 2020) (Video Youtube) (Slides pdf).

Community

Get involved in WDO development and find tasks to do in Get involved page or you can get in touch at tools.wdo@tools.wmflabs.org.

Copyright

All data, charts, and other content is available under the Creative Commons CC0 dedication.