<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_category_to_article_list.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia category to article list

**Input:** the name of a Wikipedia category.

**Output:** a list of Wikipedia articles (CSV).

This scripts queries Wikipedia to get the list of articles in a Wikipedia category, including its subcategories (and so on recursively).


## How to use

1. Edit the settings
1. Run all the cells
1. Take the output file from the notebook folder


# SETTINGS

In [19]:
# Wikipedia category to extract
# Note: use the way it is displayed in the corresponding page.
#       For instance this page: https://en.wikipedia.org/wiki/Category%3AComputer_ethics
#       Gives you this category name: "Category:Computer ethics"
category_to_extract = "Category:Computer ethics"

# Output file
output_file = "wikipedia-articles.csv"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [11]:
# Install (if needed)
!pip install wikipedia-api
!pip install wikipedia
!pip install pandas

# Import
import wikipediaapi
import wikipedia
import pandas as pd

print("Done.")

Done.


### Harvest the category

In [18]:
# Create empty set of articles to fill later on
article_set = set()

# This is an object we use to connect to the API.
# Note that we configure it to use the English Wikipedia.
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format = wikipediaapi.ExtractFormat.WIKI
)

# Create the category object (stuff specific to the API library)
cat = wiki_wiki.page(category_to_extract)

# Recursively build the list of pages (because there are sub-categories)
# For the recursion, we create a function that might call itself
def parse_categorymembers(categorymembers, level=0, max_level=2):
    for c in categorymembers.values():
        if c.ns == wikipediaapi.Namespace.MAIN: # This element is an article
            article_set.add(c.title)
        if (c.ns == wikipediaapi.Namespace.CATEGORY and
            level < max_level): # This element is a sub-category
            parse_categorymembers(c.categorymembers,
                                  level=level + 1,
                                  max_level=max_level)
parse_categorymembers(cat.categorymembers)

# Transform the set into a data frame for convenience
article_df = pd.DataFrame(article_set, columns=["Article"])

# Output the data frame to check if it works
print("Preview of the article list:")
article_df

Preview of the article list:


Unnamed: 0,Article
0,Search engine privacy
1,Member Berries
2,Real-name system
3,CSipSimple
4,Spam blog
...,...
539,Helix Kitten
540,Alternative Informatics Association
541,CyberSource
542,Flyposting


### Save list as CSV

In [None]:
try:
    article_df.to_csv(output_file, index = False, encoding='utf-8')
    print("Done.")
except IOError:
    print("/!\ Error while writing the output file")