# Explore the Wikipedia categories

This includes some EDA to get a feel for the category dataset, and introduces some functionality for finding categories related to specific topics.

This notebook requires the category index dataset to have been built previously by running `build_category_index.ipynb`.

In [1]:
from pathlib import Path

import pandas as pd

from wikipedia_utils.category import CategoryIndex
from wikipedia_utils.utils import display_pd

`DATA_DIR` should be set to the dir where the category index files were written.

In [2]:
DATA_DIR = "category_data"

## Load the data

Load the category index DF.

In [3]:
# CategoryIndex wraps the category index DF and provides some exploration functions.
ci = CategoryIndex(data_dir=DATA_DIR)

In [4]:
ci.cat_df.sample(5)

Unnamed: 0_level_0,hidden,name,num_pages,num_subcats,parents,parents_visible,subcats_visible
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mass_media_in_Bijeljina,False,Mass media in Bijeljina,15,0,"[Bijeljina, Mass_media_in_Bosnia_and_Herzegovi...","[Bijeljina, Mass media in Bosnia and Herzegovi...",[]
Russian_soups,False,Russian soups,9,0,"[Commons_category_link_is_on_Wikidata, Russian...","[Soups by country, Russian cuisine]",[]
Venezuelan_islands_of_the_Leeward_Antilles,False,Venezuelan islands of the Leeward Antilles,13,2,"[Caribbean_islands_of_Venezuela, Leeward_Antil...","[Leeward Antilles, Caribbean islands of Venezu...","[Los Roques Archipelago, Margarita Island]"
IB_Parks_%26_Entertainment,False,IB Parks & Entertainment,3,0,"[Amusement_park_companies, Companies_based_in_...","[Entertainment companies of the United States,...",[]
1960_establishments_in_Southern_Rhodesia,False,1960 establishments in Southern Rhodesia,4,0,"[1960_establishments_by_country, 1960_establis...","[Establishments in Southern Rhodesia by year, ...",[]


Total number of categories:

In [5]:
print(f"{len(ci.cat_df):,}")

2,243,956


## Hidden & visible categories

Some categories are "hidden", meaning that they don't show up on the Wikipedia article page. These are mainly used by editors for maintenance purposes.

In [6]:
ci.cat_df["hidden"].value_counts()

False    2210582
True       33374
Name: hidden, dtype: int64

What are the top visible & hidden categories?

In [7]:
ci.cat_df.query("~hidden").sort_values("num_pages", ascending=False).head(5)

Unnamed: 0_level_0,hidden,name,num_pages,num_subcats,parents,parents_visible,subcats_visible
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
WikiProject_Biography_articles,False,WikiProject Biography articles,1952673,9,"[Articles_by_WikiProject, CatAutoTOC_generates...","[Articles by WikiProject, WikiProject Biography]","[Automatically assessed biography articles, Bi..."
Biography_articles_of_living_people,False,Biography articles of living people,1112095,1,"[CatAutoTOC_generates_Large_category_TOC, Temp...",[WikiProject Biography articles],[Biography articles of living people who have ...
Living_people,False,Living people,1059265,2,"[CatAutoTOC_generates_Large_category_TOC, Peop...",[People by status],[Lists of living people]
Stub-Class_biography_articles,False,Stub-Class biography articles,1035116,10,"[Biography_articles_by_quality, CatAutoTOC_gen...","[Biography articles by quality, Stub-Class art...",[Stub-Class biography (actors and filmmakers) ...
Start-Class_biography_articles,False,Start-Class biography articles,688111,10,"[Biography_articles_by_quality, CatAutoTOC_gen...","[Biography articles by quality, Start-Class ar...",[Start-Class biography (actors and filmmakers)...


In [8]:
ci.cat_df.query("hidden").sort_values("num_pages", ascending=False).head(5)

Unnamed: 0_level_0,hidden,name,num_pages,num_subcats,parents,parents_visible,subcats_visible
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Articles_with_short_description,True,Articles with short description,4861387,4,"[Article_namespace_categories, CatAutoTOC_gene...",[Article namespace categories],[]
Short_description_is_different_from_Wikidata,True,Short description is different from Wikidata,3429523,0,"[Articles_with_short_description, CatAutoTOC_g...",[Wikipedia categories tracking Wikidata differ...,[]
All_stub_articles,True,All stub articles,2340123,0,"[CatAutoTOC_generates_Large_category_TOC, Hidd...","[Stub categories, Top-level stub categories]",[]
Noindexed_pages,True,Noindexed pages,2317388,51676,"[CatAutoTOC_generates_Large_category_TOC, Comm...",[],"[13th century of Danish law, 15th-century Moro..."
Redirects_from_moves,True,Redirects from moves,2173319,2,"[All_redirect_categories, CatAutoTOC_generates...",[],[German female writers]


Some categories are "empty", meaning that they currently have no pages or subcategories. How many of these are there, counting hidden separately from visible?

In [9]:
(
    ci.cat_df.query("num_pages + num_subcats == 0")["hidden"].value_counts()
    .to_frame(name="n_empty")
    .rename_axis(index="hidden")
    .assign(p_empty=lambda d: d["n_empty"] / ci.cat_df["hidden"].value_counts())
)

Unnamed: 0_level_0,n_empty,p_empty
hidden,Unnamed: 1_level_1,Unnamed: 2_level_1
False,122704,0.055508
True,3056,0.091568


How many visible categories have no visible parents (sources), subcategories (sinks/leaves), or neither?

In [10]:
n_parents = ci.cat_df.query("~hidden")["parents_visible"].map(len)
n_subcats = ci.cat_df.query("~hidden")["subcats_visible"].map(len)

print(f"No visible parents: {(n_parents == 0).sum():,}")
print(f"No visible subcategories: {(n_subcats == 0).sum():,}")
print(f"No visible linkage: {(n_parents + n_subcats == 0).sum():,}")

No visible parents: 109,162
No visible subcategories: 1,203,132
No visible linkage: 106,846


For those with no visible linkage: looking at their (hidden) parents suggests the majority of these are redirects.

In [11]:
(
    ci.cat_df.query("~hidden")
    [n_parents + n_subcats == 0]
    ["parents"].explode()
    .value_counts()
    .head()
)

Wikipedia_soft_redirected_categories    103658
Disambiguation_categories                 2329
CatAutoTOC_generates_no_TOC                253
Commons_category_link_is_on_Wikidata       208
All_redirect_categories                    139
Name: parents, dtype: int64

## Searching categories

Our goal is to build a list of categories that represent NSFW topics.
Here we show how we can explore the categories, their parents and subcategories, in order to build a list of categories for a topic at the right level of generality.

In order to keep the manual curation element manageable, we are looking to choose a few broader categories to use as seeds, and then we can pull in all their subcategories automatically to make sure the topic area is well-covered. Seeds can be either exact category names which are a good match for the topic, or regexes which match categories related to the topic.

This approach is used to build seed lists for the NSFW categories we wish to block. It requires some degree of manual experimentation and trial-and-error to arrive at an acceptable final list. To find examples of categories to search for, we can explore [Wikipedia's category tree](https://en.wikipedia.org/wiki/Special:CategoryTree). We can also look at the pages in Wikipedia's [objectionable content category](https://en.wikipedia.org/wiki/Category:Wikipedia_objectionable_content) to see what some common categories are that these pages belong to. Once we have some categories in mind, we can use the tools below to explore parents and subcategories to arrive at the right level of generality.

### Finding matching categories

First, let us look for categories relating to a general topic. We use the topic "coffee" as an example. To start with, find all categories whose name includes the word "coffee".

In [12]:
matching_cats = ci.find_matching_categories(regex="\\bcoffee\\b")

In [13]:
len(matching_cats)

158

Take a look at some of the results. We can see that some categories are related to coffee the drink, but there are a number of others which are unrelated, eg. "Coffee County".

In [14]:
matching_cats_sample = matching_cats.sample(20)

In [15]:
# Drop index to make display more compact.
display_pd(matching_cats_sample.reset_index(drop=True))

Unnamed: 0,name,num_pages,num_subcats,parents_visible,subcats_visible,seed,has_parent_in_list
0,Coffee user templates,46,0,"[Drink user templates, Drug user templates, Coffee and tea templates, Coffee and Tea Taskforce templates]",[],Coffee,True
1,"Transportation in Coffee County, Georgia",18,0,"[Coffee County, Georgia, Transportation in Georgia (U.S. state) by county]",[],Coffee,True
2,"Populated places in Coffee County, Alabama",0,3,"[Populated places in Alabama by county, Geography of Coffee County, Alabama]","[Cities in Coffee County, Alabama, Towns in Coffee County, Alabama, Unincorporated communities in Coffee County, Alabama]",Coffee,True
3,Coffee portal,90,0,[Beverage portals],[],Coffee,False
4,Coffee & Tea taskforce members,8,0,"[Wikipedians by WikiProject, Beverages Task Force participants, Coffee and Tea Taskforce]",[],Coffee,True
5,People associated with coffee,0,2,"[People associated with drinks, Coffee]","[Baristas, Businesspeople in coffee]",coffee,True
6,Coffee in North America,5,1,"[North American drinks, Coffee by continent]",[Coffee in the United States],Coffee,True
7,Single-serving coffee,1,2,[Coffee preparation],"[Single-serving coffee containers, Single-serving coffee makers]",coffee,True
8,Coffee in Hawaii,2,0,"[Coffee in Oceania, Hawaiian drinks]",[],Coffee,True
9,Alcoholic coffee drinks,14,2,"[Caffeinated alcoholic drinks, Cocktails by ingredient, Coffee drinks]","[Coffee liqueurs, Liqueur coffee]",coffee,True


Notice also that in many cases, both the parent and subcategory include the word "coffee", eg. "Coffee in the United States" is a subcategory of "Coffee in North America". The `has_parent_in_list` column indicates that a category has a parent which is also in this result set, ie. also matching our regex. Since we are more interested in higher-level categories, we can ignore those categories for now.

In [16]:
matching_cats["has_parent_in_list"].value_counts()

True     136
False     22
Name: has_parent_in_list, dtype: int64

This reduces the space of categories to explore.

In [17]:
matching_cats_noparent = matching_cats.query("~has_parent_in_list")

In [18]:
display_pd(matching_cats_noparent.reset_index(drop=True))

Unnamed: 0,name,num_pages,num_subcats,parents_visible,subcats_visible,seed,has_parent_in_list
0,Coffee,14,13,"[Individual psychoactive drugs, Drinks, Arab inventions]","[Coffea, Coffee by continent, Coffee by country, Coffee chemistry, Coffee culture, Coffee industry, Coffee organizations, Coffee preparation, Coffee stubs, History of coffee, People associated with coffee, Types of coffee, Works about coffee]",Coffee,False
1,"Coffee County, Alabama",9,6,[Alabama counties],"[Buildings and structures in Coffee County, Alabama, Education in Coffee County, Alabama, Geography of Coffee County, Alabama, National Register of Historic Places in Coffee County, Alabama, People from Coffee County, Alabama, Transportation in Coffee County, Alabama]",Coffee,False
2,"Coffee County, Georgia",8,6,[Georgia (U.S. state) counties],"[Buildings and structures in Coffee County, Georgia, Education in Coffee County, Georgia, Geography of Coffee County, Georgia, People from Coffee County, Georgia, Tourist attractions in Coffee County, Georgia, Transportation in Coffee County, Georgia]",Coffee,False
3,"Coffee County, Tennessee",8,7,"[Counties of Appalachia, Tennessee counties, Middle Tennessee]","[Bonnaroo Music Festival, Buildings and structures in Coffee County, Tennessee, Education in Coffee County, Tennessee, Geography of Coffee County, Tennessee, People from Coffee County, Tennessee, Tourist attractions in Coffee County, Tennessee, Transportation in Coffee County, Tennessee]",Coffee,False
4,Coffee House Press books,12,0,"[Books by publisher, Books by publishing company of the United States]",[],Coffee,False
5,Coffee Stain Studios games,9,0,[Video games developed in Sweden],[],Coffee,False
6,Coffee and Tea Taskforce,2,5,[Beverages Task Force],"[Coffee & Tea taskforce members, Coffee and Tea Taskforce templates, Coffee and tea logos, Coffee stubs, Tea stubs]",Coffee,False
7,Coffee and tea templates,1,5,[Drink templates],"[Coffee and Tea Taskforce templates, Coffee and tea navigational boxes, Coffee and tea stub templates, Coffee user templates, Tea user templates]",Coffee,False
8,Coffee portal,90,0,[Beverage portals],[],Coffee,False
9,Coffee varieties,18,0,"[Food plant cultivars, Coffea]",[],Coffee,False


From here, it looks like the easiest way to build a category list for the topic "coffee" is to use a list of exact categories:
`["Coffee", "Coffee portal", "Coffee varieties"]`.

### Alternative view: flatten parents

Another view which can be useful when exploring categories is to flatten out the list of parents. The parent categories can often help to provide context when a category is not immediately clear. This also shows exactly which of the parents are also in the result set.

In [19]:
# Pass in the result set found above
flat_parents = ci.find_parents_listed(matching_cats)

In [20]:
# Restrict to the categories shown in the sample above:
display_pd(flat_parents.loc[matching_cats_sample.index].reset_index(drop=True))

Unnamed: 0,name,parent,parent_in_list
0,Coffee user templates,Drink user templates,False
1,Coffee user templates,Drug user templates,False
2,Coffee user templates,Coffee and tea templates,True
3,Coffee user templates,Coffee and Tea Taskforce templates,True
4,"Transportation in Coffee County, Georgia","Coffee County, Georgia",True
5,"Transportation in Coffee County, Georgia",Transportation in Georgia (U.S. state) by county,False
6,"Populated places in Coffee County, Alabama",Populated places in Alabama by county,False
7,"Populated places in Coffee County, Alabama","Geography of Coffee County, Alabama",True
8,Coffee portal,Beverage portals,False
9,Coffee & Tea taskforce members,Wikipedians by WikiProject,False


For those without a parent also in the result set:

In [21]:
# Restrict to the categories shown in the sample above:
display_pd(flat_parents.loc[matching_cats_noparent.index].reset_index(drop=True))

Unnamed: 0,name,parent,parent_in_list
0,Coffee,Individual psychoactive drugs,False
1,Coffee,Drinks,False
2,Coffee,Arab inventions,False
3,"Coffee County, Alabama",Alabama counties,False
4,"Coffee County, Georgia",Georgia (U.S. state) counties,False
5,"Coffee County, Tennessee",Counties of Appalachia,False
6,"Coffee County, Tennessee",Tennessee counties,False
7,"Coffee County, Tennessee",Middle Tennessee,False
8,Coffee House Press books,Books by publisher,False
9,Coffee House Press books,Books by publishing company of the United States,False


### Find subcategories

Now that we have a seed list of higher-level categories, we can search the category index for subcategories at all levels. We will use this as our final list of categories relevant to the topic of "coffee".

In [22]:
seed_list = ["Coffee", "Coffee portal", "Coffee varieties"]

In [23]:
# Find the category listings matching this seed list.
matching_seed_cats = ci.find_matching_categories(catlist=seed_list)

In [24]:
display_pd(matching_seed_cats)

Unnamed: 0_level_0,name,num_pages,num_subcats,parents_visible,subcats_visible,seed,has_parent_in_list
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Coffee,Coffee,14,13,"[Individual psychoactive drugs, Drinks, Arab inventions]","[Coffea, Coffee by continent, Coffee by country, Coffee chemistry, Coffee culture, Coffee industry, Coffee organizations, Coffee preparation, Coffee stubs, History of coffee, People associated with coffee, Types of coffee, Works about coffee]",Coffee,False
Coffee_portal,Coffee portal,90,0,[Beverage portals],[],Coffee portal,False
Coffee_varieties,Coffee varieties,18,0,"[Food plant cultivars, Coffea]",[],Coffee varieties,False


Run breadth-first search across the directed graph defined by category -> subcategory linkage to find all subcategories.

In [25]:
%%time

all_cats = ci.category_bfs(matching_seed_cats)

CPU times: user 4.07 s, sys: 512 ms, total: 4.58 s
Wall time: 4.69 s


How many categories were found?

In [26]:
print(f"Num categories found: {len(all_cats):,}")
print(f"Max depth reached: {all_cats['level'].max()}")

Num categories found: 154
Max depth reached: 6


In [27]:
display_pd(all_cats)

Unnamed: 0,name,seed,parent,level
0,Coffee,Coffee,,0
1,Coffee portal,Coffee portal,,0
2,Coffee varieties,Coffee varieties,,0
3,Coffea,Coffee,Coffee,1
4,Coffee by continent,Coffee,Coffee,1
5,Coffee by country,Coffee,Coffee,1
6,Coffee chemistry,Coffee,Coffee,1
7,Coffee culture,Coffee,Coffee,1
8,Coffee industry,Coffee,Coffee,1
9,Coffee organizations,Coffee,Coffee,1


We may decide that certain topic directions pulled in by the search are not relevant for us, such as `"Coffee culture"` and its subcategories, or all categories beyond a certain depth level. Such cases can be excluded in the BFS algorithm.

It may take some trial and error to tune the list of categories as desired.

In [28]:
all_cats_restricted = ci.category_bfs(
    matching_seed_cats,
    ignore_cats=["Coffee culture"],
    ignore_re="coffeehouse",
    max_level=5
)

Level: 5

For this example, this will be the final list of categories relevant to the topic of "coffee":

In [29]:
display_pd(all_cats_restricted)

Unnamed: 0,name,seed,parent,level
0,Coffee,Coffee,,0
1,Coffee portal,Coffee portal,,0
2,Coffee varieties,Coffee varieties,,0
3,Coffea,Coffee,Coffee,1
4,Coffee by continent,Coffee,Coffee,1
5,Coffee by country,Coffee,Coffee,1
6,Coffee chemistry,Coffee,Coffee,1
7,Coffee industry,Coffee,Coffee,1
8,Coffee organizations,Coffee,Coffee,1
9,Coffee preparation,Coffee,Coffee,1
