# Exploring Wikipedia content moderation

The goal is to develop a blocklist of NSFW articles on English Wikipedia that we can apply to the Dynamic Wikipedia search index.
This notebook explores some of the Wikipedia resources available that we can use to accomplish this.

In [8]:
import json
import random
import gzip
import codecs
import re
from time import sleep
from pathlib import Path

import pandas as pd
import requests
from tqdm import tqdm

from wikipedia_utils.utils import display_pd
from wikipedia_utils.search_index import IndexStream
from wikipedia_utils.category import CategoryIndex

pd.set_option("display.max_columns", None)
pd.set_option("display.show_dimensions", True)

tqdm.pandas()

We will make use of some assets availabe in the Wikimedia dumps. These should be downloaded ahead of time using the functionality presented in the other notebooks.

- The full ElasticSearch index for English Wikipedia:
    * <https://dumps.wikimedia.org/other/cirrussearch/20230123/enwiki-20230130-cirrussearch-content.json.gz>
    * __Note: this is a huge file, 35GB compressed__
- The list of English Wikipedia categories and their linkages in RDF (Turtle) format:
    * <https://dumps.wikimedia.org/other/categoriesrdf/20230128/enwiki-20230128-categories.ttl.gz>
    * 82 MB
    * Uses ontology defined at <https://www.mediawiki.org/ontology/ontology.owl>

In [7]:
OUTPUT_DIR = Path("data")
SEARCH_INDEX_DIR = Path("es_data")
CAT_DATA_DIR = Path("category_data")

## Full ES index

The full index file is gzipped 35 GB.
In the ElasticSearch index format, entries are composed of two lines
```
{"index": {...}}
{field1: val1, ...}
```
The page information we are interested in is on the second line of each entry.

More details and processing code are presented in `search_index.ipynb`. The download portion of that notebook should be run in order to run processing further down here.

## Category listing

The majority of Wikipedia pages belong to 1 or more categories. Wikipedia categories are a complex taxonomy of labels used to organize content by topic and also to flag page characteristcs, eg. those in need of maintenance.

- Pages are assigned categories by editors adding the category labels in the page source.
- Each category may contain pages or other subcategories. These can be viewed on the category's page, eg. <https://en.wikipedia.org/wiki/Category:Coffee>.
- Parent/child category relationships define a directed graph. As categories are assigned to pages by manual labeling, this graph can't be assumed to be acyclic.
- Some categories are defined to be "hidden" - these are generally related to maintenance (eg. "pages with missing information") rather than topic. Visible categories that a page belongs to are listed at the bottom of the page. Both hidden and visible categories are included in the search index (and are not distinguished).

Among the Wikipedia dumps is an RDF-formatted list of all categories. This includes counts of pages and subcategories, an indicator for hidden categories, and a listing of categories containing each one as a subcategory (ie a list of parents).

To work with the category data dump, please run `build_category_index.ipynb` to convert to a pandas DataFrame.
Tools for exploration are presented in `explore_categories.ipynb`.
Having previously built the category index as a DF, we can load it here.

In [9]:
ci = CategoryIndex(data_dir=CAT_DATA_DIR)

The DF is available as `ci.cat_df`.

In [12]:
ci.cat_df.sample(5)

Unnamed: 0_level_0,hidden,name,num_pages,num_subcats,parents,parents_visible,subcats_visible
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Steve_Irwin,False,Steve Irwin,32,1,[Wikipedia_categories_named_after_Australian_p...,[],[Irwin family]
Ufa_neighbourhoods,False,Ufa neighbourhoods,4,0,[Ufa],[Ufa],[]
Air_shows_in_China,False,Air shows in China,1,0,"[Air_shows_by_country, Aviation_in_China, Comm...","[Aviation in China, Air shows by country, Even...",[]
Sierra_Leonean_expatriate_sportspeople_in_Kazakhstan,False,Sierra Leonean expatriate sportspeople in Kaza...,1,0,"[CatAutoTOC_generates_no_TOC, Expatriate_sport...",[Expatriate sportspeople in Kazakhstan by nati...,[]
Template-Class_Delhi_articles,False,Template-Class Delhi articles,33,0,"[CatAutoTOC_generates_no_TOC, Delhi_articles_b...","[Delhi articles by quality, Template-Class art...",[]


## Download Wikipedia pages to build filters from

Wikipedia offers a REST API which can be used to programmatically query various types of page content, such as all subcategories and pages belonging to a category, or the content of a page.
Set up some tooling to query the API.

In [132]:
class WikiQuery:
    """Query Wikipedia's API."""
    # Documentation: https://www.mediawiki.org/wiki/API:Main_page
    WIKIPEDIA_API_URL = "https://en.wikipedia.org/w/api.php"
    should_continue = True
    
    def _query(self, params):
        """Issue a query specified by the parameters to the API.
        
        `self._handle_response_page()` is called on the JSON result.
        If the result set spans multiple pages, this iterates through the pages
        and calls `self._handle_response_page()` on each page.
        """
        while True:
            rj = requests.get(self.WIKIPEDIA_API_URL, params=params).json()
            self._handle_response_page(rj, params)
            
            if self.should_continue and ("continue" in rj):
                params.update(rj["continue"])
                # Simple rate limiting
                sleep(0.5)
            else:
                break
    
    def _handle_response_page(self, response_json, params):
        """Action to take for each page of JSON results.
        
        Eg. accumulate results into a collection member variable.
        """
        pass
    

class CatQuery(WikiQuery):
    """Query Wikipedia's API for category members, including subcategories."""
    # Documentation: https://www.mediawiki.org/wiki/API:Categorymembers
    BASE_PARAMS = {
        "action": "query",
        "list": "categorymembers",
        "cmlimit": 500,
        "format": "json",
        "cmprop": "title|ids|type",
    }
    
    def __init__(self):
        # Mapping of pageid to page info dict
        self.pages = {}
        self.categories = {}
        # List of remaining category names to query
        self.subcat_queue = []
        
    def _handle_response_page(self, response_json, params):
        for x in response_json["query"]["categorymembers"]:
            i = x["pageid"]
            if (x["type"] == "subcat") and (i not in self.categories):
                # New subcategory. Add to list of known cats and query queue
                self.categories[i] = x
                self.subcat_queue.append(x["title"])
            elif i in self.pages:
                # Previously seen page, also belongs to subcategory.
                # Record subcategory.
                self.pages[i]["categories"].append(params["cmtitle"])
            else:
                # New page
                x["categories"] = [params["cmtitle"]]
                self.pages[i] = x
    
    def get_members(self, category):
        """Get a list of pages included in a category and its subcategories."""
        self.categories[-1] = {"title": category}
        self.subcat_queue.append(category)
        
        while len(self.subcat_queue) > 0:
            cat = self.subcat_queue.pop(0)
            params = dict(self.BASE_PARAMS)
            params["cmtitle"] = cat
            
            # print("cat", params)
            self._query(params)


class TemplateQuery(WikiQuery):
    """Query Wikipedia's API for pages containing a template."""
    # Documentation: https://www.mediawiki.org/wiki/API:Transcludedin
    BASE_PARAMS = {
        "action": "query",
        "prop": "transcludedin",
        "tilimit": 500,
        "format": "json"
    }
    
    def __init__(self):
        # Mapping of pageid to page info dict
        self.pages = []

    def _handle_response_page(self, response_json, params):
        _, results = response_json["query"]["pages"].popitem()
        self.pages.extend(results["transcludedin"])
    
    def get_pages(self, template_title):
        """Get a list of pages using the given template."""
        params = dict(self.BASE_PARAMS)
        params["titles"] = template_title

        self._query(params)


class LinksParse(WikiQuery):
    """Query Wikipedia's API for links on a page."""
    # Documentation: https://www.mediawiki.org/wiki/API:Parsing_wikitext#parse
    BASE_PARAMS = {
        "action": "parse",
        "prop": "links",
        "format": "json"
    }
    
    def __init__(self):
        # List of {"page": ..., "ns": ...}
        self.links = []
    
    def _handle_response_page(self, response_json, params):
        r = response_json["parse"]["links"]
        self.links = [{"page": x["*"], "ns": x["ns"]} for x in r]
    
    def get_page(self, page):
        params = dict(self.BASE_PARAMS)
        params["page"] = page

        self._query(params)


class Namespaces(WikiQuery):
    """Query Wikipedia's API for namespace names & IDs."""
    # Documentation: https://www.mediawiki.org/wiki/API:Siteinfo
    BASE_PARAMS = {
        "action": "query",
        "meta": "siteinfo",
        "siprop": "namespaces",
        "format": "json",
        "formatversion": "2",
    }
    
    def __init__(self):
        # Mapping of <id>: <name>
        self.ns = {}
    
    def _handle_response_page(self, response_json, params):
        r = response_json["query"]["namespaces"]
        self.ns = {int(k): v["name"] for k, v in r.items()}
    
    def get_ns(self):
        params = dict(self.BASE_PARAMS)
        self._query(params)
        # Add label for main pages.
        self.ns[0] = "Article"

### Namespaces

Pull the mapping of [namespace](https://en.wikipedia.org/wiki/Wikipedia:Namespace) ID to namespace name.

In [121]:
nslist = Namespaces()
nslist.get_ns()

In [123]:
NAMESPACES = nslist.ns

### Pull the list of bad images

MediaWiki hosts a [list of offensive images](https://en.wikipedia.org/wiki/MediaWiki:Bad_image_list) that are used across the project. We will use them for filtering Wikipedia pages.

Download the list of image links using the API.
This gives a list of filenames of the form `"File:<filename>"`, which we store as a JSON list.

In [62]:
BAD_IMG_JSON = OUTPUT_DIR / "bad_image_list.json"

In [35]:
%%time

lp = LinksParse()
lp.get_page("MediaWiki:Bad_image_list")

CPU times: user 30.1 ms, sys: 12.3 ms, total: 42.3 ms
Wall time: 263 ms


In [38]:
badimg = [x["page"] for x in lp.links if x["ns"] == 6]

In [63]:
len(badimg)

936

In [64]:
with open(BAD_IMG_JSON, "w") as f:
    json.dump(badimg, f)

In [None]:
# with open(BAD_IMG_JSON) as f:
#     badimg = json.load(f)

### List of pages & subcategories related to controversial topics

The Wikipedia category
[Wikipedia controversial topics](https://en.wikipedia.org/wiki/Category:Wikipedia_controversial_topics)
contains a categorization of pages that are controversial.
These include:

- [__Controversial topics__](https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues): topics which are disputed, see a lot of circular editing, or are subject to bias.
    * identified by templates like [Template:Controversial](https://en.wikipedia.org/wiki/Template:Controversial) on their Talk page
    * listed in [Category:Wikipedia controversial topics](https://en.wikipedia.org/wiki/Category:Wikipedia_controversial_topics).
- [__Contentious topics__](https://en.wikipedia.org/wiki/Wikipedia:Contentious_topics): specially-designated topics that have attracted persistent disruptive editing. Administrators are allowed to impose additional editing restrictions on these pages.
    * identified by templates or editnotices such as those belonging to [Category:Standardised Wikipedia arbitration enforcement templates](https://en.wikipedia.org/wiki/Category:Standardised_Wikipedia_arbitration_enforcement_templates)
    * listed in [Category:Wikipedia pages about contentious topics](https://en.wikipedia.org/wiki/Category:Wikipedia_pages_about_contentious_topics), a subcategory of Controversial topics.
- __Objectionable content__: content that may be graphically sexual or otherwise objectionable.
    * identified by the use of templates like [Template:Censor](https://en.wikipedia.org/wiki/Template:Censor) on their Talk page
    * listed in [Category:Wikipedia objectionable content](https://en.wikipedia.org/wiki/Category:Wikipedia_objectionable_content), a subcategory of Controversial topics.

These pages contain examples of content that we may wish to block or downweight.

We use the API to pull a full list of pages included in here or any subcategory. For many of these pages, it is the Talk page that belongs to the category rather than the article page, so searching the category list in the ES index will not surface these.

In [66]:
CONTROVERSIAL_SUBCATS_PKL = OUTPUT_DIR / "controversial_subcats.pkl"
CONTROVERSIAL_PAGES_PKL = OUTPUT_DIR / "controversial_pages.pkl"

Took 7 min.

In [133]:
%%time

cq = CatQuery()
cq.get_members("Category:Wikipedia controversial topics")

CPU times: user 1min 14s, sys: 5.48 s, total: 1min 19s
Wall time: 6min 47s


#### Subcategories

Pull the list of subcategories that were discovered nested under the top-level category.

In [134]:
cont_subcats = pd.DataFrame(cq.categories.values())

In [135]:
len(cont_subcats)

1375

In [136]:
display_pd(cont_subcats.sample(5))

Unnamed: 0,title,pageid,ns,type
527,Category:Autobiographical articles from April 2013,38968382.0,14.0,subcat
393,Category:Articles with minor POV problems from October 2015,47971935.0,14.0,subcat
765,Category:Wikipedia articles with possible conflicts of interest from July 2019,61175827.0,14.0,subcat
290,Category:Articles with a promotional tone from May 2019,60621791.0,14.0,subcat
1087,Category:Articles with weasel words from June 2018,57552494.0,14.0,subcat


In [137]:
# All are categories
assert cont_subcats["ns"].dropna().unique() == 14
assert cont_subcats["type"].dropna().unique() == "subcat"

Many of the subcategories relate to specific dates. What are the general category areas?

In [138]:
display_pd(cont_subcats["title"].str.replace(" from \w+ \d{4}", "").drop_duplicates().sort_values())



5                      Category:All Wikipedia neutral point of view disputes
336                            Category:All articles with a promotional tone
470                            Category:All articles with minor POV problems
968                                 Category:All articles with peacock terms
1306    Category:All articles with specifically marked weasel-worded phrases
163                                Category:Articles with a promotional tone
164                                Category:Articles with minor POV problems
471                                     Category:Articles with peacock terms
1143        Category:Articles with specifically marked weasel-worded phrases
472                                      Category:Articles with weasel words
473                                       Category:Articles with wikipuffery
165                                       Category:Autobiographical articles
167       Category:Pseudoscience articles under contentious topics procedure

Map page titles to canonical URI form (eg. spaces converted to `_`).

In [None]:
# cont_subcats["name"] = cont_subcats["title"].str.replace("^Category:", "", regex=True).str.replace("_", " ")

# cont_subcats = pd.merge(cont_subcats, all_category_info[["name", "key"]], how="left", on="name")

# # Handle the root category explicitly (different format)
# cont_subcats.iloc[0, -1] = cont_subcats.iloc[0]["title"]

In [139]:
cont_subcats.to_pickle(CONTROVERSIAL_SUBCATS_PKL)

#### Pages

Pull the list of pages belonging to any subcategory under the root.

In [140]:
cont_pages = pd.DataFrame(cq.pages.values())

In [141]:
len(cont_pages)

87694

In [142]:
display_pd(cont_pages.sample(5))

Unnamed: 0,pageid,ns,title,type,categories
4256,5243787,11,Template talk:Politics of Syria,page,[Category:Wikipedia articles under general sanctions]
24161,38742369,0,Lisa Giobbi,page,"[Category:Articles with a promotional tone from March 2013, Category:All articles with a promotional tone]"
58071,53747273,0,Shrirang Godbole,page,[Category:Wikipedia articles with possible conflicts of interest from September 2021]
39421,32049685,0,Cold Rock Ice Creamery,page,"[Category:Articles with a promotional tone from April 2021, Category:All articles with a promotional tone]"
10259,68563686,1,Talk:List of anti-vaccination groups,page,[Category:Wikipedia pages about contentious topics]


The pages belong to multiple [namespaces](https://en.wikipedia.org/wiki/Wikipedia:Namespace).
In many cases, the category is applied to the Talk page (odd numbered namespace) rather than the main article page (even numbered namespace).

- The majority of these are articles or article Talk pages.
- There are some templates & categories that are also included.

In [143]:
cont_pages["namespace"] = cont_pages["ns"].map(NAMESPACES)

In [170]:
cont_pages["namespace"].value_counts()

Article           74086
Talk              13138
Template            128
Template talk       104
Category talk        76
Wikipedia talk       36
User talk            36
Wikipedia            24
Draft talk           17
Module               16
User                 14
Module talk           8
File talk             6
Portal talk           4
Help talk             1
Name: namespace, Length: 15, dtype: int64

Deduce the main page title from the corresponding Talk page title.

In [163]:
cont_pages["main_title"] = (
    cont_pages["title"]
    .str.replace("^([^:]+) talk:", "\\1:", n=1, regex=True)
    .str.removeprefix("Talk:")
)

In [171]:
cont_pages.to_pickle(CONTROVERSIAL_PAGES_PKL)

In [None]:
# cont_pages = pd.read_pickle(CONT_PAGES_PKL)

In [173]:
cont_page_catcounts = cont_pages["categories"].explode().value_counts()

In [174]:
cont_page_catcounts.head(10)

Category:All articles with a promotional tone                                         25206
Category:All articles with specifically marked weasel-worded phrases                  17770
Category:Wikipedia pages about contentious topics                                      9326
Category:All Wikipedia neutral point of view disputes                                  7656
Category:All articles with peacock terms                                               3575
Category:Wikipedia controversial topics                                                3561
Category:All articles with minor POV problems                                          1022
Category:Wikipedia articles under general sanctions                                    1010
Category:Wikipedia objectionable content                                                615
Category:Articles with specifically marked weasel-worded phrases from January 2023      465
Name: categories, Length: 10, dtype: int64

In [175]:
cont_page_catcounts.loc[["Category:Wikipedia objectionable content", "Category:Wikipedia pages about contentious topics"]]

Category:Wikipedia objectionable content              615
Category:Wikipedia pages about contentious topics    9326
Name: categories, Length: 2, dtype: int64

## Pull records for potentially controversial pages from search index

__Note:__ Please first run the Download section of `search_index.ipynb` in order to prepare the full index file.

We run through the full index and pull out records for pages:

- belonging to one of the controversial categories
- containing a bad image

In [176]:
# ES index listing for potentially controversial pages
CONTROVERSIAL_INDEX = OUTPUT_DIR / "cirrussearch-controversial-content.json.gz"

In [183]:
CONTROVERSIAL_TITLES = set(cont_pages.query("namespace in ('Article', 'Talk')")["main_title"])
BAD_IMAGES = set([x.removeprefix("File:") for x in badimg])

In [184]:
len(CONTROVERSIAL_TITLES), len(BAD_IMAGES)

(86252, 936)

In [4]:
class ControversialIndexing(IndexStream):
    def _is_controversial_record(self, r):
        j = json.loads(r)
        if j["title"] in CONTROVERSIAL_TITLES:
            return True
        for img in BAD_IMAGES:
            if img in j["source_text"]:
                return True
        return False

    def _process_record(self, line, i):
        if self._is_controversial_record(line):
            self._write_to_output(line)

In [5]:
contind = ControversialIndexing(SEARCH_INDEX_DIR)

Took 4 hours 30 min. Wrote a gzipped JSON file of 1.3 GB.

In [187]:
%%time

contind.run(output_file=CONTROVERSIAL_INDEX)

13220896it [4:19:10, 850.16it/s]                                                                                                                         

CPU times: user 4h 15min 50s, sys: 1min 41s, total: 4h 17min 31s
Wall time: 4h 19min 11s





In [188]:
print(f"Records kept: {contind.n_kept:,}")

Records kept: 85,740


## Explore potentially controversial pages

Our goal is to develop a strategy for recognizing pages we may want to block or downweight.
We look into options for accomplishing this by exploring the record for the subset of potentially controversial pages pulled above.

First, trim down the records by removing long or irrelevant fields to facilitate loading into memory.

In [189]:
CONTROVERSIAL_INDEX_SHORT = OUTPUT_DIR / "cirrussearch-controversial-content_reduced.json.gz"
CONTROVERSIAL_DF_PKL = OUTPUT_DIR / "controversial_records.pkl"

In [209]:
OBJECTIONABLE_TITLES = (
    cont_pages[
        cont_pages["categories"].map(lambda x: "Category:Wikipedia objectionable content" in x)
    ]["main_title"].to_list()
)
CONTENTIOUS_TITLES = (
    cont_pages[
        cont_pages["categories"].map(lambda x: "Category:Wikipedia pages about contentious topics" in x)
    ]["main_title"].to_list()
)
SANCTIONS_TITLES = (
    cont_pages[
        cont_pages["categories"].map(lambda x: "Category:Wikipedia articles under general sanctions" in x)
    ]["main_title"].to_list()
)

In [210]:
len(OBJECTIONABLE_TITLES), len(CONTENTIOUS_TITLES), len(SANCTIONS_TITLES)

(615, 9326, 1010)

In [211]:
def reduce_record(r):
    FIELDS_KEPT = ["title", "opening_text", "auxiliary_text", "category", "page_id"]

    j = json.loads(r)
    result = {k: j.get(k, "") for k in FIELDS_KEPT}
    # Remove modules (Lua snippets)
    result["template"] = [x for x in j["template"] if x.startswith("Template:")]
    result["bad_img"] = False
    for img in BAD_IMAGES:
        if img in j["source_text"]:
            result["bad_img"] = True
            break
    result["controversial"] = j["title"] in CONTROVERSIAL_TITLES
    result["contentious"] = j["title"] in CONTENTIOUS_TITLES
    result["sanctions"] = j["title"] in SANCTIONS_TITLES
    result["objectionable"] = j["title"] in OBJECTIONABLE_TITLES
    
    return json.dumps(result)


def process_controversial_records(full_index, short_index):
    with gzip.open(short_index, "wt") as fw:
        with gzip.open(full_index, "rt") as fr:
            for i, line in tqdm(enumerate(fr), total=86_000):
                fw.write(reduce_record(line) + "\n")

Took 10 min. Wrote a gzipped JSON file of 125 MB.

In [212]:
%%time

process_controversial_records(CONTROVERSIAL_INDEX, CONTROVERSIAL_INDEX_SHORT)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 85740/86000 [10:02<00:01, 142.29it/s]

CPU times: user 9min 48s, sys: 7.7 s, total: 9min 56s
Wall time: 10min 2s





In [213]:
df_cont = pd.read_json(CONTROVERSIAL_INDEX_SHORT, lines=True)

In [217]:
df_cont.sample(5)

Unnamed: 0,title,opening_text,auxiliary_text,category,page_id,template,bad_img,controversial,contentious,sanctions,objectionable
65671,Barkan Industrial Park,The Barkan Industrial Park (Hebrew: איזור התעש...,[This article may be unbalanced towards certai...,"[All articles with bare URLs for citations, Ar...",5604590,"[Template:Short description, Template:Pagetype...",False,True,True,False,False
5028,Trans Misja,Trans Misja is the fourth studio album by Poli...,[This article does not cite any sources. Pleas...,"[Articles lacking sources from August 2010, Al...",12795102,"[Template:Unreferenced, Template:Ambox, Templa...",False,True,False,False,False
62669,East West (band),East West was an American Christian rock band ...,[This article includes a list of general refer...,"[Articles with short description, Short descri...",5336083,"[Template:Short description, Template:Pagetype...",False,True,False,False,False
34627,Sokikom,Sokikom (so-kee-kom) is a math program where e...,[This article has multiple issues. Please help...,[Articles with a promotional tone from June 20...,31626507,"[Template:Multiple issues, Template:Ambox, Tem...",False,True,False,False,False
71253,Ray Grainger,"Raymond ""Ray"" Grainger is the co-founder and C...",[This article may contain wording that promote...,"[CS1 maint: url-status, Articles with wikipuff...",61814373,"[Template:Puffery, Template:Ambox, Template:In...",False,True,False,False,False


In [224]:
(
    df_cont
    .groupby(["controversial", "contentious", "sanctions", "objectionable", "bad_img"])
    .size()
    .to_frame(name="count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
controversial,contentious,sanctions,objectionable,bad_img,Unnamed: 5_level_1
False,False,False,False,True,196
True,False,False,False,False,75530
True,False,False,False,True,7
True,False,False,True,False,465
True,False,False,True,True,76
True,False,True,False,False,790
True,False,True,True,False,20
True,True,False,False,False,8546
True,True,False,False,True,1
True,True,False,True,False,21


In [226]:
for c in ["controversial", "contentious", "sanctions", "objectionable", "bad_img"]:
    print(f"{c} count: {df_cont[c].sum():,}")

controversial count: 85,544
contentious count: 8,656
sanctions count: 897
objectionable count: 584
bad_img count: 282


Writes ~410 MB.

In [227]:
df_cont.to_pickle(CONTROVERSIAL_DF_PKL)

### Pages about objectionable topics or containing a bad image

These pages are candidates for being blocked outright, as many contain graphic sexual or violent content.

In [228]:
df_obj = df_cont.query("objectionable or bad_img")

In [229]:
len(df_obj)

789

Take a look at the top (visible) categories these pages belong to.

- Majority are related to sexuality

In [254]:
obj_cats = (
    df_obj["category"].explode()
    .value_counts()
    .reset_index(name="count")
    .rename(columns={"index": "name"})
    .assign(prop=lambda d: d["count"] / len(df_obj))
    .merge(ci.cat_df, on="name", how="left")
)

In [256]:
# All categories are in the full category list
assert obj_cats["hidden"].isna().sum() == 0

How many unique visible categories are there?

In [269]:
len(obj_cats)

6157

In [310]:
display_pd(obj_cats.query("~hidden")[["name", "count", "prop"]].head(20))

Unnamed: 0,name,count,prop
35,Sexual acts,40,0.050697
56,Living people,25,0.031686
77,Sex positions,20,0.025349
93,English profanity,16,0.020279
102,English words,14,0.017744
105,Penis,13,0.016477
107,Sexual fetishism,13,0.016477
111,Sexual slang,12,0.015209
118,Pornography terminology,12,0.015209
126,Human sexuality,11,0.013942


For all these categories, find the visible parent categories they belong to.
Look at that distribution for a higher-level view.

- Along with sexuality, we see some categories related to political ideology and violence.

In [276]:
display_pd(
    obj_cats["parents_visible"].explode()
    .value_counts()
    .reset_index(name="count")
    .rename(columns={"index": "name"})
    .assign(prop=lambda d: d["count"] / len(df_obj))
    .head(30)
)

Unnamed: 0,name,count,prop
0,Articles with authority control information,62,0.07858
1,Births by year,58,0.073511
2,Stub categories,44,0.055767
3,Deaths by year,42,0.053232
4,Human sexuality,26,0.032953
5,Organizations designated as terrorist by designator,24,0.030418
6,Sexuality and society,23,0.029151
7,Films by year,22,0.027883
8,Songs by songwriter,21,0.026616
9,Songs by artist,17,0.021546


What does the distribution of templates look like?

- Looking across the full list of templates, these look less informative in helping us identify objectionable content.

In [312]:
df_obj["template"].explode().nunique()

3399

In [311]:
display_pd(
    df_obj["template"].explode()
    .value_counts()
    .reset_index(name="count")
    .rename(columns={"index": "name"})
    .assign(prop=lambda d: d["count"] / len(df_obj))
    .head(20)
)

Unnamed: 0,name,count,prop
0,Template:Main other,788,0.998733
1,Template:Reflist/styles.css,762,0.965779
2,Template:Reflist,762,0.965779
3,Template:Short description,715,0.90621
4,Template:Short description/lowercasecheck,715,0.90621
5,Template:SDcat,715,0.90621
6,Template:Pagetype,710,0.899873
7,Template:Cite web,675,0.855513
8,Template:Hlist/styles.css,658,0.833967
9,Template:Navbox,571,0.723701


### Pages about contentious topics

These pages are subject to disruptive editing and stricter editorial rules or restrictions.

- The set of topics is broad, making it difficult to select representative categories.

In [262]:
df_contentious = df_cont.query("contentious")

In [263]:
len(df_contentious)

8656

Take a look at the top (visible) categories these pages belong to.

In [264]:
contentious_cats = (
    df_contentious["category"].explode()
    .value_counts()
    .reset_index(name="count")
    .rename(columns={"index": "name"})
    .assign(prop=lambda d: d["count"] / len(df_obj))
    .merge(ci.cat_df, on="name", how="left")
)

In [270]:
len(contentious_cats)

41064

There are a few categories not appearing in the full category list. Just consider these to be visible categories.

In [271]:
print(f"Unknown category: {contentious_cats['hidden'].isna().sum()}")

Unknown category: 15


In [None]:
# contentious_cats.query("hidden.isna()")[["name", "count"]]

In [272]:
contentious_cats["hidden"] = contentious_cats["hidden"].fillna(False).astype(bool)

Topics that emerge are:
    
- Conflicts, especially in the Middle East
- COVID-19
- Influential figures, such as politicians & writers
- Members of the LGBTQ community

In [314]:
display_pd(contentious_cats.query("~hidden")[["name", "count", "prop"]].head(20))

Unnamed: 0,name,count,prop
9,Living people,1616,2.048162
45,Municipalities of the State of Palestine,377,0.47782
47,Arab villages depopulated during the 1948 Arab–Israeli War,342,0.43346
56,Villages in the West Bank,278,0.352345
71,COVID-19 pandemic by country,213,0.269962
72,Transgender women,213,0.269962
78,21st-century American politicians,197,0.249683
93,21st-century LGBT people,164,0.207858
126,Israeli settlements in the West Bank,100,0.126743
138,American conspiracy theorists,95,0.120406


For all these categories, find the visible parent categories they belong to.
Look at that distribution for a higher-level view.

In [278]:
display_pd(
    contentious_cats["parents_visible"].explode()
    .value_counts()
    .reset_index(name="count")
    .rename(columns={"index": "name"})
    .assign(prop=lambda d: d["count"] / len(df_obj))
    .head(30)
)

Unnamed: 0,name,count,prop
0,Stub categories,267,0.338403
1,2020 by country,242,0.306717
2,2021 by country,235,0.297845
3,Treaties by country,199,0.252218
4,Births by year,194,0.245881
5,COVID-19 pandemic by country,189,0.239544
6,Disease outbreaks by country,160,0.202788
7,Deaths by year,149,0.188847
8,Wars by country,88,0.111534
9,2022 by country,88,0.111534


Similarly, the distribution of templates is not very informative for our purposes.

In [315]:
df_contentious["template"].explode().nunique()

13698

In [281]:
display_pd(df_contentious["template"].explode().value_counts()[:20])

Template:Main other                          8639
Template:Reflist                             8422
Template:Reflist/styles.css                  8422
Template:Short description                   7703
Template:Short description/lowercasecheck    7703
Template:SDcat                               7703
Template:Pagetype                            7652
Template:Cite web                            7463
Template:Hlist/styles.css                    7409
Template:Cite news                           6430
Template:Navbox                              4961
Template:Ns has subpages                     4923
Template:FULLROOTPAGENAME                    4923
Template:Dated maintenance category          4923
Template:DMCA                                4917
Template:Cite book                           4689
Template:Yesno                               4579
Template:Template other                      4112
Template:Category handler                    4072
Template:Plainlist/styles.css                4048


Take a look at top-level categories in the list.

In [381]:
contentious_cats_info = find_matching_categories(catlist=contentious_cats["name"], nonempty=True)

In [391]:
display_pd(
    contentious_cats_info.query("~parent_matches").sample(20)
    .reset_index(drop=True)[["name", "subcats_visible"]]
)

Unnamed: 0,name,subcats_visible
0,Cuyahoga Community College alumni,[Tri-C Triceratops baseball players]
1,Nicki Minaj,"[Nicki Minaj album covers, Nicki Minaj albums, Nicki Minaj audio samples, Nicki Minaj concert tours, Nicki Minaj songs, Songs written by Nicki Minaj]"
2,Universiade gold medalists in athletics (track and field),[]
3,Opposition Platform — For Life politicians,[]
4,"21st century in Rochester, New York",[]
5,Canadian social commentators,[]
6,2020 in North Dakota,"[2020 North Dakota elections, 2020 disestablishments in North Dakota, 2020 in sports in North Dakota]"
7,Colorado law,"[Cannabis law in Colorado, Capital punishment in Colorado, Colorado General Assembly, Colorado ballot measures, Colorado state case law, Colorado state courts, Colorado statutes, Constitution of Colorado, Courthouses in Colorado, Crime in Colorado, Criminals from Colorado, LGBT rights in Colorado, Law enforcement in Colorado, Law firms based in Colorado, Law schools in Colorado, Legal history of Colorado]"
8,Argentine television personalities,"[Argentine television chefs, Argentine television journalists, Argentine television presenters, Argentine television talk show hosts, Participants in Argentine reality television series]"
9,Members of the Assembly of Experts,[Speakers of the Assembly of Experts]


## Develop list of categories to block

Here we put together a list of categories to block by inspecting the categories associated with objectionable pages.

Looking through the categories, we observe some high-level topics emerging.
For these categories, we explored subcategories and parent categories to get an idea of the level of generality to target.

Based on these observations, we identify the following topics to consider blocking.
For each topic, we build a list of Wikipedia categories, and we plan to block pages that belong to any of these categories.

In order to make the category listing comprehensive and reproducible, the list is built as follows.
For each topic, we identify a collection of seed terms.
The list will contain all categories which match the seed terms, as well as all of their subcategories, subject to possible exclusions described below. (In some cases, we may cut off subcategories at a certain depth, to avoid branching off into related topics which maybe should not be blocked).

More details, as well as the code to build the list of blocked categories is given in `build_blocklist.ipynb`.
The current list of category seeds is specified in `moderation_category_seeds.yml`.


__Topics to consider blocking:__

- Sexuality
    * sex acts, crime or violence-related
    * not health-related or societal
- Pornography/erotica
    * not anti-pornography
- Profanity
- Pejoratives & slurs
- Cruelty
    * torture, child abuse
    * not anti-abuse, works of fiction
- Hateful ideology

The following topics may also be considered objectionable, but may not be good candidates for blocking due to their breadth or historical context. We can consider downweighting them instead:

- Prejudice/discrimination
- Violence
    * including terrorism, genocide, mass shootings, targeted
- Abuse
    * including bullying, harassment

## Pull records matching against the blocklist

Given a list of blocked categories, we extrac the subset of pages belonging to these categories from the full ES index.
This can be done by running the corresponding section of `search_index.ipynb`.

# Appendix

## Generate a sample from the full index

Pull a 1% sample that is more manageable for exploratory analysis, ignoring the initial `index` objects for each entry.

In [582]:
SAMPLE_INDEX = OUTPUT_DIR / "enwiki-20230123-cirrussearch-sample.json.gz"

In [6]:
class IndexSampler(IndexStream):
    def _process_record(self, line, i):
        if random.random() < 0.01:
            self._write_to_output(line)

Took ~10 min. Wrote a gzipped JSON of 400 MB.

In [587]:
%%time

idxs = IndexSampler()
idxs.run(output_file=SAMPLE_INDEX)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████▊| 13220896/13250000 [09:11<00:01, 23953.17it/s]

CPU times: user 8min 23s, sys: 19.6 s, total: 8min 42s
Wall time: 9min 11s





In [588]:
print(f"Records kept: {idxs.n_kept:,}")

Records kept: 66,621


## Download bad word lists to consider for filtering

We will try out using some lists of "bad words" to detect objectionable Wikipedia content.

These lists are quite broad and contain a number of terms that are either generally not "bad" or whose "badness" either depends on context. If we use this approach, we would need to curate a much more targeted list.

In [589]:
WORDLISTS_JSON = OUTPUT_DIR / "word_lists.json"

# Older word list, pretty widly used for filtering online comments
LDNOOBW_LIST_SRC = "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
# Word list from CMU research
CMU_LIST_SRC = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
# Word list used by Github Copilot
COPILOT_LIST_SRC = "https://moyix.net/~moyix/copilot_slurs_rot13.txt"

In [590]:
ldnoobw_list = requests.get(LDNOOBW_LIST_SRC).text.splitlines()

In [591]:
cmu_list = requests.get(CMU_LIST_SRC).text.strip().splitlines()

In [607]:
copilot_list = requests.get(COPILOT_LIST_SRC).text.split("===")[0]
copilot_list = codecs.decode(copilot_list, "rot_13").splitlines()
copilot_list = [x for x in sorted(copilot_list) if not x.startswith("<")]

In [608]:
wordlists = {
    "ldnoobw": ldnoobw_list,
    "cmu": cmu_list,
    "copilot": copilot_list,
}

In [609]:
with open(WORDLISTS_JSON, "w") as f:
    json.dump(wordlists, f)

In [610]:
{k: len(v) for k, v in wordlists.items()}

{'ldnoobw': 403, 'cmu': 1383, 'copilot': 1023}

Intersection & union sizes:

In [611]:
len(set(wordlists["ldnoobw"]) & set(wordlists["cmu"]) & set(wordlists["copilot"]))

50

In [612]:
print(len(set(wordlists["ldnoobw"]) & set(wordlists["cmu"])))
print(len(set(wordlists["ldnoobw"]) & set(wordlists["copilot"])))
print(len(set(wordlists["cmu"]) & set(wordlists["copilot"])))

133
66
383


In [613]:
len(set(wordlists["ldnoobw"]) | set(wordlists["cmu"]) | set(wordlists["copilot"]))

2277

In [614]:
ALL_BADWORDS = set(wordlists["ldnoobw"]) | set(wordlists["cmu"]) | set(wordlists["copilot"])

In [615]:
ALL_BADWORDS_RE = re.compile("(" + "|".join([fr"\b{re.escape(w)}\b" for w in ALL_BADWORDS]) + ")")