# Build the Wikipedia category index for processing

Fetch the latest English Wikipedia category RDF index file and convert to a pandas DataFrame.
We will use this to explore the categories and build the blocklist.

In [1]:
from pathlib import Path

from wikipedia_utils import category

Data files will be written to the following subdir using predefined filenames.

In [2]:
DATA_DIR = Path("category_data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

## Download the index file

Wikipedia dumps include an RDF-formatted list of all categories. This includes counts of pages and subcategories, an indicator for hidden categories, and a listing of categories containing each one as a subcategory (ie a list of parents).

Download the latest version of this file. It is usually around 80-85 MB.

In [3]:
latest_dump = category.get_latest_category_dump_date()
print(f"Latest dump date: {latest_dump}")

Latest dump date: 20230325


In [4]:
%%time

category.fetch_category_index(latest_dump, data_dir=DATA_DIR)

CPU times: user 308 ms, sys: 402 ms, total: 710 ms
Wall time: 19.6 s


In [5]:
rdf_file = DATA_DIR / category.CATEGORY_RDF_FILE

In [6]:
print(f"Index file size: {rdf_file.stat().st_size / (1024*1024):.1f} MB")

Index file size: 81.8 MB


In [7]:
nl = ! zgrep -c '^' $rdf_file
nl = int(nl[0])
print(f"Num lines: {nl:,}")

Num lines: 21,592,188


## Convert to JSON

The index file lists categories and their linkages in RDF (Turtle) format.
It contains two types of records, with the properties defined in [this ontology](https://www.mediawiki.org/ontology/ontology.owl) (need to view source):
- category definitions with label (human-readable name) and counts of pages & subcategories

```
<https://en.wikipedia.org/wiki/Category:Coffee_preparation> a mediawiki:Category ;
    rdfs:label "Coffee preparation" ;
    mediawiki:pages "54"^^xsd:integer ;
    mediawiki:subcategories "3"^^xsd:integer .
```

- category linkages with a list of parents

```
<https://en.wikipedia.org/wiki/Category:Coffee_preparation> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:Coffee>,
    <https://en.wikipedia.org/wiki/Category:Commons_category_link_is_on_Wikidata>,
    <https://en.wikipedia.org/wiki/Category:Food_and_drink_preparation> .
```

We use some light parsing to convert the dump to a more manageable pandas-friendly format. Because of the large size, this is easier than parsing as a graph using RDF libraries. Records are written to separate JSON files:
- category info with page URI (canonical form), name (plain text), # pages, # subcategories, whether it is hidden
- linkage list with page URI, list of page URIs for parent categories

In [8]:
%%time

category.parse_category_index(data_dir=DATA_DIR)

CPU times: user 45.4 s, sys: 517 ms, total: 45.9 s
Wall time: 46.4 s


## Convert to DataFrame

We can now read these JSON files into pandas DataFrames and recombine into the desired format. The resulting DF is written to a pickle file, usually around 500 MB.

In [9]:
%%time

category.create_combined_df(data_dir=DATA_DIR)

Unknown parents: 4
CPU times: user 1min 4s, sys: 4.79 s, total: 1min 9s
Wall time: 1min 11s


In [10]:
! ls -lh $DATA_DIR

total 701M
-rw-r--r-- 1 dzeber staff 525M Mar 31 20:08 category_df.pkl
-rw-r--r-- 1 dzeber staff  82M Mar 31 20:06 category_index.ttl.gz
-rw-r--r-- 1 dzeber staff  30M Mar 31 20:06 category_info.json.gz
-rw-r--r-- 1 dzeber staff  40M Mar 31 20:06 category_linkage.json.gz
