# Download and process the Wikipedia CirrusSearch index

This is the actual search index dump from Wikipedia which backs the Dynamic Wikipedia feature.
Here we download the index locally and stream through it to extract subsets.

## Warning! Do not run this full notebook unless really necessary

__This will download the full ES index dump and save it to disk.
The file is ~35 GB. Make sure you have enough free disk space.
It may take several hours to run.__

In [81]:
from pathlib import Path
import json

import pandas as pd

from wikipedia_utils import search_index

Data files will be written to the following subdir using predefined filenames.

In [3]:
SEARCH_INDEX_DIR = Path("es_data")
SEARCH_INDEX_DIR.mkdir(parents=True, exist_ok=True)

In [75]:
BLOCKED_CATS_CSV = Path("category_data") / "blocklist_cats.csv"
BLOCKED_PAGES_INDEX = SEARCH_INDEX_DIR / "blocked_cirrussearch.json.gz"

## Download the search index dump

The full index file is gzipped 35 GB.
In the ElasticSearch index format, entries are composed of two lines
```
{"index": {...}}
{field1: val1, ...}
```
The page information we are interested in is on the second line of each entry.

Download the latest verison of the file. It is usually around 35 GB(!). The download may take a long time.

In [58]:
latest_dump_date, latest_dump_size = search_index.get_latest_search_dump_date()

In [78]:
print(f"Latest dump date: {latest_dump_date}")
print(f"Index file size: {latest_dump_size / (1024*1024*1024):.2f} GB")

Latest dump date: 20230403
Index file size: 34.45 GB


In [62]:
%%time

search_index.fetch_search_index(latest_dump_date, data_dir=SEARCH_INDEX_DIR)

CPU times: user 1min 57s, sys: 3min 3s, total: 5min 1s
Wall time: 2h 31min 55s


Read through the file to count lines and get an idea of timing. This took 7-8 min.

Processing the full ES index can be done by extending `search_index.IndexStream`. This handles streaming through the file, skipping the `index` lines, and optionally outputting a subset of the records to a separate file (`.json.gz`).

In [72]:
class LineCount(search_index.IndexStream):
    def _process_record(self, line, i):
        self.n_kept += 1

In [73]:
lc = LineCount(SEARCH_INDEX_DIR)

In [74]:
%%time

lc.run()

13277592it [07:52, 28071.37it/s]                                                                                                                      

CPU times: user 7min 26s, sys: 18.9 s, total: 7min 45s
Wall time: 7min 53s





In [79]:
print(f"Num lines: {lc.n_kept:,}")

Num lines: 6,638,796


## Pull Wikipedia records matching the blocklist

Having built the list of categories for which Wikipedia articles should be blocked from Dynamic Wikipedia, we can now extract the subset of such pages from the full ES index file.
From here we get the count of the number of affected pages, and we can run further validation on their contents.

First load the list of blocked categories.

In [82]:
bl_df = pd.read_csv(BLOCKED_CATS_CSV)

In [85]:
class BlockIndex(search_index.IndexStream):
    def __init__(self, data_dir, blocked_cats):
        super().__init__(data_dir)
        self.blocked_cats = pd.Series(blocked_cats)

    def _process_record(self, line, i):
        # Parse each index record and check if it has one of the blocked categories.
        j = json.loads(line)
        if self.blocked_cats.isin(j["category"]).any():
            # If so, write to the file containing the subset of blocked pages.
            self._write_to_output(line)

Takes ~30 min and produces a `.json.gz` file of 300-400 MB.

The number of blocked pages is ~30,000, or ~0.5% of English Wikipedia.

In [86]:
bi = BlockIndex(SEARCH_INDEX_DIR, bl_df["name"])

In [89]:
%%time


bi.run(output_file=BLOCKED_PAGES_INDEX)

13277592it [28:53, 7661.51it/s]                                                                                                                       

CPU times: user 27min 51s, sys: 40.7 s, total: 28min 32s
Wall time: 28min 53s





In [90]:
print(f"Num pages that will be blocked: {bi.n_kept:,}")

Num pages that will be blocked: 29,601


In [91]:
print(f"Size of blocked page subset: {BLOCKED_PAGES_INDEX.stat().st_size / (1024*1024):.1f} MB")

Size of blocked page subset: 326.0 MB
