# To be defined...

Wikipedia uses templates to standardize articles in the same category and to define a standard for the way specific types of information are displayed (e.g., info boxes). Acording to [Wikipedia](https://en.wikipedia.org/wiki/Help:Template):

> A template is a Wikipedia page created to be included in other pages. Templates usually contain repetitive material that might need to show up on any number of articles or pages. They are commonly used for boilerplate messages, standard warnings or notices, infoboxes, navigational boxes, and similar purposes.

This makes the information more uniform and easy to read. However, in order to extract the information from an article, we have to parse the template. Fortunately, there is a python package that does the extraction magic for us: [wikitextparser](https://pypi.org/project/wikitextparser)

In [2]:
import dask.dataframe as dd

In [8]:
from dask.distributed import Client
client = Client()

In [6]:
ddf = dd.read_parquet("wikipedia.parquet/*.parquet", engine="pyarrow")
ddf = ddf.dropna(subset=['article'])

In [45]:
WIKI_CAT = ['category:', 'wikipedia:', 'file:',
            'template:', 'portal:', 'draft:',
            'module:', 'book:', 'mediawiki:',
            'timedtext:', 'help:', '#redirect']

def _filter_non_articles(title):
    title = title.lower()
    for cat in WIKI_CAT:
        if title.startswith(cat):
            return False
    return True

In [7]:
%%time
s_title = (
    ddf
    .loc[~(ddf['article'].str.lower().str.startswith('#redirect') |
           ddf['article'].str.lower().str.startswith(':')), 'title']
    .compute()
)

In [9]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:37801,Cluster  Workers: 4  Cores: 12  Memory: 33.28 GB


In [17]:
from functools import reduce

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
355,Book,Amateur Radio (Vol. 1),,,,,,,,
353,Book,Creation of the Great Lakes,,,,,,,,
126,Book,Hugo Awards,,,,,,,,
900,Book,Auto racing in North America,,,,,,,,
322,Book,"Valaquia, Hungría y Transilvania",,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
285,Book,IT Service Mgmt,,,,,,,,
308,Book,Rick Rubin,,,,,,,,
541,Book,San Diego,,,,,,,,
557,Book,San Diego Chargers,,,,,,,,


In [43]:
#_ = s_title[s_title.str.contains(':')]
# __ = _.str.split(':', expand=True)
__[0].value_counts().head(11).index.tolist()

['Category',
 'Wikipedia',
 'File',
 'Template',
 'Portal',
 'Draft',
 'Module',
 'Book',
 'MediaWiki',
 'TimedText',
 'Help']

In [None]:
__