## Mwparserfromhell on spark

Short description on how to use mwparserfromhell with spark; for example when trying to parse the [wikitext-dump](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_current).


This currently only works on stat1008.

The mwarserfromhell is installed on the workers of the cluster as part of the [conda base-environment](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda). Thus you only have to make sure that the kernel of the notebook uses that environment. I managed to set this up in the following way:

- loging into stat1008: `ssh stat1008.eqiad.wmnet`

- activate the base environment: `source /usr/lib/anaconda-wmf/bin/activate`

- add the environment to jupyter-kernels: `ipython kernel install --user --name=venv_anaconda-wmf` (you can give it a different name than venv_anaconda-wmf)

- start jupyterhub: `ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880` and type `http://localhost:8880/` in browser

- you should now see a kernel `venv_anaconda-wmf` to run your notebook. 

In [3]:
import os, sys
import datetime
import calendar
import time
import string
import random

import findspark
findspark.init('/usr/lib/spark2')
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, types as T, Window
import wmfdata.spark as wmfspark

## defining the spark session
spark_config = {}
spark = wmfspark.get_session(
    app_name='Pyspark notebook', 
    type='regular'
#     extra_settings=spark_config
)
spark

In [4]:
wiki = 'simplewiki'
lang = wiki.replace('wiki','')
snapshot='2020-07'


In [5]:
## Get some articles from the wikitext-dump
## only articles (no redirect title)
articles = (
    ## select table
    spark.read.table('wmf.mediawiki_wikitext_current')
    ## select wiki project
    .where( F.col('wiki_db') == wiki )
    .where( F.col('snapshot') == snapshot )
    ## main namespace
    .where(F.col('page_namespace') == 0 )
    ## no redirect-pages
    .where(F.col('page_redirect_title')=='')
    .where(F.col('revision_text').isNotNull())
    .where(F.length(F.col('revision_text'))>0)
    .select(
        F.col('page_id').alias('pid'),
        F.col('page_title').alias('title'),
        F.col('revision_text').alias('wikitext')
    )
    .limit(100)
)
articles.show()

+------+--------------------+--------------------+
|   pid|               title|            wikitext|
+------+--------------------+--------------------+
|606975|Somerset MRT station|'''Somerset MRT S...|
|  9523|       Welfare state|A '''welfare stat...|
|765801|   Audrey Wasilewski|{{refimprove|date...|
|463793|Star Trek Into Da...|'''Star Trek Into...|
|638276|       Antaeus Group|{{Infobox company...|
|301275|      Steve Montador|{{Infobox ice hoc...|
| 18907|Providence, Rhode...|{{Infobox settlem...|
|660137|  Knox County, Texas|'''Knox County'''...|
|455365|       Shashi Kapoor|{{Infobox person
...|
|498956|     Diahann Carroll|[[Image:Diahannca...|
|138161|Roquefort, Lot-et...|{{Infobox French ...|
|746807|          Serena Liu|{{Chinese name|[[...|
|656456|       Waucoma, Iowa|'''Waucoma''' is ...|
|409789|Penelope (given n...|'''Penelope''' is...|
|151772|              Angaïs|'''Angaïs''' is a...|
|441945|     Chandragupta II|{{no sources|date...|
| 60415|      George Adamski|''

In [6]:
import mwparserfromhell
import urllib
import re
links_regex = re.compile(r"\[\[(?P<link>[^\n\|\]\[\<\>\{\}]{0,256})(?:\|(?P<anchor>[^\[]*?))?\]\]")
references_regex = re.compile(r"<ref[^>]*>[^<]+<\/ref>")
def get_plain_text_without_links(row):
    """ Replace the links with a dot to interrupt the sentence and get the plain text """
    wikicode = row.wikitext
    wikicode_without_links = re.sub(links_regex, '.', wikicode)
    wikicode_without_links = re.sub(references_regex, '.', wikicode_without_links)
    ## we dont have mwparserfromhell on the spark-cluster yet
    try:
        text = mwparserfromhell.parse(wikicode_without_links).strip_code()
    except:
        text = wikicode_without_links
    return T.Row(pid=row.pid, title=normalise_title(row.title), text=text.lower())
def normalise_title(title):
    """ Replace _ with space, remove anchor, capitalize """
    title = urllib.parse.unquote(title)
    title = title.strip()
    if len(title) > 0:
        title = title[0].upper() + title[1:]
    n_title = title.replace("_", " ")
    if '#' in n_title:
        n_title = n_title.split('#')[0]
    return n_title

In [7]:
articles_plain_text =  (
    spark.createDataFrame(
        articles
        .rdd
        .map(get_plain_text_without_links)
    )
)
df = articles_plain_text.toPandas()
df.head()

Unnamed: 0,pid,text,title
0,502075,"matthew carle (born september 25, 1984) is an ...",Matt Carle
1,63261,zanzibar is the name of an . in the . 25–50 km...,Zanzibar
2,215017,".\nthe quebec nordiques (, pronounced in ., ...",Quebec Nordiques
3,534095,".\naubrey kerr mcclendon (july 14, 1959 – marc...",Aubrey McClendon
4,469807,"glen albert larson (january 3, 1937 - november...",Glen A. Larson


In [8]:
df['text'].iloc[0]

"matthew carle (born september 25, 1984) is an . professional . .. he currently plays for the . of the . (nhl). he has also played for the . and ..\n\ncareer\nbefore playing in the nhl, carle played parts of 2 seasons with the ., 1 season with the . of the . (ushl), and 3 years of college hockey with the . pioneers. during his time with the pioneers, carle won the . in 2006 for being the top . men's ice hockey player. he was also the only junior defenseman in history to win the award..\n\nhe was drafted 47th overall by the . in the .. on march 25, 2006, carle made his nhl debut and scored his first nhl goal against . of the . in a 5-1 win.. on november 21, 2007, carle was signed to a four-year, $13.75 million contract extension to stay with san jose..\n\non july 4, 2008, the sharks traded carle along with . and a first round pick in the 2009 nhl entry draft and a fourth round pick in 2010, to the . in exchange for . and ...\n\nafter 12 games with the lightning, they traded him along wi