# Resolving redirects in Wikipedia

A common problem when working with pages in Wikipedia is to resolve redirects.
For example, when working with pageview-data (e.g. in webrequests) the page-title of the request is the title of a page redirecting to a different page. 

Here, we show a procedure how to resolve those using the page-, and redirects- table.

For enwiki, there are more redirect-pages (9.7M) than non-redirect pages (6.3M)

## Starting a session in spark

In [1]:
import os, sys
import datetime
import calendar
import time
import string
import random

import findspark
findspark.init('/usr/lib/spark2')
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, types as T, Window
import wmfdata.spark as wmfspark

## defining the spark session
spark_config = {}
spark = wmfspark.get_session(
    app_name='Pyspark notebook', 
    type='regular'
#     extra_settings=spark_config
)
spark

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


## Loading the tables we will need


In [2]:
## define a snapshot
snapshot = "2021-09"

In [3]:
## get a list of all wikipedia-projects (e.g. not wikidata)
df_projects = (
    spark.read.table('wmf_raw.mediawiki_project_namespace_map')
    .where(F.col("snapshot") == snapshot)
    .where(F.col("hostname").contains("wikipedia"))
    .select(F.col("dbname").alias("wiki_db"))
    .distinct()
)
df_projects.orderBy("wiki_db").show()

+-------+
|wiki_db|
+-------+
| abwiki|
|acewiki|
|adywiki|
| afwiki|
| akwiki|
|alswiki|
|altwiki|
| amwiki|
|angwiki|
| anwiki|
|arcwiki|
| arwiki|
|arywiki|
|arzwiki|
|astwiki|
| aswiki|
|atjwiki|
|avkwiki|
| avwiki|
|awawiki|
+-------+
only showing top 20 rows



In [4]:
## all wikipedia pages in the main namespace: 
## wiki_db, page_id, page_title, page_is_redirect
df_pages = (
    ## select table
    spark.read.table('wmf_raw.mediawiki_page')
    ## select snapshot
    .where( F.col('snapshot') == snapshot )
    ## filter wikipedias
    .join(
        df_projects,
        on = "wiki_db",
        how = "inner"
    )
    ## main namespace
    .where(F.col('page_namespace') == 0 )
    .select(
        "wiki_db",
        'page_id',
        'page_title',
        'page_is_redirect',
    )
)
# show some examples
df_pages.where(F.col("wiki_db")=="enwiki").show()

+-------+-------+--------------------+----------------+
|wiki_db|page_id|          page_title|page_is_redirect|
+-------+-------+--------------------+----------------+
| enwiki|8607168|    Ettenheimmunster|            true|
| enwiki|8607178|     Fungus_Among_Us|           false|
| enwiki|8607180|   Ettenheimmuenster|            true|
| enwiki|8607184|  Chouseishin_series|            true|
| enwiki|8607185|   The_Eagle_(album)|           false|
| enwiki|8607188|      Rottenmuenster|            true|
| enwiki|8607190|Child's_Play_(Sta...|           false|
| enwiki|8607196|       Rottenmunster|            true|
| enwiki|8607201|       Rottenmünster|            true|
| enwiki|8607209|Child’s_Play_(Voy...|            true|
| enwiki|8607212|           Saltfjell|            true|
| enwiki|8607215|  Menachem_Elimelech|           false|
| enwiki|8607226|       Trevor_Womble|           false|
| enwiki|8607230|         Roger_Depue|           false|
| enwiki|8607231|Adventures_in_Goo...|          

In [5]:
## the redirect table containing all the redirects.
## wiki_db, page_id_from, page_title_to
df_redirect = (
    ## select table
    spark.read.table('wmf_raw.mediawiki_redirect')
    ## select snapshot
    .where( F.col('snapshot') == snapshot )
    ## filter wikipedias
    .join(
        df_projects,
        on = "wiki_db",
        how = "inner"
    )
    ## main namespace
    .where(F.col('rd_namespace') == 0 )
    .select(
        F.col("wiki_db"),
        F.col('rd_from').alias('page_id_from'),
        F.col('rd_title').alias('page_title_to')
    )
)
df_redirect.where(F.col("wiki_db")=="enwiki").show()

+-------+------------+--------------------+
|wiki_db|page_id_from|       page_title_to|
+-------+------------+--------------------+
| enwiki|    43035845|    Australian_Party|
| enwiki|    43035846|William_&_Mary_La...|
| enwiki|    43035856|              I-Tree|
| enwiki|    43035858|Nehanda_Charwe_Ny...|
| enwiki|    43035863|Nehanda_Charwe_Ny...|
| enwiki|    43035910|              Kirkuk|
| enwiki|    43035919|   Chollima_Movement|
| enwiki|    43035928|            Kalinago|
| enwiki|    43035938|French_Internatio...|
| enwiki|    43035970|          Alan_Moore|
| enwiki|    43036017|International_League|
| enwiki|    43036034|      Bianca_(grape)|
| enwiki|    43036042|Allegheny_Center_...|
| enwiki|    43036061|         Đuro_Kurepa|
| enwiki|    43036062|         Đuro_Kurepa|
| enwiki|    43036063|Christian_Democra...|
| enwiki|    43036065|         Viticulture|
| enwiki|    43036107|Maghrebi_communit...|
| enwiki|    43036171|Parshuram_Temple,...|
| enwiki|    43036174|Hengyang–L

## Resolving the page-ids and page-titles

We will get a table with all pages in wikipedia, i.e. adding new columns "page_id_resolved" and "page_title_resolved" containing page-id and page-title of the redirected page.

Some redirects point to pages that are still marked as "page_is_redirect"=True. Those are dropped from the table.

In [6]:
## Resolving the pages table:
## wiki_db, page_id, page_title, page_is_redirect, page_id_resolved, page_title_resolved
df_pages_resolved = (
    df_pages
    ## join the redirect-table: pid_from --> page_title_to
    ## this adds a new column "page_title_to" for all pages that a redirect
    .join(
        df_redirect.withColumnRenamed("page_id_from","page_id"),
        on = ["wiki_db","page_id"],
        how = "left"
    )
    ## create a new column page_title_resolved
    .withColumn('page_title_resolved', F.coalesce(F.col('page_title_to'),F.col('page_title')) )
    ## join the page-table to get page-ids from titles (not resolved)
    .join(
        (df_pages
         .withColumnRenamed("page_title","page_title_resolved")
         .withColumnRenamed("page_id","page_id_resolved")
         .withColumnRenamed("page_is_redirect","page_is_redirect_resolved")
        ),
        on = ["wiki_db","page_title_resolved"],
        how = "left"
    )
    ## only keep pages that are not a redirect after resolving anymore
    .where(F.col("page_is_redirect_resolved")==False)
    .select(
        "wiki_db",
        "page_id",
        "page_title",
        "page_is_redirect",
        "page_id_resolved",
        "page_title_resolved",
    )
)


## Some examples

### all wikis

In [7]:
## number of pages in all wikipedia
df_pages_resolved.count()

105554713

In [8]:
## number of redirects in all wikis
df_pages_resolved.where(F.col("page_is_redirect")==True).count()

48193836

### enwiki

In [11]:
## number of pages in enwiki
df_pages_resolved.where(F.col("wiki_db")=="enwiki").count()

16093400

In [9]:
## number of redirect-pages in enwiki
df_pages_resolved.where(F.col("wiki_db")=="enwiki").where(F.col("page_is_redirect")==True).count()

9707804

In [10]:
## some example of the redirects in enwiki
df_pages_resolved.where(F.col("wiki_db")=="enwiki").where(F.col("page_is_redirect")==True).show()

+-------+--------+--------------------+----------------+----------------+--------------------+
|wiki_db| page_id|          page_title|page_is_redirect|page_id_resolved| page_title_resolved|
+-------+--------+--------------------+----------------+----------------+--------------------+
| enwiki|38059831|"Ridgeriders"_In_...|            true|        35511155|"Ridgeriders"_in_...|
| enwiki|44367915|          +1_records|            true|        34834673|          +1_Records|
| enwiki|13586449|Allow_us_to_be_frank|            true|         6929308|...Allow_Us_to_Be...|
| enwiki|52197614|Allow_Us_to_Be_Frank|            true|         6929308|...Allow_Us_to_Be...|
| enwiki| 6929327|Allow_Us_To_Be_Frank|            true|         6929308|...Allow_Us_to_Be...|
| enwiki|59584119|          ..._So_Far|            true|        59583941|           ...So_Far|
| enwiki|32590257|1,1,1,-Trichloroe...|            true|          302493|1,1,1-Trichloroet...|
| enwiki| 7689431|1,1,1-trichloroet...|           