# Mapping Wikipedia articles to Wikidata items

We use this table: 
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Wikidata_item_page_link

We want to understand better when the mapping doesnt work. This can be several reasons:
- there is no associated Wikidata item for an article 
- there is more than one associated Wikidata item for an article so the mapping is ambiguous.

We would then like to:
- filter those articles from the mapping (the ambiguous case is harder to detect)
- if possible, resolve the ambiguous mapping



Example case for non-existent mapping:
- [en:Independent_Republicans_Group_(French_Senate)](https://en.wikipedia.org/wiki/Independent_Republicans_Group_(French_Senate)) is listed with two Wikidata items
   - [Q60846844](https://www.wikidata.org/wiki/Q60846844)
   - [Q6017112](https://www.wikidata.org/w/index.php?title=Q6017112&redirect=no) this seems to be a redirect


Example case for ambiguous mapping

In [1]:
%load_ext autoreload
%autoreload 2
# import re
import sys,os
import json

In [2]:
import findspark
findspark.init('/usr/lib/spark2')
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, types as T, Window
import wmfdata.spark as wmfspark

## defining the spark session
spark_config = {}
spark = wmfspark.get_session(
    app_name='Pyspark notebook', 
    type='regular'
#     extra_settings=spark_config
)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark

You are using wmfdata v1.3.2, but v1.3.3 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md
PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [3]:
# select snapshot
wd_snapshot = '2022-04-04' # hdfs dfs -ls /wmf/data/wmf/wikidata/entity
snapshot = "2022-04"

In [4]:
# all wikipedias
df_wikis = (
    spark.read.table("wmf_raw.mediawiki_project_namespace_map")
    .where(F.col("snapshot")==snapshot)
    .where(F.col("hostname").like("%wikipedia%"))
    .select(F.col("dbname").alias("wiki_db"))
    .distinct()
)

In [5]:
# all pages from all wikis (namespace=0, no redirects)
df_pages = (
    spark.read.table("wmf_raw.mediawiki_page")
    ## snapshot: this is a partition!
    .where(F.col("snapshot") == snapshot)
    .where(F.col("page_namespace") == 0)
    .where(F.col("page_is_redirect")==False)
    .join(
        df_wikis,
        on="wiki_db",
        how="inner"
    )
    .select(
        "wiki_db",
        "page_id",
        "page_title"
    )
)

In [6]:
# mapping table between qid and pid for namespace 0
df_qid_pid = (
    spark.read.table("wmf.wikidata_item_page_link")
    ## snapshot: this is a partition!
    .where(F.col("snapshot") == wd_snapshot)
    .where(F.col("page_namespace") == 0)
    .join(
        df_wikis,
        on="wiki_db",
        how="inner"
    )
    .select(
        F.col("item_id").alias("qid"),
        "wiki_db",
        "page_id",
        "page_title"
    )   
)
# df_qid_pid.show()

In [7]:
# join the qids to the pids
df_pages_qids = (
    df_pages
    .join(
        df_qid_pid.select("wiki_db","page_id","qid"),
        on=["wiki_db","page_id"],
        how="left"
    )

)

In [8]:
# count how many different qids are assigned to each article
df_pages_qids_filtered = (
    df_pages_qids
    .groupBy("wiki_db","page_id","page_title")
    .agg(
#         F.count(F.col("qid")).alias("n_qids"),
        F.array_distinct( F.collect_list( F.col("qid") ) ).alias("qids"),
#         F.collect_list(  F.col("qid")  ).alias("qids"),
    )
    .withColumn("n_qids",F.size(F.col("qids")))
    .withColumn(
        "qid",
        F.when(F.col("n_qids")==1,F.col("qids")[0]).otherwise(None)
    )
    .select(
        "wiki_db",
        "page_id",
        "page_title",
        "qid",
        "n_qids"
    )
    .orderBy("n_qids",ascending=False)
)
df_pages_qids_filtered.show(truncate=False)

+-------+--------+-----------------------------------+----+------+
|wiki_db|page_id |page_title                         |qid |n_qids|
+-------+--------+-----------------------------------+----+------+
|plwiki |4910213 |Tomasz_Zawisza_Trzebicki           |null|4     |
|azwiki |679282  |Kənan_Məmmədov_(leytenant)         |null|4     |
|plwiki |4909435 |Tadeusz_Feliś                      |null|4     |
|plwiki |4887989 |Jakub_z_Gostynina                  |null|3     |
|azwiki |683128  |Zülfüqar_Qubadov                   |null|3     |
|azwiki |687526  |İmamverdi_İsmayılov_(kapitan)      |null|3     |
|azwiki |679980  |Rəşad_Sadıqov_(polkovnik-leytenant)|null|3     |
|plwiki |4883914 |Eugeniusz_Kłosek                   |null|3     |
|azwiki |683826  |Saatmirzə_Əzimov                   |null|2     |
|zhwiki |1631851 |北朝鲜共产党                       |null|2     |
|eowiki |375551  |Despeñaperros                      |null|2     |
|azwiki |680790  |Nicat_Mənəfov                      |null|2     |
|

In [9]:
df_c = (
    df_pages_qids_filtered
    .groupBy("wiki_db")
    .pivot("n_qids")
    .count()
).cache()
df_c.show()

+------------+-----+-------+----+----+----+
|     wiki_db|    0|      1|   2|   3|   4|
+------------+-----+-------+----+----+----+
|      iawiki|   14|  23797|null|null|null|
|      viwiki| 3074|1268757|null|null|null|
|      sewiki|   10|   7987|null|null|null|
|      mswiki| 2761| 355043|   1|null|null|
|     acewiki|   13|  12631|null|null|null|
| map_bmswiki|    5|  13961|null|null|null|
|      kwwiki|   70|   5297|null|null|null|
|      hawiki| 1571|  14605|   1|null|null|
|      mywiki|49969|  55635|null|null|null|
|     amiwiki|  989|    518|null|null|null|
|     mwlwiki|   15|   4147|null|null|null|
|     adywiki|    6|    576|null|null|null|
|      gnwiki|   26|   4770|null|null|null|
|      bhwiki|  111|   7856|null|null|null|
|     hifwiki|   97|  10212|null|null|null|
|      sawiki|  109|  11644|null|null|null|
|      gdwiki|  115|  15737|null|null|null|
|      scwiki|   56|   7397|null|null|null|
|     hawwiki|   16|   2498|null|null|null|
|roa_tarawiki|    4|   9318|null

In [10]:
pd_c = df_c.toPandas()
pd_c = pd_c.set_index("wiki_db")
pd_c["N"]=pd_c.sum(axis=1)
pd_c=pd_c.fillna(0)
for col in pd_c.columns:
    pd_c[col]=pd_c[col].astype(int)

pd_c = pd_c.sort_values("N",ascending=False)




In [11]:
print(pd_c.to_string())

                       0        1   2  3  4        N
wiki_db                                             
enwiki             28090  6463922  40  0  0  6492052
cebwiki           838801  5287833   6  0  0  6126640
dewiki             12184  2674608  18  0  0  2686810
svwiki              1928  2559992   1  0  0  2561921
frwiki              9272  2410396   4  0  0  2419672
nlwiki              3681  2085507   2  0  0  2089190
ruwiki             22750  1793009   5  0  0  1815764
itwiki              9188  1743413   5  0  0  1752606
eswiki              9167  1703783   2  0  0  1712952
arzwiki             8345  1565969   0  0  0  1574314
plwiki              6393  1514006  46  2  2  1520449
jawiki             18081  1305606  11  0  0  1323698
viwiki              3074  1268757   0  0  0  1271831
zhwiki             26771  1244685   4  0  0  1271460
warwiki               61  1265730   0  0  0  1265791
arwiki              4629  1161173   7  0  0  1165809
ukwiki             14832  1136296   2  0  0  1

### Check some example cases

##### Example in plwiki

article: https://pl.wikipedia.org/wiki/Tomasz_Zawisza_Trzebicki
- https://www.wikidata.org/wiki/Q100592017
- https://www.wikidata.org/wiki/Q100592019
- https://www.wikidata.org/wiki/Q100592020
- https://www.wikidata.org/wiki/Q100592021
- 

In [12]:
# there are 4 different qids associated with this article
df_qid_pid.where(F.col("wiki_db")=="plwiki").where(F.col("page_id")==4910213).show(truncate=False)

+----------+-------+-------+------------------------+
|qid       |wiki_db|page_id|page_title              |
+----------+-------+-------+------------------------+
|Q100592020|plwiki |4910213|Tomasz_Zawisza_Trzebicki|
|Q100592019|plwiki |4910213|Tomasz_Zawisza_Trzebicki|
|Q100592017|plwiki |4910213|Tomasz_Zawisza_Trzebicki|
|Q100592021|plwiki |4910213|Tomasz_Zawisza_Trzebicki|
+----------+-------+-------+------------------------+



##### Example in enwiki
article: https://en.wikipedia.org/wiki/Kalakada
-  https://www.wikidata.org/wiki/Q6350231
   - only after snapshot day, the sitelink for enwiki was changed https://www.wikidata.org/w/index.php?title=Q6350231&diff=1616980172&oldid=1616414565
- https://www.wikidata.org/wiki/Q57280627

In [13]:
# there are 2 different qids associated with this article
df_qid_pid.where(F.col("wiki_db")=="enwiki").where(F.col("page_id")==10174134).show(truncate=False)

+---------+-------+--------+----------+
|qid      |wiki_db|page_id |page_title|
+---------+-------+--------+----------+
|Q6350231 |enwiki |10174134|Kalakada  |
|Q57280627|enwiki |10174134|Kalakada  |
+---------+-------+--------+----------+

