# Web2cit: up-to-date Citoid gap figures

This notebook describes how to retrieve the source code and the Citation templates of the references of a sample of Wikipedia articles to compare them with the results of the Citoid API for the same references.

To follow along, we recommend running the script portions piecemeal, in order.

__Author:__

* Nidia Hernández, [nidiahernandez@conicet.gov.ar](mailto:nidiahy@gmail.com), CAICYT-CONICET


## Table of Contents

0. Setting Up 
1. Retrieving data from Wikipedia articles
    - 1.1. ...
    - 1.2. ...
2. Querying Citoid API
3. Tidying up the queried data 
4. Evaluating Citoid results
5. Visualizing results
    - 4.1. Tabular representation
    - 4.2. Other representations ...
5. More advanced explorations
    - 5.1. ...
    - 5.2. ...
    - 5.3. ...



## 0. Setting Up

Before we get started, let's install and import the libraries that we will need.

In [34]:
import re
import sys
import os
import requests
from datetime import datetime, timedelta
import pandas as pd
from operator import itemgetter
from json import dump, load
from pprint import pprint
from tqdm import tqdm
import urllib
import wikitextparser as wtp
# scripts parciales

Here we set the global parameters:

In [3]:
LANG_OPS = ['en', 'es']
lang = 'en'
HEADER={'User-Agent': 'http://caicyt-conicet.gov.ar/; mailto:nidiahernandez@conicet.gov.ar'}
MEDIAWIKI_API_URL = f'https://{lang}.wikipedia.org/w/api.php?'
CITOID_API_URL = f'https://{lang}.wikipedia.org/api/rest_v1/data/citation/mediawiki/'

## Retrieving data from Wikipedia articles

In this section, we inspect a curated list of Wikipedia articles in order to get:

- the source code of the page
- the citation templates used on the page
- the URLs of the citations


First, we get the list of URLs of Wikipedia articles from a GoogleSpreadsheet:

In [4]:
tab = '2094364128' # english
sheet_url = f'https://docs.google.com/spreadsheets/d/1ZgCFtTj8DvAt-OJKPB1d0ShxYjSnN-eT9zxk0EUzMIA/export?gid={tab}&format=csv'
df = pd.read_csv(sheet_url)
articles = df['URL'].drop_duplicates()

In [5]:
articles

0     https://en.wikipedia.org/wiki/List_of_dramatic...
4         https://en.wikipedia.org/wiki/Editio_princeps
8           https://en.wikipedia.org/wiki/Elvis_Presley
12     https://en.wikipedia.org/wiki/Lumumba_Government
14         https://en.wikipedia.org/wiki/Britney_Spears
18            https://en.wikipedia.org/wiki/Airbus_A380
Name: URL, dtype: object

In order to obtain all the references cited on each article, we will get the source code of each article using the Mediawiki API. For the request, we need the title of the article, which is specified in the last part of the URL. More in in mediawiki API documentation: https://www.mediawiki.org/wiki/API:Main_page.

In [36]:


for article_url in tqdm(articles, desc='Retrieving data from articles'):

    title = article_urls.split('/')[-1]
    title = urllib.parse.unquote(title)
    #print(title)
    
    PARAMS = {
        "action": "parse",
        "page": title,
        "prop": "wikitext",
        "format": "json"
    }

    response = requests.get(
        url=MEDIAWIKI_API_URL, 
        params=PARAMS, 
        headers = HEADER)

    DATA = response.json()
    
    if response.status_code == 200:
        pageid = DATA["parse"]["pageid"]
        wikitext = DATA["parse"]["wikitext"]["*"]
        #revid?
        row = {"article": article_url, "pageid": pageid, "wikitext": wikitext}
        rows.append(row)

df = pd.DataFrame(rows)

Retrieving data from articles: 100%|██████████| 6/6 [00:05<00:00,  1.02it/s]


In [37]:
df

Unnamed: 0,article_url,pageid,wikitext
0,https://en.wikipedia.org/wiki/List_of_dramatic...,61839913,{{hatnote|This article is about [[live action]...
1,https://en.wikipedia.org/wiki/Editio_princeps,4193686,"{{italic title}}In [[classical scholarship]], ..."
2,https://en.wikipedia.org/wiki/Elvis_Presley,9288,{{short description|American singer and actor}...
3,https://en.wikipedia.org/wiki/Lumumba_Government,54092590,{{very long|rps=117|date=September 2018}}\n{{I...
4,https://en.wikipedia.org/wiki/Britney_Spears,3382,"{{short description|American singer, songwrite..."
5,https://en.wikipedia.org/wiki/Airbus_A380,181173,{{short description|Wide-body double deck airc...


## Citation template metadata extraction

Now that we already have the wikitext of the articles, we will parse it in order to retrieve the references that were introduced using a citation template. In other words, we are not interested in:

1. Manually entered references (ie, which do not use citation templates)

`<ref name=Briggs>Briggs, Asa & Burke, Peter (2002) ''A Social History of the Media: from Gutenberg to the Internet'', Cambridge: Polity, pp. 15–23, 61–73.</ref>`

2. Unlinked references

```<ref name=Neeham>{{cite book |title=Paper and Printing |author=[[Tsien Tsuen-Hsuin]] |author2=[[Joseph Needham]] |series=Science and Civilisation in China|volume=5 part 1|publisher=Cambridge University Press|pages=158, 201|year=1985}}</ref> ```

Instead, we are interested in:

1. References having a URL

```<ref name="VB1992">{{cite journal|last1=Osmond|first1=Patricia J.|last2=Ulery|first2=Robert W.|date=2003|title=Sallustius|url=http://catalogustranslationum.org/PDFs/volume08/v08_sallustius.pdf#page=17|journal=[[Catalogus Translationum et Commentariorum]]|volume=8|page=199|access-date=27 August 2015}}</ref>```

2. References having a DOI (revisar si hay casos de doi sin url)

```<ref name="NH99-2362">{{cite journal|last=Holzberg|first=Niklas|date=1999|title=The Fabulist, the Scholars, and the Discourse: Aesop Studies Today|url=https://www.jstor.org/stable/30222546|journal=International Journal of the Classical Tradition|volume=6|pages=236–242|doi=10.1007/s12138-999-0004-y|jstor=30222546|access-date=31 January 2021|number=2|s2cid=195318862}}</ref>```

Therefore, we parse the page content looking only for the citation templates including a url:

In [70]:
#rows = []

for i, row in df.iterrows():
    
    parsed = wtp.parse(row["wikitext"])

    for template in parsed.templates:
        if '{{cite ' in template:
            if '|url=' in template:
                print(template.arguments)
                
                break
                
    break
                
        
        
        
# row = {"article": article_url, "pageid": pageid, "template": citation_template, "url": url, "authors": authors, "title": title, "date": date, "publication": publication}
# rows.append(row)
# df = pd.DataFrame(rows)

[Argument('|last1=Hartinger '), Argument('|first1=Brent '), Argument('|title=You Should TOTALLY be Watching the Hugh Dancy Storyline on "The Big C" Right Now! '), Argument('|url=http://www.newnownext.com/you-should-totally-be-watching-the-hugh-dancy-storyline-on-the-big-c-right-now/08/2011/ '), Argument('|website=[[NewNowNext]] '), Argument('|date=August 8, 2011')]


In [80]:
args = {arg.name: arg.value for arg in template.arguments}

assert len(args) == len(template.arguments)

## Querying Citoid API

In this section, we request the [Citoid API](https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation) to find the citation metadata for the list of URLs found on the previous step.

In [None]:
cache = {}
    
def get_and_cache_response(reference_url):
    
    if reference_url not in cache:
    
        parsed_url = quote(reference_url).replace('/', '%2F')
        response = requests.get(
            url=CITOID_API_URL+parsed_url,
            headers = HEADER
        )

        if response.status_code == 200:
            cache[reference_url] = response.json()
        else:
            return

    return cache[reference_url]

## Tidying up the queried data 

Salida: tabla de URLs citadas, incluyendo
+ Número de veces citada en el corpus de ingreso
+ Tipo de fuente (art científico, cap libro, etc). >> Info del citation template
+ 4 metadatos básicos, según consta en artículo y devuelto por API Citoid
    - Título
    - Autores
    - Fecha de publicacion
    - Publicación (nombre diario, revista, nombre del sitio)
    - Tipo de fuente


In [None]:
def extract_field(field, obj):
    if field in obj:
        return obj[field]

rows = []

for reference_url in tqdm(url_list[11:21]):
    
    response_json = get_and_cache_response(reference_url)

    if response_json:
        
        assert len(response_json) == 1
        obj = response_json[0]

        fields_to_extract = [
            "title",
            "author",
            "accessDate", # date
            "publicationTitle", # (nombre diario, revista)
            "itemType", #(art científico, cap libro, etc)
        ]

        row = {field: extract_field(field, obj) for field in fields_to_extract}
        rows.append(row)


df = pd.DataFrame(rows)
df

+ Ser capaz de reanudar la operación donde se la dejó en caso de datasets muy grandes
+ Validar la metodología reproduciendo los datos publicados por Andrew Lih en 2017 (https://docs.google.com/spreadsheets/d/1kNHSVKq5qZr6_3-05UFtXF2RA5sn3HdUUgL0nC57XPk/edit#gid=543588472)

## Evaluation

+ Veredicto “soportado por Citoid”
+ Puntaje de coincidencia, usando distancia de edición? Suponemos metadatos del template son correctos (los suponemos curados)
+ Opcional: estado en dashboard traductores Zotero (o “sin traductor”)
+ Opcional: JSON+LD (la URL ofrece metadatos en JSON+LD, sí o no)

## Results

- Obtener la tabla de cobertura Citoid para cada URL y metadato básico
- Mostrar porcentajes de cobertura (Citoid coverage gap) por
    + Tipo de fuente
    + Idioma de la Wikipedia
    