# Web2cit: up-to-date Citoid gap figures

This notebook describes how to retrieve the source code and the [Citation templates](https://en.wikipedia.org/wiki/Wikipedia:Citation_templates) of the references of a sample of Wikipedia articles to compare them with the results of the [Citoid](https://www.mediawiki.org/wiki/Citoid) API for the same references.

To follow along, we recommend running the script portions piecemeal, in order.

__Author:__

* Nidia Hernández, [nidiahernandez@conicet.gov.ar](mailto:nidiahy@gmail.com), CAICYT-CONICET


## Table of Contents

0. Setting Up 
1. Retrieving data from Wikipedia articles
    - 1.1. ...
    - 1.2. ...
2. Querying Citoid API
3. Tidying up the queried data 
4. Evaluating Citoid results
5. Visualizing results
    - 4.1. Tabular representation
    - 4.2. Other representations ...
5. More advanced explorations
    - 5.1. ...
    - 5.2. ...
    - 5.3. ...



## 0. Setting Up

Before we get started, let's install and import the libraries that we will need.

In [1]:
import re
import sys
import os
import requests
from datetime import datetime, timedelta
import pandas as pd
from operator import itemgetter
from json import dump, load
from pprint import pprint
from tqdm import tqdm
import urllib
import mwparserfromhell
from wikiciteparser.parser import parse_citation_template
# scripts parciales

Here we set the global parameters:

In [2]:
LANG_OPS = ['en', 'es']
lang = 'en'
HEADER={'User-Agent': 'http://caicyt-conicet.gov.ar/; mailto:nidiahernandez@conicet.gov.ar'}
MEDIAWIKI_API_URL = f'https://{lang}.wikipedia.org/w/api.php?'
CITOID_API_URL = f'https://{lang}.wikipedia.org/api/rest_v1/data/citation/mediawiki/'

## Retrieving data from Wikipedia articles

In this section, we inspect a curated list of Wikipedia articles in order to get:

- the source code of the page
- the citation templates used on the page
- the URLs of the citations


First, we get the list of URLs of Wikipedia articles from a GoogleSpreadsheet:

In [62]:
tab = {'en': '2094364128', 'es': '0'}[lang]
        
sheet_url = f'https://docs.google.com/spreadsheets/d/1ZgCFtTj8DvAt-OJKPB1d0ShxYjSnN-eT9zxk0EUzMIA/export?gid={tab}&format=csv'
df = pd.read_csv(sheet_url)
articles_urls = df['URL'].drop_duplicates()

In [63]:
articles_urls

0     https://en.wikipedia.org/wiki/List_of_dramatic...
4         https://en.wikipedia.org/wiki/Editio_princeps
8           https://en.wikipedia.org/wiki/Elvis_Presley
12     https://en.wikipedia.org/wiki/Lumumba_Government
14         https://en.wikipedia.org/wiki/Britney_Spears
18            https://en.wikipedia.org/wiki/Airbus_A380
Name: URL, dtype: object

In order to obtain all the references cited on each article, we will get the source code of each article using the Mediawiki API. For the request, we need the title of the article, which is specified in the last part of the URL. More in mediawiki API documentation: https://www.mediawiki.org/wiki/API:Main_page.

In [76]:
rows = []

for article_url in tqdm(articles_urls, desc='Retrieving data from articles'):

    title = article_url.split('/')[-1]
    title = urllib.parse.unquote(title)
    #print(title)
    
    PARAMS = {
        "action": "parse",
        "page": title,
        "prop": "wikitext",
        "format": "json"
    }

    response = requests.get(
        url=MEDIAWIKI_API_URL, 
        params=PARAMS,
        headers=HEADER
    )

    data = response.json()
    
    if response.status_code == 200:
        pageid = data["parse"]["pageid"]
        wikitext = data["parse"]["wikitext"]["*"]
        #revid?
        row = {"title": title, "article_url": article_url, "pageid": pageid, "wikitext": wikitext}
        rows.append(row)

articles_df = pd.DataFrame(rows)

Retrieving data from articles: 100%|██████████| 6/6 [00:11<00:00,  1.95s/it]


In [77]:
articles_df

Unnamed: 0,title,article_url,pageid,wikitext
0,List_of_dramatic_television_series_with_LGBT_c...,https://en.wikipedia.org/wiki/List_of_dramatic...,61839913,{{hatnote|This article is about [[live action]...
1,Editio_princeps,https://en.wikipedia.org/wiki/Editio_princeps,4193686,"{{italic title}}In [[classical scholarship]], ..."
2,Elvis_Presley,https://en.wikipedia.org/wiki/Elvis_Presley,9288,{{short description|American singer and actor}...
3,Lumumba_Government,https://en.wikipedia.org/wiki/Lumumba_Government,54092590,{{very long|rps=117|date=September 2018}}\n{{I...
4,Britney_Spears,https://en.wikipedia.org/wiki/Britney_Spears,3382,"{{short description|American singer, songwrite..."
5,Airbus_A380,https://en.wikipedia.org/wiki/Airbus_A380,181173,{{short description|Wide-body double deck airc...


## Citation template metadata extraction

Now that we already have the wikitext of the articles, we will parse it in order to retrieve the references that were introduced using a citation template. In other words, we are not interested in:

1. Manually entered references (ie, which do not use citation templates)

`<ref name=Briggs>Briggs, Asa & Burke, Peter (2002) ''A Social History of the Media: from Gutenberg to the Internet'', Cambridge: Polity, pp. 15–23, 61–73.</ref>`

2. Unlinked references

```<ref name=Neeham>{{cite book |title=Paper and Printing |author=[[Tsien Tsuen-Hsuin]] |author2=[[Joseph Needham]] |series=Science and Civilisation in China|volume=5 part 1|publisher=Cambridge University Press|pages=158, 201|year=1985}}</ref> ```

Instead, we are interested in:

1. References having a URL

```<ref name="VB1992">{{cite journal|last1=Osmond|first1=Patricia J.|last2=Ulery|first2=Robert W.|date=2003|title=Sallustius|url=http://catalogustranslationum.org/PDFs/volume08/v08_sallustius.pdf#page=17|journal=[[Catalogus Translationum et Commentariorum]]|volume=8|page=199|access-date=27 August 2015}}</ref>```

2. References having a DOI (revisar si hay casos de doi sin url)

```<ref name="NH99-2362">{{cite journal|last=Holzberg|first=Niklas|date=1999|title=The Fabulist, the Scholars, and the Discourse: Aesop Studies Today|url=https://www.jstor.org/stable/30222546|journal=International Journal of the Classical Tradition|volume=6|pages=236–242|doi=10.1007/s12138-999-0004-y|jstor=30222546|access-date=31 January 2021|number=2|s2cid=195318862}}</ref>```

Therefore, we parse the page content looking only for the citation templates including a url.

We parse the wikicode using [mwparserfromhell](https://github.com/earwig/mwparserfromhell) and the citation templates with [wikiciteparser](https://github.com/dissemin/wikiciteparser).

We are interested in the following metadata: 
1. The URL of the source
2. The author(s)
3. The title of the referenced source
4. The publishing date
5. The publishing source

The names of each data may vary between templates. For instance, the publishing source is under "Periodical" for news and under "PublisherName" for books. 

In [7]:
def extract_field(field, obj):
    if field in obj:
        return obj[field]

In [119]:
rows = []
citation_nr = 0
citation_url_nr = 0
not_parsed = 0
not_parsed_list = []

for i, article in articles_df.iterrows():
    parsed_wiki = mwparserfromhell.parse(article["wikitext"])

    title = article['title']
    for template in tqdm(parsed_wiki.filter_templates(), desc=f"Finding citations and extracting metadata in {title}"):
        template_name = str(template.name).lower()
        
        if not re.match(r'cite|citation', template_name):
            continue

        citation_nr += 1
        citation_template = template_name

        if re.search('url=', str(template)): # other identifiers? DOI? ISBN?
            citation_url_nr += 1
            parsed_template = parse_citation_template(template)

            if parsed_template is None: # cite magazine
                not_parsed+=1
                not_parsed_list.append(template)
                continue

            fields_to_extract = {
                "URL": "URL",
                "Authors": "Authors",
                "Title": "Title",
                "Publishing_date": "Date"
            }

            if citation_template in ['cite web', 'cite news', 'cite podcast']: ##
                fields_to_extract["Published_in"] = "Periodical"  
            else:
                fields_to_extract["Published_in"] = "PublisherName"

            row = {colname: parsed_template.get(template_key) for colname, template_key in fields_to_extract.items()}
            row["Article"] = article["article_url"]
            row["Source_type"] = citation_template.replace('cite', '')
            rows.append(row)


citations_metadata = pd.DataFrame(rows)

Finding citations and extracting metadata in List_of_dramatic_television_series_with_LGBT_characters:_2010–2015: 100%|██████████| 694/694 [00:04<00:00, 154.10it/s]
Finding citations and extracting metadata in Editio_princeps: 100%|██████████| 54/54 [00:00<00:00, 1008.31it/s]
Finding citations and extracting metadata in Elvis_Presley: 100%|██████████| 1059/1059 [00:00<00:00, 1206.95it/s]
Finding citations and extracting metadata in Lumumba_Government: 100%|██████████| 780/780 [00:00<00:00, 1656.56it/s]
Finding citations and extracting metadata in Britney_Spears: 100%|██████████| 594/594 [00:03<00:00, 188.36it/s]
Finding citations and extracting metadata in Airbus_A380: 100%|██████████| 547/547 [00:02<00:00, 196.75it/s]


Let's take a look at the results:

In [121]:
print(f"{citation_nr} citations found in {articles_urls.shape[0]} articles")
print(f"{citation_url_nr} citations having a URL")
print(f"Unable to parse {not_parsed} citations")

1848 citations found in 6 articles
1680 citations having a URL
Unable to parse 46 citations


In [122]:
citations_metadata

Unnamed: 0,URL,Authors,Title,Publishing_date,Published_in,Article,Source_type
0,http://www.newnownext.com/you-should-totally-b...,"[{'first': 'Brent', 'last': 'Hartinger'}]",You Should TOTALLY be Watching the Hugh Dancy ...,2011-08-08,,https://en.wikipedia.org/wiki/List_of_dramatic...,web
1,https://lezwatchtv.com/show/boardwalk-empire/,,Boardwalk Empire,,,https://en.wikipedia.org/wiki/List_of_dramatic...,web
2,https://www.gaystarnews.com/article/flash-star...,"[{'first': 'James', 'last': 'Besanvalle'}]",Flash star Keiynan Lonsdale comes out in beaut...,2017-05-13,,https://en.wikipedia.org/wiki/List_of_dramatic...,web
3,https://thewest.com.au/entertainment/art/reluc...,"[{'first': 'Tiffany', 'last': 'Fox'}]",Reluctant heart-throb,2013-07-03,,https://en.wikipedia.org/wiki/List_of_dramatic...,web
4,https://tvtonight.com.au/2010/08/a-kiss-is-jus...,"[{'first': 'David', 'last': 'Knox'}]","A kiss is just a kiss, even in teen TV. TV To...",2010-08-21,,https://en.wikipedia.org/wiki/List_of_dramatic...,web
...,...,...,...,...,...,...,...
1629,http://www.markpower.co.uk/projects/A-380,"[{'last': 'Mark Power', 'link': 'Mark Power'}]","Project "" A380: Photographs / Audio Visual",2003/2006,,https://en.wikipedia.org/wiki/Airbus_A380,web
1630,https://www.flightglobal.com/news/articles/eve...,,Airbus A380 Aircraft Profile,2007-02-27,,https://en.wikipedia.org/wiki/Airbus_A380,web
1631,https://www.flightglobal.com/news/articles/dub...,[{'last': 'Max Kingsley-Jones'}],The path to an A380 century at Emirates,2017-11-09,,https://en.wikipedia.org/wiki/Airbus_A380,news
1632,https://www.flightglobal.com/news/articles/ana...,[{'last': 'David Kaminski-Morrow'}],Analysis: A380 scrapes along in hope of revival,2018-07-09,,https://en.wikipedia.org/wiki/Airbus_A380,news


## Tidying up the data

We arrange the column order and transform the 'Authors' to string.

In [123]:
citations_metadata = citations_metadata[['Article', 'Source_type', 'URL', 'Authors', 'Title', 'Publishing_date', 'Published_in']]
citations_metadata = citations_metadata.reset_index(drop=True)

In [124]:
def join_authors(authors):
    if authors is None:
        return None
    
    full_names = []

    for author in authors:
        # Assume last name is is always present
        # Formats we have seen so far:
        # 1. {'first': ..., 'last': ...}
        # 2. {'last': ...}
        # 3. {'last': "LAST, FIRST"}
        # 4. {'last': "FIRST LAST"}
        # 5. {'last': "FIRST1 LAST1 and FIRST2 LAST2}

        full_name = author['last']

        if 'first' in author:
            first_name = author['first']
            full_name = first_name + ' ' + full_name

        full_names.append(full_name)

    return "|".join(full_names)

citations_metadata["Authors"] = citations_metadata["Authors"].map(join_authors)

In [143]:
citations_metadata.iloc[131]

Article            https://en.wikipedia.org/wiki/List_of_dramatic...
Source_type                                                      web
URL                https://www.pinknews.co.uk/2018/10/24/arrow-ga...
Authors                                                   Nick Duffy
Title                   Arrow just revealed a major character is gay
Publishing_date                                           2018-10-24
Published_in                                               Pink News
Name: 131, dtype: object

We save this partial result to csv.

In [135]:
filepath = f'citations_metadata_{lang}.csv'
citations_metadata.to_csv(filepath, index=False)

In [136]:
citations_metadata

Unnamed: 0,Article,Source_type,URL,Authors,Title,Publishing_date,Published_in
0,https://en.wikipedia.org/wiki/List_of_dramatic...,web,http://www.newnownext.com/you-should-totally-b...,Brent Hartinger,You Should TOTALLY be Watching the Hugh Dancy ...,2011-08-08,
1,https://en.wikipedia.org/wiki/List_of_dramatic...,web,https://lezwatchtv.com/show/boardwalk-empire/,,Boardwalk Empire,,
2,https://en.wikipedia.org/wiki/List_of_dramatic...,web,https://www.gaystarnews.com/article/flash-star...,James Besanvalle,Flash star Keiynan Lonsdale comes out in beaut...,2017-05-13,
3,https://en.wikipedia.org/wiki/List_of_dramatic...,web,https://thewest.com.au/entertainment/art/reluc...,Tiffany Fox,Reluctant heart-throb,2013-07-03,
4,https://en.wikipedia.org/wiki/List_of_dramatic...,web,https://tvtonight.com.au/2010/08/a-kiss-is-jus...,David Knox,"A kiss is just a kiss, even in teen TV. TV To...",2010-08-21,
...,...,...,...,...,...,...,...
1629,https://en.wikipedia.org/wiki/Airbus_A380,web,http://www.markpower.co.uk/projects/A-380,Mark Power,"Project "" A380: Photographs / Audio Visual",2003/2006,
1630,https://en.wikipedia.org/wiki/Airbus_A380,web,https://www.flightglobal.com/news/articles/eve...,,Airbus A380 Aircraft Profile,2007-02-27,
1631,https://en.wikipedia.org/wiki/Airbus_A380,news,https://www.flightglobal.com/news/articles/dub...,Max Kingsley-Jones,The path to an A380 century at Emirates,2017-11-09,
1632,https://en.wikipedia.org/wiki/Airbus_A380,news,https://www.flightglobal.com/news/articles/ana...,David Kaminski-Morrow,Analysis: A380 scrapes along in hope of revival,2018-07-09,


## Querying Citoid API

In this section, we request the [Citoid API](https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation) to find the citation metadata for the list of URLs found on the previous step.

In [None]:
cache = {}
    
def get_and_cache_response(reference_url):
    
    if reference_url not in cache:
    
        parsed_url = quote(reference_url).replace('/', '%2F')
        response = requests.get(
            url=CITOID_API_URL+parsed_url,
            headers = HEADER
        )

        if response.status_code == 200:
            cache[reference_url] = response.json()
        else:
            return

    return cache[reference_url]

## Tidying up the queried data 

Salida: tabla de URLs citadas, incluyendo
+ Número de veces citada en el corpus de ingreso
+ Tipo de fuente (art científico, cap libro, etc). >> Info del citation template
+ 4 metadatos básicos, según consta en artículo y devuelto por API Citoid
    - Título
    - Autores
    - Fecha de publicacion
    - Publicación (nombre diario, revista, nombre del sitio)
    - Tipo de fuente


In [None]:
rows = []

for reference_url in tqdm(url_list[11:21]):
    
    response_json = get_and_cache_response(reference_url)

    if response_json:
        
        assert len(response_json) == 1
        obj = response_json[0]

        fields_to_extract = [
            "title",
            "author",
            "accessDate", # date
            "publicationTitle", # (nombre diario, revista)
            "itemType", #(art científico, cap libro, etc)
        ]

        row = {field: extract_field(field, obj) for field in fields_to_extract}
        rows.append(row)


df = pd.DataFrame(rows)
df

+ Ser capaz de reanudar la operación donde se la dejó en caso de datasets muy grandes
+ Validar la metodología reproduciendo los datos publicados por Andrew Lih en 2017 (https://docs.google.com/spreadsheets/d/1kNHSVKq5qZr6_3-05UFtXF2RA5sn3HdUUgL0nC57XPk/edit#gid=543588472)

## Evaluation

+ Veredicto “soportado por Citoid”
+ Puntaje de coincidencia, usando distancia de edición? Suponemos metadatos del template son correctos (los suponemos curados)
+ Opcional: estado en dashboard traductores Zotero (o “sin traductor”)
+ Opcional: JSON+LD (la URL ofrece metadatos en JSON+LD, sí o no)

## Results

- Obtener la tabla de cobertura Citoid para cada URL y metadato básico
- Mostrar porcentajes de cobertura (Citoid coverage gap) por
    + Tipo de fuente
    + Idioma de la Wikipedia
    