# Web2cit: up-to-date Citoid gap figures

This notebook describes how to retrieve the source code and the [Citation templates](https://en.wikipedia.org/wiki/Wikipedia:Citation_templates) of the references of a sample of Wikipedia articles to compare them with the results of the [Citoid](https://www.mediawiki.org/wiki/Citoid) API for the same references.

To follow along, we recommend running the script portions piecemeal, in order.

__Author:__

* Nidia Hernández, [nidiahernandez@conicet.gov.ar](mailto:nidiahy@gmail.com), CAICYT-CONICET


## Table of Contents

0. Setting Up 
1. Retrieving data from Wikipedia articles
    - 1.1. Fetching featured articles using Mediawiki's action API
    - 1.2. Retrieve featured articles data
2. Citation template metadata extraction
    - 2.1 Parameter name mapping
    - 2.2 Inspecting all featured articles
    - 2.3 Citation template extraction summary
3. Querying Citoid API
    - 3.1 Validate URLs
4. Evaluating Citoid results
5. Visualizing results
    - 5.1. Tabular representation
    - 5.2. Other representations ...
6. ...
    - 6.1. ...
    - 6.2. ...

## 0. Setting Up

Before we get started, let's install and import the libraries that we will need.

In [20]:
import re
import os
import requests
import urllib
import pandas as pd
from operator import itemgetter
import json
import gzip
import glob
from pprint import pprint
from tqdm import tqdm
import mwparserfromhell
import validators
from datetime import datetime
import concurrent.futures
from time import sleep
from more_itertools import chunked

Here we set the global parameters:

In [24]:
HEADER={'User-Agent': 'https://phabricator.wikimedia.org/tag/web2cit-research/; mailto:nidiahernandez@conicet.gov.ar'}

## 1. Retrieving data from Wikipedia articles

In this section, we will fetch all the featured articles from a selection of Wikipedias ir order to find:

- the citation templates used on each article
- the URLs of the citations

The selected Wikipedias are the following:

- [English Wikipedia](https://en.wikipedia.org/wiki/Category:Featured_articles): ~6k featured articles
- [Spanish Wikipedia](https://es.wikipedia.org/wiki/Categor%C3%ADa:Wikipedia:Art%C3%ADculos_destacados): ~1.2k artículos destacados
- [French Wikipedia](https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Article_de_qualit%C3%A9): ~2k articles de qualité
- [Portuguese Wikipedia](https://pt.wikipedia.org/wiki/Categoria:!Artigos_destacados): ~1.3k artigos destacados

Some general information about featured articles in all Wikipedias: https://meta.wikimedia.org/wiki/Wikipedia_featured_articles (out of date).

#### Featured content endpoint in Mediawiki's API

There is a specific endpoint for featured content in Mediawiki's API:

https://api.wikimedia.org/feed/v1/wikipedia/{lang}/featured/

For example, the request https://api.wikimedia.org/feed/v1/wikipedia/en/featured/2021/12/06 returns today's featured article (tfa) for December 6th 2021 in English Wikipedia.

Unfortunatatly, this is not avalilable for all the Wikipedias of our interest because "not all Wikipedias are integrated into the Feed endpoint" ([Feed endpoint doc](https://api.wikimedia.org/wiki/API_reference/Feed/Featured_content)). This way, the request https://api.wikimedia.org/feed/v1/wikipedia/fr/featured/2021/12/06 does not return tfa info. Same for Spanish and Portuguese. For this reason, we do not use this specific endpoint.

### 1.1 Fetching featured articles using Mediawiki's action API

We will use each [Wikipedias' action API](https://www.mediawiki.org/wiki/API:Main_page) to retrive:

1. the list of all featured articles for each language
2. the wikicode and some metadata of each featured article

The action API has the following URL:

`https://{lang}.wikipedia.org/w/api.php?`

The following request, for example, retrieves the first 15 featured articles for the category "Articles de qualité" in French Wikipedia:

In [3]:
title = urllib.parse.unquote("Cat%C3%A9gorie:Article_de_qualit%C3%A9") # category name for featured articles in French Wikipedia

PARAMS = {
    "action": "query",
    "list": "categorymembers",
    "cmtitle": title,
    "cmlimit": 15,
    "format": "json"
}

response = requests.get(
    url='https://fr.wikipedia.org/w/api.php?', 
    params=PARAMS
)

response_data = response.json()

There is a max of 500 per request. In order to retrieve all the members of the category, we should add the parameter `cmcontinue` to our request as follows: we get the value of `cmcontinue` from the response of the first request and pass it to the following request. This repeats until the end of the list. See API doc for more details on listing category members: https://www.mediawiki.org/wiki/API:Categorymembers.

In [4]:
response_data['continue']['cmcontinue'] # here is the reference of the starting point for the next request

'page|2a2e38324232443a30324e01428844880901dc10|52519'

In the response, we obtain a list showing the pageid and the title of each article:

In [5]:
response_data['query']['categorymembers']

[{'pageid': 507899, 'ns': 4, 'title': 'Wikipédia:Articles de qualité'},
 {'pageid': 9253139, 'ns': 0, 'title': '1 000 kilomètres de Spa 2009'},
 {'pageid': 10267261, 'ns': 0, 'title': '6 Heures du Castellet 2011'},
 {'pageid': 91996, 'ns': 0, 'title': '32X'},
 {'pageid': 605584, 'ns': 0, 'title': 'Les 101 Dalmatiens (film, 1961)'},
 {'pageid': 5008651, 'ns': 0, 'title': 'A Different Corner'},
 {'pageid': 639219, 'ns': 0, 'title': "A Hard Day's Night (album)"},
 {'pageid': 6672844, 'ns': 0, 'title': "À l'Olympia (album d'Alan Stivell)"},
 {'pageid': 124634, 'ns': 0, 'title': 'À la croisée des mondes'},
 {'pageid': 1486250, 'ns': 0, 'title': "L'Abbaye de Northanger"},
 {'pageid': 11085052, 'ns': 0, 'title': 'Abbaye Saint-Paul de Cormery'},
 {'pageid': 1210952, 'ns': 0, 'title': 'Abbaye Saint-Victor de Marseille'},
 {'pageid': 4293975, 'ns': 0, 'title': "Parc national d'Abisko"},
 {'pageid': 1415801, 'ns': 0, 'title': 'Acanthaster planci'},
 {'pageid': 3906668,
  'ns': 0,
  'title': 'Acci

If the item is a featured article, `ns` (namespace) is `0`. So, to avoid retriving other items (meta-pages about other categories, for example) elements having other values for this key must be discarded (cf. Retrieve article data below).

Let's request all the featured articles for each language and dump the response to a json:

In [6]:
# the name of the category in each language
catname_bylang = {
    'en': 'Category:Featured_articles',
    'es': 'Categor%C3%ADa:Wikipedia:Art%C3%ADculos_destacados',
    'fr': 'Cat%C3%A9gorie:Article_de_qualit%C3%A9',
    'pt': 'Categoria:!Artigos_destacados',
}

In [7]:
for lang in tqdm(catname_bylang, desc='Retrieving featured articles'):
    
    MEDIAWIKI_API_URL = f'https://{lang}.wikipedia.org/w/api.php?'
    
    category_title = urllib.parse.unquote(catname_bylang[lang])
    cmcont = "start"
    i = 0
    
    while (cmcont == "start") or cmcont.startswith('page'):
        i += 1
        filename = f'featured-articles/{lang}-{i:02}.json'
        
        if os.path.isfile(filename):
            with open(filename) as fi:
                featured_art_json = json.load(fi)
                if 'continue' in featured_art_json:
                    cmcont = featured_art_json['continue']['cmcontinue']
                else:
                    cmcont = "end"

            continue
        
        PARAMS = {
            "action": "query",
            "list": "categorymembers",
            "cmtitle": category_title,
            "cmlimit": "max", 
            "format": "json",
            "curtimestamp": True
        }
            
        if cmcont.startswith('page'):
            PARAMS["cmcontinue"]=cmcont

        response = requests.get(
            url=MEDIAWIKI_API_URL, 
            params=PARAMS,
            headers=HEADER
        )

        data = response.json()

        if response.status_code == 200:

            if 'continue' in data:
                cmcont = data['continue']['cmcontinue']
            
            else:
                cmcont='categoryend'

            with open(filename, 'w') as fo:
                fo.write(json.dumps(data))

Retrieving featured articles: 100%|██████████| 4/4 [00:00<00:00, 176.40it/s]


### 1.2 Retrieve featured articles data

Now that we have the pageids of all the featured articles, we can use them to retrieve the URL, the content and the id of the last edition (`revid`). First, we load the information from the json and we build a dataframe where we will store the data for each featured article:

In [49]:
dfs = []

for fname in sorted(os.listdir('featured-articles')):
    if fname.endswith('.json'):
        
        wikilang = fname.split('-')[0]
        
        with open(f'featured-articles/{fname}') as fi:
            featured_arts = json.load(fi)
            
        df = pd.DataFrame(featured_arts['query']['categorymembers'])
        df['wiki_lang'] = wikilang
        df = df[df['ns'] == 0] # discard items that are not featured articles
        df.drop('ns', axis='columns', inplace = True)

        dfs.append(df)
    
articles_data = pd.concat(dfs, ignore_index= True)
articles_data.rename(columns={'title':'article_title'}, inplace = True)
articles_data.drop_duplicates(inplace = True)

In [50]:
articles_data

Unnamed: 0,pageid,article_title,wiki_lang
0,33653136,? (film),en
1,1849799,0.999...,en
2,9702578,1 − 2 + 3 − 4 + ⋯,en
3,48723612,1st Cavalry Division (Kingdom of Yugoslavia),en
4,64662351,1st Missouri Field Battery,en
...,...,...,...
15065,132185,Yorkshire terrier,pt
15066,2269483,You Belong with Me,pt
15067,4779921,You Don't Know What to Do,pt
15068,1585284,You Know I'm No Good,pt


We add a new column containing the URL of each article using the information from the article title and the language of the wikipedia:

In [51]:
articles_data['article_url'] = "https://"+articles_data['wiki_lang']+".wikipedia.org/wiki/"+articles_data['article_title'].map(urllib.parse.quote)

In [52]:
articles_data

Unnamed: 0,pageid,article_title,wiki_lang,article_url
0,33653136,? (film),en,https://en.wikipedia.org/wiki/%3F%20%28film%29
1,1849799,0.999...,en,https://en.wikipedia.org/wiki/0.999...
2,9702578,1 − 2 + 3 − 4 + ⋯,en,https://en.wikipedia.org/wiki/1%20%E2%88%92%202%20%2B%203%20%E2%88%92%204%20%2B%20%E2%8B%AF
3,48723612,1st Cavalry Division (Kingdom of Yugoslavia),en,https://en.wikipedia.org/wiki/1st%20Cavalry%20Division%20%28Kingdom%20of%20Yugoslavia%29
4,64662351,1st Missouri Field Battery,en,https://en.wikipedia.org/wiki/1st%20Missouri%20Field%20Battery
...,...,...,...,...
15065,132185,Yorkshire terrier,pt,https://pt.wikipedia.org/wiki/Yorkshire%20terrier
15066,2269483,You Belong with Me,pt,https://pt.wikipedia.org/wiki/You%20Belong%20with%20Me
15067,4779921,You Don't Know What to Do,pt,https://pt.wikipedia.org/wiki/You%20Don%27t%20Know%20What%20to%20Do
15068,1585284,You Know I'm No Good,pt,https://pt.wikipedia.org/wiki/You%20Know%20I%27m%20No%20Good


We save this information to a csv file:

In [35]:
articles_data.to_csv('featured_articles.csv', index=False)

Now we are ready to query Mediawiki's action API again to get the content of each featured article. 

For example, to retrieve the wikitext and the id of the last edition for the article #3906668 'Accident sur la base de Fairchild en 1994' from French Wikipedia, we can do the following request:

In [36]:
pageid = 3906668

PARAMS = {
    "action": "parse",
    "pageid": pageid,
    "prop": "wikitext|revid",
    "format": "json"
}

response = requests.get(
    url='https://fr.wikipedia.org/w/api.php?', 
    params=PARAMS,
)

page_data = response.json()

In [37]:
page_data

{'parse': {'title': 'Accident sur la base de Fairchild en 1994',
  'pageid': 3906668,
  'revid': 186075124,
  'wikitext': {'*': '{{En-tête label|AdQ}}\n{{Infobox Accident de transport\n  | nom                  = Accident du B-52 sur la base de Fairchild en 1994\n  | image                = FairchildB52Crash.jpg\n  | légende              = Le B-52 sur la tranche une seconde avant de toucher le sol.\n  | date                 = {{date|24|juin|1994|en aéronautique}}\n  | phase                = Vol d\'entraînement\n  | type                 = Erreur de pilotage\n  | site                 = [[Fairchild Air Force Base]], [[Washington (État)|Washington]], [[États-Unis]]\n  | passagers            = \n  | équipage             = 4 militaires\n  | morts                = 4 militaires\n  | blessés              = \n  | survivants           = \n  | appareil             = [[Boeing B-52 Stratofortress|Boeing B-52H \'\'Stratofortress\'\']]\n  | compagnie            = [[United States Air Force]]\n  | numéro_

Let's fetch the wikitext and the id of the last edition for each featured article and save the response to a compressed file:

In [53]:
for index, row in tqdm(articles_data.iterrows(), desc='Retrieving data from articles'):

    pageid = articles_data.loc[index]['pageid']
    lang = articles_data.loc[index]['wiki_lang']
    filename = f'articles-content/{pageid}-{lang}.json.gz'
    
    MEDIAWIKI_API_URL = f'https://{lang}.wikipedia.org/w/api.php?'

    if not os.path.isfile(filename):

        PARAMS = {
            "action": "parse",
            "pageid": pageid,
            "prop": "wikitext|revid",
            "format": "json"
        }

        response = requests.get(
            url=MEDIAWIKI_API_URL, 
            params=PARAMS,
            headers=HEADER
        )

        page_data = response.json()

        if response.status_code == 200:
            with gzip.open(filename, 'w') as fo:
                fo.write(json.dumps(page_data).encode('utf-8'))
    
    else:
        continue


Retrieving data from articles: 10570it [00:06, 1640.30it/s]


## 2. Citation template metadata extraction

Now that we already have the wikitext of the articles, we will parse it in order to retrieve the references that were introduced using a citation template. In other words, we are not interested in:

1. Manually entered references (ie, which do not use citation templates)

`<ref name=Briggs>Briggs, Asa & Burke, Peter (2002) ''A Social History of the Media: from Gutenberg to the Internet'', Cambridge: Polity, pp. 15–23, 61–73.</ref>`

2. Unlinked references

```<ref name=Neeham>{{cite book |title=Paper and Printing |author=[[Tsien Tsuen-Hsuin]] |author2=[[Joseph Needham]] |series=Science and Civilisation in China|volume=5 part 1|publisher=Cambridge University Press|pages=158, 201|year=1985}}</ref> ```

An example of the type of references that we want to keep:

```<ref name="VB1992">{{cite journal|last1=Osmond|first1=Patricia J.|last2=Ulery|first2=Robert W.|date=2003|title=Sallustius|url=http://catalogustranslationum.org/PDFs/volume08/v08_sallustius.pdf#page=17|journal=[[Catalogus Translationum et Commentariorum]]|volume=8|page=199|access-date=27 August 2015}}</ref>```

Therefore, we parse the page content looking only for the citation templates including a URL.

### 2.1 Parameter name mapping

From each reference, we are interested in extracting the following metadata: 
1. The source type (journal, book, website, etc.)
1. The URL of the source
2. The author(s)
3. The title
4. The publishing date
5. The publishing source (publisher, location, etc.)

The names of each data may vary between templates. For instance, the publishing source is under "periodical" for the news template and under "publisher" for the books template. Or the publishing date might be called "date" or "year" in the maps template. The [following spreadsheet](https://docs.google.com/spreadsheets/d/1xbc3FKE0m4JQHa6WCXtBbzeJ9in8P0EQ2NF_VNsaBaM/edit#gid=0) maps the name of each parameter in the citations templates to our fieldnames:

In [59]:
mapping_sheet = f'https://docs.google.com/spreadsheets/d/1xbc3FKE0m4JQHa6WCXtBbzeJ9in8P0EQ2NF_VNsaBaM/export?gid=0&format=csv'

fieldnames_df = pd.read_csv(mapping_sheet)
## rename columns
fieldnames_df.rename(columns={
    "Template": "template",
    "authorLast": "author_last",
    "authorFirst": "author_first",
    "pubDate": "pub_date",
    "source (published In + published By)": "pub_source"
}, inplace=True)

fieldnames_df

Unnamed: 0,wiki_lang,template,title,author_last,author_first,pub_date,pub_source
0,en,Cite web,title,"last\d*, author\d*",first\d*,date,"website, publisher"
1,es,Cita web,título,"apellidos?\d*, autor\d*",nombre\d*,fecha,"sitioweb, obra, publicación, editorial"
2,pt,Citar web,titulo,"ultimo\d*, autor\d*",primeiro\d*,data,"obra, publicado"
3,fr,Lien web,titre,"nom\d*, auteur\d*",prénom\d*,date,site
4,en,Cite book,title,"last\d*, author\d*",first\d*,date,"publication-place, location, publisher"
...,...,...,...,...,...,...,...
78,fr,Lien arXiv,titre,nom\d*,prénom\d*,date,eprint
79,pt,Citar arXiv,titulo,"ultimo\d*, autor\d*",primeiro\d*,data,eprint
80,en,Citation,title,last\d*,first\d*,date,"place, publisher"
81,es,Obra citada,título,apellidos?\d*,nombre\d*,fecha,editorial


The spreadsheet accepts several parameter names for the same field (`sitioweb, obra, publicación, editorial`) and it also accepts regular expressions (`apellidos?\d*`).

In [58]:
for lang in fieldnames_df['wiki_lang'].unique():
    print(f"We have {len(fieldnames_df[fieldnames_df['wiki_lang'] == lang])} templates in {lang}")

We have 26 templates in en
We have 18 templates in es
We have 24 templates in pt
We have 12 templates in fr


### 2.2 Inspecting all featured articles

Taking this spreadsheet into account, we define some methods to extract the citation metadata from the articles:

In [64]:
def read_article(filename):
    with gzip.open(filename) as fi:
        cont = json.loads(fi.read())
    return cont
    
def get_article_data(filename, cont):
    lang = os.path.basename(filename).split('-')[1].replace('.json.gz', '')
    pageid = cont['parse']['pageid'] 
    revid = cont['parse']['revid']
    article_title = cont['parse']['title']
    article_url = f'https://{lang}.wikipedia.org/wiki/{urllib.parse.quote(article_title)}'
    
    article_data = {
        "article_title": article_title,
        "article_url": article_url,
        "page_id": pageid,
        "revid": revid,
        "wiki_lang": lang,
    }
    return article_data

def get_citation_data(fieldname, template, article_data):
    '''
    Receives the wikicode of a citation template and a dataframe mapping w2c fieldnames to citation template fieldnames.
    Matches w2c fieldnames with citation template parameter names and extracts the values for those parameters
    Returns a dictionnary with the extracted data.
    '''
    ## Build result dict
    citation_data = article_data.copy()
    citation_data.update({
        "url": None,
        "source_type": template.name.strip(),
        "title": [],
        "author_last": [],
        "author_first": [],
        "pub_date": [],
        "pub_source": [],
    })

    ## Find parameters in wikicode
    wikicode_params = [param.name.strip() for param in template.params]

    ## Match wikicode parameters with the fieldnames of our spreadsheet
    for key in ["title", "author_first", "author_last", "pub_date", "pub_source"]:
        params_found_in_wikicode = []
        
        if pd.isna(fieldname[key]):
            continue
        for param_re in fieldname[key].split(", "):
            for wikicode_param in wikicode_params:
                if re.fullmatch(param_re, wikicode_param):
                    params_found_in_wikicode.append(wikicode_param)

        for param_found in params_found_in_wikicode:
             citation_data[key].append(template.get(param_found).value.strip())
                
    url_fieldname = "url" if "url" in wikicode_params else "URL"
    citation_data["url"] = template.get(url_fieldname).value.strip()

    return citation_data

And now we apply these methods to the ~10.5k featured articles that we obtained in the previous step. In the next cell, we inspect the json having the content of each article, we parse the wikicode using [mwparserfromhell](https://github.com/earwig/mwparserfromhell) and for each template:

1. we verify if it appears in our spreadsheet
2. we verify if it contains a URL
3. we extract the values for URL, title, author_last, author_first, pub_date, pub_source
4. we store the information in a dataframe containing all the citations of the article which respond to 1 and 2

In [None]:
## Extract citation data for each citation in all featured articles
os.makedirs('citations-metadata', exist_ok = True)

for filename in tqdm(glob.glob('articles-content/*json.gz'), desc=f"Finding citations and extracting metadata"):
    new_filename = f"citations-metadata/{os.path.basename(filename).replace('json.gz', 'csv.gz')}"
    
    if os.path.isfile(new_filename):
        continue

    article_cont = read_article(filename)
    wikitext = article_cont['parse']['wikitext']['*']
    parsed_wiki = mwparserfromhell.parse(wikitext)
    article_data = get_article_data(filename, article_cont)
    lang = article_data['wiki_lang']

    article_citations = []
    for template in parsed_wiki.filter_templates():        
        for index, fieldname in fieldnames_df[fieldnames_df['wiki_lang'] == lang].iterrows():
            
            ## Inspect templates listed in our spreadsheet and extract data for each citation    
            if template.name.matches(fieldname['template']):
                if template.has_param('url', ignore_empty=True) or template.has_param('URL', ignore_empty=True):

                    citation_data = get_citation_data(fieldname, template, article_data)
                    article_citations.append(citation_data)
                    
    article_citations_df = pd.DataFrame(article_citations)
    if not article_citations_df.empty:
        article_citations_df.to_csv(new_filename, index=False)


### 2.3 Citation template extraction summary

Let's take a look at the results. First, we build a dataframe of the citations:

In [None]:
dfs = []

for filename in tqdm(sorted(os.listdir('citations-metadata'))): 

    with gzip.open(f'citations-metadata/{filename}') as fi:
        article_citations = pd.read_csv(fi)

    dfs.append(article_citations)
    
all_citations = pd.concat(dfs, ignore_index= True)
all_citations

We found ~461k citations in ~9.9k articles. This means that there are 629 articles for which no reference was extracted. There are several possible reasons for this:

1. the references in those articles do not use citation templates
2. the references use citation templates but they do not include a URL
3. the citation templates are not listed in our spreadsheet

Let's summaryze the names of the citation templates that were identified, and how many instances of each were extracted.

In [None]:
template_counts = all_citations[['wiki_lang', 'source_type']].value_counts()
template_counts.name = 'freq'
template_counts = template_counts.reset_index()
template_counts = template_counts.sort_values(['wiki_lang', 'freq'], ascending=[True, False])
template_counts = template_counts.reset_index(drop = True)

freq_total = template_counts.groupby('wiki_lang')['freq'].sum().to_dict()
template_nr = template_counts['wiki_lang'].map(freq_total)
template_counts['freq_rel'] = round(template_counts['freq']/template_nr, 2)
template_counts

In [None]:
template_counts.to_csv('citation_templates_freq.csv', index=False)

## 3. Querying Citoid API

In this section, we request the [Citoid API](https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation) to find the citation metadata for the list of URLs found on the previous step.

### 3.1 Validate URLs before calling Citoid

To optimize this step and to avoid to unnecessarily load the Citoid service, we will check if the URLs meet the following requierements:

- to be well formed
- to be public
- to have http or https URL scheme
- do not point to a pdf (Citoid does not support pdfs)

In [8]:
def validate_url(url):
    is_pdf = url.endswith(".pdf") or url.endswith(".PDF")
    is_ftp = url.startswith("ftp") # validators excludes schemes other than http, https, ftp
    is_valid = validators.url(url, public=True)

    if is_pdf or is_ftp or not is_valid:
        return False
    else:
        return True

In [9]:
dfs = []

for filename in tqdm(sorted(os.listdir('citations-metadata')), desc='Validating URLs'):
    with gzip.open(f'citations-metadata/{filename}') as fi:
        article_citations = pd.read_csv(fi)
        article_citations = article_citations.dropna(subset=['url']) ## REVISAR

    valid_urls = article_citations[article_citations['url'].map(validate_url)]
    dfs.append(valid_urls)
    
all_valid_urls = pd.concat(dfs, ignore_index= True)

Validating URLs: 100%|██████████| 9978/9978 [01:01<00:00, 163.08it/s]


In [10]:
all_valid_urls.to_csv('citations_metadata_valid_urls.csv', index=False)

We also eliminate duplicated urls to avoid to request fr the same information twice:

In [13]:
valid_urls = all_valid_urls.drop_duplicates(subset=['url'])

In [14]:
print(f'Valid urls: {len(all_valid_urls)}')
print(f'Valid urls without duplicates: {len(valid_urls)}'

Valid urls: 431929
Valid urls without duplicates: 382984


### 3.2 Query function and parallel requests

We set up the result directory and the function for requesting Citoid's data for one URL:

In [17]:
citoid_cache_dir = 'citoid-cache/'
os.makedirs(citoid_cache_dir, exist_ok = True)

In [18]:
def get_and_cache_response(reference_url, outfname, HEADER):
    
    if os.path.isfile(outfname):
        return f"Citoid response for {reference_url} already in cache"
    
    parsed_url = urllib.parse.quote(reference_url).replace('/', '%2F')
    response = requests.get(
        url = 'https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/'+parsed_url,
        headers = HEADER,
    )

    citoid_data = response.json()

    # Add request timestamp in Zulu format
    tstamp = f"{datetime.now().isoformat(timespec='seconds')}Z"
    if isinstance(citoid_data, dict): 
        citoid_data["requestTimestamp"] = tstamp
    else:
        citoid_data[0]["requestTimestamp"] = tstamp

    with gzip.open(outfname, 'w') as fo:
        fo.write(json.dumps(citoid_data).encode('utf-8'))


In [21]:
urls_fnames = []

for index, row in valid_urls.iterrows():

    pageid = row['page_id']
    outfname = f"{citoid_cache_dir}page_{pageid:08}-valid_url_{index:06}.json.gz"
    url = row['url']
    
    urls_fnames.append((url, outfname))

In [None]:
for batch_of_ufs in tqdm(chunked(urls_fnames, 500), total=len(urls_fnames)//500):
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
        
        future_to_url = {}
        
        for uf in batch_of_ufs:
            url, outfname = uf
            future = executor.submit(get_and_cache_response, url, outfname, HEADER)
            future_to_url[future] = uf
            
        for future in concurrent.futures.as_completed(future_to_url):
            uf = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print( f'{uf} generated an exception: {exc}')
    
    sleep(10)

  0%|          | 0/765 [00:00<?, ?it/s]

('http://www.telegraph.co.uk/news/worldnews/australiaandthepacific/newzealand/7846625/Most-expensive-feather-ever-fetches-4000-at-auction.html', 'citoid-cache/page_00100018-valid_url_000068.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.lillemetropole.fr/mel/institution/competences/dechets-menagers.html', 'citoid-cache/page_00100154-valid_url_000395.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('https://recordsearch.naa.gov.au/SearchNRetrieve/NAAMedia/ShowImage.aspx?B=1492204&T=PDF', 'citoid-cache/page_01001558-valid_url_000427.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  0%|          | 1/765 [02:17<29:10:44, 137.49s/it]

('https://www.jstage.jst.go.jp/article/kikaic1979/72/714/72_714_471/_pdf', 'citoid-cache/page_00100340-valid_url_000666.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  0%|          | 2/765 [03:37<21:55:42, 103.46s/it]

('http://www.timeanddate.com/worldclock/city.html?n=265', 'citoid-cache/page_00100730-valid_url_001415.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://mdotnetpublic.state.mi.us/tmispublic/', 'citoid-cache/page_01006150-valid_url_001182.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  0%|          | 3/765 [05:51<24:54:48, 117.70s/it]

('http://myride.winnipegtransit.com/en/inside-transit/interestingtransitfacts/', 'citoid-cache/page_00100730-valid_url_001579.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('https://www.amazon.fr/Je-suis-n%C3%A9-jour-bleu/dp/2290011436', 'citoid-cache/page_10101628-valid_url_002004.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|          | 4/765 [07:24<22:49:47, 108.00s/it]

('http://news.bbc.co.uk/1/hi/england/tyne/7579943.stm', 'citoid-cache/page_01011219-valid_url_002143.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('https://acervo.folha.com.br/leitor.do?numero=17133&keyword=SANDY&anchor=5625013&origem=busca&pd=d66444d2be912980e04eef9fe075feee', 'citoid-cache/page_00101409-valid_url_002526.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|          | 5/765 [08:56<21:30:58, 101.92s/it]

('https://f5.folha.uol.com.br/celebridades/carnaval/2019/02/fa-de-sandy-lorena-queiroz-febre-de-carinha-de-anjo-tera-bloquinho-proprio-em-sp.shtml', 'citoid-cache/page_00101409-valid_url_002773.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www1.folha.uol.com.br/folha/ilustrada/ult90u7934.shtml', 'citoid-cache/page_00101409-valid_url_002851.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|          | 7/765 [12:27<22:23:08, 106.32s/it]

('http://seer.ufrgs.br/index.php/bgg/article/download/40018/25538', 'citoid-cache/page_00010193-valid_url_004145.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.lume.ufrgs.br/bitstream/handle/10183/33270/000114367.pdf?sequence=1', 'citoid-cache/page_00010193-valid_url_004139.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|          | 8/765 [14:04<21:40:56, 103.11s/it]

('https://periodicos.furg.br/momento/article/download/4408/2766', 'citoid-cache/page_00010193-valid_url_004251.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.les-lettres-francaises.fr/2010/09/jack-kerouac-et-le-jazz/', 'citoid-cache/page_01019581-valid_url_004311.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|          | 9/765 [15:24<20:10:25, 96.07s/it] 

('http://www.amazon.com/Lives-Pillars-Orthodoxy-Dormition-Skete/dp/0944359043', 'citoid-cache/page_01022186-valid_url_004683.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0


  1%|▏         | 10/765 [16:51<19:35:01, 93.38s/it]

('http://www.olympic.org/fr/saint-moritz-1948-olympiques-hiver', 'citoid-cache/page_00102878-valid_url_005499.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.cliffordmeth.com/alanmooretalkstocliff-pt2.htm', 'citoid-cache/page_00102915-valid_url_005550.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.portlandonline.com/bes/watershedapp/index.cfm?action=DisplayContent&SubWaterShedID=26&SubjectID=3&TopicID=26&SectionID=1', 'citoid-cache/page_10298609-valid_url_005566.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.cbc.ca/beta/news/politics/sir-john-a-macdonald-toonie-to-celebrate-1st-pm-s-200th-birthday-1.2879467', 'citoid-cache/page_00103110-valid_url_005629.json.gz') generated an exception: [Errno Expecting value] upstream request timeout: 0
('http://www.cbc.ca/news/canada/ottawa/ottawa-river-parkway-renamed-after-sir-john-a-macdonald-1.

## Tidying up the queried data 

## Evaluation

## Results
    