# Create configuration files for Web2Cit

This notebook performs automatic creation of [Web2Cit](https://meta.wikimedia.org/wiki/Web2Cit) configuration files for the web domains evaluated in "understand-citoid-coverage.ipynb".

__Author:__

* Nidia Hernández, [nidiahernandez@conicet.gov.ar](mailto:nidiahy@gmail.com), CAICYT-CONICET


In [1]:
import os
import pandas as pd
from operator import itemgetter
import json
import gzip
from glob import glob
from pprint import pprint
from tqdm import tqdm
tqdm.pandas()
from urllib.parse import urlparse
from pprint import pprint

  from pandas import Panel


## Web2Cit's configuration files

Doc: https://meta.wikimedia.org/wiki/Web2Cit/Early_adopters#Domain_configuration_files

Example config file: https://meta.wikimedia.org/wiki/Web2Cit/data/com/nbcnews/www/tests.json

This is how a config file looks like:

```
[
 {
   "path": "/business/economy/july-inflation-numbers-consumer-prices-rose-85-year-year-summer-inflat-rcna42393",
   "fields": [
     {
       "fieldname": "itemType",
       "goal": [
         "newspaperArticle"
       ]
     },
     {
       "fieldname": "title",
       "goal": [
         "Consumer prices rose by 8.5% year over year in July as the summer of inflation wears on"
       ]
     },
     {
       "fieldname": "authorLast",
       "goal": [
         "Rob Wile"
       ]
     },
     {
       "fieldname": "date",
       "goal": [
         "2022-08-10"
       ]
     },
     {
       "fieldname": "publishedIn",
       "goal": [
         "NBC News"
       ]
     },
     {
       "fieldname": "language",
       "goal": [
         "en"
       ]
     }
   ]
 }
] 
```




Currently, there are 25 config files available at https://meta.wikimedia.org/wiki/Special:PrefixIndex/Web2Cit/data:

Web2Cit/data/ar/com/lavoz/www/tests.json

Web2Cit/data/ar/com/pagina12/www/tests.json

Web2Cit/data/au/gov/qld/qagoma/blog/tests.json

Web2Cit/data/br/com/uol/folha/www1/tests.json

Web2Cit/data/com/3rionoticias/tests.json

Web2Cit/data/com/aljazeera/www/tests.json

Web2Cit/data/com/cnn/edition/tests.json

Web2Cit/data/com/elespectador/www/tests.json

Web2Cit/data/com/eltiempo/www/tests.json

Web2Cit/data/com/go/abcnews/tests.json

Web2Cit/data/com/nbcnews/www/tests.json

Web2Cit/data/com/newyorker/www/tests.json

Web2Cit/data/com/revistaanfibia/tests.json

Web2Cit/data/com/thequietus/tests.json

Web2Cit/data/com/yahoo/news/tests.json

Web2Cit/data/ie/independent/www/tests.json

Web2Cit/data/ie/rte/www/tests.json

Web2Cit/data/org/adl/www/tests.json

Web2Cit/data/org/amnesty/www/tests.json

Web2Cit/data/org/cjr/www/tests.json

Web2Cit/data/uk/gov/www/tests.json

Web2Cit/data/uy/com/elobservador/www/tests.json

Web2Cit/data/uy/com/elpais/www/tests.json

Web2Cit/data/uy/com/ladiaria/tests.json

Web2Cit/data/uy/com/montevideo/www/tests.json


Our goal is to automatically produce new configuration files for the domains previously evaluated.

https://meta.wikimedia.org/wiki/Web2Cit/Early_adopters#Translation_fields

Fieldnames:
* "itemType": mandatory. Single value
* "title": mandatory. Single value, empty string not allowed.
* authorLast: author first y last se mapean directamente al archivo de config
* authorFirst: author first y last se mapean directamente al archivo de config
* date: 
* publishedBy: undefined (can't be recovered)
* publishedIn: undefined (can't be recovered)
* language: undefined (can't be recovered)

In [2]:
## Read evaluation results df
citoid_eval = pd.read_csv("eval/eval_all_citations.csv.gz")
cols = ['author_first_citoid', 'author_last_citoid', 'source_type_manual', 'title_manual',
       'author_last_manual', 'author_first_manual', 'pub_date_manual']
# no leemos 'pub_source_manual' porque no la vamos a usar para el archivo de config, no se puede mapear

# read data as lists
for col in cols:
    citoid_eval[col] = citoid_eval[col].fillna('None').map(eval)

In [3]:
citoid_eval.shape

(91338, 25)

In [4]:
citoid_eval.columns

Index(['url', 'url_citoid', 'source_type_citoid', 'title_citoid',
       'author_first_citoid', 'author_last_citoid', 'pub_date_citoid',
       'pub_source_citoid', 'citation_template', 'source_type_manual',
       'title_manual', 'author_last_manual', 'author_first_manual',
       'pub_date_manual', 'pub_source_manual', 'article_url', 'page_id',
       'revid', 'wiki_lang', 'comp_title', 'comp_source_type',
       'comp_pub_source', 'comp_pub_date', 'comp_author_first',
       'comp_author_last'],
      dtype='object')

In [5]:
## caculate scores
comp_fields = ['comp_author_first', 'comp_author_last', 'comp_title',
          'comp_source_type', 'comp_pub_source', 'comp_pub_date']

citoid_eval["citoid_success"] = (citoid_eval[comp_fields] >= 0.75).sum(axis=1)

In [6]:
## find domains
citoid_eval['domain'] = citoid_eval['url'].map(lambda x : urlparse(x).netloc)

In [7]:
## calculate domain popularity (count domain freq)
citoid_eval["domain_counts"] = citoid_eval.groupby('domain')["domain"].transform("count")

In [8]:
## calculate domain score
citoid_eval["domain_score"] = citoid_eval.groupby("domain")["citoid_success"].transform("mean")

## Select citations for config files

To be elegible for generating a confinguration file, a citation must:

- have information for at least 3 of the following 4 fields: authorFirst, authorLast, Title, and Date
- belong to a domain cited at least X times in our corpus of evaluated URLs


In [84]:
## filter by required metadata
required_fields = ['author_first_manual', 'author_last_manual', 'title_manual', 'pub_date_manual']
mask = citoid_eval[required_fields].applymap(len).sum(axis=1) > 2
citations4conf = citoid_eval[mask]

In [32]:
## filter by domain frequency
popularity_threshold = 99
citations4conf = citations4conf[citations4conf["domain_counts"] > popularity_threshold]

In [33]:
print(f"We have {len(citations4conf)} citations left")

We have 25249 citations left


In [34]:
## Sort by domain score, from lower to higher
citations4conf = citations4conf.sort_values(by=["domain_score"], ascending=True)

In [35]:
print(f"We have {len(citations4conf['domain'].drop_duplicates())} domains left")

We have 91 domains left


In [40]:
citations4conf.to_csv("data/citations_for_config_files.csv.gz", index=False)

## Old selection of citations for config files -----------------

In [5]:
## drop duplicated URLs
citoid_eval = citoid_eval.dropna(subset=["url"])
## find domains
citoid_eval['domain'] = citoid_eval['url'].map(lambda x : urlparse(x).netloc)

## calculate Citoid success
citoid_eval["citoid_success"] = (citoid_eval[comp_fields] >= 0.75).sum(axis=1)

100%|██████████| 321179/321179 [00:17<00:00, 18366.25it/s]


In [42]:
success_by_domain = citoid_eval.groupby("domain")["citoid_success"].mean().sort_values().reset_index()
success_by_domain

Unnamed: 0,domain,citoid_success
0,www.standardnewswire.com,0.0
1,www.victorianlondon.org,0.0
2,etymolog.ruslang.ru,0.0
3,www.newspakistan.pk,0.0
4,espn.go.com,0.0
...,...,...
15496,nscripter.com,6.0
15497,www.thesmackdownhotel.com,6.0
15498,www.mairie-orthez.fr,6.0
15499,www.dargaud.com,6.0


In [43]:
## Find top 1k URLs with low score and high domain frequency
## We drop domain duplicates, otherwise it's all books.google.com
top_cited_low_score = citoid_eval.sort_values(by=["citoid_success", "domain_counts"], ascending=[True, False]).drop_duplicates("domain")
top_cited_low_score = top_cited_low_score.head(1000)

In [47]:
top_cited_low_score.to_csv("data/top_cited_low_score.csv", index=False)

In [None]:
config_sample  = pd.read_csv("data/top_cited_low_score.csv")

cols = ['title_manual','author_last_manual', 'author_first_manual', 'pub_date_manual']

for col in cols:
    config_sample[col] = config_sample[col].fillna('None').map(eval)

config_sample.head()

## -----------------

## Recover metadata without preprocessing

### Manual metadata without preprocessing

In [53]:
## This file contains the manual citations metadata before preprocessing
manual_before_preproc = pd.read_csv("data/citations_metadata_only_year.csv.gz")

cols = ['title','author_last', 'author_first', 'pub_date'] # 'pub_source' will not be used

manual_before_preproc = manual_before_preproc.drop_duplicates(subset=["url"], keep="first")

for col in cols:
    manual_before_preproc[col] = manual_before_preproc[col].fillna('None').map(eval)

In [54]:
manual_before_preproc.columns

Index(['article_title', 'article_url', 'page_id', 'revid', 'wiki_lang', 'url',
       'source_type', 'source_type_map', 'title', 'title_clean', 'author_last',
       'author_last_clean', 'author_first', 'author_first_clean', 'pub_date',
       'pub_date_clean', 'pub_date_control', 'pub_date_only_year',
       'pub_source', 'pub_source_clean'],
      dtype='object')

In [55]:
manual_before_preproc = manual_before_preproc.set_index("url").reindex(columns=cols)
manual_before_preproc.shape

(382983, 5)

In [None]:
### A partir de acá cambiar top_cited_low_score por citations4conf
### incluir datos de citoid para casos en que hay que buscar el exact match (itemType y listas de más de un elemento)

In [64]:
## Replace data in dataframe of top cited domains with low score with data 
## before preprocessing for 'title','author_last', 'author_first'
## 'pub_source' is not necessary because it will not be mapped to the config file
## 'pub_date' is not necessary because we will keep data transformed to YYYY-MM-DD

cols = ['title','author_last', 'author_first'] ## do not overwrite pub_date

for col in cols:
    citations4conf[f"{col}_manual"] = manual_before_preproc.loc[ citations4conf["url"], col ].values

In [72]:
citations4conf[['source_type_citoid','title_manual','author_first_manual','author_last_manual', 'pub_date_manual']].sample(6)

Unnamed: 0,source_type_citoid,title_manual,author_first_manual,author_last_manual,pub_date_manual
6068,webpage,[History of U.S. Marine Corps Operations in Wo...,"[Henry I., Bernard C., Edwin T.]","[Shaw, Jr., Nalty, Turnbladh]",[1966]
74685,webpage,[Zedd and Katy Perry Share New Song and Video ...,[Madison],[Bloom],[2019-02-14]
85363,webpage,[Uruguay - List of Champions],[],"[STOKKERMANS, Karel]",[2010-08-10]
70119,book,[Makedonika],[Eugene N.],[Borza],[1995]
52151,newspaperArticle,[The presidents' guide to science],[],[James van der Pool],[2008-09-16]
70459,book,[Histoire naturelle des animaux sans vertèbres...,[Jean-Baptiste],[de Lamarck],[1819]


In [59]:
# config_data = citations4conf[['url', 'domain', 'source_type_citoid', 'source_type_manual','title_manual',
#               'author_last_manual', 'author_first_manual', 'pub_date_manual',
#               'comp_author_first', 'comp_author_last', 'comp_title', 'comp_source_type',
#                                    'comp_pub_date', 'citoid_success']]

In [60]:
#config_data[['source_type_citoid','title_manual','author_first_manual','author_last_manual', 'pub_date_manual']].sample(6)

Unnamed: 0,source_type_citoid,title_manual,author_first_manual,author_last_manual,pub_date_manual,pub_date_only_year
22470,webpage,[New Controlling Body Formed at C.A.H.A. Meet],[Robert],[Clarke],[1940-04-16],[]
4201,webpage,"[How Lady Gaga Conquered Music, Fashion and Fi...",[Leena],[Tailor],[2018-09-04],[]
37233,webpage,[Novo navio de assistência hospitalar será bat...,[],[],[2021-09-15],[]
47552,newspaperArticle,[Steven Wilson],[Dave],[Baird],[2012-06-25],[]
37510,webpage,[Ghettos in German-Occupied Eastern Europe],[Laura],[Crago],[10],[10]
4824,webpage,"[Britannia, or, A Chorographical Description o...",[William],[Camden],[1789],[1789]


In [129]:
## Test rows: complete data, missing data, author_last only and several authors
test_rows = [22208, 24204, 47332, 88461, 6068]
citations4conf.loc[test_rows]

Unnamed: 0,url,url_citoid,source_type_citoid,title_citoid,author_first_citoid,author_last_citoid,pub_date_citoid,pub_source_citoid,citation_template,source_type_manual,...,comp_title,comp_source_type,comp_pub_source,comp_pub_date,comp_author_first,comp_author_last,citoid_success,domain,domain_counts,domain_score
22208,https://www.telegraph.co.uk/culture/tvandradio...,https://www.telegraph.co.uk/culture/tvandradio...,webpage,imo and ben: a new radio drama that shows the ...,,,,www.telegraph.co.uk,cite news,"[magazineArticle, newspaperArticle]",...,1.0,0,0.0,0.0,0.0,0.0,1,www.telegraph.co.uk,1116,2.226703
24204,https://www.eurogamer.net/articles/2016-11-11-...,https://www.eurogamer.net/articles/2016-11-11-...,newspaperArticle,dishonored 2 developer's pro-tips are really u...,,,2016-11-11,eurogamer.net,citar web,"[blogPost, email, forumPost, webpage]",...,1.0,0,0.69,1.0,1.0,0.0,3,www.eurogamer.net,785,2.155414
47332,https://www.npr.org/templates/story/story.php?...,https://www.npr.org/templates/story/story.php?...,newspaperArticle,passengers treated for hypothermia,,,,npr.org,Lien web,"[blogPost, email, forumPost, podcast, videoRec...",...,1.0,0,0.64,0.0,1.0,0.0,2,www.npr.org,248,2.572581
88461,http://www.rsssf.com/tablesa/arg75.html,http://www.rsssf.com/tablesa/arg75.html,webpage,argentina 1975,,,,www.rsssf.com,Cita web,"[blogPost, email, forumPost, webpage]",...,0.36,1,0.38,1.0,0.0,0.0,2,www.rsssf.com,852,3.150235
6068,http://www.ibiblio.org/hyperwar/USMC/III/index...,http://www.ibiblio.org/hyperwar/USMC/III/index...,webpage,hyperwar: usmc operations in wwii: vol iii--ce...,,,,www.ibiblio.org,Cite book,"[book, manuscript]",...,0.61,0,0.13,0.0,0.0,0.0,0,www.ibiblio.org,166,1.493976


In [63]:
#citations4conf.loc[3915].title_manual

## Build config files

* If we do not have metadata for a field in our selected citations ("[]" or None), we exclude the field from the configuration file
* We only include itemType metadata if there is an exact match between Citoid's response and the manual citation
* For title and date fields, only one value is admitted in the configuration file. If we have more than one value in our selection, include the exact match
* For authorLast and authorFirst, if we have metada, we map it directly

In [95]:
## map evaluation fieldnames to w2c fieldnames
eval_w2c = {
    'source_type_manual':'itemType',
    'title_manual': 'title',
    'author_first_manual': 'authorFirst',
    'author_last_manual': 'authorLast', 
    'pub_date_manual': 'date',
}

def add_field(d, fieldname, row):
    el = {"fieldname": eval_w2c[fieldname]}
    el["goal"] = row[fieldname]
    d["fields"].append(el)
    return d


def row_to_json(row):
    '''
    Transforms row from evaluation df to w2c configuration file
    input: row
    output: json
    '''
    d = {}
    d["path"] = urlparse(row["url"]).path
    d["fields"] = []
            
    for fieldname in ['source_type_manual','title_manual','author_first_manual','author_last_manual', 'pub_date_manual']:
        if row[fieldname]:
            if fieldname == 'author_first_manual' or fieldname == 'author_last_manual':
                add_field(d, fieldname, row)
            else:
                if len(row[fieldname]) > 1:
                    name = fieldname.replace("_manual", "")
                    if row[f"comp_{name}"] == 1:
                        ## compare with citoid data and keep match
                        ## source_type tomarlo cuando el mapeo es a un solo valor (thesis y cite patent)
                else:
                    add_field(d, fieldname, row)
        ## else:
        ## do not add fieldname to config file
    return d

IndentationError: expected an indented block (<ipython-input-95-0342de29b1b2>, line 36)

In [64]:
for ix, row in citations4conf.iterrows():
    el = row_to_json_element(row)
    domain = urlparse(row["url"]).netloc
    revdomain = "/".join(domain.split(".")[::-1])
    fp = f"data/config_files/{revdomain}.tests.json"
    
    if not os.path.isfile(fp):
        fp = f"data/config_files/{domain}.tests.json"
        ## Add 2nd config
        
    with open(fp, "w") as fo:
        json.dump(el, fo, indent=4)

In [63]:
## Prueba con filas seleccionadas

for ix, row in config_data.loc[test_rows].iterrows():
    el = row_to_json_element(row)
    pprint(el)     

{'fields': [{'fieldname': 'title',
             'goal': ['Why Thatgamecompany nearly fell apart after releasing '
                      'Journey – and what’s next for the studio']},
            {'fieldname': 'authorFirst', 'goal': ['Neil']},
            {'fieldname': 'authorLast', 'goal': ['Long']},
            {'fieldname': 'date', 'goal': ['2013-05-30']}],
 'path': '/features/why-thatgamecompany-nearly-fell-apart-after-releasing-journey-and-whats-next-for-the-studio/'}
{'fields': [{'fieldname': 'title',
             'goal': ['Kids See Ghosts by Kids See Ghosts Kanye West Kid '
                      'Cudi']}],
 'path': '/KIDS-GHOSTS-Kanye-West-Cudi/dp/B07F2QSSLR'}
{'fields': [{'fieldname': 'title',
             'goal': ['Les guerres parthiques de Démétrios II et Antiochos VII '
                      'dans les sources gréco-romaines, de Posidonios à '
                      'Trogue/Justin']},
            {'fieldname': 'authorFirst', 'goal': ['Charlotte']},
            {'fieldname': 'aut

In [111]:
## Check results
#! cat data/config_files/www.wired.co.uk.tests.json
#! cat data/config_files/www.vogue.com.tests.json ## Not found
#! cat data/config_files/www.washingtonpost.com.tests.json
# ! cat data/config_files/www.theguardian.com.tests.json
#! cat data/config_files/www.scielo.br.tests.json
#! cat data/config_files/www.lefigaro.fr.tests.json
#! cat data/config_files/www.japantimes.co.jp.tests.json
#! cat data/config_files/www.elespectador.com.tests.json
#! cat data/config_files/www.cervantesvirtual.com.tests.json
! cat data/config_files/oglobo.globo.com.tests.json

{
    "path": "/mundo/eua-ira-seis-decadas-de-relacoes-instaveis-3381014",
    "fields": [
        {
            "fieldname": "title",
            "goal": [
                "EUA e Ir\u00e3: seis d\u00e9cadas de rela\u00e7\u00f5es inst\u00e1veis"
            ]
        },
        {
            "fieldname": "date",
            "goal": [
                "2011-12-04"
            ]
        }
    ]
}

In [115]:
def find_row_in_manual_data(col, title):
    if title in col:
        return True
    else:
        return False
    
#find_title(['ASAMBLEA GENERAL CONSTITUYENTE - Sesión del 12 de marzo de 1813', 'Final Fantasy'], "ASAMBLEA GENERAL CONSTITUYENTE - Sesión del 12 de marzo de 1813")

field = "Roland-Garros : Seles et Wilander à l'honneur"

mask = manual_before_preproc.apply(
    lambda row: find_row_in_manual_data(row['title'], field), 
    axis=1)                         

manual_before_preproc[mask]

Unnamed: 0_level_0,title,author_last,author_first,pub_date,pub_date_only_year
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
https://www.lefigaro.fr/flash-actu/2012/05/21/97001-20120521FILWWW00759-r-garros-seles-et-wilander-a-l-honneur.php,[Roland-Garros : Seles et Wilander à l'honneur],[],[],[2012],[2012]


Pruebas en monitor

1. https://www.wired.co.uk/article/final-fantasy-vii-comes-to-ios
2. http://www.vogue.com/magazine/article/michelle-williams-my-week-with-michelle/ ## URL not found
3. https://www.washingtonpost.com/wp-srv/local/longterm/library/dc/mismanage/manage20.htm
4. https://www.theguardian.com/uk-news/2014/jul/22/richard-iii-visitor-centre-leicester ## prueba bastante lenta
5. http://www.scielo.br/scielo.php?pid=S0021-75572003000700013&script=sci_arttext ## mi detección del path está incompleta
6. https://www.lefigaro.fr/flash-actu/2012/05/21/97001-20120521FILWWW00759-r-garros-seles-et-wilander-a-l-honneur.php ## el config es más incompleto que el default
7. http://www.japantimes.co.jp/news/2014/08/12/national/japan-chooses-boeing-777-300er-governments-official-jet/ ## article expired
8. https://www.elespectador.com/entretenimiento/cine-y-tv/el-espectador-gana-premio-internacional-miguel-hernandez-127852/
9. http://www.cervantesvirtual.com/servlet/SirveObras/34693958761247208143679/index.htm
10. http://oglobo.globo.com/mundo/eua-ira-seis-decadas-de-relacoes-instaveis-3381014 ## 