# A methodology for annotating the in-text citations toward a retracted article
Starting from a seed retracted article, we present a step by step methodology for collecting and annotating the citing entities in-text citations. Two main services have been used during this process: (a) The APIs of COCI, the OpenCitations Index of Crossref open DOI-to-DOI references (http://opencitations.net/index/coci), and (b) the RetractionWatch database, a collection of retracted articles over the academic world (http://retractiondatabase.org/).
This methodology is divided into 5 steps: (1) identifying and retrieving the resources, (2) annotating the citing entities characteristics, (3) classifying the citing entities into subjects of study, (4) extracting textual values from the citing entities, and (5) annotating the in-text citations characteristics.


|Phase                                                    |Description                                                                                                                                                         |Input                                    |Output (new dataset attributes)                                                                              |
|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|-------------------------------------------------------------------------------------------------------------|
|1) Identifying and retrieving the resources              |Identifying the list of entities citing the retracted article and annotating their main attributes                                                                  |Retracted article DOI                    |1.1) DOI 1.2) year of publication 1.3) title  1.4) source id (ISSN/ISBN)  1.5) source title                  |
|2) Annotating the citing entities characteristics        |Annotating whether the citing entities are/aren’t retracted                                                                                                         |Citing entities' DOIs                     |2.1) is retracted?                §                                                                           |
|3) Classifying the citing entities into subjects of study|Classifying the citing entities macro subjects and specific areas of study following the SCImago classification                                                     |Citing entities' ISSN/ISBN values         |3.1) area  3.2) category                                                                                      |
|4) Extracting textual values from the citing entities    |Extracting the citing entities abstracts and the in-text citation/s pointer, section of occurrence, and context                                                     |Citing entities' DOIs                     |4.1) abstract 4.2) in-text citation/s section 4.3) in-text citation/s context 4.4) in-text citation/s pointer|
|5) annotating the in-text citations' characteristics      |Annotating for each captured in-text citation its intent and sentiment, and specifying whether at least one of the in-text citation/s mentions the retraction notice|Citing entities in-text citations context|5.1) citation/s intent 5.2) sentiment/s 5.3) retraction mentioned?                                           |



In [2]:
#Imports
import csv
import re
import requests
import os
from datetime import datetime
import pandas as pd 
import numpy as np
from collections import defaultdict
import pprint
import util

OUT_PATH = "data_test/"
CITS_DATASET = OUT_PATH + "cits_dataset.csv"

---
## Step-1) Identifying and retrieving the resources
#### Input: the DOI value of a retracted article

In [4]:
RET_ART_DOI = "10.1016/S0140-6736(97)11096-0"

#### Output: Creates a dataset having the following variables/columns: "doi","title","year","source_id", and "source_title" 

### Step-1-1) 
No script needed

### Step-1-2)

In [None]:
def call_api_coci(operation, vals, fields, params=""):
    COCI_API = "https://opencitations.net/index/coci/api/v1/"
    if len(vals) == 0:
        return {}

    val_key = vals.pop(0)
    item = {}
    item[val_key] = {}  
    r = requests.get(COCI_API + str(operation) + "/" + str(val_key) + str(params))
    if len(r.json()) > 0:
        if fields == "*":
            item[val_key] = r.json()[0]
        else:
            for f in fields:
                item[val_key][f] = None
                if f in r.json()[0]:
                    item[val_key][f] = r.json()[0][f]
    
    return util.merge_two_dicts(item, call_api_coci(operation, vals, fields, params))

# All the citations in COCI
ret_meta = call_api_coci("metadata", [RET_ART_DOI],["citation"],'?json=array("; ",citation,doi)')
coci_cits = ret_meta[RET_ART_DOI]["citation"]
# ---- <TEST> ----- COMMENT  
# coci_cits = coci_cits[0:10]
# ---- </TEST> ----- COMMENT  

# Get the metadata of citing document
coci_cits_meta = call_api_coci("metadata", coci_cits, "*")

#write the partial results of this step
step_a_data = []
for c in coci_cits_meta:
    step_a_data.append({
        "doi": coci_cits_meta[c]["doi"],
        "title": coci_cits_meta[c]["title"],
        "year": coci_cits_meta[c]["year"],
        "source_id": coci_cits_meta[c]["source_id"],
        "source_title": coci_cits_meta[c]["source_title"]
    })

util.write_list(step_a_data, CITS_DATASET, header= True)
# Verify and add "retracted" field to each citing document using/querying RetractionWatch database (http://retractiondatabase.org/) as source

---
## Step-2) Annotating the citing entities characteristics
#### Input: Citing entities DOIs	

In [7]:
CITS_DATASET = "example_data/cits_dataset.csv"

#### Output: extends the CitsDataset with the new variable "is_retracted"

### Step-2-1) 

In [12]:
cits_df = pd.read_csv(CITS_DATASET)
step_2_1_data = util.df_to_dict_list(cits_df,{"is_retracted":"todo"},["doi","title","year","source_id","source_title"])
util.write_list(step_2_1_data, CITS_DATASET, header= True)

### Step-2-2) 
No script needed

---
## Step-3) Classifying the citing entities into subjects of study
#### Input: Citing entities ISSN/ISBN values
#### Output: extends the CitsDataset with the new variables: "subject", and "area"

### Step-3-1) 

In [67]:
cits_df = pd.read_csv(CITS_DATASET)
ISSN_DATASET = CITS_DATASET.replace("/cits_dataset.csv","")+"/issn_dataset.csv"
ISBN_DATASET = CITS_DATASET.replace("/cits_dataset.csv","")+"/isbn_dataset.csv"

# ISSNs: citations having an issn value in the source id
cits_df_issn = cits_df[cits_df["source_id"].str.contains('^issn')]
cits_df_issn = cits_df_issn[["source_id","source_title"]].drop_duplicates(subset ="source_id", keep = 'first')
step_3_1_data = util.df_to_dict_list(cits_df_issn,{"scimago_area":"todo","scimago_category":"todo"},["source_id","source_title"])
util.write_list(step_3_1_data, ISSN_DATASET, header= True)

# ISBNs: citations having an isbn value in the source id
cits_df_isbn = cits_df[cits_df["source_id"].str.contains('^isbn')]
cits_df_isbn = cits_df_isbn[["source_id","source_title"]].drop_duplicates(subset ="source_id", keep = 'first')
step_3_1_data = util.df_to_dict_list(cits_df_isbn,{"lcc":"todo","scimago_area":"todo","scimago_category":"todo"},["source_id","source_title"])
util.write_list(step_3_1_data, ISBN_DATASET, header= True)

### Step-3-2) 
No script needed

### Step-3-3) 
No script needed

### Step-3-4) 

In [70]:
ISBN_DATASET = CITS_DATASET.replace("/cits_dataset.csv","")+"/isbn_dataset.csv"
lcc_lookup_df = pd.read_csv("lcc_lookup.csv")
scimago_lookup_df = pd.read_csv("scimago_lookup.csv")
isbn_df = pd.read_csv(ISBN_DATASET)

step_3_4_data = []
for index, row in isbn_df.iterrows():
    
    step_3_4_data.append(row.to_dict())
    
    #1. Consider only the alphabetic part of the LCC code
    alphabetic_code = re.findall('[a-zA-Z]+', row['lcc'])
    if len(alphabetic_code) == 0:
        continue
    alphabetic_code = alphabetic_code[0].upper()   
    query_df = lcc_lookup_df.loc[lcc_lookup_df['lcc_code'] == alphabetic_code]
    lcc_subject = None
    if len(query_df) > 0:
        lcc_subject = query_df["lcc_subject"].values[0]
    else:
        continue
    
    area = "todo_manual"
    category = "todo_manual"
    #2. Checks whether the value of the LCC subject is also a Scimago subject area 
    query_df = scimago_lookup_df.loc[scimago_lookup_df['area'].str.lower() == lcc_subject.lower()]
    if len(query_df) > 0:
        area = query_df["area"].values[0]
        category = area + " (miscellaneous)"
        
    #3. Checks whether the value of the LCC subject is also a Scimago subject category 
    else:
        query_df = scimago_lookup_df.loc[scimago_lookup_df['category'].str.lower() == lcc_subject.lower()]
        if len(query_df) > 0:
            area = query_df["area"].values[0]
            category = query_df["category"].values[0]
    
    step_3_4_data[0]["scimago_area"] = area
    step_3_4_data[0]["scimago_category"] = category
    
util.write_list(step_3_4_data, ISBN_DATASET, header= True)

### Step-3-5) 

In [90]:
ISSN_DATASET = CITS_DATASET.replace("/cits_dataset.csv","")+"/issn_dataset.csv"
ISBN_DATASET = CITS_DATASET.replace("/cits_dataset.csv","")+"/isbn_dataset.csv"

cits_df = pd.read_csv(CITS_DATASET)

issn_df = pd.read_csv(ISSN_DATASET)
for index, row in issn_df.iterrows():
    query_df = cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower()]
    if(len(query_df) > 0):
        cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower(), 'area'] = row["scimago_area"]
        cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower(), 'category'] = row["scimago_category"]

isbn_df = pd.read_csv(ISBN_DATASET)
for index, row in isbn_df.iterrows():
    query_df = cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower()]
    if(len(query_df) > 0):
        cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower(), 'area'] = row["scimago_area"]
        cits_df.loc[cits_df.source_id.str.lower() == row["source_id"].lower(), 'category'] = row["scimago_category"]
    
    
step_3_5_data = cits_df.to_dict("records")
util.write_list(step_3_5_data, CITS_DATASET, header= True)  

---
## Step-4) Extracting textual values from the citing entities
#### Input: the citing entities' DOIs
#### Output: extends the CitsDataset with the new variables: "abstract", "intext_citation.section", "intext_citation.context" and "intext_citation.pointer"

### Step-4-1) 

In [None]:
cits_df = pd.read_csv(CITS_DATASET)
step_4_1_data = util.df_to_dict_list(cits_df,{"abstract":"todo","intext_citation.section":"todo","intext_citation.context":"todo","intext_citation.pointer":"todo"},["doi","title","year","source_id","source_title","is_retracted","area","category"])
util.write_list(step_4_1_data, CITS_DATASET, header= True)

### Step-4-2) 
No script needed

---
## Step-5) Annotating the in-text citations characteristics
#### Input: the citing entities' in-text citations context
#### Output: extends the CitsDataset with the new variables: "intext_citation.intent", "intext_citation.sentiment", and "intext_citation.ret_mention"

### Step-5-1)

In [None]:
cits_df = pd.read_csv(CITS_DATASET)
step_5_1_data = util.df_to_dict_list(
    cits_df,
    {"intext_citation.intent":"todo","intext_citation.sentiment":"todo","intext_citation.ret_mention":"todo"},
    ["doi","title","year","source_id","source_title","is_retracted","area","category","abstract","intext_citation.section","intext_citation.context","intext_citation.pointer"])
util.write_list(step_5_1_data, CITS_DATASET, header= True)

### Step-5-2)
No script needed