# Scite_ + Lens
This notebook consists of two parts (main sections): 
1. Downloading data from LENS API 
2. Analysis 

## Fetching data from LENS API

To make interaction with their API easier, I wrote, tested, and put all the code into a 'lensorg' package that you could install in the colab. 
This way, 
* you could avoid looking at the source code (but if you want, it's  open-sourced [here](https://github.com/hcss-utils/lensorg/blob/master/lensorg/__init__.py))
* can't mess up the code yourself by accident 
* by installing from github, you could be sure it's always the latest version (if we update the source code, we don't have to change anything in the colab) 

### Preparations
* Install lensorg package
* Install dependecies 
* Get LENS API token

In [None]:
# installing dependencies 'leonsorg' relies on
!pip -q --no-cache-dir install ratelimit python-dotenv
# installing 'lenosorg' from git
!pip -q --no-cache-dir install git+https://github.com/hcss-utils/lensorg.git

  Building wheel for ratelimit (setup.py) ... [?25l[?25hdone
  Building wheel for lensorg (setup.py) ... [?25l[?25hdone


In [None]:
# mounting to google drive to access data
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import pandas as pd
from lensorg import chunks, remote_call, process, token

In [None]:
# if running from google colab, paste token after 'or'

# to get a token, go to API & Data section on the Lens site
# then press the button 'Create a token' 
# after creating a token, copy and paste it

token = token or ""

### Loading *scite_* data

The idea behind combining *scite_* and *LENS* was to use *LENS* to identify 'field of study' for *scite_* publications. 

So the first step is to read csv file with *scite_* data. 

---

*Note*: you need to have access to 'RuBase' folder!

In [None]:
# loading scite_ data with DOIs
df = pd.read_csv("/content/drive/MyDrive/RuBase/Bibliometrics/Deterrence/December 2020/Scite/deterrence-broad-scite.csv")

# selecting valid DOIs
DOIs = df.loc[df["doi"].str.contains("^10\.\d{4,9}/[-._;()/:a-zA-Z0-9]+$"), "doi"].tolist()

In [None]:
# original scite_ table from which we're taking DOIs
df.head()

Unnamed: 0,doi,title,pmid,authors,year,issns,supporting_cites,disputing_cites,mentioning_cites,total_cites,scite_report_link
0,10.1016/0304-405x(94)00823-j,Poison or placebo? Evidence on the deterrence ...,,"['Robert Comment', 'G.William Schwert']",1995.0,['0304-405X'],45.0,7.0,339.0,391.0,https://scite.ai/reports/poison-or-placebo-evi...
1,10.1017/s0033291714000129,What is the impact of mental health-related st...,,"['S. Clement', 'O. Schauman', 'T. Graham', 'F....",2014.0,"['0033-2917', '1469-8978']",44.0,2.0,712.0,758.0,https://scite.ai/reports/what-is-the-impact-of...
2,10.1126/science.1163732,A Glucosinolate Metabolism Pathway in Living P...,,"['P. Bednarek', 'M. Pislewska-Bednarek', 'A. S...",2009.0,"['0036-8075', '1095-9203']",38.0,0.0,831.0,869.0,https://scite.ai/reports/a-glucosinolate-metab...
3,10.1046/j.1365-2125.2001.01306.x,Attitudes and knowledge of hospital pharmacist...,11167664.0,"['Christopher F. Green', 'David R. Mottram', '...",2001.0,['0306-5251'],37.0,2.0,117.0,156.0,https://scite.ai/reports/attitudes-and-knowled...
4,10.1017/cbo9780511897948,The Neural Crest,,"['Nicole Le Douarin', 'Chaya Kalcheim']",1999.0,[],37.0,0.0,1472.0,1509.0,https://scite.ai/reports/the-neural-crest-xYYxGD


In [None]:
# first 5 DOIs
DOIs[:5]

['10.1016/0304-405x(94)00823-j',
 '10.1017/s0033291714000129',
 '10.1126/science.1163732',
 '10.1046/j.1365-2125.2001.01306.x',
 '10.1017/cbo9780511897948']

In above cells I defined and printed a list of DOIs. 
* at this stage, we're only interested in DOIs that we need to query, so it's okay to ignore other columns for now
* we're also interested in making our requests as efficient as possible (because we have a limit of 5 000 requests per month), so we don't want to query incorrect DOIs - *that's why we're selecting only valid DOIs using CrossRef's regex approach.* 

---
*Note*: what regex essentially checks is whether the DOI starts with '10', have '.' followed by 4 to 9 numbers afterwards, followed by '/' and any text after that, e.g. *10.1017/s0033291714000129*

In [None]:
print(f"we have {len(DOIs)} DOIs, but each request could only take 1000 DOIs")
print(f"this is why we're splitting a list of DOIs by 999 items, resulting in {len(DOIs) // 999} requests")

we have 18406 DOIs, but each request could only take 1000 DOIs
this is why we're splitting a list of DOIs by 999 items, resulting in 18 requests


If you're trying to reproduce the analysis, go straight to 'Analysis' section, as the code below downloads and saves data to GDrive (which I have already done so you don't have to re-download everything and could just skip). 

In [None]:
## uncomment this cell if you actually want to download everything!
## by removing '#' at the beginning of the line

# container = []
# for chunk in chunks(DOIs, n=999):
#     response = remote_call(dois=chunk, token=token)
#     data = process(response)
#     container.append(
#         pd.DataFrame(data)
#     )

# result = pd.concat(container, ignore_index=True)

running the code below without downloading data (running cell above before) will throw an error, so either go straight to 'Analysis' section or run all of the cells, including uncommented cell above). 

In [None]:
result.shape

(18393, 2)

In [None]:
result.head()

Unnamed: 0,doi,fields
0,10.1007/s004420050815,Nutrient_Food choice_Adenostyles alliariae_Pet...
1,10.1111/j.1540-5915.2012.00361.x,Business_Marketing_Employee research_Organisat...
2,10.1046/j.1365-2125.1997.00616.x,Public health_Pediatrics_Alternative medicine_...
3,10.1016/j.socscimed.2008.12.031,Psychology_Health care_Internal audit_Internal...
4,10.1080/03235408.2013.858879,Horticulture_Lepidoptera genitalia_PEST analys...


### Saving LENS response
To avoid re-downloading data once again, we'll save the result to our GDrive

1. Save raw response - the 'result' table with only two columns that you see above
2. Save joined dataset - we join the 'result' table to our original *scite_* table

In [None]:
joined_dataset = pd.merge(
    df, result,
    how="left",
    on="doi"
)

In [None]:
print(f"Joined dataset has {joined_dataset.shape[0]} rows, {joined_dataset.shape[1]} columns")

Joined dataset has 18468 rows, 12 columns


In [None]:
# note that we now have 'fields' column as the last column of the dataset 
joined_dataset.head()

Unnamed: 0,doi,title,pmid,authors,year,issns,supporting_cites,disputing_cites,mentioning_cites,total_cites,scite_report_link,fields
0,10.1016/0304-405x(94)00823-j,Poison or placebo? Evidence on the deterrence ...,,"['Robert Comment', 'G.William Schwert']",1995.0,['0304-405X'],45.0,7.0,339.0,391.0,https://scite.ai/reports/poison-or-placebo-evi...,Shareholder_Event study_Business_Market for co...
1,10.1017/s0033291714000129,What is the impact of mental health-related st...,,"['S. Clement', 'O. Schauman', 'T. Graham', 'F....",2014.0,"['0033-2917', '1469-8978']",44.0,2.0,712.0,758.0,https://scite.ai/reports/what-is-the-impact-of...,Psychiatry_Mental health_Ethnic group_Stigma (...
2,10.1126/science.1163732,A Glucosinolate Metabolism Pathway in Living P...,,"['P. Bednarek', 'M. Pislewska-Bednarek', 'A. S...",2009.0,"['0036-8075', '1095-9203']",38.0,0.0,831.0,869.0,https://scite.ai/reports/a-glucosinolate-metab...,Gene_Myrosinase_Plant cell_Enzyme_Metabolic pa...
3,10.1046/j.1365-2125.2001.01306.x,Attitudes and knowledge of hospital pharmacist...,11167664.0,"['Christopher F. Green', 'David R. Mottram', '...",2001.0,['0306-5251'],37.0,2.0,117.0,156.0,https://scite.ai/reports/attitudes-and-knowled...,Epidemiology_Public health_Pediatrics_Hospital...
4,10.1017/cbo9780511897948,The Neural Crest,,"['Nicole Le Douarin', 'Chaya Kalcheim']",1999.0,[],37.0,0.0,1472.0,1509.0,https://scite.ai/reports/the-neural-crest-xYYxGD,Neural fold_Neural crest_Neuroscience_Neural p...


In [None]:
# save raw response table
result.to_csv("/content/drive/MyDrive/RuBase/Bibliometrics/Deterrence/December 2020/Scite/lens-raw-response.csv", index=False)
# save joined dataset table
joined_dataset.to_csv("/content/drive/MyDrive/RuBase/Bibliometrics/Deterrence/December 2020/Scite/scite_lens_joined.csv", index=False)

## Analysis


In [None]:
joined_dataset = pd.read_csv("/content/drive/MyDrive/RuBase/Bibliometrics/Deterrence/December 2020/Scite/scite_lens_joined.csv")
joined_dataset.shape

(18468, 12)

### Which publications are relevant to our research, based on the 'fields of study' info?

We came up with a couple of approahces to check if publication is relevant to our field. 

*If 'fields' column contains the following keywords, the publication is relevant according to one of the approaches:*

1. Political science or International relations or International security
2. International relations or International security
3. ((Political science or International relations or International security) AND NOT Criminal)
4. Political science

#### Defining relevant publications

In [None]:
# the same patterns but as a python equivalent
patterns = {
    "first_approach": "Political science|International relations|International security",
    "second_approach": "International relations|International security",
    "third_approach": [
        "Political science|International relations|International security",
        "criminal"
    ],
    "forth_approach": "political science"
}

In [None]:
# actually checking if 'fields' contains patterns we defined above
joined_dataset["first_approach"] = joined_dataset["fields"].str.contains(patterns.get("first_approach"), case=False, na=False).astype(int)
joined_dataset["second_approach"] = joined_dataset["fields"].str.contains(patterns.get("second_approach"), case=False, na=False).astype(int)
joined_dataset["third_approach"] = (
    (joined_dataset["fields"].str.contains(patterns.get("third_approach")[0], case=False, na=False)) &
    (~joined_dataset["fields"].str.contains(patterns.get("third_approach")[1], case=False, na=False))
).astype(int)
joined_dataset["forth_approach"] = joined_dataset["fields"].str.contains(patterns.get("forth_approach"), case=False, na=False).astype(int)

The code above results in 4 additional columns (which you can see below) consisting of '0's and '1's
* 0 means the publication is irrelevant 
* 1 means the publication is relevant

In [None]:
joined_dataset.loc[:, joined_dataset.columns.str.contains("doi|approach")]

Unnamed: 0,doi,first_approach,second_approach,third_approach,forth_approach
0,10.1016/0304-405x(94)00823-j,0,0,0,0
1,10.1017/s0033291714000129,0,0,0,0
2,10.1126/science.1163732,0,0,0,0
3,10.1046/j.1365-2125.2001.01306.x,0,0,0,0
4,10.1017/cbo9780511897948,0,0,0,0
...,...,...,...,...,...
18463,10.1016/j.cropro.2020.105351,0,0,0,0
18464,10.4018/978-1-7998-3476-2.ch009,0,0,0,0
18465,10.4018/978-1-7998-6618-3.ch021,0,0,0,0
18466,10.1016/j.aspen.2020.10.002,0,0,0,0


To make it more clear, *each column represents a different approach.* 

In the table *below* you could see that all publications are relevant according to *first, second, and third approaches* at the same time, but only one publication - id **1116** - is also relevant according to *forth approach*. 

So the takeaway is that the publication might be relevant based on one approach, but irrelevant according to others.

In [None]:
joined_dataset.loc[joined_dataset["second_approach"].eq(1), joined_dataset.columns.str.contains("doi|approach")].head()

Unnamed: 0,doi,first_approach,second_approach,third_approach,forth_approach
1076,10.1017/s0003055405051865,1,1,1,0
1116,10.1162/002081802760199917,1,1,1,1
1166,10.1146/annurev.polisci.2.1.25,1,1,1,0
1192,10.1111/0020-8833.00102,1,1,1,0
1423,10.1017/s0020818314000393,1,1,1,0


##### Basic stats based on all dataset
To give us a better understanding of the scite dataset,

In [None]:
pre_stats = (
    joined_dataset.melt(
        id_vars=["doi"], 
        var_name="approach",
        value_vars=["first_approach", "second_approach", "third_approach", "forth_approach"],
        value_name="relevant_publications"
    )
    .groupby("approach", as_index=False)["relevant_publications"].sum()
)
pre_stats["percent_of_total"] = (pre_stats["relevant_publications"] / joined_dataset.shape[0] * 100).round(2)
pre_stats

Unnamed: 0,approach,relevant_publications,percent_of_total
0,first_approach,2859,15.48
1,forth_approach,2782,15.06
2,second_approach,235,1.27
3,third_approach,2644,14.32


What we also have to keep in mind is that the LENS API did not respond with 100% full info on the DOIs we passed:
1. not all of our DOIs from *scite_* were corrent
2. some info is missing from the *LENS*

*It might also be fair to calculate % based on the actual number of valid 'fields' responses we recieved, not on the number of rows in our original dataset.*

In [None]:
valid_responses = joined_dataset.loc[~(joined_dataset["fields"].isna() | joined_dataset["fields"].eq("No data."))]
print(f'Of {joined_dataset.shape[0]}, we have "fields" info on {valid_responses.shape[0]} publications ({round(valid_responses.shape[0]/joined_dataset.shape[0]*100, 2)}%)')

Of 18468, we have "fields" info on 15100 publications (81.76%)


In [None]:
pre_stats["percent_of_fields"] = (pre_stats["relevant_publications"] / valid_responses.shape[0] * 100).round(2)
pre_stats

Unnamed: 0,approach,relevant_publications,percent_of_total,percent_of_fields
0,first_approach,2859,15.48,18.93
1,forth_approach,2782,15.06,18.42
2,second_approach,235,1.27,1.56
3,third_approach,2644,14.32,17.51


---

#### Defining 'verifiable claims'

As we're only interested in the subset of publications (specifically, those that have supporting/disputing claims), we need to define what 'verifiable claims' consists of practically, that is:
* supporing cites >= 1 OR disputing_cites >= 1
* 'fields of study' was queried from LENS API (with valid or 'No data.' response)



In [None]:
verifiable_claims = joined_dataset.loc[joined_dataset["supporting_cites"].ge(1) | joined_dataset["disputing_cites"].ge(1)].copy()
print("supporting/disputing criteria only", verifiable_claims.shape)

verifiable_claims = verifiable_claims.loc[verifiable_claims["fields"].notnull()].copy()
print("supporting/disputing AND fields of study present", verifiable_claims.shape)
print(f"Our working sample, then, is {verifiable_claims.shape[0]} publications, which is {round(verifiable_claims.shape[0] / joined_dataset.shape[0] * 100, 2)}% of our full dataset")

supporting/disputing criteria only (2980, 16)
supporting/disputing AND fields of study present (2925, 16)
Our working sample, then, is 2925 publications, which is 15.84% of our full dataset


In [None]:
verifiable_claims.head()

Unnamed: 0,doi,title,pmid,authors,year,issns,supporting_cites,disputing_cites,mentioning_cites,total_cites,scite_report_link,fields,first_approach,second_approach,third_approach,forth_approach
0,10.1016/0304-405x(94)00823-j,Poison or placebo? Evidence on the deterrence ...,,"['Robert Comment', 'G.William Schwert']",1995.0,['0304-405X'],45.0,7.0,339.0,391.0,https://scite.ai/reports/poison-or-placebo-evi...,Shareholder_Event study_Business_Market for co...,0,0,0,0
1,10.1017/s0033291714000129,What is the impact of mental health-related st...,,"['S. Clement', 'O. Schauman', 'T. Graham', 'F....",2014.0,"['0033-2917', '1469-8978']",44.0,2.0,712.0,758.0,https://scite.ai/reports/what-is-the-impact-of...,Psychiatry_Mental health_Ethnic group_Stigma (...,0,0,0,0
2,10.1126/science.1163732,A Glucosinolate Metabolism Pathway in Living P...,,"['P. Bednarek', 'M. Pislewska-Bednarek', 'A. S...",2009.0,"['0036-8075', '1095-9203']",38.0,0.0,831.0,869.0,https://scite.ai/reports/a-glucosinolate-metab...,Gene_Myrosinase_Plant cell_Enzyme_Metabolic pa...,0,0,0,0
3,10.1046/j.1365-2125.2001.01306.x,Attitudes and knowledge of hospital pharmacist...,11167664.0,"['Christopher F. Green', 'David R. Mottram', '...",2001.0,['0306-5251'],37.0,2.0,117.0,156.0,https://scite.ai/reports/attitudes-and-knowled...,Epidemiology_Public health_Pediatrics_Hospital...,0,0,0,0
4,10.1017/cbo9780511897948,The Neural Crest,,"['Nicole Le Douarin', 'Chaya Kalcheim']",1999.0,[],37.0,0.0,1472.0,1509.0,https://scite.ai/reports/the-neural-crest-xYYxGD,Neural fold_Neural crest_Neuroscience_Neural p...,0,0,0,0


#### Stats



In [None]:
stats = verifiable_claims.melt(
    id_vars=["doi", "supporting_cites", "disputing_cites"], 
    var_name="approach",
    value_vars=["first_approach", "second_approach", "third_approach", "forth_approach"],
    value_name="relevant_publications"
)
stats = (stats
    .loc[stats["relevant_publications"].ge(1)]
    .groupby("approach", as_index=False).sum()    
)
stats

Unnamed: 0,approach,supporting_cites,disputing_cites,relevant_publications
0,first_approach,156.0,23.0,84
1,forth_approach,147.0,22.0,79
2,second_approach,12.0,2.0,8
3,third_approach,122.0,22.0,70


In [None]:
# comparing with all relevant publications, not only verifiable ones
final_result = pd.merge(
    stats.rename(columns={"relevant_publications": "verifiable_publications"}), 
    pre_stats.loc[:, ["approach", "relevant_publications"]],
    on="approach",
    how="left"
)
final_result["% of all total verifiable publications"] = (final_result["verifiable_publications"] / verifiable_claims.shape[0] * 100).round(2)
final_result["% of varifiable claims"] = (final_result["verifiable_publications"] / final_result["relevant_publications"] * 100).round(2)
final_result["label"] = final_result["approach"].map({
    "first_approach": "PS or IR or IS", #"Political science or International relations or International security",
    "second_approach": "IR or IS", #"International relations or International security",
    "third_approach": "(PS or IR or IS) and (not Criminal)", #"(Political science or International relations or International security) and (not Criminal)",
    "forth_approach": "PS", #Political Science
})
final_result[["approach", "label", "relevant_publications", "verifiable_publications", "% of all total verifiable publications", "% of varifiable claims", "supporting_cites", "disputing_cites"]]

Unnamed: 0,approach,label,relevant_publications,verifiable_publications,% of all total verifiable publications,% of varifiable claims,supporting_cites,disputing_cites
0,first_approach,PS or IR or IS,2859,84,2.87,2.94,156.0,23.0
1,forth_approach,PS,2782,79,2.7,2.84,147.0,22.0
2,second_approach,IR or IS,235,8,0.27,3.4,12.0,2.0
3,third_approach,(PS or IR or IS) and (not Criminal),2644,70,2.39,2.65,122.0,22.0


**Results:**
1. of all *'political science or international relations or international security'* (2859) relevant publications, we have **84** (2.94%) verifiable claims with 156 supporting and 23 disputing cites that cover 2.87% of all 2925 publications of our sample. 
2. of all *'international relations or internaltional security'* (235) relevant publications, we have **8** (3.4%) verifiable claims with 12 supporting and 2 disputing cites that cover 0.27% of all 2925 publications of our sample.
3. of all *'(political science or international relations or international security) and (not criminal))'* (2644) relevant publications, we have **70** (2.65%) verifiable claims with 122 supporting and 22 disputing cites that cover 2.39% of all 2925 publications of our sample.
4. of all *'political science'* (2782) relevant publications, we have **79** (2.94%) verifiable claims with 147 supporting and 22 disputing cites that cover 2.7% of all 2925 publications of our sample.

In [None]:
# e.g. 8 IR or IS publications
verifiable_claims.loc[verifiable_claims["second_approach"].eq(1)]

Unnamed: 0,doi,title,pmid,authors,year,issns,supporting_cites,disputing_cites,mentioning_cites,total_cites,scite_report_link,fields,first_approach,second_approach,third_approach,forth_approach
1076,10.1017/s0003055405051865,Military Coercion in Interstate Crises,,['Branislav L. Slantchev'],2005.0,"['0003-0554', '1537-5943']",2.0,0.0,61.0,63.0,https://scite.ai/reports/military-coercion-in-...,Economics_Valuation (finance)_Law and economic...,1,1,1,0
1116,10.1162/002081802760199917,Ethnic Bargaining in the Shadow of Third-Party...,,['Rupen Cetinyan'],2002.0,"['0020-8183', '1531-5088']",2.0,0.0,49.0,51.0,https://scite.ai/reports/ethnic-bargaining-in-...,Empirical research_Ethnic group_Political econ...,1,1,1,1
1166,10.1146/annurev.polisci.2.1.25,DETERRENCE AND INTERNATIONAL CONFLICT: Empiric...,,['Paul K. Huth'],1999.0,"['1094-2939', '1545-1577']",2.0,0.0,36.0,38.0,https://scite.ai/reports/deterrence-and-intern...,Public economics_Positive economics_Empirical ...,1,1,1,0
1192,10.1111/0020-8833.00102,"Rigor Mortis or Rigor, More Tests: Necessity, ...",,['Frank P. Harvey'],1998.0,"['0020-8833', '1468-2478']",2.0,0.0,13.0,15.0,https://scite.ai/reports/rigor-mortis-or-rigor...,Scientific theory_Positive economics_Economics...,1,1,1,0
1423,10.1017/s0020818314000393,Revisiting Reputation: How Past Actions Matter...,,"['Alex Weisiger', 'Keren Yarhi-Milo']",2015.0,"['0020-8183', '1531-5088']",2.0,0.0,14.0,16.0,https://scite.ai/reports/revisiting-reputation...,Positive economics_Sociology_International con...,1,1,1,0
1478,10.2139/ssrn.2552820,Can the International Criminal Court Deter Atr...,,"['Hyeran Jo', 'Beth A. Simmons']",2014.0,['1556-5068'],1.0,0.0,8.0,9.0,https://scite.ai/reports/can-the-international...,Political science_War crime_Law_Criminal court...,1,1,0,1
1719,10.1111/j.1468-2478.2010.00621.x,Explaining the Deterrence Effect of Human Righ...,,"['Hunjoon Kim', 'Kathryn Sikkink']",2010.0,['0020-8833'],1.0,1.0,77.0,79.0,https://scite.ai/reports/explaining-the-deterr...,Economics_Social work_Human rights_Law_Sanctio...,1,1,1,0
15876,10.3844/jssp.2018.124.128,Theoretical Context of the Nuclear Posture Review,,"['Timothy Sands', 'Richard Mihalik', 'Harold C...",2018.0,['1549-3652'],0.0,1.0,5.0,6.0,https://scite.ai/reports/theoretical-context-o...,Rational choice theory_Political science_Law a...,1,1,1,1
