# Davide Brembilla's Notebook


## 28 March 2022 week
1. Research about the DOAJ

In order to write the abstract about our research question, I researched the DOAJ and Crossref.
I found the [DOAJ's documentation](https://doaj.org/docs). We could exploit the API to get the article metadata in XML format. It also employs [OpenURLs](https://doaj.org/docs/openurl/). 

2. Research about Crossref

Crossref also has an [API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/a-non-technical-introduction-to-our-api/) that we can query, but also provides [data dumps](https://www.crossref.org/blog/new-public-data-file-120-million-metadata-records/) that can be accessed, containing json files with the metadata. The source is massive, so we should consider what the best strategy is.
If we decide to go with the API, we should consider the fact that we can query it in anonymous, Polite or Plus modes.

3. Producing a draft of the abstract

I used this information to create a draft for the abstract, that will be completed with the other group members:

*"This research paper inquires how much overlap there is between the articles from Open Access journals in DOAJ and Crossref. Furthermore, we scouted the availability of these articles’ reference lists, the presence of IDs such as DOIs and the entities responsible for their specification. This analysis was carried out by querying the DOAJ and Crossref’s APIs, creating a dataset combining the two. The results reveal that only a small portion(?) of the articles on the DOAJ have relevant information in Crossref, but that most of them have a relevant id (?). Also, most of the information was fiven by Crossref itself(?). This is information is relevant because previous studies were not able to organise this kind of information in an optimal way."*

After talking with my group mates, we improved the abstract:

<i>With the spread of data in the public domain, the relative advantage of holding data in closed databases has diminished. On the other hand, the data on open citations have innumerable advantages. They can improve the transparency and robustness of science portfolio analysis, improve science policy decision making, stimulate downstream commercial activity, and increase the discoverability of scientific articles. Thus, once sparsely populated, public domain citation databases passed the 1 billion citation mark in February 2021. This inquiry is therefore aimed at studying how data on open citations are treated by different aggregators and authorities, to evaluate the management of data linked to articles (references) and to have an idea of ​​how much the aggregators communicate with each other in terms of information sharing. Our research involves the articles from Open Access journals in DOAJ and their data on Crossref. We scouted the availability of these articles’ reference lists, the presence of IDs such as DOIs and the entities responsible for their specification. This analysis was carried out by querying the DOAJ and Crossref’s APIs, creating a dataset combining the two. The results reveal that only a small portion of the articles on the DOAJ have relevant information in Crossref, but that most of them have a relevant id. Also, most of the information was given by Crossref itself. This is relevant information because it demonstrates the limits of the aggregators and defines possible solutions to evaluate, and therefore improve, the quality of information shared.</i>

And after the teacher's comments, we rewrote the abstract divided into sections.

## 4 March week
I looked for papers linked to our research question and added them to our shared Zotero and ResearchRabbit. I added some research done especially about the DOAJ, that seems more 'compact' compared to the wider literature about Crossref.
With my group we created the first version of the DMP.


## 11 March week
I studied the API on Crossref to understand how to properly query it.
I reviewed the DMP for the group Don't Lock Up, confronting it with the requirements and the guidelines given.
We refined the research question and the workflow, updating the abstract and creating the workflow.

## 18 March week
This week we tried to use the API with the Python package [crossrefapi](https://github.com/fabiobatalha/crossrefapi).We realised that in the count, backfile DOIs are included; do we need to keep them?  
I tested a little script to query the api of crossref from the journals in the doaj. this can be useful to check the overlapping between the two.

In [None]:
#!pip install crossrefapi
from crossref.restful import Journals, Works
journal = Journals()
journal.works(issn = '1471-2466').count()

In [None]:
import pandas as pd
from os import sep, listdir 
journal_data = pd.read_csv('..%c..%c..%c..%cjournals_doaj.csv' % (sep,sep,sep,sep), encoding='utf8')

journal_data = journal_data[['Journal title','Journal ISSN (print version)','Journal EISSN (online version)','Number of Article Records']]
print(journal_data.head())

In [None]:
journal_data=journal_data.rename(columns ={'Journal ISSN (print version)':'pissn', 'Journal EISSN (online version)':'eissn', 'Number of Article Records':'count'})

In [None]:
pissn = journal_data['pissn'].dropna()
eissn = journal_data['eissn'].dropna()
count_pissn = dict()
for issn in pissn:
    try:
        print(issn, journal.works(issn=issn).count(), int(journal_data[journal_data['pissn'] == issn]['count']) == journal.works(issn=issn).count())
    except Exception as e:
        print(e)
        print(issn)

## 26 april
I checked the licences that could fit wellour project, studying the [MIT](https://opensource.org/licenses/MIT), [MIT-0](https://opensource.org/licenses/MIT-0), [Boost](https://opensource.org/licenses/BSL-1.0) and [ISC](https://opensource.org/licenses/ISC). I realised that there is a small difference between the ones with no attribution and the others. Personally, I'd choose either the MIT or ISC.

I also start to test the difference between the articles in DOAJ and in Crossref, and encountered a problem: via the DOAJ API you can't access more than 1000 results.


In [None]:
#!pip install crossrefapi
from crossref.restful import Journals, Works
journal = Journals()
import pandas as pd
from os import sep, listdir 
journal_data = pd.read_csv('..%c..%c..%c..%cjournals_doaj.csv' % (sep,sep,sep,sep), encoding='utf8')
journal_data=journal_data.rename(columns ={'Journal ISSN (print version)':'pissn', 'Journal EISSN (online version)':'eissn', 'Number of Article Records':'count'})

In [None]:
pissn = journal_data['pissn'].dropna()
eissn = journal_data['eissn'].dropna()

In [47]:
articles_doaj = set()
articles_crossref= set()

In [23]:
for article in journal.works(issn = pissn[0]):
    articles_crossref.add(article['DOI'])

In [50]:
import requests
results = list()
page_num = 1
while True:    
    api_url = "https://doaj.org/api/search/articles/issn:"+pissn[0]+'?page=%s&pageSize=100' % str(page_num)
    response = requests.get(api_url).json()
    if response['status'] == 'bad_request':
        break
    page_num +=1
    results.append(response)

    
#print(response['results'])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157


KeyboardInterrupt: 

In [51]:
results[-1]

{'status': 'bad_request',
 'error': 'You cannot access results beyond 1000 records via this API.\n    If you would like to see more results, you can download all of our data from\n    https://doaj.org/docs/public-data-dump/. You can also harvest from our OAI-PMH endpoints; articles: https://doaj.org/oai.article, journals: https://doaj.org/oai (ref: 816af2d1-c21d-11ec-87bf-47a42d843b88)'}

What this means is, we need to use the dump from DOAJ. We will need to simplify the files, in order to remove unnecessary data and make our future work lighter. We need to discuss together what to keep and what to remove.

3574
9


## 02/05
This week I developed the first version of the script we could use to query efficiently Crossref. It employs multithreading, making our queries approximately 10x faster. [Here](https://github.com/open-sci/2021-2022-la-chouffe-code/blob/main/main.py) is the main file, which can be launched from the command line by using <code>py -m main < path ></code>. It launches the **multithread_populating** script, which is the one doing the work. 
I also calculated that 74.13% of articles have DOIs.

We started th query; it takes approximately 15500 seconds (4h18m) to query a single batch.
I also wrote the letter to answer the review as well as start writing down the measures needed to answer the research questions.

for item in results[0]['results']:
    for id in article['bibjson']['identifier']:
            if id['type'] == 'doi':
                doi = id['id']
                articles_crossref.add(doi)

## 9 may
this week i created the script to create csv files, wasier to manipulate and to extract statitsics. it's the stats.py file.
moreover, i created the scripts to manipulate the files for statistics

In [2]:
import pandas as pd
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)         # initiate notebook for offline plot

import os
from os import sep
def get_all_in_dir(dir, format = 'json'):
    for filename in os.listdir(dir):
        f = os.path.join(dir, filename)

        if os.path.isfile(f) and f[-len(format):] == format:
            yield f

a = list(get_all_in_dir('results', 'csv'))
df = pd.DataFrame()
for file in a:
    df = df.append(pd.read_csv(file))
df = df.set_index('issn')
df =df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

df['perc_cr'] = (df['on_crossref']/df['doi-num'])*100 #etc ! dalle statistiche delle reference, ha senso togliere i ref_nd dal total perché non vedi la percentuale
df['perc_ref'] = (df['reference']/df['on-crossref'])*100
top_journals = df.nlargest(100,'doi-num')
top_journals.drop([col for col in top_journals.columns if col not in ['perc_cr','perc_ref']],axis=1,inplace=True)
fig = px.scatter(top_journals)
iplot(fig)

FileNotFoundError: [WinError 3] Impossibile trovare il percorso specificato: 'results'

In [None]:
df.describe() #this gives you the percentages

In general, it appears that the biggest journals have most almost all their articles on crossref with a reference list. To investigate more, i tried to plot all the data and there seems to be an inverse relationship between doi and being on crossref and references. the clear outlier is plos one (issn: 1932-6203). The role of crossref's algo to assert the doi seems to be diffuse

In [None]:
fig = px.scatter(df,x = 'doi-num', y = ['perc_cr','perc_ref'], size = 'on_crossref', color='asserted-by-cr')
fig.show()

In [None]:
df.drop('1932-6203', inplace=True)
df['perc_asserted_cr']= (df['asserted-by-cr']/(df['ref-num']-df['ref-undefined']))
fig = px.scatter(df,x = 'doi-num', y = ['perc_cr','perc_ref'], size = 'on_crossref', color='perc_ass_cr')
px.histogram


this could be a good sample for statistics?

In [None]:
df = df[(df['doi-num']>10) & (df['on_crossref']==1)]

find info over a year

In [None]:
df = df.set_index('issn')
df =df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df = df[df.year > 1850]
df.describe()


by using multiple costraints such as refeence == 1 to check if there is a reference and manipulating years i can now check multiple elements.

In [None]:
import plotly.graph_objects as go
df =df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df = df.drop(['issn','doi'],axis=1)
df = df[(df.year >= 1850)&(df.year < 2022)]
df = df.groupby('year').sum()
df['perc_cr'] = (df['on_crossref']/df['doi-num'])*100 #etc ! dalle statistiche delle reference, ha senso togliere i ref_nd dal total 
df['perc_ref'] = (df['reference']/df['on_crossref'])*100#perché non vedi la percentuale
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc='avg',x=df.index,y = df.perc_cr, name='percentage on crossref'))
fig.add_trace(go.Histogram(histfunc='avg',x=df.index,y = df.perc_ref, name='percentage with reference'))

In [None]:
df['perc_asserted_cr'] = (df['asserted-by-cr']/df['ref-num'])*100
df['perc_asserted_pub'] = (df['asserted-by-pub']/df['ref-num'])*100
df['perc_ref_nodoi'] = (df['ref-undefined']/df['ref-num'])*100
fig2  =px.histogram(df, x=df.index, y= ['perc_ass_cr','perc_ass_pub','perc_nodoi'], histfunc='avg')