# Laboratory Notebook

## Table of contents

1. [24/03/2021 Preliminary research](#preliminary)
2. [28/03/2021 Writing the abstract](#abstract)
3. [29/03/2021 Full reading of the DOI Handbook](#handbook)
4. [03/04/2021 Data Management Plan Version 1.0](#dmp1)
5. [09/04/2021 Literature review](#literature)
6. [11/04/2021 Computational workflow Version 1.0](#workflow1)
7. [17/04/2021 Open peer review](#peerReview)
8. [17/04/2021 First version of the regular expression to clean DOI's wrong prefixes](#regex1)

## 24/03/2021 Preliminary research <a name="preliminary"></a>

<p>Today I started working on the problem of wrong DOI names reported by Crossref. In particular, the research question on which the entire work focuses is which are the classes of errors that characterize the invalid DOI names and which classes can be addressed through automatic processes in order to obtain the correct DOI names.<p/>
<p>Having never used the DOI API, I explored the subject by reading its documentation<sup><a href="#preliminary_ref_01">[1]</a></sup>. In particular, in chapter <em>3.8.3 Proxy Server REST API</em> I discovered that by performing a GET to https://doi.org/api/handles/&lt;doi&gt; it is possible to obtain useful information in response. Among these, the status code is particularly interesting, which can take on four values:
<ul>
    <li>1: Success. (HTTP 200 OK)</li>
    <li>2: Error. Something unexpected went wrong during handle resolution. (HTTP 500 Internal Server Error)</li>
    <li>100: Handle Not Found. (HTTP 404 Not Found)</li>
    <li>200: Values Not Found. The handle exists but has no values (or no values according to the types and indices specified). (HTTP 200 OK)</li>
</ul>
This is extremely useful for identifying those initially invalid DOI names that have become valid in the meantime.</p>
   
### References
<ol>
    <li id="preliminary_ref_01">International DOI Foundation. (2019). DOI® Handbook. <a href="https://doi.org/10.1000/182" target="_blank">https://doi.org/10.1000/182</a>.</li>
</ol>

## 28/03/2021 Writing the abstract <a name="abstract"></a>


<p>Together with my colleagues Cristian Santini (orcid: 0000-0001-7363-6737), Ricarda Boente (orcid: 0000-0002-2133-8735) and Deniz Tural (orcid: 0000-0002-6391-4198) I wrote the first version of the abstract.</p>
<p>Starting from the initial hypothesis that there are two classes of errors, namely factual errors and DOI names that are not yet valid at the time of processing, we tried to further break down the first class into various subclasses: exploring the input dataset, we hypothesized that a DOI name may be factually wrong because it contains forbidden characters, because it contains excess strings at the beginning or at the end or due to a human error in the transcription.</p>
<p>Moreover, I have elaborated a first hypotheses on how to deal with both errors' classes. As for the first class, I speculated that it would be possible to obtain the correct cited DOI names starting from the valid citing DOI names, using the REST API for COCI: to obtain through the COCI references operation all the cited articles and, using word embeddings algorithm, to identify the most similar DOI name to the wrong one. However, my colleagues have rightly pointed out that wrong DOI names are not accepted by COCI and that COCI is built from Crossref, so it would not be possible to get the correct DOI names from COCI. Another hypothesis proposed by Cristian was to exploit other metadata provided by Crossref relating to that particular wrong DOI name to check if there are other DOI names connected to it, which perhaps refer to the correct one. Finally, it was decided to remain as vague as possible in the abstract and to summarize the various hypotheses formulated in the expression "rule-based methods" with the idea of defining this point after further research.<p/>
<p>Instead, there has been much more consensus on the strategy for dealing with the second class of errors, namely those due to DOI names that are not yet valid, that is, use the DOI API by interpreting the response status code.</p>

## 29/03/2021 Full reading of the DOI Handbook <a name="handbook"></a>

<p>Writing the abstract raised some questions about the specifics of a DOI name. Therefore, I decided to read the entire documentation with the aim of answering three questions:
<ol>
    <li>Do DOI names use a limited character set?</li>
    <li>Do DOI names have semantics?</li>
    <li>In which ways can I query the DOI System Proxy Server?</li>
</ol>
<p>Chapter <em>2.5.1 Encoding principle</em> answers the first question:</p>
<blockquote>"DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO / IEC 10646, which is the character set defined by Unicode. The character set encompasses most characters used in every major language written today. However, because of specific uses made of certain characters by some Internet technologies (the use of pointed brackets in xml for example), there may some effective restrictions in day-to-day use."</blockquote>
<p>Chapter <em>2.2.1 General characteristics</em> answers the second question:</p>
<blockquote>The DOI name is an opaque string for the purposes of the DOI system. No definitive information may be inferred from the specific character string of a DOI name. In particular, the inclusion in a DOI name of any registrant code allocated to a specific registrant does not provide evidence of the ownership of rights or current management responsibility of any intellectual property in the referent. Such information may be asserted in the associated metadata.</blockquote>
<p>As for the third question, chapter <em>3.8.4.2 Parameter Passing</em> explores a series of queries that can be performed to specify the desired output. Therefore, I hypothesized that among the wrong DOI names there could be some that, during the extraction phase, had not been deprived of the query portion. The hypothesis was confirmed. The code used to perform this check is shown below.</p>

In [3]:
import csv, re, urllib.request

url = 'https://zenodo.org/record/4625300/files/invalid_dois.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
reader = csv.reader(lines)
rows_number = 0
proxy_server_occurrences = list()
proxy_server_queries = set()
for row in reader:
    rows_number += 1
    query = re.search("\?.+?=", row[1])
    if query is not None:
        query = query.group(0)
        proxy_server_occurrences.append(query)
        proxy_server_queries.add(query)
print(f"The wrong DOI names that contain queries are {len(proxy_server_occurrences)} out of {rows_number}")
print(f"The number of queries found is equal to {len(proxy_server_queries)}. They are:")
print(proxy_server_queries)

The wrong DOI names that contain queries are 228 out of 1223297
The number of queries found is equal to 39. They are:
{'?genre=', '?download=', '?slug=', '?uid=', '?arnumber=', '?abstract_id=', '?page=', '?print=', '?term=', '?select-row=', '?crawler=', '?seq=', '?sequence=', '?site=', '?doid=', '?title=', '?origin=', '?v=', '?locale=', '?artid=', '?prd=', '?doi=', '?nosfx=', '?articleid=', '?id=', '?code=', '?ref=', '?ver=', '?sid=', '?rss=', '?refreqid=', '?scroll=', '?accountid=', '?src=', '?goto=', '?no-access=', '?rskey=', '?error=', '?report='}


## 03/04/2021 Data Management Plan Version 1.0 <a name="dmp1"></a>

<p>Together with my colleagues I compiled version 1.0 of the Data Management Plan, using the ARGOS platform by OpenAIRE and EUDAT. Two datasets were considered: one for the output and the other for the code. Particular care was taken in filling in as many fields as possible, but some doubts arose about which metadata to use in order to comply with the FAIR principles<sup><a href="#dmp1_ref_01">[1]</a><a href="#dmp1_ref_02">[2]</a><a href="#dmp1_ref_03">[3]</a></sup>, an aspect that will be further explored later.</p>

### References
<ol>
    <li id="dmp1_ref_01">Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18.</li>
    <li id="dmp1_ref_02">GO FAIR. (2018). FAIR Principles. https://www.go-fair.org/fair-principles/.</li>
    <li id="dmp1_ref_03">Michener, W. K. (2015). Ten Simple Rules for Creating a Good Data Management Plan. PLOS Computational Biology, 11(10), e1004525. https://doi.org/10.1371/journal.pcbi.1004525.</li>
</ol>

## 09/04/2021 Literature review <a name="literature"></a>
<p>Taking a cue from Cristian's laboratory notebook, I read the articles he reported on 09/04/2021 about past studies on our research topic. I found several interesting ideas in <em>Types of DOI errors of cited references in Web of Science with a cleaning method</em><sup><a href="#literature_ref_01">[1]</a></sup> by Shuo Xu et al. :</p>
<ul>
    <li>The study suggests that there are three types of DOI errors, namely prefix-, suffix- and other-type errors. The other-type errors are further divided into three subgroups: (a) those containing special characters (such as 10.1034/j.1600-0404.2000.101004262x./), (b) incoherently described DOIs (such as 10.1038/), and (c) those with incomplete suffix but with correct DOI prefix (such as 10.1007/3-540-48194-X_).</li>
    <li>Three regular expressions are proposed to clean up the respective types of errors.
        <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1007%2Fs11192-019-03162-4/MediaObjects/11192_2019_3162_Fig2_HTML.png?as=webp"/>
It is worth mentioning that the aforementioned regular expressions are applied to strings already cleaned of double underscores, double periods, XML tags, spaces and forward slashes.
    </li>
    <li>Finally, the article mentions some problems that cannot be solved by this approach: (a) similar characters confused with each other, such as “O” versus “0”, “b” versus “6” and “O” versus “Q”; (b) to distinguish the correct DOI name from multiple DOI names assigned to the same cited reference; (c) a DOI name assigned to some cited reference that cannot be resolved by the DOI system; (d) a DOI name that is resolvable, but points to some knowledge unit within the interested cited reference.</li>
</ul>
<p>Also, the article <em>DOI errors and possible solutions for Web of Science</em> by Juenwen Zhu et al.<sup><a href="#literature_ref_02">[2]</a></sup> suggests that DOI names often contain the wrong characters for similarity to the right ones, such as "O" instead of "0", "b" instead of "6", "O" instead of "Q". A possible solution could then be to apply these replacements and verify the new DOI is resolved by the DOI System Proxy.</p>
<p>Finally, the paper <em>Errors in DOI indexing by bibliometrics databases</em> by Fiorenzo Franceschini et al.<sup><a href="#literature_ref_03">[3]</a></sup> reports a more generic classification of bibliographic database errors, i.e. it distinguishes between authors' errors in creating the list of cited resources, and database mapping errors, such as transcription errors. It is clear that our work focuses only on the second category. The article then proceeds with analyzing a further category of error, i.e. single DOI names associated with different papers. However, no solution is proposed to the problem, which is only highlighted.</p>

## References
<ol>
    <li id="literature_ref_01">Xu, S., Hao, L., An, X. et al. Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics 120, 1427–1437 (2019). https://doi.org/10.1007/s11192-019-03162-4.</li>
    <li id="literature_ref_02">Zhu, J., Hu, G. &amp; Liu, W. DOI errors and possible solutions for Web of Science. Scientometrics 118, 709–718 (2019). https://doi.org/10.1007/s11192-018-2980-7.</li>
    <li id="literature_ref_03">Franceschini, F., Maisano, D., &amp; Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186. https://doi.org/10.1007/s11192-014-1503-4.</li>
</ol>

## 11/04/2021 Computational workflow Version 1.0 <a name="workflow1"></a>

<p>The first version of the computational workflow, that is the protocol, was carried out collectively by all members of the group from the beginning to the end.</p>
<p>The platform chosen for creating, editing and publishing the protocol was Protocols.io (<a href="https://www.protocols.io/welcome" alt="Protocols.io home page" target="_blank">https://www.protocols.io/welcome</a>).</p>
<p>Having found valid ideas from the study of the existing literature, for the first version we decided to structure the protocol by reusing already existing methods, in particular as regards the classification of possible DOI errors (Buchanan, 2006) and as regards the cleaning of strings using regular expressions (Xu et al., 2019).</p>

## 17/04/2021 Open peer review <a name="peerReview"></a>
<p>In order to better understand how to carry out and structure a peer review, I read the article <em>How to write a thorough peer review</em> (Column, 2018), which inspired me about the <em>modus operandi</em> as well as clarifying the spirit in which a review should be drafted. Specifically, having to review a computational workflow, I also deepened the <em>Guidelines for Reviewers</em> provided by PLOS ONE, which concern not only the protocols but in general the dynamics related to the review process.</p>
<p>Having those premises, I tried to stick to both guidelines, taking three readings of the protocol, taking notes away and focusing on different aspects each time. I also tried to answer four questions:</p>
<ul>
    <li>Does the manuscript provide valid rationale for the planned or ongoing study, with clearly identified and justified research questions?</li>
    <li>Is the protocol technically sound and planned in a manner that will lead to a meaningful outcome and allow testing of the stated hypotheses?</li>
    <li>Have the authors described where all data underlying the findings will be made available when the study is complete?</li>
    <li>Is the methodology feasible and does the description provide sufficient methodological detail for the protocol to be reproduced and replicated?</li>
</ul>
<p>Therefore, I organized the review in 4 chapters:</p>
<ul>
    <li><strong>The premises. About the study's rationale and impact</strong></li>
    <li><strong>The methodology. About the protocol's technical soundness</strong></li>
    <li><strong>The reproducibility. About the the input and output</strong></li>
    <li><strong>Conclusions</strong></li>
</ul>

## References
<ol>
    <li>Stiller-Reeve, M. (2018). How to write a thorough peer review. Nature. https://doi.org/10.1038/d41586-018-06991-0.</li>
    <li>Guidelines for Reviewers. PLOS ONE. https://journals.plos.org/plosone/s/reviewer-guidelines.</li>
</ol>

## 17/04/2021 First version of the regular expression to clean DOI's wrong prefixes <a name="regex1"></a>
<p>Trying to use the regular expression to clean up wrong DOI prefixes proposed in <em>Types of DOI errors of cited references in Web of Science with a cleaning method</em> (Zu, Shuo et al., 2019), I quickly realized that it would never could have matched. The regular expression is the following:</p>
<p><code>^(?:D[0|O]I\/?HTTP:\/\/DX.D[0|O]I.[0|O]RG\/[0|O]RG\/[:|\/]\\\\d+\\\\.HTTP:\/\/DX.D[0|O]I.[0|O]RG\/?)+(.*)</code></p>
<p>The reasons why it cannot match are the following:</p>
<ol>
    <li>The various protocols and domain names are all mandatory, none optional, while it is more plausible that a match will occur if they are reported as optional with the exclamation mark.</li>
    <li>The match does not necessarily occur at the beginning of the string. For example, in the following incorrect DOI taken from the input dataset, the protocol is in the middle of the string and not at the beginning: <code>10.2478/s11696-009-0027-5,10.1016/j.aca.2006.07.086.http://dx.doi.org/10.1016/j.aca.2006.07.086"</code>.</li>
</ol>
<p>Therefore, I have written a new version of the regular expression, which takes these two aspects into account, and it is the following:</p>
<p><code>(?:http:\/\/dx.d[0|o]i.[0|o]rg\/)+(.*)</code></p>
<p>It is expected to further refine it shortly.</p>