# 28 - 03 - 2021


## First draft of our abstract

In our first meeting with the group, we discussed about the topic and created a first draft of our abstract. For this, everyone wrote a draft and we combined our results in the end.

# 30 - 03 - 2021






## Preliminary Analysis: exploring the classes of errors
I started to carry some preliminary analysis, in order to butter understand the problem that we need to tackle in our research question. The dataset that we need to use to solve our problem is provided by Silvio Peroni under public license at http://doi.org/10.5281/zenodo.4625300. This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the [Crossref API](https://www.crossref.org/education/retrieve-metadata/rest-api/) with the citing DOI. <br />
Among the column containing invalid cited DOIs I noticed some first general classes of errors that invalid the identifier:<br />
* **DOIs containing additional URLs:**
  1. 10.1016/j.aca.2006.07.086.http://dx.doi.org/10.1016/j.aca.2006.07.086 → 10.1016/j.aca.2006.07.086
  2. “10.1186/1735-2746-10-21,http://www.ijehse.com/content/10/1/21" → 10.1186/1735-2746-10-21
* **Extra characters at the end:**
  1. 10.1061/9780784480502.018] →  10.1061/9780784480502.018
  2. 10.1044/1092-4388(2012/11-0316)a →  10.1044/1092-4388(2012/11-0316)
* **Extra strings at the end:** 
  1. 10.1111/j.1099-0860.1997.tb00004.x/abstract> → 10.1111/j.1099-0860.1997.tb00004.x
  2. 10.4103/0975-�-7406.163460>accessed8 →  10.4103/0975-7406.163460
  3. 10.1007/s10706-019-01181-9(0123456789 → 10.1007/s10706-019-01181-9
* **DOIs with wrongly encoded HTML entities:**
  1. 10.1379/1466-1268(1997)002lt;0162:tmethi>2.3.co;2 → 10.1379/1466-1268(1997)002<0162:tmethi>2.3.co;2

Arcangelo noticed that some **DOIs contain queries to proxy servers or characters forbidden in URLs:**
  1. 10.2307/2491102?uid=37380728uid=20uid=40sid=4102564553863 → 10.2307/2491102
  2. 10.1016/j.envexpbot.2013.10.018#doilink → 10.1016/j.envexpbot.2013.10.018

# 03 - 04 - 2021

## First draft of Data Management Plan (DMP)
In a group meeting, we created a DMP on the platform Argos (https://argos.openaire.eu/) and divided the further work on it (the description of the two datasets) between us.
For the DMP, I focused on part **1: Data Summary** and **2: Reusable Data** of the dataset containing the source code used to clean a CSV list of invalid DOI names.

# 07 - 04 - 2021


## First bites of code: Counting the invalid DOI-to-DOI citations in the CSV

I started to carry some preliminary computations on our data, by counting how many pair of invalid DOI-to-DOI citations are in the CSV that we use as input data.

In [None]:
!wget https://zenodo.org/record/4625300/files/invalid_dois.csv

# if you are running this notebook locally on Windows 10 use:
# pip install wget
# !python -m wget https://zenodo.org/record/4625300/files/invalid_dois.csv

--2021-04-09 13:11:48--  https://zenodo.org/record/4625300/files/invalid_dois.csv
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64233224 (61M) [text/plain]
Saving to: ‘invalid_dois.csv’


2021-04-09 13:11:54 (12.9 MB/s) - ‘invalid_dois.csv’ saved [64233224/64233224]



In [None]:
from csv import reader

f = reader(open('invalid_dois.csv'))
header = next(f)
input_list = [(doi1, doi2) for (doi1, doi2) in f]
print(input_list[:50])

[('10.14778/1920841.1920954', '10.5555/646836.708343'), ('10.5406/ethnomusicology.59.2.0202', '10.2307/20184517'), ('10.1161/01.cir.63.6.1391', '10.1161/circ.37.4.509'), ('10.1177/1179546820918903', '10.3748/wjg.v10.i5.707.'), ('10.1080/10410236.2020.1731937', '10.1070/10810730903528033'), ('10.1161/strokeaha.112.652065', '10.1161/str.24.7.8322400'), ('10.1177/1049732310393747', '10.1111/j.545-5300.2003.42208.x'), ('10.1155/2017/1491405', '10.3760/cma.j.issn.0366-6999.20131202'), ('10.1161/01.res.68.6.1549', '10.1161/res.35.2.159'), ('10.4018/978-1-5225-2650-6.ch006', '10.1002/per'), ('10.1145/2525314.2594229', '10.5555/1873601.1873616'), ('10.1007/s10619-020-07320-z', '10.3390/sym11070911www.mdpi.com/journal/symmetry'), ('10.1007/s11771-020-4410-2', '10.13745/j.esf.2016.02.011'), ('10.1161/01.cir.102.5.591', '10.1161/circ.85.3.1537115'), ('10.1007/s40617-018-00299-1', '10.1901/jaba.2012.45-657'), ('10.1074/jbc.m508416200', '10.1059/0003-4819-100-4-483'), ('10.1177/2054358119836124', '

In [None]:
print(len(input_list))

1223296


In [None]:
invalid_dois = list(zip(*input_list))[1]
print(invalid_dois[:50])

('10.5555/646836.708343', '10.2307/20184517', '10.1161/circ.37.4.509', '10.3748/wjg.v10.i5.707.', '10.1070/10810730903528033', '10.1161/str.24.7.8322400', '10.1111/j.545-5300.2003.42208.x', '10.3760/cma.j.issn.0366-6999.20131202', '10.1161/res.35.2.159', '10.1002/per', '10.5555/1873601.1873616', '10.3390/sym11070911www.mdpi.com/journal/symmetry', '10.13745/j.esf.2016.02.011', '10.1161/circ.85.3.1537115', '10.1901/jaba.2012.45-657', '10.1059/0003-4819-100-4-483', '10.1016/j.amepre.2015.07.017.', '10.1016/j.atmosenv.2008.0305', '10.1161/circ.66.1.7083497', '10.1080/1364253032000157166', '10.1161/str.23.11.1440702', '10.1186/s12933-015-0183-6.', '10.1037/0021–9010.93.3.602', '10.1177/0312896216656720.', '10.5555/2133036.2133123', '10.5555/2442626.2442634', '10.5555/1785594.1785635', '10.1161/res.59.2.2874900', '10.1002/smj', '10.1161/res.89.12.1216', '10.3390/brainsci7120164.', '10.5555/944919.944937', '10.1145/2838344.2856460', '10.1177/0890117116661982.', '10.1161/circ.39.1.48', '10.110

#09 - 04 - 2021

## Useful References for literature review

Here I will list the useful references that I found in order to carry a literature review:

1. Xu, S., Hao, L., An, X. et al. Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics 120, 1427–1437 (2019). https://doi.org/10.1007/s11192-019-03162-4
2. Zhu, J., Hu, G. & Liu, W. DOI errors and possible solutions for Web of Science. Scientometrics 118, 709–718 (2019). https://doi.org/10.1007/s11192-018-2980-7 
3. Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186. https://doi.org/10.1007/s11192-014-1503-4.





# 10 - 04 - 2021



## Notes on: Xu, S., Hao, L., An, X. et al. (2019). Types of DOI errors of cited references in Web of Science with a cleaning method. 
(article: https://doi.org/10.1007/s11192-019-03162-4)<br/>
This paper address the problem of DOIs errors in cited references contained in the Web of Science (WoS) database. The authors of this paper collected a set of bibliographic references in the gene editing field and deeply analysed them in order to understand which classes of errors were present in cited DOIs and which possible solution could have been attempted in order to automatically correct them. <br/>
After their analysis they found that **many cited DOIs were duplicates** and they **contained various errors** which they generalized in three classes: 
  1.	prefix-type errors: DOIs starting with extra characters, e.g. “http://dx.doi.org/10.XXXX/XXXXXXXXXX”
  2.	suffix-type errors: DOIs containing extra characters at the end, e.g. “10.XXXX/XXXXXXXXXX(EPUB)”
  3.	other types of errors: double underscores, white spaces, forward slashes and XML tags<br />

After this generalization, they proposed an algorithmic solution to automatically clean duplicate DOIs and join them. This solution consists of 4 steps: (1) cleaning of prefix-type errors, (2) cleaning of suffix-type errors, (3) removal of incompatible characters and other errors and finally (4) joining of compatible DOIs. The first two steps were carried by using **regular expressions**.<br/>
The authors state that the **vast majority of DOI errors belonged to the first category** of prefix-type (92.39%). After applying their algorithm, the authors achieved to reduce drastically the quantity of cited reference containing two and three DOI names from 9,704 to 1,990 and from 45 to 33, respectively. However, the authors acknowledge that their solution was not able to deal with the following issues: (a) to correct similar characters confused with each other, such as “O” versus “0”, “b” versus “6” etc., (b) to distinguish the correct DOI name from multiple DOI names assigned to the same cited reference; (c) to identify DOIs that cannot be resolved by the DOI system; and (d) to identify DOIs that are resolvable, but point to some knowledge unit which is related to the Digital Object but it’s different from it.


## Notes on: Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases.
(article: https://doi.org/10.1007/s11192-014-1503-4)<br/>
In this paper, Franceschini F., Maisano D. & Mastrogiacomo L. provided a taxonomy of errors for two bibliometric databases: Scopus and Web of Science (WoS). What they found out is that errors in these databases fall within two general categories: (1) **author made errors**, due to lack of care of the author when providing the list of cited articles, and (2) **database mapping errors**, due to poor communication between database administrator and data provider. Moreover, they found out (a) that citations obtained from certain publishers are more likely to be omitted than those from other ones, and (b) that same DOIs are often mistakenly attached to multiple publications.

## Notes on: Zhu, J., Hu, G. & Liu, W. (2019). DOI errors and possible solutions for Web of Science.
(article: https://doi.org/10.1007/s11192-018-2980-7)<br/>
In this paper Zhu, J., Hu, G. & Liu, W. start with the hypothesis that some DOIs assigned to the papers indexed in Web of Science (WoS) are wrongly mapped in the database due to similar characters mistyping: e.g., confusing the number "0" with the letter "O". In order to carry their research, they queried the WoS database with special strings were the letter "O" appeared between two 0-9 numbers (e.g. "0O1"). Among the 310 records returned by the system, 119 DOIs were impossible to be resolved within the DOI system. In addition, they found out that many DOIs were invalid due to other type of character mistyping (e.g. "b" versus "6" and "Q" versus "O") and that one article had 2 DOIs attached when only one was correct. However, despite these discoveries, the paper only suggests possible solutions for WoS and does not describe any concrete process to carry and evaluate data cleaning processes on incorrect DOIs.

# 19 -04 - 2021


## Classes of errors: suffix-type errors

After a manual revision, I find this recurrent suffix errors:



1.   /doi/+"www.website.com/etc/etc"
2.   /doi/+"http://dx.doi.org/etc."
3.   /doi/+"...........-.-.403420(2001)"
4.   /doi/+",pp.2206-2222"
5.   /doi/+",http://etc."
6.   /doi/ + "."
7.   /doi/ + ","
8.  /doi/ + ">accessed8"
9.  /doi/ + ">"
10.  /doi/ + "suppinfo"
11.  /doi/ + "...........32,63(2006)"
12.  "10.1007/s11199-012-0130-x" + ",1-16"
13.  "10.1088/2053-1583/3/4/045006" + "/meta"
14.  "10.1111/j.0735-2751.2004.00237.x" +"/abstract"
15.  "10.1002/smi.1053"+ ")."
16.  "10.1038/208365a0" + ".......208365(1965)"
17.  "10.1016/j.expneurol.2009.06.012" + "uncitedrefs"
18.  "10.3389/fnins.2016.00584" + ".5186786.pmid28082858.author"
19.  "10.1111/j.1365-2486.1999.00252.x" + "/full"
20.  "10.1111/j.1467-9671.2009.01180.x" + "/pdf"
21.  "/doi/" + ",/doi/"
22. "10.1101/gr.229202" + "articlepublishedonlinebeforemarch2002"
23. "10.1177/2043820617738836" + "journals.sagepub.com/home/dhg" 
24. "10.1177/1468794112468475" + "qri.sagepub.com"






In [81]:
# these are the patterns
# to match suffix errors
regex1 = "\/-\/DCSUPPLEMENTAL"
regex2 = "SUPPINF[0|O](\.)?"
regex3 = "[\.|(|,|;]?PMID:\d+.*?"
regex4 = "[\.|(|,|;]?PMCID:PMC\d+.*?"
regex5 = "[(|\[]EPUBAHEADOFPRINT[)\]]"
regex6 = "[\.|(|,|;]?ARTICLEPUBLISHEDONLINE.*?\d{4}"
regex7 = "[\.|(|,|;]*HTTP:\/\/.*?"
regex8 = "[\.\/](META|ABSTRACT|FULL|EPDF|PDF|SUMMARY)>?"
regex9 = "([\/\.](META|ABSTRACT|FULL|EPDF|PDF|SUMMARY))?[>|)](LAST)?ACCESSED\d+"
regex10 = "[\.|(|,|;]?[A-Z]*\.?SAGEPUB.*?"
regex11 = "<[A-Z\/]+>"
regex12 = "\.{5}.*?"
regex13 = "[\.|,|<|>|&|(|;]"
regex14 = "[\.;,]PP.\d+-\d+"
regex15 = "[\.|(|,|;]10.\d{4}\/.*?"

regex_lst = [regex1, regex2, regex3, regex4, regex5, regex6, regex7, regex8, regex9, regex10, regex11, regex12, regex13, regex14, regex15]

# regex from 1 to 7 are from paper
# regex from 8 to 15 are added by me



In [82]:
import csv, re, urllib.request

url = 'https://zenodo.org/record/4625300/files/invalid_dois.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
reader = csv.reader(lines)
rows_number = 0
occurrences = list()
for row in reader:
    rows_number += 1
    pattern = re.search("(.*?)(?:"+ "|".join(regex_lst) + ")$", row[1].upper())
    if pattern is not None:
        pattern = pattern.group(1)
        occurrences.append((row[1], pattern))
print(f"The wrong DOI names are {len(occurrences)} out of {rows_number}")

The wrong DOI names are 148264 out of 1223297


In [83]:
print(occurrences[74532:77000])

## TO DO:
## 1. Deal with closing parenthesis at the end
## 2. test on DOIs checked for incorrectness to see if results vary
## 3. print output csv

[('10.3382/ps/pev188.', '10.3382/PS/PEV188'), ('10.1177/0149206313475815.', '10.1177/0149206313475815'), ('10.1115/1.2969803.', '10.1115/1.2969803'), ('10.1044/jslhr.4103.618.', '10.1044/JSLHR.4103.618'), ('10.1086/704105.', '10.1086/704105'), ('10.17226/11019.', '10.17226/11019'), ('10.4161/viru.1.3.12072.http://dx.doi.org/10.4161/viru.1.3.12072', '10.4161/VIRU.1.3.12072'), ('10.1002/jhet.544.http://dx.doi.org/10.1002/jhet.544', '10.1002/JHET.544'), ('10.1139/cjz-79-9-1559.', '10.1139/CJZ-79-9-1559'), ('10.1016/j.apsoil.2008.03.007http://dx.doi.org/10.1016/j.apsoil.2008.03.007', '10.1016/J.APSOIL.2008.03.007'), ('10.1073/pnas.0308600100.', '10.1073/PNAS.0308600100'), ('10.1038/mtna.2016.105.', '10.1038/MTNA.2016.105'), ('10.1001/jama.2014.14601.', '10.1001/JAMA.2014.14601'), ('10.1007/s00441-016-2404-z.', '10.1007/S00441-016-2404-Z'), ('10.1016/0090-2616(95)90034-9.', '10.1016/0090-2616(95)90034-9'), ('10.1191/1358863x04vm552xx.', '10.1191/1358863X04VM552XX'), ('10.1186/s12872-016-028

## Review  of The Leftovers 2.0 DMP
https://doi.org/10.32388/DIA06O 