# Analyzing the validity of paper DOIs in VisPubData

**REMEMBER THAT THERE ARE AT LEAST TWO PAPERS THAT CANNOT BE ACCESSED ON IEEE!**

Vispubdata:
  - [paper](https://ieeexplore.ieee.org/abstract/document/7583708)
  - [dataset website](https://sites.google.com/site/vispubdata/home)
  - [raw data in Google Spreadsheet](https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/edit#gid=939249534)
  
I've downloaded the raw data of VisPubData into a csv file: `vispubdata`, located in `Data/Raw/Vispubdata.csv`.

## Log:

- 2022-01-09: This is the date when the notebook was created. 

- 2022-01-15: I updated this notebook. I added '10.1109/VIS.1999.10000' to good_urls. For duplicates, I delete DOIS that are not accessible on OpenAlex. Specifically, I deleted two DOIs. The final output is a `txt` file containing 3390 DOIs. I changed the txt filename to `vispd_good_dois.txt`.

- 2022-01-29: I removed papers with type of 'M' and redid the analysis. 

In [1]:
# loading pacakges
import pandas as pd
import numpy as np
import re
from functools import reduce
import random

In [2]:
# import raw data
vispd = pd.read_csv("../../data/raw/vispubdata.csv")
vispd

Unnamed: 0,Conference,Year,Title,DOI,Link,FirstPage,LastPage,PaperType,Abstract,AuthorNames-Deduped,AuthorNames,AuthorAffiliation,InternalReferences,AuthorKeywords,AminerCitationCount_04-2020,XploreCitationCount - 2021-02,PubsCited,Award
0,InfoVis,2011,D³ Data-Driven Documents,10.1109/TVCG.2011.185,http://dx.doi.org/10.1109/TVCG.2011.185,2301,2309,J,Data-Driven Documents (D3) is a novel represen...,Michael Bostock;Vadim Ogievetsky;Jeffrey Heer,Michael Bostock;Vadim Ogievetsky;Jeffrey Heer,,10.1109/INFVIS.2000.885091;10.1109/INFVIS.2000...,"Information visualization, user interfaces, to...",1537.0,1197.0,41.0,
1,Vis,1991,Tree-maps: a space-filling approach to the vis...,10.1109/VISUAL.1991.175815,http://dx.doi.org/10.1109/VISUAL.1991.175815,284,291,C,A method for visualizing hierarchically struct...,Brian Johnson;Ben Shneiderman,B. Johnson;B. Shneiderman,"Dept. of Comput. Sci., Maryland Univ., College...",,,1132.0,423.0,23.0,
2,Vis,1990,Parallel coordinates: a tool for visualizing m...,10.1109/VISUAL.1990.146402,http://dx.doi.org/10.1109/VISUAL.1990.146402,361,378,C,A methodology for visualizing analytic and syn...,Alfred Inselberg;Bernard Dimsdale,A. Inselberg;B. Dimsdale,"IBM Sci. Center, Los Angeles, CA, USA;IBM Sci....",,,963.0,373.0,47.0,
3,InfoVis,2006,Hierarchical Edge Bundles: Visualization of Ad...,10.1109/TVCG.2006.147,http://dx.doi.org/10.1109/TVCG.2006.147,741,748,J,A compound graph is a frequently encountered t...,Danny Holten,Danny Holten,,10.1109/INFVIS.2004.1;10.1109/INFVIS.2003.1249...,"Network visualization, edge bundling, edge agg...",700.0,507.0,33.0,TT;BP
4,Vis,1997,ROAMing terrain: Real-time Optimally Adapting ...,10.1109/VISUAL.1997.663860,http://dx.doi.org/10.1109/VISUAL.1997.663860,81,88,C,Terrain visualization is a difficult problem f...,Mark A. Duchaineau;Murray Wolinsky;David E. Si...,M. Duchaineau;M. Wolinsky;D.E. Sigeti;M.C. Mil...,"Los Alamos Nat. Lab., NM, USA",10.1109/VISUAL.1996.567600;10.1109/VISUAL.1996...,"triangle bintree, view-dependent mesh, frame-t...",615.0,207.0,19.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3389,VAST,2020,Visual Analytics of Multivariate Event Sequenc...,10.1109/VAST50239.2020.00009,http://dx.doi.org/10.1109/VAST50239.2020.00009,36,47,C,"In this work, we propose a generic visual anal...",Jiang Wu;Ziyang Guo;Zuobin Wang;Qingyang Xu;Yi...,Jiang Wu;Ziyang Guo;Zuobin Wang;Qingyang Xu;Yi...,"Zhejiang University,The State Key Lab of CAD&C...","10.1109/TVCG.2019.2934209,10.1109/TVCG.2017.27...","Sports Analytics,Event Sequence,Multivariate D...",,0.0,55.0,
3390,VAST,2020,Visual Causality Analysis of Event Sequence Data,10.1109/TVCG.2020.3030465,http://dx.doi.org/10.1109/TVCG.2020.3030465,1343,1352,J,Causality is crucial to understanding the mech...,Zhuochen Jin;Shunan Guo;Nan Chen;Daniel Weisko...,Zhuochen Jin;Shunan Guo;Nan Chen;Daniel Weisko...,iDVx Lab at Tongji University;iDVx Lab at Tong...,"10.1109/INFVIS.2003.1249025,10.1109/TVCG.2014....","Event sequence data,causality analysis,visual ...",,0.0,53.0,
3391,VAST,2020,Visual cohort comparison for spatial single-ce...,10.1109/TVCG.2020.3030336,http://dx.doi.org/10.1109/TVCG.2020.3030336,733,743,J,Spatially-resolved omics-data enable researche...,Antonios Somarakis;Marieke E. Ijsselsteijn;Sie...,Antonios Somarakis;Marieke E. Ijsselsteijn;Sie...,"Department of Radiology, Division of Image Pro...","10.1109/TVCG.2013.124,10.1109/TVCG.2018.286490...","Visual analytics,Imaging Mass Cytometry,Vectra...",,0.0,58.0,
3392,VAST,2020,Visual Neural Decomposition to Explain Multiva...,10.1109/TVCG.2020.3030420,http://dx.doi.org/10.1109/TVCG.2020.3030420,1374,1384,J,Investigating relationships between variables ...,Johannes Knittel;Andres Lalama;Steffen Koch;Th...,Johannes Knittel;Andres Lalama;Steffen Koch;Th...,University of Stuttgart;University of Stuttgar...,"10.1109/INFVIS.2004.68,10.1109/VAST.2008.46773...","Visual Analytics,Multivariate Data Analysis,Ma...",,0.0,64.0,


In [3]:
vispd.shape

(3394, 18)

In [4]:
# paper types I want
jc = ['J', 'C']

In [5]:
# only include papers with the paper type of 'J' or 'C'
vispd = vispd[vispd['PaperType'].isin(jc)]

## Number of papers

In [6]:
num_pub = vispd.shape[0]
print(f'The dataset contains {num_pub} publications.')

The dataset contains 3073 publications.


## Title duplicates checking

In [9]:
## This result indicates that there are no duplicates in titles
len(list(set(vispd.Title))) == vispd.shape[0]

True

## Extract the DOI

In [9]:
dois = vispd.loc[:, "DOI"].tolist()
random.sample(dois, 10)

['10.1109/VAST.2011.6102448',
 '10.1109/TVCG.2007.70604',
 '10.1109/INFVIS.2002.1173154',
 '10.1109/TVCG.2006.197',
 '10.1109/TVCG.2011.214',
 '10.1109/VAST.2014.7042494',
 '10.1109/INFVIS.2002.1173162',
 '10.1109/TVCG.2010.186',
 '10.1109/INFVIS.1997.636789',
 '10.1109/TVCG.2015.2467413']

We can see that most papers have this pattern: '10.1109' followed by 'TVCG', 'VAST', 'VISUAL', 'SciVis', or 'INFVIS', and then some random numbers. 

### Identifying invalid DOI 

I know that there are several invalid paper DOIs. I want to find out what and where they are. Most papers have the string of `10.1109`, which indicates the journal of IEEE Visualization conference, I guess. Then, papers that do not contain this string must be different, if not invalid. 

To find out whether every paper DOI contains `10.1109`, I first extracted the string before the first `/` in each DOI using regular experession and put the output into a list, and then find out the unique elements in that list. 

In [10]:
# first_num here indicates the first number before `/` in each doi. 
first_num_list = [re.sub('\/(.*)', '', i) for i in dois] 
# credit of the above code goes to: https://stackoverflow.com/a/4419021
first_num_list[1]

'10.1109'

In [11]:
# Find out unique strings in first_num_list
# Method 1: `dict.fromkeys()`
dict.fromkeys(first_num_list)

{'10.1109': None, '10.0000': None}

In [12]:
unique = list(dict.fromkeys(first_num_list))
unique

['10.1109', '10.0000']

In [13]:
# Method 2: set
unique = list(set(first_num_list))
unique

['10.0000', '10.1109']

I now know there are only two unique strings. I want to know the count of the "irregular" string: '10.0000'

In [15]:
first_num_list.count('10.0000')

1

## False papers' information

Now I know there are only one paper that contain the irregular string of '10.0000'. I want to which paper it is:

In [16]:
## These two papers' full doi:
for doi in dois:
    if '10.0000' in doi:
        print(doi)

10.0000/00000001


In [17]:
# Here is this paper's full information
false_paper = vispd[vispd['DOI'].str.contains("10.0000")]
false_paper

Unnamed: 0,Conference,Year,Title,DOI,Link,FirstPage,LastPage,PaperType,Abstract,AuthorNames-Deduped,AuthorNames,AuthorAffiliation,InternalReferences,AuthorKeywords,AminerCitationCount_04-2020,XploreCitationCount - 2021-02,PubsCited,Award
1493,InfoVis,2000,Using Visualization to Detect Plagiarism in Co...,10.0000/00000001,http://dl.acm.org/citation.cfm?id=857699,173,,C,,Randy L. Ribler;Marc Abrams,,,10.1109/VISUAL.1993.398883;10.1109/VISUAL.1995...,,23.0,,,


Its title is:
  - *Using Visualization to Detect Plagiarism in Computer Science Classes*, 
  
The first paper, *Using visualization to ...* exists, and there is the link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.9516&rep=rep1&type=pdf. 

## Analyzing Journal Code

By "Journal Code", I mean strings like 'TVCG', 'VAST', 'VISUAL', 'SciVis', or 'INFVIS that follow '10.1109'. 

In [18]:
# I first strip '10.1109/' or '10.0000/' off each doi
doi_main = [re.sub(r'10.1109/|10.0000/', '', i) for i in dois]
random.sample(doi_main, 10)
# len(doi_main) = 3394

['VISUAL.1996.567609',
 'TVCG.2017.2743958',
 'VISUAL.1990.146379',
 'TVCG.2009.141',
 'TVCG.2018.2864827',
 'INFVIS.1997.636778',
 'TVCG.2006.154',
 'VISUAL.2005.1532812',
 'INFVIS.2001.963274',
 'TVCG.2020.3030452']

In [20]:
# I am interested in the outcome of the irregular DOI
for i in doi_main:
    if '00000002' in i or '00000001' in i:
        print(i)

00000001


In [21]:
# Then strip off everything after the dot following the journal code in doi_main
journal_code_list = [re.sub('\.(.*)', '', i) for i in doi_main] 
random.sample(journal_code_list, 10)

['VISUAL',
 'VISUAL',
 'TVCG',
 'TVCG',
 'VISUAL',
 'VISUAL',
 'TVCG',
 'TVCG',
 'INFVIS',
 'VISUAL']

In [22]:
# getting the list of unique journal code
journal_code_unique = list(set(journal_code_list))
journal_code_unique

['TVCG',
 'VAST47406',
 '00000001',
 'SciVis',
 'VAST50239',
 'VIS',
 'VISUAL',
 'VAST',
 'INFVIS']

In [23]:
# initiate a list of dictionary containing journal code name and count
journal_code_dict_list = []

In [24]:
journal_code_df = pd.DataFrame(columns = ["Journal Code", "Count"])
journal_code_df

Unnamed: 0,Journal Code,Count


In [25]:
# for each unique journal code, get the name and the count, format it as a dict and add this dict to 
# journal_code_dict_list
for i in journal_code_unique: 
    journal_code_dict = {'Journal Code': i, 'Count': journal_code_list.count(i)} 
    journal_code_dict_list.append(journal_code_dict)

In [26]:
journal_code_dict_list

[{'Journal Code': 'TVCG', 'Count': 1514},
 {'Journal Code': 'VAST47406', 'Count': 9},
 {'Journal Code': '00000001', 'Count': 1},
 {'Journal Code': 'SciVis', 'Count': 9},
 {'Journal Code': 'VAST50239', 'Count': 10},
 {'Journal Code': 'VIS', 'Count': 1},
 {'Journal Code': 'VISUAL', 'Count': 1054},
 {'Journal Code': 'VAST', 'Count': 241},
 {'Journal Code': 'INFVIS', 'Count': 234}]

In [27]:
for i in journal_code_dict_list:
    df1 = pd.DataFrame([i])
    journal_code_df = journal_code_df.append(df1, ignore_index = True)

In [28]:
journal_code_df

Unnamed: 0,Journal Code,Count
0,TVCG,1514
1,VAST47406,9
2,00000001,1
3,SciVis,9
4,VAST50239,10
5,VIS,1
6,VISUAL,1054
7,VAST,241
8,INFVIS,234


The above dataframe displays the journal code and its count. `00000001` is the false paper. I am interested in `VAST50239`, `VIS`, `SciVis` and `VAST47406`. I want to see whether they are also false papers. 

In [29]:
for i in dois:
    if 'SciVis' in i:
        print(i)

10.1109/SciVis.2015.7429492
10.1109/SciVis.2015.7429487
10.1109/SciVis.2015.7429485
10.1109/SciVis.2015.7429489
10.1109/SciVis.2015.7429491
10.1109/SciVis.2015.7429493
10.1109/SciVis.2015.7429488
10.1109/SciVis.2015.7429486
10.1109/SciVis.2015.7429490


In [30]:
for i in dois:
    if 'VAST50239' in i:
        print(i)

10.1109/VAST50239.2020.00008
10.1109/VAST50239.2020.00010
10.1109/VAST50239.2020.00006
10.1109/VAST50239.2020.00007
10.1109/VAST50239.2020.00013
10.1109/VAST50239.2020.00014
10.1109/VAST50239.2020.00015
10.1109/VAST50239.2020.00012
10.1109/VAST50239.2020.00011
10.1109/VAST50239.2020.00009


In [31]:
for i in dois:
    if 'VAST47406' in i:
        print(i)

10.1109/VAST47406.2019.8986948
10.1109/VAST47406.2019.8986934
10.1109/VAST47406.2019.8986917
10.1109/VAST47406.2019.8986918
10.1109/VAST47406.2019.8986940
10.1109/VAST47406.2019.8986923
10.1109/VAST47406.2019.8986943
10.1109/VAST47406.2019.8986909
10.1109/VAST47406.2019.8986922


2022-01-29: I manually checked all of the dois. All of them are woking: they direct me to IEEE website. 

In [32]:
dois[journal_code_list.index('VIS')]
# This one directs me to computer.org rather than ieee.org

'10.1109/VIS.1999.10000'

## Conclusion

There are two "irregular" DOIs:

- '10.1109/VIS.1999.10000', this is a **VALID** DOI but it [directs us to computer.org](https://www.computer.org/csdl/proceedings-article/ieee-vis/1999/58970011/12OmNCgrDbG). It is a real paper and we [can access it](https://www.tau.ac.il/~levin/vis99-dco.pdf). It has citations on [Google Scholars](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q=%2710.1109%2FVIS.1999.10000%27&btnG=) and [Web of Science](https://www.webofscience.com/wos/woscc/summary/61385074-41fd-4374-9268-a7aa4f43aa92-1dbf4526/date-descending/1). 

- '10.0000/00000001' (*Using Visualization to Detect Plagiarism in Computer Science Classes*), this is an **INVALID** DOI. The paper [does exists](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.9516&rep=rep1&type=pdf), and have citaions on [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q=Using+Visualization+to+Detect+Plagiarism+in+Computer+Science+Classes&btnG=) and [Web of Science](https://www.webofscience.com/wos/woscc/summary/2bbca800-6664-4f81-a13f-5f30eb2df28e-1dbf5333/date-descending/1). 

<!-- If we include these two papers, we have to collect the information about this paper, like its authors and citation manually.  -->

I checked the availability of these two papers on OpenAlex. These two can be identified via title serach. For the first paper, the first two authors' affiliation info is missing and the last author's info is complete. In fact, I can fill the missing part with the last author's info because they are from the same department. For the second paper, author info is complete. 

In [33]:
bad_dois = ['10.0000/00000001']

In [34]:
vispd[vispd.DOI.isin(bad_dois)]

Unnamed: 0,Conference,Year,Title,DOI,Link,FirstPage,LastPage,PaperType,Abstract,AuthorNames-Deduped,AuthorNames,AuthorAffiliation,InternalReferences,AuthorKeywords,AminerCitationCount_04-2020,XploreCitationCount - 2021-02,PubsCited,Award
1493,InfoVis,2000,Using Visualization to Detect Plagiarism in Co...,10.0000/00000001,http://dl.acm.org/citation.cfm?id=857699,173,,C,,Randy L. Ribler;Marc Abrams,,,10.1109/VISUAL.1993.398883;10.1109/VISUAL.1995...,,23.0,,,


In [35]:
vispd_good_dois = [doi for doi in dois if doi not in bad_dois]

In [36]:
len(vispd_good_dois)

3072

## Checking title duplicates

I checked whether there are duplicate papers.

In [47]:
vispd_titles = vispd.Title.tolist()

In [48]:
len(vispd_titles)

3073

In [49]:
# it turns out there are no duplicate papers. 
vispd_titles_duplicates = list(set([x for x in vispd_titles if vispd_titles.count(x) > 1]))
vispd_titles_duplicates

[]