# Analyzing the validity of paper DOIs of 2021 papers

**REMEMBER THAT THERE ARE AT LEAST TWO PAPERS THAT CANNOT BE ACCESSED ON IEEE!**

Vispubdata:
  - [paper](https://ieeexplore.ieee.org/abstract/document/7583708)
  - [dataset website](https://sites.google.com/site/vispubdata/home)
  - [raw data in Google Spreadsheet](https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/edit#gid=939249534)
  
I've downloaded the raw data of VisPubData into a csv file: `vispubdata`, located in `Data/Raw/Vispubdata.csv`.

## Log:

- 2022-01-09: This is the date when the notebook was created. 

- 2022-01-15: I updated this notebook. I added '10.1109/VIS.1999.10000' to good_urls. For duplicates, I delete DOIS that are not accessible on OpenAlex. Specifically, I deleted two DOIs. The final output is a `txt` file containing 3390 DOIs. I changed the txt filename to `vispd_good_dois.txt`.

- 2022-01-29: I removed papers with type of 'M' and redid the analysis. 

- 2022-02-02: I identified DOIs for 2021 papers and am analyzing them here. 

In [1]:
# loading pacakges
import pandas as pd
import numpy as np
import re
from functools import reduce
import random

In [2]:
# import raw data
dois_2021 = pd.read_csv('../../data/raw/dois_2021.csv')
dois_2021

Unnamed: 0,Conference,Year,Title,DOI
0,VIS,2021,Simultaneous Matrix Orderings for Graph Collec...,10.1109/tvcg.2021.3114773
1,VIS,2021,IRVINE: A Design Study on Analyzing Correlatio...,10.1109/tvcg.2021.3114797
2,VIS,2021,Perception! Immersion! Empowerment! Superpower...,10.1109/tvcg.2021.3114844
3,VIS,2021,Feature Curves and Surfaces of 3D Asymmetric T...,10.1109/tvcg.2021.3114808
4,VIS,2021,AffectiveTDA: Using Topological Data Analysis ...,10.1109/tvcg.2021.3114784
...,...,...,...,...
165,VIS,2021,Probabilistic Data-Driven Sampling via Multi-C...,10.1109/TVCG.2020.3006426
166,VIS,2021,Visual Evaluation for Autonomous Driving,10.1109/TVCG.2021.3114777
167,VIS,2021,Visual Cascade Analytics of Large-scale Spatio...,10.1109/TVCG.2021.3071387
168,VIS,2021,Understanding Missing Links in Bipartite Netwo...,10.1109/TVCG.2020.3032984


## Extract the DOI

In [3]:
dois = dois_2021.loc[:, "DOI"].tolist()
random.sample(dois, 10)

['10.1109/MCG.2020.3024146',
 '10.1109/tvcg.2021.3114880',
 '10.1109/tvcg.2021.3114796',
 '10.1109/tvcg.2021.3114850',
 '10.1109/tvcg.2021.3114854',
 '10.1109/TVCG.2020.3006426',
 '10.1109/tvcg.2021.3114870',
 '10.1109/tvcg.2021.3114790',
 '10.1109/TVCG.2021.3114766',
 '10.1109/tvcg.2021.3114863']

### Identifying invalid DOI 

I know that there are several invalid paper DOIs. I want to find out what and where they are. Most papers have the string of `10.1109`, which indicates the journal of IEEE Visualization conference, I guess. Then, papers that do not contain this string must be different, if not invalid. 

To find out whether every paper DOI contains `10.1109`, I first extracted the string before the first `/` in each DOI using regular experession and put the output into a list, and then find out the unique elements in that list. 

In [4]:
# first_num here indicates the first number before `/` in each doi. 
first_num_list = [re.sub('\/(.*)', '', i) for i in dois] 
# credit of the above code goes to: https://stackoverflow.com/a/4419021
first_num_list[1]

'10.1109'

In [5]:
# Find out unique strings in first_num_list
# Method 1: `dict.fromkeys()`
dict.fromkeys(first_num_list)

{'10.1109': None}

In [6]:
unique = list(dict.fromkeys(first_num_list))
unique

['10.1109']

In [7]:
# Method 2: set
unique = list(set(first_num_list))
unique

['10.1109']

I now know there are only one unique string.

## Analyzing Journal Code

By "Journal Code", I mean strings like 'TVCG', 'VAST', 'VISUAL', 'SciVis', or 'INFVIS that follow '10.1109'. 

In [8]:
# I first strip '10.1109/' or '10.0000/' off each doi
doi_main = [re.sub(r'10.1109/|10.0000/', '', i) for i in dois]
random.sample(doi_main, 10)
# len(doi_main) = 3394

['tvcg.2021.3114864',
 'tvcg.2021.3114780',
 'TVCG.2021.3114777',
 'tvcg.2021.3074010',
 'TVCG.2021.3071387',
 'tvcg.2021.3114845',
 'TVCG.2020.3032984',
 'tvcg.2021.3114770',
 'tvcg.2021.3114808',
 'tvcg.2021.3114795']

In [9]:
# Then strip off everything after the dot following the journal code in doi_main
journal_code_list = [re.sub('\.(.*)', '', i) for i in doi_main] 
random.sample(journal_code_list, 10)

['tvcg',
 'tvcg',
 'TVCG',
 'tvcg',
 'tvcg',
 'TVCG',
 'TVCG',
 'tvcg',
 'tvcg',
 'tvcg']

In [10]:
# getting the list of unique journal code
journal_code_unique = list(set(journal_code_list))
journal_code_unique

['MCG', 'mcg', 'TVCG', 'tvcg']

In [11]:
# initiate a list of dictionary containing journal code name and count
journal_code_dict_list = []

In [12]:
journal_code_df = pd.DataFrame(columns = ["Journal Code", "Count"])
journal_code_df

Unnamed: 0,Journal Code,Count


In [13]:
# for each unique journal code, get the name and the count, format it as a dict and add this dict to 
# journal_code_dict_list
for i in journal_code_unique: 
    journal_code_dict = {'Journal Code': i, 'Count': journal_code_list.count(i)} 
    journal_code_dict_list.append(journal_code_dict)

In [14]:
journal_code_dict_list

[{'Journal Code': 'MCG', 'Count': 1},
 {'Journal Code': 'mcg', 'Count': 10},
 {'Journal Code': 'TVCG', 'Count': 26},
 {'Journal Code': 'tvcg', 'Count': 133}]

In [15]:
for i in journal_code_dict_list:
    df1 = pd.DataFrame([i])
    journal_code_df = journal_code_df.append(df1, ignore_index = True)

In [16]:
journal_code_df

Unnamed: 0,Journal Code,Count
0,MCG,1
1,mcg,10
2,TVCG,26
3,tvcg,133


In [17]:
dois_2021[dois_2021.DOI.str.contains('mcg')]

Unnamed: 0,Conference,Year,Title,DOI
41,VIS,2021,Data Badges: Making an Academic Profile throug...,10.1109/mcg.2020.3025504
42,VIS,2021,Move&Find: The value of kinesthetic experience...,10.1109/mcg.2020.3025385
43,VIS,2021,Slave Voyages:reflections on data sculptures,10.1109/mcg.2020.3025183
44,VIS,2021,Narrative Physicalisation: Supporting Interact...,10.1109/mcg.2020.3025078
45,VIS,2021,Data Clothing and BigBarChart: designing physi...,10.1109/mcg.2020.3025322
64,VIS,2021,Dynamic 3D Visualization of Climate Model Deve...,10.1109/mcg.2020.3042587
65,VIS,2021,Exploring the Design Space of Sankey Diagrams ...,10.1109/mcg.2019.2927556
66,VIS,2021,QuteVis: Visually Studying Transportation Patt...,10.1109/mcg.2019.2911230
67,VIS,2021,Many Views Are Not Enough: Designing for Synop...,10.1109/mcg.2020.2985368
68,VIS,2021,CLEVis: A Semantic Driven Visual Analytics Sys...,10.1109/mcg.2020.2973939


## Checking title duplicates

I checked whether there are duplicate papers.

In [18]:
titles_2021 = dois_2021.Title.tolist()

In [19]:
len(titles_2021)

170

In [20]:
# it turns out there are no duplicate papers. 
titles_2021_duplicates = list(set([x for x in titles_2021 if titles_2021.count(x) > 1]))
titles_2021_duplicates

[]