## Aim

In this notebook, I aim to see:

- How many papers in `data/interim/title_query_empty_doi_query_404.txt` can be identified if I modify the queried title slightly
- How many papers that are successfully identified (those in `data/interim/vispd_openalex_match_1.csv`) do not have the same title and are not the same paper. 

## Conclusion

- 11 papers do have results based on title and DOI research. Among those, 4 are able to be identified by title search if I modify their the title query string a little bit. 

- For papers that are successfully matched via title search, 56 papers' OpenAlex titles AND OpenAlex DOIs are not exactly the same as information on VISPUBDATA. I manually checked the differences and found that 56 papers have different title AND DOI. Among these, 17 are indeed different. Among these 17, only two, namely, HyperLIC and Escape Maps are not searchable. The remaining 15 can be found by a different index in title query results, simply through DOI search, or slightly change the queried title (removing the string of "#" in the title)

- The conclusion is that only (11-4) + 2 = 9 papers are not able to be identified on OpenAlex. 

In [1]:
import requests
import pandas as pd
import csv
import re
pd.options.display.max_colwidth = 200
pd.set_option('display.max_rows', 500)
import numpy as np

In [2]:
def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

In [3]:
no_matching = read_txt("../../data/interim/checking/title_query_404_1.txt")

In [4]:
no_result = read_txt("../../data/interim/checking/title_query_empty_doi_query_404_1.txt")

In [5]:
# create dicts to convert doi and title, and vice versa
vispd_plus = pd.read_csv("../../data/processed/vispubdata_plus.csv")
dois = vispd_plus.loc[:, "DOI"].tolist()
titles = vispd_plus.loc[:, "Title"].tolist()
doi_title_dict = dict(zip(dois, titles))
title_doi_dict = dict(zip(titles, dois))

In [6]:
# checking no matching.
no_matching_titles = [doi_title_dict[doi] for doi in no_matching]

In [7]:
no_matching_titles_transformed = [re.sub(r'\:|\?|\&|\,', '', title) for title in no_matching_titles]

In [8]:
# empty, which means that every paper can be mathced. 
no_matching_titles

[]

In [9]:
no_matching_titles_transformed

[]

In [10]:
## checking no result titles
## By `no_result`, I mean that for these papers, title queries were successful
## However, the query shows empty results. For example:
## https://api.openalex.org/works?filter=title.search:Generation of Transfer Functions with Stochastic Search Technique
no_result_titles = [doi_title_dict[doi] for doi in no_result]
no_result_titles
no_result_titles_transformed = [re.sub(r'\:|\?|\&|\,', '', title) for title in no_result_titles]
no_result_dois = [title_doi_dict[title] for title in no_result_titles]

In [11]:
no_result_titles_transformed

['Generation of Transfer Functions with Stochastic Search Technique',
 'Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera',
 'Integrated control of distributed volume visualization through the World-Wide-Web',
 'Color change and control of quantitative data display',
 'MixMatch a construction kit for visualization',
 'Visualizing causal effects in 4D space-time vector fields',
 'Visualization for nonlinear engineering FEM analysis in manufacturing',
 'Visualisation tools for semiconductor modelling software',
 'Visualization of neutron scattering data using AVS',
 "Fast analytical computation of Richard's smooth molecular surface",
 'SSR-TVD Spatial Super-Resolution for Time-Varying Data Analysis and Visualization']

In [12]:
# build a df to show no result doi and title side by side, this is clearer
no_result_df = pd.DataFrame(columns = ['DOI', 'Title'])

In [13]:
no_result_df['DOI'] = no_result_dois
no_result_df['Title'] = no_result_titles

In [14]:
no_result_df

Unnamed: 0,DOI,Title
0,10.1109/VISUAL.1996.568113,Generation of Transfer Functions with Stochastic Search Technique
1,10.1109/VISUAL.2000.885685,Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera
2,10.1109/VISUAL.1994.346342,Integrated control of distributed volume visualization through the World-Wide-Web
3,10.1109/VISUAL.1992.235201,"Color, change, and control of quantitative data display"
4,10.1109/VISUAL.1994.346305,Mix&Match: a construction kit for visualization
5,10.1109/VISUAL.1991.175770,Visualizing causal effects in 4D space-time vector fields
6,10.1109/VISUAL.1990.146412,Visualization for nonlinear engineering FEM analysis in manufacturing
7,10.1109/VISUAL.1991.175830,Visualisation tools for semiconductor modelling software
8,10.1109/VISUAL.1992.235180,Visualization of neutron scattering data using AVS
9,10.1109/VISUAL.1993.398882,Fast analytical computation of Richard's smooth molecular surface


### among the 11 papers that are not able to be identified by title & doi seaerch:, four can be identified with manual adjustment

#### Title is wrong:

Generation of Transfer Functions with Stochastic Search Technique -> Generation of transfer functions with stochastic search techniques, 10.1109/VISUAL.1996.568113

#### Title difficult to search

<!-- - Interactive data analysis with nSpace2® -> Interactive data analysis with nSpace2
 -->
- Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera -> Automatic alignment of high-resolution multi-projector display using an un-calibrated camera (https://openalex.org/W2170549385)
- Mix&Match a construction kit for visualization, if I replace `&` with ' ' rather than '', it will be successful.
<!-- -  PhylloTrees: Harnessing Nature's Phyllotactic Patterns for Tree Layout -> PhylloTrees: Harnessing Nature Phyllotactic Patterns for Tree Layout (*This is a little bit special. The title after -> is not that is shown on OpenAlex, but the query string that can get me to the target paper.*) -->
- Fast analytical computation of Richard's smooth molecular surface -> Fast analytical computation of Richards's smooth molecular surface


## Checking `vispd_openalex_match.csv`

In [15]:
df = pd.read_csv("../../data/interim/vispd_openalex_match_1.csv")
# df.Title != df['OpenAlex Title']) & 

In [16]:
df.DOI[1]
df['OpenAlex DOI'][1]
df

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
0,2011,10.1109/TVCG.2011.185,D³ Data-Driven Documents,2011.0,https://openalex.org/W2135415614,D³ Data-Driven Documents,https://doi.org/10.1109/tvcg.2011.185,https://doi.org/10.1109/tvcg.2011.185,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,2301.0,2309.0
1,1991,10.1109/VISUAL.1991.175815,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,1991.0,https://openalex.org/W2146872957,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,https://doi.org/10.5555/949607.949654,https://ieeexplore.ieee.org/document/175815/,,ieee visualization,IEEE Computer Society Press,284.0,291.0
2,1990,10.1109/VISUAL.1990.146402,Parallel coordinates: a tool for visualizing multi-dimensional geometry,1990.0,https://openalex.org/W2034694694,Parallel coordinates: a tool for visualizing multi-dimensional geometry,https://doi.org/10.5555/949531.949588,https://link.springer.com/chapter/10.1007/978-4-431-68057-4_3,,ieee visualization,IEEE Computer Society Press,361.0,378.0
3,2006,10.1109/TVCG.2006.147,Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data,2006.0,https://openalex.org/W2145640629,Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data,https://doi.org/10.1109/tvcg.2006.147,https://doi.org/10.1109/tvcg.2006.147,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,741.0,748.0
4,1997,10.1109/VISUAL.1997.663860,ROAMing terrain: Real-time Optimally Adapting Meshes,1997.0,https://openalex.org/W2532506824,ROAMing terrain: real-time optimally adapting meshes,https://doi.org/10.5555/266989.267028,https://ieeexplore.ieee.org/document/663860/,,ieee visualization,IEEE Computer Society Press,81.0,88.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3237,2021,10.1109/TVCG.2021.3084694,Interactive Graph Construction for Graph-Based Semi-Supervised Learning,2021.0,https://openalex.org/W3164124914,Interactive Graph Construction for Graph-Based Semi-Supervised Learning,https://doi.org/10.1109/tvcg.2021.3084694,https://doi.org/10.1109/tvcg.2021.3084694,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,3701.0,3716.0
3238,2021,10.1109/TVCG.2021.3114777,Visual Evaluation for Autonomous Driving,2021.0,https://openalex.org/W3211221870,Visual Evaluation for Autonomous Driving,https://doi.org/10.1109/tvcg.2021.3114777,https://ieeexplore.ieee.org/document/9597616/media,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,1.0,1.0
3239,2021,10.1109/TVCG.2021.3071387,Visual Cascade Analytics of Large-scale Spatiotemporal Data,2021.0,https://openalex.org/W3152147466,Visual Cascade Analytics of Large-scale Spatiotemporal Data,https://doi.org/10.1109/tvcg.2021.3071387,https://doi.org/10.1109/tvcg.2021.3071387,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,1.0,1.0
3240,2021,10.1109/TVCG.2020.3032984,Understanding Missing Links in Bipartite Networks with MissBiN,2020.0,https://openalex.org/W3094173950,Understanding Missing Links in Bipartite Networks with MissBiN.,https://doi.org/10.1109/tvcg.2020.3032984,https://doi.org/10.1109/tvcg.2020.3032984,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,1.0,1.0


## Looking into df, to prepare for string replacing

I wanted to see the part of the strings in the titles that I need to modify. The reason I did it is to see whether I want to replace part of the title instead of the whole of title (no special reasons, just to make the script look cleaner). To do that, I need to make sure there is only one paper with the specific part of string i want to replace. 

In [17]:
df[df.Title.str.contains("Richard's")]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
2875,1993,10.1109/VISUAL.1993.398882,Fast analytical computation of Richard's smooth molecular surface,,,,,,,,,,


In [18]:
df[df.Title.str.contains('Stochastic Search Technique')]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
93,1996,10.1109/VISUAL.1996.568113,Generation of Transfer Functions with Stochastic Search Technique,,,,,,,,,,


In [19]:
df[df.Title.str.contains('displays using an uncalibrated camera')]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
262,2000,10.1109/VISUAL.2000.885685,Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera,,,,,,,,,,


### Replace all "'" or not

I am thinking of two methods:

- `.replace("Nature's", "Nature’s").replace("Richard's", "Richards's")` to replace only these two instances

- `.replace("'", "’")` to replace all

First all, I need to find out how many titles contain the string of "'":

In [20]:
titles_to_check = df[df.Title.str.contains("'")].Title.tolist()
len(titles_to_check)

32

## Checking the results where title queries were successful

In [21]:
# checking how many papers do not have openalex id; that is, how many failed. 
# i want to see whether the number match the total of no_result and no_match

df[df['OpenAlex ID'].isnull()].shape[0]

# Yes, it is correct

11

In [22]:
# first, filter out those without openalex id
no_nan_df = df.dropna(subset=['OpenAlex ID'])
# create a new column based on DOI; the purpose is to compare it with OpenAlex DOI
no_nan_df = no_nan_df.assign(DOI_URL = [re.sub(r'^', 'https://doi.org/', doi) for doi in no_nan_df.DOI])
# then, show rows where both title and doi do not match
no_nan_df[(no_nan_df.Title.str.lower() != no_nan_df['OpenAlex Title'].str.lower()) & (
    no_nan_df['OpenAlex DOI'].str.lower() != no_nan_df['DOI_URL'].str.lower()
)]
# shape: (52, 14). There are 52 papers whose title AND DOI do not match with those on OpenAlex

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page,DOI_URL
147,1999,10.1109/VISUAL.1999.809896,"The ""Parallel Vectors"" operator-a vector field visualization primitive",1999.0,https://openalex.org/W2095105262,The “parallel vectors” operator: a vector field visualization primitive,https://doi.org/10.5555/319351.319420,https://dblp.uni-trier.de/db/conf/visualization/visualization1999.html#PeikertR99,,ieee visualization,IEEE Computer Society Press,263.0,270.0,https://doi.org/10.1109/VISUAL.1999.809896
158,1991,10.1109/VISUAL.1991.175771,The virtual windtunnel: An environment for the exploration of three-dimensional unsteady flows,1991.0,https://openalex.org/W2114975239,The virtual windtunnel-an environment for the exploration of three-dimensional unsteady flows,https://doi.org/10.5555/949607.949612,http://ieeexplore.ieee.org/document/175771/,,ieee visualization,IEEE Computer Society Press,17.0,24.0,https://doi.org/10.1109/VISUAL.1991.175771
175,1998,10.1109/VISUAL.1998.745302,TOPIC ISLANDS TM - a wavelet-based text visualization system,1998.0,https://openalex.org/W2136452290,TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system,https://doi.org/10.5555/288216.288247,https://doi.org/10.1109/VISUAL.1998.745302,,ieee visualization,IEEE Computer Society Press,189.0,196.0,https://doi.org/10.1109/VISUAL.1998.745302
195,1993,10.1109/VISUAL.1993.398868,Geometric optimization,1988.0,https://openalex.org/W2100440346,Geometric Algorithms and Combinatorial Optimization,,https://ci.nii.ac.jp/ncid/BA03567922,,,,,,https://doi.org/10.1109/VISUAL.1993.398868
261,1993,10.1109/VISUAL.1993.398859,HyperSlice - Visualization of Scalar Functions of Many Variables,1993.0,https://openalex.org/W2103111128,HyperSlice: visualization of scalar functions of many variables,https://doi.org/10.5555/949845.949871,https://doi.org/10.1109/VISUAL.1993.398859,,ieee visualization,IEEE Computer Society,119.0,125.0,https://doi.org/10.1109/VISUAL.1993.398859
350,1991,10.1109/VISUAL.1991.175795,Color icons: merging color and texture perception for integrated visualization of multiple parameters,1991.0,https://openalex.org/W1991928089,Color icons-merging color and texture perception for integrated visualization of multiple parameters,https://doi.org/10.5555/949607.949634,https://dblp.uni-trier.de/db/conf/visualization/visualization1991.html#Levkowitz91,,ieee visualization,IEEE Computer Society Press,164.0,170.0,https://doi.org/10.1109/VISUAL.1991.175795
354,2003,10.1109/VISUAL.2003.1250401,Video visualization,1997.0,https://openalex.org/W2104113200,Video visualization for compact presentation and fast browsing of pictorial content,https://doi.org/10.1109/76.633496,https://doi.org/10.1109/76.633496,https://openalex.org/V115173108,IEEE Transactions on Circuits and Systems for Video Technology,Institute of Electrical and Electronics Engineers,771.0,785.0,https://doi.org/10.1109/VISUAL.2003.1250401
389,1991,10.1109/VISUAL.1991.175789,The stream polygon: A technique for 3D vector field visualization,1991.0,https://openalex.org/W2077014479,The stream polygon-a technique for 3D vector field visualization,https://doi.org/10.5555/949607.949628,https://dl.acm.org/doi/10.5555/949607.949628,,ieee visualization,IEEE Computer Society Press,126.0,132.0,https://doi.org/10.1109/VISUAL.1991.175789
414,2000,10.1109/VISUAL.2000.885739,WEAVE: a system for visually linking 3-D and statistical visualizations applied to cardiac simulation and measurement data,2000.0,https://openalex.org/W2089875636,"WEAVE: a system for visually linking 3-D and statistical visualizations, applied to cardiac simulation and measurement data",https://doi.org/10.5555/375213.375311,https://dblp.uni-trier.de/db/conf/visualization/visualization2000.html#GreshRWSY00,,ieee visualization,IEEE Computer Society Press,489.0,492.0,https://doi.org/10.1109/VISUAL.2000.885739
523,1999,10.1109/VISUAL.1999.809871,Image graphs-a novel approach to visual data exploration,1999.0,https://openalex.org/W2018246367,Image graphs—a novel approach to visual data exploration,https://doi.org/10.5555/319351.319360,http://dblp.uni-trier.de/db/conf/visualization/visualization1999.html#Ma99,,ieee visualization,IEEE Computer Society Press,81.0,88.0,https://doi.org/10.1109/VISUAL.1999.809871


In [43]:
# I manually checked the above results, and found that 17 papers are wrong mathces. Among these 17,
# only two cannot be identified on OpenAlex; the remaining 15 can be identified via
# a different index in title query or simply using doi query
wrong_match_dois = [
    '10.1109/VISUAL.1993.398868', # Geometric optimization, INDEX: [5]
    '10.1109/VISUAL.2003.1250401', # video visualization, DOI searchable
#     '10.1109/TVCG.2014.2346922', # #fluxflow, fixable by deleteting #
    '10.1109/VISUAL.1996.567807', # Volume tracking, INDEX: [6]
    '10.1109/VISUAL.1998.745315', # simplication of tetrahedral meshes, DOI searchable
    '10.1109/VISUAL.2001.964504', # connectivity shapes, INDEX: [2]
    '10.1109/VISUAL.2005.1532812', # High dynamic range volume visualization, INDEX: [1]
    '10.1109/INFVIS.2001.963282', #graph sketches, INDEX: [1]
    '10.1109/VISUAL.1995.480804', # space walking, DOI searchable
    '10.1109/VISUAL.1992.235194', # volumn warping, INDEX: [3]
    '10.1109/VISUAL.1993.398866', # performance visualization of parallel programs, INDEX: [3]
    '10.1109/VISUAL.1992.235181', # Visualizing the Universe, INDEX: [1]
    '10.1109/VISUAL.1992.235195', # Network video device control, INDEX: [1]
    '10.1109/VISUAL.2000.885719', # Polyhedral modeling, DOI searchable
    '10.1109/VISUAL.2003.1250379', #hyperlic
    '10.1109/VISUAL.2003.1250404', # Hierarchical splatting of scattered data, doi searchable
    '10.1109/TVCG.2014.2346442', #escape maps
#     '10.1109/TVCG.2021.3114849': #interactive data comics, doi searchable
    '10.1109/TVCG.2021.3114842', #Visualization Equilibrium, doi searchable
]

Another worth mentioning is that "#FluxFlow: Visual Analysis of Anomalous Information Spreading on Social Media" can be easily searched on OpenAlex if I remove the "#" in the beginning. 

56 papers have different title AND DOI. Among these, 17 are indeed different. Among these 17, only two, namely, HyperLIC and Escape Maps are not searchable. The remaining 15 can be found by a different index in title query results, or simply through DOI search. 

In [31]:
to_query_by_doi = [
        '10.1109/VISUAL.2003.1250401',
        '10.1109/VISUAL.1998.745315',
        '10.1109/VISUAL.1995.480804',
        '10.1109/VISUAL.2000.885719',
        '10.1109/TVCG.2021.3114842',
        '10.1109/TVCG.2021.3114849',
        '10.1109/VISUAL.2003.1250404',
    ]

In [32]:
special_result_index_dict = {
	    '10.1109/VISUAL.1993.398868': 5,
	    '10.1109/VISUAL.1996.567807': 6,
	    '10.1109/VISUAL.2001.964504': 2,
	    '10.1109/VISUAL.2005.1532812': 1,
	    '10.1109/INFVIS.2001.963282': 1,
	    '10.1109/VISUAL.1992.235194': 3,
	    '10.1109/VISUAL.1993.398866': 3,
	    '10.1109/VISUAL.1992.235181': 1,
	    '10.1109/VISUAL.1992.235195': 1,
	}

In [35]:
list(special_result_index_dict.keys())

['10.1109/VISUAL.1993.398868',
 '10.1109/VISUAL.1996.567807',
 '10.1109/VISUAL.2001.964504',
 '10.1109/VISUAL.2005.1532812',
 '10.1109/INFVIS.2001.963282',
 '10.1109/VISUAL.1992.235194',
 '10.1109/VISUAL.1993.398866',
 '10.1109/VISUAL.1992.235181',
 '10.1109/VISUAL.1992.235195']

In [36]:
special_result_index_dict['10.1109/VISUAL.1996.567807']

6

### Checking '#'

Later, I will replace `'#'` with `''`. To do that, I need to check how many papers have "#" and whether they can still be identified on OpenAlex once "#" is removed. 

The answer is yes. I manually tried, and they can be identified. 

In [27]:
df[df.Title.str.contains('#')]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
518,2014,10.1109/TVCG.2014.2346922,#FluxFlow: Visual Analysis of Anomalous Information Spreading on Social Media,1959.0,https://openalex.org/W2398842109,Action of diphenesenic acid on the intestinal absorption of fats as determined in vivo in the rat,,https://www.ncbi.nlm.nih.gov/pubmed/13808286,https://openalex.org/V2753293397,Bollettino della Società italiana di biologia sperimentale,Boll Soc Ital Biol Sper,1783.0,1785.0


<!-- ## Experimenting methods to remove results for those four papers whose results are false -->