## Aim

In this notebook, I aim to see:

- How many papers are in `data/interim/title_query_empty_doi_query_404.txt` and whether they can be identified via other means, for example, slight modification of the title
- How many papers that are successfully identified (those in `data/interim/vispd_openalex_match_1.csv`) do not have the same title and are not the same paper. 

## Conclusion

- 1 paper has empty title query and unsuccessful doi query. This paper is indeed not searchable in openalex.
- 76 successfully matched papers do not have exactly the same title & DOI. Among these 76 papers, 74 can be identified via DOI query, one by using a different index in title query, and the remaining one indeed does not exist in openalex
- Therefore, only 2 papers do not can not be found in OpenAlex. 

In [1]:
import requests
import pandas as pd
import csv
import re
pd.options.display.max_colwidth = 200
pd.set_option('display.max_rows', 500)
import numpy as np

In [2]:
def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

In [3]:
no_matching = read_txt("../../data/interim/checking/title_query_404_1.txt")

In [4]:
no_result = read_txt("../../data/interim/checking/title_query_empty_doi_query_404_1.txt")

In [5]:
# create dicts to convert doi and title, and vice versa
vispd_plus = pd.read_csv("../../data/processed/vispubdata_plus.csv")
dois = vispd_plus.loc[:, "DOI"].tolist()
titles = vispd_plus.loc[:, "Title"].tolist()
doi_title_dict = dict(zip(dois, titles))
title_doi_dict = dict(zip(titles, dois))

In [6]:
# checking no matching.
no_matching_titles = [doi_title_dict[doi] for doi in no_matching]

In [7]:
no_matching_titles_transformed = [re.sub(r'\:|\?|\&|\,', '', title) for title in no_matching_titles]

In [8]:
# empty, which means that every paper can be mathced. 
no_matching_titles

[]

In [9]:
no_matching_titles_transformed

[]

In [10]:
## checking no result titles
## By `no_result`, I mean that for these papers, title queries were successful
## However, the query shows empty results and DOI query was not successful. For example:
## https://api.openalex.org/works?filter=title.search:Generation of Transfer Functions with Stochastic Search Technique
no_result_titles = [doi_title_dict[doi] for doi in no_result]
no_result_titles
no_result_titles_transformed = [re.sub(r'\:|\?|\&|\,', '', title) for title in no_result_titles]
no_result_dois = [title_doi_dict[title] for title in no_result_titles]

In [11]:
no_result_titles_transformed

['Visualization for nonlinear engineering FEM analysis in manufacturing']

In [12]:
# build a df to show no result doi and title side by side, this is clearer
no_result_df = pd.DataFrame(columns = ['DOI', 'Title'])

In [13]:
no_result_df['DOI'] = no_result_dois
no_result_df['Title'] = no_result_titles

In [14]:
no_result_df

Unnamed: 0,DOI,Title
0,10.1109/VISUAL.1990.146412,Visualization for nonlinear engineering FEM analysis in manufacturing


One paper does not exist in OpenAlex's database. 

<!-- ### among the 11 papers that are not able to be identified by title & doi seaerch:, four can be identified with manual adjustment

#### Title is wrong:

Generation of Transfer Functions with Stochastic Search Technique -> Generation of transfer functions with stochastic search techniques, 10.1109/VISUAL.1996.568113

#### Title difficult to search

- Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera -> Automatic alignment of high-resolution multi-projector display using an un-calibrated camera (https://openalex.org/W2170549385)
- Mix&Match a construction kit for visualization, if I replace `&` with ' ' rather than '', it will be successful.

- Fast analytical computation of Richard's smooth molecular surface -> Fast analytical computation of Richards's smooth molecular surface

 -->

## Checking `vispd_openalex_match.csv`

In [15]:
df = pd.read_csv("../../data/interim/vispd_openalex_match_1.csv")
# df.Title != df['OpenAlex Title']) & 

In [16]:
df.DOI[1]
df['OpenAlex DOI'][1]
df.head()

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
0,2011,10.1109/TVCG.2011.185,D³ Data-Driven Documents,2011.0,https://openalex.org/W2135415614,D³ Data-Driven Documents,https://doi.org/10.1109/tvcg.2011.185,https://doi.org/10.1109/tvcg.2011.185,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,2301.0,2309.0
1,1991,10.1109/VISUAL.1991.175815,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,1991.0,https://openalex.org/W2146872957,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,https://doi.org/10.5555/949607.949654,http://dx.doi.org/10.1109/VISUAL.1991.175815,,ieee visualization,IEEE Computer Society Press,284.0,291.0
2,1990,10.1109/VISUAL.1990.146402,Parallel coordinates: a tool for visualizing multi-dimensional geometry,1990.0,https://openalex.org/W2034694694,Parallel coordinates: a tool for visualizing multi-dimensional geometry,https://doi.org/10.5555/949531.949588,http://dx.doi.org/10.1109/VISUAL.1990.146402,,ieee visualization,IEEE Computer Society Press,361.0,378.0
3,2006,10.1109/TVCG.2006.147,Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data,2006.0,https://openalex.org/W2145640629,Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data,https://doi.org/10.1109/tvcg.2006.147,https://doi.org/10.1109/tvcg.2006.147,https://openalex.org/V84775595,IEEE Transactions on Visualization and Computer Graphics,Institute of Electrical and Electronics Engineers,741.0,748.0
4,1997,10.1109/VISUAL.1997.663860,ROAMing terrain: Real-time Optimally Adapting Meshes,1997.0,https://openalex.org/W2532506824,ROAMing terrain: real-time optimally adapting meshes,https://doi.org/10.5555/266989.267028,http://kucg.korea.ac.kr/seminar/2001/src/PA-01-45.pdf,,ieee visualization,IEEE Computer Society Press,81.0,88.0


## Checking the results where title queries were successful

In [17]:
# checking how many papers do not have openalex id; that is, how many failed. 
# i want to see whether the number match the total of no_result and no_match

df[df['OpenAlex ID'].isnull()].shape[0]

# Yes, it is correct

1

In [18]:
# first, filter out those without openalex id
no_nan_df = df.dropna(subset=['OpenAlex ID'])
# create a new column based on DOI; the purpose is to compare it with OpenAlex DOI
no_nan_df = no_nan_df.assign(DOI_URL = [re.sub(r'^', 'https://doi.org/', doi) for doi in no_nan_df.DOI])
# then, show rows where both title and doi do not match
no_nan_df = no_nan_df[(no_nan_df.Title.str.lower() != no_nan_df['OpenAlex Title'].str.lower()) & (
    no_nan_df['OpenAlex DOI'].str.lower() != no_nan_df['DOI_URL'].str.lower()
)]
# shape: (72, 14). There are 72 papers whose title AND DOI do not match with those on OpenAlex

In [26]:
to_query_by_doi = no_nan_df.DOI
no_nan_df.shape

(72, 14)

In [20]:
doi_query_title_dic = {}
for doi in to_query_by_doi:
    response = requests.get("https://api.openalex.org/works/doi:" + doi)
    try:
        j = response.json()
        title = j['display_name']
        doi_query_title_dic[doi] = title
    except:
        title = np.nan
        doi_query_title_dic[doi] = title

In [21]:
to_compare_df = pd.DataFrame(doi_query_title_dic.items(), columns = ['DOI', 'OpenAlex Title'])
to_compare_df['Title'] = [doi_title_dict[doi] for doi in to_query_by_doi]
to_compare_df

Unnamed: 0,DOI,OpenAlex Title,Title
0,10.1109/VISUAL.2001.964489,Point set surfaces,Point set surfaces
1,10.1109/VISUAL.1996.568113,Generation of transfer functions with stochastic search techniques,Generation of Transfer Functions with Stochastic Search Technique
2,10.1109/VISUAL.1999.809896,"The ""Parallel Vectors"" operator-a vector field visualization primitive","The ""Parallel Vectors"" operator-a vector field visualization primitive"
3,10.1109/VISUAL.1991.175771,The virtual windtunnel-an environment for the exploration of three-dimensional unsteady flows,The virtual windtunnel: An environment for the exploration of three-dimensional unsteady flows
4,10.1109/VISUAL.1998.745302,TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system,TOPIC ISLANDS TM - a wavelet-based text visualization system
5,10.1109/VISUAL.1993.398868,Geometric optimization,Geometric optimization
6,10.1109/INFVIS.2005.1532128,Voronoi Treemaps,Voronoi treemaps
7,10.1109/VISUAL.1993.398859,HyperSlice,HyperSlice - Visualization of Scalar Functions of Many Variables
8,10.1109/VISUAL.1991.175795,Color icons-merging color and texture perception for integrated visualization of multiple parameters,Color icons: merging color and texture perception for integrated visualization of multiple parameters
9,10.1109/VISUAL.2003.1250401,Video visualization,Video visualization


In [22]:
wrong_match_dois = [
#     '10.1109/VISUAL.1993.398859', # hyperslice, this is in fact correct match
#     '10.1109/VISUAL.1995.480804', # space walking, this is in fact correct match
    '10.1109/VISUAL.1992.235194', # volume warping, 
    '10.1109/VISUAL.2003.1250379', # hyperlic
]

# volume warping, can be searched by a different index: [4]
# hyperLIC does not exist

In [23]:
wrong_match_dois

['10.1109/VISUAL.1992.235194', '10.1109/VISUAL.2003.1250379']

In [24]:
to_query_by_doi = [doi for doi in to_query_by_doi if doi not in wrong_match_dois]

In [25]:
to_query_by_doi

['10.1109/VISUAL.2001.964489',
 '10.1109/VISUAL.1996.568113',
 '10.1109/VISUAL.1999.809896',
 '10.1109/VISUAL.1991.175771',
 '10.1109/VISUAL.1998.745302',
 '10.1109/VISUAL.1993.398868',
 '10.1109/INFVIS.2005.1532128',
 '10.1109/VISUAL.1993.398859',
 '10.1109/VISUAL.1991.175795',
 '10.1109/VISUAL.2003.1250401',
 '10.1109/VISUAL.1991.175789',
 '10.1109/VISUAL.2000.885739',
 '10.1109/TVCG.2014.2346922',
 '10.1109/VISUAL.1999.809871',
 '10.1109/VISUAL.1996.567807',
 '10.1109/VISUAL.2000.885692',
 '10.1109/VISUAL.1991.175777',
 '10.1109/VISUAL.1998.745315',
 '10.1109/VISUAL.1997.663909',
 '10.1109/VISUAL.2000.885697',
 '10.1109/VISUAL.2001.964504',
 '10.1109/TVCG.2006.168',
 '10.1109/TVCG.2007.70617',
 '10.1109/VISUAL.1997.663910',
 '10.1109/VISUAL.1997.663931',
 '10.1109/VISUAL.2002.1183792',
 '10.1109/VISUAL.1992.235201',
 '10.1109/VISUAL.1996.568128',
 '10.1109/VISUAL.1997.663923',
 '10.1109/VAST.2011.6102441',
 '10.1109/VISUAL.2000.885732',
 '10.1109/VISUAL.2001.964522',
 '10.1109/VISUA

<!-- ## Experimenting methods to remove results for those four papers whose results are false -->