## Aim

In this notebook, I aim to see:

- How many papers in `data/interim/title_query_empty_doi_query_404_2.txt` can be identified if I modify the queried title slightly
- How many papers that are successfully identified (those in `data/interim/vispd_openalex_match_2.csv`) do not have the same title and are not the same paper. 

## Conclusion

I conclude that the strategy I used in `get_vispd_openalex_match_2.py` works. 

In [1]:
import requests
import pandas as pd
import csv
import re
pd.options.display.max_colwidth = 200
pd.set_option('display.max_rows', 500)
import numpy as np

In [2]:
def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

In [3]:
no_matching = read_txt("../../data/interim/checking/title_query_404_2.txt")

In [4]:
no_result = read_txt("../../data/interim/checking/title_query_empty_doi_query_404_2.txt")

In [5]:
failed_doi = read_txt("../../data/interim/checking/doi_query_404_2.txt")

In [6]:
# failed doi is the result of failed query for dois in `to_query_by_doi`. 
failed_doi

[]

In [7]:
# create dicts to convert doi and title, and vice versa
vispd_plus = pd.read_csv("../../data/processed/vispubdata_plus.csv")
dois = vispd_plus.loc[:, "DOI"].tolist()
titles = vispd_plus.loc[:, "Title"].tolist()
doi_title_dict = dict(zip(dois, titles))
title_doi_dict = dict(zip(titles, dois))

In [8]:
# checking no matching.
no_matching_titles = [doi_title_dict[doi] for doi in no_matching]

In [9]:
no_matching_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_matching_titles]

In [10]:
# empty, which means that every paper can be mathced. 
no_matching_titles

[]

In [11]:
no_matching_titles_transformed

[]

In [12]:
## checking no result titles
## By `no_result`, I mean that for these papers, title queries were successful
## However, the query does not show any result. For example:
## https://api.openalex.org/works?filter=title.search:Generation of Transfer Functions with Stochastic Search Technique
no_result_titles = [doi_title_dict[doi] for doi in no_result]
no_result_titles
no_result_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_result_titles]
no_result_dois = [title_doi_dict[title] for title in no_result_titles]

In [13]:
no_result_titles_transformed

['Visualization for nonlinear engineering FEM analysis in manufacturing']

In [14]:
# build a df to show no result doi and title side by side, this is clearer
no_result_df = pd.DataFrame(columns = ['DOI', 'Title'])

In [15]:
no_result_df['DOI'] = no_result_dois
no_result_df['Title'] = no_result_titles

In [16]:
## These are the papers that I cannot access on openalex
no_result_df

Unnamed: 0,DOI,Title
0,10.1109/VISUAL.1990.146412,Visualization for nonlinear engineering FEM analysis in manufacturing


## Checking `vispd_openalex_match.csv`

In [17]:
df = pd.read_csv("../../data/interim/vispd_openalex_match_2.csv")
# df.Title != df['OpenAlex Title']) & 

In [18]:
df.DOI[1]
df['OpenAlex DOI'][1]

'https://doi.org/10.5555/949607.949654'

In [19]:
df.shape

(3242, 13)

In [20]:
df[df['DOI'] == '10.1109/VISUAL.1992.235194']

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page
2275,1992,10.1109/VISUAL.1992.235194,Volume warping,1992.0,https://openalex.org/W2293612704,Volume warping,https://doi.org/10.5555/949685.949740,https://dl.acm.org/doi/10.5555/949685.949740,,ieee visualization,,308.0,315.0


## Checking the results where title queries were successful

In [21]:
# checking how many papers do not have openalex id; that is, how many failed. 
# i want to see whether the number match the total of no_result and no_match

df[df['OpenAlex ID'].isnull()].shape[0]

# Yes, it is correct

1

In [22]:
# first, filter out those without openalex id
no_nan_df = df.dropna(subset=['OpenAlex ID'])
# create a new column based on DOI; the purpose is to compare it with OpenAlex DOI
no_nan_df = no_nan_df.assign(DOI_URL = [re.sub(r'^', 'https://doi.org/', doi) for doi in no_nan_df.DOI])
# then, show rows where both title and doi do not match
no_nan_df[(no_nan_df.Title.str.lower() != no_nan_df['OpenAlex Title'].str.lower()) & (
    no_nan_df['OpenAlex DOI'].str.lower() != no_nan_df['DOI_URL'].str.lower()
)]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue,OpenAlex Journal,OpenAlex Publisher,OpenAlex First Page,OpenAlex Last Page,DOI_URL
2908,2003,10.1109/VISUAL.2003.1250379,HyperLIC,2006.0,https://openalex.org/W2365704339,Blow-up of solutions to initial boundary value problem for a class of nonlinear hyperlic equation,,https://en.cnki.com.cn/Article_en/CJFDTOTAL-ZKSG200602005.htm,https://openalex.org/V2764756576,Journal of Zhoukou Normal University,,,,https://doi.org/10.1109/VISUAL.2003.1250379


In [41]:
# I manually checked the above results, and found that only one paper is wrong match
# This means the strategy I used in `get_vispd_openalex_match_2.py` succeeds. 
# Then I can run `get_openalex_dfs.py`. 
wrong_match_dois = [
    '10.1109/VISUAL.2003.1250379', #hyperlic
]