## Aim

In this notebook, I aim to see:

- How many papers in `data/interim/title_query_empty_doi_query_404_dfs.txt` can be identified if I modify the queried title slightly
- How many papers that are successfully identified (those in `data/processed/openalex_paper_df.csv`) do not have the same title and are not the same paper. 

## Conclusion

I conclude that the strategy I used in `get_openalex_dfs.py` works. 

In [1]:
import requests
import pandas as pd
import csv
import re
pd.options.display.max_colwidth = 200
pd.set_option('display.max_rows', 500)
import numpy as np

In [2]:
def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

In [3]:
# where title query fails
no_matching = read_txt("../../data/interim/checking/title_query_404_dfs.txt")

In [4]:
# title query has empty results and doi query fails
no_result = read_txt("../../data/interim/checking/title_query_empty_doi_query_404_dfs.txt")

In [5]:
# doi query fails
failed_doi = read_txt("../../data/interim/checking/doi_query_404_dfs.txt")

In [6]:
# failed doi is the result of failed query for dois in `to_query_by_doi`. 
failed_doi

[]

In [7]:
# create dicts to convert doi and title, and vice versa
vispd_plus = pd.read_csv("../../data/processed/vispubdata_plus.csv")
dois = vispd_plus.loc[:, "DOI"].tolist()
titles = vispd_plus.loc[:, "Title"].tolist()
doi_title_dict = dict(zip(dois, titles))
title_doi_dict = dict(zip(titles, dois))

In [8]:
# checking no matching.
no_matching_titles = [doi_title_dict[doi] for doi in no_matching]

In [9]:
no_matching_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_matching_titles]

In [10]:
# empty, which means that every paper can be mathced. 
no_matching_titles

[]

In [11]:
no_matching_titles_transformed

[]

In [12]:
## checking no result titles
## By `no_result`, I mean that for these papers, title queries were successful
## However, the query does not show any result. For example:
## https://api.openalex.org/works?filter=title.search:Generation of Transfer Functions with Stochastic Search Technique
no_result_titles = [doi_title_dict[doi] for doi in no_result]
no_result_titles
no_result_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_result_titles]
no_result_dois = [title_doi_dict[title] for title in no_result_titles]

In [13]:
no_result_titles_transformed

[]

In [14]:
# build a df to show no result doi and title side by side, this is clearer
no_result_df = pd.DataFrame(columns = ['DOI', 'Title'])

In [15]:
no_result_df['DOI'] = no_result_dois
no_result_df['Title'] = no_result_titles

In [16]:
no_result_df

Unnamed: 0,DOI,Title


## Checking `openalex_paper_df.csv`

In [17]:
df = pd.read_csv("../../data/processed/openalex_paper_df.csv")
# df.Title != df['OpenAlex Title']) & 

In [18]:
df.DOI[1]
df['OpenAlex DOI'][1]

'https://doi.org/10.5555/949607.949654'

## Checking the results where title queries were successful

In [19]:
# checking how many papers do not have openalex id; that is, how many failed. 
# i want to see whether the number match the total of no_result and no_match

df[df['OpenAlex ID'].isnull()].shape[0]

# Yes, it is correct

0

In [20]:
# first, filter out those without openalex id
no_nan_df = df.dropna(subset=['OpenAlex ID'])
# create a new column based on DOI; the purpose is to compare it with OpenAlex DOI
no_nan_df = no_nan_df.assign(DOI_URL = [re.sub(r'^', 'https://doi.org/', doi) for doi in no_nan_df.DOI])
# then, show rows where both title and doi do not match
no_nan_df[(no_nan_df.Title.str.lower() != no_nan_df['OpenAlex Title'].str.lower()) & (
    no_nan_df['OpenAlex DOI'].str.lower() != no_nan_df['DOI_URL'].str.lower()
)]
# shape: (36, 23)

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex Publication Date,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue ID,...,OpenAlex First Page,OpenAlex Last Page,Number of Pages,Number of References,Number of Authors,Number of Concepts,Number of Citations,Citation API URL,Number of Citation API URLs,DOI_URL
93,1996,10.1109/VISUAL.1996.568113,Generation of Transfer Functions with Stochastic Search Technique,1996,1996-10-27,W1964910730,Generation of transfer functions with stochastic search techniques,https://doi.org/10.5555/244979.245572,http://dblp.uni-trier.de/db/conf/visualization/visualization1996.html#HeHKP96,,...,227.0,234.0,8.0,16,4,5,186,https://api.openalex.org/works?filter=cites:W1964910730,1,https://doi.org/10.1109/VISUAL.1996.568113
147,1999,10.1109/VISUAL.1999.809896,"The ""Parallel Vectors"" operator-a vector field visualization primitive",1999,1999-10-24,W2095105262,The “parallel vectors” operator: a vector field visualization primitive,https://doi.org/10.5555/319351.319420,https://dblp.uni-trier.de/db/conf/visualization/visualization1999.html#PeikertR99,,...,263.0,270.0,8.0,26,2,7,147,https://api.openalex.org/works?filter=cites:W2095105262,1,https://doi.org/10.1109/VISUAL.1999.809896
158,1991,10.1109/VISUAL.1991.175771,The virtual windtunnel: An environment for the exploration of three-dimensional unsteady flows,1991,1991-10-22,W2114975239,The virtual windtunnel-an environment for the exploration of three-dimensional unsteady flows,https://doi.org/10.5555/949607.949612,http://ieeexplore.ieee.org/document/175771/,,...,17.0,24.0,8.0,7,2,4,143,https://api.openalex.org/works?filter=cites:W2114975239,1,https://doi.org/10.1109/VISUAL.1991.175771
175,1998,10.1109/VISUAL.1998.745302,TOPIC ISLANDS TM - a wavelet-based text visualization system,1998,1998-10-18,W2136452290,TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system,https://doi.org/10.5555/288216.288247,https://doi.org/10.1109/VISUAL.1998.745302,,...,189.0,196.0,8.0,24,4,10,77,https://api.openalex.org/works?filter=cites:W2136452290,1,https://doi.org/10.1109/VISUAL.1998.745302
261,1993,10.1109/VISUAL.1993.398859,HyperSlice - Visualization of Scalar Functions of Many Variables,1993,1993-10-25,W2103111128,HyperSlice: visualization of scalar functions of many variables,https://doi.org/10.5555/949845.949871,https://doi.org/10.1109/VISUAL.1993.398859,,...,119.0,125.0,7.0,7,2,5,108,https://api.openalex.org/works?filter=cites:W2103111128,1,https://doi.org/10.1109/VISUAL.1993.398859
262,2000,10.1109/VISUAL.2000.885685,Automatic alignment of high-resolution multi-projector displays using an uncalibrated camera,2000,2000-10-01,W2170549385,Automatic alignment of high-resolution multi-projector display using an un-calibrated camera,https://doi.org/10.5555/375213.375230,https://www.ics.uci.edu/~majumder/vispercep/vis2000_chen.pdf,,...,125.0,130.0,6.0,15,5,11,90,https://api.openalex.org/works?filter=cites:W2170549385,1,https://doi.org/10.1109/VISUAL.2000.885685
350,1991,10.1109/VISUAL.1991.175795,Color icons: merging color and texture perception for integrated visualization of multiple parameters,1991,1991-10-22,W1991928089,Color icons-merging color and texture perception for integrated visualization of multiple parameters,https://doi.org/10.5555/949607.949634,https://dblp.uni-trier.de/db/conf/visualization/visualization1991.html#Levkowitz91,,...,164.0,170.0,7.0,9,1,16,85,https://api.openalex.org/works?filter=cites:W1991928089,1,https://doi.org/10.1109/VISUAL.1991.175795
389,1991,10.1109/VISUAL.1991.175789,The stream polygon: A technique for 3D vector field visualization,1991,1991-10-22,W2077014479,The stream polygon-a technique for 3D vector field visualization,https://doi.org/10.5555/949607.949628,https://dl.acm.org/doi/10.5555/949607.949628,,...,126.0,132.0,7.0,6,3,12,77,https://api.openalex.org/works?filter=cites:W2077014479,1,https://doi.org/10.1109/VISUAL.1991.175789
414,2000,10.1109/VISUAL.2000.885739,WEAVE: a system for visually linking 3-D and statistical visualizations applied to cardiac simulation and measurement data,2000,2000-10-01,W2089875636,"WEAVE: a system for visually linking 3-D and statistical visualizations, applied to cardiac simulation and measurement data",https://doi.org/10.5555/375213.375311,https://dblp.uni-trier.de/db/conf/visualization/visualization2000.html#GreshRWSY00,,...,489.0,492.0,4.0,17,5,6,75,https://api.openalex.org/works?filter=cites:W2089875636,1,https://doi.org/10.1109/VISUAL.2000.885739
523,1999,10.1109/VISUAL.1999.809871,Image graphs-a novel approach to visual data exploration,1999,1999-10-24,W2018246367,Image graphs—a novel approach to visual data exploration,https://doi.org/10.5555/319351.319360,http://dblp.uni-trier.de/db/conf/visualization/visualization1999.html#Ma99,,...,81.0,88.0,8.0,13,1,8,73,https://api.openalex.org/works?filter=cites:W2018246367,1,https://doi.org/10.1109/VISUAL.1999.809871


In [24]:
# all papers above are matched correctly
# EVERYTHING CORRECT
wrong_match_dois = [
]

In [25]:
# there are a totle of 3,233 papers, correct
df.shape

(3233, 22)