## Aim

In this notebook, I aim to see:

- How many papers in `data/interim/title_query_empty_doi_query_404_dfs.txt` can be identified if I modify the queried title slightly
- How many papers that are successfully identified (those in `data/processed/openalex_paper_df.csv`) do not have the same title and are not the same paper. 

## Conclusion

I conclude that the strategy I used in `get_openalex_dfs.py` works. 

In [1]:
import requests
import pandas as pd
import csv
import re
pd.options.display.max_colwidth = 200
pd.set_option('display.max_rows', 500)
import numpy as np

In [2]:
def read_txt(INPUT):
    """read txt files and return a list
    """
    raw = open(INPUT, "r")
    reader = csv.reader(raw)
    allRows = [row for row in reader]
    data = [i[0] for i in allRows]
    return data

In [3]:
# where title query fails
no_matching = read_txt("../../data/interim/checking/title_query_404_dfs.txt")

In [4]:
# title query has empty results and doi query fails
no_result = read_txt("../../data/interim/checking/title_query_empty_doi_query_404_dfs.txt")

In [5]:
# doi query fails
failed_doi = read_txt("../../data/interim/checking/doi_query_404_dfs.txt")

In [6]:
# failed doi is the result of failed query for dois in `to_query_by_doi`. 
failed_doi

[]

In [7]:
# create dicts to convert doi and title, and vice versa
vispd_plus = pd.read_csv("../../data/processed/vispubdata_plus.csv")
dois = vispd_plus.loc[:, "DOI"].tolist()
titles = vispd_plus.loc[:, "Title"].tolist()
doi_title_dict = dict(zip(dois, titles))
title_doi_dict = dict(zip(titles, dois))

In [8]:
# checking no matching.
no_matching_titles = [doi_title_dict[doi] for doi in no_matching]

In [9]:
no_matching_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_matching_titles]

In [10]:
# empty, which means that every paper can be mathced. 
no_matching_titles

[]

In [11]:
no_matching_titles_transformed

[]

In [12]:
## checking no result titles
## By `no_result`, I mean that for these papers, title queries were successful
## However, the query does not show any result. For example:
## https://api.openalex.org/works?filter=title.search:Generation of Transfer Functions with Stochastic Search Technique
no_result_titles = [doi_title_dict[doi] for doi in no_result]
no_result_titles
no_result_titles_transformed = [re.sub(r'\:|\?|\,|\#|\&', '', title) for title in no_result_titles]
no_result_dois = [title_doi_dict[title] for title in no_result_titles]

In [13]:
no_result_titles_transformed

[]

In [14]:
# build a df to show no result doi and title side by side, this is clearer
no_result_df = pd.DataFrame(columns = ['DOI', 'Title'])

In [15]:
no_result_df['DOI'] = no_result_dois
no_result_df['Title'] = no_result_titles

In [16]:
no_result_df

Unnamed: 0,DOI,Title


## Checking `openalex_paper_df.csv`

In [17]:
df = pd.read_csv("../../data/processed/openalex_paper_df.csv")
# df.Title != df['OpenAlex Title']) & 

In [18]:
df.DOI[1]
df['OpenAlex DOI'][1]

'https://doi.org/10.5555/949607.949654'

## Checking the results where title queries were successful

In [19]:
# checking how many papers do not have openalex id; that is, how many failed. 
# i want to see whether the number match the total of no_result and no_match

df[df['OpenAlex ID'].isnull()].shape[0]

# Yes, it is correct

0

In [20]:
# first, filter out those without openalex id
no_nan_df = df.dropna(subset=['OpenAlex ID'])
# create a new column based on DOI; the purpose is to compare it with OpenAlex DOI
no_nan_df = no_nan_df.assign(DOI_URL = [re.sub(r'^', 'https://doi.org/', doi) for doi in no_nan_df.DOI])
# then, show rows where both title and doi do not match
no_nan_df[(no_nan_df.Title.str.lower() != no_nan_df['OpenAlex Title'].str.lower()) & (
    no_nan_df['OpenAlex DOI'].str.lower() != no_nan_df['DOI_URL'].str.lower()
)]

Unnamed: 0,Year,DOI,Title,OpenAlex Year,OpenAlex Publication Date,OpenAlex ID,OpenAlex Title,OpenAlex DOI,OpenAlex URL,OpenAlex Venue ID,...,OpenAlex First Page,OpenAlex Last Page,Number of Pages,Number of References,Number of Authors,Number of Concepts,Number of Citations,Citation API URL,Number of Citation API URLs,DOI_URL


In [21]:
# all papers above are matched correctly
# EVERYTHING CORRECT
wrong_match_dois = [
]

In [22]:
# there are a totle of 3,240 papers, correct
df.shape

(3240, 22)