# Chapter 3 Part 2:The shared corpus study
author: <span style="color:magenta">Poppy Riddle</span><br>
date: Mar 31, 2025

## Data collection
- [ ] create sample collection from Crossref from part 1
- [ ] create shared corpus with works also found in OpenAlex matched on DOI
    - take DOI from Crossref sample df_collated2 export
    - send API call to OpenAlex for single work to select relevant elements or all and then refine down to elements needed for analysis
        - example API: https://api.openalex.org/works?filter=doi:10.7717/peerj.4375&select=doi,title,id,publication_year,language,abstract_inverted_index
    - reconstruct abstract from inverted 
        - [x] reconstruction code
- [ ] Crossref schema and OpenAlex schema comparison
    - Crossref schema: https://data.crossref.org/reports/help/schema_doc/5.3.1/index.html
    - OpenAlex schema: https://docs.openalex.org/api-entities/works/work-object
    - create diagram of this map
    - create dictionary to build later
- [ ] create mapping of metadata element from Crossref and its respective element in OpenAlex
    - cr_title -> openalex_title
    - cr_citedby_count -> openalex_citedby
    - etc
- [ ] quantify differences
    - [ ] exact match for numerical or absolute str values
        - cited_by
        - language
        - URL
        - doc_type
        - license 
    - [ ] Levenshtein ratio for str values that can accept some variation without changing meaning: https://rapidfuzz.github.io/Levenshtein/
        - title (may want to use Levenshtein.seqratio())
        - abstract (may want to use Levenshtein.seqratio())
- [ ] identify changes from publisher deposited data (Crossref) to OpenAlex
    - DOI, title, abstract, license type, license, cited-by, language, and document type.
- [ ] identify which error types occur: incorrect values, missing info, inconsistent values
- [ ] visualize: other pubs have used Sankey diagram to show change - other ways to do this? Or improve upon the Sankey approach?

## helpers:
### Python
- the Pyalex library: https://pypi.org/project/pyalex/#get-abstract
- how to uninvert: https://stackoverflow.com/questions/72093757/running-python-loop-to-iterate-and-undo-inverted-index
- 


In [78]:
# import libraries
import pandas as pd
import os
import requests
import pickle
import json
from colorama import Fore,Back,Style
import time
import csv
import xmltodict #probably not needed here
from tqdm import tqdm

In [56]:
# get samples from part 1
file = "data/part_1_sample.txt"
df = pd.read_csv(file, sep='\t', encoding='utf-8', header=0)
print(df.columns)
df.drop(['Unnamed: 0','level_0'], axis=1, inplace=True)
df.head(1)


Index(['Unnamed: 0', 'level_0', 'index', 'doi', 'doi_type', 'title', 'abstract', 'citedby_count', 'doi_url', 'abstract_keys_count', 'abstract_type', 'license_x',
       'license_y'],
      dtype='object')


Unnamed: 0,index,doi,doi_type,title,abstract,citedby_count,doi_url,abstract_keys_count,abstract_type,license_x,license_y
0,0,10.3390/su152215683,journal_article,Underpinning Quality Assurance: Identifying Co...,"{'language': None, 'text': 'The Internet of Th...",1,https://www.mdpi.com/2071-1050/15/22/15683,2,dict,https://creativecommons.org/licenses/by/4.0/,https://creativecommons.org/licenses/by/4.0/


In [59]:
# reconstruct abtract - from https://stackoverflow.com/questions/72093757/running-python-loop-to-iterate-and-undo-inverted-index


def reconstruct_abstract(abstract:dict)-> str:

    """
    This takes a dictionary of the inverted abstract
    and returns a string of the reconstructed abstract.

    Args:
    abstract should be in the form of a dictionary. 
    Example:
    abstract_inverted_index = {
    'Despite':[0],
    'growing':[1],
   'interest': [2],
    'in': [3],
    'Open': [4],
    'Access': [5],
    '...': [6]

    Returns:
    String 
    """

    # Create a list of (word, index) pairs
    word_index = []
    for k, v in abstract.items():
        for index in v:
            word_index.append([k, index])

    #print(word_index) # uncomment to see the sublists
    # Sort the list based on index
    word_index = sorted(word_index, key=lambda x: x[1]) # this sorts based on the second item in the sublist

    # Join the words with a space
    abstract = ' '.join([word for word, index in word_index])
    return abstract

In [66]:
# collect OpenAlex data

# send call to OpenAelx API
def get_openalex_data(doi:str)->dict:
    """
    Arg: takes a DOI as a string without the resolver.
    Return: A dictionary of values.

    Note: oa_abstract is reconstructed from the function reconstruct_abstract()
    """
    URL = f"https://api.openalex.org/works?filter=doi:{doi}&select=doi,title,id,type,type_crossref,language,abstract_inverted_index,cited_by_count,is_paratext,primary_location"
    result = requests.get(URL)
    
    if result.status_code == 200:
        data = result.json()

        #parse json data into each element:
        oa_doi = data['results'][0]['doi'].lstrip('https://doi.org/')
        oa_title = data['results'][0]['title']
        oa_id = data['results'][0]['id']
        oa_type = data['results'][0]['type']
        oa_type_crossref = data['results'][0]['type_crossref']
        oa_language = data['results'][0]['language']
        oa_abstract_inverted_index = data['results'][0]['abstract_inverted_index']
        oa_cited_by_count = data['results'][0]['cited_by_count']
        oa_is_paratext = data['results'][0]['is_paratext']
        oa_primary_location_pdf_url = data['results'][0]['primary_location']['pdf_url']
        oa_license = data['results'][0]['primary_location']['license']
        oa_version = data['results'][0]['primary_location']['version']

    oa_abstract = reconstruct_abstract(oa_abstract_inverted_index)

    return {'oa_doi': oa_doi,
            'oa_title':oa_title,
            'oa_id':oa_id,
            'oa_type':oa_type,
            'oa_type_crossref':oa_type_crossref,
            'oa_language':oa_language,
            'oa_abstract_inverted_index':oa_abstract_inverted_index,
            'oa_abstract':oa_abstract,
            'oa_cited_by_count':oa_cited_by_count,
            'oa_is_paratext':oa_is_paratext,
            'oa_primary_location_pdf_url':oa_primary_location_pdf_url,
            'oa_license':oa_license,
            'oa_version':oa_version
            }

    time.sleep(1)



In [67]:
# test on single doi
# 10.3390/su152215683
get_openalex_data("10.3390/su152215683")

{'oa_doi': '10.3390/su152215683',
 'oa_title': 'Underpinning Quality Assurance: Identifying Core Testing Strategies for Multiple Layers of Internet-of-Things-Based Applications',
 'oa_id': 'https://openalex.org/W4388446692',
 'oa_type': 'article',
 'oa_type_crossref': 'journal-article',
 'oa_language': 'en',
 'oa_abstract_inverted_index': {'The': [0],
  'Internet': [1],
  'of': [2, 10, 26, 48, 75, 84, 142, 153, 166],
  'Things': [3],
  '(IoT)': [4],
  'constitutes': [5],
  'a': [6, 24, 85, 94, 100, 140],
  'digitally': [7],
  'integrated': [8],
  'network': [9],
  'intelligent': [11],
  'devices': [12],
  'equipped': [13],
  'with': [14],
  'sensors,': [15],
  'software,': [16],
  'and': [17, 51, 164],
  'communication': [18],
  'capabilities,': [19],
  'facilitating': [20],
  'data': [21],
  'exchange': [22],
  'among': [23],
  'multitude': [25],
  'digital': [27],
  'systems': [28],
  'via': [29],
  'the': [30, 37, 82, 119, 137, 150, 154, 162],
  'Internet.': [31],
  'Despite': [32],

In [73]:
from tqdm import tqdm

openalex_data = []

for doi in tqdm(df['doi'],colour="MAGENTA"):
    result = get_openalex_data(doi)
    openalex_data.append(result)

df_openalex = pd.DataFrame(openalex_data)

print(Fore.MAGENTA + df_openalex.columns)

df_openalex.head(2)

100%|[35m██████████[0m| 20/20 [00:04<00:00,  4.88it/s]

Index(['[35moa_doi', '[35moa_title', '[35moa_id', '[35moa_type', '[35moa_type_crossref', '[35moa_language', '[35moa_abstract_inverted_index',
       '[35moa_abstract', '[35moa_cited_by_count', '[35moa_is_paratext', '[35moa_primary_location_pdf_url', '[35moa_license', '[35moa_version'],
      dtype='object')





Unnamed: 0,oa_doi,oa_title,oa_id,oa_type,oa_type_crossref,oa_language,oa_abstract_inverted_index,oa_abstract,oa_cited_by_count,oa_is_paratext,oa_primary_location_pdf_url,oa_license,oa_version
0,10.3390/su152215683,Underpinning Quality Assurance: Identifying Co...,https://openalex.org/W4388446692,article,journal-article,en,"{'The': [0], 'Internet': [1], 'of': [2, 10, 26...",The Internet of Things (IoT) constitutes a dig...,0,False,https://www.mdpi.com/2071-1050/15/22/15683/pdf...,,publishedVersion
1,10.25139/jkp.v6i6.5294,Proses Pengambilan Keputusan Adopsi Inovasi Ap...,https://openalex.org/W4362648852,article,journal-article,id,"{'This': [0], 'study': [1], 'aims': [2], 'to':...",This study aims to determine how the process o...,0,False,https://ejournal.unitomo.ac.id/index.php/jkp/a...,,publishedVersion


In [77]:
# Compare Crossref df and oa df match on DOI
# new df with a boolean value if they share: doi, match_on_doi,...
# this will expand out for other boolean values

# match on df['doi'] and df_openalex['oa_doi']
match_on_doi_df = df[['doi']].merge(df_openalex[['oa_doi']], left_on='doi', right_on='oa_doi', how='outer')

match_on_doi_df['match_on_doi'] = match_on_doi_df['doi'] == match_on_doi_df['oa_doi']

match_on_doi_df = match_on_doi_df.drop(['oa_doi'], axis=1)

print(Fore.CYAN + f"percent matched from Crossref: {len(match_on_doi_df)/len(df)*100:.1f}%")

match_on_doi_df


[36mpercent matched from Crossref: 100.0%


Unnamed: 0,doi,match_on_doi
0,10.1039/d3nr03946c,True
1,10.1051/e3sconf/202448001017,True
2,10.1063/5.0208102,True
3,10.1088/1674-1056/ac16cd,True
4,10.1088/1755-1315/899/1/012022,True
5,10.1093/eurheartjsupp/suac121.504,True
6,10.1177/1357034x231201950,True
7,10.18203/2320-6012.ijrms20230875,True
8,10.19181/socjour.2021.27.3.8426,True
9,10.24191/mij.v1i1.14172,True


- [ ] Crossref schema and OpenAlex schema comparison
    - Crossref schema: https://data.crossref.org/reports/help/schema_doc/5.3.1/index.html
    - OpenAlex schema: https://docs.openalex.org/api-entities/works/work-object
    - create diagram of this map
    - create dictionary to build later

- [ ] create mapping of metadata element from Crossref and its respective element in OpenAlex
    - cr_title -> openalex_title
    - cr_citedby_count -> openalex_citedby
    - etc
    

# Analysis

- [ ] quantify differences
    - [ ] exact match for numerical or absolute str values
        - cited_by
        - language
        - URL
        - doc_type
        - license 
    - [ ] Levenshtein ratio for str values that can accept some variation without changing meaning: https://rapidfuzz.github.io/Levenshtein/
        - title (may want to use Levenshtein.seqratio())
        - abstract (may want to use Levenshtein.seqratio())

### overall changes
- [ ] identify changes from publisher deposited data (Crossref) to OpenAlex
    - DOI, title, abstract, license type, license, cited-by, language, and document type -> 0 for missing, 1 for present
    - create table


### DOI specific metadata
- [ ] Change in DOI URL from Crossref to OpenAlex, (0,1)
- [ ] count of those that have https vs http (as an indicator of link rot)
- [ ] count of HTTP status code on all URLs
- [ ] count of those not working (such as 400)
- [ ] create table

### publication type
- [ ] count of each type
- [ ] change from Crossref to Openalex, 0,1?
- [ ] % distribution 
- [ ] maybe a good place for a sankey diagram showing changes

### title
- [ ] change between Crossref and OpenAlex 0,1?
- [ ] count of tokens in each
- [ ] count of stopwords
- [ ] count of punctuation
- [ ] count of special char, formating char
- [ ] count of numerals
- [ ] count of tags or other non-text elements
- [ ] visualize distribution of these across both databases

### abstract
- [ ] change between Crossref and OpenAlex?
- [ ] count of tokens in each
- [ ] count of stopwords
- [ ] count of punctuation
- [ ] count of special char, formating char
- [ ] count of numerals
- [ ] count of tags or other non-text elements
- [ ] visualize distribution of these across both databases

## cited by count
- [ ] std dev of differences between two samples
- [ ] n with change
- [ ] % affected
- [ ] visualize to see if one database favors more than the other

### license
- [ ] change between Crossref and OpenAlex?
- [ ] count of types for each
- [ ] count of those with licenses vs without
- [ ] % of those with 
- [ ] count of common or proprietary licenses
- [ ] visualization of distribution of license types

### languages
- [ ] change between Crossref and OpenAlex
- [ ] count of types
- [ ] % declared in abstract
    - found in XML API
- [ ] % declared in journal title level
    - found in REST API
- [ ] visualization of distribution of language types

## Qualitative Analysis

### License
IF there are differences, an examination of changes between two sources. This may require a subset based on filtering from above. 
- [ ] comparison of license in each source
- [ ] apply an error classification: incorrect, missing, inconsistent

### title and abstract
- [ ] filter df from above for those with differences 
    - [ ] use subset if needed due to quantity
- [ ] compare title from each source based on Levenshtein seqratio
- [ ] compare abstract from each source based on Levenshtein seqratio
- [ ] apply classification
- [ ] identify error types 
