# Chapter 3 Part 1: the random sample
author: <span style="color:magenta">Poppy Riddle</span><br>
date: Mar 31, 2025

## Data collection
This notebook collects a random sample from Crossref. Inclusion criteria include:
- date range: 2020-2025
- doc_type:journal-article, proceedings-article, book-chapter
- filter: has-abstract=1
- sample: sample size is limited to 100 per call. Multiple calls will be used. 

Total records needed:
- journal articles = 9550 (95.5%),
- proceedings-article=150 (1.5%), 
- and book-chapter=290 (2.9%


<span style="color:magenta">## to do</span><br>
Pulling data from the XML API is a pain and is inconsistent particularly for abstract and license. 
- [ ] Pull all data from REST API with full metadata (otherwise language is not part of select option)
IF NO DATA EXISTS FOR LANGUAGE:
- [ ] language attributes from XML API using DOIs from above

This makes getting the data for proceedings-article and book-chapter much easier and consistent. Otherwise, coding all the locations to get from the XML is too 
likely to have errors or false negatives. 


In [2]:
import pandas as pd
import os
import requests
import pickle
import json
from colorama import Fore,Back,Style
import time
import csv
import xmltodict


In [5]:
# doc_type is a dictionary that includes the type (key) to be inserted into the API_URL and the quantity (value) of times the API_URL needs to be called.
doc_type = {"journal-article":0, "proceedings-article":0, "book-chapter":2}

#API_URL = f"https://api.crossref.org/works?mailto=pnriddle@dal.ca&filter=from-pub-date:2020-01-01,has-abstract:1,type:{doc_type}&select=DOI&sample=10"

# URL and params - the requests library concats with &
API_URL = "https://api.crossref.org/works?"
params = {
    "mailto": "pnriddle@dal.ca",
    "filter": "from-pub-date:2020-01-01,has-abstract:1",
    "select": "DOI",
    "sample": 10
}

# dictionary of dataframes to store the results for each document type
dfs = {}

# Send API calls for each document type and collect the results
for doc_type, num_samples in doc_type.items():
    params["filter"] += f",type:{doc_type}"
    results = []
    for i in range(num_samples):
        response = requests.get(API_URL, params=params)
        # Print the URL and parameters for the API call
        print(f"API call {i+1}: {response.url}")
        print(f"Params: {params}")
        data = response.json()
        print(data)
        for item in data['message']['items']:
            results.append({'DOI': item['DOI']})
        # for rate limiting
        time.sleep(1)
    dfs[doc_type] = pd.DataFrame(results)

# Collate the results into a single dataframe
df_collated = pd.concat(dfs.values(), keys=dfs.keys())

df_collated



API call 1: https://api.crossref.org/works?mailto=pnriddle%40dal.ca&filter=from-pub-date%3A2020-01-01%2Chas-abstract%3A1%2Ctype%3Ajournal-article%2Ctype%3Aproceedings-article%2Ctype%3Abook-chapter&select=DOI&sample=10
Params: {'mailto': 'pnriddle@dal.ca', 'filter': 'from-pub-date:2020-01-01,has-abstract:1,type:journal-article,type:proceedings-article,type:book-chapter', 'select': 'DOI', 'sample': 10}
{'status': 'ok', 'message-type': 'work-list', 'message-version': '1.0.0', 'message': {'facets': {}, 'total-results': 15020144, 'items': [{'DOI': '10.3389/fendo.2022.1097165'}, {'DOI': '10.1088/1361-6528/ab6ab5'}, {'DOI': '10.55606/kreatif.v3i2.1355'}, {'DOI': '10.3390/rs16091590'}, {'DOI': '10.3390/jrfm17050183'}, {'DOI': '10.3390/foods10061309'}, {'DOI': '10.1051/matecconf/202134603004'}, {'DOI': '10.1093/ofid/ofac492.512'}, {'DOI': '10.5327/1516-3180.141s2.9921'}, {'DOI': '10.29303/jppipa.v10i3.6877'}], 'items-per-page': 20, 'query': {'start-index': 0, 'search-terms': None}}}
API call 2:

Unnamed: 0,Unnamed: 1,DOI
book-chapter,0,10.3389/fendo.2022.1097165
book-chapter,1,10.1088/1361-6528/ab6ab5
book-chapter,2,10.55606/kreatif.v3i2.1355
book-chapter,3,10.3390/rs16091590
book-chapter,4,10.3390/jrfm17050183
book-chapter,5,10.3390/foods10061309
book-chapter,6,10.1051/matecconf/202134603004
book-chapter,7,10.1093/ofid/ofac492.512
book-chapter,8,10.5327/1516-3180.141s2.9921
book-chapter,9,10.29303/jppipa.v10i3.6877


In [7]:
# Save the data as a pickle for future use
with open('DOI_data.pkl', 'wb') as f:
    pickle.dump(df_collated, f)

# Save the data as a CSV file
df_collated.to_csv('DOI_data.csv', sep='\t')

In [326]:
"""
sends an API call to the Crossref XML API
returns document type, citedby_count, title, abstract, url, and license
"""

def get_xml_data(doi):
    XML_API = f"https://doi.crossref.org/search/doi?pid=pnriddle@dal.ca&format=unixsd&doi={doi}"
    #make API call 
    response = requests.get(XML_API)
    xml_data = response.content
    output = xmltodict.parse(xml_data)
    print(output)

    doi_type = output['crossref_result']['query_result']['body']['query']['doi']['@type']
    print(Fore.MAGENTA + f"doi_type: {doi_type}")

    doi_xml = output['crossref_result']['query_result']['body']['query']['doi']['#text']
    print(Fore.CYAN + f"doi_xml: {doi_xml}")

    citedby_count = output['crossref_result']['query_result']['body']['query']['crm-item'][9]['#text']
    print(Fore.YELLOW + f"citedby_count: {citedby_count}")

    # journal_article title
    try:
        title = output['crossref_result']['query_result']['body']['query']['doi_record']['crossref']['journal']['journal_article']['titles']['title']
        if not title:
            # can stack other locations for other doc_types
            title = " no title"
    except (KeyError,TypeError) as e:
        print(f"ah nuts, an error: {e}")
        title = "no title"
    print(Fore.MAGENTA + f"title: {title}")
    # look into .flatten() to flatten the lists if they exist
    
    #this is just for journal_article lists
    try:
        abstract_element = output['crossref_result']['query_result']['body']['query']['doi_record']['crossref']['journal']['journal_article']['jats:abstract']
        if isinstance(abstract_element, list):
            abstract = []
            for elem in abstract_element:
                language = elem.get('@xml:lang')
                text = elem.get('jats:sep')
                abstract.append({'language':language,'text':text})
        else:
            language = abstract_element.get('@xml:lang')
            text = abstract_element.get('jats:p')
            abstract = {'language':language,'text':text}
    except:
        abstract = None
    print(Fore.CYAN + f"abstract: {abstract}")

    # URL retrieval
    try:
        doi_url = output['crossref_result']['query_result']['body']['query']['doi_record']['crossref']['journal']['journal_article']['doi_data']['resource'].format('#text')
        if not doi_url:
            doi_url = output['crossref_result']['query_result']['body']['query']['doi_record']['crossref']['journal']['journal_article']['doi_data']['resource']
            if not doi_url:
                print(Fore.YELLOW + f" das ist nicht so gut")
                doi_url = "no resolution url"
    except Exception as e:
        print(Fore.YELLOW + f" ah boo: {e}")
        doi_url = "no resolution url"            
    print(Fore.YELLOW + f"doi_url: {doi_url}")

    #license
    """
    this part is very complex as there are multiple locations where license can be supplied. May want to pull this field from the REST API
    just to keep it simple. See locations here: https://data.crossref.org/reports/help/schema_doc/5.3.1/index.html

    tried - license is too messy - get license from REST API call
    """



    data = {'doi':doi,
            'doi_type':doi_type,
            'title':title,
            'abstract':abstract,
            'citedby_count':citedby_count,
            'doi_url':doi_url,
            }

    # time delay for rate limiting
    time.sleep(1)

    return data



In [327]:
get_xml_data("10.1088/1755-1315/899/1/012022")

{'crossref_result': {'@xmlns': 'http://www.crossref.org/qrschema/3.0', '@version': '3.0', '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', '@xsi:schemaLocation': 'http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd', 'query_result': {'head': {'doi_batch_id': 'none'}, 'body': {'query': {'@status': 'resolved', 'doi': {'@type': 'journal_article', '#text': '10.1088/1755-1315/899/1/012022'}, 'crm-item': [{'@name': 'publisher-name', '@type': 'string', '#text': 'IOP Publishing'}, {'@name': 'prefix-name', '@type': 'string', '#text': 'IOP Publishing'}, {'@name': 'member-id', '@type': 'number', '#text': '266'}, {'@name': 'citation-id', '@type': 'number', '#text': '132521094'}, {'@name': 'journal-id', '@type': 'number', '#text': '75384'}, {'@name': 'deposit-timestamp', '@type': 'number', '#text': '2022030117155000'}, {'@name': 'owner-prefix', '@type': 'string', '#text': '10.1088'}, {'@name': 'last-update', '@type': 'date', '#text': '2022-03-

{'doi': '10.1088/1755-1315/899/1/012022',
 'doi_type': 'journal_article',
 'title': 'The gap of cultural heritage protection with climate change adaptation in the context of spatial planning. The case of Greece',
 'abstract': {'language': None,
  'text': 'The case of cultural resources, and in particular of archaeological sites, is one of the key elements of the anthropogenic environment that is affected by climate change and needs protection. At the same time, it is a field of analysis allowing the understanding of the interactions and interconnections of natural and socio-economic systems in time and in different spatial scales, thus providing useful information on the phenomenon of climate change and on how to respond and adapt to it [1]. However, the related scientific research, policies and actions are still limited, as only in the last decade [2] there has been an (albeit ever-increasing) interest in this field. The main objective of this paper is to codify protection policies an

In [330]:
# apply function to df_collated['DOI']
#df_collated['XML_data'] = df_collated['DOI'].apply(get_xml_data)

#df_collated(

xml_data = df_collated['DOI'].apply(get_xml_data)

df_collated2 = pd.DataFrame(xml_data.to_list())

df_collated2

[35mdoi_type: journal_article
[36mdoi_xml: 10.3390/su152215683
[33mcitedby_count: 1
[35mtitle: Underpinning Quality Assurance: Identifying Core Testing Strategies for Multiple Layers of Internet-of-Things-Based Applications
[36mabstract: {'language': None, 'text': 'The Internet of Things (IoT) constitutes a digitally integrated network of intelligent devices equipped with sensors, software, and communication capabilities, facilitating data exchange among a multitude of digital systems via the Internet. Despite its pivotal role in the software development life-cycle (SDLC) for ensuring software quality in terms of both functional and non-functional aspects, testing within this intricate software–hardware ecosystem has been somewhat overlooked. To address this, various testing techniques are applied for real-time minimization of failure rates in IoT applications. However, the execution of a comprehensive test suite for specific IoT software remains a complex undertaking. This paper 

Unnamed: 0,doi,doi_type,title,abstract,citedby_count,doi_url
0,10.3390/su152215683,journal_article,Underpinning Quality Assurance: Identifying Co...,"{'language': None, 'text': 'The Internet of Th...",1,https://www.mdpi.com/2071-1050/15/22/15683
1,10.25139/jkp.v6i6.5294,journal_article,Proses Pengambilan Keputusan Adopsi Inovasi Ap...,"{'language': None, 'text': 'This study aims to...",0,https://ejournal.unitomo.ac.id/index.php/jkp/a...
2,10.1093/eurheartjsupp/suac121.504,journal_article,1134 IN-HOSPITAL ARRHYTHMIC BURDEN REDUCTION I...,"{'language': None, 'text': None}",1,https://academic.oup.com/eurheartjsupp/article...
3,10.3390/plants12223844,journal_article,Applications and Market of Micro-Organism-Base...,"{'language': None, 'text': 'The use of plant-b...",0,https://www.mdpi.com/2223-7747/12/22/3844
4,10.24911/ijmdc.51-1696257618,journal_article,"Comparative study of pharmacokinetics, pharmac...","{'language': None, 'text': 'Angiotensin-conver...",0,https://www.ejmanager.com/fulltextpdf.php?mno=...
5,10.30574/wjarr.2024.21.1.0037,journal_article,How COVID-19 and malaria are strikingly alike:...,"{'language': 'en', 'text': 'COVID-19 is a seve...",0,https://wjarr.com/content/how-covid-19-and-mal...
6,10.1088/1674-1056/ac16cd,journal_article,Landau damping of electrons with bouncing moti...,"{'language': None, 'text': {'jats:italic': ['n...",0,no resolution url
7,10.46502/issn.1856-7576/2022.16.03.4,journal_article,Information technologies as a means of overcom...,"{'language': None, 'text': 'The intensificatio...",0,https://revistaeduweb.org/check/16-3/4-55-66.pdf
8,10.1039/d3nr03946c,journal_article,An innovative method for controlled synthesis ...,"[{'language': None, 'text': None}, {'language'...",1,https://xlink.rsc.org/?DOI=D3NR03946C
9,10.5913/pala.13.2020.a012,journal_article,De illa quae dicitur C . Cornelii Galli papyro...,"{'language': None, 'text': 'Several critical o...",0,https://lockwoodonlinejournals.com/index.php/p...


In [331]:
# get some more info on the 'abstract' column
df_collated2['abstract_keys_count'] = df_collated2['abstract'].apply(lambda x: len(x) if isinstance(x,dict) else 0)

# or see if its a list
df_collated2['abstract_type'] = df_collated2['abstract'].apply(lambda x:'list' if isinstance(x, list) else 'dict')


In [344]:
## GET LICENSE DATA FROM REST API
def get_rest_license(doi):
    URL = f"https://api.crossref.org/works/{doi}"
    result = requests.get(URL)
    # return JSON result
    if result.status_code == 200:
        data = result.json()
        # retrieve the license if the content-version = "vor", return the URL value
        if 'message' in data and 'license' in data['message']:
            licenses = data['message']['license']
            for license in licenses:
                if 'content-version' in license and license['content-version'] == 'vor':
                    print(Fore.MAGENTA + f"license: {license['URL']}")
                    return {'license': license['URL']}
    return {'license': 'no license'}
    time.sleep(1)
    print(Style.RESET)



In [345]:
license_data = df_collated['DOI'].apply(get_rest_license)

df_license = pd.DataFrame(license_data.to_list())

df_collated2 = df_collated2.reset_index().merge(df_license.reset_index(), on='index')


[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: https://academic.oup.com/pages/standard-publication-reuse-rights
[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: https://iopscience.iop.org/page/copyright
[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: https://creativecommons.org/licenses/by/4.0/
[35mlicense: http://creativecommons.org/licenses/by/3.0/


In [350]:
# save out the goods
folder_to_be_saved = 'data'
if not os.path.exists(folder_to_be_saved):
    os.makedirs(folder_to_be_saved)
#export as .csv but tab separated
file_to_be_saved = os.path.join(folder_to_be_saved, "part_1_sample.csv")

df_collated2.to_csv(file_to_be_saved, sep='\t', encoding='utf-8',na_rep='NA')

# also save out as pickle to preserve data types
pkl_to_be_saved = os.path.join(folder_to_be_saved, "part_1_sample.pkl")
df_collated2.to_pickle(pkl_to_be_saved)



# Analysis
Schema 5.4.0:https://gitlab.com/crossref/schema/-/blob/master/schemas/common5.4.0.xsd?ref_type=heads

and Schema definitions: https://data.crossref.org/reports/help/schema_doc/5.3.1/index.html for qualitative analysis

info on abstracts: https://www.crossref.org/documentation/schema-library/markup-guide-metadata-segments/abstracts/

## Outcomes of the analysis
### Quantitative analysis:
- [ ] boolean values for presence of each metadata element
    - DOI and publication type
    - title
    - abstract
    - abstract keys count
    - citedby
    - resolution URL
    - license
    - language
- [ ] DOI:
    - http or https count
    - https status code
    - working or not (boolean value)
- [ ] publication type:
    - count and % of each type (this may not be necessary becaue I controlled this in the sampling)
- [ ] title:
    - count of tokens, stop words, punctuation, special char, formatting char, numerals, non-text elements
    - descriptive analysis
- [ ] abstract:
    - count of tokens, stop words, punctuation, special char, formatting char, numerals, non-text elements
    - descriptive analysis
- [ ] citedby_count
    - line chart or histogram
- [ ] license
    - type ,count, %, common or proprietary - may need much cleaning to get this info and may have to do after the qual evaluation
- [ ] language
    - type, count of each type, % in abstract, % in journal level attribution


### Qualitative analysis
- [ ] License - identification of locations, difference between REST and XML API
    - types:
        - [ ] errors in consistency, conventions such as with CC-BY, etc. 
        - [ ] coded for incorrect values, missing info, and inconsistent value rep
        - [ ] sample of types that are non-CC
- [ ] Title and abstract
    - subset used for screening error types:
        - contains both languages  
        - language not consistent with language attribute 
        - duplicate characters  
        - NA for title  
        - all caps  
        - includes web address  
        - includes conference location name, date  
        - inclusion of HTML text formatting codes/face markup  
        - inclusion of numbers or characters not in title  
        - includes full citation  
        - includes isbn  
        - nonsense title/placeholder  
        - includes author 
    - applied to rest of sample for counts
    - coded for inconsistent value rep    

