NB! 
- Requirements: 
    - pip install openalexapi  
    - pip install pdfminer OR pypdf 

# Libraries and packages 

In [36]:
# To access, use, and request OpenAlex 
import requests 

# To handle data 
import pandas as pd 
import numpy as np
import csv 

#To filter invalid pdf
import pdfminer 
import pypdf 

#To handle files
import glob
import sys 
import os

# Queries through OpenAlex 

First, I need to see if NeuroImage is among the sources available in OpenAlex's database. 

In [37]:
# There are 2 results when I search for sources with name 'NeuroImage'
response_neuroimage = requests.get('https://api.openalex.org/sources?filter=display_name:NeuroImage').json()

# For each result, I see which has issn 1095-9572 and issn_l 1053-8119, as that is the correct journal. 
# The target ISSN numbers to match
target_issn = ['1095-9572', '1053-8119']

# Variable to store the matching ID
matching_id = None

# Iterate through the results to find a match
for result in response_neuroimage['results']:
    # Check if the result's ISSN values match the target ISSN
    if all(issn in result['issn'] for issn in target_issn):
        # If both ISSN values are found, store the ID and break the loop
        matching_id = result['id']
        break

# Print the matching ID (or None if no match was found)
print("Matching ID:", matching_id)

Matching ID: https://openalex.org/S103225281


In [38]:
# This is the metadata about NeuroImage in Open Alex's database 
neuroimage_oa = requests.get(f'https://api.openalex.org/sources?filter=ids.openalex:{matching_id}').json()

Going through all of the 2022 articles manually and clicking 'Select all articles'(https://www-sciencedirect-com.ep.ituproxy.kb.dk/journal/neuroimage/issues), I get the following article counts: 
- Vol 264: 94 
- Vol 263: 77
- Vol 262: 33 
- Vol 261: 25 
- Vol 260: 55 
- Vol 259: 33 
- Vol 258: 50 
- Vol 257: 62 
- Vol 256: 43 
- Vol 255: 41 
- Vol 254: 42 
- Vol 253: 37 
- Vol 252: 30 
- Vol 251: 44
- Vol 250: 39 
- Vol 249: 35 
- Vol 248: 14 
- Vol 247: 55 
- Vol 246: 25 
<br />
<br />
which is a total of 19 volumes containing 834 articles. The object identified by OpenAlex, with ID S103225281, has 826 articles published in 2022, which is a difference of 8 articles. 
<br />

In [39]:
# Common filter criteria for NeuroImage articles published in 2022
neuroimage2022_filter = [
    "primary_location.source.id:"+matching_id,
    "publication_year:2022",  # Filter by the year 2022
    "is_paratext:false",  # Exclude paratext
]

# Build the URL for the current source
def build_neuroimage_works_url(filters):
    base_url = 'https://api.openalex.org/works'
    filters = ','.join(filters)
    return f'{base_url}?filter={filters}'

# Store the URL for the NeuroImage articles, filtered by 2022 and by being non-paratext articles
neuroimage2022_url = build_neuroimage_works_url(neuroimage2022_filter)

# Make a request to the API and store the JSON response
response = requests.get(neuroimage2022_url).json()

# Get the count of articles captured in the response
print("Articles in the response for ", neuroimage2022_url, ": ", response['meta']['count'])

Articles in the response for  https://api.openalex.org/works?filter=primary_location.source.id:https://openalex.org/S103225281,publication_year:2022,is_paratext:false :  812


Looking at the journal with OpenAlex ID S103225281, there are 812 articles, excluding paratext articles (defined in their documentation as: 

    In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples: 
    
    - yep it's paratext: front cover, back cover, tabel of contents, editorial board listing, issue information, masthead 
    - no, not paratext: research paper, dataset, lettors to the editor, figures. 
    
Reference: https://docs.openalex.org/api-entities/works/work-object#is_paratext
   <br> </br>  
Looking at the different volumes of NeuroImage on Elsevier's website, each has an article titled 'Editorial board' as the first article in the journal. If these are excluded, due to the paratext attribute, that would leave 815 research articles (when we start at 834 total, from our manual count). 

In [41]:
# Initialize an empty list to store all articles
all_articles = []

# Set the initial page number to 1
page = 1

# Loop until all articles are retrieved
while True:
    # Make a request to the API with the current page number
    response = requests.get(neuroimage2022_url, params={"page": page})

    # Check if the response is successful
    if response.status_code == 200:
        data = response.json()

        # Extract articles from the current page and append to the list
        articles_on_page = data.get("results", [])
        all_articles.extend(articles_on_page)

        # Check if there are more pages to fetch
        if len(articles_on_page) == 0 or page * data['meta']['per_page'] >= data['meta']['count']:
            break

        # Increment the page number for the next request
        page += 1
    else:
        print("Error fetching data. Status code:", response.status_code)
        break

# Create a DataFrame from all articles
articles_df = pd.DataFrame(all_articles)

In [51]:
# Create a 'fulltext_oa_url' column: 
articles_df['fulltext_oa_url'] = articles_df['open_access'].apply(lambda x: x.get('oa_url') if isinstance(x, dict) else None)


In [59]:
articles_df['doi']

0      https://doi.org/10.1016/j.neuroimage.2021.118870
1      https://doi.org/10.1016/j.neuroimage.2021.118788
2      https://doi.org/10.1016/j.neuroimage.2022.119027
3      https://doi.org/10.1016/j.neuroimage.2021.118774
4      https://doi.org/10.1016/j.neuroimage.2021.118789
                             ...                       
807    https://doi.org/10.1016/j.neuroimage.2022.119731
808    https://doi.org/10.1016/j.neuroimage.2022.119749
809    https://doi.org/10.1016/j.neuroimage.2022.119747
810    https://doi.org/10.1016/j.neuroimage.2022.119752
811    https://doi.org/10.1016/j.neuroimage.2022.119766
Name: doi, Length: 812, dtype: object

# Get PDF's

I want to download the fulltext of all the articles in OpenAlex from NeuroImage. 
I used a part of code written by Théo Sourget to get the PDF's. The code in the file with the following breadcrumb: code/other/download_fulltext.ipynb

Reference: 
Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [93]:
import os
import glob
from pypdf import PdfFileReader, PdfFileWriter

In [94]:
test = articles_df[:5]

At this point we have access to the pdf of the article. We first want to check whether the name of the dataset appears in the tables. The hypothesis is that if a dataset name appears in a table, it must be a table of results. And since we have removed the review papers from the list, this should indicate that the dataset is really used.


In [95]:
# Define the folder where PDFs will be downloaded
pdf_folder = "../Thesis/Code/OpenAlex/downloaded_pdf/"
downloaded_doi = []
title_to_doi = {}

# Create the download folder if it doesn't exist
if not os.path.exists(pdf_folder):
    os.makedirs(pdf_folder)
    
# Iterate through the articles in articles_df
for index, row in test.iterrows():
    title = row['title'].replace('/', '')
    doi = row['doi']
    fulltext_url = row['fulltext_oa_url']
    print("Title: ", title)
    print("DOI: ", doi)
    print("Fulltext_url: ", fulltext_url)
    file_path = f"../OpenAlex/downloaded_pdf/{title}.pdf"   
    
    if title not in title_to_doi: 
        title_to_doi[title] = doi
    if not fulltext_url:
        continue
    if file_path in pdf_folder:
        continue 

    # Check if the DOI has already been downloaded
    if doi not in downloaded_doi:
        try: 
            r_fulltext = requests.get(fulltext_url, allow_redirects=True, timeout=10)
            print(r_fulltext)
            pdf_content = r_fulltext.content
            if r_fulltext.status_code == 200:
                # Save the PDF to the download folder
                open(file_path, "wb").write(pdf_content)
                #with open(file_path, "wb") as pdf_file:
                    #pdf_file.write(pdf_content)
                downloaded_doi.append(doi)
            else:
                continue
        except requests.exceptions.RequestException as ce: 
            continue 

Title:  Quantitative mapping of the brain’s structural connectivity using diffusion MRI tractography: A review
DOI:  https://doi.org/10.1016/j.neuroimage.2021.118870
Fulltext_url:  https://doi.org/10.1016/j.neuroimage.2021.118870
<Response [200]>


FileNotFoundError: [Errno 2] No such file or directory: '../OpenAlex/downloaded_pdf/Quantitative mapping of the brain’s structural connectivity using diffusion MRI tractography: A review.pdf'

In [88]:
print(f"Number of downloaded fulltext: {len(downloaded_doi)}")

Number of downloaded fulltext: 5


In [None]:
 """
        try:
            # Try to read the PDF (raise an error if the file is an invalid PDF)
            PdfReader(file_path,strict=True)
        except PdfReadError: 
            # If there's an error reading the PDF, remove it from the downloaded list and delete the file
            downloaded_doi.remove(doi)
            continue"""

WHAT I WANT: 
- CSV/DATABASE where I can 
    - see how many articles are published in each volume (so I can check compared to Elsevier) 
    - get all the metadata and the url's for all the articles 
        - FROM THE URL I need to be able to search the texts for their databases. 

In [7]:
# Initialize an empty list to store all articles
all_articles = []

# Set the initial page number to 1
page = 1

# Loop until all articles are retrieved
while True:
    # Make a request to the API with the current page number
    response = requests.get(neuroimage2022_url, params={"page": page})

    # Check if the response is successful
    if response.status_code == 200:
        data = response.json()

        # Extract articles from the current page and append to the list
        articles_on_page = data.get("results", [])
        all_articles.extend(articles_on_page)

        # Check if there are more pages to fetch
        if len(articles_on_page) == 0 or page * data['meta']['per_page'] >= data['meta']['count']:
            break

        # Increment the page number for the next request
        page += 1
    else:
        print("Error fetching data. Status code:", response.status_code)
        break

# Create a DataFrame from all articles
df = pd.DataFrame(all_articles)

# Export to CSV if needed
# df.to_csv('articles.csv', index=False)

# Print the total number of articles retrieved
# print("Total articles retrieved:", len(all_articles))

In [8]:
# open_access is an object, that contains information about the access status of this work, as an  object
# https://docs.openalex.org/api-entities/works/work-object#open_access 
# Initialize counters
true_count = 0
false_count = 0
nan_count = 0

# Iterate through rows and count True, False, and NaN
for index, row in df.iterrows():
    value = row['open_access'].get('any_repository_has_fulltext', np.nan)
    if pd.isna(value):
        nan_count += 1
    elif value == True:
        true_count += 1
    elif value == False:
        false_count += 1

print("Count of 'any_repository_has_fulltext' = True:", true_count)
print("Count of 'any_repository_has_fulltext' = False:", false_count)
print("Count of 'any_repository_has_fulltext' = NaN:", nan_count)

Count of 'any_repository_has_fulltext' = True: 547
Count of 'any_repository_has_fulltext' = False: 265
Count of 'any_repository_has_fulltext' = NaN: 0


In [13]:
# Filter rows where 'any_repository_has_fulltext' is False
filtered_df = df[df['open_access'].apply(lambda x: x.get('any_repository_has_fulltext', False) == False)]
filtered_df

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,ids,language,primary_location,type,...,grants,referenced_works_count,referenced_works,related_works,ngrams_url,abstract_inverted_index,cited_by_api_url,counts_by_year,updated_date,created_date
3,https://openalex.org/W3216528674,https://doi.org/10.1016/j.neuroimage.2021.118774,A dynamic graph convolutional neural network f...,A dynamic graph convolutional neural network f...,2022,2022-02-01,{'openalex': 'https://openalex.org/W3216528674...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320306076'...,54,"[https://openalex.org/W1560723556, https://ope...","[https://openalex.org/W2087328730, https://ope...",https://api.openalex.org/works/W3216528674/ngrams,"{'The': [0, 200], 'pathological': [1], 'mechan...",https://api.openalex.org/works?filter=cites:W3...,"[{'year': 2023, 'cited_by_count': 15}, {'year'...",2023-09-13T20:23:41.636399,2021-12-06
17,https://openalex.org/W4200619774,https://doi.org/10.1016/j.neuroimage.2021.118746,Ongoing neural oscillations influence behavior...,Ongoing neural oscillations influence behavior...,2022,2022-02-01,{'openalex': 'https://openalex.org/W4200619774...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],142,"[https://openalex.org/W1449283962, https://ope...","[https://openalex.org/W176629368, https://open...",https://api.openalex.org/works/W4200619774/ngrams,"{'The': [0], 'ability': [1], 'to': [2, 6, 38, ...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 14}, {'year'...",2023-09-02T13:08:24.356239,2021-12-31
32,https://openalex.org/W4210518162,https://doi.org/10.1016/j.neuroimage.2022.118970,Brain structure-function coupling provides sig...,Brain structure-function coupling provides sig...,2022,2022-04-01,{'openalex': 'https://openalex.org/W4210518162...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320320924'...,70,"[https://openalex.org/W795339718, https://open...","[https://openalex.org/W2295675667, https://ope...",https://api.openalex.org/works/W4210518162/ngrams,"{'Brain': [0], 'signatures': [1, 32, 219], 'of...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 11}, {'year'...",2023-09-11T16:48:03.504543,2022-02-08
37,https://openalex.org/W4210709080,https://doi.org/10.1016/j.neuroimage.2022.118974,Periodic/Aperiodic parameterization of transie...,Periodic/Aperiodic parameterization of transie...,2022,2022-05-01,{'openalex': 'https://openalex.org/W4210709080...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],67,"[https://openalex.org/W155683135, https://open...","[https://openalex.org/W592157012, https://open...",https://api.openalex.org/works/W4210709080/ngrams,"{'Two': [0], 'techniques': [1], 'for': [2, 83]...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 8}, {'year':...",2023-08-29T20:40:42.907402,2022-02-08
39,https://openalex.org/W4213387819,https://doi.org/10.1016/j.neuroimage.2022.119009,Patterns of a structural covariance network as...,Patterns of a structural covariance network as...,2022,2022-05-01,{'openalex': 'https://openalex.org/W4213387819...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],118,"[https://openalex.org/W964942278, https://open...","[https://openalex.org/W1973509935, https://ope...",https://api.openalex.org/works/W4213387819/ngrams,"{'Dispositional': [0], 'optimism': [1, 31, 140...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2023, 'cited_by_count': 10}, {'year'...",2023-09-13T13:36:11.114564,2022-02-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
805,https://openalex.org/W4308346205,https://doi.org/10.1016/j.neuroimage.2022.119704,Ventral tegmental area integrity measured with...,Ventral tegmental area integrity measured with...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4308346205...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],74,"[https://openalex.org/W1219461777, https://ope...","[https://openalex.org/W1983138582, https://ope...",https://api.openalex.org/works/W4308346205/ngrams,"{'The': [0], 'ventral': [1], 'tegmental': [2],...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-11T10:36:33.410750,2022-11-11
806,https://openalex.org/W4308432229,https://doi.org/10.1016/j.neuroimage.2022.119739,Group polarization calls for group-level brain...,Group polarization calls for group-level brain...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4308432229...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[{'funder': 'https://openalex.org/F4320321001'...,85,"[https://openalex.org/W1526866867, https://ope...","[https://openalex.org/W348899774, https://open...",https://api.openalex.org/works/W4308432229/ngrams,"{'Group': [0], 'of': [1, 8, 20, 32, 54, 120, 1...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-01T00:21:26.910969,2022-11-11
807,https://openalex.org/W4308479179,https://doi.org/10.1016/j.neuroimage.2022.119731,Behavioral and neural representation of expect...,Behavioral and neural representation of expect...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4308479179...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],65,"[https://openalex.org/W1499836032, https://ope...","[https://openalex.org/W1972742627, https://ope...",https://api.openalex.org/works/W4308479179/ngrams,"{'When': [0], 'faced': [1], 'with': [2, 187, 2...",https://api.openalex.org/works?filter=cites:W4...,[],2023-08-30T14:14:24.484562,2022-11-12
809,https://openalex.org/W4309294048,https://doi.org/10.1016/j.neuroimage.2022.119747,Optimising the sensing volume of OPM sensors f...,Optimising the sensing volume of OPM sensors f...,2022,2022-12-01,{'openalex': 'https://openalex.org/W4309294048...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",article,...,[],48,"[https://openalex.org/W1976755501, https://ope...","[https://openalex.org/W1999107116, https://ope...",https://api.openalex.org/works/W4309294048/ngrams,"{'Magnetoencephalography': [0], '(MEG)': [1], ...",https://api.openalex.org/works?filter=cites:W4...,[],2023-09-13T06:30:04.653841,2022-11-25


In [34]:
id1 = df['id'].iloc[0]
response = requests.get(f'https://api.openalex.org/works/{id1}/ngrams')
response

<Response [200]>

In [None]:
# For each ID in filtered_df['id'], do a fulltext.search for the words 'Data and code availability'
# fulltext.search 
# ids.openalex: filtered_df['id'] 

# Define the OpenAlex API base URL
openalex_base_url = 'https://api.openalex.org/works?filter='

# Initialize a list to store the results
results = []

# Iterate through the 'id' column in filtered_df
for article_id in df['id'].iloc[0]:
    # Construct the URL for the fulltext.search endpoint
    fulltext_search_url = openalex_base_url + f'ids.openalex:{article_id}?fulltext.search:Data%and%code%availability'
    #print(fulltext_search_url)
    
    # Make a request to OpenAlex
    response = requests.get(fulltext_search_url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response and append it to the results list
        results.append(response.json())
    else:
        # Handle any errors or exceptions here
        print(f"Error fetching data for article ID {article_id}")

# results now contains the responses for each article
print(results)

In [27]:
# For each ID in filtered_df['id'], do a fulltext.search for the words 'Data and code availability'
# fulltext.search 
# ids.openalex: filtered_df['id'] 

# Define the OpenAlex API base URL
openalex_base_url = 'https://api.openalex.org/works?filter='

# Initialize a list to store the results
results = []

# Iterate through the 'id' column in filtered_df
for article_id in df['id'].iloc[0]:
    # Construct the URL for the fulltext.search endpoint
    fulltext_search_url = openalex_base_url + f'ids.openalex:{article_id}?fulltext.search:Data%and%code%availability'
    #print(fulltext_search_url)
    
    # Make a request to OpenAlex
    response = requests.get(fulltext_search_url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response and append it to the results list
        results.append(response.json())
    else:
        # Handle any errors or exceptions here
        print(f"Error fetching data for article ID {article_id}")

# results now contains the responses for each article
print(results)

Error fetching data for article ID h
Error fetching data for article ID t
Error fetching data for article ID t
Error fetching data for article ID p
Error fetching data for article ID s
Error fetching data for article ID :
Error fetching data for article ID /
Error fetching data for article ID /
Error fetching data for article ID o
Error fetching data for article ID p
Error fetching data for article ID e
Error fetching data for article ID n
Error fetching data for article ID a
Error fetching data for article ID l
Error fetching data for article ID e


KeyboardInterrupt: 

# References 

Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
