# Table of contents 
- [Setup](#setup) 
    - [Purpose](#Purpose)
    - [Libraries](#libraries)
- [Ground truth URLs and sentences](#groundtruthURLsandsentences)
- [URLs and sentences](#URLsandsentences)
    - [Process URLs](#processURLs)
    - [URLs in NeuroImage 2022 articles](#URLsinNeuroImage2022articles)
- [Text classification using word embeddings](#textclassificationusingwordembeddings) 
    - [SciBERT](#scibert)
    - [Class concepts and class labels](#classconceptsandclasslabels)
        - [Word2Vec](#word2vec)
        - [FastText](#fasttext)
        - [GloVe](#glove)
    - [Sentence classification](#sentenceclassification) 
- [Validate](#validate) 
- [Datasets](#datasets)
- [References](#references) 

<a name='setup'></a>
# 0. Setup 

This notebook contains the code to extract the datasets used in the articles published in NeuroImage in 2022. 
<br>
<br>

<a name='purpose'></a> 
## 0.1. Purpose
The purpose of this notebook is to locate and extract publicly available datasets used for analysis in the research articles published in NeuroImage in 2022. 

The overall steps are: 

- URLs and sentences: Use pypdf to locate and extract URLs and sentences containing URLs
- Text classification using word embeddings: Use a fine-tuned Sci-BERT to identify the URLs linked to datasets.
- 


<br>
<br>

<a name='libraries'></a>
## 0.2. Libraries 

In [1]:
import pandas as pd
import numpy as np

import json 
import os 
import re 
import io

# Read PDFs
import pypdf 
# Extract URLs from text 
import urlextract 
# Sentence - and thus URL - classification, and related imports 
import gensim.downloader as api
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_numeric, stem_text
from gensim.similarities import WordEmbeddingSimilarityIndex
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import string
from transformers import AutoModel, AutoTokenizer 
import torch
# Random 
import random

# 1. Ground truth URLs and sentences 
<a name = 'groundtruthURLsandsentences'></a>

Based on my exploration of ten randomly picked articles, 75% of the articles contained URLs - of the articles that did not contain any URLs, the majority either used self-collected data or no datasets at all. This means that extracting the URLs will also extract the datasets used for analysis in the article. 


I will test the functions using the groundtruth texts as my validation set. When manually extracting the datasets from the ten groundtruth texts, we should get the following datasets (NB! Currently, I have not distinguished between links that leads the reader to data and links that leads the reader to code - this will come later): 


Similar to my processing, I will perform the following manually: 
- For each article, save only unique URLs (i.e., if the same URL is mentioned more than one time, save all the sentences in one list) 
- Remove the following URLs: 
    - 'www.elsevier.com/locate/neuroimage'
    - URLs containing the DOI of the article
    - Creative Commons licenses
- Columns: 
    - DOI: The article's DOI
    - URL: The URL
    - Sentence: The sentence(s) in which the URL appears. The only cleaning I did included removing extra spaces, e.g., 'under- lay' is changed to 'under-lay'. 
    - Data: True if the URL and sentence(s) point to and describe a dataset. 

In [2]:
# List of groundtruth DOI values to filter 
groundtruth_dois = [
    '10.1016/j.neuroimage.2021.118839',
    '10.1016/j.neuroimage.2021.118854',
    '10.1016/j.neuroimage.2022.119030',
    '10.1016/j.neuroimage.2022.119050',
    '10.1016/j.neuroimage.2022.119240',
    '10.1016/j.neuroimage.2022.119443',
    '10.1016/j.neuroimage.2022.119526',
    '10.1016/j.neuroimage.2022.119549',
    '10.1016/j.neuroimage.2022.119646',
    '10.1016/j.neuroimage.2022.119676',
] 

In [3]:
groundtruth_urls = [
    {
        'DOI': '10.1016/j.neuroimage.2021.118839',
        'URL': 'http://neuroimage.usc.edu/brainstorm',
        'Sentence': ['Subsequently the results were loaded in a Matlab Tool Box, Brainstorm (Tadel et al. 2011), an accredited software freely available for download online under the GNU general public license (http://neuroimage.usc.edu/brainstorm).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/data-releases',
        'Sentence': ['We applied our GFA extension to the publicly available resting-state functional MRI (rs-fMRI) and non-imaging measures (e.g., demograph-ics, psychometrics and other behavioural measures) obtained from 1003 subjects (only these had rs-fMRI data available) of the 1200-subject data release of the HCP (https://www.humanconnectome.org/study/hcp-young-adult/data-releases).'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation',
        'Sentence': ['The data used in this study was downloaded from the Human Connectome Project website (https://www.humanconnectome.org/study/hcp-young-adult/document/extensively-processed-fmri-data-documentation).'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2021.118854',
        'URL': 'https://github.com/ferreirafabio80/gfa',
        'Sentence': ['The GFA models and experiments were implemented in Python 3.9.1 and are available here: https://github.com/ferreirafabio80/gfa.'],
        'Label': 'Model'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'marmosetbrainconnectome.org',
        'Sentence': ['To accelerate such progress, we present the Marmoset Functional Brain Connectivity Resource (marmosetbrainconnectome.org), currently consisting of over 70 h of resting-state fMRI (RS-fMRI) data acquired at 500 μm isotropic resolution from 31 fully awake marmosets in a common stereotactic space.', 'To promote progress in understanding the functional organization of the marmoset brain, we present a resource that allows for online viewing and download of three-dimensional functional connectivity (FC) maps from over 70 h of RS-fMRI collected at ultra-high field from 31 fully awake adult marmosets: marmosetbrainconnectome.org.', 'A resampled ver-sion of this atlas (at 100 μm) allows for additional anatomical detail over the in vivo template but will still load sufficiently fast as an under-lay image on marmosetbrainconnectome.org.', 'Features of the web portal: marmosetbrainconnectome.org.', 'The Marmoset Functional Connectivity Resource is publicly accessi-ble at marmosetbrainconnectome.org.', 'This resource allows users to instantaneously view and use FC topologies from any gray matter voxel in the marmoset brain online (marmosetbrainconnectome.org; Fig. 1), offering a fine-grained (500 μm) insight into how the marmoset brain is functionally con-nected in any given region, utterly agnostic to structural nomenclature.', 'Tracer maps (B & E) were downloaded from marmosetbrain.org, and FC maps (A through H) were generated from marmosetbrainconnectome.org.', 'With all the publicly available data demo-graphics, users can also download specific demographics (e.g., heavy males in late life) to address their research questions. Individual-level topologies can be loaded via the marmosetbrainconnectome.org viewer without any analysis.'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://www.marmosetbrainconnectome.org/download.html',
        'Sentence': ['(B) The data download page (https://www.marmosetbrainconnectome.org/download.html) allows the user to download all raw (BIDS standard formated) (Gorgolewski et al., 2016) and pre-processed data.', 'Directing to https://www.marmosetbrainconnectome.org/download.html allows for download of the “raw” structural and functional images (3D Neu-roimaging Informatics Technology Initiative (NIfTI) format) contribut-ing to the FC maps shown in the resource – for convenience, these data are in a standard format (BIDS) (Gorgolewski et al., 2016).', 'All raw and preprocessed data are openly available for download at: https://www.marmosetbrainconnectome.org/download.html'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://rii-mango.github.io/Papaya/',
        'Sentence': ['The resource makes use of the Papaya viewer (https://rii-mango.github.io/Papaya/), with several additional features (illustrated in Fig. 1C & D), including (1) calculation of surface over-lay maps on-demand based on the threshold chosen in volume space, (2) the ability to display atlas borders in surface space, (3) support for rotating the underlying volume, overlaying functional connectiv-ity map, and atlas boundaries together – such obliquing of the images can be of utility for presurgical planning, and (4) the ability to choose between group- and subject-level topologies.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://gitlab.com/cfmm/marmoset',
        'Sentence': ['The development of the Marmoset Functional Connectivity Resource is described in full detail at https://gitlab.com/cfmm/marmoset.', 'All code for the online viewer is available at: https://gitlab.com/cfmm/marmoset'],
        'Label': 'Code'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'https://gitlab.com/cfmm/marmoset-connectivity',
        'Sentence': ['Users can also download all code used to generate the functional connectivity maps from https://gitlab.com/cfmm/marmoset-connectivity.', 'A 3D printed model is shown in stereotactic position (with skull cut away to expose the cortical surface) to demonstrate targeting based on the resource coordinates. (D) in method 2, the user employs the supplied code (downloaded from https://gitlab.com/cfmm/marmoset-connectivity) to transform the FC map to their native animals’ anatomical MRI space.', 'All code used for processing data is openly available at: https://gitlab.com/cfmm/marmoset-connectivity'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119030',
        'URL': 'marmosetbrain.org',
        'Sentence': ['Explicitly, we focused on a tracer map from an area 46 injection (left; CJ801-DY; marmosetbrain.org for notes on these injections).', 'To demonstrate the additional information offered by our resource, we systematically plotted connectivity across (within) area TE3 for com-parison with available tracer injections within that region (left; CJ180- CTBr and CJ180-DY; marmosetbrain.org for notes on these injections) (Majka et al., 2020).', 'As shown in Fig. 9, we systematically plotted connectivity across (within) area TE3 and compared available tracer injections within that region (left; CJ180-CTBr and CJ180-DY; marmosetbrain.org for notes on these injections) (Majka et al., 2020).', 'Green labeled ROIs indicate FC data, whereas purple labeled ROIs show where tracer data is publicly available within area TE3 from marmosetbrain.org', 'Accordingly, these re-sources can readily perform similar comparisons in any circuitry of inter-est Fig. 9. shows an example of how our brain-wide functional connec-tivity data can complement existing resources (e.g., marmosetbrain.org) (Majka et al., 2020), demonstrating a gradient of connectivity between cortical tracer injection sites.'],
        'Label': 'Atlas/map'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'URL': 'http://audition.ens.fr/adc/NoiseTools/',
        'Sentence': ['EEG analysis used FieldTrip (Oostenveld et al., 2011), Noise-Tools (De Cheveigne and Parra, 2014; http://audition.ens.fr/adc/NoiseTools/), and custom-written scripts in Matlab.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119050',
        'URL': 'zenodo.org',
        'Sentence': ['Raw EEG data from all healthy individuals, as well as Matlab code, are publicly available on zenodo.org (doi:10.5281/zenodo.6110595).'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'www.cni.stanford.edu',
        'Sentence': ['MRI data were acquired on a 3T Discovery MR750 scanner (Gen-eral Electric Healthcare, Milwaukee, WI, USA) equipped with a 32-channel head coil (Nova Medical, Wilmington, MA, USA) at the Cen-ter for Cognitive and Neurobiological Imaging at Stanford Univer-sity (www.cni.stanford.edu).'],
        'Label': 'Resource'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'http://github.com/vistalab/vistasoft/mrDiffusion',
        'Sentence': ['Diffusion weighted images were pre-processed with Vistasoft (http://github.com/vistalab/vistasoft/mrDiffusion), an open-source software package implemented in MATLAB R2012a (Mathworks, Natick, MA).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'http://www.fil.ion.ucl.ac.uk/spm/',
        'Sentence': ['Each diffusion weighted image was registered to the mean of the b=0 images and the mean b=0 image was registered automatically to the participant’s T1w image, using a rigid body transformation (imple-mented in SPM8, http://www.fil.ion.ucl.ac.uk/spm/; no warping was applied).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'https://github.com/mezera/mrQ',
        'Sentence': ['Quantitative T1 (relaxation time, seconds) maps were calculated us-ing mrQ, (https://github.com/mezera/mrQ), an open-source software package implemented in MATLAB R2012a (Mathworks, Natick, MA).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119240',
        'URL': 'https://github.jyeatman/AFQ',
        'Sentence': ['Automated Fiber Quantification (AFQ; https://github.jyeatman/ AFQ; (Yeatman, Dougherty, Myall, et al., 2012)), a software package implemented in MATLAB R2012a (Mathworks, Natick, MA), was used to isolate and characterize white matter metrics from three dorsal tracts (Arc-L and bilateral SLF) and four ventral white matter tracts (bilat-eral ILF and bilateral UF).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/gazx2/',
        'Sentence': ['EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/eucqf/',
        'Sentence': ['EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/thsqg/',
        'Sentence': ['EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/bndjg/',
        'Sentence': ['EEG datasets used to create the figure in this commentary are freely available at osf.io/gazx2/, osf.io/eucqf/, osf.io/thsqg/ and osf.io/bndjg/.'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119443',
        'URL': 'osf.io/guwnm/',
        'Sentence': ['Code used to reproduce the plots in Fig. 1, as well as averaged ERP data, is available from osf.io/guwnm/.'],
        'Label': 'Processed dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://db.humanconnectome.org/data/projects/HCP_1200',
        'Sentence': ['200 unrelated subjects were selected from the Human Con-nectome Project (HCP) 1200 Subjects Data Release with avail-able resting (task-free) and task fMRI data from a 3T MRI scan-ner (https://db.humanconnectome.org/data/projects/HCP_1200).', 'Preprocessed task fMRI data for the four tasks from the HCP were analyzed (working memory, motor, language, emotion) (https://db.humanconnectome.org/data/projects/HCP_1200).'],
        'Label': 'Dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://www.humanconnectome.org/study/hcp-young-adult/document/wu-minn-hcp-consortium-open-access-data-use-terms',
        'Sentence': ['This study agreed to the Open Access Data Use Terms (https://www.humanconnectome.org/study/hcp-young-adult/document/wu-minn-hcp-consortium-open-access-data-use-terms) and was exempt from the UCSF IRB because investigators could not readily ascertain the identities of the individuals to whom the data belonged.'],
        'Label': 'Resource'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/',
        'Sentence': ['We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/) and AFNI (https://afni.nimh.nih.gov/) for additional fMRI preprocessing.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://afni.nimh.nih.gov/',
        'Sentence': ['We used FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/) and AFNI (https://afni.nimh.nih.gov/) for additional fMRI preprocessing.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://www.brainnetome.org/',
        'Sentence': ['Maps were averaged within 273 regions of interest by combining a parcella-tion of 210 cortical regions and 36 subcortical regions from the Brainnetome atlas (Fan et al., 2016) (http://www.brainnetome.org/) and 27 cerebellar regions from the SUIT atlas (Diedrichsen, 2006) (http://www.diedrichsenlab.org/imaging/suit.htm).'],
        'Label': 'Atlas/map'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://www.diedrichsenlab.org/imaging/suit.htm',
        'Sentence': ['Maps were averaged within 273 regions of interest by combining a parcella-tion of 210 cortical regions and 36 subcortical regions from the Brainnetome atlas (Fan et al., 2016) (http://www.brainnetome.org/) and 27 cerebellar regions from the SUIT atlas (Diedrichsen, 2006) (http://www.diedrichsenlab.org/imaging/suit.htm).'],
        'Label': 'Atlas/map'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://www.fil.ion.ucl.ac.uk/spm/software/spm12/',
        'Sentence': ['Task condition block regressors were convolved with a hemo-dynamic response function using the ‘spm_get_bf’ function in SPM12 (https://www.fil.ion.ucl.ac.uk/spm/software/spm12/).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://github.com/rmarkello/abagen',
        'Sentence': ['We compared each gradient map to Allen Human Brain spatial gene expression patterns using the ‘abagen’ package (https://github.com/rmarkello/abagen) (Arnatkevici ̆ūtė et al., 2019; Hawrylycz et al., 2012).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://brainsmash.readthedocs.io/en/latest/',
        'Sentence': ['These surrogate gradient maps were estimated using BrainSMASH (https://brainsmash.readthedocs.io/en/latest/).'],
        'Label': 'Atlas/map'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://sites.google.com/site/bctnet/',
        'Sentence': ['Graph the-ory analyses were run using the Brain Connectivity Toolbox (BCT; https://sites.google.com/site/bctnet/).'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'http://human.brain-map.org/',
        'Sentence': ['Original data was obtained from the Human Connectome Project (1U54MH091657, PIs Van Essen and Ugurbil) and the Allen Hu-man Brain Atlas (http://human.brain-map.org/).'],
        'Label': 'Atlas/map'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119526',
        'URL': 'https://github.com/jbrown81/gradients',
        'Sentence': ['All code (latent space derivation, dynamical system modeling, and gene expression corre-lation) and processed data (gradient maps/region weights, gradient timeseries, and region gene expression values) are available at https://github.com/jbrown81/gradients.', 'All code and processed data are available at https://github.com/jbrown81/gradients.'],
        'Label': 'Processed dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119549',
        'URL': np.nan,
        'Sentence': np.nan,
        'Label': np.nan
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119646',
        'URL': np.nan,
        'Sentence': np.nan,
        'Label': np.nan
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://www.shutterstock.com',
        'Sentence': ['Both experiments employed static images (modified from Shutterstock, https://www.shutterstock.com).'],
        'Label': 'Resource'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://clippingmagic.com',
        'Sentence': ['All image transformations were done with Clipping Magic (https://clippingmagic.com), ImageMagick, GIMP, Microsoft Paint, the MATLAB SHINE toolbox, and custom MATLAB code.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'http://www.nitrc.org/projects/jip',
        'Sentence': ['Functional volumes were realigned and motion-corrected with the Statistical Parametric Mapping software (SPM12, RRID: SCR_007037), followed by non-rigid co-registration (using JIP, http://www.nitrc.org/projects/jip, RRID: SCR_009588) to the high-resolution anatomical template of the skull-stripped brain of each monkey.'],
        'Label': 'Software'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://caffe.berkeleyvision.org/model_zoo.html',
        'Sentence': ['Another version of pre-trained AlexNet was im-ported from Caffe Model Zoo (https://caffe.berkeleyvision.org/model_zoo.html).'],
        'Label': 'Model'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://osf.io/b8pfa/?view_only=b6dbb5dd6a044989a7eecdc99facb43c',
        'Sentence': ['Preprocessed fMRI data are available at https://osf.io/b8pfa/?view_only=b6dbb5dd6a044989a7eecdc99facb43c.'],
        'Label': 'Preprocessed dataset'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://github.com/Yozafirova/monkey-fMRI-codes',
        'Sentence': ['Codes for the fMRI data analysis at https://github.com/Yozafirova/monkey-fMRI-codes and for the CNN data analysis at https://github.com/RajaniRaman/face_body_integration.'],
        'Label': 'Analysis'
    },
    {
        'DOI': '10.1016/j.neuroimage.2022.119676',
        'URL': 'https://github.com/RajaniRaman/face_body_integration',
        'Sentence': ['Codes for the fMRI data analysis at https://github.com/Yozafirova/monkey-fMRI-codes and for the CNN data analysis at https://github.com/RajaniRaman/face_body_integration.'],
        'Label': 'Analysis'
    },
]

# Convert the list of dictionaries to a DataFrame
manual_groundtruth_urls = pd.DataFrame(groundtruth_urls)

In [4]:
# Path to the 'Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# File path
file_path = os.path.join(data_dir, 'articles_groundtruth_urls_and_sentences.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
manual_groundtruth_urls.to_csv(file_path, index=False, mode='w')

In [5]:
manual_groundtruth_urls.count()

DOI         43
URL         41
Sentence    41
Label       41
dtype: int64

In [6]:
# Check for NaN values in the 'URL' column
manual_groundtruth_urls[manual_groundtruth_urls['URL'].isna()]

Unnamed: 0,DOI,URL,Sentence,Label
34,10.1016/j.neuroimage.2022.119549,,,
35,10.1016/j.neuroimage.2022.119646,,,


In [7]:
len(manual_groundtruth_urls[manual_groundtruth_urls['Label']==True])

0

In [8]:
# Group by 'DOI' and count the number of URLs, setting NaN counts to 0
url_counts = manual_groundtruth_urls.groupby('DOI')['URL'].count().fillna(0)

# Sort the counts in descending order
url_counts = url_counts.sort_values(ascending=False)

url_counts

DOI
10.1016/j.neuroimage.2022.119526    12
10.1016/j.neuroimage.2022.119676     7
10.1016/j.neuroimage.2022.119030     6
10.1016/j.neuroimage.2022.119240     5
10.1016/j.neuroimage.2022.119443     5
10.1016/j.neuroimage.2021.118854     3
10.1016/j.neuroimage.2022.119050     2
10.1016/j.neuroimage.2021.118839     1
10.1016/j.neuroimage.2022.119549     0
10.1016/j.neuroimage.2022.119646     0
Name: URL, dtype: int64

A total of 41 links were extracted manually from the groundtruth articles. There are between one and twelve URLs in the articles. Two of the articles did not contain any URLs. 20 of the 41 links point of data/datasets, models pretrained with data. 

<a name='URLsandsentences'></a>
# 2. URLs and sentences
I use the work of Sourget (2023) to search the PDFs for their datasets: 

I use the Python library *urlextract* by Lipovský (2022) to extract the URLs. 

I perform some initial cleaning of the sentences extracted from the PDFs, specifically removing multiple spaces with a single space, removing all \n characters, and remove leadning and trailing spaces after a number of special characters, incl. -, (, ), /, ., _ , and between : /.

The functions: 
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
- *clean_text* 
- *split_text_into_sentence* 
- *extract_links* uses the urlextract library (Lipovský 2022). 
- *get_urls_and_sentences* calls on *extract_links* and gets both URLs and sentences containing the URL. 
- *extract_and_transform_urls_from_dataframe* 
    

<br>

References: 
- Lipovský, J. (2022). urlextract: Collects and extracts URLs from given text. (1.8.0) [Python]. https://github.com/lipoja/URLExtract
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

In [9]:
def get_content(pdf_path, alt_pdf_path):
    """Get sentences that contain URLs. 
    This function is loosely interpreted from Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
    specifically: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.
    
    Parameters: 
    :param pdf_path (str): Path to the PDF file.
    :param alt_pdf_path (str): Alternative path to the PDF file. 
    
    Returns: 
    :return: Dataframe or 'Editorial board' if not found.
    """
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        pdf_text = " ".join(page.extract_text() for page in pdf_reader.pages)

        # Extract sentences containing urls
        df = get_urls_and_sentences(pdf_text)
        pdf_file.close()
        if df is not None:  # Check if a DataFrame is returned
            return df

    except FileNotFoundError:
        try:
            alternative_pdf_path = os.path.join(alt_pdf_path, os.path.basename(pdf_path))
            pdf_file = open(alternative_pdf_path, 'rb')
            return pd.DataFrame({"url": [np.nan], "sentences": [np.nan]})
        except FileNotFoundError:
            return pd.DataFrame({"url": [np.nan], "sentences": [np.nan]})
        except Exception as e:
            print(f"Error reading PDF: {e}")
    
    except Exception as e:
        print(f"Error reading PDF: {e}")
    
    # If no URLs were found, return an empty DataFrame
    return pd.DataFrame(columns=["url", "sentences"])


############### SENTENCES ################################################
def clean_text(text): 
    """This function performs a very simple initial cleaning of the extracted sentences. 
    This includes removing multiple spaces with a single space, removing all \n characters, 
    and remove leadning and trailing spaces after a number of special characters, 
    incl. -, (, ), /, ., _ , and between : / 
    """
    return text.replace('   ', ' ').replace('  ', ' ').replace('\n', '').replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_') 

def get_sentences(text):
    """This function splits a given text into sentences based on a regular expression pattern. 
    It uses re.split() to identify sentence boundaries, considering common sentence-ending 
    punctuation like ".", "!", or "?". It avoids splitting sentences if a digit immediately 
    follows the punctuation, e.g., 'Fig. 1'. 
    
    Parameters: 
    :param text(str): 
    
    Returns: 
    :return: 
    """
    sentence_pattern = r'(?<=[.!?])\s+(?![0-9]+\s)'
    sentences = re.split(sentence_pattern, text)
    return sentences


############### LINKS ################################################
def get_urls(text):
    """This function returns all unique urls in a text that have certain characters stripped from 
    the end of them (including ',', '.', and ')'). It uses the Python library URLExtract (Lipovský 2022).
    
    """
    # Instance of the URLExtract class
    extractor = urlextract.URLExtract()
    
    # Create a set to store unique URLs
    unique_urls = set()
    
    for url in extractor.gen_urls(text):
        # Apply additional processing to the URL, e.g., removing characters at the end
        processed_url = url.rstrip('.').rstrip(')').rstrip(',')
        unique_urls.add(processed_url)
    
    # Convert the set back to a list if needed
    unique_url_list = list(unique_urls)
    
    return unique_url_list

def get_sentences_with_urls(sentences):
    """This function returns all text that contains URLs. It uses the Python library URLExtract (Lipovský 2022).
    """
    # Instance of the URLExtract class
    extractor = urlextract.URLExtract()
    # Extract all sentences with URLs
    sentences_with_urls = []
    stop_processing = False  # Flag to stop processing when "References" is found

    for sentence in sentences:
        if stop_processing:
            break  # Stop processing when "References" is found
        if extractor.has_urls(sentence):
            sentences_with_urls.append(sentence)
        if "References" in sentence:
            stop_processing = True  # Set the flag to stop processing

    return sentences_with_urls
    
def get_urls_and_sentences(text):
    """
    """
    # Lists to store the extracted URLs and their corresponding sentences
    url_list = []
    sentence_list = []
    # Clean sentences 
    cleaned_text = clean_text(text)
    # Extract links
    links = get_urls(cleaned_text)
    # Extract sentences 
    sentences = get_sentences(cleaned_text)
    # Extract sentences with links 
    sentences_w_links = get_sentences_with_urls(sentences)

    # Process each URL 
    for link in links:
        sentences_for_url = [sentence for sentence in sentences_w_links if link in sentence]
        if sentences_for_url:  # Only add the URL if there are associated sentences
            url_list.append(link)
            sentence_list.append(sentences_for_url)
    
    # If no URLs were found, return an empty DataFrame
    if not url_list:
        return pd.DataFrame(columns=["url", "sentences"])

    return pd.DataFrame({"url": url_list, "sentences": sentence_list})


############### DATAFRAME WITH DOI, URL, SENTENCES ################################################
def process_groundtruth_DOIs(groundtruth_dois, articles_directory, editorialboard_directory, json_file_path):
    """This function processes a list of DOIs, extracts urls and sentences from PDFs, 
    and create a DataFrame.

    Parameters:
    :param groundtruth_dois (list): List of DOIs to process.
    :param articles_directory (str): Path to the directory containing PDF articles.
    :param editorialboard_directory (str): Path to the directory with editorial board articles.
    :param json_file_path (str): Path to a JSON file.

    Returns:
    :return pd.DataFrame: A DataFrame containing processed information from the DOIs.
    """
    results_list = []

    with open(json_file_path, 'r') as json_file:
        for doi in groundtruth_dois:
            doi_replaced = doi.replace('/', '.')
            pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

            # Call the get_content function for each DOI
            url_df = get_content(pdf_path, editorialboard_directory)

            # Append the DOI to the URL DataFrame
            url_df['DOI'] = doi
            results_list.append(url_df)

        """ TO PROCESS JUST ONE DOI: 
        doi = '10.1016/j.neuroimage.2022.119030'
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        url_df = get_content(pdf_path, editorialboard_directory)
        # Append the DOI to the URL DataFrame
        url_df['DOI'] = doi
        results_list.append(url_df)
        """
            
    # Concatenate the list of DataFrames into a single DataFrame
    results_df = pd.concat(results_list, ignore_index=True)

    # Rename the columns as needed
    results_df.rename(columns={'url': 'URL', 'sentences': 'Sentences'}, inplace=True)

    return results_df


def process_DOIs(articles_directory, editorialboard_directory, json_file_path):
    """This function processes a list of DOIs, extracts urls and sentences from PDFs, 
    and create a DataFrame.

    Parameters:
    :param articles_directory (str): Path to the directory containing PDF articles.
    :param editorialboard_directory (str): Path to the directory with editorial board articles.
    :param json_file_path (str): Path to a JSON file.

    Returns:
    :return pd.DataFrame: A DataFrame containing processed information from the DOIs.
    """
    results_list = []

    with open(json_file_path, 'r') as json_file:
        doi_data = json.load(json_file)
        for doi in doi_data['DOIs']:
            doi_replaced = doi.replace('/', '.')
            pdf_path = os.path.join(articles_directory, f"{doi_replaced}.pdf")

            # Call the get_content function for each DOI
            url_df = get_content(pdf_path, editorialboard_directory)

            # Append the DOI to the URL DataFrame
            url_df['DOI'] = doi
            results_list.append(url_df)
            
    # Concatenate the list of DataFrames into a single DataFrame
    results_df = pd.concat(results_list, ignore_index=True)

    # Rename the columns as needed
    results_df.rename(columns={'url': 'URL', 'sentences': 'Sentences'}, inplace=True)

    return results_df

In [10]:
# Path to the directory containing PDFs
articles_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
editorialboard_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

In [11]:
automatic_groundtruth_df = process_groundtruth_DOIs(groundtruth_dois, articles_directory, editorialboard_directory, json_file_path)

In [12]:
automatic_groundtruth_df

Unnamed: 0,URL,Sentences,DOI
0,https://doi.org/10.1016/j.neuroimage.2021.118839,[Corticospinal projections has been shown also...,10.1016/j.neuroimage.2021.118839
1,www.elsevier.com/locate/neuroimage,[NeuroImage 248 (2022) 118839 Contents lists a...,10.1016/j.neuroimage.2021.118839
2,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118839
3,http://neuroimage.usc.edu/brainstorm,"[2011), an accredited software freely availabl...",10.1016/j.neuroimage.2021.118839
4,http://creativecommons.org/licenses/by/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118854
...,...,...,...
65,https://github.com/Yozaﬁrova/monkey-fMRI-codes,[Codes for the fMRI data analysis at https://g...,10.1016/j.neuroimage.2022.119676
66,https://github.com/RajaniRaman/face_body_integ...,[Codes for the fMRI data analysis at https://g...,10.1016/j.neuroimage.2022.119676
67,https://clippingmagic.com,[NeuroImage 264 (2022) 119676 All image transf...,10.1016/j.neuroimage.2022.119676
68,https://caﬀe.berkeleyvision.org/model_zoo.html,[Another version of pre-trained AlexNet was im...,10.1016/j.neuroimage.2022.119676


Brief exploration of the URLs extracted from the groundtruth articles. 

In [13]:
automatic_groundtruth_df.count()

URL          70
Sentences    70
DOI          70
dtype: int64

In [14]:
len(automatic_groundtruth_df)

70

In [15]:
# How many URLs are saved per DOI
doi_counts = automatic_groundtruth_df['DOI'].value_counts()

# Print the number of rows for each unique DOI
print(doi_counts)

10.1016/j.neuroimage.2022.119526    15
10.1016/j.neuroimage.2022.119676    10
10.1016/j.neuroimage.2022.119030     9
10.1016/j.neuroimage.2022.119443     8
10.1016/j.neuroimage.2022.119240     7
10.1016/j.neuroimage.2021.118854     6
10.1016/j.neuroimage.2022.119050     5
10.1016/j.neuroimage.2021.118839     4
10.1016/j.neuroimage.2022.119549     3
10.1016/j.neuroimage.2022.119646     3
Name: DOI, dtype: int64


In [16]:
# Count the rows with URLs containing 'creativecommons.org'
cc_license_count = len(automatic_groundtruth_df[automatic_groundtruth_df['URL'].str.contains('creativecommons.org')])

# Print the count
print(f"Number of rows with links to Creative Commons license: {cc_license_count}")

Number of rows with links to Creative Commons license: 10


## 2.1. Process URLs
<a name = 'processURLs'></a>

Before this point, I already performed a few preprocessing steps of the URLs: 
- in *get_urls*, I returned only unique URLs. 
- in *process_url*, I stripped URLs of the characters '.' and ')', if they were at the end of the link. 

At this point, I have only unique URLs for each DOI. But I want to remove some URLs that I know do not point to datasets. As such, this processing step is: 
* Remove URLs: 
    * 'www.elsevier.com/locate/neuroimage' - this link is placed outside of the article's text. 
    * URLs containing the DOI of the article - this link is placed outside of the article's text. 
    * Creative Commons licenses 

I want to check any links that are common between the articles to see if there are some NeuroImage or Elsevier specific links that can be removed. 

In [17]:
# Print the rows with duplicate URLs, indicating which DOIs share the same URL. 
# Check the column 'url' for duplicates using subset='url' and keep=False keeps all occurrences of the duplicates.
duplicate_urls = automatic_groundtruth_df[automatic_groundtruth_df.duplicated(subset='URL', keep=False)]

# Print the rows with duplicate URLs
duplicate_urls

Unnamed: 0,URL,Sentences,DOI
1,www.elsevier.com/locate/neuroimage,[NeuroImage 248 (2022) 118839 Contents lists a...,10.1016/j.neuroimage.2021.118839
2,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118839
4,http://creativecommons.org/licenses/by/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2021.118854
7,www.elsevier.com/locate/neuroimage,[NeuroImage 249 (2022) 118854 Contents lists a...,10.1016/j.neuroimage.2021.118854
15,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119030
18,www.elsevier.com/locate/neuroimage,[NeuroImage 252 (2022) 119030 Contents lists a...,10.1016/j.neuroimage.2022.119030
21,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119050
23,www.elsevier.com/locate/neuroimage,[NeuroImage 253 (2022) 119050 Contents lists a...,10.1016/j.neuroimage.2022.119050
26,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119240
28,www.elsevier.com/locate/neuroimage,[NeuroImage 256 (2022) 119240 Contents lists a...,10.1016/j.neuroimage.2022.119240


Remove links: 

In [18]:
def filter_urls(df, urls_to_remove):
    """
    """
    filtered_df = df.copy()
    # Collect unique DOIs 
    unique_dois = filtered_df['DOI'].unique()
     # Ensure 'URL' and 'DOI' columns are of string type
    filtered_df['URL'] = filtered_df['URL'].astype(str)
    filtered_df['DOI'] = filtered_df['DOI'].astype(str)
    
    # Remove URLs from list of URLs to remove 
    filtered_df = filtered_df[~filtered_df['URL'].isin(urls_to_remove)]

    # Remove URLs referring to the Creative Commons license 
    filtered_df = filtered_df[~filtered_df['URL'].str.contains('creativecommons.org')]
    
    # Remove URLs containing the DOI 
    for doi in filtered_df['DOI'].unique():
        filtered_df = filtered_df[~filtered_df['URL'].str.contains('doi')]
    
    # Check if all unique DOIs are still present, if not, add them with NaN values
    missing_dois = set(unique_dois) - set(filtered_df['DOI'].unique())
    for missing_doi in missing_dois:
        filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)
    
    return filtered_df

In [19]:
# URLs to remove
urls_to_remove = [
    'www.elsevier.com/locate/neuroimage',  # URL to remove
]

automatic_groundtruth_urls = filter_urls(automatic_groundtruth_df, urls_to_remove)

  filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)
  filtered_df = filtered_df.append({'DOI': missing_doi, 'URL': np.nan, 'Sentences': np.nan}, ignore_index=True)


In [20]:
automatic_groundtruth_urls.count()

URL          40
Sentences    40
DOI          42
dtype: int64

In [21]:
len(automatic_groundtruth_urls)

42

In [22]:
# How many URLs are saved per DOI
doi_counts = automatic_groundtruth_urls.groupby('DOI')['URL'].count().fillna(0) 

doi_counts = doi_counts.sort_values(ascending=False)

doi_counts

DOI
10.1016/j.neuroimage.2022.119526    12
10.1016/j.neuroimage.2022.119676     7
10.1016/j.neuroimage.2022.119030     6
10.1016/j.neuroimage.2022.119443     5
10.1016/j.neuroimage.2022.119240     4
10.1016/j.neuroimage.2021.118854     3
10.1016/j.neuroimage.2022.119050     2
10.1016/j.neuroimage.2021.118839     1
10.1016/j.neuroimage.2022.119549     0
10.1016/j.neuroimage.2022.119646     0
Name: URL, dtype: int64

The automatic URL extraction found a total of 40 links across the ten articles. 

In [23]:
# Calculate the average number of URLs per DOI
average_urls_per_doi = doi_counts.mean()
print("Average URLs per DOI:", average_urls_per_doi)

Average URLs per DOI: 4.0


The counts don't fully match up with my manual exploration. 
- 10.1016/j.neuroimage.2022.119240 has one less than my manual count, specifically the link: 'https://github.jyeatman/AFQ' 

In [24]:
manual_groundtruth_urls[manual_groundtruth_urls['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,DOI,URL,Sentence,Label
12,10.1016/j.neuroimage.2022.119240,www.cni.stanford.edu,[MRI data were acquired on a 3T Discovery MR75...,Resource
13,10.1016/j.neuroimage.2022.119240,http://github.com/vistalab/vistasoft/mrDiffusion,[Diffusion weighted images were pre-processed ...,Software
14,10.1016/j.neuroimage.2022.119240,http://www.fil.ion.ucl.ac.uk/spm/,[Each diffusion weighted image was registered ...,Software
15,10.1016/j.neuroimage.2022.119240,https://github.com/mezera/mrQ,"[Quantitative T1 (relaxation time, seconds) ma...",Software
16,10.1016/j.neuroimage.2022.119240,https://github.jyeatman/AFQ,[Automated Fiber Quantification (AFQ; https://...,Software


In [25]:
automatic_groundtruth_urls[automatic_groundtruth_urls['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,URL,Sentences,DOI
12,http://github.com/vistalab/vistasoft/mrDiﬀusion,[Diﬀusion weighted images were pre-processed w...,10.1016/j.neuroimage.2022.119240
13,https://github.com/mezera/mrQ,"[Quantitative T1 (relaxation time, seconds) ma...",10.1016/j.neuroimage.2022.119240
14,http://www.ﬁl.ion.ucl.ac.uk/spm/,[Each diﬀusion weighted image was registered t...,10.1016/j.neuroimage.2022.119240
15,www.cni.stanford.edu,[MRI data acquisition and processing MRI data ...,10.1016/j.neuroimage.2022.119240


Looking at the links before filtering them, the link was not caught: 

In [26]:
automatic_groundtruth_df[automatic_groundtruth_df['DOI'] == '10.1016/j.neuroimage.2022.119240']

Unnamed: 0,URL,Sentences,DOI
24,http://github.com/vistalab/vistasoft/mrDiﬀusion,[Diﬀusion weighted images were pre-processed w...,10.1016/j.neuroimage.2022.119240
25,https://github.com/mezera/mrQ,"[Quantitative T1 (relaxation time, seconds) ma...",10.1016/j.neuroimage.2022.119240
26,http://creativecommons.org/licenses/by-nc-nd/4.0/,[This is an open access article under the CC B...,10.1016/j.neuroimage.2022.119240
27,http://www.ﬁl.ion.ucl.ac.uk/spm/,[Each diﬀusion weighted image was registered t...,10.1016/j.neuroimage.2022.119240
28,www.elsevier.com/locate/neuroimage,[NeuroImage 256 (2022) 119240 Contents lists a...,10.1016/j.neuroimage.2022.119240
29,www.cni.stanford.edu,[MRI data acquisition and processing MRI data ...,10.1016/j.neuroimage.2022.119240
30,https://doi.org/10.1016/j.neuroimage.2022.119240,[The sample included FT and PT children across...,10.1016/j.neuroimage.2022.119240


Compared to the manual extraction, 39 of the 40 links were picked up on by the code. I will move on with the current code and extract all URLs from the entire corpus of NeuroImage 2022 articles. 

<a name='URLsinNeuroImage2022articles'></a>
## 2.2. URLs in NeuroImage 2022 articles 

Some expectations based on the ground truth sample: 
- 4 URLs per article on average (ranges from 1 to 12)
- Two out of ten articles will not contain any URLs (after filtering). 
    - 20 % of the articles seems quite high. 
    - 20% of 834 is 166.34. 
- With 834 articles total, based on the average: 
    - Before filtering: seven URLs per article on average (four average plus at least three URLs I filter out), i.e., 5.838 URLs 
    - After filtering: 3.336 URLs 

In [27]:
# Path to the directory containing PDFs
articles_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_articles_doi/'
editorialboard_directory = '../Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi/'

# Path to the JSON file containing DOI values
json_file_path = '../Data/ElsevierAPI/downloadedPDFs_info.json'

# URLs to remove
urls_to_remove = [
    'www.elsevier.com/locate/neuroimage',  # URL to remove
]

In [28]:
# Last run on October 29th
# Extract all URLs in NeuroImage 2022 articles 
#df = process_DOIs(articles_directory, editorialboard_directory, json_file_path)

In [29]:
# Filter the extracted URLs
#filtered_df = filter_urls(df, urls_to_remove)

In [30]:
# Save the URLs (filtered and unfiltered) to csv 
# The file path
#path_all_urls = os.path.join(os.pardir, 'Data/articles_all_urls.csv')
#path_filtered_urls = os.path.join(os.pardir, 'Data/articles_filtered_urls.csv')

# Save the DataFrame to CSV, overwriting the file if it exists
#df.to_csv(path_all_urls, index=False, mode='w')
#filtered_df.to_csv(path_filtered_urls, index=False, mode='w')

In [31]:
path_all_urls = os.path.join(os.pardir, 'Data/articles_all_urls.csv')
path_filtered_urls = os.path.join(os.pardir, 'Data/articles_filtered_urls.csv')
all_urls = pd.read_csv(path_all_urls)
filtered_urls = pd.read_csv(path_filtered_urls)

In [32]:
# List all the Editorial Board DOIs to exclude them 
path = os.path.join(os.pardir,'Data/ElsevierAPI/downloaded_pdfs/fulltext_editorialboard_doi')
files = os.listdir(path)

# Extract the DOIs from the file names
editorial_dois = [file.split('_')[0] for file in files]
editorial_dois = list(editorial_dois)
editorial_dois = [doi.replace('.pdf', '').replace('.S', '/S') for doi in editorial_dois]

# Exclude the Editorial Board articles 
all_urls = all_urls[~all_urls['DOI'].isin(editorial_dois)]
filtered_urls = filtered_urls[~filtered_urls['DOI'].isin(editorial_dois)]

In [33]:
print("Unique DOIs: ", len(all_urls['DOI'].unique()))
print("Extracted URLs (excl. NaN values): ", len(all_urls[~all_urls['URL'].isna()]))
count_of_nan_dois = all_urls['URL'].isna().sum()
print("Count of DOIs with NaN values in the 'URL' column:", count_of_nan_dois)

urls_elsevier = len(all_urls[all_urls['URL'].str.contains('www.elsevier.com/locate/neuroimage')])
urls_creativecommons = len(all_urls[all_urls['URL'].str.contains('creativecommons.org')])
urls_doi = len(all_urls[all_urls['URL'].str.contains('doi')])
print("'www.elsevier.com/locate/neuroimage': ", urls_elsevier)
print('creativecommons.org: ', urls_creativecommons)
print('doi: ', urls_doi)
print("URLs to be filtered out: ", urls_elsevier+urls_creativecommons+urls_doi)

Unique DOIs:  815
Extracted URLs (excl. NaN values):  5382
Count of DOIs with NaN values in the 'URL' column: 0
'www.elsevier.com/locate/neuroimage':  815
creativecommons.org:  815
doi:  891
URLs to be filtered out:  2521


In [34]:
all_urls[all_urls['URL'].isna()]

Unnamed: 0,URL,Sentences,DOI


In [35]:
print("Unique DOIs: ", len(filtered_urls['DOI'].unique()))
print("Extracted URLs (excl. NaN values): ", len(filtered_urls[~filtered_urls['URL'].isna()]))
count_of_nan_dois = filtered_urls['URL'].isna().sum()
print("Count of DOIs with NaN values in the 'URL' column:", count_of_nan_dois)

Unique DOIs:  815
Extracted URLs (excl. NaN values):  2861
Count of DOIs with NaN values in the 'URL' column: 122


In [36]:
filtered_urls[filtered_urls['URL'].isna()]

Unnamed: 0,URL,Sentences,DOI
2880,,,10.1016/j.neuroimage.2022.119207
2881,,,10.1016/j.neuroimage.2022.119328
2882,,,10.1016/j.neuroimage.2022.119643
2883,,,10.1016/j.neuroimage.2021.118840
2884,,,10.1016/j.neuroimage.2022.118982
...,...,...,...
2997,,,10.1016/j.neuroimage.2022.119406
2998,,,10.1016/j.neuroimage.2022.119058
2999,,,10.1016/j.neuroimage.2022.119633
3000,,,10.1016/j.neuroimage.2022.119137


There are a 122 articles that do not contain any URLs (excluding the 'Editorial Board' articles). 

<a name='textclassificationusingwordembeddings'></a>
# 3. Text classification using word embeddings

As my overarching goal is to identify and extract the datasets, I can classify the sentences that contain a URL based on the assumption that the sentences will somehow reflect what the URL contains. 

Inspired by the work of Halford (2020), I perform text classification using word embeddings (Birunda & Devi (2021), Kosar et al. (2022), Haj-Yahia et al. (2019)), but I use the pre-trained SciBERT model (Beltagy et al. 2019) and finetune it using my own manually labelled data. 

<a name='scibert'></a>
## 3.1 Fine-tuning SciBERT 

SciBERT (Beltagy et al. 2019) is a BERT model (Devlin et al. 2019) that is trained on scientific papers from semanticscholar.org. They use the full text of the papers in training. I use Huggingface's Transformers (Wolf et al. 2020) and datasets (Lhoest et al. 2021) to download and finetune SciBert and classify my sentences. 

Together with three annotators, we manually labeled 129 sentences which will be used to fine-tune SciBERT. I use a function from scikit-learn (Pedregosa et al. 2011) to split my data into training and test sets. Upon fine-tuning the model, I can then classify the rest of the sentences to get a list of all sentences - and thereby URLs - that link to datasets. 

I follow one guide in particular: 
- Fine-tuning BERT (and friends) for multi-label text classification. (n.d.). Retrieved November 9, 2023, from https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=hiloh9eMK91o

I also read and found the following useful: 
- Choudhary, R. (2021, December 29). Fine-Tuning Bert for Tweets Classification ft. Hugging Face. MLearning.Ai. https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf
- Text classification. (n.d.). 🤗 HuggingFace - Transformers. Retrieved October 29, 2023, from https://huggingface.co/docs/transformers/tasks/sequence_classification

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, EvalPrediction
from datasets import load_metric, Dataset, DatasetDict 
import evaluate
import torch

2023-11-09 19:24:27.345789: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [38]:
directory_path = '../Data/QA_manually_labeled_data'
file_path = os.path.join(directory_path, 'labeled_data.csv')
labeled_data = pd.read_csv(file_path)

In [39]:
labeled_data = labeled_data.rename(columns={"Original_sentence": "sentence", "True_label": "label"})

In [40]:
# Add an ID column (assuming 'labeled_data' is a DataFrame)
labeled_data['ID'] = labeled_data.index

In [41]:
# One-hot encode the labels based on the 'label_mapping'
label_mapping = {
    "Analysis": 'ana',
    "Atlas/map": 'at_map',
    "Dataset": 'data',
    "Model": 'mod',
    "Not a URL": 'no_url',
    "Not enough information": 'no_info',
    "Person or institution": 'pers_inst',
    "Processed dataset": 'pro_data',
    "Resource": 'res',
    "Software, incl. plugins, toolbox, packages, and functions": 'soft'
}

I need to one-hot encode the labels to ensure the model can read the multilabels. 

In [42]:
# One-hot encode the labels based on the 'label_mapping'
for label_name, label_code in label_mapping.items():
    labeled_data[label_code] = labeled_data['label'].apply(lambda x: int(label_code in x))

# Drop the original 'label' column if needed
labeled_data = labeled_data.drop(columns=['label'])

In [43]:
labeled_data

Unnamed: 0,sentence,ID,ana,at_map,data,mod,no_url,no_info,pers_inst,pro_data,res,soft
0,"In addition, as introduced in Section 3.1.3.1 ...",0,0,0,0,0,1,0,0,0,0,0
1,"Vries, I.E.J.de, Driel, J.van, Olivers, C.N.L....",1,0,0,0,0,1,0,0,0,0,0
2,All electrode coordinates and labels were save...,2,0,0,0,0,0,0,0,0,1,0
3,"Wen, J., Thibeau-Sutre, E., Diaz-Melo, M., Sam...",3,0,0,0,0,1,0,0,0,0,0
4,∗ Data used in preparation of this article wer...,4,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
117,Funding JD was funded by the Rennes Clinical N...,117,0,0,0,0,0,0,1,0,0,0
118,We used MRIcroGL (www.mccauslandcenter.sc.edu/...,118,0,0,0,0,0,0,0,0,0,1
119,The 3D ﬁgure was realized using BrainNet viewe...,119,0,0,0,0,0,0,0,0,0,1
120,Seed-based d mapping The SDM-PSI (www.sdmproje...,120,0,0,0,0,0,0,0,0,0,1


In [44]:
# Assuming labeled_data contains your dataset
X = labeled_data[['ID', 'sentence']]
y = labeled_data[['ana', 'at_map', 'data', 'mod', 'no_url', 'no_info', 'pers_inst', 'pro_data', 'res', 'soft']]

# Split your data into training, test, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Reset the index before concatenation
X_train.reset_index(drop=True, inplace=True)
X_val.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

# Create Datasets
train_dataset = Dataset.from_pandas(pd.concat([X_train, y_train], axis=1))
val_dataset = Dataset.from_pandas(pd.concat([X_val, y_val], axis=1))
test_dataset = Dataset.from_pandas(pd.concat([X_test, y_test], axis=1))

# Combine the datasets into a DatasetDict
dataset = DatasetDict({'train': train_dataset, 'test': test_dataset, 'validation': val_dataset})

In [45]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'sentence', 'ana', 'at_map', 'data', 'mod', 'no_url', 'no_info', 'pers_inst', 'pro_data', 'res', 'soft'],
        num_rows: 97
    })
    test: Dataset({
        features: ['ID', 'sentence', 'ana', 'at_map', 'data', 'mod', 'no_url', 'no_info', 'pers_inst', 'pro_data', 'res', 'soft'],
        num_rows: 13
    })
    validation: Dataset({
        features: ['ID', 'sentence', 'ana', 'at_map', 'data', 'mod', 'no_url', 'no_info', 'pers_inst', 'pro_data', 'res', 'soft'],
        num_rows: 12
    })
})

In [46]:
labels = [label for label in dataset['train'].features.keys() if label not in ['ID', 'sentence', 'label']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['ana',
 'at_map',
 'data',
 'mod',
 'no_url',
 'no_info',
 'pers_inst',
 'pro_data',
 'res',
 'soft']

In [47]:
# Load SciBERT with uncased scivocab 
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')

In [48]:
def preprocess_data(examples):
  # take a batch of texts
  text = examples["sentence"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [49]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

Map:   0%|          | 0/97 [00:00<?, ? examples/s]

Map:   0%|          | 0/13 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

In [50]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [51]:
tokenizer.decode(example['input_ids'])

'[CLS] they were based on several image processing pipelines, us - ing brainvisa ( riviere et al., 2011, https : / / brainvisa. info / web / ) and freesurfer ( http : / / surfer. nmr. mgh. harvard. edu / ), that were built to : 1 ) compute anatomical models from the structural mri preoperative se - quence, 2 ) normalize this sequence on mni template, 3 ) coregister pre - and post - operative sequences in the patient native space with the struc - tural preoperative mri as reference, using a block matching algorithm, 4 [SEP]'

In [52]:
example['labels']

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

In [53]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['soft']

In [54]:
encoded_dataset.set_format("torch")

In [55]:
model = AutoModelForSequenceClassification.from_pretrained("allenai/scibert_scivocab_uncased", 
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels), 
                                                           id2label=id2label, 
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [56]:
batch_size = 8
metric_name = "f1"

In [57]:
training_args = TrainingArguments(output_dir="SciBERT_finetuned")

#args = TrainingArguments(
#    f"bert-finetuned-sem_eval-english",
#    evaluation_strategy = "epoch",
#    save_strategy = "epoch",
#    learning_rate=2e-5,
#    per_device_train_batch_size=batch_size,
#    per_device_eval_batch_size=batch_size
#    num_train_epochs=5,
#    weight_decay=0.01,
#    load_best_model_at_end=True,
#    metric_for_best_model=metric_name,
    #push_to_hub=True,
#)

Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (pyOpenSSL 23.2.0 (/Users/carolinevanglarsen/opt/anaconda3/lib/python3.9/site-packages), Requirement.parse('pyopenssl<23.0.0')).


In [58]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

In [59]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [60]:
encoded_dataset['train']['input_ids'][0]

tensor([  102,   698,   267,   791,   191,  1323,  1572,  2307, 12714, 30113,
          422,   227,   579,  5520,  2216,  5332, 30110,   145, 21176, 29187,
          365,   186,   205,   422,  5228,   422,  6558,   862,  1352,  1352,
         2216,  5332, 30110,   205, 25321,  1352,  2987,  1352,   546,   137,
         2159,  9383,   815,   145,  2081,   862,  1352,  1352, 21614,   114,
          205,  5744,   205,  1529, 30117,   205, 16048,   205, 27457,  1352,
          546,   422,   198,   267,  5896,   147,   862,   158,   546,  4677,
        10951,  1262,   263,   111,  3276,  6410, 10014,   262,   579, 22360,
        30107,   422,   170,   546, 22585,   238,  1733,   191,  5060, 30109,
         7475,   422,   239,   546,  3077, 25445,   192,   382,   579,   137,
         1422,   579, 12741,  2789,   121,   111,  1454,  6227,  1630,   190,
          111,   785,   579,  4011,   120, 10014,  6410,   188,  2470,   422,
          487,   106,  1984,  4740,  1172,   422,   286,   103])

In [61]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.6893, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.2792, -0.5600, -0.1132, -0.2784, -0.0494,  0.4248, -0.0650,  0.8157,
         -0.2457,  0.0703]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [63]:
trainer = Trainer(
    model, 
    args = training_args, 
    train_dataset=encoded_dataset["train"], 
    eval_dataset=encoded_dataset["validation"], 
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [64]:
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode

Step,Training Loss


TrainOutput(global_step=39, training_loss=0.3313069954896585, metrics={'train_runtime': 578.3507, 'train_samples_per_second': 0.503, 'train_steps_per_second': 0.067, 'total_flos': 19142704175616.0, 'train_loss': 0.3313069954896585, 'epoch': 3.0})

### 
After training, we evaluate our model on the validation set.

In [65]:
trainer.evaluate()

{'eval_loss': 0.26600751280784607,
 'eval_f1': 0.14285714285714288,
 'eval_roc_auc': 0.5384615384615384,
 'eval_accuracy': 0.0,
 'eval_runtime': 5.8823,
 'eval_samples_per_second': 2.04,
 'eval_steps_per_second': 0.34,
 'epoch': 3.0}

In [66]:
trainer.save_model("./SciBERT_finetuned")

In [67]:
## Load the fine-tuned model 

In [68]:
tokenizer = AutoTokenizer.from_pretrained("SciBERT_finetuned")

These maps are openly available in the multi-modal atlas of the Human Brain Project at the EBRAINS platform (https://ebrains.eu/service/human-brain-atlas/), together with a surface map in the FreeSurfer reference space (https://ebrains.eu/news/new-maps-features-ebrains-multilevel-human-brain-atlas/).

Labels: 
- at_map
- soft 

In [69]:
text = 'These maps are openly available in the multi-modal atlas of the Human Brain Project at the EBRAINS platform (https://ebrains.eu/service/human-brain-atlas/), together with a surface map in the FreeSurfer reference space (https://ebrains.eu/news/new-maps-features-ebrains-multilevel-human-brain-atlas/).'

In [70]:
inputs = tokenizer(text, return_tensors="pt")

In [71]:
model_finetuned = AutoModelForSequenceClassification.from_pretrained("SciBERT_finetuned")

with torch.no_grad():
    logits = model(**inputs).logits

In [72]:
predicted_class_id = logits.argmax().item()

model_finetuned.config.id2label[predicted_class_id]

'soft'

---

In [None]:
data_dir = os.path.join(os.pardir, 'Data')
file_path = os.path.join(data_dir, 'articles_groundtruth_urls_and_sentences.csv')
df = pd.read_csv(file_path)

In [None]:
# This is my labeled data 
# Remove rows with NaN values
df = df.dropna()
# Remove square brackets and single quotes from the entire 'Sentences' column
df['Sentence'] = df['Sentence'].str.replace(r"[\[\]']", '', regex=True)
# Change the type so that it works with the model 
df['Dataa'] = df['Data'].astype(str).replace("True", 1).replace("False", 0)

In [None]:
df = df[['Sentence', 'Data']]
# Rename columns
df = df.rename(columns={'Sentence': 'text', 'Data': 'label'})

In [None]:
dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2)

#test_dataset = Dataset.from_pandas(test)

# Alternative, if the data are in CSV files 
#dataset = load_dataset('csv', data_files={'train': 'Corona_NLP_train.csv', 'test': 'Corona_NLP_test.csv'}, encoding = "ISO-8859-1")

In [None]:
# Load SciBERT with uncased scivocab 
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')

# Define a preprocessing function
def tokenize_data(data):
    """Function from: https://huggingface.co/docs/transformers/tasks/sequence_classification
    """
    tokenized_text = tokenizer(data["text"], truncation=True)
    return tokenized_text

In [None]:
dataset = dataset.map(tokenize_data, batched=True)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# Map the expected ids to their labels with id2label and label2id:
id2label = {0: "FALSE", 1: "TRUE"}
label2id = {"FALSE": 0, "TRUE": 1}

In [None]:
# Load SciBERT with uncased scivocab 
#model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

In [None]:
accuracy = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    """From: https://huggingface.co/docs/transformers/tasks/sequence_classification
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("allenai/scibert_scivocab_uncased", num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
model_path = os.path.join(os.pardir, "SciBERT-finetuned")

In [None]:
training_args = TrainingArguments(output_dir="SciBERT_finetuned")

In [None]:
train_dataset = dataset['train'].shuffle(seed=10).select(range(26))
eval_dataset = dataset['train'].shuffle(seed=10).select(range(26, 32))

In [None]:
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset, 
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.save_model("./SciBERT_finetuned")

In [None]:
text = 'These maps are openly available in the multi-modal atlas of the Human Brain Project at the EBRAINS platform (https://ebrains.eu/service/human-brain-atlas/), together with a surface map in the FreeSurfer reference space (https://ebrains.eu/news/new-maps-features-ebrains-multilevel-human-brain-atlas/).'

In [None]:
tokenizer = AutoTokenizer.from_pretrained("SciBERT_finetuned")

In [None]:
inputs = tokenizer(text, return_tensors="pt")

In [None]:
model_finetuned = AutoModelForSequenceClassification.from_pretrained("SciBERT_finetuned")

with torch.no_grad():
    logits = model(**inputs).logits

In [None]:
predicted_class_id = logits.argmax().item()

model_finetuned.config.id2label[predicted_class_id]

<a name='classify'></a>
## 3.2 Classify data

<a name='getdatasets'></a>
# 4. Get datasets 
Now that the sentences have been classified, I will continue working with those that were labelled as 'Dataset'. 

<a name='processsentencesforclassification'></a>
## 3.2. Process sentences for classification 
I combine the the cleaning steps mention in Gasparetto et al. (2022), Haj-Yahia et al. (2019), and Halford (2020) using functions from both gensim (Řehůřek and Sojka 2010) and NLTK (Bird et al. 2009): 
- Uninformative tokens are removed, including: 
    - Numeric characters (Gensim)
    - Tags (Gensim)
    - Punctuation (Gensim)
    - Extra whitespace (Gensim)
    - Stopwords based on stopword list from NLTK.
    - Isolated characters (len = 1)
- The URLs are removed 
- Stem text (Gensim)
- The text is converted to lowercase.
- The text is tokenized using SciBERT's tokenizer. 

In [None]:
# Define the path to the 'Data' directory
data_dir = os.path.join(os.pardir, 'Data')

# Define the file path for 'articles_filtered_urls.csv'
path = os.path.join(data_dir, 'articles_filtered_urls.csv')

# Load the CSV file into a DataFrame
df = pd.read_csv(path)

# Now, the 'df' variable contains the data from 'articles_filtered_urls.csv'

In [None]:
# Remove rows with NaN values
df = df.dropna()
# Remove square brackets and single quotes from the entire 'Sentences' column
df['Sentences'] = df['Sentences'].str.replace(r"[\[\]']", '', regex=True)

In [None]:
df['Sentences'].loc[10]

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [None]:
nltk.download('punkt')

In [None]:
def remove_urls(text):
    """This function removes URLs from a text using the URLExtract library (Lipovský 2022) to
    identify the URLs. 
    
    Paramters: 
    :param text(str): text. 
    
    Returns: 
    :return: text (str) without URLs. 
    """
    extractor = urlextract.URLExtract()
    urls = list(extractor.find_urls(text))
    for url in urls:
        text = text.replace(url, '')
    return text

def tokenize_text(text):
    """This function cleans the text using multiple of gensim's preprocessing functions (Řehůřek and Sojka 2010), 
    NLTK's stopwords and tokenize modules (Bird et al. 2009), and it removes the URLs using 
    the URLExtract library (Lipovský 2022). 
    
    Parameters: 
    :param text (str): 
    
    Returns: 
    :return:
    """
    # Remove URLs from the text
    text = remove_urls(text)
    
    # Combine multiple spaces, strip tags, punctuation, numerics, stem the text, and convert to lowercase
    # POSSIBLE: remove_stopwords
    text = ' '.join(preprocess_string(text, filters=[strip_tags, strip_punctuation, strip_numeric, strip_multiple_whitespaces, stem_text]))
    
    # Lowercase the text
    text = text.lower()
    
    # Tokenization (split text into words)
    tokens = nltk.word_tokenize(text)

    # Remove uninformative tokens
    cleaned_tokens = []
    for token in tokens:
        if token.isalpha() and len(token) > 1:  # Check if token is a word and not a single character
            if token.lower() not in stop_words:  # Remove stopwords
                cleaned_tokens.append(token)

    tokenized_text = ' '.join(cleaned_tokens)
    
    return tokenized_text

In [None]:
#test = df['Sentences'].loc[1]
#test_clean = clean_text(test)
#test_clean

In [None]:
df_cleaned = df.copy()

In [None]:
# Apply the cleaning function to the 'Sentences' column
df_cleaned['Tokens'] = df_cleaned['Sentences'].apply(tokenize_text)

In [None]:
df_cleaned

<a name='references'></a>
# References

- Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A Pretrained Language Model for Scientific Text (arXiv:1903.10676). arXiv. http://arxiv.org/abs/1903.10676
- Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python. O’Reilly Media, INC.
- Choudhary, R. (2021, December 29). Fine-Tuning Bert for Tweets Classification ft. Hugging Face. MLearning.Ai. https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. https://doi.org/10.48550/arXiv.1810.04805
- Fine-tuning BERT (and friends) for multi-label text classification. (n.d.). Retrieved November 9, 2023, from https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=hiloh9eMK91o
- Gasparetto, A., Marcuzzo, M., Zangari, A., & Albarelli, A. (2022). A Survey on Text Classification Algorithms: From Text to Predictions. _Information_, _13_(2), Article 2. https://doi.org/10.3390/info13020083
- Haj-Yahia, Z., Sieg, A., & Deleris, L. A. (2019). Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 371–379. https://doi.org/10.18653/v1/P19-1036
- Halford, M. (2020, October 3). Unsupervised text classification with word embeddings. https://maxhalford.github.io/blog/unsupervised-text-classification/
- Kosar, A., Pauw, G. D., & Daelemans, W. (2022). Unsupervised Text Classification with Neural Word Embeddings. _Computational Linguistics in the Netherlands Journal_, _12_, 165–181.
- Lipovský, J. (2022). urlextract: Collects and extracts URLs from given text. (1.8.0) [Python]. https://github.com/lipoja/URLExtract
- Lhoest, Q., Villanova del Moral, A., von Platen, P., Wolf, T., Šaško, M., Jernite, Y., Thakur, A., Tunstall, L., Patil, S., Drame, M., Chaumond, J., Plu, J., Davison, J., Brandeis, S., Sanh, V., Le Scao, T., Canwen Xu, K., Patry, N., Liu, S., … Delangue, C. (2021). Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 175–184) [Python]. Association for Computational Linguistics. https://aclanthology.org/2021.emnlp-demo.21 (Original work published 2020)
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget
- Text classification. (n.d.). 🤗 HuggingFace - Transformers. Retrieved October 29, 2023, from https://huggingface.co/docs/transformers/tasks/sequence_classification
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., … Rush, A. M. (2020). HuggingFace’s Transformers: State-of-the-art Natural Language Processing (arXiv:1910.03771). arXiv. https://doi.org/10.48550/arXiv.1910.03771



Do I still use these? 
- Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning Word Vectors for 157 Languages (arXiv:1802.06893). arXiv. https://doi.org/10.48550/arXiv.1802.06893
- Kalai, A. T. & Brown University (Directors). (2019, April 18). An ICERM Public Lecture - Bias in bios: Fairness in a high-stakes machine-learning setting. https://www.youtube.com/watch?v=IDNXZitcQng
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
- Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. 45–50. https://doi.org/10.13140/2.1.2393.1847
- Řehůřek, R. (2023). What is Gensim-data for? [Python]. https://github.com/piskvorky/gensim-data (Original work published 2017)

In [None]:
section_patterns_v2 = [
    (["Data and Code Availability", "Data Availability", "Data/code availability"], ["3. ", "CRediT authorship contribution statement", "Acknowledgements", "References", "Declaration of Competing Interests", "Credit authorship contribution statement", "\n\n"]),
    (["2.1."], ["2.2."]),
    (["Resource", "3.1."], ["3.2."]),
    (["Fig.\d+", "Fig.\d+\.?", "Figure \d+"], ["https?://[^\s]+"]),
    (["Tab.\d+", "Table \d+\.?"], ["https?://[^\s]+", "[\w\s-]+\d{4}"]),
    (["Introduction", "1. "], ["2. "]),
    (["Abstract"], ["1. ", "Introduction"])
]

In [None]:
# Empty list to store individual results
results_list_v2 = []

# Read DOI values from the JSON file
with open(json_file_path, 'r') as json_file:
    doi_data = json.load(json_file)

    # Get the first 10 DOIs
    first_10_dois = doi_data['DOIs'][11:21]

    for doi in first_10_dois:
        doi_replaced = doi.replace('/', '.')
        pdf_path = os.path.join(pdf_directory, f"{doi_replaced}.pdf")

        # Call the get_content function for each DOI
        section_content = get_content(pdf_path, alternative_pdf_directory, section_patterns_v2)

        # Create a dictionary for each result and add it to the list
        results_list_v2.append({"DOI": doi, "Section": section_content})

# Convert the list of dictionaries to a DataFrame
results_df2 = pd.DataFrame(results_list_v2)

In [None]:
results_df2

# Old notes 

## 1.1. Get text sections 

I use the work of Akkoç (2023) and Sourget (2023) to search the PDFs for their datasets. I am using the code from two separate git repositories as inspiration for the two functions presented in this section. 
- *get_section* is losely interpreted from Akkoç (2023) using the following breadcrumb in the github repository: PublicDatasets/ArticleAnalyser.ipynb, section '2.1 Get section function'
- *get_content* is losely interpreted from Soruget (2023) using the following breadcrumb in the github repository: DDSA_Sourget/code/other/download_fulltext.ipynb, section '3. Check for dataset's organ in figures'.

<br>

References: 
- Akkoç, A. (2023). PublicDatasets [Jupyter Notebook]. https://github.com/madprogramer/PublicDatasets (Original work published 2022)
- Sourget, T. (2023). TheoSourget/DDSA_Sourget: Repository used during my travel at the ITU of Copenhagen in March 2023 [Computer software]. https://github.com/TheoSourget/DDSA_Sourget

## Stuff 
    URLs do not necessarily link to the data.
    A git repository can contain both data and code - but not always.
    The dataset might only be mentioned by name and not linked (so far, I've only seen the names in camelcase).
    QUESTION: How do we treat reviews that summarizes data but does not contain new data? Is the reuse of a dataset not also the same as not containing new data?

## More stuff

- Worries 
    - How to get the name of the dataset itself and the url
        - The URL can be broken up by spaces (due to line changes in the pdf) - can I find a way to find out which is the entire URL? 
            - Is there any slashes in the text ahead? A parenthesis, dot, comma, or another symbol might end it URL. 
    - If someone uses e.g., HCP, do they use all of the data? Do I need to catch more text-sections to learn this (in relation to the discussion of significance testing - if they use different parts of the dataset, they are not testing on the same). 
        - "Due to HCP and dHCP privacy policies, the preprocessed resting-state images of human adults and neonates (with their IDs) can only be shared upon request with qualified investigators who agree to the Restricted Data Use Terms of these two datasets." (from 10.1016/j.neuroimage.2022.119339)
    - What if the article does not analyse any data? (e.g., 10.1016/j.neuroimage.2022.119295 presents a software package for the execution of RT-fMRI experiments. 
- FUNCTION GET_CONTENT: Make a comment about trying the "Editorial board" texts in the other file - just so I don't get en "Error reading PDF:" 
    - Make an addition to 'get_section' where the says 'Editorial board' instead of None for the section text. 


### Clean text 
I will do a very simple initial cleaning of the extracted text sections: 
- Replace multiple spaces with a single space
    - [.replace('   ', ' ').replace('  ', ' ')]
- Remove all \n characters 
    - [.replace('\n', '')]
- Remove leading and trailing spaces after the following characters: -, (, ), /, ., _ , and between : / 
    - [.replace('- ', '-').replace('( ', '(').replace(' )', ')').replace('/ ', '/').replace(' /', '/').replace(' .', '.').replace(': /', ':/').replace(' _ ', '_').replace(' _', '_').replace('_ ', '_')] 
