In [1]:
import subprocess
import downloader
from reader import Reader
from ontology import export_ontology

## How does it work?
---
This notebook takes a list of papers (DOIs and titles) from `data/lists/{named_list}.txt/` and process them as follows:
1. Download and store the papers in `data/data/papers/{named_list}/`
2. Instantiate a `Reader(named_list)` and loads the pdf files along any previously information stored in cache `data/reader/cache/{named_list}.json`
3. Extract metadata using the DOIs and calling external sources
4. Extract topics using different classifiers. At the moment "acm" and "dbpedia" are available.
5. Dump the results to a cached `json` file. This allows you to load them later instead of having to process everything again.
6. Export the data to a tsv file in a format that klink uses as input to build the ontology.
7. Run the klink algorithm (In R scripts). /!!\ If klink2 is already set up with the correct parameters, the R algorithm can be called from Python using this notebook, otherwise, it's recommended to run `input.R` and `klink-2.R` scripts individually. I find easier to use RStudio in this case.
8. Export the ontology to ttl file, so it can be easily loaded and visualised at https://service.tib.eu/webvowl/


## Extract - Fetching papers
---
If the papers are already available in the data directory, you can skip this step.

**Note:**

You must be connected to VPN or directly to a network with full access to the papers, so the code is able to access them.

> Iprovement: fetch the user credentials/token from google scholar and pass it with the request. This will eliminate the need to be connected to a network with access, 

In [2]:
named_list = "test"

In [3]:
downloader.download(named_list=named_list, overwrite=False)

  0%|          | 0/5 [00:00<?, ?it/s]

10.1145/3593013.3594049: This file already exists! To overwrite it, use `overwrite=True`.


 40%|████      | 2/5 [00:01<00:01,  1.99it/s]

10.1145/3593013.3594039: This file already exists! To overwrite it, use `overwrite=True`.


 60%|██████    | 3/5 [00:02<00:01,  1.41it/s]

10.1145/3593013.3594048: This file already exists! To overwrite it, use `overwrite=True`.


 80%|████████  | 4/5 [00:03<00:00,  1.22it/s]

10.1145/3442188.3445876: This file already exists! To overwrite it, use `overwrite=True`.


100%|██████████| 5/5 [00:04<00:00,  1.24it/s]

10.1145/3531146.3533226: This file already exists! To overwrite it, use `overwrite=True`.





## Load
---
Instantiate Reader and load content as well as cached data if you have worked on this before and stored it as check point. 

In [3]:
reader = Reader(named_list)
help(reader.load)

Help on method load in module reader.reader:

load(filename_has_doi: bool = True, pattern_to_replace: dict = {}, from_inc: int | None = None, to_exc: int | None = None) method of reader.reader.Reader instance
    Load papers from the specified directory.
    
    Args:
        filename_has_doi (bool, optional): Whether the file is named with DOI pattern. Defaults to True.
            When set to `True` the DOI will be extract from the name using the
            `pattern_to_replace`.
        pattern_to_replace (dict): Pattern to replace. Defaults to {}.
            e.g.: filename = file_10_1145-3351095_00000000.txt
                  pattern_to_replace = {'_': '.', '-':'/'}
        from_inc (int | None, optional): Index to start loading papers (inclusive). Defaults to None.
        to_exc (int | None, optional): Index to stop loading papers (exclusive). Defaults to None.



In [4]:
reader.load(from_inc=1, filename_has_doi=True, pattern_to_replace={"_": ".", "-": "/"})


Processing 10_1109-ICSE48619_2023_00133.pdf: 100%|██████████| 8/8 [00:00<00:00, 10.52it/s]



Fetch metadata from an external source via API. Only needed if it is not yet stored in the cache, otherwise the previous cell already took care of loading the metadata to this namespace.

Keywords and abstracts are extracted directly from the pdf content using REGEX. In some cases, they can be extracted from pdf metadata or from external sources if they are available.

**Note**: There's a delay on purpose here to avoid reaching rate limits.

In [8]:
reader.extract_metadata()

Processing 10.1109/ICSE48619.2023.00133: 100%|██████████| 8/8 [01:04<00:00,  8.09s/it]


## Classify - Finding the topics which the papers are realted to
---
ACM provides a classification on the wepage along with the paper. The classification for papers from ACM is scraped and only works for papers available at ACM.

On the other hand DBPedia provies an API that "highlights" the main keywords of a piece of text. For this pupose, the abstract from the PDFs fils are extracted and passed to DBPedia via API. This approach can be used in principle for any paper, independent of their source. The disadvantage is that the abstract extraction was built based on REGEX and tested only for papers from ACM. The REGEX matches are based on the word "ABSTRACT", thus the abstract from papers with their abstracts without a section title (i.e. ABSTRACT) are not properly extracted.

The same concept of topics extraction can done using the CSO classifier, which can be locally installed. Although this is not yet implemented.

In [6]:
classifiers = ["acm", "dbpedia"]
reader.extract_classification(*classifiers)

Processing 10.1145/3593013.3594048:  12%|█▎        | 1/8 [00:03<00:24,  3.53s/it]     

10.1109/ICSE43902.2021.00129: The title '"ignorance and prejudice" in software fairness' in the topics extracted from ACM didn't match with ''.Ignoring this classification.


Processing 10.1109/ICSE48619.2023.00133:  88%|████████▊ | 7/8 [00:22<00:02,  2.98s/it]

10.1145/3531146.3533226: 400 Client Error: Bad Request for url: https://api.dbpedia-spotlight.org/en/annotate?text=


Processing 10.1109/ICSE48619.2023.00133: 100%|██████████| 8/8 [00:25<00:00,  3.21s/it]


In [9]:
reader.metadata_collection("dataframe")

Unnamed: 0,file_name,doi,title,authors,journal,publisher,year,keywords,topics_dbpedia,topics_acm
0,10_1109-ICSE43902_2021_00129.pdf,10.1109/icse43902.2021.00129,âIgnorance and Prejudiceâ in Software Fair...,"Zhang, Jie M.;Harman, Mark",,IEEE,2021,software fairness;machine learning fairness,data set;machine learning,
1,10_1145-3593013_3594048.pdf,10.1145/3593013.3594048,Maximal fairness,"Defrance, Marybeth;De Bie, Tijl",,ACM,2023,fairness;fairness in ai;fairness definitions;c...,parity;ai;theorem,computing methodologies;artificial intelligence
2,10_1145-3593013_3594049.pdf,10.1145/3593013.3594049,Augmented Datasheets for Speech Datasets and E...,"Papakyriakopoulos, Orestis;Choi, Anna Seo Gyeo...",,ACM,2023,datasets;speech;datasheets;ethics;transparency,web-crawling;intersectionality;bear;acm;dialec...,applied computing;arts and humanities;sound an...
3,10_1145-1005686_1005704.pdf,10.1145/1005686.1005704,A resource-allocation queueing fairness measure,"Raz, David;Levy, Hanoch;Avi-Itzhak, Benjamin",,ACM,2004,fairness;fcfs;job scheduling;m/m/1;processor s...,ros;fairness measure;accounting,general and reference;cross-computing tools an...
4,10_1109-90_993299.pdf,10.1109/90.993299,The impact of multicast layering on network fa...,"Rubenstein, D.;Kurose, J.;Towsley, D.",IEEE/ACM Transactions on Networking,Institute of Electrical and Electronics Engine...,2002,congestion control;fairness;multicast.,bandwidth;protocol;unicast;multicast;congestio...,general and reference;cross-computing tools an...
5,10_1145-3442188_3445876.pdf,10.1145/3442188.3445876,Group Fairness: Independence Revisited,"RÃ¤z, Tim",,ACM,2021,fairness;independence;statistical parity;demog...,parity;computer science,applied computing;arts and humanities;computin...
6,10_1145-3531146_3533226.pdf,10.1145/3531146.3533226,Demographic-Reliant Algorithmic Fairness: Char...,"Andrus, McKane;Villeneuve, Sarah",,ACM,2022,demographic data;sensitive data;categorization...,,"applied computing;law, social and behavioral s..."
7,10_1109-ICSE48619_2023_00133.pdf,10.1109/icse48619.2023.00133,Towards Understanding Fairness and its Composi...,"Gohar, Usman;Biswas, Sumon;Rajan, Hridesh",,IEEE,2023,fairness;ensemble;machine learning;models,hyperparameters;random forest;interplay;algori...,


In [10]:
len([1 for paper in reader.paper_list if paper.topics.get("acm") or paper.topics.get("dbpedia")])

8

## Save the progress to a JSON file

In [11]:
reader.dump(overwrite=False)

## Export to Klink file

In [12]:
reader.export_as_klink_input(classification_source="acm")

Unnamed: 0,DE,TI,AU,SO,SC,PY
1,fairness;fairness in ai;fairness definitions;c...,Maximal fairness,"Defrance, Marybeth;De Bie, Tijl",ACM,computing methodologies;artificial intelligence,2023
2,datasets;speech;datasheets;ethics;transparency,Augmented Datasheets for Speech Datasets and E...,"Papakyriakopoulos, Orestis;Choi, Anna Seo Gyeo...",ACM,applied computing;arts and humanities;sound an...,2023
3,fairness;fcfs;job scheduling;m/m/1;processor s...,A resource-allocation queueing fairness measure,"Raz, David;Levy, Hanoch;Avi-Itzhak, Benjamin",ACM,general and reference;cross-computing tools an...,2004
4,congestion control;fairness;multicast.,The impact of multicast layering on network fa...,"Rubenstein, D.;Kurose, J.;Towsley, D.",Institute of Electrical and Electronics Engine...,general and reference;cross-computing tools an...,2002
5,fairness;independence;statistical parity;demog...,Group Fairness: Independence Revisited,"RÃ¤z, Tim",ACM,applied computing;arts and humanities;computin...,2021
6,demographic data;sensitive data;categorization...,Demographic-Reliant Algorithmic Fairness: Char...,"Andrus, McKane;Villeneuve, Sarah",ACM,"applied computing;law, social and behavioral s...",2022


## Run the Klink-2 Scripts

In [7]:

%cd "./klink2/"
subprocess.run(["Rscript", "main.R", named_list])
%cd "../"

  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


/Users/jmatias/Documents/develop/aiod-ontology/src/klink2



Attaching package: ‘igraph’

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union

hash-2.2.6.3 provided by Decision Patterns


Attaching package: ‘fastcluster’

The following object is masked from ‘package:stats’:

    hclust


Attaching package: ‘dplyr’

The following objects are masked from ‘package:igraph’:

    as_data_frame, groups, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



process article  1  out of  6 
process article  2  out of  6 
process article  3  out of  6 
process article  4  out of  6 
process article  5  out of  6 
process article  6  out of  6 
process keyword  1  out of  36 
process keyword  2  out of  36 
process keyword  3  out of  36 
process keyword  4  out of  36 
process keyword  5  out of  36 
process keyword  6  out of  36 
process keyword  7  out of  36 
process keyword  8  out of  36 
process keyword  9  out of  36 
process keyword  10  out of  36 
process keyword  11  out of  36 
process keyword  12  out of  36 
process keyword  13  out of  36 
process keyword  14  out of  36 
process keyword  15  out of  36 
process keyword  16  out of  36 
process keyword  17  out of  36 
process keyword  18  out of  36 
process keyword  19  out of  36 
process keyword  20  out of  36 
process keyword  21  out of  36 
process keyword  22  out of  36 
process keyword  23  out of  36 
process keyword  24  out of  36 
process keyword  25  out of  36

  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## Export the ontology to a `.ttl` file

In [8]:
export_ontology(named_list)

File exported to /Users/jmatias/Documents/develop/aiod-ontology/data/ontology/test.ttl


### For visualisation, go to https://service.tib.eu/webvowl/ and import the file to it using its interface.