## Literature Study Preprocessing
This notebook is used to perform preprocessing of the literature on the topic of interest. The goal is to reduce the data set to the most relevant literature by adding a relative citation index. Then the top literature is selected for further analysis.

## Clean the data


In [15]:
import sys
# !{sys.executable} -m pip install crossrefapi
!{sys.executable} -m pip install crossref-commons

Collecting crossref-commons
  Downloading crossref_commons-0.0.7-py3-none-any.whl.metadata (3.2 kB)
Collecting ratelimit>=2.2.1 (from crossref-commons)
  Downloading ratelimit-2.2.1.tar.gz (5.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading crossref_commons-0.0.7-py3-none-any.whl (14 kB)
Building wheels for collected packages: ratelimit
  Building wheel for ratelimit (setup.py) ... [?25ldone
[?25h  Created wheel for ratelimit: filename=ratelimit-2.2.1-py3-none-any.whl size=5939 sha256=a8c416ab172118e103ac7107519a9dc27a5473d07849aa23243a2d1193272e41
  Stored in directory: /home/jovyan/.cache/pip/wheels/69/bd/e0/4a5dee2a1bfbc8e258f543f92940e2b494d63b5be8144ec8c4
Successfully built ratelimit
Installing collected packages: ratelimit, crossref-commons
Successfully installed crossref-commons-0.0.7 ratelimit-2.2.1


In [32]:
from datetime import datetime
import pandas as pd
import crossref_commons
# from crossref.restful import Works
from crossref_commons.iteration import iterate_publications_as_json
from crossref_commons.retrieval import get_entity
from crossref_commons.types import EntityType, OutputType


In [58]:
df = pd.read_csv('data/pre-selected-data.csv', delimiter='|')
df['relative_citation_index'] = df['relative_citation_index'].astype(float)
df.head(1)

Unnamed: 0,#,title,abstract,venue,year,doi,authors,relative_citation_index,relevance_score
0,1,Accel-Sim: An Extensible Simulation Framework ...,"In computer architecture, significant innovati...",International Symposium on Computer Architecture,2018,10.1109/ISCA45697.2020.00047,"Mahmoud Khairy, Timothy G. Rogers, Tor M. Aamo...",0.0,11


In [62]:
df.shape

(2318, 9)

In [59]:
errors = []
df['relative_citation_index'] = df['relative_citation_index'].apply(pd.to_numeric)
for index, row in df.iterrows():
    if pd.isna(row['doi']):
        print(f"No DOI available for paper {row['title']}")
        try:
            filters = {'from-pub-date': row['year'], 'until-pub-date': row['year']}
            queries = {'query.title': row['title'], 'query.author': row['authors'] }
            pub = list(iterate_publications_as_json(max_results=1, queries=queries, filter=filters))[0]
            df.at[index, 'doi'] = pub['DOI']
            df.at[index, 'relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
            # row['doi'] = pub['DOI']
            # row['relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
        except Exception as e:
            print(f"Could not find paper {row['title']}, Error: {e}")
            errors.append(row['title'])
    else:
        try:
            pub = get_entity(row['doi'], EntityType.PUBLICATION, OutputType.JSON)
            # row['relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
            df.at[index, 'relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
        except ValueError as e:
            print(f"Could not find paper {row['title']}, Error: {e}")
            errors.append(row['title'])
    print(row['title'], row['relative_citation_index'], row['doi'])

Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling 0.0 10.1109/ISCA45697.2020.00047
PPT-GPU: Scalable GPU Performance Modeling 0.0 10.1109/LCA.2019.2904497
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads 0.0 10.1145/3613424.3614277
Balar: A SST GPU Component for Performance Modeling and Profiling. 0.0 10.2172/1560919
NaviSim: A Highly Accurate GPU Simulator for AMD RDNA GPUs 0.0 10.1145/3559009.3569666
A Detailed Model for Contemporary GPU Memory Systems 0.0 10.1109/ISPASS.2019.00023
Efficient L2 Cache Management to Boost GPGPU Performance 0.0 10.4995/thesis/10251/125477
No DOI available for paper Exploring Modern GPU Memory System Design Challenges through Accurate Modeling
Exploring Modern GPU Memory System Design Challenges through Accurate Modeling 0.0 nan
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis 0.0 10.1109/HPEC55821.2022.9926299
No DOI available for 

In [64]:
print(len(errors))
print(errors)

179
['Techniques for Managing Irregular Control Flow on GPUs', 'Towards Detailed Real-Time Simulations of Cardiac Arrhythmia', 'A comparison of Algebraic Multigrid Bidomain solvers on hybrid CPU-GPU architectures', 'Chrono DEM-Engine: A Discrete Element Method dual-GPU simulator with customizable contact forces and element shape', 'A Low-Latency Communication Design for Brain Simulations', 'Performance Portable Solid Mechanics via Matrix-Free p-Multigrid', 'CPU-GPU Heterogeneous Code Acceleration of a Finite Volume Computational Fluid Dynamics Solver', 'Exploiting Nested Parallelism on Heterogeneous Processors', 'Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation', 'Decentralized Training of Foundation Models in Heterogeneous Environments', 'Fundamentals of a numerical cloud computing for applied sciences', 'Power modeling and architectural techniques for energy-efficient GPUs', 'Modelling angiogenesis in three dimensions', 'Shader-based 

In [65]:
df.sort_values(by=['relevance_score', 'relative_citation_index'], ascending=False, inplace=True)
df.to_csv('data/selected-and-sorted-data-all.csv', index=False, sep='|')
df.to_csv('data/selected-and-sorted-data-top-200.csv', index=False, sep='|')