## Literature Study Preprocessing
This notebook is used to perform preprocessing of the literature on the topic of interest. The goal is to reduce the data set to the most relevant literature by adding a relative citation index. Then the top literature is selected for further analysis.

## Clean the data


In [21]:
import sys
from time import sleep
# !{sys.executable} -m pip install crossrefapi
!{sys.executable} -m pip install crossref-commons



In [22]:
from datetime import datetime
import pandas as pd
import crossref_commons
# from crossref.restful import Works
from crossref_commons.iteration import iterate_publications_as_json
from crossref_commons.retrieval import get_entity
from crossref_commons.types import EntityType, OutputType
from time import sleep


In [23]:
df = pd.read_csv('data/pre-selected-data.csv', delimiter='|')
df['relative_citation_index'] = df['relative_citation_index'].astype(float)
df.head(1)

Unnamed: 0,#,title,abstract,venue,year,doi,authors,relative_citation_index,relevance_score
0,1,A cloud simulation based environment for multi...,Multi-disciplinary virtual prototype (MDVP) ba...,International Conference on Computer Supported...,2017,10.1109/CSCWD.2017.8066735,"Mei Wang, Tingyu Lin, Lichao Wei, Chao Ruan, L...",0.0,3


In [24]:
df.shape

(283, 9)

In [26]:
errors = []
df['relative_citation_index'] = df['relative_citation_index'].apply(pd.to_numeric)
for index, row in df.iterrows():
    # if pd.isna(row['doi']):
    #     print(f"No DOI available for paper {row['title']}")
    try:
        filters = {'from-pub-date': row['year'], 'until-pub-date': row['year']}
        queries = {'query.title': row['title'], 'query.author': row['authors'] }
        pub = list(iterate_publications_as_json(max_results=1, queries=queries, filter=filters))[0]
        df.at[index, 'doi'] = pub['DOI']
        df.at[index, 'relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
        df.at[index, 'doi'] = pub['DOI']
        # if pd.isna(row['abstract']):
        #     df.at[index, 'abstract'] = pub['abstract']
    except Exception as e:
        print(f"Could not find paper {row['title']}, Error: {e}")
        errors.append(row['title'])
    # else:
    #     try:
    #         pub = get_entity(row['doi'], EntityType.PUBLICATION, OutputType.JSON)
    #         # row['relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
    #         df.at[index, 'relative_citation_index'] = pub['is-referenced-by-count'] / (datetime.now().year - row['year'])
    #     except ValueError as e:
    #         print(f"Could not find paper {row['title']}, Error: {e}")
    #         errors.append(row['title'])
    print(row['title'], row['relative_citation_index'], row['doi'])

Could not find paper A cloud simulation based environment for multi-disciplinary collaborative simulation and optimization, Error: API returned code 504
A cloud simulation based environment for multi-disciplinary collaborative simulation and optimization 0.375 10.1109/CSCWD.2017.8066735
Optimization of performance and scheduling of HPC applications in cloud using cloudsim and scheduling approach 0.625 10.1109/ICIOTA.2017.8073634
Could not find paper GPUCloudSim: an extension of CloudSim for modeling and simulation of GPUs in cloud data centers, Error: 'abstract'
GPUCloudSim: an extension of CloudSim for modeling and simulation of GPUs in cloud data centers 2.7142857142857144 10.1007/s11227-018-2636-7
Study on fundamental usage of CloudSim simulator and algorithms of resource allocation in cloud computing 1.625 10.1109/ICCCNT.2017.8203998
Adaptive cloud simulation using position based fluids 1.2 10.1002/cav.1657
Modifying CloudSim to accurately simulate interactive services for cloud au

KeyboardInterrupt: 

In [6]:
print(len(errors))
print(errors)

19
['Visualization of HPC simulation data', 'Assessing Effects of Asymmetries, Dynamics, and Failures on a Cloud Simulator', 'Climbing Up Cloud Nine: Performance Enhancement Techniques for Cloud Computing Environments', 'On the elastic optimisation of cloud IaaS environments', 'A Scheme for Improving the Performance of User Authenticity Through Client Validation Process Using Fuzzy Associative Memory (FAM) in Cloud Computing', 'Comparison Study of The Most Common Virtual Machine Load Balancing Algorithms in Large-Scale Cloud Environment Using Cloud Simulator', 'Προσομοίωση υπολογιστικών νέφων', 'Προσομοίωση υπολογιστικών νεφών', 'Simulators for Cloud Computing_ A Survey', 'Architecture and Quality of Cloud Simulators', 'Scalability assurance process in replication and migration using cloud simulator', 'Cluster-based Analysis of Multi-Parameter Distributions in Cloud Simulation Ensembles', 'SecNetworkCloudSim: An Extensible Simulation Tool for Secure Distributed Mobile Applications', 'E

In [7]:
df.sort_values(by=['relevance_score', 'relative_citation_index'], ascending=False, inplace=True)
# df.sort_values(by=['relative_citation_index'], ascending=False, inplace=True)
df.to_csv('data/selected-and-sorted-data-all.csv', index=False, sep='|')
df.head(100).to_csv('data/selected-and-sorted-data-top-100.csv', index=False, sep='|')