# CORD-19-collect-scopus-data

In general, this jupyter notebook is designated to collect additional data via scopus to enbroaden the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import datetime
import matplotlib.pyplot as plt
import re
from urllib.parse import urlparse
from collections import Counter

from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

import time # for sleep
from pybtex.database import parse_file, BibliographyData, Entry
import json
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
from elsapy.elssearch import ElsSearch

Get the data and save it to a variable.

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv')

Check the length of the column containing doi's.

In [3]:
len(CORD19_CSV['doi'])

77448

Display the column doi to see if there are inconsistencies such as NaN's

In [4]:
doi = CORD19_CSV['doi']
doi

0                                 NaN
1          10.1016/j.regg.2021.01.002
2           10.1016/j.rec.2020.08.002
3        10.1016/j.vetmic.2006.11.026
4                   10.3390/v12080849
                     ...             
77443      10.1007/s11229-020-02869-9
77444                             NaN
77445     10.1101/2020.05.13.20100206
77446      10.1007/s42991-020-00052-8
77447     10.1101/2020.09.14.20194670
Name: doi, Length: 77448, dtype: object

Create a series with solely unique values and neglect NaN's. It is important to sort the unique values. Otherwise, the method is creating different results after each restart of the notebook. 

In [5]:
doi_counted = doi.value_counts().sort_index(ascending=True)
doi_counted

10.1001/jamainternmed.2020.1369       1
10.1001/jamanetworkopen.2020.16382    1
10.1001/jamanetworkopen.2020.17521    1
10.1001/jamanetworkopen.2020.20485    1
10.1001/jamanetworkopen.2020.24984    1
                                     ..
10.9745/ghsp-d-20-00115               1
10.9745/ghsp-d-20-00171               1
10.9745/ghsp-d-20-00218               1
10.9758/cpn.2020.18.4.607             1
10.9781/ijimai.2020.02.002            1
Name: doi, Length: 74302, dtype: int64

The following function determines the requested information from the Scopus API. (https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4)

In [6]:
#Adapted from https://github.com/ElsevierDev/elsapy/blob/master/exampleProg.py
def fetch_scopus_api(client, doi):
    """obtain additional paper information from scopus by doi
    """
    doc_srch = ElsSearch("DOI("+doi+")",'scopus')
    doc_srch.execute(client, get_all = True)
    #print ("doc_srch has", len(doc_srch.results), "results.")
    #print(doc_srch.results)
    try:
        scopus_id=doc_srch.results[0]["dc:identifier"].split(":")[1]
        scp_doc = AbsDoc(scp_id = scopus_id)
        if scp_doc.read(client):
            # print ("scp_doc.title: ", scp_doc.title)
            scp_doc.write()   
        else:
            print ("Read document failed.")
        # print(scp_doc.data["affiliation"])
        return scp_doc.data
    except:
        return None

Thusly, the configuration file is set up and contains an APIkey. Further information: https://github.com/ElsevierDev/elsapy/blob/master/CONFIG.md

In [7]:
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

Moreover, the client is initialized with the API-Key.

In [8]:
client = ElsClient(config['apikey'])

For demonstation purposes, the following cells shows which data is returned by the Scopus API. 

In [9]:
return_example = fetch_scopus_api(client, '10.1016/j.dsx.2020.04.012')
print(json.dumps(return_example, indent=2))

{
  "affiliation": [
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Hamdard",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Millia Islamia",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Indraprastha Apollo Hospitals",
      "affiliation-country": "India"
    }
  ],
  "coredata": {
    "srctype": "j",
    "eid": "2-s2.0-85083171050",
    "pubmed-id": "32305024",
    "prism:coverDate": "2020-07-01",
    "prism:aggregationType": "Journal",
    "prism:url": "https://api.elsevier.com/content/abstract/scopus_id/85083171050",
    "dc:creator": {
      "author": [
        {
          "ce:given-name": "Raju",
          "preferred-name": {
            "ce:given-name": "Raju",
            "ce:initials": "R.",
            "ce:surname": "Vaishya",
            "ce:indexed-name": "Vaishya R."
          },
          "@seq": "1",
          "ce:init

Based on the returned data, further analysis is conductable. Therefore, two notebooks are created to analyse data linked to: 
<ul>
  <li>affiliation</li>
  <li>coredata</li>
</ul>    

Thusly, the already fetched coredata and affiliation are read and combined to a DataFrame for further processing.

In [10]:
df_current_extra_info = pd.DataFrame()
try:
    read_affiliation = pd.read_pickle('extra_info_affiliation_CS.pkl')
    read_coredata = pd.read_pickle('extra_info_coredata_CS.pkl')
    df_current_extra_info['affiliation'] = read_affiliation
    df_current_extra_info['coredata'] = read_coredata
    df_current_extra_info
except:
    print("The DataFrame is empty")
    #if the dataframe is not empty set the variable to show the dataframe

The length of the DataFrame containing the current information is assigned to a variable to be used for further processing. 
Therefore, the length will be used within a while loop as a starting index. 

In [11]:
len_df_current_extra_info = len(df_current_extra_info)
len_df_current_extra_info

46129

In [12]:
df_current_extra_info

Unnamed: 0,affiliation,coredata
0,"[{'affiliation-city': 'Palo Alto', 'affilname'...","{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"[{'affiliation-city': 'Seattle', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"[{'affiliation-city': 'Madison', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"[{'affiliation-city': 'Los Angeles', 'affilnam...","{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...,...
46124,,
46125,,
46126,,
46127,,


In [13]:
def contains_only_None(dic):
    """
    This functions inspects an dictionary and returns True if it solely contains None values
    """
    return len(dic) == sum(value == None for value in dict_new_extra_info.values())

In [14]:
def append_fetched_data_to_df(df_current_extra_info, dic):
    """
    This function appends or inserts newly fetched data to the DataFrame containing scopus data.
    """
    #df_current_extra_info -> holding the latest data, new data needs to be appended to it, 
    #df_newly_fetched_transposed -> holdy newly fetched data, needs to be inserted or fetched
    
    if contains_only_None(dic):
        placeholder_entries = pd.DataFrame(np.empty((len(dict_new_extra_info),2),dtype=object),columns=['affiliation','coredata'], index=dict_new_extra_info.keys())
        df_newly_fetched_transposed = placeholder_entries
        print(placeholder_entries)
    else:
        #Prior appending, the dictionary is transformed to a DataFrame
        df_newly_fetched = pd.DataFrame(dic)
        #For readability, the DataFrame is transposed
        df_newly_fetched_transposed = df_newly_fetched.T
        print(df_newly_fetched_transposed)
    
    #Insert newly fetched rows which were previously not successful appended
    for index, row in df_newly_fetched_transposed.iterrows():
        #insert to current extra info DataFrame because the row is existent
        if index in df_current_extra_info.index and row.affiliation is not None:
            df_current_extra_info.loc[index] = row
        #append to current extra info DataFrame because the row is new     
        if index not in df_current_extra_info.index:
            df_current_extra_info = df_current_extra_info.append(row, ignore_index=True)
            
    #returning DataFrame with inserted and replaced rows. 
    return df_current_extra_info

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [15]:
def store_df_columns(df):
    ser_affiliation = df['affiliation']
    ser_coredata = df['coredata']
    ser_affiliation.to_pickle('extra_info_affiliation_CS.pkl')
    ser_coredata.to_pickle('extra_info_coredata_CS.pkl')
    return ser_affiliation, ser_coredata

In [16]:
# placeholder_entries = pd.DataFrame(np.empty((4,2),dtype=object),columns=['affiliation','coredata'])

In [17]:
# placeholder_entries

Subsequently, the fetched scopus data is stored within a dictionary. Besides, the print function is used to show the state of the process by displaying the latest fetched information. 

In [None]:
%%time
dict_new_extra_info = dict()
len_dois = len(doi_counted)
def trigger_fetching():
    threshold = 0 
    i = len_df_current_extra_info
    while i < len_dois: #-> upto modified, normally len_dois
        dict_new_extra_info[i] = fetch_scopus_api(client, doi_counted.index[i])
        print("Position fetched: " + str(i) + " -> " +  doi_counted.index[i])
        i = i + 1 
        threshold = threshold + 1
        if threshold > 99:
            df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
            stored_series = store_df_columns(df_combined_extra_info)
            threshold = 0
            print("batch saved")
trigger_fetching()

Position fetched: 46129 -> 10.1101/2020.09.27.20202754
Position fetched: 46130 -> 10.1101/2020.09.27.315762
Position fetched: 46131 -> 10.1101/2020.09.27.316018
Position fetched: 46132 -> 10.1101/2020.09.27.316158
Position fetched: 46133 -> 10.1101/2020.09.28.20164798
Position fetched: 46134 -> 10.1101/2020.09.28.20190009
Position fetched: 46135 -> 10.1101/2020.09.28.20201475
Position fetched: 46136 -> 10.1101/2020.09.28.20201947
Position fetched: 46137 -> 10.1101/2020.09.28.20202028
Position fetched: 46138 -> 10.1101/2020.09.28.20202804
Position fetched: 46139 -> 10.1101/2020.09.28.20202911
Position fetched: 46140 -> 10.1101/2020.09.28.20202929
Position fetched: 46141 -> 10.1101/2020.09.28.20202937
Position fetched: 46142 -> 10.1101/2020.09.28.20202945
Position fetched: 46143 -> 10.1101/2020.09.28.20202952
Position fetched: 46144 -> 10.1101/2020.09.28.20203109
Position fetched: 46145 -> 10.1101/2020.09.28.20203166
Position fetched: 46146 -> 10.1101/2020.09.28.20203174
Position fetched

Position fetched: 46273 -> 10.1101/2020.10.05.20207001
Position fetched: 46274 -> 10.1101/2020.10.05.20207118
Position fetched: 46275 -> 10.1101/2020.10.05.20207217
Position fetched: 46276 -> 10.1101/2020.10.05.20207423
Position fetched: 46277 -> 10.1101/2020.10.05.325290
Position fetched: 46278 -> 10.1101/2020.10.05.326850
Position fetched: 46279 -> 10.1101/2020.10.06.20204487
Position fetched: 46280 -> 10.1101/2020.10.06.20205864
Position fetched: 46281 -> 10.1101/2020.10.06.20207571
Position fetched: 46282 -> 10.1101/2020.10.06.20207761
Position fetched: 46283 -> 10.1101/2020.10.06.20208033
Position fetched: 46284 -> 10.1101/2020.10.06.20208132
Position fetched: 46285 -> 10.1101/2020.10.06.323634
Position fetched: 46286 -> 10.1101/2020.10.06.327080
Position fetched: 46287 -> 10.1101/2020.10.06.327445
Position fetched: 46288 -> 10.1101/2020.10.06.327452
Position fetched: 46289 -> 10.1101/2020.10.06.327742
Position fetched: 46290 -> 10.1101/2020.10.06.328112
Position fetched: 46291 ->

Position fetched: 46418 -> 10.1101/2020.10.13.20211425
Position fetched: 46419 -> 10.1101/2020.10.13.20211664
Position fetched: 46420 -> 10.1101/2020.10.13.20211763
Position fetched: 46421 -> 10.1101/2020.10.13.20211771
Position fetched: 46422 -> 10.1101/2020.10.13.20211821
Position fetched: 46423 -> 10.1101/2020.10.13.20211854
Position fetched: 46424 -> 10.1101/2020.10.13.20211888
Position fetched: 46425 -> 10.1101/2020.10.13.20211912
Position fetched: 46426 -> 10.1101/2020.10.13.20211953
Position fetched: 46427 -> 10.1101/2020.10.13.20212035
Position fetched: 46428 -> 10.1101/2020.10.13.20212092
      affiliation coredata
46129        None     None
46130        None     None
46131        None     None
46132        None     None
46133        None     None
...           ...      ...
46424        None     None
46425        None     None
46426        None     None
46427        None     None
46428        None     None

[300 rows x 2 columns]
batch saved
Position fetched: 46429 -> 10.1101/

Position fetched: 46556 -> 10.1101/2020.10.19.346031
Position fetched: 46557 -> 10.1101/2020.10.20.20210195
Position fetched: 46558 -> 10.1101/2020.10.20.20213116
Position fetched: 46559 -> 10.1101/2020.10.20.20213793
Position fetched: 46560 -> 10.1101/2020.10.20.20215541
Position fetched: 46561 -> 10.1101/2020.10.20.20215608
Position fetched: 46562 -> 10.1101/2020.10.20.20215616
Position fetched: 46563 -> 10.1101/2020.10.20.20215715
Position fetched: 46564 -> 10.1101/2020.10.20.20215756
Position fetched: 46565 -> 10.1101/2020.10.20.20215814
Position fetched: 46566 -> 10.1101/2020.10.20.20215863
Position fetched: 46567 -> 10.1101/2020.10.20.20215905
Position fetched: 46568 -> 10.1101/2020.10.20.20215962
Position fetched: 46569 -> 10.1101/2020.10.20.20215970
Position fetched: 46570 -> 10.1101/2020.10.20.20216150
Position fetched: 46571 -> 10.1101/2020.10.20.20216283
Position fetched: 46572 -> 10.1101/2020.10.20.20216291
Position fetched: 46573 -> 10.1101/2020.10.20.20216309
Position fet

Position fetched: 46700 -> 10.1101/2020.10.27.20211631
Position fetched: 46701 -> 10.1101/2020.10.27.20215566
Position fetched: 46702 -> 10.1101/2020.10.27.20216366
Position fetched: 46703 -> 10.1101/2020.10.27.20219196
Position fetched: 46704 -> 10.1101/2020.10.27.20219717
Position fetched: 46705 -> 10.1101/2020.10.27.20220061
Position fetched: 46706 -> 10.1101/2020.10.27.20220400
Position fetched: 46707 -> 10.1101/2020.10.27.20220442
Position fetched: 46708 -> 10.1101/2020.10.27.20220541
Position fetched: 46709 -> 10.1101/2020.10.27.20220640
Position fetched: 46710 -> 10.1101/2020.10.27.20220665
Position fetched: 46711 -> 10.1101/2020.10.27.20220715
Position fetched: 46712 -> 10.1101/2020.10.27.20220723
Position fetched: 46713 -> 10.1101/2020.10.27.20220830
Position fetched: 46714 -> 10.1101/2020.10.27.20220863
Position fetched: 46715 -> 10.1101/2020.10.27.20220897
Position fetched: 46716 -> 10.1101/2020.10.27.20220905
Position fetched: 46717 -> 10.1101/2020.10.27.354563
Position fet

Position fetched: 46838 -> 10.1101/2020.11.01.363499
Position fetched: 46839 -> 10.1101/2020.11.01.363812
Position fetched: 46840 -> 10.1101/2020.11.01.364224
Position fetched: 46841 -> 10.1101/2020.11.02.20183236
Position fetched: 46842 -> 10.1101/2020.11.02.20215657
Position fetched: 46843 -> 10.1101/2020.11.02.20221309
Position fetched: 46844 -> 10.1101/2020.11.02.20221622
Position fetched: 46845 -> 10.1101/2020.11.02.20222778
Position fetched: 46846 -> 10.1101/2020.11.02.20223404
Position fetched: 46847 -> 10.1101/2020.11.02.20223560
Position fetched: 46848 -> 10.1101/2020.11.02.20223636
Position fetched: 46849 -> 10.1101/2020.11.02.20224204
Position fetched: 46850 -> 10.1101/2020.11.02.20224212
Position fetched: 46851 -> 10.1101/2020.11.02.20224303
Position fetched: 46852 -> 10.1101/2020.11.02.20224352
Position fetched: 46853 -> 10.1101/2020.11.02.20224402
Position fetched: 46854 -> 10.1101/2020.11.02.20224485
Position fetched: 46855 -> 10.1101/2020.11.02.20224550
Position fetched

Position fetched: 46983 -> 10.1101/2020.11.07.20227447
Position fetched: 46984 -> 10.1101/2020.11.07.20227504
Position fetched: 46985 -> 10.1101/2020.11.07.20227512
Position fetched: 46986 -> 10.1101/2020.11.07.20227520
Position fetched: 46987 -> 10.1101/2020.11.07.20227603
Position fetched: 46988 -> 10.1101/2020.11.07.365726
Position fetched: 46989 -> 10.1101/2020.11.07.372938
Position fetched: 46990 -> 10.1101/2020.11.08.20184663
Position fetched: 46991 -> 10.1101/2020.11.08.20217653
Position fetched: 46992 -> 10.1101/2020.11.08.20222638
Position fetched: 46993 -> 10.1101/2020.11.08.20224790
Position fetched: 46994 -> 10.1101/2020.11.08.20227470
Position fetched: 46995 -> 10.1101/2020.11.08.20227702
Position fetched: 46996 -> 10.1101/2020.11.08.20227819
Position fetched: 46997 -> 10.1101/2020.11.08.20227876
Position fetched: 46998 -> 10.1101/2020.11.08.20227884
Position fetched: 46999 -> 10.1101/2020.11.08.20227892
Position fetched: 47000 -> 10.1101/2020.11.08.20227975
Position fetch

Position fetched: 47128 -> 10.1101/2020.11.13.381343
      affiliation coredata
46129        None     None
46130        None     None
46131        None     None
46132        None     None
46133        None     None
...           ...      ...
47124        None     None
47125        None     None
47126        None     None
47127        None     None
47128        None     None

[1000 rows x 2 columns]
batch saved
Position fetched: 47129 -> 10.1101/2020.11.14.20230938
Position fetched: 47130 -> 10.1101/2020.11.14.20231142
Position fetched: 47131 -> 10.1101/2020.11.14.20231704
Position fetched: 47132 -> 10.1101/2020.11.14.20231878
Position fetched: 47133 -> 10.1101/2020.11.14.20231886
Position fetched: 47134 -> 10.1101/2020.11.14.382416
Position fetched: 47135 -> 10.1101/2020.11.14.382572
Position fetched: 47136 -> 10.1101/2020.11.14.382697
Position fetched: 47137 -> 10.1101/2020.11.14.383075
Position fetched: 47138 -> 10.1101/2020.11.15.20229971
Position fetched: 47139 -> 10.1101/2020.11.1

Position fetched: 47266 -> 10.1101/2020.11.20.20235341
Position fetched: 47267 -> 10.1101/2020.11.20.20235390
Position fetched: 47268 -> 10.1101/2020.11.20.20235630
Position fetched: 47269 -> 10.1101/2020.11.20.20235648
Position fetched: 47270 -> 10.1101/2020.11.20.20235697
Position fetched: 47271 -> 10.1101/2020.11.20.20235705
Position fetched: 47272 -> 10.1101/2020.11.20.20235895
Position fetched: 47273 -> 10.1101/2020.11.20.20235978
Position fetched: 47274 -> 10.1101/2020.11.20.390625
Position fetched: 47275 -> 10.1101/2020.11.20.390690
Position fetched: 47276 -> 10.1101/2020.11.20.391318
Position fetched: 47277 -> 10.1101/2020.11.20.391532
Position fetched: 47278 -> 10.1101/2020.11.20.392126
Position fetched: 47279 -> 10.1101/2020.11.21.20235283
Position fetched: 47280 -> 10.1101/2020.11.21.20235853
Position fetched: 47281 -> 10.1101/2020.11.21.20236018
Position fetched: 47282 -> 10.1101/2020.11.21.20236034
Position fetched: 47283 -> 10.1101/2020.11.21.20236042
Position fetched: 47

Position fetched: 47410 -> 10.1101/2020.11.29.20240416
Position fetched: 47411 -> 10.1101/2020.11.29.20240481
Position fetched: 47412 -> 10.1101/2020.11.29.20240499
Position fetched: 47413 -> 10.1101/2020.11.29.20240515
Position fetched: 47414 -> 10.1101/2020.11.29.20240564
Position fetched: 47415 -> 10.1101/2020.11.29.20240580
Position fetched: 47416 -> 10.1101/2020.11.29.20240606
Position fetched: 47417 -> 10.1101/2020.11.29.20240614
Position fetched: 47418 -> 10.1101/2020.11.29.402339
Position fetched: 47419 -> 10.1101/2020.11.29.402404
Position fetched: 47420 -> 10.1101/2020.11.29.402669
Position fetched: 47421 -> 10.1101/2020.11.29.402677
Position fetched: 47422 -> 10.1101/2020.11.30.20239566
Position fetched: 47423 -> 10.1101/2020.11.30.20239806
Position fetched: 47424 -> 10.1101/2020.11.30.20239947
Position fetched: 47425 -> 10.1101/2020.11.30.20240671
Position fetched: 47426 -> 10.1101/2020.11.30.20240721
Position fetched: 47427 -> 10.1101/2020.11.30.20240739
Position fetched: 

Position fetched: 47548 -> 10.1101/2020.12.04.20244087
Position fetched: 47549 -> 10.1101/2020.12.04.20244129
Position fetched: 47550 -> 10.1101/2020.12.04.20244137
Position fetched: 47551 -> 10.1101/2020.12.04.20244145
Position fetched: 47552 -> 10.1101/2020.12.04.20244194
Position fetched: 47553 -> 10.1101/2020.12.04.406421
Position fetched: 47554 -> 10.1101/2020.12.04.408260
Position fetched: 47555 -> 10.1101/2020.12.04.409144
Position fetched: 47556 -> 10.1101/2020.12.04.410589
Position fetched: 47557 -> 10.1101/2020.12.04.411660
Position fetched: 47558 -> 10.1101/2020.12.04.411744
Position fetched: 47559 -> 10.1101/2020.12.04.412155
Position fetched: 47560 -> 10.1101/2020.12.04.412494
Position fetched: 47561 -> 10.1101/2020.12.05.20222968
Position fetched: 47562 -> 10.1101/2020.12.05.20241927
Position fetched: 47563 -> 10.1101/2020.12.05.20244376
Position fetched: 47564 -> 10.1101/2020.12.05.20244426
Position fetched: 47565 -> 10.1101/2020.12.05.20244442
Position fetched: 47566 ->

Position fetched: 47693 -> 10.1101/2020.12.11.421008
Position fetched: 47694 -> 10.1101/2020.12.11.421057
Position fetched: 47695 -> 10.1101/2020.12.11.422055
Position fetched: 47696 -> 10.1101/2020.12.11.422139
Position fetched: 47697 -> 10.1101/2020.12.12.20246934
Position fetched: 47698 -> 10.1101/2020.12.12.20248070
Position fetched: 47699 -> 10.1101/2020.12.12.20248103
Position fetched: 47700 -> 10.1101/2020.12.12.422477
Position fetched: 47701 -> 10.1101/2020.12.12.422516
Position fetched: 47702 -> 10.1101/2020.12.12.422532
Position fetched: 47703 -> 10.1101/2020.12.13.20247254
Position fetched: 47704 -> 10.1101/2020.12.13.20248120
Position fetched: 47705 -> 10.1101/2020.12.13.20248122
Position fetched: 47706 -> 10.1101/2020.12.13.20248123
Position fetched: 47707 -> 10.1101/2020.12.13.20248129
Position fetched: 47708 -> 10.1101/2020.12.13.20248133
Position fetched: 47709 -> 10.1101/2020.12.13.20248141
Position fetched: 47710 -> 10.1101/2020.12.13.20248142
Position fetched: 47711 

Position fetched: 47831 -> 10.1101/2020.12.18.20248434
Position fetched: 47832 -> 10.1101/2020.12.18.20248439
Position fetched: 47833 -> 10.1101/2020.12.18.20248452
Position fetched: 47834 -> 10.1101/2020.12.18.20248454
Position fetched: 47835 -> 10.1101/2020.12.18.20248461
Position fetched: 47836 -> 10.1101/2020.12.18.20248466
Position fetched: 47837 -> 10.1101/2020.12.18.20248470
Position fetched: 47838 -> 10.1101/2020.12.18.20248479
Position fetched: 47839 -> 10.1101/2020.12.18.20248480
Position fetched: 47840 -> 10.1101/2020.12.18.20248483
Position fetched: 47841 -> 10.1101/2020.12.18.20248498
Position fetched: 47842 -> 10.1101/2020.12.18.20248499
Position fetched: 47843 -> 10.1101/2020.12.18.20248506
Position fetched: 47844 -> 10.1101/2020.12.18.20248509
Position fetched: 47845 -> 10.1101/2020.12.18.20248518
Position fetched: 47846 -> 10.1101/2020.12.18.422865
Position fetched: 47847 -> 10.1101/2020.12.18.423358
Position fetched: 47848 -> 10.1101/2020.12.18.423363
Position fetched

Position fetched: 47976 -> 10.1101/2020.12.23.424169
Position fetched: 47977 -> 10.1101/2020.12.23.424171
Position fetched: 47978 -> 10.1101/2020.12.23.424172
Position fetched: 47979 -> 10.1101/2020.12.23.424177
Position fetched: 47980 -> 10.1101/2020.12.23.424189
Position fetched: 47981 -> 10.1101/2020.12.23.424194
Position fetched: 47982 -> 10.1101/2020.12.23.424199
Position fetched: 47983 -> 10.1101/2020.12.23.424229
Position fetched: 47984 -> 10.1101/2020.12.23.424283
Position fetched: 47985 -> 10.1101/2020.12.24.20248633
Position fetched: 47986 -> 10.1101/2020.12.24.20248672
Position fetched: 47987 -> 10.1101/2020.12.24.20248802
Position fetched: 47988 -> 10.1101/2020.12.24.20248813
Position fetched: 47989 -> 10.1101/2020.12.24.20248822
Position fetched: 47990 -> 10.1101/2020.12.24.20248825
Position fetched: 47991 -> 10.1101/2020.12.24.20248826
Position fetched: 47992 -> 10.1101/2020.12.24.20248830
Position fetched: 47993 -> 10.1101/2020.12.24.20248834
Position fetched: 47994 -> 1

Position fetched: 48121 -> 10.1101/2021.01.03.21249166
Position fetched: 48122 -> 10.1101/2021.01.03.21249168
Position fetched: 48123 -> 10.1101/2021.01.03.21249175
Position fetched: 48124 -> 10.1101/2021.01.03.21249182
Position fetched: 48125 -> 10.1101/2021.01.03.21249183
Position fetched: 48126 -> 10.1101/2021.01.03.424883
Position fetched: 48127 -> 10.1101/2021.01.03.425115
Position fetched: 48128 -> 10.1101/2021.01.03.425139
      affiliation coredata
46129        None     None
46130        None     None
46131        None     None
46132        None     None
46133        None     None
...           ...      ...
48124        None     None
48125        None     None
48126        None     None
48127        None     None
48128        None     None

[2000 rows x 2 columns]
batch saved
Position fetched: 48129 -> 10.1101/2021.01.03.425167
Position fetched: 48130 -> 10.1101/2021.01.04.20232520
Position fetched: 48131 -> 10.1101/2021.01.04.20237578
Position fetched: 48132 -> 10.1101/2021.01

Position fetched: 48259 -> 10.1101/2021.01.10.426143
Position fetched: 48260 -> 10.1101/2021.01.11.20248606
Position fetched: 48261 -> 10.1101/2021.01.11.20248765
Position fetched: 48262 -> 10.1101/2021.01.11.20248947
Position fetched: 48263 -> 10.1101/2021.01.11.21249265
Position fetched: 48264 -> 10.1101/2021.01.11.21249276
Position fetched: 48265 -> 10.1101/2021.01.11.21249435
Position fetched: 48266 -> 10.1101/2021.01.11.21249461
Position fetched: 48267 -> 10.1101/2021.01.11.21249509
Position fetched: 48268 -> 10.1101/2021.01.11.21249561
Position fetched: 48269 -> 10.1101/2021.01.11.21249562
Position fetched: 48270 -> 10.1101/2021.01.11.21249564
Position fetched: 48271 -> 10.1101/2021.01.11.21249565
Position fetched: 48272 -> 10.1101/2021.01.11.21249605
Position fetched: 48273 -> 10.1101/2021.01.11.21249610
Position fetched: 48274 -> 10.1101/2021.01.11.21249622
Position fetched: 48275 -> 10.1101/2021.01.11.21249626
Position fetched: 48276 -> 10.1101/2021.01.11.21249630
Position fet

Position fetched: 48403 -> 10.1101/2021.01.18.427109
Position fetched: 48404 -> 10.1101/2021.01.18.427113
Position fetched: 48405 -> 10.1101/2021.01.18.427121
Position fetched: 48406 -> 10.1101/2021.01.18.427173
Position fetched: 48407 -> 10.1101/2021.01.18.427189
Position fetched: 48408 -> 10.1101/2021.01.18.427191
Position fetched: 48409 -> 10.1101/2021.01.18.427217
Position fetched: 48410 -> 10.1101/2021.01.19.21249222
Position fetched: 48411 -> 10.1101/2021.01.19.21249592
Position fetched: 48412 -> 10.1101/2021.01.19.21249604
Position fetched: 48413 -> 10.1101/2021.01.19.21249678
Position fetched: 48414 -> 10.1101/2021.01.19.21249790
Position fetched: 48415 -> 10.1101/2021.01.19.21249898
Position fetched: 48416 -> 10.1101/2021.01.19.21249921
Position fetched: 48417 -> 10.1101/2021.01.19.21249936
Position fetched: 48418 -> 10.1101/2021.01.19.21250046
Position fetched: 48419 -> 10.1101/2021.01.19.21250064
Position fetched: 48420 -> 10.1101/2021.01.19.21250079
Position fetched: 48421 

Position fetched: 48541 -> 10.1101/2021.01.25.21250452
Position fetched: 48542 -> 10.1101/2021.01.25.21250454
Position fetched: 48543 -> 10.1101/2021.01.25.21250468
Position fetched: 48544 -> 10.1101/2021.01.25.21250489
Position fetched: 48545 -> 10.1101/2021.01.25.427846
Position fetched: 48546 -> 10.1101/2021.01.25.427910
Position fetched: 48547 -> 10.1101/2021.01.25.427948
Position fetched: 48548 -> 10.1101/2021.01.25.428025
Position fetched: 48549 -> 10.1101/2021.01.25.428042
Position fetched: 48550 -> 10.1101/2021.01.25.428049
Position fetched: 48551 -> 10.1101/2021.01.25.428055
Position fetched: 48552 -> 10.1101/2021.01.25.428097
Position fetched: 48553 -> 10.1101/2021.01.25.428122
Position fetched: 48554 -> 10.1101/2021.01.25.428125
Position fetched: 48555 -> 10.1101/2021.01.25.428136
Position fetched: 48556 -> 10.1101/2021.01.25.428149
Position fetched: 48557 -> 10.1101/2021.01.25.428190
Position fetched: 48558 -> 10.1101/2021.01.25.428191
Position fetched: 48559 -> 10.1101/202

Position fetched: 48686 -> 10.1101/2021.01.30.21250705
Position fetched: 48687 -> 10.1101/2021.01.30.21250708
Position fetched: 48688 -> 10.1101/2021.01.30.21250785
Position fetched: 48689 -> 10.1101/2021.01.30.21250827
Position fetched: 48690 -> 10.1101/2021.01.30.21250830
Position fetched: 48691 -> 10.1101/2021.01.30.21250844
Position fetched: 48692 -> 10.1101/2021.01.31.21250863
Position fetched: 48693 -> 10.1101/2021.01.31.21250866
Position fetched: 48694 -> 10.1101/2021.01.31.21250867
Position fetched: 48695 -> 10.1101/2021.01.31.21250868
Position fetched: 48696 -> 10.1101/2021.01.31.21250870
Position fetched: 48697 -> 10.1101/2021.01.31.21250872
Position fetched: 48698 -> 10.1101/2021.01.31.428824
Position fetched: 48699 -> 10.1101/2021.01.31.428851
Position fetched: 48700 -> 10.1101/2021.01.31.429007
Position fetched: 48701 -> 10.1101/2021.01.31.429010
Position fetched: 48702 -> 10.1101/2021.01.31.429023
Position fetched: 48703 -> 10.1101/2021.02.01.21249903
Position fetched: 48

Position fetched: 48842 -> 10.1101/564112
Position fetched: 48843 -> 10.1101/565218
Position fetched: 48844 -> 10.1101/568386
Position fetched: 48845 -> 10.1101/571455
Position fetched: 48846 -> 10.1101/572768
Position fetched: 48847 -> 10.1101/582957
Position fetched: 48848 -> 10.1101/585331
Position fetched: 48849 -> 10.1101/585984
Position fetched: 48850 -> 10.1101/587261
Position fetched: 48851 -> 10.1101/590307
Position fetched: 48852 -> 10.1101/597963
Position fetched: 48853 -> 10.1101/606442
Position fetched: 48854 -> 10.1101/612614
Position fetched: 48855 -> 10.1101/612945
Position fetched: 48856 -> 10.1101/614958
Position fetched: 48857 -> 10.1101/617910
Position fetched: 48858 -> 10.1101/618249
Position fetched: 48859 -> 10.1101/619924
Position fetched: 48860 -> 10.1101/623132
Position fetched: 48861 -> 10.1101/623819
Position fetched: 48862 -> 10.1101/629881
Position fetched: 48863 -> 10.1101/632372
Position fetched: 48864 -> 10.1101/634600
Position fetched: 48865 -> 10.1101

Position fetched: 49012 -> 10.1107/s1744309106009407
Position fetched: 49013 -> 10.1107/s1744309106021567
Position fetched: 49014 -> 10.1107/s1744309106052341
Position fetched: 49015 -> 10.1107/s1744309107033234
Position fetched: 49016 -> 10.1107/s1744309108012396
Position fetched: 49017 -> 10.1107/s1744309108024391
Position fetched: 49018 -> 10.1107/s1744309109014055
Position fetched: 49019 -> 10.1107/s1744309109024749
Position fetched: 49020 -> 10.1107/s174430911001417x
Position fetched: 49021 -> 10.1107/s1744309110017616
Position fetched: 49022 -> 10.1107/s1744309111002867
Position fetched: 49023 -> 10.1107/s1744309111017829
Position fetched: 49024 -> 10.1107/s1744309112018623
Position fetched: 49025 -> 10.1107/s205225251402003x
Position fetched: 49026 -> 10.1107/s205225252000562x
Position fetched: 49027 -> 10.1107/s2052252520009653
Position fetched: 49028 -> 10.1107/s2052252520012634
batch saved
Position fetched: 49029 -> 10.1107/s2052252520012725
Position fetched: 49030 -> 10.1107

Position fetched: 49169 -> 10.1111/1471-0528.16403
Position fetched: 49170 -> 10.1111/1471-0528.16470
Position fetched: 49171 -> 10.1111/1471-0528.16597
Position fetched: 49172 -> 10.1111/1475-5890.12226
Position fetched: 49173 -> 10.1111/1475-5890.12232
Position fetched: 49174 -> 10.1111/1475-5890.12240
Position fetched: 49175 -> 10.1111/1475-5890.12243
Position fetched: 49176 -> 10.1111/1475-5890.12245
Position fetched: 49177 -> 10.1111/1475-6765.12425
Position fetched: 49178 -> 10.1111/1541-4329.12204
Position fetched: 49179 -> 10.1111/1541-4337.12358
Position fetched: 49180 -> 10.1111/1556-4029.13415
Position fetched: 49181 -> 10.1111/1574-6968.12063
Position fetched: 49182 -> 10.1111/1574-6976.12067
Position fetched: 49183 -> 10.1111/1740-9713.01399
Position fetched: 49184 -> 10.1111/1740-9713.01400
Position fetched: 49185 -> 10.1111/1742-6723.13513
Position fetched: 49186 -> 10.1111/1742-6723.13546
Position fetched: 49187 -> 10.1111/1742-6723.13547
Position fetched: 49188 -> 10.1

The following cell is useful when the process above is interrupted. Therefore, the dictionary containing fetched information can be narrowed down to useful entries. 

In [None]:
# def save_new_extra_info(len_df_current_extra_info, upto):
#     """
#     This function is used to separate successfull API calls from API calls which were prevented due to an invalid API-Key.
#     As a result, this function returns a range of valid entries up to the given parameter. 
#     """
#     dict_new_extra_info_saver = dict()
#     i = len_df_current_extra_info
#     while i < upto:
#         #print("Position: " + str(i) + " -> " +  doi_counted.index[i])
#         dict_new_extra_info_saver[i] = dict_new_extra_info[i]
#         i = i + 1 
#     return dict_new_extra_info_saver

The existing and newly fetched information are combined into one DataFrame. 

In [None]:
# df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
df_combined_extra_info

In [None]:
#to big for GitHub
#df_combined_extra_info.to_csv('extra_info_CS5099.csv', sep='\t')

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [None]:
# stored_series = store_df_columns(df_combined_extra_info)
# stored_series[0]

In [None]:
# stored_series[1]

Verifying that the returned None values are due to non existent data and not to an invalid API-Key

In [None]:
# len_data = len(stored_series[0])
# len_data 

In [None]:
# ser_doi = pd.Series(doi_counted.index[:len_data])
# ser_doi

In [None]:
# df_current_extra_info_checker = df_combined_extra_info
# df_current_extra_info_checker['doi'] = ser_doi

In [None]:
# %%time
# len_df_current_extra_info_checker = len(df_current_extra_info_checker)
# dict_new_extra_info_checker = dict()
# i = 0 
# while i < len_df_current_extra_info_checker: ###################################################### 
#     if df_current_extra_info_checker['affiliation'][i] == None:
#         dict_new_extra_info_checker[i] = fetch_scopus_api(client, ser_doi[i])
#         print("Position fetched again: " + str(i) + " -> " +  ser_doi[i])
#     i = i + 1    

In [None]:
# dict_new_extra_info_checker
# -> check if at least one value is not None -> otherwise the process is finished here

In [None]:
# len(dict_new_extra_info_checker)

In [None]:
# df_combined_extra_info_fetched_again  = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info_checker)
# df_combined_extra_info_fetched_again

In [None]:
# store_df_columns(df_combined_extra_info_fetched_again)

In [None]:
# df_combined_extra_info_fetched_again['check_doi'] = ser_doi
# df_combined_extra_info_fetched_again.head(30)