# CORD-19-collect-scopus-data

In general, this jupyter notebook is designated to collect additional data via scopus to enbroaden the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import datetime
import matplotlib.pyplot as plt
import re
from urllib.parse import urlparse
from collections import Counter

from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

import time # for sleep
from pybtex.database import parse_file, BibliographyData, Entry
import json
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
from elsapy.elssearch import ElsSearch

Get the data and save it to a variable.

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv')

Check the length of the column containing doi's.

In [3]:
len(CORD19_CSV['doi'])

77448

Display the column doi to see if there are inconsistencies such as NaN's

In [4]:
doi = CORD19_CSV['doi']
doi

0                                 NaN
1          10.1016/j.regg.2021.01.002
2           10.1016/j.rec.2020.08.002
3        10.1016/j.vetmic.2006.11.026
4                   10.3390/v12080849
                     ...             
77443      10.1007/s11229-020-02869-9
77444                             NaN
77445     10.1101/2020.05.13.20100206
77446      10.1007/s42991-020-00052-8
77447     10.1101/2020.09.14.20194670
Name: doi, Length: 77448, dtype: object

Create a series with solely unique values and neglect NaN's. It is important to sort the unique values. Otherwise, the method is creating different results after each restart of the notebook. 

In [5]:
doi_counted = doi.value_counts().sort_index(ascending=True)
doi_counted

10.1001/jamainternmed.2020.1369       1
10.1001/jamanetworkopen.2020.16382    1
10.1001/jamanetworkopen.2020.17521    1
10.1001/jamanetworkopen.2020.20485    1
10.1001/jamanetworkopen.2020.24984    1
                                     ..
10.9745/ghsp-d-20-00115               1
10.9745/ghsp-d-20-00171               1
10.9745/ghsp-d-20-00218               1
10.9758/cpn.2020.18.4.607             1
10.9781/ijimai.2020.02.002            1
Name: doi, Length: 74302, dtype: int64

The following function determines the requested information from the Scopus API. (https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4)

In [6]:
#Adapted from https://github.com/ElsevierDev/elsapy/blob/master/exampleProg.py
def fetch_scopus_api(client, doi):
    """obtain additional paper information from scopus by doi
    """
    doc_srch = ElsSearch("DOI("+doi+")",'scopus')
    doc_srch.execute(client, get_all = True)
    #print ("doc_srch has", len(doc_srch.results), "results.")
    #print(doc_srch.results)
    try:
        scopus_id=doc_srch.results[0]["dc:identifier"].split(":")[1]
        scp_doc = AbsDoc(scp_id = scopus_id)
        if scp_doc.read(client):
            # print ("scp_doc.title: ", scp_doc.title)
            scp_doc.write()   
        else:
            print ("Read document failed.")
        # print(scp_doc.data["affiliation"])
        return scp_doc.data
    except:
        return None

Thusly, the configuration file is set up and contains an APIkey. Further information: https://github.com/ElsevierDev/elsapy/blob/master/CONFIG.md

In [7]:
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

Moreover, the client is initialized with the API-Key.

In [8]:
client = ElsClient(config['apikey'])

For demonstation purposes, the following cells shows which data is returned by the Scopus API. 

In [9]:
return_example = fetch_scopus_api(client, '10.1016/j.dsx.2020.04.012')
print(json.dumps(return_example, indent=2))

{
  "affiliation": [
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Hamdard",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Millia Islamia",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Indraprastha Apollo Hospitals",
      "affiliation-country": "India"
    }
  ],
  "coredata": {
    "srctype": "j",
    "eid": "2-s2.0-85083171050",
    "pubmed-id": "32305024",
    "prism:coverDate": "2020-07-01",
    "prism:aggregationType": "Journal",
    "prism:url": "https://api.elsevier.com/content/abstract/scopus_id/85083171050",
    "dc:creator": {
      "author": [
        {
          "ce:given-name": "Raju",
          "preferred-name": {
            "ce:given-name": "Raju",
            "ce:initials": "R.",
            "ce:surname": "Vaishya",
            "ce:indexed-name": "Vaishya R."
          },
          "@seq": "1",
          "ce:init

Based on the returned data, further analysis is conductable. Therefore, two notebooks are created to analyse data linked to: 
<ul>
  <li>affiliation</li>
  <li>coredata</li>
</ul>    

Thusly, the already fetched coredata and affiliation are read and combined to a DataFrame for further processing.

In [10]:
df_current_extra_info = pd.DataFrame()
try:
    read_affiliation = pd.read_pickle('extra_info_affiliation_CS.pkl')
    read_coredata = pd.read_pickle('extra_info_coredata_CS.pkl')
    df_current_extra_info['affiliation'] = read_affiliation
    df_current_extra_info['coredata'] = read_coredata
    df_current_extra_info
except:
    print("The DataFrame is empty")
    #if the dataframe is not empty set the variable to show the dataframe

The length of the DataFrame containing the current information is assigned to a variable to be used for further processing. 
Therefore, the length will be used within a while loop as a starting index. 

In [11]:
len_df_current_extra_info = len(df_current_extra_info)
len_df_current_extra_info

60429

In [12]:
df_current_extra_info

Unnamed: 0,affiliation,coredata
0,"[{'affiliation-city': 'Palo Alto', 'affilname'...","{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"[{'affiliation-city': 'Seattle', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"[{'affiliation-city': 'Madison', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"[{'affiliation-city': 'Los Angeles', 'affilnam...","{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...,...
60424,"[{'affiliation-city': 'Sao Paulo', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '6',..."
60425,"[{'affiliation-city': 'London', 'affilname': '...","{'srctype': 'j', 'prism:issueIdentifier': '6',..."
60426,"[{'affiliation-city': 'San Diego', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
60427,"[{'affiliation-city': 'Bondy', 'affilname': 'U...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."


In [13]:
def contains_only_None(dic):
    """
    This functions inspects an dictionary and returns True if it solely contains None values
    """
    return len(dic) == sum(value == None for value in dict_new_extra_info.values())

In [14]:
def append_fetched_data_to_df(df_current_extra_info, dic):
    """
    This function appends or inserts newly fetched data to the DataFrame containing scopus data.
    """
    #df_current_extra_info -> holding the latest data, new data needs to be appended to it, 
    #df_newly_fetched_transposed -> holdy newly fetched data, needs to be inserted or fetched
    
    if contains_only_None(dic):
        placeholder_entries = pd.DataFrame(np.empty((len(dict_new_extra_info),2),dtype=object),columns=['affiliation','coredata'], index=dict_new_extra_info.keys())
        df_newly_fetched_transposed = placeholder_entries
        print(placeholder_entries)
    else:
        #Prior appending, the dictionary is transformed to a DataFrame
        df_newly_fetched = pd.DataFrame(dic)
        #For readability, the DataFrame is transposed
        df_newly_fetched_transposed = df_newly_fetched.T
        print(df_newly_fetched_transposed)
    
    #Insert newly fetched rows which were previously not successful appended
    for index, row in df_newly_fetched_transposed.iterrows():
        #insert to current extra info DataFrame because the row is existent
        if index in df_current_extra_info.index and row.affiliation is not None:
            df_current_extra_info.loc[index] = row
        #append to current extra info DataFrame because the row is new     
        if index not in df_current_extra_info.index:
            df_current_extra_info = df_current_extra_info.append(row, ignore_index=True)
            
    #returning DataFrame with inserted and replaced rows. 
    return df_current_extra_info

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [15]:
def store_df_columns(df):
    ser_affiliation = df['affiliation']
    ser_coredata = df['coredata']
    ser_affiliation.to_pickle('extra_info_affiliation_CS.pkl')
    ser_coredata.to_pickle('extra_info_coredata_CS.pkl')
    return ser_affiliation, ser_coredata

In [16]:
# placeholder_entries = pd.DataFrame(np.empty((4,2),dtype=object),columns=['affiliation','coredata'])

In [17]:
# placeholder_entries

Subsequently, the fetched scopus data is stored within a dictionary. Besides, the print function is used to show the state of the process by displaying the latest fetched information. 

In [None]:
%%time
dict_new_extra_info = dict()
len_dois = len(doi_counted)
def trigger_fetching():
    threshold = 0 
    i = len_df_current_extra_info
    while i < len_dois: #-> upto modified, normally len_dois
        dict_new_extra_info[i] = fetch_scopus_api(client, doi_counted.index[i])
        print("Position fetched: " + str(i) + " -> " +  doi_counted.index[i])
        i = i + 1 
        threshold = threshold + 1
        if threshold > 99:
            df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
            stored_series = store_df_columns(df_combined_extra_info)
            threshold = 0
            print("batch saved")
trigger_fetching()

Position fetched: 60429 -> 10.1371/journal.pntd.0005859
Position fetched: 60430 -> 10.1371/journal.pntd.0006070
Position fetched: 60431 -> 10.1371/journal.pntd.0006076
Position fetched: 60432 -> 10.1371/journal.pntd.0006143
Position fetched: 60433 -> 10.1371/journal.pntd.0006183
Position fetched: 60434 -> 10.1371/journal.pntd.0006257
Position fetched: 60435 -> 10.1371/journal.pntd.0006275
Position fetched: 60436 -> 10.1371/journal.pntd.0006295
Position fetched: 60437 -> 10.1371/journal.pntd.0006342
Position fetched: 60438 -> 10.1371/journal.pntd.0006343
Position fetched: 60439 -> 10.1371/journal.pntd.0006348
Position fetched: 60440 -> 10.1371/journal.pntd.0006505
Position fetched: 60441 -> 10.1371/journal.pntd.0006526
Position fetched: 60442 -> 10.1371/journal.pntd.0006539
Position fetched: 60443 -> 10.1371/journal.pntd.0006573
Position fetched: 60444 -> 10.1371/journal.pntd.0006628
Position fetched: 60445 -> 10.1371/journal.pntd.0006642
Position fetched: 60446 -> 10.1371/journal.pntd.

Position fetched: 60550 -> 10.1371/journal.pone.0002876
Position fetched: 60551 -> 10.1371/journal.pone.0003154
Position fetched: 60552 -> 10.1371/journal.pone.0003181
Position fetched: 60553 -> 10.1371/journal.pone.0003454
Position fetched: 60554 -> 10.1371/journal.pone.0003500
Position fetched: 60555 -> 10.1371/journal.pone.0003803
Position fetched: 60556 -> 10.1371/journal.pone.0004118
Position fetched: 60557 -> 10.1371/journal.pone.0004171
Position fetched: 60558 -> 10.1371/journal.pone.0004176
Position fetched: 60559 -> 10.1371/journal.pone.0004219
Position fetched: 60560 -> 10.1371/journal.pone.0004261
Position fetched: 60561 -> 10.1371/journal.pone.0004596
Position fetched: 60562 -> 10.1371/journal.pone.0004744
Position fetched: 60563 -> 10.1371/journal.pone.0005156
Position fetched: 60564 -> 10.1371/journal.pone.0005466
Position fetched: 60565 -> 10.1371/journal.pone.0005807
Position fetched: 60566 -> 10.1371/journal.pone.0005819
Position fetched: 60567 -> 10.1371/journal.pone.

Position fetched: 60671 -> 10.1371/journal.pone.0018543
Position fetched: 60672 -> 10.1371/journal.pone.0018558
Position fetched: 60673 -> 10.1371/journal.pone.0018687
Position fetched: 60674 -> 10.1371/journal.pone.0018890
Position fetched: 60675 -> 10.1371/journal.pone.0018928
Position fetched: 60676 -> 10.1371/journal.pone.0019056
Position fetched: 60677 -> 10.1371/journal.pone.0019156
Position fetched: 60678 -> 10.1371/journal.pone.0019232
Position fetched: 60679 -> 10.1371/journal.pone.0019245
Position fetched: 60680 -> 10.1371/journal.pone.0019311
Position fetched: 60681 -> 10.1371/journal.pone.0019330
Position fetched: 60682 -> 10.1371/journal.pone.0019417
Position fetched: 60683 -> 10.1371/journal.pone.0019436
Position fetched: 60684 -> 10.1371/journal.pone.0019496
Position fetched: 60685 -> 10.1371/journal.pone.0019510
Position fetched: 60686 -> 10.1371/journal.pone.0019738
Position fetched: 60687 -> 10.1371/journal.pone.0019750
Position fetched: 60688 -> 10.1371/journal.pone.

Position fetched: 60792 -> 10.1371/journal.pone.0031800
Position fetched: 60793 -> 10.1371/journal.pone.0031886
Position fetched: 60794 -> 10.1371/journal.pone.0031961
Position fetched: 60795 -> 10.1371/journal.pone.0031981
Position fetched: 60796 -> 10.1371/journal.pone.0032157
Position fetched: 60797 -> 10.1371/journal.pone.0032160
Position fetched: 60798 -> 10.1371/journal.pone.0032273
Position fetched: 60799 -> 10.1371/journal.pone.0032469
Position fetched: 60800 -> 10.1371/journal.pone.0032486
Position fetched: 60801 -> 10.1371/journal.pone.0032582
Position fetched: 60802 -> 10.1371/journal.pone.0032731
Position fetched: 60803 -> 10.1371/journal.pone.0032739
Position fetched: 60804 -> 10.1371/journal.pone.0032845
Position fetched: 60805 -> 10.1371/journal.pone.0033174
Position fetched: 60806 -> 10.1371/journal.pone.0033389
Position fetched: 60807 -> 10.1371/journal.pone.0033392
Position fetched: 60808 -> 10.1371/journal.pone.0033428
Position fetched: 60809 -> 10.1371/journal.pone.

Position fetched: 60913 -> 10.1371/journal.pone.0045730
Position fetched: 60914 -> 10.1371/journal.pone.0045842
Position fetched: 60915 -> 10.1371/journal.pone.0045957
Position fetched: 60916 -> 10.1371/journal.pone.0046113
Position fetched: 60917 -> 10.1371/journal.pone.0046241
Position fetched: 60918 -> 10.1371/journal.pone.0046378
Position fetched: 60919 -> 10.1371/journal.pone.0046393
Position fetched: 60920 -> 10.1371/journal.pone.0046516
Position fetched: 60921 -> 10.1371/journal.pone.0047403
Position fetched: 60922 -> 10.1371/journal.pone.0047492
Position fetched: 60923 -> 10.1371/journal.pone.0047529
Position fetched: 60924 -> 10.1371/journal.pone.0047711
Position fetched: 60925 -> 10.1371/journal.pone.0047737
Position fetched: 60926 -> 10.1371/journal.pone.0047740
Position fetched: 60927 -> 10.1371/journal.pone.0047912
Position fetched: 60928 -> 10.1371/journal.pone.0048053
                                             affiliation  \
60429  [{'affiliation-city': 'Atlanta', 'aff

batch saved
Position fetched: 61029 -> 10.1371/journal.pone.0068056
Position fetched: 61030 -> 10.1371/journal.pone.0068081
Position fetched: 61031 -> 10.1371/journal.pone.0068558
Position fetched: 61032 -> 10.1371/journal.pone.0068759
Position fetched: 61033 -> 10.1371/journal.pone.0068777
Position fetched: 61034 -> 10.1371/journal.pone.0069305
Position fetched: 61035 -> 10.1371/journal.pone.0069374
Position fetched: 61036 -> 10.1371/journal.pone.0069387
Position fetched: 61037 -> 10.1371/journal.pone.0069804
Position fetched: 61038 -> 10.1371/journal.pone.0069825
Position fetched: 61039 -> 10.1371/journal.pone.0069858
Position fetched: 61040 -> 10.1371/journal.pone.0069941
Position fetched: 61041 -> 10.1371/journal.pone.0069982
Position fetched: 61042 -> 10.1371/journal.pone.0070129
Position fetched: 61043 -> 10.1371/journal.pone.0070190
Position fetched: 61044 -> 10.1371/journal.pone.0070854
Position fetched: 61045 -> 10.1371/journal.pone.0070944
Position fetched: 61046 -> 10.1371/j

Position fetched: 61149 -> 10.1371/journal.pone.0090905
Position fetched: 61150 -> 10.1371/journal.pone.0090957
Position fetched: 61151 -> 10.1371/journal.pone.0091103
Position fetched: 61152 -> 10.1371/journal.pone.0091433
Position fetched: 61153 -> 10.1371/journal.pone.0091516
Position fetched: 61154 -> 10.1371/journal.pone.0091679
Position fetched: 61155 -> 10.1371/journal.pone.0091996
Position fetched: 61156 -> 10.1371/journal.pone.0092154
Position fetched: 61157 -> 10.1371/journal.pone.0092199
Position fetched: 61158 -> 10.1371/journal.pone.0092777
Position fetched: 61159 -> 10.1371/journal.pone.0092884
Position fetched: 61160 -> 10.1371/journal.pone.0093001
Position fetched: 61161 -> 10.1371/journal.pone.0093227
Position fetched: 61162 -> 10.1371/journal.pone.0093269
Position fetched: 61163 -> 10.1371/journal.pone.0093390
Position fetched: 61164 -> 10.1371/journal.pone.0093395
Position fetched: 61165 -> 10.1371/journal.pone.0093541
Position fetched: 61166 -> 10.1371/journal.pone.

Position fetched: 61270 -> 10.1371/journal.pone.0112602
Position fetched: 61271 -> 10.1371/journal.pone.0112617
Position fetched: 61272 -> 10.1371/journal.pone.0112983
Position fetched: 61273 -> 10.1371/journal.pone.0112986
Position fetched: 61274 -> 10.1371/journal.pone.0113078
Position fetched: 61275 -> 10.1371/journal.pone.0113113
Position fetched: 61276 -> 10.1371/journal.pone.0113234
Position fetched: 61277 -> 10.1371/journal.pone.0113570
Position fetched: 61278 -> 10.1371/journal.pone.0113711
Position fetched: 61279 -> 10.1371/journal.pone.0114021
Position fetched: 61280 -> 10.1371/journal.pone.0114652
Position fetched: 61281 -> 10.1371/journal.pone.0114710
Position fetched: 61282 -> 10.1371/journal.pone.0114871
Position fetched: 61283 -> 10.1371/journal.pone.0114931
Position fetched: 61284 -> 10.1371/journal.pone.0115180
Position fetched: 61285 -> 10.1371/journal.pone.0115475
Position fetched: 61286 -> 10.1371/journal.pone.0115588
Position fetched: 61287 -> 10.1371/journal.pone.

Position fetched: 61391 -> 10.1371/journal.pone.0134943
Position fetched: 61392 -> 10.1371/journal.pone.0135573
Position fetched: 61393 -> 10.1371/journal.pone.0135640
Position fetched: 61394 -> 10.1371/journal.pone.0135675
Position fetched: 61395 -> 10.1371/journal.pone.0135767
Position fetched: 61396 -> 10.1371/journal.pone.0135828
Position fetched: 61397 -> 10.1371/journal.pone.0135850
Position fetched: 61398 -> 10.1371/journal.pone.0135940
Position fetched: 61399 -> 10.1371/journal.pone.0136253
Position fetched: 61400 -> 10.1371/journal.pone.0136888
Position fetched: 61401 -> 10.1371/journal.pone.0136927
Position fetched: 61402 -> 10.1371/journal.pone.0137018
Position fetched: 61403 -> 10.1371/journal.pone.0137108
Position fetched: 61404 -> 10.1371/journal.pone.0137212
Position fetched: 61405 -> 10.1371/journal.pone.0137288
Position fetched: 61406 -> 10.1371/journal.pone.0137378
Position fetched: 61407 -> 10.1371/journal.pone.0137679
Position fetched: 61408 -> 10.1371/journal.pone.

Position fetched: 61512 -> 10.1371/journal.pone.0155044
Position fetched: 61513 -> 10.1371/journal.pone.0155134
Position fetched: 61514 -> 10.1371/journal.pone.0155341
Position fetched: 61515 -> 10.1371/journal.pone.0155484
Position fetched: 61516 -> 10.1371/journal.pone.0155555
Position fetched: 61517 -> 10.1371/journal.pone.0155589
Position fetched: 61518 -> 10.1371/journal.pone.0156019
Position fetched: 61519 -> 10.1371/journal.pone.0156518
Position fetched: 61520 -> 10.1371/journal.pone.0156552
Position fetched: 61521 -> 10.1371/journal.pone.0156603
Position fetched: 61522 -> 10.1371/journal.pone.0156739
Position fetched: 61523 -> 10.1371/journal.pone.0157034
Position fetched: 61524 -> 10.1371/journal.pone.0157287
Position fetched: 61525 -> 10.1371/journal.pone.0157398
Position fetched: 61526 -> 10.1371/journal.pone.0157450
Position fetched: 61527 -> 10.1371/journal.pone.0157620
Position fetched: 61528 -> 10.1371/journal.pone.0158128
                                             aff

batch saved
Position fetched: 61629 -> 10.1371/journal.pone.0176947
Position fetched: 61630 -> 10.1371/journal.pone.0177340
Position fetched: 61631 -> 10.1371/journal.pone.0178007
Position fetched: 61632 -> 10.1371/journal.pone.0178094
Position fetched: 61633 -> 10.1371/journal.pone.0178146
Position fetched: 61634 -> 10.1371/journal.pone.0178241
Position fetched: 61635 -> 10.1371/journal.pone.0178336
Position fetched: 61636 -> 10.1371/journal.pone.0178408
Position fetched: 61637 -> 10.1371/journal.pone.0178433
Position fetched: 61638 -> 10.1371/journal.pone.0178569
Position fetched: 61639 -> 10.1371/journal.pone.0178732
Position fetched: 61640 -> 10.1371/journal.pone.0178781
Position fetched: 61641 -> 10.1371/journal.pone.0178926
Position fetched: 61642 -> 10.1371/journal.pone.0179177
Position fetched: 61643 -> 10.1371/journal.pone.0179356
Position fetched: 61644 -> 10.1371/journal.pone.0179391
Position fetched: 61645 -> 10.1371/journal.pone.0179863
Position fetched: 61646 -> 10.1371/j

Position fetched: 61749 -> 10.1371/journal.pone.0199067
Position fetched: 61750 -> 10.1371/journal.pone.0199298
Position fetched: 61751 -> 10.1371/journal.pone.0199388
Position fetched: 61752 -> 10.1371/journal.pone.0199656
Position fetched: 61753 -> 10.1371/journal.pone.0199869
Position fetched: 61754 -> 10.1371/journal.pone.0200095
Position fetched: 61755 -> 10.1371/journal.pone.0200200
Position fetched: 61756 -> 10.1371/journal.pone.0200428
Position fetched: 61757 -> 10.1371/journal.pone.0200531
Position fetched: 61758 -> 10.1371/journal.pone.0200726
Position fetched: 61759 -> 10.1371/journal.pone.0200858
Position fetched: 61760 -> 10.1371/journal.pone.0200919
Position fetched: 61761 -> 10.1371/journal.pone.0201207
Position fetched: 61762 -> 10.1371/journal.pone.0201250
Position fetched: 61763 -> 10.1371/journal.pone.0201281
Position fetched: 61764 -> 10.1371/journal.pone.0201295
Position fetched: 61765 -> 10.1371/journal.pone.0201497
Position fetched: 61766 -> 10.1371/journal.pone.

Position fetched: 61870 -> 10.1371/journal.pone.0226483
Position fetched: 61871 -> 10.1371/journal.pone.0226489
Position fetched: 61872 -> 10.1371/journal.pone.0226952
Position fetched: 61873 -> 10.1371/journal.pone.0227104
Position fetched: 61874 -> 10.1371/journal.pone.0228068
Position fetched: 61875 -> 10.1371/journal.pone.0228329
Position fetched: 61876 -> 10.1371/journal.pone.0228544
Position fetched: 61877 -> 10.1371/journal.pone.0228983
Position fetched: 61878 -> 10.1371/journal.pone.0229467
Position fetched: 61879 -> 10.1371/journal.pone.0229558
Position fetched: 61880 -> 10.1371/journal.pone.0229790
Position fetched: 61881 -> 10.1371/journal.pone.0229911
Position fetched: 61882 -> 10.1371/journal.pone.0230067
Position fetched: 61883 -> 10.1371/journal.pone.0230148
Position fetched: 61884 -> 10.1371/journal.pone.0230183
Position fetched: 61885 -> 10.1371/journal.pone.0230295
Position fetched: 61886 -> 10.1371/journal.pone.0230405
Position fetched: 61887 -> 10.1371/journal.pone.

The following cell is useful when the process above is interrupted. Therefore, the dictionary containing fetched information can be narrowed down to useful entries. 

In [None]:
# def save_new_extra_info(len_df_current_extra_info, upto):
#     """
#     This function is used to separate successfull API calls from API calls which were prevented due to an invalid API-Key.
#     As a result, this function returns a range of valid entries up to the given parameter. 
#     """
#     dict_new_extra_info_saver = dict()
#     i = len_df_current_extra_info
#     while i < upto:
#         #print("Position: " + str(i) + " -> " +  doi_counted.index[i])
#         dict_new_extra_info_saver[i] = dict_new_extra_info[i]
#         i = i + 1 
#     return dict_new_extra_info_saver

The existing and newly fetched information are combined into one DataFrame. 

In [None]:
# df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
df_combined_extra_info

In [None]:
#to big for GitHub
#df_combined_extra_info.to_csv('extra_info_CS5099.csv', sep='\t')

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [None]:
# stored_series = store_df_columns(df_combined_extra_info)
# stored_series[0]

In [None]:
# stored_series[1]

Verifying that the returned None values are due to non existent data and not to an invalid API-Key

In [None]:
# len_data = len(stored_series[0])
# len_data 

In [None]:
# ser_doi = pd.Series(doi_counted.index[:len_data])
# ser_doi

In [None]:
# df_current_extra_info_checker = df_combined_extra_info
# df_current_extra_info_checker['doi'] = ser_doi

In [None]:
# %%time
# len_df_current_extra_info_checker = len(df_current_extra_info_checker)
# dict_new_extra_info_checker = dict()
# i = 0 
# while i < len_df_current_extra_info_checker: ###################################################### 
#     if df_current_extra_info_checker['affiliation'][i] == None:
#         dict_new_extra_info_checker[i] = fetch_scopus_api(client, ser_doi[i])
#         print("Position fetched again: " + str(i) + " -> " +  ser_doi[i])
#     i = i + 1    

In [None]:
# dict_new_extra_info_checker
# -> check if at least one value is not None -> otherwise the process is finished here

In [None]:
# len(dict_new_extra_info_checker)

In [None]:
# df_combined_extra_info_fetched_again  = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info_checker)
# df_combined_extra_info_fetched_again

In [None]:
# store_df_columns(df_combined_extra_info_fetched_again)

In [None]:
# df_combined_extra_info_fetched_again['check_doi'] = ser_doi
# df_combined_extra_info_fetched_again.head(30)