# CORD-19-collect-scopus-data

In general, this jupyter notebook is designated to collect additional data via scopus to enbroaden the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import datetime
import matplotlib.pyplot as plt
import re
from urllib.parse import urlparse
from collections import Counter

from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

import time # for sleep
from pybtex.database import parse_file, BibliographyData, Entry
import json
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
from elsapy.elssearch import ElsSearch

Get the data and save it to a variable.

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv')

Check the length of the column containing doi's.

In [3]:
len(CORD19_CSV['doi'])

77448

Display the column doi to see if there are inconsistencies such as NaN's

In [4]:
doi = CORD19_CSV['doi']
doi

0                                 NaN
1          10.1016/j.regg.2021.01.002
2           10.1016/j.rec.2020.08.002
3        10.1016/j.vetmic.2006.11.026
4                   10.3390/v12080849
                     ...             
77443      10.1007/s11229-020-02869-9
77444                             NaN
77445     10.1101/2020.05.13.20100206
77446      10.1007/s42991-020-00052-8
77447     10.1101/2020.09.14.20194670
Name: doi, Length: 77448, dtype: object

Create a series with solely unique values and neglect NaN's. It is important to sort the unique values. Otherwise, the method is creating different results after each restart of the notebook. 

In [5]:
doi_counted = doi.value_counts().sort_index(ascending=True)
doi_counted

10.1001/jamainternmed.2020.1369       1
10.1001/jamanetworkopen.2020.16382    1
10.1001/jamanetworkopen.2020.17521    1
10.1001/jamanetworkopen.2020.20485    1
10.1001/jamanetworkopen.2020.24984    1
                                     ..
10.9745/ghsp-d-20-00115               1
10.9745/ghsp-d-20-00171               1
10.9745/ghsp-d-20-00218               1
10.9758/cpn.2020.18.4.607             1
10.9781/ijimai.2020.02.002            1
Name: doi, Length: 74302, dtype: int64

The following function determines the requested information from the Scopus API. (https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4)

In [6]:
#Adapted from https://github.com/ElsevierDev/elsapy/blob/master/exampleProg.py
def fetch_scopus_api(client, doi):
    """obtain additional paper information from scopus by doi
    """
    doc_srch = ElsSearch("DOI("+doi+")",'scopus')
    doc_srch.execute(client, get_all = True)
    #print ("doc_srch has", len(doc_srch.results), "results.")
    #print(doc_srch.results)
    try:
        scopus_id=doc_srch.results[0]["dc:identifier"].split(":")[1]
        scp_doc = AbsDoc(scp_id = scopus_id)
        if scp_doc.read(client):
            # print ("scp_doc.title: ", scp_doc.title)
            scp_doc.write()   
        else:
            print ("Read document failed.")
        # print(scp_doc.data["affiliation"])
        return scp_doc.data
    except:
        return None

Thusly, the configuration file is set up and contains an APIkey. Further information: https://github.com/ElsevierDev/elsapy/blob/master/CONFIG.md

In [7]:
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

Moreover, the client is initialized with the API-Key.

In [8]:
client = ElsClient(config['apikey'])

For demonstation purposes, the following cells shows which data is returned by the Scopus API. 

In [9]:
return_example = fetch_scopus_api(client, '10.1016/j.dsx.2020.04.012')
print(json.dumps(return_example, indent=2))

{
  "affiliation": [
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Hamdard",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Millia Islamia",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Indraprastha Apollo Hospitals",
      "affiliation-country": "India"
    }
  ],
  "coredata": {
    "srctype": "j",
    "eid": "2-s2.0-85083171050",
    "pubmed-id": "32305024",
    "prism:coverDate": "2020-07-01",
    "prism:aggregationType": "Journal",
    "prism:url": "https://api.elsevier.com/content/abstract/scopus_id/85083171050",
    "dc:creator": {
      "author": [
        {
          "ce:given-name": "Raju",
          "preferred-name": {
            "ce:given-name": "Raju",
            "ce:initials": "R.",
            "ce:surname": "Vaishya",
            "ce:indexed-name": "Vaishya R."
          },
          "@seq": "1",
          "ce:init

Based on the returned data, further analysis is conductable. Therefore, two notebooks are created to analyse data linked to: 
<ul>
  <li>affiliation</li>
  <li>coredata</li>
</ul>    

Thusly, the already fetched coredata and affiliation are read and combined to a DataFrame for further processing.

In [10]:
df_current_extra_info = pd.DataFrame()
try:
    read_affiliation = pd.read_pickle('extra_info_affiliation_CS.pkl')
    read_coredata = pd.read_pickle('extra_info_coredata_CS.pkl')
    df_current_extra_info['affiliation'] = read_affiliation
    df_current_extra_info['coredata'] = read_coredata
    df_current_extra_info
except:
    print("The DataFrame is empty")
    #if the dataframe is not empty set the variable to show the dataframe

The length of the DataFrame containing the current information is assigned to a variable to be used for further processing. 
Therefore, the length will be used within a while loop as a starting index. 

In [11]:
len_df_current_extra_info = len(df_current_extra_info)
len_df_current_extra_info

41929

In [12]:
df_current_extra_info

Unnamed: 0,affiliation,coredata
0,"[{'affiliation-city': 'Palo Alto', 'affilname'...","{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"[{'affiliation-city': 'Seattle', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"[{'affiliation-city': 'Madison', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"[{'affiliation-city': 'Los Angeles', 'affilnam...","{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...,...
41924,,
41925,,
41926,,
41927,,


In [13]:
def contains_only_None(dic):
    """
    This functions inspects an dictionary and returns True if it solely contains None values
    """
    return len(dic) == sum(value == None for value in dict_new_extra_info.values())

In [14]:
def append_fetched_data_to_df(df_current_extra_info, dic):
    """
    This function appends or inserts newly fetched data to the DataFrame containing scopus data.
    """
    #df_current_extra_info -> holding the latest data, new data needs to be appended to it, 
    #df_newly_fetched_transposed -> holdy newly fetched data, needs to be inserted or fetched
    
    if contains_only_None(dic):
        placeholder_entries = pd.DataFrame(np.empty((len(dict_new_extra_info),2),dtype=object),columns=['affiliation','coredata'], index=dict_new_extra_info.keys())
        df_newly_fetched_transposed = placeholder_entries
        print(placeholder_entries)
    else:
        #Prior appending, the dictionary is transformed to a DataFrame
        df_newly_fetched = pd.DataFrame(dic)
        #For readability, the DataFrame is transposed
        df_newly_fetched_transposed = df_newly_fetched.T
    
    #Insert newly fetched rows which were previously not successful appended
    for index, row in df_newly_fetched_transposed.iterrows():
        #insert to current extra info DataFrame because the row is existent
        if index in df_current_extra_info.index and row.affiliation is not None:
            df_current_extra_info.loc[index] = row
        #append to current extra info DataFrame because the row is new     
        if index not in df_current_extra_info.index:
            df_current_extra_info = df_current_extra_info.append(row, ignore_index=True)
            
    #returning DataFrame with inserted and replaced rows. 
    return df_current_extra_info

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [15]:
def store_df_columns(df):
    ser_affiliation = df['affiliation']
    ser_coredata = df['coredata']
    ser_affiliation.to_pickle('extra_info_affiliation_CS.pkl')
    ser_coredata.to_pickle('extra_info_coredata_CS.pkl')
    return ser_affiliation, ser_coredata

In [16]:
# placeholder_entries = pd.DataFrame(np.empty((4,2),dtype=object),columns=['affiliation','coredata'])

In [17]:
# placeholder_entries

Subsequently, the fetched scopus data is stored within a dictionary. Besides, the print function is used to show the state of the process by displaying the latest fetched information. 

In [None]:
%%time
dict_new_extra_info = dict()
len_dois = len(doi_counted)
def trigger_fetching():
    threshold = 0 
    i = len_df_current_extra_info
    while i < len_dois: #-> upto modified, normally len_dois
        dict_new_extra_info[i] = fetch_scopus_api(client, doi_counted.index[i])
        print("Position fetched: " + str(i) + " -> " +  doi_counted.index[i])
        i = i + 1 
        threshold = threshold + 1
        if threshold > 99:
            df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
            stored_series = store_df_columns(df_combined_extra_info)
            threshold = 0
            print("batch saved")
trigger_fetching()

Position fetched: 41929 -> 10.1101/2020.04.21.20072637
Position fetched: 41930 -> 10.1101/2020.04.21.20073114
Position fetched: 41931 -> 10.1101/2020.04.21.20073262
Position fetched: 41932 -> 10.1101/2020.04.21.20073536
Position fetched: 41933 -> 10.1101/2020.04.21.20073734
Position fetched: 41934 -> 10.1101/2020.04.21.20073833
Position fetched: 41935 -> 10.1101/2020.04.21.20073890
Position fetched: 41936 -> 10.1101/2020.04.21.20073916
Position fetched: 41937 -> 10.1101/2020.04.21.20074054
Position fetched: 41938 -> 10.1101/2020.04.21.20074138
Position fetched: 41939 -> 10.1101/2020.04.21.20074211
Position fetched: 41940 -> 10.1101/2020.04.21.20074443
Position fetched: 41941 -> 10.1101/2020.04.21.20074450
Position fetched: 41942 -> 10.1101/2020.04.21.20074468
Position fetched: 41943 -> 10.1101/2020.04.21.20074492
Position fetched: 41944 -> 10.1101/2020.04.21.20074591
Position fetched: 41945 -> 10.1101/2020.04.21.20074633
Position fetched: 41946 -> 10.1101/2020.04.22.044404
Position fet

Position fetched: 42073 -> 10.1101/2020.04.25.20077396
Position fetched: 42074 -> 10.1101/2020.04.25.20077842
Position fetched: 42075 -> 10.1101/2020.04.25.20079079
Position fetched: 42076 -> 10.1101/2020.04.25.20079095
Position fetched: 42077 -> 10.1101/2020.04.25.20079103
Position fetched: 42078 -> 10.1101/2020.04.25.20079111
Position fetched: 42079 -> 10.1101/2020.04.25.20079129
Position fetched: 42080 -> 10.1101/2020.04.25.20079251
Position fetched: 42081 -> 10.1101/2020.04.25.20079343
Position fetched: 42082 -> 10.1101/2020.04.25.20079400
Position fetched: 42083 -> 10.1101/2020.04.25.20079426
Position fetched: 42084 -> 10.1101/2020.04.25.20079467
Position fetched: 42085 -> 10.1101/2020.04.25.20079475
Position fetched: 42086 -> 10.1101/2020.04.25.20079491
Position fetched: 42087 -> 10.1101/2020.04.25.20079517
Position fetched: 42088 -> 10.1101/2020.04.25.20079624
Position fetched: 42089 -> 10.1101/2020.04.25.20079640
Position fetched: 42090 -> 10.1101/2020.04.25.20079848
Position f

Position fetched: 42217 -> 10.1101/2020.04.29.20081174
Position fetched: 42218 -> 10.1101/2020.04.29.20082065
Position fetched: 42219 -> 10.1101/2020.04.29.20082263
Position fetched: 42220 -> 10.1101/2020.04.29.20082867
Position fetched: 42221 -> 10.1101/2020.04.29.20083485
Position fetched: 42222 -> 10.1101/2020.04.29.20083709
Position fetched: 42223 -> 10.1101/2020.04.29.20083717
Position fetched: 42224 -> 10.1101/2020.04.29.20084111
Position fetched: 42225 -> 10.1101/2020.04.29.20084236
Position fetched: 42226 -> 10.1101/2020.04.29.20084285
Position fetched: 42227 -> 10.1101/2020.04.29.20084335
Position fetched: 42228 -> 10.1101/2020.04.29.20084376
      affiliation coredata
41929        None     None
41930        None     None
41931        None     None
41932        None     None
41933        None     None
...           ...      ...
42224        None     None
42225        None     None
42226        None     None
42227        None     None
42228        None     None

[300 rows x 2 c

Position fetched: 42354 -> 10.1101/2020.05.03.20089318
Position fetched: 42355 -> 10.1101/2020.05.03.20089417
Position fetched: 42356 -> 10.1101/2020.05.03.20089508
Position fetched: 42357 -> 10.1101/2020.05.03.20089557
Position fetched: 42358 -> 10.1101/2020.05.03.20089755
Position fetched: 42359 -> 10.1101/2020.05.03.20089813
Position fetched: 42360 -> 10.1101/2020.05.03.20089839
Position fetched: 42361 -> 10.1101/2020.05.03.20089854
Position fetched: 42362 -> 10.1101/2020.05.03.20089938
Position fetched: 42363 -> 10.1101/2020.05.03.20089961
Position fetched: 42364 -> 10.1101/2020.05.04.070177
Position fetched: 42365 -> 10.1101/2020.05.04.074989
Position fetched: 42366 -> 10.1101/2020.05.04.075291
Position fetched: 42367 -> 10.1101/2020.05.04.075945
Position fetched: 42368 -> 10.1101/2020.05.04.077826
Position fetched: 42369 -> 10.1101/2020.05.04.20072447
Position fetched: 42370 -> 10.1101/2020.05.04.20076349
Position fetched: 42371 -> 10.1101/2020.05.04.20079301
Position fetched: 42

Position fetched: 42498 -> 10.1101/2020.05.07.083212
Position fetched: 42499 -> 10.1101/2020.05.07.083410
Position fetched: 42500 -> 10.1101/2020.05.07.20073817
Position fetched: 42501 -> 10.1101/2020.05.07.20083386
Position fetched: 42502 -> 10.1101/2020.05.07.20084087
Position fetched: 42503 -> 10.1101/2020.05.07.20085365
Position fetched: 42504 -> 10.1101/2020.05.07.20089243
Position fetched: 42505 -> 10.1101/2020.05.07.20090225
Position fetched: 42506 -> 10.1101/2020.05.07.20091652
Position fetched: 42507 -> 10.1101/2020.05.07.20092353
Position fetched: 42508 -> 10.1101/2020.05.07.20092882
Position fetched: 42509 -> 10.1101/2020.05.07.20093286
Position fetched: 42510 -> 10.1101/2020.05.07.20093674
Position fetched: 42511 -> 10.1101/2020.05.07.20093807
Position fetched: 42512 -> 10.1101/2020.05.07.20093831
Position fetched: 42513 -> 10.1101/2020.05.07.20093849
Position fetched: 42514 -> 10.1101/2020.05.07.20093864
Position fetched: 42515 -> 10.1101/2020.05.07.20093872
Position fetch

Position fetched: 42635 -> 10.1101/2020.05.11.20095158
Position fetched: 42636 -> 10.1101/2020.05.11.20095851
Position fetched: 42637 -> 10.1101/2020.05.11.20096362
Position fetched: 42638 -> 10.1101/2020.05.11.20096727
Position fetched: 42639 -> 10.1101/2020.05.11.20097709
Position fetched: 42640 -> 10.1101/2020.05.11.20097725
Position fetched: 42641 -> 10.1101/2020.05.11.20097907
Position fetched: 42642 -> 10.1101/2020.05.11.20097923
Position fetched: 42643 -> 10.1101/2020.05.11.20097980
Position fetched: 42644 -> 10.1101/2020.05.11.20098004
Position fetched: 42645 -> 10.1101/2020.05.11.20098053
Position fetched: 42646 -> 10.1101/2020.05.11.20098061
Position fetched: 42647 -> 10.1101/2020.05.11.20098087
Position fetched: 42648 -> 10.1101/2020.05.11.20098111
Position fetched: 42649 -> 10.1101/2020.05.11.20098145
Position fetched: 42650 -> 10.1101/2020.05.11.20098202
Position fetched: 42651 -> 10.1101/2020.05.11.20098228
Position fetched: 42652 -> 10.1101/2020.05.11.20098335
Position f

Position fetched: 42780 -> 10.1101/2020.05.14.20101675
Position fetched: 42781 -> 10.1101/2020.05.14.20101691
Position fetched: 42782 -> 10.1101/2020.05.14.20101717
Position fetched: 42783 -> 10.1101/2020.05.14.20101774
Position fetched: 42784 -> 10.1101/2020.05.14.20101808
Position fetched: 42785 -> 10.1101/2020.05.14.20101873
Position fetched: 42786 -> 10.1101/2020.05.14.20101972
Position fetched: 42787 -> 10.1101/2020.05.14.20101998
Position fetched: 42788 -> 10.1101/2020.05.14.20102012
Position fetched: 42789 -> 10.1101/2020.05.14.20102038
Position fetched: 42790 -> 10.1101/2020.05.14.20102087
Position fetched: 42791 -> 10.1101/2020.05.14.20102343
Position fetched: 42792 -> 10.1101/2020.05.14.20102475
Position fetched: 42793 -> 10.1101/2020.05.14.20102483
Position fetched: 42794 -> 10.1101/2020.05.14.20102491
Position fetched: 42795 -> 10.1101/2020.05.14.20102517
Position fetched: 42796 -> 10.1101/2020.05.14.20102533
Position fetched: 42797 -> 10.1101/2020.05.14.20102541
Position f

Position fetched: 42925 -> 10.1101/2020.05.19.20101832
Position fetched: 42926 -> 10.1101/2020.05.19.20106336
Position fetched: 42927 -> 10.1101/2020.05.19.20106427
Position fetched: 42928 -> 10.1101/2020.05.19.20106492
      affiliation coredata
41929        None     None
41930        None     None
41931        None     None
41932        None     None
41933        None     None
...           ...      ...
42924        None     None
42925        None     None
42926        None     None
42927        None     None
42928        None     None

[1000 rows x 2 columns]
batch saved
Position fetched: 42929 -> 10.1101/2020.05.19.20106575
Position fetched: 42930 -> 10.1101/2020.05.19.20106641
Position fetched: 42931 -> 10.1101/2020.05.19.20106658
Position fetched: 42932 -> 10.1101/2020.05.19.20106781
Position fetched: 42933 -> 10.1101/2020.05.19.20106799
Position fetched: 42934 -> 10.1101/2020.05.19.20106856
Position fetched: 42935 -> 10.1101/2020.05.19.20106914
Position fetched: 42936 -> 10.1101

Position fetched: 43063 -> 10.1101/2020.05.22.20110627
Position fetched: 43064 -> 10.1101/2020.05.22.20110700
Position fetched: 43065 -> 10.1101/2020.05.22.20110718
Position fetched: 43066 -> 10.1101/2020.05.22.20110726
Position fetched: 43067 -> 10.1101/2020.05.22.20110742
Position fetched: 43068 -> 10.1101/2020.05.22.20110791
Position fetched: 43069 -> 10.1101/2020.05.22.20110809
Position fetched: 43070 -> 10.1101/2020.05.22.20110817
Position fetched: 43071 -> 10.1101/2020.05.22.20110825
Position fetched: 43072 -> 10.1101/2020.05.23.107334
Position fetched: 43073 -> 10.1101/2020.05.23.111385
Position fetched: 43074 -> 10.1101/2020.05.23.111971
Position fetched: 43075 -> 10.1101/2020.05.23.112235
Position fetched: 43076 -> 10.1101/2020.05.23.112284
Position fetched: 43077 -> 10.1101/2020.05.23.112797
Position fetched: 43078 -> 10.1101/2020.05.23.20101741
Position fetched: 43079 -> 10.1101/2020.05.23.20109496
Position fetched: 43080 -> 10.1101/2020.05.23.20110189
Position fetched: 4308

Position fetched: 43207 -> 10.1101/2020.05.27.20112987
Position fetched: 43208 -> 10.1101/2020.05.27.20113001
Position fetched: 43209 -> 10.1101/2020.05.27.20113803
Position fetched: 43210 -> 10.1101/2020.05.27.20114017
Position fetched: 43211 -> 10.1101/2020.05.27.20114066
Position fetched: 43212 -> 10.1101/2020.05.27.20114298
Position fetched: 43213 -> 10.1101/2020.05.27.20114371
Position fetched: 43214 -> 10.1101/2020.05.27.20114447
Position fetched: 43215 -> 10.1101/2020.05.27.20114470
Position fetched: 43216 -> 10.1101/2020.05.27.20114512
Position fetched: 43217 -> 10.1101/2020.05.27.20114538
Position fetched: 43218 -> 10.1101/2020.05.27.20114546
Position fetched: 43219 -> 10.1101/2020.05.27.20114652
Position fetched: 43220 -> 10.1101/2020.05.27.20114728
Position fetched: 43221 -> 10.1101/2020.05.27.20114744
Position fetched: 43222 -> 10.1101/2020.05.27.20114983
Position fetched: 43223 -> 10.1101/2020.05.27.20115048
Position fetched: 43224 -> 10.1101/2020.05.27.20115113
Position f

Position fetched: 43345 -> 10.1101/2020.05.31.20118802
Position fetched: 43346 -> 10.1101/2020.06.01.126821
Position fetched: 43347 -> 10.1101/2020.06.01.127019
Position fetched: 43348 -> 10.1101/2020.06.01.127381
Position fetched: 43349 -> 10.1101/2020.06.01.127589
Position fetched: 43350 -> 10.1101/2020.06.01.127829
Position fetched: 43351 -> 10.1101/2020.06.01.128355
Position fetched: 43352 -> 10.1101/2020.06.01.20086025
Position fetched: 43353 -> 10.1101/2020.06.01.20100461
Position fetched: 43354 -> 10.1101/2020.06.01.20112334
Position fetched: 43355 -> 10.1101/2020.06.01.20114884
Position fetched: 43356 -> 10.1101/2020.06.01.20116590
Position fetched: 43357 -> 10.1101/2020.06.01.20118018
Position fetched: 43358 -> 10.1101/2020.06.01.20118505
Position fetched: 43359 -> 10.1101/2020.06.01.20118877
Position fetched: 43360 -> 10.1101/2020.06.01.20118893
Position fetched: 43361 -> 10.1101/2020.06.01.20118927
Position fetched: 43362 -> 10.1101/2020.06.01.20118935
Position fetched: 4336

The following cell is useful when the process above is interrupted. Therefore, the dictionary containing fetched information can be narrowed down to useful entries. 

In [None]:
def save_new_extra_info(len_df_current_extra_info, upto):
    """
    This function is used to separate successfull API calls from API calls which were prevented due to an invalid API-Key.
    As a result, this function returns a range of valid entries up to the given parameter. 
    """
    dict_new_extra_info_saver = dict()
    i = len_df_current_extra_info
    while i < upto:
        #print("Position: " + str(i) + " -> " +  doi_counted.index[i])
        dict_new_extra_info_saver[i] = dict_new_extra_info[i]
        i = i + 1 
    return dict_new_extra_info_saver

The existing and newly fetched information are combined into one DataFrame. 

In [None]:
# df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
df_combined_extra_info

In [None]:
#to big for GitHub
#df_combined_extra_info.to_csv('extra_info_CS5099.csv', sep='\t')

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [None]:
# stored_series = store_df_columns(df_combined_extra_info)
# stored_series[0]

In [None]:
# stored_series[1]

Verifying that the returned None values are due to non existent data and not to an invalid API-Key

In [None]:
# len_data = len(stored_series[0])
# len_data 

In [None]:
# ser_doi = pd.Series(doi_counted.index[:len_data])
# ser_doi

In [None]:
# df_current_extra_info_checker = df_combined_extra_info
# df_current_extra_info_checker['doi'] = ser_doi

In [None]:
# %%time
# len_df_current_extra_info_checker = len(df_current_extra_info_checker)
# dict_new_extra_info_checker = dict()
# i = 0 
# while i < len_df_current_extra_info_checker: ###################################################### 
#     if df_current_extra_info_checker['affiliation'][i] == None:
#         dict_new_extra_info_checker[i] = fetch_scopus_api(client, ser_doi[i])
#         print("Position fetched again: " + str(i) + " -> " +  ser_doi[i])
#     i = i + 1    

In [None]:
# dict_new_extra_info_checker
# -> check if at least one value is not None -> otherwise the process is finished here

In [None]:
# len(dict_new_extra_info_checker)

In [None]:
# df_combined_extra_info_fetched_again  = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info_checker)
# df_combined_extra_info_fetched_again

In [None]:
# store_df_columns(df_combined_extra_info_fetched_again)

In [None]:
# df_combined_extra_info_fetched_again['check_doi'] = ser_doi
# df_combined_extra_info_fetched_again.head(30)