# CORD-19-collect-scopus-data

In general, this jupyter notebook is designated to collect additional data via scopus to enbroaden the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import datetime
import matplotlib.pyplot as plt
import re
from urllib.parse import urlparse
from collections import Counter

from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

import time # for sleep
from pybtex.database import parse_file, BibliographyData, Entry
import json
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
from elsapy.elssearch import ElsSearch

Get the data and save it to a variable.

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv')

Check the length of the column containing doi's.

In [3]:
len(CORD19_CSV['doi'])

77448

Display the column doi to see if there are inconsistencies such as NaN's

In [4]:
doi = CORD19_CSV['doi']
doi

0                                 NaN
1          10.1016/j.regg.2021.01.002
2           10.1016/j.rec.2020.08.002
3        10.1016/j.vetmic.2006.11.026
4                   10.3390/v12080849
                     ...             
77443      10.1007/s11229-020-02869-9
77444                             NaN
77445     10.1101/2020.05.13.20100206
77446      10.1007/s42991-020-00052-8
77447     10.1101/2020.09.14.20194670
Name: doi, Length: 77448, dtype: object

Create a series with solely unique values and neglect NaN's. It is important to sort the unique values. Otherwise, the method is creating different results after each restart of the notebook. 

In [5]:
doi_counted = doi.value_counts().sort_index(ascending=True)
doi_counted

10.1001/jamainternmed.2020.1369       1
10.1001/jamanetworkopen.2020.16382    1
10.1001/jamanetworkopen.2020.17521    1
10.1001/jamanetworkopen.2020.20485    1
10.1001/jamanetworkopen.2020.24984    1
                                     ..
10.9745/ghsp-d-20-00115               1
10.9745/ghsp-d-20-00171               1
10.9745/ghsp-d-20-00218               1
10.9758/cpn.2020.18.4.607             1
10.9781/ijimai.2020.02.002            1
Name: doi, Length: 74302, dtype: int64

The following function determines the requested information from the Scopus API. (https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4)

In [6]:
#Adapted from https://github.com/ElsevierDev/elsapy/blob/master/exampleProg.py
def fetch_scopus_api(client, doi):
    """obtain additional paper information from scopus by doi
    """
    doc_srch = ElsSearch("DOI("+doi+")",'scopus')
    doc_srch.execute(client, get_all = True)
    #print ("doc_srch has", len(doc_srch.results), "results.")
    #print(doc_srch.results)
    try:
        scopus_id=doc_srch.results[0]["dc:identifier"].split(":")[1]
        scp_doc = AbsDoc(scp_id = scopus_id)
        if scp_doc.read(client):
            # print ("scp_doc.title: ", scp_doc.title)
            scp_doc.write()   
        else:
            print ("Read document failed.")
        # print(scp_doc.data["affiliation"])
        return scp_doc.data
    except:
        return None

Thusly, the configuration file is set up and contains an APIkey. Further information: https://github.com/ElsevierDev/elsapy/blob/master/CONFIG.md

In [7]:
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

Moreover, the client is initialized with the API-Key.

In [8]:
client = ElsClient(config['apikey'])

For demonstation purposes, the following cells shows which data is returned by the Scopus API. 

In [9]:
return_example = fetch_scopus_api(client, '10.1016/j.dsx.2020.04.012')
print(json.dumps(return_example, indent=2))

{
  "affiliation": [
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Hamdard",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Millia Islamia",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Indraprastha Apollo Hospitals",
      "affiliation-country": "India"
    }
  ],
  "coredata": {
    "srctype": "j",
    "eid": "2-s2.0-85083171050",
    "pubmed-id": "32305024",
    "prism:coverDate": "2020-07-01",
    "prism:aggregationType": "Journal",
    "prism:url": "https://api.elsevier.com/content/abstract/scopus_id/85083171050",
    "dc:creator": {
      "author": [
        {
          "ce:given-name": "Raju",
          "preferred-name": {
            "ce:given-name": "Raju",
            "ce:initials": "R.",
            "ce:surname": "Vaishya",
            "ce:indexed-name": "Vaishya R."
          },
          "@seq": "1",
          "ce:init

Based on the returned data, further analysis is conductable. Therefore, two notebooks are created to analyse data linked to: 
<ul>
  <li>affiliation</li>
  <li>coredata</li>
</ul>    

Thusly, the already fetched coredata and affiliation are read and combined to a DataFrame for further processing.

In [10]:
df_current_extra_info = pd.DataFrame()
try:
    read_affiliation = pd.read_pickle('extra_info_affiliation_CS.pkl')
    read_coredata = pd.read_pickle('extra_info_coredata_CS.pkl')
    df_current_extra_info['affiliation'] = read_affiliation
    df_current_extra_info['coredata'] = read_coredata
    df_current_extra_info
except:
    print("The DataFrame is empty")
    #if the dataframe is not empty set the variable to show the dataframe

The length of the DataFrame containing the current information is assigned to a variable to be used for further processing. 
Therefore, the length will be used within a while loop as a starting index. 

In [11]:
len_df_current_extra_info = len(df_current_extra_info)
len_df_current_extra_info

54529

In [12]:
df_current_extra_info

Unnamed: 0,affiliation,coredata
0,"[{'affiliation-city': 'Palo Alto', 'affilname'...","{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"[{'affiliation-city': 'Seattle', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"[{'affiliation-city': 'Madison', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"[{'affiliation-city': 'Los Angeles', 'affilnam...","{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...,...
54524,"[{'affiliation-city': 'Oslo', 'affilname': 'Ul...","{'srctype': 'j', 'prism:issueIdentifier': '4',..."
54525,"[{'affiliation-city': 'Innsbruck', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '4',..."
54526,"[{'affiliation-city': 'London', 'affilname': '...","{'srctype': 'j', 'prism:issueIdentifier': '6',..."
54527,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '1',..."


In [13]:
def contains_only_None(dic):
    """
    This functions inspects an dictionary and returns True if it solely contains None values
    """
    return len(dic) == sum(value == None for value in dict_new_extra_info.values())

In [14]:
def append_fetched_data_to_df(df_current_extra_info, dic):
    """
    This function appends or inserts newly fetched data to the DataFrame containing scopus data.
    """
    #df_current_extra_info -> holding the latest data, new data needs to be appended to it, 
    #df_newly_fetched_transposed -> holdy newly fetched data, needs to be inserted or fetched
    
    if contains_only_None(dic):
        placeholder_entries = pd.DataFrame(np.empty((len(dict_new_extra_info),2),dtype=object),columns=['affiliation','coredata'], index=dict_new_extra_info.keys())
        df_newly_fetched_transposed = placeholder_entries
        print(placeholder_entries)
    else:
        #Prior appending, the dictionary is transformed to a DataFrame
        df_newly_fetched = pd.DataFrame(dic)
        #For readability, the DataFrame is transposed
        df_newly_fetched_transposed = df_newly_fetched.T
        print(df_newly_fetched_transposed)
    
    #Insert newly fetched rows which were previously not successful appended
    for index, row in df_newly_fetched_transposed.iterrows():
        #insert to current extra info DataFrame because the row is existent
        if index in df_current_extra_info.index and row.affiliation is not None:
            df_current_extra_info.loc[index] = row
        #append to current extra info DataFrame because the row is new     
        if index not in df_current_extra_info.index:
            df_current_extra_info = df_current_extra_info.append(row, ignore_index=True)
            
    #returning DataFrame with inserted and replaced rows. 
    return df_current_extra_info

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [15]:
def store_df_columns(df):
    ser_affiliation = df['affiliation']
    ser_coredata = df['coredata']
    ser_affiliation.to_pickle('extra_info_affiliation_CS.pkl')
    ser_coredata.to_pickle('extra_info_coredata_CS.pkl')
    return ser_affiliation, ser_coredata

In [16]:
# placeholder_entries = pd.DataFrame(np.empty((4,2),dtype=object),columns=['affiliation','coredata'])

In [17]:
# placeholder_entries

Subsequently, the fetched scopus data is stored within a dictionary. Besides, the print function is used to show the state of the process by displaying the latest fetched information. 

In [None]:
%%time
dict_new_extra_info = dict()
len_dois = len(doi_counted)
def trigger_fetching():
    threshold = 0 
    i = len_df_current_extra_info
    while i < len_dois: #-> upto modified, normally len_dois
        dict_new_extra_info[i] = fetch_scopus_api(client, doi_counted.index[i])
        print("Position fetched: " + str(i) + " -> " +  doi_counted.index[i])
        i = i + 1 
        threshold = threshold + 1
        if threshold > 99:
            df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
            stored_series = store_df_columns(df_combined_extra_info)
            threshold = 0
            print("batch saved")
trigger_fetching()

Position fetched: 54529 -> 10.1183/13993003.03730-2020
Position fetched: 54530 -> 10.1183/13993003.03961-2020
Position fetched: 54531 -> 10.1183/13993003.04116-2020
Position fetched: 54532 -> 10.1183/16000617.0060-2020
Position fetched: 54533 -> 10.1183/16000617.0181-2020
Position fetched: 54534 -> 10.1183/16000617.0240-2020
Position fetched: 54535 -> 10.1183/16000617.0287-2020
Position fetched: 54536 -> 10.1183/16000617.0310-2020
Position fetched: 54537 -> 10.1183/20734735.019317
Position fetched: 54538 -> 10.1183/23120541.00015-2015
Position fetched: 54539 -> 10.1183/23120541.00066-2018
Position fetched: 54540 -> 10.1183/23120541.00096-2018
Position fetched: 54541 -> 10.1183/23120541.00142-2020
Position fetched: 54542 -> 10.1183/23120541.00156-2017
Position fetched: 54543 -> 10.1183/23120541.00174-2020
Position fetched: 54544 -> 10.1183/23120541.00238-2020
Position fetched: 54545 -> 10.1183/23120541.00260-2020
Position fetched: 54546 -> 10.1183/23120541.00304-2020
Position fetched: 5

Position fetched: 54658 -> 10.1186/1471-2105-5-65
Position fetched: 54659 -> 10.1186/1471-2105-5-72
Position fetched: 54660 -> 10.1186/1471-2105-5-96
Position fetched: 54661 -> 10.1186/1471-2105-6-113
Position fetched: 54662 -> 10.1186/1471-2105-6-190
Position fetched: 54663 -> 10.1186/1471-2105-6-23
Position fetched: 54664 -> 10.1186/1471-2105-6-90
Position fetched: 54665 -> 10.1186/1471-2105-7-182
Position fetched: 54666 -> 10.1186/1471-2105-7-232
Position fetched: 54667 -> 10.1186/1471-2105-7-451
Position fetched: 54668 -> 10.1186/1471-2105-7-9
Position fetched: 54669 -> 10.1186/1471-2105-7-s5-s12
Position fetched: 54670 -> 10.1186/1471-2105-8-194
Position fetched: 54671 -> 10.1186/1471-2105-8-21
Position fetched: 54672 -> 10.1186/1471-2105-9-155
Position fetched: 54673 -> 10.1186/1471-2105-9-159
Position fetched: 54674 -> 10.1186/1471-2105-9-304
Position fetched: 54675 -> 10.1186/1471-2105-9-368
Position fetched: 54676 -> 10.1186/1471-2105-9-s1-s19
Position fetched: 54677 -> 10.118

Position fetched: 54790 -> 10.1186/1471-2334-1-6
Position fetched: 54791 -> 10.1186/1471-2334-10-106
Position fetched: 54792 -> 10.1186/1471-2334-10-113
Position fetched: 54793 -> 10.1186/1471-2334-10-136
Position fetched: 54794 -> 10.1186/1471-2334-10-139
Position fetched: 54795 -> 10.1186/1471-2334-10-145
Position fetched: 54796 -> 10.1186/1471-2334-10-215
Position fetched: 54797 -> 10.1186/1471-2334-10-222
Position fetched: 54798 -> 10.1186/1471-2334-10-234
Position fetched: 54799 -> 10.1186/1471-2334-10-256
Position fetched: 54800 -> 10.1186/1471-2334-10-286
Position fetched: 54801 -> 10.1186/1471-2334-10-288
Position fetched: 54802 -> 10.1186/1471-2334-10-296
Position fetched: 54803 -> 10.1186/1471-2334-10-322
Position fetched: 54804 -> 10.1186/1471-2334-10-330
Position fetched: 54805 -> 10.1186/1471-2334-10-35
Position fetched: 54806 -> 10.1186/1471-2334-10-42
Position fetched: 54807 -> 10.1186/1471-2334-10-54
Position fetched: 54808 -> 10.1186/1471-2334-10-62
Position fetched: 5

Position fetched: 54922 -> 10.1186/1471-2431-12-32
Position fetched: 54923 -> 10.1186/1471-2431-12-83
Position fetched: 54924 -> 10.1186/1471-2431-14-37
Position fetched: 54925 -> 10.1186/1471-244x-14-81
Position fetched: 54926 -> 10.1186/1471-2458-10-130
Position fetched: 54927 -> 10.1186/1471-2458-10-138
Position fetched: 54928 -> 10.1186/1471-2458-10-191
                                             affiliation  \
54529  [{'affiliation-city': 'Melbourne', 'affilname'...   
54530  [{'affiliation-city': 'Berlin', 'affilname': '...   
54531  {'affiliation-city': 'Dundee', 'affilname': 'N...   
54532  {'affiliation-city': 'Chicago', 'affilname': '...   
54533  [{'affiliation-city': 'Irvine', 'affilname': '...   
...                                                  ...   
54924  [{'affiliation-city': 'Toronto', 'affilname': ...   
54925  [{'affiliation-city': 'Jinan', 'affilname': 'S...   
54926  [{'affiliation-city': 'Sydney', 'affilname': '...   
54927  [{'affiliation-city': 'Townsville

batch saved
Position fetched: 55029 -> 10.1186/1472-6807-10-6
Position fetched: 55030 -> 10.1186/1472-6807-13-s1-s3
Position fetched: 55031 -> 10.1186/1472-6807-3-8
Position fetched: 55032 -> 10.1186/1472-6807-5-5
Position fetched: 55033 -> 10.1186/1472-6882-12-10
Position fetched: 55034 -> 10.1186/1472-6882-13-367
Position fetched: 55035 -> 10.1186/1472-6882-14-171
Position fetched: 55036 -> 10.1186/1472-6882-14-273
Position fetched: 55037 -> 10.1186/1472-6882-14-90
Position fetched: 55038 -> 10.1186/1472-6920-14-158
Position fetched: 55039 -> 10.1186/1472-6939-7-12
Position fetched: 55040 -> 10.1186/1472-6947-12-147
Position fetched: 55041 -> 10.1186/1472-6947-12-37
Position fetched: 55042 -> 10.1186/1472-6947-5-17
Position fetched: 55043 -> 10.1186/1472-6947-7-15
Position fetched: 55044 -> 10.1186/1472-6947-7-28
Position fetched: 55045 -> 10.1186/1472-6947-8-35
Position fetched: 55046 -> 10.1186/1472-6955-10-4
Position fetched: 55047 -> 10.1186/1472-6963-12-21
Position fetched: 5504

Position fetched: 55162 -> 10.1186/1742-4690-3-19
Position fetched: 55163 -> 10.1186/1742-4690-3-s1-p62
Position fetched: 55164 -> 10.1186/1742-4690-5-112
Position fetched: 55165 -> 10.1186/1742-4690-7-47
Position fetched: 55166 -> 10.1186/1742-4690-8-102
Position fetched: 55167 -> 10.1186/1742-4690-8-32
Position fetched: 55168 -> 10.1186/1742-4690-8-66
Position fetched: 55169 -> 10.1186/1742-4690-8-75
Position fetched: 55170 -> 10.1186/1742-4690-9-97
Position fetched: 55171 -> 10.1186/1742-4933-11-4
Position fetched: 55172 -> 10.1186/1742-4933-5-2
Position fetched: 55173 -> 10.1186/1742-6405-11-28
Position fetched: 55174 -> 10.1186/1742-6405-7-21
Position fetched: 55175 -> 10.1186/1742-6405-9-24
Position fetched: 55176 -> 10.1186/1742-7622-10-3
Position fetched: 55177 -> 10.1186/1742-7622-11-4
Position fetched: 55178 -> 10.1186/1742-7622-3-10
Position fetched: 55179 -> 10.1186/1742-7622-5-20
Position fetched: 55180 -> 10.1186/1743-422x-1-1
Position fetched: 55181 -> 10.1186/1743-422x-

Position fetched: 55295 -> 10.1186/1743-422x-8-184
Position fetched: 55296 -> 10.1186/1743-422x-8-215
Position fetched: 55297 -> 10.1186/1743-422x-8-23
Position fetched: 55298 -> 10.1186/1743-422x-8-262
Position fetched: 55299 -> 10.1186/1743-422x-8-263
Position fetched: 55300 -> 10.1186/1743-422x-8-3
Position fetched: 55301 -> 10.1186/1743-422x-8-308
Position fetched: 55302 -> 10.1186/1743-422x-8-319
Position fetched: 55303 -> 10.1186/1743-422x-8-325
Position fetched: 55304 -> 10.1186/1743-422x-8-332
Position fetched: 55305 -> 10.1186/1743-422x-8-348
Position fetched: 55306 -> 10.1186/1743-422x-8-380
Position fetched: 55307 -> 10.1186/1743-422x-8-405
Position fetched: 55308 -> 10.1186/1743-422x-8-432
Position fetched: 55309 -> 10.1186/1743-422x-8-434
Position fetched: 55310 -> 10.1186/1743-422x-8-455
Position fetched: 55311 -> 10.1186/1743-422x-8-483
Position fetched: 55312 -> 10.1186/1743-422x-8-494
Position fetched: 55313 -> 10.1186/1743-422x-8-501
Position fetched: 55314 -> 10.1186

                                             affiliation  \
54529  [{'affiliation-city': 'Melbourne', 'affilname'...   
54530  [{'affiliation-city': 'Berlin', 'affilname': '...   
54531  {'affiliation-city': 'Dundee', 'affilname': 'N...   
54532  {'affiliation-city': 'Chicago', 'affilname': '...   
54533  [{'affiliation-city': 'Irvine', 'affilname': '...   
...                                                  ...   
55424  {'affiliation-city': 'Rio de Janeiro', 'affiln...   
55425  [{'affiliation-city': 'Carlow', 'affilname': '...   
55426  [{'affiliation-city': 'Seattle', 'affilname': ...   
55427  [{'affiliation-city': 'Seattle', 'affilname': ...   
55428  {'affiliation-city': 'Guiyang', 'affilname': '...   

                                                coredata  
54529  {'srctype': 'j', 'prism:issueIdentifier': '5',...  
54530  {'srctype': 'j', 'prism:issueIdentifier': '4',...  
54531  {'srctype': 'j', 'prism:issueIdentifier': '6',...  
54532  {'srctype': 'j', 'eid': '2-s2.0-8508

Position fetched: 55538 -> 10.1186/cc13934
Position fetched: 55539 -> 10.1186/cc14184
Position fetched: 55540 -> 10.1186/cc1795
Position fetched: 55541 -> 10.1186/cc1860
Position fetched: 55542 -> 10.1186/cc2687
Position fetched: 55543 -> 10.1186/cc2967
Position fetched: 55544 -> 10.1186/cc3046
Position fetched: 55545 -> 10.1186/cc3518
Position fetched: 55546 -> 10.1186/cc3904
Position fetched: 55547 -> 10.1186/cc3916
Position fetched: 55548 -> 10.1186/cc5059
Position fetched: 55549 -> 10.1186/cc5116
Position fetched: 55550 -> 10.1186/cc5604
Position fetched: 55551 -> 10.1186/cc5723
Position fetched: 55552 -> 10.1186/cc5732
Position fetched: 55553 -> 10.1186/cc5944
Position fetched: 55554 -> 10.1186/cc6894
Position fetched: 55555 -> 10.1186/cc7098
Position fetched: 55556 -> 10.1186/cc7880
Position fetched: 55557 -> 10.1186/cc8208
Position fetched: 55558 -> 10.1186/cc8222
Position fetched: 55559 -> 10.1186/cc8514
Position fetched: 55560 -> 10.1186/cc9068
Position fetched: 55561 -> 10.11

Position fetched: 55670 -> 10.1186/s12859-020-03763-4
Position fetched: 55671 -> 10.1186/s12859-020-03776-z
Position fetched: 55672 -> 10.1186/s12859-020-03782-1
Position fetched: 55673 -> 10.1186/s12859-020-03838-2
Position fetched: 55674 -> 10.1186/s12859-020-03845-3
Position fetched: 55675 -> 10.1186/s12859-020-03869-9
Position fetched: 55676 -> 10.1186/s12859-020-03872-0
Position fetched: 55677 -> 10.1186/s12859-020-03881-z
Position fetched: 55678 -> 10.1186/s12859-020-03883-x
Position fetched: 55679 -> 10.1186/s12859-020-03890-y
Position fetched: 55680 -> 10.1186/s12859-020-03894-8
Position fetched: 55681 -> 10.1186/s12859-020-03897-5
Position fetched: 55682 -> 10.1186/s12859-020-03915-6
Position fetched: 55683 -> 10.1186/s12859-020-03931-6
Position fetched: 55684 -> 10.1186/s12859-020-03935-2
Position fetched: 55685 -> 10.1186/s12859-020-03946-z
Position fetched: 55686 -> 10.1186/s12859-020-03950-3
Position fetched: 55687 -> 10.1186/s12859-020-3527-5
Position fetched: 55688 -> 10

Position fetched: 55796 -> 10.1186/s12871-020-01084-w
Position fetched: 55797 -> 10.1186/s12871-020-01098-4
Position fetched: 55798 -> 10.1186/s12871-020-01108-5
Position fetched: 55799 -> 10.1186/s12871-020-01127-2
Position fetched: 55800 -> 10.1186/s12871-020-01149-w
Position fetched: 55801 -> 10.1186/s12871-020-01162-z
Position fetched: 55802 -> 10.1186/s12871-020-01193-6
Position fetched: 55803 -> 10.1186/s12871-020-01197-2
Position fetched: 55804 -> 10.1186/s12871-020-01202-8
Position fetched: 55805 -> 10.1186/s12871-020-01207-3
Position fetched: 55806 -> 10.1186/s12871-020-01209-1
Position fetched: 55807 -> 10.1186/s12871-020-01227-z
Position fetched: 55808 -> 10.1186/s12871-020-0933-1
Position fetched: 55809 -> 10.1186/s12871-020-0936-y
Position fetched: 55810 -> 10.1186/s12871-020-0942-0
Position fetched: 55811 -> 10.1186/s12871-020-0944-y
Position fetched: 55812 -> 10.1186/s12871-021-01233-9
Position fetched: 55813 -> 10.1186/s12871-021-01236-6
Position fetched: 55814 -> 10.11

Position fetched: 55921 -> 10.1186/s12877-021-02013-3
Position fetched: 55922 -> 10.1186/s12878-016-0069-1
Position fetched: 55923 -> 10.1186/s12879-014-0583-3
Position fetched: 55924 -> 10.1186/s12879-014-0617-x
Position fetched: 55925 -> 10.1186/s12879-014-0635-8
Position fetched: 55926 -> 10.1186/s12879-014-0690-1
Position fetched: 55927 -> 10.1186/s12879-014-0691-0
Position fetched: 55928 -> 10.1186/s12879-014-0698-6
                                             affiliation  \
54529  [{'affiliation-city': 'Melbourne', 'affilname'...   
54530  [{'affiliation-city': 'Berlin', 'affilname': '...   
54531  {'affiliation-city': 'Dundee', 'affilname': 'N...   
54532  {'affiliation-city': 'Chicago', 'affilname': '...   
54533  [{'affiliation-city': 'Irvine', 'affilname': '...   
...                                                  ...   
55924  [{'affiliation-city': 'Milan', 'affilname': 'I...   
55925  [{'affiliation-city': 'Pune', 'affilname': 'KE...   
55926  [{'affiliation-city': 'Canbe

batch saved
Position fetched: 56029 -> 10.1186/s12879-018-3662-z
Position fetched: 56030 -> 10.1186/s12879-018-3668-6
Position fetched: 56031 -> 10.1186/s12879-019-3671-6
Position fetched: 56032 -> 10.1186/s12879-019-3729-5
Position fetched: 56033 -> 10.1186/s12879-019-3735-7
Position fetched: 56034 -> 10.1186/s12879-019-3898-2
Position fetched: 56035 -> 10.1186/s12879-019-3981-8
Position fetched: 56036 -> 10.1186/s12879-019-3987-2
Position fetched: 56037 -> 10.1186/s12879-019-4096-y
Position fetched: 56038 -> 10.1186/s12879-019-4109-x
Position fetched: 56039 -> 10.1186/s12879-019-4150-9
Position fetched: 56040 -> 10.1186/s12879-019-4181-2
Position fetched: 56041 -> 10.1186/s12879-019-4266-y
Position fetched: 56042 -> 10.1186/s12879-019-4277-8
Position fetched: 56043 -> 10.1186/s12879-019-4380-x
Position fetched: 56044 -> 10.1186/s12879-019-4385-5
Position fetched: 56045 -> 10.1186/s12879-019-4400-x
Position fetched: 56046 -> 10.1186/s12879-019-4426-0
Position fetched: 56047 -> 10.1186

Position fetched: 56154 -> 10.1186/s12879-020-05612-4
Position fetched: 56155 -> 10.1186/s12879-020-05614-2
Position fetched: 56156 -> 10.1186/s12879-020-05619-x
Position fetched: 56157 -> 10.1186/s12879-020-05620-4
Position fetched: 56158 -> 10.1186/s12879-020-05622-2
Position fetched: 56159 -> 10.1186/s12879-020-05637-9
Position fetched: 56160 -> 10.1186/s12879-020-05643-x
Position fetched: 56161 -> 10.1186/s12879-020-05645-9
Position fetched: 56162 -> 10.1186/s12879-020-05647-7
Position fetched: 56163 -> 10.1186/s12879-020-05654-8
Position fetched: 56164 -> 10.1186/s12879-020-05662-8
Position fetched: 56165 -> 10.1186/s12879-020-05665-5
Position fetched: 56166 -> 10.1186/s12879-020-05666-4
Position fetched: 56167 -> 10.1186/s12879-020-05670-8
Position fetched: 56168 -> 10.1186/s12879-020-05671-7
Position fetched: 56169 -> 10.1186/s12879-020-05674-4
Position fetched: 56170 -> 10.1186/s12879-020-05677-1
Position fetched: 56171 -> 10.1186/s12879-020-05678-0
Position fetched: 56172 -> 1

Position fetched: 56280 -> 10.1186/s12884-020-03481-y
Position fetched: 56281 -> 10.1186/s12884-020-03502-w
Position fetched: 56282 -> 10.1186/s12884-020-03510-w
Position fetched: 56283 -> 10.1186/s12884-020-03524-4
Position fetched: 56284 -> 10.1186/s12884-020-03532-4
Position fetched: 56285 -> 10.1186/s12884-021-03557-3
Position fetched: 56286 -> 10.1186/s12884-021-03561-7
Position fetched: 56287 -> 10.1186/s12884-021-03572-4
Position fetched: 56288 -> 10.1186/s12885-016-2197-1
Position fetched: 56289 -> 10.1186/s12885-016-2506-8
Position fetched: 56290 -> 10.1186/s12885-018-4292-y
Position fetched: 56291 -> 10.1186/s12885-020-06948-5
Position fetched: 56292 -> 10.1186/s12885-020-07215-3
Position fetched: 56293 -> 10.1186/s12885-020-07264-8
Position fetched: 56294 -> 10.1186/s12885-020-07270-w
Position fetched: 56295 -> 10.1186/s12885-020-07416-w
Position fetched: 56296 -> 10.1186/s12885-020-07544-3
Position fetched: 56297 -> 10.1186/s12885-020-07558-x
Position fetched: 56298 -> 10.1

Position fetched: 56406 -> 10.1186/s12889-019-6770-9
Position fetched: 56407 -> 10.1186/s12889-019-6772-7
Position fetched: 56408 -> 10.1186/s12889-019-6777-2
Position fetched: 56409 -> 10.1186/s12889-019-6778-1
Position fetched: 56410 -> 10.1186/s12889-019-6969-9
Position fetched: 56411 -> 10.1186/s12889-019-7035-3
Position fetched: 56412 -> 10.1186/s12889-019-7067-8
Position fetched: 56413 -> 10.1186/s12889-019-7317-9
Position fetched: 56414 -> 10.1186/s12889-019-7369-x
Position fetched: 56415 -> 10.1186/s12889-019-7707-z
Position fetched: 56416 -> 10.1186/s12889-019-7899-2
Position fetched: 56417 -> 10.1186/s12889-019-8008-2
Position fetched: 56418 -> 10.1186/s12889-019-8009-1
Position fetched: 56419 -> 10.1186/s12889-019-8077-2
Position fetched: 56420 -> 10.1186/s12889-020-08570-3
Position fetched: 56421 -> 10.1186/s12889-020-08697-3
Position fetched: 56422 -> 10.1186/s12889-020-08726-1
Position fetched: 56423 -> 10.1186/s12889-020-08773-8
Position fetched: 56424 -> 10.1186/s12889-

batch saved
Position fetched: 56529 -> 10.1186/s12889-020-10103-x
Position fetched: 56530 -> 10.1186/s12889-020-10105-9
Position fetched: 56531 -> 10.1186/s12889-020-10109-5
Position fetched: 56532 -> 10.1186/s12889-020-10116-6
Position fetched: 56533 -> 10.1186/s12889-020-10125-5
Position fetched: 56534 -> 10.1186/s12889-020-10126-4
Position fetched: 56535 -> 10.1186/s12889-020-10131-7
Position fetched: 56536 -> 10.1186/s12889-020-10136-2
Position fetched: 56537 -> 10.1186/s12889-020-10145-1
Position fetched: 56538 -> 10.1186/s12889-020-10147-z
Position fetched: 56539 -> 10.1186/s12889-020-10153-1
Position fetched: 56540 -> 10.1186/s12889-020-8239-2
Position fetched: 56541 -> 10.1186/s12889-020-8279-7
Position fetched: 56542 -> 10.1186/s12889-020-8359-8
Position fetched: 56543 -> 10.1186/s12889-020-8388-3
Position fetched: 56544 -> 10.1186/s12889-020-8455-9
Position fetched: 56545 -> 10.1186/s12889-021-10165-5
Position fetched: 56546 -> 10.1186/s12889-021-10166-4
Position fetched: 565

Position fetched: 56655 -> 10.1186/s12904-020-00616-y
Position fetched: 56656 -> 10.1186/s12904-020-00617-x
Position fetched: 56657 -> 10.1186/s12904-020-00636-8
Position fetched: 56658 -> 10.1186/s12904-020-00644-8
Position fetched: 56659 -> 10.1186/s12904-020-00652-8
Position fetched: 56660 -> 10.1186/s12904-020-00689-9
Position fetched: 56661 -> 10.1186/s12904-020-00691-1
Position fetched: 56662 -> 10.1186/s12904-020-00692-0
Position fetched: 56663 -> 10.1186/s12904-020-00695-x
Position fetched: 56664 -> 10.1186/s12904-020-00704-z
Position fetched: 56665 -> 10.1186/s12904-021-00711-8
Position fetched: 56666 -> 10.1186/s12905-019-0875-2
Position fetched: 56667 -> 10.1186/s12905-020-01086-3
Position fetched: 56668 -> 10.1186/s12905-020-01115-1
Position fetched: 56669 -> 10.1186/s12905-020-01151-x
Position fetched: 56670 -> 10.1186/s12905-020-01163-7
Position fetched: 56671 -> 10.1186/s12905-021-01177-9
Position fetched: 56672 -> 10.1186/s12906-015-0792-3
Position fetched: 56673 -> 10.

Position fetched: 56781 -> 10.1186/s12911-020-01266-z
Position fetched: 56782 -> 10.1186/s12911-020-01275-y
Position fetched: 56783 -> 10.1186/s12911-020-01281-0
Position fetched: 56784 -> 10.1186/s12911-020-01316-6
Position fetched: 56785 -> 10.1186/s12911-020-01321-9
Position fetched: 56786 -> 10.1186/s12911-020-01322-8
Position fetched: 56787 -> 10.1186/s12911-020-01336-2
Position fetched: 56788 -> 10.1186/s12911-020-01338-0
Position fetched: 56789 -> 10.1186/s12911-020-01344-2
Position fetched: 56790 -> 10.1186/s12911-020-01353-1
Position fetched: 56791 -> 10.1186/s12911-020-01373-x
Position fetched: 56792 -> 10.1186/s12911-020-01374-w
Position fetched: 56793 -> 10.1186/s12911-020-1108-1
Position fetched: 56794 -> 10.1186/s12911-020-1114-3
Position fetched: 56795 -> 10.1186/s12911-021-01394-0
Position fetched: 56796 -> 10.1186/s12912-015-0065-y
Position fetched: 56797 -> 10.1186/s12912-020-00457-3
Position fetched: 56798 -> 10.1186/s12912-020-00481-3
Position fetched: 56799 -> 10.1

Position fetched: 56907 -> 10.1186/s12916-020-01663-1
Position fetched: 56908 -> 10.1186/s12916-020-01670-2
Position fetched: 56909 -> 10.1186/s12916-020-01673-z
Position fetched: 56910 -> 10.1186/s12916-020-01682-y
Position fetched: 56911 -> 10.1186/s12916-020-01685-9
Position fetched: 56912 -> 10.1186/s12916-020-01687-7
Position fetched: 56913 -> 10.1186/s12916-020-01691-x
Position fetched: 56914 -> 10.1186/s12916-020-01692-w
Position fetched: 56915 -> 10.1186/s12916-020-01696-6
Position fetched: 56916 -> 10.1186/s12916-020-01705-8
Position fetched: 56917 -> 10.1186/s12916-020-01712-9
Position fetched: 56918 -> 10.1186/s12916-020-01719-2
Position fetched: 56919 -> 10.1186/s12916-020-01726-3
Position fetched: 56920 -> 10.1186/s12916-020-01731-6
Position fetched: 56921 -> 10.1186/s12916-020-01732-5
Position fetched: 56922 -> 10.1186/s12916-020-01735-2
Position fetched: 56923 -> 10.1186/s12916-020-01739-y
Position fetched: 56924 -> 10.1186/s12916-020-01747-y
Position fetched: 56925 -> 1

batch saved
Position fetched: 57029 -> 10.1186/s12917-018-1640-8
Position fetched: 57030 -> 10.1186/s12917-018-1720-9
Position fetched: 57031 -> 10.1186/s12917-018-1736-1
Position fetched: 57032 -> 10.1186/s12917-018-1756-x
Position fetched: 57033 -> 10.1186/s12917-019-1773-4
Position fetched: 57034 -> 10.1186/s12917-019-1774-3
Position fetched: 57035 -> 10.1186/s12917-019-1796-x
Position fetched: 57036 -> 10.1186/s12917-019-1802-3
Position fetched: 57037 -> 10.1186/s12917-019-1848-2
Position fetched: 57038 -> 10.1186/s12917-019-1851-7
Position fetched: 57039 -> 10.1186/s12917-019-1862-4
Position fetched: 57040 -> 10.1186/s12917-019-1877-x
Position fetched: 57041 -> 10.1186/s12917-019-1887-8
Position fetched: 57042 -> 10.1186/s12917-019-1898-5
Position fetched: 57043 -> 10.1186/s12917-019-1909-6
Position fetched: 57044 -> 10.1186/s12917-019-1911-z
Position fetched: 57045 -> 10.1186/s12917-019-1925-6
Position fetched: 57046 -> 10.1186/s12917-019-1927-4
Position fetched: 57047 -> 10.1186

The following cell is useful when the process above is interrupted. Therefore, the dictionary containing fetched information can be narrowed down to useful entries. 

In [None]:
# def save_new_extra_info(len_df_current_extra_info, upto):
#     """
#     This function is used to separate successfull API calls from API calls which were prevented due to an invalid API-Key.
#     As a result, this function returns a range of valid entries up to the given parameter. 
#     """
#     dict_new_extra_info_saver = dict()
#     i = len_df_current_extra_info
#     while i < upto:
#         #print("Position: " + str(i) + " -> " +  doi_counted.index[i])
#         dict_new_extra_info_saver[i] = dict_new_extra_info[i]
#         i = i + 1 
#     return dict_new_extra_info_saver

The existing and newly fetched information are combined into one DataFrame. 

In [None]:
# df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
df_combined_extra_info

In [None]:
#to big for GitHub
#df_combined_extra_info.to_csv('extra_info_CS5099.csv', sep='\t')

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [None]:
# stored_series = store_df_columns(df_combined_extra_info)
# stored_series[0]

In [None]:
# stored_series[1]

Verifying that the returned None values are due to non existent data and not to an invalid API-Key

In [None]:
# len_data = len(stored_series[0])
# len_data 

In [None]:
# ser_doi = pd.Series(doi_counted.index[:len_data])
# ser_doi

In [None]:
# df_current_extra_info_checker = df_combined_extra_info
# df_current_extra_info_checker['doi'] = ser_doi

In [None]:
# %%time
# len_df_current_extra_info_checker = len(df_current_extra_info_checker)
# dict_new_extra_info_checker = dict()
# i = 0 
# while i < len_df_current_extra_info_checker: ###################################################### 
#     if df_current_extra_info_checker['affiliation'][i] == None:
#         dict_new_extra_info_checker[i] = fetch_scopus_api(client, ser_doi[i])
#         print("Position fetched again: " + str(i) + " -> " +  ser_doi[i])
#     i = i + 1    

In [None]:
# dict_new_extra_info_checker
# -> check if at least one value is not None -> otherwise the process is finished here

In [None]:
# len(dict_new_extra_info_checker)

In [None]:
# df_combined_extra_info_fetched_again  = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info_checker)
# df_combined_extra_info_fetched_again

In [None]:
# store_df_columns(df_combined_extra_info_fetched_again)

In [None]:
# df_combined_extra_info_fetched_again['check_doi'] = ser_doi
# df_combined_extra_info_fetched_again.head(30)