# CORD-19-collect-scopus-data

In general, this jupyter notebook is designated to collect additional data via scopus to enbroaden the CORD19 dataset: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import datetime
import matplotlib.pyplot as plt
import re
from urllib.parse import urlparse
from collections import Counter

from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch

import time # for sleep
from pybtex.database import parse_file, BibliographyData, Entry
import json
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import AbsDoc
from elsapy.elssearch import ElsSearch

Get the data and save it to a variable.

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv')

Check the length of the column containing doi's.

In [3]:
len(CORD19_CSV['doi'])

77448

Display the column doi to see if there are inconsistencies such as NaN's

In [4]:
doi = CORD19_CSV['doi']
doi

0                                 NaN
1          10.1016/j.regg.2021.01.002
2           10.1016/j.rec.2020.08.002
3        10.1016/j.vetmic.2006.11.026
4                   10.3390/v12080849
                     ...             
77443      10.1007/s11229-020-02869-9
77444                             NaN
77445     10.1101/2020.05.13.20100206
77446      10.1007/s42991-020-00052-8
77447     10.1101/2020.09.14.20194670
Name: doi, Length: 77448, dtype: object

Create a series with solely unique values and neglect NaN's. It is important to sort the unique values. Otherwise, the method is creating different results after each restart of the notebook. 

In [5]:
doi_counted = doi.value_counts().sort_index(ascending=True)
doi_counted

10.1001/jamainternmed.2020.1369       1
10.1001/jamanetworkopen.2020.16382    1
10.1001/jamanetworkopen.2020.17521    1
10.1001/jamanetworkopen.2020.20485    1
10.1001/jamanetworkopen.2020.24984    1
                                     ..
10.9745/ghsp-d-20-00115               1
10.9745/ghsp-d-20-00171               1
10.9745/ghsp-d-20-00218               1
10.9758/cpn.2020.18.4.607             1
10.9781/ijimai.2020.02.002            1
Name: doi, Length: 74302, dtype: int64

The following function determines the requested information from the Scopus API. (https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4)

In [6]:
#Adapted from https://github.com/ElsevierDev/elsapy/blob/master/exampleProg.py
def fetch_scopus_api(client, doi):
    """obtain additional paper information from scopus by doi
    """
    doc_srch = ElsSearch("DOI("+doi+")",'scopus')
    doc_srch.execute(client, get_all = True)
    #print ("doc_srch has", len(doc_srch.results), "results.")
    #print(doc_srch.results)
    try:
        scopus_id=doc_srch.results[0]["dc:identifier"].split(":")[1]
        scp_doc = AbsDoc(scp_id = scopus_id)
        if scp_doc.read(client):
            # print ("scp_doc.title: ", scp_doc.title)
            scp_doc.write()   
        else:
            print ("Read document failed.")
        # print(scp_doc.data["affiliation"])
        return scp_doc.data
    except:
        return None

Thusly, the configuration file is set up and contains an APIkey. Further information: https://github.com/ElsevierDev/elsapy/blob/master/CONFIG.md

In [7]:
con_file = open("config.json")
config = json.load(con_file)
con_file.close()

Moreover, the client is initialized with the API-Key.

In [8]:
client = ElsClient(config['apikey'])

For demonstation purposes, the following cells shows which data is returned by the Scopus API. 

In [9]:
return_example = fetch_scopus_api(client, '10.1016/j.dsx.2020.04.012')
print(json.dumps(return_example, indent=2))

{
  "affiliation": [
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Hamdard",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Jamia Millia Islamia",
      "affiliation-country": "India"
    },
    {
      "affiliation-city": "New Delhi",
      "affilname": "Indraprastha Apollo Hospitals",
      "affiliation-country": "India"
    }
  ],
  "coredata": {
    "srctype": "j",
    "eid": "2-s2.0-85083171050",
    "pubmed-id": "32305024",
    "prism:coverDate": "2020-07-01",
    "prism:aggregationType": "Journal",
    "prism:url": "https://api.elsevier.com/content/abstract/scopus_id/85083171050",
    "dc:creator": {
      "author": [
        {
          "ce:given-name": "Raju",
          "preferred-name": {
            "ce:given-name": "Raju",
            "ce:initials": "R.",
            "ce:surname": "Vaishya",
            "ce:indexed-name": "Vaishya R."
          },
          "@seq": "1",
          "ce:init

Based on the returned data, further analysis is conductable. Therefore, two notebooks are created to analyse data linked to: 
<ul>
  <li>affiliation</li>
  <li>coredata</li>
</ul>    

Thusly, the already fetched coredata and affiliation are read and combined to a DataFrame for further processing.

In [10]:
df_current_extra_info = pd.DataFrame()
try:
    read_affiliation = pd.read_pickle('extra_info_affiliation_CS.pkl')
    read_coredata = pd.read_pickle('extra_info_coredata_CS.pkl')
    df_current_extra_info['affiliation'] = read_affiliation
    df_current_extra_info['coredata'] = read_coredata
    df_current_extra_info
except:
    print("The DataFrame is empty")
    #if the dataframe is not empty set the variable to show the dataframe

The length of the DataFrame containing the current information is assigned to a variable to be used for further processing. 
Therefore, the length will be used within a while loop as a starting index. 

In [11]:
len_df_current_extra_info = len(df_current_extra_info)
len_df_current_extra_info

43329

In [12]:
df_current_extra_info

Unnamed: 0,affiliation,coredata
0,"[{'affiliation-city': 'Palo Alto', 'affilname'...","{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"[{'affiliation-city': 'Seattle', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"[{'affiliation-city': 'Cambridge', 'affilname'...","{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"[{'affiliation-city': 'Madison', 'affilname': ...","{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"[{'affiliation-city': 'Los Angeles', 'affilnam...","{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...,...
43324,,
43325,,
43326,,
43327,,


In [13]:
def contains_only_None(dic):
    """
    This functions inspects an dictionary and returns True if it solely contains None values
    """
    return len(dic) == sum(value == None for value in dict_new_extra_info.values())

In [14]:
def append_fetched_data_to_df(df_current_extra_info, dic):
    """
    This function appends or inserts newly fetched data to the DataFrame containing scopus data.
    """
    #df_current_extra_info -> holding the latest data, new data needs to be appended to it, 
    #df_newly_fetched_transposed -> holdy newly fetched data, needs to be inserted or fetched
    
    if contains_only_None(dic):
        placeholder_entries = pd.DataFrame(np.empty((len(dict_new_extra_info),2),dtype=object),columns=['affiliation','coredata'], index=dict_new_extra_info.keys())
        df_newly_fetched_transposed = placeholder_entries
        print(placeholder_entries)
    else:
        #Prior appending, the dictionary is transformed to a DataFrame
        df_newly_fetched = pd.DataFrame(dic)
        #For readability, the DataFrame is transposed
        df_newly_fetched_transposed = df_newly_fetched.T
    
    #Insert newly fetched rows which were previously not successful appended
    for index, row in df_newly_fetched_transposed.iterrows():
        #insert to current extra info DataFrame because the row is existent
        if index in df_current_extra_info.index and row.affiliation is not None:
            df_current_extra_info.loc[index] = row
        #append to current extra info DataFrame because the row is new     
        if index not in df_current_extra_info.index:
            df_current_extra_info = df_current_extra_info.append(row, ignore_index=True)
            
    #returning DataFrame with inserted and replaced rows. 
    return df_current_extra_info

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [15]:
def store_df_columns(df):
    ser_affiliation = df['affiliation']
    ser_coredata = df['coredata']
    ser_affiliation.to_pickle('extra_info_affiliation_CS.pkl')
    ser_coredata.to_pickle('extra_info_coredata_CS.pkl')
    return ser_affiliation, ser_coredata

In [16]:
# placeholder_entries = pd.DataFrame(np.empty((4,2),dtype=object),columns=['affiliation','coredata'])

In [17]:
# placeholder_entries

Subsequently, the fetched scopus data is stored within a dictionary. Besides, the print function is used to show the state of the process by displaying the latest fetched information. 

In [None]:
%%time
dict_new_extra_info = dict()
len_dois = len(doi_counted)
def trigger_fetching():
    threshold = 0 
    i = len_df_current_extra_info
    while i < len_dois: #-> upto modified, normally len_dois
        dict_new_extra_info[i] = fetch_scopus_api(client, doi_counted.index[i])
        print("Position fetched: " + str(i) + " -> " +  doi_counted.index[i])
        i = i + 1 
        threshold = threshold + 1
        if threshold > 99:
            df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
            stored_series = store_df_columns(df_combined_extra_info)
            threshold = 0
            print("batch saved")
trigger_fetching()

Position fetched: 43329 -> 10.1101/2020.05.31.126524
Position fetched: 43330 -> 10.1101/2020.05.31.126615
Position fetched: 43331 -> 10.1101/2020.05.31.126813
Position fetched: 43332 -> 10.1101/2020.05.31.20112979
Position fetched: 43333 -> 10.1101/2020.05.31.20114520
Position fetched: 43334 -> 10.1101/2020.05.31.20114991
Position fetched: 43335 -> 10.1101/2020.05.31.20117168
Position fetched: 43336 -> 10.1101/2020.05.31.20118059
Position fetched: 43337 -> 10.1101/2020.05.31.20118182
Position fetched: 43338 -> 10.1101/2020.05.31.20118273
Position fetched: 43339 -> 10.1101/2020.05.31.20118315
Position fetched: 43340 -> 10.1101/2020.05.31.20118380
Position fetched: 43341 -> 10.1101/2020.05.31.20118448
Position fetched: 43342 -> 10.1101/2020.05.31.20118554
Position fetched: 43343 -> 10.1101/2020.05.31.20118653
Position fetched: 43344 -> 10.1101/2020.05.31.20118679
Position fetched: 43345 -> 10.1101/2020.05.31.20118802
Position fetched: 43346 -> 10.1101/2020.06.01.126821
Position fetched: 

Position fetched: 43473 -> 10.1101/2020.06.04.20122457
Position fetched: 43474 -> 10.1101/2020.06.04.20122473
Position fetched: 43475 -> 10.1101/2020.06.04.20122481
Position fetched: 43476 -> 10.1101/2020.06.04.20122507
Position fetched: 43477 -> 10.1101/2020.06.04.20122564
Position fetched: 43478 -> 10.1101/2020.06.04.20122747
Position fetched: 43479 -> 10.1101/2020.06.04.20122754
Position fetched: 43480 -> 10.1101/2020.06.04.20122812
Position fetched: 43481 -> 10.1101/2020.06.04.20122838
Position fetched: 43482 -> 10.1101/2020.06.04.20122879
Position fetched: 43483 -> 10.1101/2020.06.05.131748
Position fetched: 43484 -> 10.1101/2020.06.05.134114
Position fetched: 43485 -> 10.1101/2020.06.05.134551
Position fetched: 43486 -> 10.1101/2020.06.05.135194
Position fetched: 43487 -> 10.1101/2020.06.05.135699
Position fetched: 43488 -> 10.1101/2020.06.05.135749
Position fetched: 43489 -> 10.1101/2020.06.05.135806
Position fetched: 43490 -> 10.1101/2020.06.05.135921
Position fetched: 43491 ->

Position fetched: 43618 -> 10.1101/2020.06.09.20127092
Position fetched: 43619 -> 10.1101/2020.06.09.20127118
Position fetched: 43620 -> 10.1101/2020.06.09.20127142
Position fetched: 43621 -> 10.1101/2020.06.10.135533
Position fetched: 43622 -> 10.1101/2020.06.10.135632
Position fetched: 43623 -> 10.1101/2020.06.10.141325
Position fetched: 43624 -> 10.1101/2020.06.10.142281
Position fetched: 43625 -> 10.1101/2020.06.10.143545
Position fetched: 43626 -> 10.1101/2020.06.10.143990
Position fetched: 43627 -> 10.1101/2020.06.10.144196
Position fetched: 43628 -> 10.1101/2020.06.10.144212
      affiliation coredata
43329        None     None
43330        None     None
43331        None     None
43332        None     None
43333        None     None
...           ...      ...
43624        None     None
43625        None     None
43626        None     None
43627        None     None
43628        None     None

[300 rows x 2 columns]
batch saved
Position fetched: 43629 -> 10.1101/2020.06.10.14478

Position fetched: 43756 -> 10.1101/2020.06.14.20131128
Position fetched: 43757 -> 10.1101/2020.06.14.20131177
Position fetched: 43758 -> 10.1101/2020.06.14.20131268
Position fetched: 43759 -> 10.1101/2020.06.15.134403
Position fetched: 43760 -> 10.1101/2020.06.15.145656
Position fetched: 43761 -> 10.1101/2020.06.15.147470
Position fetched: 43762 -> 10.1101/2020.06.15.150912
Position fetched: 43763 -> 10.1101/2020.06.15.151647
Position fetched: 43764 -> 10.1101/2020.06.15.151738
Position fetched: 43765 -> 10.1101/2020.06.15.151761
Position fetched: 43766 -> 10.1101/2020.06.15.151845
Position fetched: 43767 -> 10.1101/2020.06.15.152587
Position fetched: 43768 -> 10.1101/2020.06.15.152835
Position fetched: 43769 -> 10.1101/2020.06.15.152892
Position fetched: 43770 -> 10.1101/2020.06.15.153064
Position fetched: 43771 -> 10.1101/2020.06.15.153239
Position fetched: 43772 -> 10.1101/2020.06.15.153478
Position fetched: 43773 -> 10.1101/2020.06.15.153643
Position fetched: 43774 -> 10.1101/2020.

Position fetched: 43901 -> 10.1101/2020.06.18.20135012
Position fetched: 43902 -> 10.1101/2020.06.18.20135046
Position fetched: 43903 -> 10.1101/2020.06.18.20135103
Position fetched: 43904 -> 10.1101/2020.06.18.20135111
Position fetched: 43905 -> 10.1101/2020.06.18.20135145
Position fetched: 43906 -> 10.1101/2020.06.18.20135152
Position fetched: 43907 -> 10.1101/2020.06.18.20135210
Position fetched: 43908 -> 10.1101/2020.06.19.159970
Position fetched: 43909 -> 10.1101/2020.06.19.160606
Position fetched: 43910 -> 10.1101/2020.06.19.160630
Position fetched: 43911 -> 10.1101/2020.06.19.160747
Position fetched: 43912 -> 10.1101/2020.06.19.161000
Position fetched: 43913 -> 10.1101/2020.06.19.161042
Position fetched: 43914 -> 10.1101/2020.06.19.161802
Position fetched: 43915 -> 10.1101/2020.06.19.162248
Position fetched: 43916 -> 10.1101/2020.06.19.20128207
Position fetched: 43917 -> 10.1101/2020.06.19.20132969
Position fetched: 43918 -> 10.1101/2020.06.19.20133991
Position fetched: 43919 ->

Position fetched: 44039 -> 10.1101/2020.06.24.20138867
Position fetched: 44040 -> 10.1101/2020.06.24.20138891
Position fetched: 44041 -> 10.1101/2020.06.24.20138925
Position fetched: 44042 -> 10.1101/2020.06.24.20138941
Position fetched: 44043 -> 10.1101/2020.06.24.20138982
Position fetched: 44044 -> 10.1101/2020.06.24.20139006
Position fetched: 44045 -> 10.1101/2020.06.24.20139048
Position fetched: 44046 -> 10.1101/2020.06.24.20139121
Position fetched: 44047 -> 10.1101/2020.06.24.20139204
Position fetched: 44048 -> 10.1101/2020.06.24.20139212
Position fetched: 44049 -> 10.1101/2020.06.24.20139238
Position fetched: 44050 -> 10.1101/2020.06.24.20139295
Position fetched: 44051 -> 10.1101/2020.06.24.20139303
Position fetched: 44052 -> 10.1101/2020.06.24.20139410
Position fetched: 44053 -> 10.1101/2020.06.24.20139436
Position fetched: 44054 -> 10.1101/2020.06.24.20139444
Position fetched: 44055 -> 10.1101/2020.06.24.20139451
Position fetched: 44056 -> 10.1101/2020.06.24.20139469
Position f

Position fetched: 44184 -> 10.1101/2020.06.29.20142562
Position fetched: 44185 -> 10.1101/2020.06.29.20142596
Position fetched: 44186 -> 10.1101/2020.06.29.20142638
Position fetched: 44187 -> 10.1101/2020.06.29.20142646
Position fetched: 44188 -> 10.1101/2020.06.29.20142703
Position fetched: 44189 -> 10.1101/2020.06.29.20142836
Position fetched: 44190 -> 10.1101/2020.06.29.20142851
Position fetched: 44191 -> 10.1101/2020.06.29.20143156
Position fetched: 44192 -> 10.1101/2020.06.29.20143180
Position fetched: 44193 -> 10.1101/2020.06.30.166207
Position fetched: 44194 -> 10.1101/2020.06.30.175695
Position fetched: 44195 -> 10.1101/2020.06.30.176537
Position fetched: 44196 -> 10.1101/2020.06.30.177006
Position fetched: 44197 -> 10.1101/2020.06.30.177097
Position fetched: 44198 -> 10.1101/2020.06.30.178897
Position fetched: 44199 -> 10.1101/2020.06.30.179523
Position fetched: 44200 -> 10.1101/2020.06.30.179606
Position fetched: 44201 -> 10.1101/2020.06.30.179663
Position fetched: 44202 -> 1

batch saved
Position fetched: 44329 -> 10.1101/2020.07.06.20140285
Position fetched: 44330 -> 10.1101/2020.07.06.20141333
Position fetched: 44331 -> 10.1101/2020.07.06.20145938
Position fetched: 44332 -> 10.1101/2020.07.06.20146233
Position fetched: 44333 -> 10.1101/2020.07.06.20146712
Position fetched: 44334 -> 10.1101/2020.07.06.20147009
Position fetched: 44335 -> 10.1101/2020.07.06.20147033
Position fetched: 44336 -> 10.1101/2020.07.06.20147066
Position fetched: 44337 -> 10.1101/2020.07.06.20147082
Position fetched: 44338 -> 10.1101/2020.07.06.20147124
Position fetched: 44339 -> 10.1101/2020.07.06.20147140
Position fetched: 44340 -> 10.1101/2020.07.06.20147199
Position fetched: 44341 -> 10.1101/2020.07.06.20147223
Position fetched: 44342 -> 10.1101/2020.07.06.20147256
Position fetched: 44343 -> 10.1101/2020.07.06.20147264
Position fetched: 44344 -> 10.1101/2020.07.06.20147421
Position fetched: 44345 -> 10.1101/2020.07.06.20147512
Position fetched: 44346 -> 10.1101/2020.07.06.2014762

Position fetched: 44473 -> 10.1101/2020.07.12.199687
Position fetched: 44474 -> 10.1101/2020.07.12.20100941
Position fetched: 44475 -> 10.1101/2020.07.12.20150474
Position fetched: 44476 -> 10.1101/2020.07.12.20151068
Position fetched: 44477 -> 10.1101/2020.07.12.20151316
Position fetched: 44478 -> 10.1101/2020.07.12.20151696
Position fetched: 44479 -> 10.1101/2020.07.12.20151753
Position fetched: 44480 -> 10.1101/2020.07.12.20151936
Position fetched: 44481 -> 10.1101/2020.07.12.20152017
Position fetched: 44482 -> 10.1101/2020.07.12.20152074
Position fetched: 44483 -> 10.1101/2020.07.12.20152124
Position fetched: 44484 -> 10.1101/2020.07.12.20152140
Position fetched: 44485 -> 10.1101/2020.07.12.20152157
Position fetched: 44486 -> 10.1101/2020.07.12.20152165
Position fetched: 44487 -> 10.1101/2020.07.13.190140
Position fetched: 44488 -> 10.1101/2020.07.13.199562
Position fetched: 44489 -> 10.1101/2020.07.13.200188
Position fetched: 44490 -> 10.1101/2020.07.13.200386
Position fetched: 44

Position fetched: 44617 -> 10.1101/2020.07.16.206920
Position fetched: 44618 -> 10.1101/2020.07.16.207100
Position fetched: 44619 -> 10.1101/2020.07.16.207308
Position fetched: 44620 -> 10.1101/2020.07.16.207951
Position fetched: 44621 -> 10.1101/2020.07.17.20152389
Position fetched: 44622 -> 10.1101/2020.07.17.20155150
Position fetched: 44623 -> 10.1101/2020.07.17.20155929
Position fetched: 44624 -> 10.1101/2020.07.17.20155960
Position fetched: 44625 -> 10.1101/2020.07.17.20155978
Position fetched: 44626 -> 10.1101/2020.07.17.20155986
Position fetched: 44627 -> 10.1101/2020.07.17.20155994
Position fetched: 44628 -> 10.1101/2020.07.17.20156000
      affiliation coredata
43329        None     None
43330        None     None
43331        None     None
43332        None     None
43333        None     None
...           ...      ...
44624        None     None
44625        None     None
44626        None     None
44627        None     None
44628        None     None

[1300 rows x 2 columns]

The following cell is useful when the process above is interrupted. Therefore, the dictionary containing fetched information can be narrowed down to useful entries. 

In [None]:
# def save_new_extra_info(len_df_current_extra_info, upto):
#     """
#     This function is used to separate successfull API calls from API calls which were prevented due to an invalid API-Key.
#     As a result, this function returns a range of valid entries up to the given parameter. 
#     """
#     dict_new_extra_info_saver = dict()
#     i = len_df_current_extra_info
#     while i < upto:
#         #print("Position: " + str(i) + " -> " +  doi_counted.index[i])
#         dict_new_extra_info_saver[i] = dict_new_extra_info[i]
#         i = i + 1 
#     return dict_new_extra_info_saver

The existing and newly fetched information are combined into one DataFrame. 

In [None]:
# df_combined_extra_info = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info)
df_combined_extra_info

In [None]:
#to big for GitHub
#df_combined_extra_info.to_csv('extra_info_CS5099.csv', sep='\t')

Both Dataframes columns are stored each to an object. The series objects are stored to each to a pkl-file which is not exceeding the size of 100MB allowing GitHub uploads.

In [None]:
# stored_series = store_df_columns(df_combined_extra_info)
# stored_series[0]

In [None]:
# stored_series[1]

Verifying that the returned None values are due to non existent data and not to an invalid API-Key

In [None]:
# len_data = len(stored_series[0])
# len_data 

In [None]:
# ser_doi = pd.Series(doi_counted.index[:len_data])
# ser_doi

In [None]:
# df_current_extra_info_checker = df_combined_extra_info
# df_current_extra_info_checker['doi'] = ser_doi

In [None]:
# %%time
# len_df_current_extra_info_checker = len(df_current_extra_info_checker)
# dict_new_extra_info_checker = dict()
# i = 0 
# while i < len_df_current_extra_info_checker: ###################################################### 
#     if df_current_extra_info_checker['affiliation'][i] == None:
#         dict_new_extra_info_checker[i] = fetch_scopus_api(client, ser_doi[i])
#         print("Position fetched again: " + str(i) + " -> " +  ser_doi[i])
#     i = i + 1    

In [None]:
# dict_new_extra_info_checker
# -> check if at least one value is not None -> otherwise the process is finished here

In [None]:
# len(dict_new_extra_info_checker)

In [None]:
# df_combined_extra_info_fetched_again  = append_fetched_data_to_df(df_current_extra_info, dict_new_extra_info_checker)
# df_combined_extra_info_fetched_again

In [None]:
# store_df_columns(df_combined_extra_info_fetched_again)

In [None]:
# df_combined_extra_info_fetched_again['check_doi'] = ser_doi
# df_combined_extra_info_fetched_again.head(30)