# CORD-19 Analyse fetched coredata

In general, this jupyter notebook is designated to analyse fetched coredata. Thusly, an API-call is described: 
https://api.elsevier.com/content/search/scopus?query=DOI(10.1109/MCOM.2016.7509373)&apiKey=6d485ef1fe1408712f37e8a783a285a4

First, relevant packages must be imported to the Notebook.

In [2]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import datetime
import re
import time 
from urllib.parse import urlparse
from collections import Counter
from elsapy.elsclient import ElsClient
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch
from pybtex.database import parse_file, BibliographyData, Entry

Thusly, the fetched coredata and affiliation are read from the pkl-files and combined to a DataFrame for further processing. 

In [3]:
read_coredata = pd.read_pickle('extra_info_CS5099.pkl')
df_current_extra_info = pd.DataFrame()
df_current_extra_info['coredata'] = read_coredata['coredata']
df_current_extra_info

Unnamed: 0,coredata
0,"{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...
74297,"{'srctype': 'j', 'eid': '2-s2.0-85092678139', ..."
74298,"{'srctype': 'j', 'eid': '2-s2.0-85087468210', ..."
74299,"{'srctype': 'j', 'eid': '2-s2.0-85092677974', ..."
74300,"{'srctype': 'j', 'prism:issueIdentifier': '4',..."


In [4]:
df_current_extra_info.isnull().sum()

coredata    14098
dtype: int64

In [5]:
df_current_extra_info.isnull().sum() / len(df_current_extra_info)

coredata    0.189739
dtype: float64

Compared to the length of the dataset, a fifth of fetched scopus data has no return value.

Thusly, all rows which contain "None" values are dropped and the DataFrame is reindexed. 

In [6]:
df_combined = df_current_extra_info.dropna()
df_combined = df_combined.reset_index(drop=True)
df_combined

Unnamed: 0,coredata
0,"{'srctype': 'j', 'eid': '2-s2.0-85083266658', ..."
1,"{'srctype': 'j', 'prism:issueIdentifier': '7',..."
2,"{'srctype': 'j', 'prism:issueIdentifier': '8',..."
3,"{'srctype': 'j', 'prism:issueIdentifier': '9',..."
4,"{'srctype': 'j', 'prism:issueIdentifier': '11'..."
...,...
60199,"{'srctype': 'j', 'eid': '2-s2.0-85092679086', ..."
60200,"{'srctype': 'j', 'eid': '2-s2.0-85092678139', ..."
60201,"{'srctype': 'j', 'eid': '2-s2.0-85087468210', ..."
60202,"{'srctype': 'j', 'eid': '2-s2.0-85092677974', ..."


The following functions support the creation of DataFrames based on the columns affiliation and coredata. 

In [None]:
def get_one_entry(dic):
    df_affiliation_holder = pd.DataFrame(dic.items()).T
    df_affiliation_holder.columns = df_affiliation_holder.iloc[0]
    df_affiliation_holder = df_affiliation_holder.drop(df_affiliation_holder.index[0])
    return df_affiliation_holder

In [None]:
def get_various_entries(dic):
    df_affiliation_holder = pd.DataFrame.from_dict(dic, orient='columns')
    return df_affiliation_holder

The next cell creates the DataFrame which focus on coredata. 

In [None]:
%%time
df_coredata = pd.DataFrame()
df_coredata_holder = pd.DataFrame()


for i in df_combined['coredata']:
    string_holder = str(i)
    if string_holder[0] == "[":
        df_coredata_holder = get_various_entries(i)
    else: 
        df_coredata_holder = get_one_entry(i)
    df_coredata = pd.concat([df_coredata_holder, df_coredata],ignore_index=True)
    print(len(df_coredata))
df_coredata

In [None]:
ser_type = df_coredata['prism:aggregationType']
ser_type_counted = ser_type.value_counts()
ser_type_counted

In [None]:
ser_type_dict = {}
len_ser_type_counted  = len(ser_type_counted)
i = 0
while i < len_ser_type_counted:
    ser_type_dict[i] = ser_type_counted.index[i] + " (" + str(ser_type_counted.values[i]) + ")"
    i = i + 1 

y = ser_type_counted.values
labelling = ser_type_dict.values()
shift = [-0.1, 0.2, 0.4, 0.6, 0.8]

plt.pie(y, labels = labelling, startangle = 70, explode = shift)
plt.title("Literature types", loc="left")
plt.show() 

In [None]:
ser_citedcount = df_coredata['citedby-count']
ser_citedcount_counted = ser_citedcount.value_counts()
ypoints = ser_citedcount_counted.values

plt.plot(ypoints, linestyle = 'dotted')
plt.xlabel("Number of citations")
plt.ylabel("Number of publications")
plt.title("Citation data of academic literature")
plt.show()

In [None]:
ser_citedcount_counted 