## SCOPUS API WITH PYBLIOMETRICS LIBRARY
Notebook to extract useful info from scopus datasets. Starting dataset: AUTHOR NAME and SURNAME. from this info, this program is able to extract:
* number of citations 
* number of publications/documents
* H index
* H index in the last 15 years (from 2005)

**IMPORTANT** In order to access scopus, you should be connected to a network from a scopus affiliation (e.g. university). When you ask a request to scopus, it searches for your IP and check if correspond to a known network.
**VPN** is not working.



### An API KEY is necessary to access scopus information. 
ask for you key on https://dev.elsevier.com/

*key="xxxxxxxxxxxxxxxxxxxx"*

then add your key on config.ini file. It is automatically created probably in the following directory:
*C:/Users/<username\>/.scopus/config.ini*


In [None]:
import requests
import json
import pandas as pd
import re
import numpy as np

import pybliometrics 
from pybliometrics.scopus import AuthorSearch
from pybliometrics.scopus import AuthorRetrieval
from pybliometrics.scopus import ScopusSearch

config.ini is directly accessed calling pybliometrics.
if this is not working you can directly set the direction where the program is going to search for conig.ini file.

import os
*os.environ['PYB_CONFIG_FILE'] = "C:/Users/<username\>/.scopus/config.ini"*


In [None]:
#Read excel file
file_path='path/to/file.xlsx'
df = pd.read_excel (file_path, engine='openpyxl')
print (df)


In [None]:
#if your dataset has name and surname in the same cell, split name and surname in two different columns. you'll need it later
df["Surname"]=""
for i in range(df.shape[0]):
    splitted=df["Name"][i].split("\xa0")
    df["Name"][i]=splitted[1]
    df["Surname"][i]=splitted[0].title()
print(df)

In [None]:
#reorder dataset
df=df[["Name", "Surname", "Gender", "Affil", "Documents", "Citations", "H", "H15"]]

In [None]:
print(df.iloc[0:5,:])

## AUTHOR SEARCH API
autorsearch can take different parameters as input. e.g.: AUTHLAST (surname) AUTHFIRST (name) AFFIL (university; it can be also part of the name such as city). 

This step is necessary if you don't have the AUTH_ID or the EID of the research. 

## AUTHOR RETRIVAL API
This is the main API to access author data. It requires EID or AUTH ID as input.
Then it returns several parameters. We are interested in :

1. citations: au.citation_count
2. documents: au.document_count
3. H index: au.h_index

We are also itnerested in the H index in the last 15 years. We have to compute that by hand, searching all documents published in the last 15 years, checking their citations and computing H index. 




In [None]:
for i in range(df.shape[0]):
    
    #AUTHOR SEARCH
    api="AUTHLAST({}) and AUTHFIRST({}) and AFFIL({})".format(df["Surname"][i], df["Name"][i], df["Affil"][i])
    s = AuthorSearch(api)
    df_auth=pd.DataFrame(s.authors)

    #AUTHOR RETRIEVAL
    au = AuthorRetrieval(df_auth["eid"][0])
    df["Citations"][i]=int(au.citation_count)
    df["Documents"][i]=int(au.document_count)
    df["H"][i]=int(au.h_index)

    #serch for all work of author x in the last 15 years
    api_search="AU-ID({})AND PUBYEAR > 2004".format(df_auth["eid"][0][-11:] )
    s = ScopusSearch(api_search)
    s.get_results_size()
    tot_eids=s.get_eids()
    cit_list=[]
    #get citations for each document and create a dataframe
    for idx,eid in enumerate(tot_eids):
        print(idx)
        s = ScopusSearch("EID({})".format(eid))
        df_tmp=pd.DataFrame(s.results)      
        cit_list.append(int(df_tmp["citedby_count"][0]))
        
    df_cit = pd.DataFrame(cit_list, columns=["citations"]) 
    #compute H index
    for num in range(df_cit.shape[0]):
        work=np.sum(df_cit["citations"]>=num)
        if work>num:
            continue
        elif work==num:
            h_index=num
            break
        elif work<num:
            h_index=num-1
            break
        
    print("final h index",h_index)

    df["H15"][i]=int(h_index)

In [None]:
#save file to excel
df.to_excel(file_path)  

In [None]:
print(df)

In [None]:
api_search="AU-ID({})AND PUBYEAR > 2014".format(df_auth["eid"][0][-11:] )
s = ScopusSearch(api_search)
s.get_results_size()
tot_eids=s.get_eids()
eid_list=[]
cit_list=[]
for idx,eid in enumerate(tot_eids):
    print(eid)
    s = ScopusSearch("EID({})".format(eid))
    df_tmp=pd.DataFrame(s.results)      
    print(df_tmp["citedby_count"])
    eid_list.append(eid)
    cit_list.append(int(df_tmp["citedby_count"][0]))
    
df_cit = pd.DataFrame(cit_list, columns=["citations"]) 
for i in range(df_cit.shape[0]):
    print(i)
    work=np.sum(df_cit["citations"]>=i)
    print(work)
    if work>i:
        print("Citations",i)
        print("num work",work)
        continue
    elif work==i:
        print("here")
        print("Citations",i)
        print("num work",work)
        h_index=i
        break
    elif work<i:
        print("Citations",i)
        print("num work",work)
        h_index=i-1
        break
    
print("final",h_index)

In [None]:
api='https://api.elsevier.com/content/search/index:SCOPUS?query=AUTHLASTNAME(Chiti)&AUTHFIRST(Artur)&apikey=092453898cf0b5a0cd258448bf530541'
api="https://api.elsevier.com/content/author/author_id/22988279600?apiKey=092453898cf0b5a0cd258448bf530541"
response=requests.get(api)
print(response.status_code)
result=response.json()
print(result)

In [None]:
api="https://api.elsevier.com/content/author/author_id/22988279600?apiKey=092453898cf0b5a0cd258448bf530541"
api="http://api.elsevier.com/content/search/author?apikey=092453898cf0b5a0cd258448bf530541&query=AUTHFIRST%28John%29+AND+AUTHLASTNAME%28Kitchin%29+AND+SUBJAREA%28COMP%29"
response= requests.get(api)
print(response.status_code)
