# Entidades disponíveis na API CORE CORE v3

### Provedores de dados
Dá acesso ao conjunto de entidades que oferecem dados ao CORE. Contém repositórios (institucionais e disciplinares), servidores de preprints, periódicos e editoras.

### Journals
Este conjunto de dados contém todos os títulos de periódicos incluídos na coleção CORE. Além disso, você pode pesquisar e recuperar qualquer periódico, mesmo que não seja um provedor de dados CORE.

### Works
Estas são as entidades que representam uma investigação, por exemplo, artigos de investigação, teses, etc. No total, é uma versão desduplicada e enriquecida de registos.

### Outputs
As saídas são uma representação de um Trabalho em um provedor de dados. Os dados não são enriquecidos e refletem exatamente o conteúdo colhido do provedor de dados.

### Search (Procurar)
Pesquise na coleção de entidades CORE.

### Recommender (Recomendar)
Uma solução líder de recomendação de artigos de pesquisa.

### Discovery (Descobrir)
Acesso com um clique a cópias gratuitas de artigos de pesquisa sempre que você acessar o acesso pago.

### Consultando conjuntos de dados acima de 10000 registros
A API é ótima para pequenas consultas e acesso rápido aos dados CORE. Para consultas maiores, recomendamos que você use o conjunto de dados CORE . O tamanho máximo para um conjunto de resultados usando a API é de 10.000 resultados. 

Para obter conjuntos de dados maiores, os endpoints de pesquisa oferecem a possibilidade de consultas de rolagem. Ao anexar um parâmetro chamado, scroll o sistema fornecerá um ID que você pode usar para realizar consultas maiores. Nosso exemplo mostrará como usá-lo. Tenha em mente que realizar essas consultas tem um impacto maior no desempenho e estamos limitando-as de forma mais estrita do que as consultas normais.

## Executando consultas à API COREv3

### 1. Register for the CORE API https://core.ac.uk/services/api


In [1]:
import os
import json
import pandas
import hashlib
import requests
import itertools
import networkx as nx
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from operator import itemgetter
from matplotlib import rcParams
rcParams['figure.figsize'] = 11.7,8.27

def pretty_json(json_object):
    print(json.dumps(json_object, indent=2))

In [2]:
# Acessar as API keys no sistema local
def apikeys(api):
    # Obter caminho absoluto para a pasta home do usuário
    home_dir = os.path.expanduser("~")

    # Criar caminho completo para o arquivo secrets.json
    secrets_file_path = os.path.join(home_dir, "secrets.json")

    # Verificar se o arquivo existe
    if os.path.exists(secrets_file_path):
        # Abra o arquivo secrets.json para leitura
        with open(secrets_file_path, 'r') as secrets_file:
            secrets = json.load(secrets_file)
        try:
            # Acessar as chaves de API
            api_key = secrets[api+"_api_key"]
            return api_key
        except:
            print(f"Chave para API '{api}' não cadastrada em secrets.json")
            return None   
    else:
        print("O arquivo secrets.json não foi encontrado na pasta home.")

Baixar todos os identificadores do Data Provider usando a pesquisa de varredura e rolagem para gerar um conjunto de dados contendo todos os identificadores dos registros de um provedor de dados.

In [11]:
# headers={"Authorization":"Bearer "+apikey}
# query=f"?q=_exists_:doi&limit=1"

headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}

endpoint = f"https://api.core.ac.uk/v3/"
entity   = f"search/works"
query    = f"?q=_exists_:doi&limit=1"
# apikey   = f"&api_key={apikeys('core')}"

req = f"{endpoint}{entity}{query}"
print(req)

singlework = requests.get(req,
                          headers=headers,
                          )

singlework.json()

https://api.core.ac.uk/v3/search/works?q=_exists_:doi&limit=1


{'totalHits': 60410043,
 'limit': '1',
 'offset': 0,
 'scrollId': None,
 'results': [{'acceptedDate': '2003-03-11T00:00:00',
   'arxivId': None,
   'authors': [{'name': 'Fuller, Duncan'}, {'name': 'Jonas, Andrew'}],
   'citationCount': None,
   'contributors': ['Duncan'],
   'outputs': ['https://api.core.ac.uk/v3/outputs/353601649',
    'https://api.core.ac.uk/v3/outputs/4147554'],
   'createdDate': '2012-06-01T18:32:03',
   'dataProviders': [{'id': 162,
     'name': '',
     'url': 'https://api.core.ac.uk/v3/data-providers/162',
     'logo': 'https://api.core.ac.uk/data-providers/162/logo'},
    {'id': 4786,
     'name': '',
     'url': 'https://api.core.ac.uk/v3/data-providers/4786',
     'logo': 'https://api.core.ac.uk/data-providers/4786/logo'}],
   'depositedDate': '2008-05-22T13:01:00',
   'abstract': 'This paper provides a critical overview of recent developments in British credit union development, and contributed to the broader analysis of alternative financial/economic spaces

In [55]:
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}

endpoint = f"https://api.core.ac.uk/v3/"
entity   = f"search/works"
provider = f"https://api.core.ac.uk/v3/data-providers/131"
query    = f"?q=dataProviders:{provider}&limit=1"
apikey   = f"&api_key={apikeys('core')}"

req = f"{endpoint}{entity}{query}{apikey}"
print(req)

data_providers = requests.get(req,headers=headers)
print(data_providers.json().get('totalHits'))
result_list = data_providers.json().get('results')

https://api.core.ac.uk/v3/search/works?q=dataProviders:https://api.core.ac.uk/v3/data-providers/131&limit=1&api_key=bqpJg1oMvXCIVDBKjWLr0nEiR7OGucT5
71172


In [40]:
type(data_providers)

requests.models.Response

In [41]:
type(json.dumps(data_providers.json()))

str

In [54]:
import pandas as pd
df_results = pd.DataFrame(result_list)
print(df_results['publisher'].values)
print([x['name'] for x in df_results['authors'][0]])
df_results

['DERlab e.V. – European Distributed Energy Resources Laboratories']
['Crolla, Paul', 'de Graff, Roald', 'de Jong, Erik', 'Gafaro, Francisco', 'Kotsampopoulos, Panos', 'Lauss, Georg', 'Lefuss, Felix', 'Roscoe, Andrew', 'Vassen, Peter']


Unnamed: 0,acceptedDate,arxivId,authors,citationCount,contributors,outputs,createdDate,dataProviders,depositedDate,abstract,...,oaiIds,publishedDate,publisher,pubmedId,references,sourceFulltextUrls,updatedDate,yearPublished,journals,links
0,,,"[{'name': 'Crolla, Paul'}, {'name': 'de Graff,...",,[],"[https://api.core.ac.uk/v3/outputs/59383248, h...",2013-05-02T16:08:41,"[{'id': 131, 'name': '', 'url': 'https://api.c...",2013-04-10T10:06:00,The European White Book on Real-Time-Powerhard...,...,[oai:strathprints.strath.ac.uk:43450],2012-03-01T00:00:00,DERlab e.V. – European Distributed Energy Reso...,,"[{'id': 1749855, 'title': '50160 Voltage chara...",[http://strathprints.strath.ac.uk/43450/1/noe_...,2022-02-20T07:52:51,2012,[],"[{'type': 'download', 'url': 'https://core.ac...."


In [59]:
def query_api(search_url, query, scrollId=None):
    # headers={"Authorization":"Bearer "+apikeys('core')}
    headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}
    
    if not scrollId:
        response = requests.get(f"{search_url}?q={query}&limit=100&scroll=true&{apikey}",headers=headers)
    else:
        response = requests.get(f"{search_url}?q={query}&limit=100&scrollId={scrollId}&{apikey}",headers=headers)        
    return response.json(), response.elapsed.total_seconds()

def scroll(search_url, query, extract_info_callback):
    allresults = []
    count = 0
    scrollId=None
    while True:
        result, elapsed = query_api(search_url, query, scrollId)
        try:
            scrollId=result["scrollId"]
            totalhits = result["totalHits"]
            result_size = len(result["results"])
            if result_size==0:
                break
            for hit in result["results"]:
                if extract_info_callback:
                    allresults.append(extract_info_callback(hit))
                else:
                    allresults.append(extract_info(hit))
            count+=result_size
            print(f"{count}/{totalhits} {elapsed}s")
        except:
            print('Resultado sem scroll id')
            continue
    return allresults
        
def extract_info(hit):
    return {"id":hit["id"], "name": hit["name"], "url":hit["oaiPmhUrl"]}

def get_ids(hit):
  return {
      "id":hit["id"],
      "arxivId":hit["arxivId"],
      "doi":hit["doi"],
      "oaiIds":",".join(hit["oaiIds"]),
      "magId":hit["magId"],
      "coreIds":",".join(hit["outputs"]),
      "pubmedId":hit["pubmedId"]
  }

provider = f"https://api.core.ac.uk/v3/data-providers/14"
query    = f"?q=dataProviders:{provider}&limit=1"
response = scroll("https://api.core.ac.uk/v3/search/works", 
                  query, 
                  get_ids)

Resultado sem scroll id
Resultado sem scroll id
Resultado sem scroll id


KeyboardInterrupt: 

In [None]:
def query_api(search_url, query, scrollId=None):
    # headers={"Authorization":"Bearer "+apikey}
    headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}
    
    if not scrollId:
        response = requests.get(f"{search_url}?q={query}&limit=100&scroll=true",headers=headers)
    else:
        response = requests.get(f"{search_url}?q={query}&limit=100&scrollId={scrollId}",headers=headers)
    print(response.content)        
    return response.json(), response.elapsed.total_seconds()

In [None]:
from pprint import pprint
import urllib.request, urllib.parse, urllib.error

query = '"machine learning" AND graph AND innovation AND ontology'

def get_entity(url_fragment, query):
    api_endpoint = "https://api.core.ac.uk/v3/"
    # headers={"Authorization":"Bearer "+api_key}
    headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}
    
    encoded_query = urllib.parse.quote(query)
    api_key = apikeys('core')
    url = f"{api_endpoint}{url_fragment}?q={encoded_query}&api_key={api_key}"
    print(url)
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        return response.json(), response.elapsed.total_seconds()
    else:
        print(f"Error code {response.status_code}")
        # pprint(response.content, width=120)
        return None, None

In [None]:
data_provider, elapsed = get_entity("search/works/", query)
pretty_json(data_provider)

In [None]:
def query_api(url_fragment, query,limit=100):
    api_endpoint = "https://api.core.ac.uk/v3/"
    encoded_query=urllib.parse.quote(query)
    headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}
    
    response = requests.get(f"{api_endpoint}{url_fragment}?api_key={apikeys('core')}&q={encoded_query}&limit={limit}", headers=headers)
    
    if response.status_code ==200:
        return response.json(), response.elapsed.total_seconds()
    else:
        print(f"Error code {response.status_code}, {response.content}")


In [None]:
query = '"machine learning" AND graph AND innovation AND ontology'

response_object , elapsed = query_api("search/works", query)

pretty_json(response_object)

In [None]:
import urllib
limit = 100
query = "location.countryCode:gb"
params = {"q":query, "limit":limit}
encoded_query = urllib.parse.quote(json.dumps(params))
encoded_query

In [None]:
params = {"q":query, "limit":limit}
params.get('q')

In [None]:
params = {"q":query, "limit":limit}
json.dumps(params)

In [None]:
def query_api(url_fragment, query, is_scroll=False, limit=100, scrollId=None):
    endpoint = "https://api.core.ac.uk/v3/"
    encoded_query=urllib.parse.quote(query)
    headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30"}

    api_key = ler_apikeys('core')
    headers = {"Authorization":"Bearer "+api_key}
    params = {"q":query, "limit":limit}
    not_encoded_query = json.dumps(params)
    encoded_query = urllib.parse.quote(json.dumps(params))
    
    if not is_scroll:
        url=f"{endpoint}{url_fragment}?api_key={core_api_key}&q={params.get('q')}"
        print(url)
        response = requests.get(url,headers=headers)
    elif not scrollId:
        params["scroll"]="true"
        url=f"{endpoint}{url_fragment}?api_key={core_api_key}&q={params.get('q')}"
        print(url)
        response = requests.get(url,headers=headers)
    else:
        params["scrollId"]=scrollId
        url=f"{endpoint}{url_fragment}?api_key={core_api_key}&q={params.get('q')}"
        print(url)
        response = requests.get(url,headers=headers)
    if response.status_code ==200:
        return response.json(), response.elapsed.total_seconds()
    else:
        print(f"Error code {response.status_code}, {response.content}")

def scroll(search_url, query, extract_info_callback=None):
    allresults = []
    count = 0
    scrollId=None
    while True:
        result, elapsed = query_api(search_url, 
                                    query, 
                                    is_scroll=True, 
                                    scrollId=scrollId)
        scrollId=result["scrollId"]
        totalhits = result["totalHits"]
        result_size = len(result["results"])
        if result_size==0:
            break
        for hit in result["results"]:
            if extract_info_callback:
              allresults.append(extract_info_callback(hit))
            else:
              allresults.append(hit)
        count+=result_size
        print(f"{count}/{totalhits} {elapsed}s")
    return allresults

In [None]:
def get_data_providers_id(hit):
    return {"id":hit["id"], "name":hit["name"]}

uk_data_providers_raw = scroll("search/data-providers/", 
                               "location.countryCode:gb", 
                               get_data_providers_id)

uk_data_providers = pandas.DataFrame(uk_data_providers_raw)
uk_data_providers

In [None]:
work, elapsed = get_entity("works/58886742")
pretty_json(work)

In [None]:
work, elapsed = get_entity("works/10.3389/fmicb.2018.01845/full")
pretty_json(work)



In [None]:
work, elapsed = get_entity("works/oai:strathprints.strath.ac.uk:4509")
pretty_json(work)

In [None]:
work, elapsed = get_entity("works/core:277236443")
pretty_json(work)

In [None]:
query = f"covid AND yearPublished>=2010 AND yearPublished<=2021"
results, elapsed = query_api("search/works", query, limit=1)
pretty_json(results)

In [None]:
def aggregations(query, aggregation_fields,entity_type="works", limit=20, cache=True):
    headers={"Authorization":"Bearer "+api_key}

    query = {"q":query,"aggregations":aggregation_fields, "limit":limit}
    querystring = json.dumps(query).encode('utf-8')
    filename = f"cache/{hashlib.md5(querystring).hexdigest()}.csv"
    responseObject = {}
    if cache and os.path.exists(filename):
        with open (filename, "r") as cached:
            responseObject=json.loads(cached.readlines()[0].strip())
    else:
        response = requests.post(f"{api_endpoint}search/{entity_type}/aggregate",data = json.dumps(query), headers=headers)
        responseObject = response.json()
        with open (filename, "w") as cached:
            cached.write(json.dumps(responseObject))
        
    return responseObject




In [None]:
query = f"covid AND yearPublished>=2010 AND yearPublished<=2021"
aggregation_response = aggregations(query, aggregation_fields=["yearPublished"],entity_type="works", limit=20)
pretty_json(aggregation_response)
year_data = aggregation_response["aggregations"]["yearPublished"]
years = pandas.DataFrame(list(year_data.items()), columns=["year", "records"]) 
years = years.sort_values("year", ascending=True)
ax = sns.barplot(x="year", y="records", data=years)

In [None]:
def get_dataprovider_aggregation(query, year):
    connected_ids = " OR ".join(f"dataProviders:{id}" for id in uk_data_providers.id)
    query = f"{query} AND ({connected_ids}) AND yearPublished:{year}"
    aggregation_response = aggregations(query, aggregation_fields=["dataProviders"],entity_type="works")
    dp_data = aggregation_response["aggregations"]["dataProviders"]
    dps = pandas.DataFrame(list(dp_data.items()), columns=["dp_id", "records"]) 
    dps = dps[dps.dp_id.astype(int).isin(uk_data_providers.id)]
    dps["dp_id"]=dps["dp_id"].astype(int) 
    return dps.set_index("dp_id").join(uk_data_providers.set_index("id")).reset_index().sort_values("records", ascending=False)

In [None]:
def plot_query(query, year, top_n=10):
    aggs = get_dataprovider_aggregation(query, year)
    ax = sns.barplot(x="records", y="name", data=aggs[:top_n])
    ax.set(xlabel="# records")
    ax.set(ylabel=None)
    plt.xticks(rotation=90)
    plt.title(f"\"{query}\" publications in {year}")
    plt.show()

plot_query("unprecedented times", 2018)
plot_query("unprecedented times", 2019)
plot_query("unprecedented times", 2020)
plot_query("unprecedented times", 2021)

In [None]:
import itertools
russel_group = [119,286,27,83,33,39,504,14443,105,635,140,129,35,252,80,88,619,289,140,36,118,136, 42, 34]

target_dps =uk_data_providers.id.to_list()
def get_repo_name(url):
    id_repo = url.split("/")[-1]
    if uk_data_providers[uk_data_providers.id==int(id_repo)].any()["name"]:
            return uk_data_providers[uk_data_providers.id==int(id_repo)]["name"].values[0]
    return "Other data providers"

def get_repo_name_if_needed(dp_url):
    id_repo = dp_url.split("/")[-1]
    if int(id_repo) in target_dps:
        dp_name=get_repo_name(id_repo)
    else: 
        dp_name="Other data providers"
    return dp_name

def get_arcs(hit):
    results = []
    if len(hit["dataProviders"])==1:
        return results
    for dpA, dpB in itertools.combinations(hit["dataProviders"],2):
        dpA_name = get_repo_name_if_needed(dpA)
        dpB_name = get_repo_name_if_needed(dpB)        
        results.append({"source":dpA_name, "target":dpB_name, "edge":"co_deposit"})
    return results

def get_collaboration_network(query, cache=True):
    filename = f"cache/{hashlib.md5(query.encode('utf-8')).hexdigest()}.csv"
    edges_df = []
    if cache:
        if os.path.exists(filename):
            edges_df = pandas.read_csv(filename)
    if len(edges_df)==0:
        covid_works = scroll("search/works", query, get_arcs)
        works= []

        for c in covid_works:
            works.extend(c)

        edges_df = pandas.DataFrame(works)
        edges_df.to_csv(filename)
        
        
    edges_df = edges_df[edges_df.target!="Other data providers"][edges_df.source!="Other data providers"].groupby(['source', "target"]).count().reset_index()

    G=nx.from_pandas_edgelist(edges_df,  "source", "target", edge_attr=True, create_using=nx.Graph())
    plt.figure(figsize=(40,40))
    M = G.number_of_edges()
    edge_colors = range(2, M + 2)
    cmap = sns.color_palette("viridis", as_cmap=True)
    pos = nx.circular_layout(G)
    widths = 15 * (edges_df["edge"]/edges_df["edge"].max()) +1
    nodes = nx.draw_networkx_nodes(G, pos, node_color="indigo" )
    edges = nx.draw_networkx_edges(
        G,
        pos,
        edge_color=edge_colors,
        edge_cmap=cmap,
        width=widths,
        
    )
    label_options = {"fc": "white"}
    nx.draw_networkx_labels(G, pos, font_size=14, bbox=label_options)
    plt.show()
    #return edges_df.count()

In [None]:
def get_ego_network(query, cache=True):
    filename = f"cache/{hashlib.md5(query.encode('utf-8')).hexdigest()}.csv"
    edges_df = []
    if cache:
        if os.path.exists(filename):
            edges_df = pandas.read_csv(filename)
    if len(edges_df)==0:
        covid_works = scroll("search/works", query, get_arcs)
        works= []

        for c in covid_works:
            works.extend(c)

        edges_df = pandas.DataFrame(works)
        edges_df.to_csv(filename)
        
        
    edges_df = edges_df[edges_df.target!="Other data providers"][edges_df.source!="Other data providers"].groupby(['source', "target"]).count().reset_index()    
    
    
    G=nx.from_pandas_edgelist(edges_df,  "source", "target", edge_attr=True, create_using=nx.Graph())
    plt.figure(figsize=(40,40))
    M = G.number_of_edges()
    edge_colors = range(2, M + 2)
    cmap = sns.color_palette("viridis", as_cmap=True)
    pos = nx.circular_layout(G)
    widths = 15 * (edges_df["edge"]/edges_df["edge"].max()) +1
    # Create a BA model graph - use seed for reproducibility
    seed = 20532

    # find node with largest degree
    node_and_degree = G.degree()
    (largest_hub, degree) = sorted(node_and_degree, key=itemgetter(1))[-1]

    # Create ego graph of main hub
    hub_ego = nx.ego_graph(G, largest_hub)

    # Draw graph
    pos = nx.spring_layout(hub_ego, seed=seed)  # Seed layout for reproducibility
    nx.draw(hub_ego, pos, node_color="b", node_size=50, with_labels=False)

    # Draw ego as large and red
    options = {"node_size": 300, "node_color": "r"}
    label_options = {"fc": "white"}
    nx.draw_networkx_nodes(hub_ego, pos, nodelist=[largest_hub], **options)
    nx.draw_networkx_labels(hub_ego, pos, font_size=24, bbox=label_options)
    edges = nx.draw_networkx_edges(
        hub_ego,
        pos,
        edge_color=edge_colors,
        edge_cmap=cmap,
        width=widths,
        
    )
    plt.show()
    #return edges_df.count()

In [None]:
query = f"covid AND yearPublished:2021 AND dataProviders:140"
get_ego_network(query)

In [None]:
ids_to_focus = " OR ".join(f"dataProviders:{id}" for id in russel_group)
query = f"covid AND yearPublished:2021 AND ({ids_to_focus})"
get_collaboration_network(query, cache=True)