## What is a third-party API?


Third party APIs are provided by third parties — generally companies such as Facebook, Twitter, or Google — to allow you to access their functionality and data. 

## Sending an API request in Python 

Let's say we wanted to find scientific papers about "ChatGPT" published in 2023. 
There are a few APIs out there that allow you to gather research articles and related data. 
Some examples are: 

- [Semantic Scholar Academic Graph API](https://api.semanticscholar.org/api-docs/graph)
- [openalex](https://docs.openalex.org/)
- [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/)

In this example, I am going to use the Semantic Scholar Academic Graph API. 

In [4]:
import requests


In [5]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "paper/search"


params = {'query':"ChatGPT",
          "year":2023,
           "offset":0,
           "limit":100,
            "fields":"title,year,authors"}

In [6]:
my_url = BASE_URL + VERSION + RESOURCE

In [7]:
r = requests.get(my_url, params=params)

In [8]:
r.json()["data"]

[{'paperId': '0d25a53184a9c56084416b292de9a8fef4b27347',
  'title': 'Tools such as ChatGPT threaten transparent science; here are our ground rules for their use',
  'year': 2023,
  'authors': []},
 {'paperId': '8f64f4633d9c482bb826b7a9fe9c1493837d7112',
  'title': 'Is ChatGPT A Good Translator? A Preliminary Study',
  'year': 2023,
  'authors': [{'authorId': '12386833', 'name': 'Wenxiang Jiao'},
   {'authorId': '2144328160', 'name': 'Wenxuan Wang'},
   {'authorId': '2161306685', 'name': 'Jen-tse Huang'},
   {'authorId': '2144800839', 'name': 'Xing Wang'},
   {'authorId': '2909321', 'name': 'Zhaopeng Tu'}]},
 {'paperId': '1f22de83d912176cb8857efa1c6d65b14d6a2f5c',
  'title': 'ChatGPT is not all you need. A State of the Art Review of large Generative AI models',
  'year': 2023,
  'authors': [{'authorId': '2200162353', 'name': 'Roberto Gozalo-Brizuela'},
   {'authorId': '1398645148', 'name': 'E.C. Garrido-Merchán'}]},
 {'paperId': 'eddfb9be78cfe94193766e3722eb0e56c3d24cef',
  'title': 'Ch

# Week 2

In [11]:
import pandas as pd
import numpy as np
df2019 = pd.read_csv("2019_poster.csv")
df2020 = pd.read_csv("2020.csv")
df2021 = pd.read_csv("2021.csv")

names2019 = np.array(df2019.author)
names2020 = np.array(df2020.author)
names2021 = np.array(df2021.author)

names = np.concatenate([names2019, names2020, names2021])
names_final = np.unique(names)
df = pd.DataFrame(names_final, columns=['author'])
df.head()


Unnamed: 0,author
0,A. Gül Gökay Emel
1,Aaron Clauset
2,Aaron Cluaset
3,Aaron Halfaker
4,Aaron Schecter


In [19]:
df.to_csv("authors.csv", index=False)

In [18]:
names_final[0]

'A. Gül Gökay Emel'

In [154]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

params = {'query':names_final[813],
          'fields': "papers.authors"
}

my_url = BASE_URL + VERSION + RESOURCE

params

{'query': 'Jennifer Pan', 'limit': 1, 'fields': 'papers.authors'}

In [136]:
import time
from tqdm import tqdm

BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

my_url = BASE_URL + VERSION + RESOURCE

ids = []

for name in names_final:
    params = {'query':name,
          'limit': 1,
          'fields': "papers.authors"
          }

    r = requests.get(my_url, params=params)
    if len(r.json()['data']) == 0:
        continue
    if r.status_code == 200 and len(r.json()['data']) > 0:
        temp = r.json()['data'][0]['papers'][0]['authors']
        temp_ids = [author['authorId'] for author in temp]
        ids.extend(temp_ids)
    else:
        status = False
        while status == False:
            print("Sleepin'")
            time.sleep(30)
            r = requests.get(my_url, params=params)
            if len(r.json()['data']) == 0:
                continue
            if r.status_code == 200 and len(r.json()['data']) > 0:
                temp = r.json()['data'][0]['papers'][0]['authors']
                temp_ids = [int(author['authorId']) for author in temp]
                ids.extend(temp_ids)
                status = True
                
            
ids


KeyError: 'data'

In [141]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

my_url = BASE_URL + VERSION + RESOURCE

ids = []

    

In [162]:
def get_request(name):
    params = {'query':name,
              'limit': 1,
              'fields': "papers.authors"
              }
    r = requests.get(my_url, params=params)
    if r.status_code != 200:
        time.sleep(30)
        get_request(name)

    return r.json()
    
def get_ids(r):
    if 'data' not in r.keys():
        return None
    
    if len(r['data']) == 0:
        return None

    if len(r['data'][0]['papers']) == 0:
        ids.extend(r['data'][0]['authorId'])
        return None
        
    temp = r['data'][0]['papers'][0]['authors']
    temp_ids = [author['authorId'] for author in temp]
    ids.extend(temp_ids)

In [None]:
for name in tqdm(names_final):
    get_ids(get_request(name))

In [163]:
for name in tqdm(names_final[812:]):
    get_ids(get_request(name))

100%|██████████| 1095/1095 [36:48<00:00,  2.02s/it]  


In [168]:
#len(ids)
ids_short = pd.DataFrame(ids, columns=['authorId'])
ids_short.to_csv('ids_short.csv', index=False)


In [171]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/search"

params = {'query':names_final[100],
          'fields': "papers.authors"
}

my_url = BASE_URL + VERSION + RESOURCE

In [184]:
r = requests.get(my_url, params=params)
ost = []
for paper in r.json()['data'][0]['papers']:
    aut = paper['authors']
    aut_ids = [author['authorId'] for author in aut]
    ost.extend(aut_ids)


## Get ALL Ids

In [None]:
all_ids = []

In [224]:
def get_request(name):
    params = {'query':name,
              'fields': "papers.authors"
              }
    r = requests.get(my_url, params=params)

    if r.status_code == 500:
        return {}
        
    if r.status_code != 200:
        time.sleep(30)
        get_request(name)

    return r.json()
    
def get_all_ids(r):
    if 'data' not in r.keys():
        return None
    
    if len(r['data']) == 0:
        return None

    if len(r['data'][0]['papers']) == 0:
        ids.extend(r['data'][0]['authorId'])
        return None
        
    for paper in r['data'][0]['papers']:
        aut = paper['authors']
        aut_ids = [author['authorId'] for author in aut]
        all_ids.extend(aut_ids)


In [238]:
for name in tqdm(names_final):
    get_all_ids(get_request(name))


In [227]:
authorIds = list(set(all_ids))

In [237]:
all_authors = pd.DataFrame(authorIds, columns=['authorId'])
all_authors.to_csv("all_authors.csv", index=False)


## Data set construction

We now have all the author Ids that we want, and we now construct the data sets

In [272]:
all_authors.head()

Unnamed: 0,authorId
0,2773799
1,1399712184
2,50197963
3,3530609
4,84242025


In [395]:
BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/batch"
FIELDS = "?fields=name,aliases,papers.title,papers.abstract,papers.year,papers.externalIds,papers.s2FieldsOfStudy,papers.citationCount,papers.authors"

params = {"ids": ['2773799','1399712184']}

my_url = BASE_URL + VERSION + RESOURCE + FIELDS



In [422]:
r = requests.post(my_url, json=params)
r.json()[0]['papers'][0]


{'paperId': '825f85375bba977cd3ad78ac1ba22c7fae5609fb',
 'externalIds': {'PubMedCentral': '9045324',
  'DOI': '10.1128/spectrum.02434-21',
  'CorpusId': 247938320,
  'PubMed': '35377231'},
 'title': 'Reference-Grade Genome and Large Linear Plasmid of Streptomyces rimosus: Pushing the Limits of Nanopore Sequencing',
 'abstract': 'The genomes of Streptomyces species are difficult to assemble due to long repeats, extrachromosomal elements (giant linear plasmids [GLPs]), rearrangements, and high GC content. To improve the quality of the S. rimosus ATCC 10970 genome, producer of oxytetracycline, we validated the assembly of GLPs by applying a new approach to combine pulsed-field gel electrophoresis separation and GLP isolation and sequenced the isolated GLP with Oxford Nanopore technology. ABSTRACT Streptomyces rimosus ATCC 10970 is the parental strain of industrial strains used for the commercial production of the important antibiotic oxytetracycline. As an actinobacterium with a large lin

In [613]:
author_data = {
    "ids": [],
    "names": [],
    "aliases": [],
    "citationCount": [],
    "field": []
    }

paper_data = {
    "paperId": [],
    "title": [],
    "year": [],
    "DOI": [],
    "citationCount": [],
    "fields": [],
    "authorIds": []
}

abstract_data = {
    "paperId": [],
    "abstract": []
}

In [614]:
from collections import Counter

BASE_URL = "https://api.semanticscholar.org/graph/"
VERSION = "v1/"
RESOURCE = "author/batch"
FIELDS = "?fields=name,aliases,citationCount,papers.title,papers.abstract,papers.year,papers.externalIds,papers.s2FieldsOfStudy,papers.citationCount,papers.authors"

params = {"ids": ['2773799','1399712184']}

my_url = BASE_URL + VERSION + RESOURCE + FIELDS


def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common(1)[0][0]

def get_data(idx, url):
    ids = list(all_authors["authorId"][idx:idx+30])
    r = requests.post(url, json={"ids": ids})

    if r.status_code == 500 or r.status_code == 504:
        print(f'Failed on {idx} with status code {r.status_code}')
        return None
        
    if r.status_code != 200:
        time.sleep(30)
        print(f"Got status code: {r.status_code}. Trying Again!")
        get_data(idx, url)

    return r.json()

def make_data(r):
    if r is None:
        return None
    for author in r:
        author_data['ids'].append(author['authorId'])
        author_data['names'].append(author['name'])
        author_data['aliases'].append(str(author['aliases']))
        author_data['citationCount'].append(int(author['citationCount']))
        temp_fields = []
        for paper in author['papers']:
            if 'DOI' in paper['externalIds'].keys():
                paper_data["paperId"].append(paper['paperId'])
                paper_data["title"].append(paper['title'])
                paper_data["year"].append(int(paper['year'])) if paper['year'] is not None else paper_data["year"].append(0)
                paper_data["citationCount"].append(int(paper['citationCount']))
                paper_data["fields"].append(str([field['category'] for field in paper['s2FieldsOfStudy']]))
                temp_fields.extend([field['category'] for field in paper['s2FieldsOfStudy']])
                aut = paper['authors']
                paper_data["authorIds"].append([author['authorId'] for author in aut])
                paper_data["DOI"].append(paper['externalIds']['DOI'])
                abstract_data['paperId'].append(paper['paperId'])
                abstract_data["abstract"].append(paper['abstract'])
        author_data['field'].append(most_frequent(temp_fields)) if len(temp_fields) > 0 else author_data['field'].append("None")
        

In [615]:
for idx in tqdm(range(0, len(all_authors), 30)):
    make_data(get_data(idx, my_url))
    

  0%|          | 2/3523 [00:08<4:24:23,  4.51s/it]

Failed on 30 with status code 500


  1%|          | 22/3523 [01:40<5:38:33,  5.80s/it]

Failed on 630 with status code 500


  1%|          | 30/3523 [02:31<5:04:49,  5.24s/it]

Failed on 870 with status code 500


  1%|          | 35/3523 [03:16<11:17:44, 11.66s/it]

Failed on 1020 with status code 504


  1%|          | 38/3523 [03:32<7:32:03,  7.78s/it] 

Failed on 1110 with status code 500


  1%|          | 39/3523 [03:41<7:41:18,  7.94s/it]

Failed on 1140 with status code 500


  1%|▏         | 45/3523 [04:13<7:03:29,  7.31s/it]

Failed on 1320 with status code 500


  1%|▏         | 51/3523 [04:44<6:37:24,  6.87s/it]

Failed on 1500 with status code 500


  2%|▏         | 53/3523 [04:53<5:36:49,  5.82s/it]

Failed on 1560 with status code 500


  2%|▏         | 56/3523 [05:22<9:20:12,  9.70s/it]

Failed on 1650 with status code 500


  2%|▏         | 58/3523 [05:34<7:40:30,  7.97s/it]

Failed on 1710 with status code 500


  2%|▏         | 64/3523 [06:19<11:13:42, 11.69s/it]

Failed on 1890 with status code 504


  2%|▏         | 66/3523 [06:27<7:35:14,  7.90s/it] 

Failed on 1950 with status code 500


  2%|▏         | 71/3523 [06:54<6:28:30,  6.75s/it]

Failed on 2100 with status code 500


  2%|▏         | 78/3523 [07:28<5:00:36,  5.24s/it]

Failed on 2310 with status code 500


  2%|▏         | 81/3523 [07:40<4:24:08,  4.60s/it]

Failed on 2400 with status code 500


  2%|▏         | 87/3523 [08:08<5:23:27,  5.65s/it]

Failed on 2580 with status code 500


  3%|▎         | 89/3523 [08:17<4:50:45,  5.08s/it]

Failed on 2640 with status code 500


  3%|▎         | 92/3523 [08:46<8:32:18,  8.96s/it]

Failed on 2730 with status code 500


  3%|▎         | 94/3523 [08:57<7:10:31,  7.53s/it]

Failed on 2790 with status code 500


  3%|▎         | 106/3523 [09:53<5:55:05,  6.24s/it]

Failed on 3150 with status code 500


  3%|▎         | 109/3523 [10:09<5:39:29,  5.97s/it]

Failed on 3240 with status code 500


  4%|▎         | 126/3523 [11:42<8:24:50,  8.92s/it]

Failed on 3750 with status code 500


  4%|▎         | 129/3523 [11:54<5:31:42,  5.86s/it]

Failed on 3840 with status code 500


  4%|▍         | 138/3523 [12:44<6:11:26,  6.58s/it]

Failed on 4110 with status code 500


  4%|▍         | 142/3523 [13:06<5:53:03,  6.27s/it]

Failed on 4230 with status code 500


  4%|▍         | 148/3523 [13:44<6:55:21,  7.38s/it]

Failed on 4410 with status code 500


  4%|▍         | 149/3523 [13:54<7:31:59,  8.04s/it]

Failed on 4440 with status code 500


  4%|▍         | 151/3523 [14:07<7:01:06,  7.49s/it]

Failed on 4500 with status code 500


  4%|▍         | 157/3523 [14:34<4:39:28,  4.98s/it]

Failed on 4680 with status code 500


  5%|▍         | 163/3523 [15:06<5:26:50,  5.84s/it]

Failed on 4860 with status code 500


  5%|▍         | 171/3523 [15:47<5:42:13,  6.13s/it]

Failed on 5100 with status code 500


  5%|▍         | 173/3523 [16:03<6:45:52,  7.27s/it]

Failed on 5160 with status code 500


  5%|▍         | 175/3523 [16:14<5:59:59,  6.45s/it]

Failed on 5220 with status code 500


  5%|▌         | 179/3523 [16:59<11:45:31, 12.66s/it]

Failed on 5340 with status code 504


  5%|▌         | 184/3523 [17:33<7:20:05,  7.91s/it] 

Failed on 5490 with status code 500


  5%|▌         | 186/3523 [17:46<6:47:02,  7.32s/it]

Failed on 5550 with status code 500


  6%|▌         | 199/3523 [18:48<4:59:36,  5.41s/it]

Failed on 5940 with status code 500


  6%|▌         | 215/3523 [20:07<7:44:38,  8.43s/it]

Failed on 6420 with status code 500


  6%|▌         | 215/3523 [2:48:51<43:18:09, 47.13s/it]


KeyboardInterrupt: 

In [623]:
adf = pd.DataFrame(author_data,  columns=author_data.keys())
adf.head()
adf.to_csv('author_df.csv', index=False)



In [624]:
pdf = pd.DataFrame(paper_data, columns=paper_data.keys())
pdf.describe()
pdf.to_csv('paper_df.csv', index=False)

In [621]:
pdf[pdf['year'] == 0]

Unnamed: 0,paperId,title,year,DOI,citationCount,fields,authorIds
517,18b329ac2e91458b0fed24103fb4d8b3a8b00548,quity and Fairness of Single- vs. Double-Blind...,0,10.31234/osf.io/q2tkw,0,['Psychology'],"['2294049', '4421835', '3691024', '3376613']"
1753,2250f3dde8786515473bc0b03280420ac10d8256,A sustainability assessment framework for geno...,0,10.1016/j.aquaculture.2022.738803,0,['Economics'],"['2120789929', '4139040']"
2439,37c960d56b5f3ef16c16883f48dac3179f05fd61,"Review of ""The Weirdest People in the World""",0,10.31219/osf.io/ufjya,1,"['Psychology', 'Art']",['2352183']
11755,09ead28addb05e98c8adf1a58739d34e72eddd8c,Functional architecture of the aging brain,0,10.1101/2021.03.31.437922,5,['Psychology'],"['6939215', '1404691578', '6589755', '8335807'..."
13416,1fdb8fd4a54e8cc65d8c6c4bfe1d991a26817892,Introduction,0,10.4324/9781003041566-1,0,[],"['116605617', '30210596']"
...,...,...,...,...,...,...,...
310368,2d0d7b40cded3f3bf3aa46d53cef64f44dd3e20d,Procapra Przewalskii Dataset Collecting from 7...,0,10.3974/geodb.2021.10.02.v1,0,['Environmental Science'],"['2111280100', '153552141', '1962660', '898836..."
310369,64a69a422e014139d1a377d7acabfaf298f79232,Waterfowl Habitat and Migration Dataset Collec...,0,10.3974/geodb.2021.10.01.v1,0,['Environmental Science'],"['2111280100', '153552141', '1962660', '898836..."
310370,8f61e6078db7d82db08ac658b5679c75f3e65afc,In situ Dataset on Vegetation from 28 Sample S...,0,10.3974/geodb.2021.10.04.v1,0,['Environmental Science'],"['2115554316', '153552141', '89883663', '21112..."
312641,a2ee6a61917fa6c08952ca4a6ca25b78e8a44093,Are media exposure and self-control related?,0,10.32920/ryerson.14648913.v1,0,['Psychology'],['152130182']


In [625]:
abdf = pd.DataFrame(abstract_data, columns=abstract_data.keys())
abdf.head()
abdf.to_csv('abstract_df.csv', index=False)