# Papers With Code API

# Setup

## Instructions

This notebook utilizes the Papers With Code API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Papers With Code account at https://paperswithcode.com/accounts/register?next=/
2. After logging in, click on the user account icon in the top right corner, and click on 'Get API token'
3. Click on 'Generate API Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'PAPERSWITHCODE_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Papers With Code API ([Papers With Code](https://paperswithcode.com/api/v1/docs/))
- Papers With Code API ([readthedocs](https://paperswithcode-client.readthedocs.io/en/latest/))

## Overview of workflow

<img src="../images/PapersWithCode_workflow.jpg" width=600 height=600 align="left"/>

## Imports

In [1]:
#import libraries
import requests
import pandas as pd
import pickle
from flatten_json import flatten
from collections import OrderedDict
from tqdm import tqdm

In [2]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        PWC_TOKEN = pickle.load(credentials)['PAPERSWITHCODE_TOKEN']
except:
    PWC_TOKEN = input('Please enter your Papers With Code API Key: ')

## Setup

In [3]:
BASE_URL = 'https://paperswithcode.com/api/v1'

In [5]:
search_params = {'page': 1, 'items_per_page': 100}
r = requests.get(f'{BASE_URL}/papers', params=search_params)

In [6]:
r.json()

{'count': 804583,
 'next': 'https://paperswithcode.com/api/v1/papers/?items_per_page=100&page=2',
 'previous': None,
 'results': [{'id': '007-democratically-finding-the-cause-of',
   'arxiv_id': '1802.07222',
   'nips_id': None,
   'url_abs': 'http://arxiv.org/abs/1802.07222v1',
   'url_pdf': 'http://arxiv.org/pdf/1802.07222v1.pdf',
   'title': '007: Democratically Finding The Cause of Packet Drops',
   'abstract': 'Network failures continue to plague datacenter operators as their symptoms\nmay not have direct correlation with where or why they occur. We introduce 007,\na lightweight, always-on diagnosis application that can find problematic links\nand also pinpoint problems for each TCP connection. 007 is completely contained\nwithin the end host. During its two month deployment in a tier-1 datacenter, it\ndetected every problem found by previously deployed monitoring tools while also\nfinding the sources of other problems previously undetected.',
   'authors': ['Arzani Behnaz',
    '

## Query #1: query API based on search types

Function `get_all_search_outputs` queries the Papers with Code API for all search types specified and returns the results as a dictionary of dataframes (one dataframe for each query combination)
- Calls function `get_individual_search_output`

In [4]:
def get_all_search_outputs(search_types, flatten_output=False):
    """Call the Papers With Code API for each search type. 
    Results are retured in results['({type},)'] = df.
    
    Parameters
    ----------
    search_types : list-like 
        Collection of search types to query over.
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
    
    Returns
    -------
    results : OrderedDict
        Dictionary consisting of returned DataFrames from get_search_output for each query.
    """
    
    results = OrderedDict()

    for search_type in search_types:
        results[(search_type,)] = get_individual_search_output(search_type, flatten_output)
        
    return results

Function `_conduct_search_over_pages` is a helper function used to iterate over search result pages

In [5]:
def _conduct_search_over_pages(search_url, search_params, flatten_output=False):
    search_df = pd.DataFrame()
    
    # Conduct a search, extract json results
    response = requests.get(url = search_url, params=search_params)
    output = response.json()

    # Search over all valid pages
    while output.get('results') and search_params['page'] < 2:
        # Flatten nested json
        if flatten_output:
            output = [flatten(result) for result in output['results']]
        else:
            output = output['results']

        # Add results to cumulative DataFrame
        output_df = pd.DataFrame(output)
        output_df['page'] = search_params['page']

        search_df = pd.concat([search_df, output_df]).reset_index(drop=True)

        # Increment page for search
        search_params['page'] += 1
        
        # Conduct a search
        response = requests.get(url = search_url, params=search_params)
        
        # Ensure we've received results if they exist
        # 200: OK, 404: page not found
        while response.status_code not in [200, 404]:
            print(f'Search error {response.status_code} on page {search_params["page"]}')
            search_params['page'] += 1
            # Conduct a search, extract json results
            response = requests.get(url = search_url, params=search_params)
            
        # Extract json results
        output = response.json()
    
    return search_df

Function `get_individual_search_output` queries the Papers with Code API with the specified search type ('conferences', 'datasets', 'evaluations', 'papers', or 'tasks')
- Calls function `_conduct_search_over_pages`
- Result is a dataframe (one dataframe per search type)

In [6]:
def get_individual_search_output(search_type, flatten_output=False):
    """Calls the Papers With Code API with the specified search term and returns the search output results.
    
    Parameters
    ----------
    search_type : str
        Must be in ('conferences', 'datasets', 'evaluations', 'papers', 'tasks').
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
   
    Returns
    -------
    DataFrame
        DataFrame containing the output of the search query.
    """
    
    # Make sure our input is valid
    assert search_type in ('conferences', 'datasets', 'evaluations', 'papers', 'tasks'), \
        f'Invalid search type "{search_type}"'
    
    # Set search variables
    start_page = 1
    page_size = 500 # Seems to be max size
    search_url = f'{BASE_URL}/{search_type}'
    
    search_params = {
        'page': start_page,
        'items_per_page': page_size
        }
    
    return _conduct_search_over_pages(search_url, search_params, flatten_output)

#### Run query #1 functions - example

In [7]:
search_types = ['papers']

In [8]:
search_output_dict = get_all_search_outputs(search_types, flatten_output=True)

In [9]:
search_output_dict[('papers',)]

Unnamed: 0,id,arxiv_id,nips_id,url_abs,url_pdf,title,abstract,authors_0,authors_1,authors_2,...,authors_134,authors_135,authors_136,authors_137,authors_138,authors_139,authors_140,authors_141,authors_142,page
0,007-democratically-finding-the-cause-of,1802.07222,,http://arxiv.org/abs/1802.07222v1,http://arxiv.org/pdf/1802.07222v1.pdf,007: Democratically Finding The Cause of Packe...,Network failures continue to plague datacenter...,Arzani Behnaz,Ciraci Selim,Chamon Luiz,...,,,,,,,,,,1
1,0-1-phase-transitions-in-sparse-spiked-matrix,1911.05030,,https://arxiv.org/abs/1911.05030v1,https://arxiv.org/pdf/1911.05030v1.pdf,0-1 phase transitions in sparse spiked matrix ...,We consider statistical models of estimation o...,Jean Barbier,Nicolas Macris,,...,,,,,,,,,,1
2,02-dualities-and-the-4-simplex,1905.05173,,http://arxiv.org/abs/1905.05173v2,http://arxiv.org/pdf/1905.05173v2.pdf,"(0,2) Dualities and the 4-Simplex","We propose that a simple, Lagrangian 2d $\math...",Dimofte Tudor,Paquette Natalie M.,,...,,,,,,,,,,1
3,02-hybrid-models,1712.04976,,http://arxiv.org/abs/1712.04976v2,http://arxiv.org/pdf/1712.04976v2.pdf,"(0,2) hybrid models","We introduce a class of (0,2) superconformal f...",Bertolini Marco,Plesser M. Ronen,,...,,,,,,,,,,1
4,02-mirror-symmetry-on-homogeneous-hopf,2012.01851,,https://arxiv.org/abs/2012.01851v2,https://arxiv.org/pdf/2012.01851v2.pdf,"(0,2) Mirror Symmetry on homogeneous Hopf surf...","In this work we find the first examples of (0,...",Luis Álvarez-Cónsul,Andoni De Arriba de La Hera,Mario Garcia-Fernandez,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,2-5d-root-of-trust-secure-system-level,2009.02412,,http://arxiv.org/abs/2009.02412v2,http://arxiv.org/pdf/2009.02412v2.pdf,2.5D Root of Trust: Secure System-Level Integr...,"Dedicated, after acceptance and publication, i...",Nabeel Mohammed,Ashraf Mohammed,Patnaik Satwik,...,,,,,,,,,,1
496,2-5d-visual-relationship-detection,2104.12727,,https://arxiv.org/abs/2104.12727v1,https://arxiv.org/pdf/2104.12727v1.pdf,2.5D Visual Relationship Detection,Visual 2.5D perception involves understanding ...,Yu-Chuan Su,Soravit Changpinyo,Xiangning Chen,...,,,,,,,,,,1
497,25d-visual-sound,1812.04204,,http://arxiv.org/abs/1812.04204v4,http://arxiv.org/pdf/1812.04204v4.pdf,2.5D Visual Sound,Binaural audio provides a listener with 3D sou...,Ruohan Gao,Kristen Grauman,,...,,,,,,,,,,1
498,25-mev-solar-proton-events-in-cycle-24-and,1604.07873,,http://arxiv.org/abs/1604.07873v2,http://arxiv.org/pdf/1604.07873v2.pdf,25 MeV Solar Proton Events in Cycle 24 and Pre...,We summarize observations of around a thousand...,Richardson Ian G.,von Rosenvinge Tycho T.,Cane Hilary V.,...,,,,,,,,,,1


## Query #2: query API for full metadata for hits from initial query

Function `get_query_metadata` extracts metadata associated with each object based on object path and formats as dataframe
- Calls function `_conduct_search_over_pages`
- Output is single dataframe for each search type (matching each dataframe in result #1 dictionary output)

In [10]:
def get_query_metadata(object_paths, flatten_output=False):
    """Retrieves the metadata for the file/files listed in object_paths.
    
    Parameters
    ----------
    object_paths : str/list-like
        String or list of strings containing the paths for the objects.
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
    
    Returns
    -------
    metadata_dict : dict
        Dictionary of DataFrames containing metadata for the requested datasets.
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    metadata_types = ('methods', 'repositories', 'results', 'tasks')
    
    start_page = 1
    metadata_dict = dict()
    
    # Searches for each of the metadata types that are present for the search type we conducted
    for metadata_type in metadata_types:
        search_df = pd.DataFrame()
        print(f'Querying {metadata_type}')
        
        # Searches over each object
        for object_path in tqdm(object_paths):
            search_url = f'{BASE_URL}/papers/{object_path}/{metadata_type}'
            search_params = {'page': start_page}

            # Conduct the search & add supplementary material to the DataFrame
            object_df = _conduct_search_over_pages(search_url, search_params, flatten_output)
            object_df['id'] = object_path
            object_df['page'] = search_params['page']
            
            # Merge with the cumulative search DataFrame
            search_df = pd.concat([search_df, object_df]).reset_index(drop=True)
            
        metadata_dict[(metadata_type, )] = search_df

    return metadata_dict

Function `get_all_metadata` uses a `for` loop to put dataframes into an ordered dictionary, matching result #1 ordered_dictionary
- Calls function `get_query_metadata`

In [11]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """Retrieves all of the metadata that relates to the provided DataFrames.
    
    Parameters
    ----------
    search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs.
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
      
    Returns
    -------
    metadata_dict : OrderedDict
        OrderedDict of DataFrames with metadata for each query.
        Order matches the order of search_output_dict.
    """
    metadata_dict = OrderedDict()
    for query, df in search_output_dict.items():
        print(f'Retrieving {query} metadata')
        # Create object paths
        object_paths = df.id.values

        metadata_dict[query] = get_query_metadata(object_paths, flatten_output)
    
    return metadata_dict

#### Run query #2 functions - example

In [12]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

  0%|          | 0/500 [00:00<?, ?it/s]

Retrieving ('papers',) metadata
Querying methods


100%|██████████| 500/500 [05:39<00:00,  1.47it/s]
  0%|          | 0/500 [00:00<?, ?it/s]

Querying repositories


100%|██████████| 500/500 [06:19<00:00,  1.32it/s]
  0%|          | 0/500 [00:00<?, ?it/s]

Querying results


100%|██████████| 500/500 [05:00<00:00,  1.67it/s]
  0%|          | 0/500 [00:00<?, ?it/s]

Querying tasks


100%|██████████| 500/500 [06:01<00:00,  1.38it/s]


### Take a look at the results

Since we stored the metadata and DataFrames in our dictionaries via tuple keys, we index the metadata_dict as 

```metadata_dict[('SEARCH_TYPE',)][('METADATA_TYPE', )]```

Note that the tuple keys each have a comma after the sole value in order to preserve the tuple structure and relate in form to the other notebooks used in this project.

In [13]:
# Check which metadata options we have access to
for key, dict_ in metadata_dict.items():
    print(f'{key[0]}: {[item[0] for item in dict_.keys()]}')

papers: ['methods', 'repositories', 'results', 'tasks']


In [14]:
metadata_dict[('papers',)][('results',)]

Unnamed: 0,id,best_rank,metrics_10%,methodology,uses_additional_data,paper,best_metric,evaluated_on,evaluation,metrics_Top-1 Accuracy,...,metrics_10 way 1~2 shot,metrics_Mean AP @ 0.5,metrics_F1,metrics_MOTA,metrics_mask AP,metrics_AP50,metrics_AP75,metrics_AR1,metrics_AR10,metrics_Track mAP
0,007-democratically-finding-the-cause-of,1.0,"test'""><svg/onload=alert(9)>","test'""><svg/onload=alert(9)>",False,007-democratically-finding-the-cause-of,10%,2021-08-19,testp-on-100doh,,...,,,,,,,,,,
1,007-democratically-finding-the-cause-of,,,MMV,False,007-democratically-finding-the-cause-of,,2018-02-20,audio-classification-on-esc-50,88.9,...,,,,,,,,,,
2,007-democratically-finding-the-cause-of,5.0,,TResNet-L,False,007-democratically-finding-the-cause-of,mAP,2018-02-20,multi-label-classification-on-ms-coco,,...,,,,,,,,,,
3,007-democratically-finding-the-cause-of,2.0,,WideResNet-28-10,False,007-democratically-finding-the-cause-of,Percentage error,2018-02-20,image-classification-on-svhn,,...,,,,,,,,,,
4,007-democratically-finding-the-cause-of,3.0,,AutoAugment,False,007-democratically-finding-the-cause-of,Percentage error,2018-02-20,image-classification-on-svhn,,...,,,,,,,,,,
5,02-hybrid-models,,,v,False,02-hybrid-models,,2018-07-24,3d-face-animation-on-2d-3d-s,,...,,,,,,,,,,
6,02-hybrid-models,5.0,,DeepDeblur-PyTorch,True,02-hybrid-models,PSNR,2018-07-24,deblurring-on-gopro,,...,,,,,,,,,,
7,04-brane-box-models,,,hi,True,04-brane-box-models,,2021-06-10,image-to-image-translation-on-2017-test-set,,...,457.0,,,,,,,,,
8,0-4-dualities,1.0,,HAIS,False,0-4-dualities,Mean AP @ 0.5,2021-03-17,3d-instance-segmentation-on-scannetv2,,...,,69.9,,,,,,,,
9,0-step-capturability-motion-decomposition-and,1.0,12,Ashok,False,0-step-capturability-motion-decomposition-and,10%,2020-05-30,talking-head-generation-on-100-sleep-nights,,...,,,,,,,,,,
