# Request data from PLOS API and save to JSON file
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model will search the abstracts of academic papers that contain the keyword "nutrition". 

## Approach

In considering various APIs, this model will use the Public Library of Science (PLOS) API since it also includes searches by Abstract. According to the  
[PLOS API Documentation](http://alm.plos.org/users/me) the API also does not require an API Key for access to almost all works, with the exception of sources that don't allow redistribution of data. An API Key would otherwise only be required to add or update works.

Search fields to include:
- publication_date
- title
- conclusions

## Request
### Packages and setup

In [5]:
import pandas as pd
import requests
import time

from IPython.core.interactiveshell import InteractiveShell

In [6]:
# Set workspace

# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'

### Input parameters

In [7]:
keyword = 'nutrition'
API_KEY = 'YrBVAstBzZkfnsD_WzpX'

In [25]:
# Function to generate the required query string for the PLOS API
def query_str(keyword, start_row, batch_size, API_KEY):
    '''Compiles query string for relevant keyword for use in the PLOS API, 
    from the starting row and the batch size'''
    address = 'http://api.plos.org/search?q=conclusions:"' \
            + keyword \
            + '"&fl=publication_date,title,conclusions&wt=json&start=' \
            + str(start_row) \
            + '&rows=' \
            + str(batch_size) \
            + '&api_key=' \
            + str(API_KEY)
    return address

In [30]:
# Example of query string
query_str(keyword, 1, 100, API_KEY)

'http://api.plos.org/search?q=conclusions:"nutrition"&fl=publication_date,title,conclusions&wt=json&start=1&rows=100&api_key=YrBVAstBzZkfnsD_WzpX'

### Fetch from PLOS API

In [28]:
# Request the documents from the API in batches of 100 (maximum allowed) and append 
# it to a dataframe called 'corpus'

# Note: A 10 second delay is built in between requests to comply with PLOS API rate limit

# initialise paramters 
start_row = 1
batch_size = 100 
i = 1
corpus = pd.DataFrame()

# First request to initialise while loop
r = requests.get(query_str(keyword, start_row, batch_size, API_KEY))
time.sleep(10)
json_data = r.json()

while json_data['response']['docs']:
    corpus = corpus.append(json_data['response']['docs'])
    
    start_row = (i * batch_size) + 1
    
    r = requests.get(query_str(keyword, start_row, batch_size, API_KEY), timeout=20)
    time.sleep(10)
    json_data = r.json()
    
    i += 1

# reset index
corpus = corpus.reset_index(drop=True)

# show when complete
print('Complete')
print('Number of documents: {}'.format(corpus.shape[0]))
corpus.head()

Complete
Number of documents: 794


Unnamed: 0,publication_date,title,conclusions
0,2014-10-21T00:00:00Z,Psychological Determinants of Consumer Accepta...,"[To the authors’ knowledge, this is the first ..."
1,2015-03-13T00:00:00Z,Uncovering the Nutritional Landscape of Food,"[In this study, we have developed a unique com..."
2,2017-06-27T00:00:00Z,Developing and validating a scale to measure F...,[Food and nutrition literacy scale is a valid ...
3,2017-05-18T00:00:00Z,Quality of nutrition services in primary healt...,"[The aim of the NNS, integrating nutrition ser..."
4,2015-10-21T00:00:00Z,To See or Not to See: Do Front of Pack Nutriti...,[Our work strongly supports the idea that FOP ...


### Save to file corpus.csv

In [29]:
# save to completed dataframe
corpus.to_csv('../data/interim/corpus.csv', index=False)