# Request data from PLOS API and save to JSON file
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model will search the abstracts of academic papers that contain the keyword "nutrition". 

## Approach

In considering various APIs, this model will use the Public Library of Science (PLOS) API since it also includes searches by Abstract. According to the  
[PLOS API Documentation](http://alm.plos.org/users/me) the API also does not require an API Key for access to almost all works, with the exception of sources that don't allow redistribution of data. An API Key would otherwise only be required to add or update works.

Search fields to include:
- publication_date
- title
- conclusions

## Request
### Packages and setup

In [119]:
import pandas as pd
import requests
import time

from IPython.core.interactiveshell import InteractiveShell

In [120]:
# Set workspace

# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'

### Input parameters

In [153]:
# Define keywords
keywords = ['nutrition', 'diet']
API_KEY = 'YrBVAstBzZkfnsD_WzpX'

In [154]:
# Show keywords together for query
keyword = '" "'.join(keywords)
keyword

'nutrition" "diet'

In [155]:
# Function to generate the required query string for the PLOS API
def query_str(keyword, start_row, batch_size, API_KEY):
    '''Compiles query string for relevant keyword for use in the PLOS API, 
    from the starting row and the batch size'''
    address = 'http://api.plos.org/search?q={! q.op=OR df=conclusions}"' \
            + keyword \
            + '"&fl=publication_date,title,conclusions&wt=json&start=' \
            + str(start_row) \
            + '&rows=' \
            + str(batch_size) \
            + '&api_key=' \
            + str(API_KEY)
    return address

In [156]:
# Example of query string
query_str(keyword, 1, 100, API_KEY)

'http://api.plos.org/search?q={! q.op=OR df=conclusions}"nutrition" "diet"&fl=publication_date,title,conclusions&wt=json&start=1&rows=100&api_key=YrBVAstBzZkfnsD_WzpX'

### Fetch from PLOS API

In [157]:
# First check how many documents contain the keywords
r = requests.get(query_str(keyword, 1, 1, API_KEY))
check_data = r.json()
n_articles = check_data['response']['numFound']
print('Number of articles found: ', n_articles)

Number of articles found:  1531


In [158]:
print('Time to download data: {:.2f} min'.format(n_articles / 100 * 10 / 60 ))

Time to download data: 2.55 min


In [165]:
%%time
# Request the documents from the API in batches of 100 (maximum allowed) and append 
# it to a dataframe called 'corpus'

# Note: A 10 second delay is built in between requests to comply with PLOS API rate limit

# initialise paramters 
start_row = 1
batch_size = 100 
i = 1
corpus = pd.DataFrame()

# First request to initialise while loop
r = requests.get(query_str(keyword, start_row, batch_size, API_KEY))
time.sleep(10)
json_data = r.json()

print('Starting...')

while json_data['response']['docs']:
    corpus = corpus.append(json_data['response']['docs'])
    
    start_row = (i * batch_size) + 1
    
    r = requests.get(query_str(keyword, start_row, batch_size, API_KEY), timeout=20)
    time.sleep(10)
    json_data = r.json()
    
    if i % 5 == 0:
        print('Articles downloaded: {} of {}'.format(i*100, n_articles))
    
    i += 1

# reset index
corpus = corpus.reset_index(drop=True)

# show when complete
print('Complete')
print('Number of documents: {}'.format(corpus.shape[0]))
corpus.head()

Starting...
Articles downloaded: 500 of 1531
Articles downloaded: 1000 of 1531
Articles downloaded: 1500 of 1531
Complete
Number of documents: 1530
Wall time: 3min 7s


Unnamed: 0,publication_date,title,conclusions
0,2016-03-09T00:00:00Z,Pregnancy Requires Major Changes in the Qualit...,[Pregnancy tends to markedly widen the nutriti...
1,2016-08-23T00:00:00Z,Continental-Scale Patterns Reveal Potential fo...,"[In all, given the geographic patterns in diet..."
2,2015-06-17T00:00:00Z,Assessing Nutritional Parameters of Brown Bear...,[Previous studies have illustrated the differe...
3,2015-04-17T00:00:00Z,The Self-Reported Clinical Practice Behaviors ...,[The present study provides a valuable insight...
4,2017-10-09T00:00:00Z,The impact of nutritional supplement intake on...,[Our study shows that the propensity to consum...


### Save to file corpus.csv

In [166]:
# save to completed dataframe
corpus.to_csv('../data/interim/corpus_raw.csv', index=False)