# Calling the SHARE API
----
Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.

*these slides and example notebooks*:  https://osf.io/bygau/

slides also here: https://github.com/erinspace/share_tutorials

## Setup

- All of the examples I use here will be using python and some basic libraries
    - [Here's a basic guide to getting started with python](https://wiki.python.org/moin/BeginnersGuide)
- Code and setup instructions on github at:
     - https://github.com/erinspace/share_tutorials

- To run these examples on your machine, you'll need to install some basic python packages
    - Make sure to use a virtual enviornment to install python packages:
        - https://virtualenv.readthedocs.org/en/latest/
    - Using your terminal, run ```pip install -r requirements.txt``` inside your virtual enviornment
- Run the Jupyter notebook server from the command line:
    ```jupyter notebook```

### Get a List of the Current SHARE Providers
----
We'll make an API call to find:
- The official name of each SHARE Provider
- The URL for the home page of each SHARE Provider
- The shortname, or nickname of the SHARE provider for internal use
    - We'll use this name when querying for documents from this source

In [None]:
import requests

data = requests.get('https://osf.io/api/v1/share/providers/').json()

In [None]:
data

#### This is a lot of information!

Let's display this in a way that looks a little nicer.

In [None]:
from IPython.display import Image, display

for source in data['providerMap'].keys():
    display(Image(url=data['providerMap'][source]['favicon']))
    print(
        '{}\n{}\n{}\n'.format(
            data['providerMap'][source]['long_name'].encode('utf-8'),
            data['providerMap'][source]['url'],
            data['providerMap'][source]['short_name']
        )
    )

## SHARE Schema

Required fields:
- title
- contributors
- uris
- providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:
- source (where the document was originally harvested)
- docID  (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

See more details about the SHARE Schema, including examples of documents with all of the fields, here:
https://osf.io/wunk7/wiki/home/?view

## Simple Queries

- We need a URL to use to access the SHARE API.
- We will add arguments to this URL to shape our request
    - size: how many results we'll return
    - sort: how we want the results to be sorted
    - from: where to start in the resutls returned

In [None]:
OSF_APP_URL = 'https://osf.io/api/v1/share/search/'

In [None]:
import furl

search_url = furl.furl(OSF_APP_URL)
search_url.args['size'] = 3
search_url.args['sort'] = 'providerUpdatedDateTime'
search_url.args['from'] = 5

### Our Query URL So far

In [None]:
print('The request URL is {}'.format(search_url.url))

### Our results

In [None]:
from datetime import datetime

recent_results = requests.get(search_url.url).json()

for result in recent_results['results']:
    print(
        '{} -- from {} -- updated on {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            datetime.strftime(datetime.strptime(result['providerUpdatedDateTime'], "%Y-%m-%dT%H:%M:%S+00:00"), '%B %d %Y')
        )
    )

### Narrowing Results by Source

In [None]:
search_url.args['q'] = 'shareProperties.source:mit'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated on {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            datetime.strftime(datetime.strptime(result['providerUpdatedDateTime'], "%Y-%m-%dT%H:%M:%S+00:00"), '%B %d %Y')
        )
    )

## Complex Queries
- The SHARE Search API runs on elasticsearch
- More information on how to format elasticsearch queries: 
    - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

### Defining a Helper Function
- We'll use this helper function in later examples.

In [None]:
import json

def query_share(url, query):
    # A helper function that will use the requests library,
    # pass along the correct headers,
    # and make the query we want
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data, verify=False).json()

### Building a Query

In [None]:
sponsorship_query = {
    "size": 5,
    "query": {
        "filtered": {
            "filter": {
                "exists": {
                    "field": "sponsorships"
                }
            }
        }
    }
}

### Running the Query and Printing Results

In [None]:
results = query_share(search_url.url, sponsorship_query)

for item in results['results']:
    print('{} -- from source {} -- sponsored by {}'.format(
            item['title'].encode('utf-8'),
            item['shareProperties']['source'].encode('utf-8'),
            ' '.join(
                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]
            )
        )
    )
    print('-------------------')

### New Query

How many results *do not* have subjects?

In [None]:
no_subjects_query = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT subjects:*"
        }
    }
}

In [None]:
results_with_no_subjects = query_share(search_url.url, no_subjects_query)
total_results = requests.get(OSF_APP_URL).json()['count']
results_percent = (float(results_with_no_subjects['count'])/total_results)*100

In [None]:
print(
    '{} results out of {}, or {}%, do not have subjects.'.format(
        results_with_no_subjects['count'],
        total_results,
        format(results_percent, '.2f')
    )
)

## Using sharepa for SHARE Parsing and Analysis

- sharepa - short for SHARE Parsing and Analysis
    - https://github.com/CenterForOpenScience/sharepa#sharepa

### Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

#### Count
You can use sharepa and the basic search to get the total number of documents in SHARE

In [None]:
from sharepa import basic_search

basic_search.count()

### Iterating Through Results

A basic iteration through results will yield 10 at a time, starting from the first documents collected.

Let's do a basic search and iterate through the results

In [None]:
results = basic_search.execute()

for hit in results:
    print(hit.title)

#### Slicing Results

You can use slices to access a different set of results.

Let's print out 5 results, starting from the 20th and going until the 25th.

In [None]:
results = basic_search[20:25].execute()
for hit in results:
    print(hit.title)

#### Sorting Results

By default, the oldest results are returned first.

You can instead sort results by ```ProviderUpdatedDateTime``` to get the most recent items in the SHARE dataset

In [None]:
results = basic_search.sort('-providerUpdatedDateTime').execute()

for hit in results:
    print('{} - Last updated on {}'.format(
            hit.title.encode('utf-8'), 
            datetime.strftime(datetime.strptime(hit.providerUpdatedDateTime, "%Y-%m-%dT%H:%M:%S+00:00"), '%B %d %Y')
        )
    )

## Advanced Search with sharepa

Queries are formed using lucene query syntax 
    - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

In [None]:
from sharepa import ShareSearch
from sharepa.helpers import pretty_print

my_search = ShareSearch()

my_search = my_search.query(
    'query_string',
    query='subjects:*',
    analyze_wildcard=True
)

pretty_print(my_search.to_dict())

In [None]:
new_results = my_search.sort('-providerUpdatedDateTime').execute()

for hit in new_results:
    print(
        '{} - with subjects {}\n\n'.format(
            hit.title.encode('utf-8'),
            [sub.encode('utf-8') for sub in hit.subjects]
        )
    )

## Debugging and Problem Solving

Not everything always goes as planned when querying an unfamillar API.

Here are some debugging and problem solving strategies when you're querying the SHARE API.

### Start forming a search we're not too sure about

We are interested in seeing how many results are specified as being in a language other than English

In [None]:
language_search = ShareSearch()

language_search = language_search.query(
    'query_string',
    query='NOT languages=english'
)

In [None]:
results = language_search.execute()

for hit in results:
    print(hit.languages)

### That didn't look right.

Let's look at our error:

```AttributeError: 'Result' object has no attribute 'languages' ```

### Building up the correct query

Let's ry narrowing our query to only results that have a language attribute

(Language is not required, so many results won't have this information.

In [None]:
language_search = ShareSearch()

language_search = language_search.filter(
    'exists',
    field="languages"
)

In [None]:
results = language_search.execute()
results_percent = (float(language_search.count())/basic_search.count())*100

print('There are {}/{} - or {}% documents with languages specified'.format(
        language_search.count(),
        basic_search.count(),
        format(results_percent, '.2f')
    )
)

In [None]:
print('Here are the languages for the first 10 results:')

for hit in results:
    print(hit.languages)

### Referencing the SHARE Schema

Simplified form here: https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml

#### Section on languages:
        languages:
            description: |-
                The primary languages in which the content of the resource is presented. Values used for this element MUST conform to ISO 639-3. This offers three letter tags e.g. "eng" for English.
            type: array
            items:
                type: string
                pattern: "[a-z][a-z][a-z]"

### Continuing to Refine our Query

In [None]:
from elasticsearch_dsl import Q

language_search = language_search.query(~Q("term", languages="eng"))

results = language_search.execute()

In [None]:
print(
    'There are {} documents that do not have "eng" listed.'.format(
        language_search.count()
    )
)

print('Here are the languages for the first 10 results:')

for hit in results:
    print(hit.languages)

# Complex Queries and Basic Visualization

- How to use both basic HTTP requests and sharepa
- Aggregations, or queries that will return summary statistics about the whole dataset.
- simple data visualizations using pandas and matplotlib

## Aggregations

Aggregations let you quickly get summary statistics for all of SHARE results in one query.

### Documents Per Source Missing Descriptions

In [None]:
missing_descriptions_aggregation = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT description:*"
        }
    },
    "aggs": {
        "sources": {
            "terms": {
                "field": "_type", # A field where the SHARE source is stored                
                "min_doc_count": 0, 
                "size": 0  # Will return all sources, regardless if there are results
            }
        }
    }
}

In [None]:
results_without_descriptions = query_share(OSF_APP_URL, missing_descriptions_aggregation)

missing_descriptions_counts = results_without_descriptions['aggregations']['sources']['buckets']

for source in missing_descriptions_counts:
    print('{} has {} documents without descriptions'.format(source['key'], source['doc_count']))

### Making Results More Useful

Let's do that same query, but this time find the percentages of documents from each source insead of the numbers alone.

We'll also leave out sources that have all of their descriptions to make the list more manageable.

In [None]:
sig_terms_agg = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT description:*"
        }
    },
    "aggs": {
        "sources":{
            "significant_terms":{
                "field": "_type", # A field where the SHARE source is stored                
                "min_doc_count": 1, # Only results with more than one document
                "percentage": {} # This will make the "score" parameter a percentage
            }
        }
    }
}

In [None]:
docs_with_no_description_results = query_share(OSF_APP_URL, sig_terms_agg)
docs_with_no_description = docs_with_no_description_results['aggregations']['sources']['buckets']

In [None]:
for source in docs_with_no_description:
    print(
        '{}% (or {}/{}) of documents from {} have no description'.format(
            format(source['score']*100, '.2f'),
            source['doc_count'],
            source['bg_count'],
            source['key']
        )
    )

### Aggregations with sharepa

Let's use sharepa to find out how many documents per source that do not have subjects

In [None]:
no_subjects_search = ShareSearch()

no_subjects_search = no_subjects_search.query(
    'query_string',
    query='NOT subjects:*',
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

no_subjects_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=1,
    percentage={},
    size=0
)

#### Examining the query

Let's take a look at the query that sharepa generated, and we'll see that it looks a lot like the query we made by hand

In [None]:
pretty_print(no_subjects_search.to_dict())

#### Executing the query

Run the query and check out the results

In [None]:
aggregated_results = no_subjects_search.execute()

for source in aggregated_results.aggregations['sources']['buckets']:
    print(
        '{}% of documents from {} do not have subjects'.format(
            format(source['score']*100, '.2f'),
            source['key'] 
        )
    )

### Top Subjects Aggregation

Let's do an elasticsearch query to find out what the most used subjects are used in the dataset across all sources.

In [None]:
top_subjects_search = ShareSearch()

top_subjects_search.aggs.bucket(
    'subjectsTermFilter',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations
    field='subjects',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=1,
    exclude= "of|and|or",
    size=10
)

In [None]:
top_subjects_results_executed = top_subjects_search.execute()
top_subjects_results = top_subjects_results_executed.aggregations.subjectsTermFilter.to_dict()['buckets']

pretty_print(top_subjects_results)

## Plotting

Here are some simple plots using pandas and matplotlib



### Creating a Dataframe

To create a plot, first we need to get the data into an appropriae format.

Pandas, a python plotting library, has the DataFrame format, which is a lot like a spreadsheet.

In [None]:
import pandas as pd

top_subjects_dataframe = pd.DataFrame(top_subjects_results)
top_subjects_dataframe

### Plotting the Dataframe

In [None]:
from matplotlib import pyplot
%matplotlib inline

top_subjects_dataframe.plot(kind='bar', x='key', y='doc_count')
pyplot.show()

### Complex Queries and Dataframes

Let's make a new search, for all documents updated in the years 2012 to 2015 that contain the subject "science."

In [None]:
science_search = ShareSearch() #create search object
science_search = science_search.filter( #apply filter to search
    "range", #applied a range type filter
    providerUpdatedDateTime={ #the feild in the data we compare
        'gte':'2012-01-01', #hits must be greater than or equal to this date and...
        'lte':'2015-12-31' #hits must be less than or equal to this date
    }
)

In [None]:
science_search = science_search.filter(
     "prefix",
     subjects="science"
)

science_search.aggs.bucket(
    'sources',
    'significant_terms',
    field='_type',
    min_doc_count=1,
    percentage={},
    size=0
)

### Take a look at the query we've built

In [None]:
pretty_print(science_search.to_dict())

### Make the query, and graph the result

In [None]:
import pandas as pd
from matplotlib import pyplot

%matplotlib inline

science_search_results = science_search.execute()

science_results = science_search_results.aggregations.sources.to_dict()  
science_data_frame = pd.DataFrame(science_results['buckets']) 

science_data_frame['percents'] = (science_data_frame['score'] * 100)

science_data_frame[:30].plot(kind='bar', x='key', y='percents') # Limit to the first 30 results for readability

pyplot.show()

### Plot Number of Documents by Source

We'll limit it to the top 30 sources to make sure that the graph is readable.

In [None]:
from sharepa import bucket_to_dataframe

all_results = ShareSearch()

all_results = all_results.query(
    'query_string',
    query='*',
    analyze_wildcard=True
)

all_results.aggs.bucket(
    'sources',
    'terms',
    field='_type',
    size=0,
    min_doc_count=0
)

In [None]:
all_results = all_results.execute()

all_results_frame = bucket_to_dataframe(
    '# documents by source',
    all_results.aggregations.sources.buckets
)

all_results_frame_sorted = all_results_frame.sort(
    ascending=False,
    columns='# documents by source'
)

all_results_frame_sorted[:30].plot(kind='bar')

### Different Kinds of Charts

Let's make a pie chart

- Limited to 10 sources

In [None]:
all_results_frame_sorted[:10].plot(kind='pie', y="# documents by source", legend=False)

## SHARE Data in the Wide World

Here are some examples of how to get SHARE data into different formats

### Exporting a DataFrame to csv and Excel

Let's do a query and then export the results to different formats.

We're interested in the number of documents from each source that have a description.

In [None]:
description_search = ShareSearch()

description_search = description_search.query(
    'query_string', 
    query='description:*',
    analyze_wildcard=True
)

description_search.aggs.bucket(
    'sources',
    'significant_terms',
    field='_type',
    min_doc_count=0,
    percentage={},
    size=0
)

description_results = description_search.execute()

### Cleaning up our dataframe

In [None]:
description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])

# We will add our own "percent" column to make things clearer
description_dataframe['percent'] = (description_dataframe['score'] * 100)
# And, drop the old score column
description_dataframe = description_dataframe.drop('score', 1)

# Let's set the source name as the index, and then drop the old column
description_dataframe = description_dataframe.set_index(description_dataframe['key'])
description_dataframe = description_dataframe.drop('key', 1)

In [None]:
# Finally, we'll show the results!
description_dataframe

### Exporting to CSV and Excel formats

Pandas has handy tools built in that makes converting a dataframe very easy

In [None]:
description_dataframe.to_csv('exported_data/SHARE_Counts_with_Descriptions.csv')
description_dataframe.to_excel('exported_data/SHARE_Counts_with_Descriptions.xlsx')

### Working with Outside Data

Here's a quick example of how you could work with a list of names, and use them to see what information is in SHARE

In [None]:
names = ["Susan Jones", "Ravi Patel"]

In [None]:
name_search = ShareSearch()

for name in names:
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )


name_results = name_search.execute()

In [None]:
print(
    'There are {} documents with contributors who have any of those names.'.format(
        name_search.count()
    )
)

print('Here are the first 10:')
print('---------')
for result in name_results:
    print(
        '{} -- with contributors {}'.format(
            result.title.encode('utf-8'),
            ', '.join([contributor.name.encode('utf-8') for contributor in result.contributors])
        )
    )

### Where did these results come from?

We can add an aggregation!

In [None]:
name_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

name_results = name_search.execute()

pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])

### Searching by ORCID 

In [None]:
orcids = [
    'http://orcid.org/0000-0003-1942-4543',
    'http://orcid.org/0000-0003-4875-1447',
    'http://orcid.org/0000-0002-6085-4433',
    'http://orcid.org/0000-0002-7995-9948',
    'http://orcid.org/0000-0002-2170-853X',
    'http://orcid.org/0000-0002-8899-9087'
]

In [None]:
orcid_search = ShareSearch()

for orcid in orcids:
    orcid_search = orcid_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.sameAs": {
                                "query": orcid, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )

In [None]:
orcid_search.aggs.bucket(
    'sources',
    'terms',
    field='_type',
    size=0,
    min_doc_count=1
)

orcid_results = orcid_search.execute()

In [None]:
print(
    'There are {} documents with contributors who have any of those orcids.'.format(
        orcid_search.count()
    )
)

all_agg_df = pd.DataFrame()
all_agg_df['title'] = [result.title for result in orcid_results]
all_agg_df['docID'] = [result.shareProperties.docID for result in orcid_results]
all_agg_df['source'] = [result.shareProperties.source for result in orcid_results]
all_agg_df

## This is just the surface!

The SHARE API has the potential to answer many questions about our data

Data curation and enhancement will only make these analasyes more interesting.

# Thank you!

## Questions?

**email**: erin@cos.io

*SHARE Technical Documentation and Information*: https://osf.io/t3j94/

*these slides and example notebooks*:  https://osf.io/bygau/