# Purpose

This notebook is setup to query the [OSTI.gov](https://www.OSTI.gov) API for project records. The goals for the code located herein are:

1. Determine what fields are available for different records in OSTI
2. Design a DOE Solar Energy Technologies Office (SETO) query that only pulls that technology office's data
3. Build the query to work using an arbitrarily-large list of formatted project IDs, assuming the Solar Information Management System (SIMS) project code syntax as the input.
    * **Note**: SIMS is an internal DOE system


# To Do

1. Get query text from Strategic Support that searches only for SETO projects
    * This won't be critical long-term though, as SIMS should be able to generate a list of all Active projects that can be fed into the query
2. Figure out exactly what params are most useful using the [OSTI API docs](https://www.osti.gov/api/v1/docs).

In [68]:
#Query the API, mimicking the pre-made SS search URL as closely as possible
import requests

URL = "https://www.osti.gov/api/v1/records"

#sort by publication date, with the most current dates first (these can be future values)
    #and only return records that are for thing sponsored by the solar office, EE-4S
params = {'sort': 'publication_date desc', 'sponsor_org': 'EE-4S'}

r = requests.get(URL, params=params)

query_date = r.headers["Date"]
results_count = r.headers['X-Total-Count']

print(f"Query was successful: {r.status_code == requests.codes.ok}")
print(f"Query made on {query_date} returned {results_count} hits")
print(f"URL used was {r.url}")


Query was successful: True
Query made on Wed, 27 Feb 2019 03:35:18 GMT returned 16252 hits
URL used was https://www.osti.gov/api/v1/records?sort=publication_date+desc&sponsor_org=EE-4S


# Problems

1. Using [this search URL](https://www.osti.gov/search/sort:publication_date%20desc/sponsor-org:EE-4S#), I'm able to get results that are specific to the solar office (SETO = EE-4S) just fine. *But if I try to use the same parameters for the API call, I get nonsensical results*. 
    * If I try to sort by publication date without specifying a sort order, I don't see any really default ordering so I may as well have not done the sort in the first place
    * If I try to specify the sort order by using the parameter order: desc, it fails. The API documentation is flawed in this way, because really what it wants you to do is **specify the sort order as part of the sort field specification**. In other words, instead of `{'sort': 'publication_date', 'order': 'desc'}`, it actually only works if you do `{'sort': 'publication_date desc'}`, even though the documentation says otherwise.
2. Regardless of the sorting issues, using sponsoring_org = EE-4S doesn't seem to be working as intended.

In [69]:
#Import the JSON query response into a DataFrame for cleaning
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(r.json())
df

Unnamed: 0,article_type,authors,availability,contributing_org,country_publication,description,doe_contract_number,doi,entry_date,format,...,links,osti_id,product_type,publication_date,publisher,report_number,research_orgs,sponsor_orgs,subjects,title
0,,"[Dong, Changgui, Sigrin, Benjamin]",,,United States,"Distributed energy resources, such as rooftop ...",AC36-08GO28308,10.1016/j.enpol.2019.02.017,2019-02-25T05:00:00Z,Medium: X; Size: p. 100-110,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1494980,Journal Article,2019-06-01T04:00:00Z,Elsevier,NREL/JA-6A20-66020,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 29 ENERGY PLANNING, POLICY, ...",Using Willingness to Pay to Forecast the Adopt...
1,,"[Skoryunov, R. V., Babanova, O. A., Soloninin,...",,,United States,In order to study the dynamical properties of ...,AC36-08GO28308,10.1016/j.jallcom.2018.12.162,2019-01-23T05:00:00Z,Medium: X; Size: p. 913-918,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1491139,Journal Article,2019-04-01T04:00:00Z,Elsevier,NREL/JA-5900-73081,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[36 MATERIALS SCIENCE, energy storage material...",Nuclear Magnetic Resonance Study of Anion and ...
2,,"[Sulas, Dana B., Johnston, Steve (ORCID:000000...",,,United States,We investigate the implications of using parti...,AC36-08GO28308,10.1016/j.solmat.2018.12.022,2019-01-23T05:00:00Z,Medium: X; Size: p. 81-87,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1491141,Journal Article,2019-04-01T04:00:00Z,Elsevier,NREL/JA-5K00-71930,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 36 MATERIALS SCIENCE, silico...",Comparison of Photovoltaic Module Luminescence...
3,,"[Cai, Can, Miller, David C., Tappan, Ian A., D...",,,United States,We developed a framework to predict and model ...,AC36-08GO28308,10.1016/j.solmat.2018.11.024,2019-01-08T05:00:00Z,Medium: X; Size: p. 486-492,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1489188,Journal Article,2019-03-01T05:00:00Z,Elsevier,NREL/JA-5K00-73005,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 36 MATERIALS SCIENCE, accele...",Framework for Predicting the Photodegradation ...
4,,"[Monroe, Eric, Gladden, John, Albrecht, Karl O...",,,United States,This work describes the first documented case ...,AC36-08GO28308,10.1016/j.fuel.2018.11.046,2019-02-07T05:00:00Z,Medium: X; Size: p. 1143-1148,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1492507,Journal Article,2019-03-01T05:00:00Z,Elsevier,NREL/JA-5400-73186,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[09 BIOMASS FUELS, 37 INORGANIC, ORGANIC, PHYS...",Discovery of novel octane hyperboosting phenom...
5,,"[Neises, Ty, Turchi, Craig]",,,United States,"This analysis investigates the design, cost, a...",AC36-08GO28308,10.1016/j.solener.2019.01.078,2019-02-25T05:00:00Z,Medium: X; Size: p. 27-36,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1494976,Journal Article,2019-03-01T05:00:00Z,Elsevier,NREL/JA-5500-72674,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 47 OTHER INSTRUMENTATION, co...",Supercritical Carbon Dioxide Power Cycle Desig...
6,Published Article,"[Padgett, Elliot (ORCID:0000000190342335), Yar...",,,United States,,EE0007271; AC02-06CH11357,10.1149/2.0371904jes,2019-02-25T05:00:00Z,Medium: X; Size: p. F198-F207,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495620,Journal Article,2019-02-21T05:00:00Z,The Electrochemical Society,,[],[USDOE Office of Energy Efficiency and Renewab...,[],Mitigation of PEM Fuel Cell Catalyst Degradati...
7,Published Article,"[Schuler, Tobias, Chowdhury, Anamika, Freiberg...",,,United States,,AC02-05CH11231,10.1149/2.0031907jes,2019-02-22T05:00:00Z,Medium: X; Size: p. F3020-F3031,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495252,Journal Article,2019-02-20T05:00:00Z,The Electrochemical Society,,[],[USDOE Office of Energy Efficiency and Renewab...,[],Fuel-Cell Catalyst-Layer Resistance via Hydrog...
8,,"[Jain, Akshay Kumar [National Renewable Energy...",,,United States,Distributed photovoltaic systems (DPV) can cau...,AC36-08GO28308,,2019-02-26T05:00:00Z,Medium: ED; Size: 1.4 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495718,Conference,2019-02-15T05:00:00Z,,NREL/CP-5D00-72284,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 24 POWER TRANSMISSION AND DI...",Quasi-Static Times Series PV Hosting Capacity ...
9,,"[Woodhouse, Michael A [National Renewable Ener...",,,United States,In this paper we provide an overview of the ac...,AC36-08GO28308,,2019-02-26T05:00:00Z,Medium: ED; Size: 4.5 MB,...,"[{'rel': 'citation', 'href': 'https://www.osti...",1495719,Technical Report,2019-02-15T05:00:00Z,,NREL/TP-6A20-72134,"[National Renewable Energy Lab. (NREL), Golden...",[USDOE Office of Energy Efficiency and Renewab...,"[14 SOLAR ENERGY, 29 ENERGY PLANNING, POLICY, ...",Crystalline Silicon Photovoltaic Module Manufa...


In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 24 columns):
article_type           5 non-null object
authors                20 non-null object
availability           20 non-null object
contributing_org       20 non-null object
country_publication    20 non-null object
description            20 non-null object
doe_contract_number    20 non-null object
doi                    20 non-null object
entry_date             20 non-null object
format                 20 non-null object
journal_issue          20 non-null object
journal_name           20 non-null object
journal_volume         20 non-null object
language               20 non-null object
links                  20 non-null object
osti_id                20 non-null object
product_type           20 non-null object
publication_date       20 non-null object
publisher              20 non-null object
report_number          20 non-null object
research_orgs          20 non-null object
sponsor_orgs    

In [71]:
#Provide some basic info about missing values
missing = pd.DataFrame(df.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(df),2)
missing.sort_values('total missing', ascending = False)

Unnamed: 0,total missing,percent missing
article_type,15,0.75
authors,0,0.0
subjects,0,0.0
sponsor_orgs,0,0.0
research_orgs,0,0.0
report_number,0,0.0
publisher,0,0.0
publication_date,0,0.0
product_type,0,0.0
osti_id,0,0.0


In [88]:
#Forcing a column (sponsor_orgs) that can have multiple values per row (by way of a list) to be melted into 
    # a database-like setup, with a new row for every unique list value in that original column

df['sponsor_orgs'].apply(pd.Series)\
    .merge(df, right_index = True, left_index = True)\
    .drop(["sponsor_orgs"], axis = 1) \
    .melt(id_vars = df.drop('sponsor_orgs', axis = 1).columns.values, value_name = "sponsor_org") \
    .drop("variable", axis = 1) \
    .dropna(subset = ['sponsor_org'])['sponsor_org'].value_counts()

USDOE Office of Energy Efficiency and Renewable Energy (EERE), Solar Energy Technologies Office (EE-4S)                        8
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Fuel Cell Technologies Office (EE-3F)                           4
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Vehicle Technologies Office (EE-3V)                             3
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Wind and Water Technologies Office (EE-4W)                      2
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Solar Energy Technologies Office (EE-4S), SunShot Initiative    1
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Weatherization and Intergovernmental Programs Office (EE-5W)    1
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Building Technologies Office (EE-5B)                            1
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Bioenergy Technologies Office (EE-