You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because of this, we will first focus on the changes needed to display an unpublished or updated dataset on the /maps page. Much of this can be applied to the /data page, but a decent amount of work needs to be done to modify the data pulled from scicrunch. This is needed because of a few limitations on the amount of data we get from the unpublished vs published datasets.
Overview of steps
Part A: Running the current implementation
A1. Setting up environment variable
A2. Dataset processing from scicrunch
Part B: Understanding and modifying the current implemenation
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
B2. Downloading files from the pennsieve python client as opposed to from s3
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
B4. Front end changes
Part A: Staging datasets and running the staging site
Step 1: Setting up environment variables
The following will be needed to stage and retrieve the datasets:
There are two categories, ones that can be kept the same as normal development and those that need to be changed
Set the sparc-api endpoint on sparc-app to where you are running it. (Often http://localhost:5000/)
That should be everything needed to view staged datasets on the /maps page
Part B: Understanding and modifying the current implementation
Next we will go into how it works and how to develop it further.
I currently don't know of any tickets to develop this further, but I imagine there will be a ticket soon to be able to stage datasets across all of sparc.science with a logged in user's pennsieve keys.
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
This is as simple as modifying the elastic search query to use pennsieveId as opposed to DOIS
B2. Downloading files from the pennsieve python client as opposed to from s3
We first check which method of downloading we will use by the length of the id. (PennsieveIds are longer than discoverIds.) This is necessary as we don't get told which type of id is returned from scicrunch for unpublished datasets.
# This version of s3-resouces is used for accessing files on staging that have never been published
@app.route("/s3-resource/<path:path>")
def direct_download_url2(path):
print(path)
filePath = path.split('files/')[-1]
pennsieveId = path.split('/')[0]
# If length is small, we have a pennsieve discover id. We will process this one with the normal s3-resource route
if len(pennsieveId) <= 4:
return direct_download_url(path)
if 'N:package:' not in pennsieveId:
pennsieveId = 'N:dataset:' + pennsieveId
url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
if url != None:
resp2 = requests.get(url)
return resp2.content
return jsonify({'error': 'error with the provided ID '}, status=502)
Note that url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath) retrieves a temporary download url from the pennsieve python client for a pennsieve id and file path.
If the dataset does have a discover id, we need to retrieve the pennsieve id to use on the pennsieve python client.
You could attempt to avoid making the call to scicrunch to translate the pennsieve id to discover id, but I did it this way to keep the downloads consistent at one point I believe.
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
The banners unfortunately cannot be returned from the pennsieve python client or discover, so we must use the pennsieve REST api. The endpoint used for this is /getbanner
Note that in order to use the pennsieve REST api you must log in via the s3 auth system
@app.route("/get_banner/<datasetId>")
def get_banner_pen(datasetId):
p_temp_key = pennsieve_login()
ban = get_banner(p_temp_key, datasetId)
return ban
B4. Front end changes
Since unpublished datasets return less content, checks likely need to be added to keep the site from accessing properties of undefined and crashing the site.
I did this by just adding more logic to check fields exist, but it may be a bit more complicated to implement on the /datasets page where a lot of processing is done in one big async data block.
Processing and displaying unpublished dasets sparc-app. A full guide
This issue is where the documentation for the data staging environment will be stored until the data staging environment has it's own repo.
An example of this running can be found at:
https://context-cards-demo-stage.herokuapp.com/maps
(note that only the maps page is implemented currently)
Because of this, we will first focus on the changes needed to display an unpublished or updated dataset on the /maps page. Much of this can be applied to the /data page, but a decent amount of work needs to be done to modify the data pulled from scicrunch. This is needed because of a few limitations on the amount of data we get from the unpublished vs published datasets.
Overview of steps
Part A: Running the current implementation
A1. Setting up environment variable
A2. Dataset processing from scicrunch
Part B: Understanding and modifying the current implemenation
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
B2. Downloading files from the pennsieve python client as opposed to from s3
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
B4. Front end changes
Part A: Staging datasets and running the staging site
Step 1: Setting up environment variables
The following will be needed to stage and retrieve the datasets:
There are two categories, ones that can be kept the same as normal development and those that need to be changed
Same as normal:
The pennsieve keys must have access to the desired datasets for staging to see them:
And these are set to the curation index:
**Note that ALGOLIA_INDEX is front end. It is set in sparc-app
Feel free to slack or email me if you are working on this and need any of these keys
Step 2: Dataset Processing
Datasets can be put through the scicrunch elastic search processing via a url.
2a: Check if staging is ready to run
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>
where <KNOWLEDGEBASE_KEY> is your KNOWLEDGEBASE_KEY.
There is no queue for processing and datasets can only be processed one at a time. The status is used to check if the server is available and ready.
2b: Submit dataset for staging
Use this url:
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>&datasetID=<pennsieve-id>
where is the pennsieve identifier. I.E. 5c0a31f6-4926-4091-8876-3b11af7846ed
Step 3: Running the site
3a: Retrieve the staging repos
Use the staging branch of sparc-api:
#157
And this branch of sparc-app:
https://github.com/Tehsurfer/sparc-app/tree/new-staging
Set the sparc-api endpoint on sparc-app to where you are running it. (Often http://localhost:5000/)
That should be everything needed to view staged datasets on the /maps page
Part B: Understanding and modifying the current implementation
Next we will go into how it works and how to develop it further.
I currently don't know of any tickets to develop this further, but I imagine there will be a ticket soon to be able to stage datasets across all of sparc.science with a logged in user's pennsieve keys.
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
This is as simple as modifying the elastic search query to use pennsieveId as opposed to DOIS
The results from here can be processed with app/scicrunch_process_results.py. The function used is
_prepare_results(results):
B2. Downloading files from the pennsieve python client as opposed to from s3
We first check which method of downloading we will use by the length of the id. (PennsieveIds are longer than discoverIds.) This is necessary as we don't get told which type of id is returned from scicrunch for unpublished datasets.
Note that
url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
retrieves a temporary download url from the pennsieve python client for a pennsieve id and file path.If the dataset does have a discover id, we need to retrieve the pennsieve id to use on the pennsieve python client.
You could attempt to avoid making the call to scicrunch to translate the pennsieve id to discover id, but I did it this way to keep the downloads consistent at one point I believe.
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
The banners unfortunately cannot be returned from the pennsieve python client or discover, so we must use the pennsieve REST api. The endpoint used for this is /getbanner
Note that in order to use the pennsieve REST api you must log in via the s3 auth system
B4. Front end changes
Since unpublished datasets return less content, checks likely need to be added to keep the site from accessing properties of undefined and crashing the site.
I did this by just adding more logic to check fields exist, but it may be a bit more complicated to implement on the /datasets page where a lot of processing is done in one big async data block.
The code to get this running on the /maps page is available here:
https://github.com/Tehsurfer/map-sidebar-curation
If you have any questions about the implementation or have ideas on how to do this better feel free to chat here or dm me.
The text was updated successfully, but these errors were encountered: