# PSS SciCatLive 1: populating
## PaNOSC Search Scoring Workshop, Part 1
## SciCatLive integration between SciCat backend and PaNOSC Search Scoring

This notebook shows how to extract items from the local catalogue system, and populate the PaNOSC Search Scoring (PSS) with the items to be scored.  
It assumes that you have the SciCatLive running on your machine.  

Two groups of elements are extracted and imported in PSS: 
- datasets 
- documents.  
Those groups match the two type of items that needs to be scored and are provided by the PaNOSC Search API to the PaNOSC Federated Search

**Important**: all the current items and weights already present in the database will be deleted.

**Disclaimer**:  
This notebook has been prepared within the context  of the PaNOSC Scoring Workshop.  
It is provided it as is, although you are free to re-use it for other purposed and modified it as you need.   
By using this notebook, you are releasing ESS and its team from any responsability.

In [None]:
%run PSS-SciCatLive-common.ipynb

## Retrieve datasets and documents from SciCat backend running in SciCatLive

Login in scicat backend.  
Send request to login backend with username and password.
Retrieve JWT token to be used as authentication token in each request

This is the login endpoint for SciCat backend in SciCatLive

In [None]:
sc_login_url

In [None]:
res = requests.post(
    sc_login_url,
    json={
        'username' : username,
        'password' : password
    }
)

Successfull response should report a status code of 200 

In [None]:
res

Extract user id and access token from response

In [None]:
json_res = res.json()

In [None]:
access_token = json_res['access_token'] if 'access_token' in json_res.keys() else json_res['id']
user_id = json_res['userId']

In [None]:
print("User id : {}".format(user_id))
print("Token   : {}".format(access_token))

#### Retrieve all datasets available and refactor them to be inserted in the scoring system

In [None]:
res = requests.get(
    sc_datasets_url,
    headers={
        'Authorization' : 'Bearer ' + access_token
    }
)

In [None]:
raw_datasets = res.json()

Number of datasets retrieved from SciCat backend

In [None]:
len(raw_datasets)

List of fields in the first item

In [None]:
list(raw_datasets[0].keys())

Prepare dataset items to be inserted in the scoring service.  
Each dataset items contains the scoring information under the *field* key

In [None]:
datasets_items = [
    {
        'id' : item['pid'],
        'group' : 'datasets',
        'fields' : prepFields(item,'datasets')
    }
    for item 
    in raw_datasets
]

Number of items in group Datasets to be inserted in scoring system

In [None]:
len(datasets_items)

In [None]:
datasets_items[0]

#### Retrieve all proposals available and refactor them to be inserted in the scoring system

In [None]:
res = requests.get(
    sc_proposals_url + '?access_token=' + access_token
)

In [None]:
res

In [None]:
raw_proposals = res.json()

List of fields in the first item

In [None]:
list(raw_proposals[0].keys())

Prepare proposals items to be inserted in the scoring service

In [None]:
proposals_items = [
    {
        'id' : item['proposalId'],
        'group' : 'proposals',
        'fields' : prepFields(item,'proposals')
    }
    for item 
    in raw_proposals
]

Number of items in group Proposals to be inserted in scoring system

In [None]:
len(proposals_items)

#### Delete all the current items in the scoring system
During normal operation, we would not know if there are any items in the scoring system.  
Given that the scoring uses ids from the catalogue, instead of checking and updating each item individually, it is faster to delete everything and insert them once more.

In the context of the PaNOSC Scoring Workshop, if you are running this for the first time, the scoring system should be empty, so no item should be deleted.

At the moment there is no endpoint for deleting all the items or all the items belonging to a single group.   
We need to retrieve all the items and deleted them one by one.

In [None]:
res = requests.get(pss_items_url + "/count")

In [None]:
count = res.json()['count']
count

In [None]:
res = requests.get(
    pss_items_url,
    params={
        'limit': count
    }
)

In [None]:
current_items = res.json() if count else []

In [None]:
len(current_items)

In [None]:
delete_res = []
for item in current_items:
    res = requests.delete(
        '/'.join([
            pss_items_url,
            item['id']
        ])
    )
    delete_res.append(res.status_code)


Makes sure that all the deletes have been successfull. 
We should see only one value matching status code 200.

In [None]:
set(delete_res)

### Populate items in scoring service
We are inserting both dataset and proposal items.  

First we check if we have any items in the system right now.

In [None]:
res = requests.get(pss_items_url + '/count')

We should have zero items in the scoring system, given that the system has just been deployed

In [None]:
count = res.json()['count']
count

Insert datasets items.  
Status code returned should be 201 for successful operation

In [None]:
res = requests.post(
    pss_items_url,
    json=datasets_items
)

In [None]:
res

Insert proposal items.
Same as for datasets, returned code should be 201.

In [None]:
res = requests.post(
    pss_items_url,
    json=proposals_items
)

In [None]:
res

Let's verify that all our items have been created.  
First we request a count of the items, than we verify that we retrieve all the items.  

In [None]:
res = requests.get(pss_items_url + '/count')

In [None]:
count = res.json()['count']
print("There are {} items in the scoring service".format(count))

Now, we retrieve all the items and check if we they are the right number and there are the two groups:   
Datasets and Documents

In [None]:
res = requests.get(pss_items_url + "?limit=" + str(count+100))

In [None]:
items = res.json()

Here is the first item retrieved

In [None]:
items[0]

In [None]:
print("All items are grouped in the following groups: {}".format(set([item['group'] for item in items])))