# PSS ESS SciCat - PaNOSC
## ESS SciCat integration with PaNOSC Search Scoring for PaNOSC Federated Search

This notebook is an example on how to extract items from the local catalogue system (at ESS is SciCat), populate the PaNOSC Search Scoring (ESS implementation) with the items to be scored.  
Two groups of elements are extracted and imported in PSS: datasets and documents.  
The match with the two type of items that needs to be scored.

Once we have verified that we the items to be scored in the scoring system, we trigger the weight computation and confirm that they have been computed.

**Important**: all the current items and weights already present in the database will be deleted.

**Disclaimer**: this notebook is just as a proof of concept. Use it as is. By using this notebook, you are releasing ESS and its team from any responsability.

In [None]:
%run PSS-SciCat-for-PaNOSC-common.ipynb

## Retrieve datasets and documents from SciCat

Login in scicat backend.  
Hit login url with username and password, and retrieve JWT token to be used as authentication token in eahc request

In [None]:
res = requests.post(
    sc_functional_login_url,
    json={
        'username' : username,
        'password' : password
    }
)

Successfull response should report a status code of 200 

In [None]:
res

Extract user id and access token from response

In [None]:
json_res = res.json()

In [None]:
json_res

In [None]:
access_token = json_res['id']
user_id = json_res['userId']

In [None]:
user_id, access_token

#### Retrieve all datasets available, retain only the public ones and refactor them to be inserted in the scoring system

In [None]:
res = requests.get(
    sc_datasets_url,
    headers={
        'Authorization' : 'Bearer ' + access_token
    }
)

In [None]:
raw_datasets = res.json()

List of fields in the first item

In [None]:
list(raw_datasets[0].keys())

Extract public datasets

In [None]:
raw_published_datasets = {d['pid']: d for d in raw_datasets if d['isPublished']}

In [None]:
len(raw_published_datasets)

Prepare dataset to be inserted in the scoring service

In [None]:
def prepFields(item,group):
    return {
        k: item[v]
        for k,v
        in meaningful_fields[group].items()
    }

In [None]:
scoring_datasets = [
    {
        'id' : item['pid'],
        'group' : 'datasets',
        'fields' : prepFields(item,'datasets')
    }
    for item 
    in raw_published_datasets.values()
]

Number of items in group Datasets to be inserted in scoring system

In [None]:
len(scoring_datasets)

#### Retrieve all published data available ( which are mapped to PaNOSC documents) and refactor them to be inserted in the scoring system

In [None]:
res = requests.get(
    sc_published_data_url + '?access_token=' + access_token
)

# this csall to the end point does not work
# apparently it does not accept the authorization in the header
#res = requests.get(
#    sc_proposals_url,
#    headers={
#        'Authorization' : 'Bearer ' + access_token
#    }
#)

In [None]:
res

In [None]:
raw_published_data = res.json()

List of fields in the first item

In [None]:
list(raw_published_data[0].keys())

In [None]:
len(raw_published_data)

In [None]:
raw_published_data[0]

In [None]:
def get_dataset(pid):
    encoded_pid = urllib.parse.quote_plus(pid)
    res = requests.get(
        sc_datasets_url + '/' + encoded_pid,
        headers={
            'Authorization' : 'Bearer ' + access_token
        }
    )
    return res.json()

Now retrieve all the datasets

In [None]:
for pd in raw_published_data:
    pd['datasets'] = [raw_published_datasets[pid] for pid in pd['pidArray'] if pid in raw_published_datasets.keys()]

In [None]:
def extractFieldValue(dk,sk,item):
    #print('extractFieldValue ----------')
    #print(dk)
    #print(sk)
    #print(item)
    output = ""
    if type(sk) == dict:
        if type(item[dk]) == list:
            output = [
                prepNestedFields(i,sk)
                for i
                in item[dk]
            ]
        else:
            output =  prepNestedFields(item[dk],sk)
    elif sk in item.keys():
        output = item[sk]

    return output

In [None]:
def prepNestedFields(item,fields_list):
    #print('prepNestedFields ----------')
    return {
        dk : extractFieldValue(dk,sk,item)
        for dk,sk
        in fields_list.items()
    }

Prepare proposals to be inserted in the scoring service

In [None]:
scoring_documents = [
    {
        'id' : item['doi'],
        'group' : 'documents',
        'fields' : prepNestedFields(item,meaningful_fields['documents'])
    }
    for item 
    in raw_published_data
]

Number of items in group Proposals to be inserted in scoring system

In [None]:
len(scoring_documents)

#### Delete all the current items in the scoring system
We do not know if there are any items in scoring system.  
Given that the scoring uses ids from the catalogue, instead of checking and updating each item individually, it is faster to delete everything and insert them once more.

At the moment there is no endpoint for deleting all the items or all the items belonging to a single group.   
We need to retrieve all the items and deleted them one by one.

In [None]:
res = requests.get(pss_items_url + "/count")

In [None]:
count = res.json()['count']
count

In [None]:
res = requests.get(
    pss_items_url,
    params={
        'limit': count
    }
)

In [None]:
current_items = res.json() if count else []

In [None]:
len(current_items)

In [None]:
delete_res = []
for item in current_items:
    res = requests.delete(
        '/'.join([
            pss_items_url,
            item['id']
        ])
    )
    delete_res.append(res.status_code)


Makes sure that all the deletes have been successfull. 
We should see only one value matching status code 200.

In [None]:
set(delete_res)

### Populate items in scoring service
We are inserting both datasets and proposals

In [None]:
res = requests.get(pss_items_url + '/count')

We should have zero items in the scoring system

In [None]:
count = res.json()['count']
count

Insert datasets in items.  
Status code returned should be 201 for successful operation

In [None]:
res = requests.post(
    pss_items_url,
    json=scoring_datasets
)

In [None]:
res

Insert documents

In [None]:
res = requests.post(
    pss_items_url,
    json=scoring_documents
)

In [None]:
res

Let's verify that all our items have been created.  
First we request a count of the items, than we verify that we retrieve all the items.  
Finally, we are going to check if we get the two groups.

In [None]:
res = requests.get(pss_items_url + '/count')
count = res.json()['count']
count

In [None]:
res = requests.get(pss_items_url + "?limit=" + str(count+100))

In [None]:
res

In [None]:
items = res.json()

In [None]:
len(items)

In [None]:
set([item['group'] for item in items])

### Weight Computation

Trigger weight computations with a post on the compute endpoint

In [None]:
res = requests.post(pss_compute_url)

In [None]:
res

In [None]:
res.json()

The response received from the scoring system informs us that the request has been submitted and received, but not yet started.

We suggest to wait a little bit and than place the request below.  
A get request to the compute endpoint, returns the computation status.  
Re-run the following 3 cells until the computation is done.

In [None]:
res = requests.get(pss_compute_url)

In [None]:
res

In [None]:
status = res.json()
status

Computation is done, when all three timestamps fields are assigned and progress is set to 1.0.  
It should look something like the following info:  
`    {`  
`      'requested': '2021-09-28T15:56:19.451171',`  
`      'started': '2021-09-28T15:56:24.468000',`  
`      'ended': '2021-09-28T15:57:00.753000',`  
`      'progressPercent': 1.0,`  
`      'progressDescription': 'Done',`  
`      'inProgress': False`  
`    }`  


In [None]:
while (status['progressPercent'] < 1.0):
    print("Weight computation not done yet")
    time.sleep(10)
    res = requests.get(pss_compute_url)
    status = res.json()

In [None]:
status

### Retrieve all weights, count them and check one

In [None]:
res = requests.get(pss_weights_url + '/count')

In [None]:
res.json()

In [None]:
res = requests.get(pss_weights_url)

In [None]:
weights = res.json()

In [None]:
len(weights)

In [None]:
weights[0]

### Retrieve all terms, count them and check one

In [None]:
res = requests.get(pss_terms_url + '/count')

In [None]:
res.json()

In [None]:
res = requests.get(pss_terms_url)

In [None]:
terms = res.json()

In [None]:
len(terms)

In [None]:
terms[0]