## Install prerequisities

The only prerequisity needed at this time is installing Python Sparc Client library (sparc.client).

Optionally, for uploading the files, Pennsieve Agent needs to be installed.

For details, please follow the instruction on https://docs.pennsieve.io/docs/uploading-files-programmatically .


In [1]:
!pip install sparc.client



## Load modules


sparc.client has a modular structure. Modules can be loaded either automatically (without the 'connect' flag), or manually.

In the following example we are loading a Pennsieve2 module and connecting to the Pennsieve agent runnin

In [7]:
from sparc.client import SparcClient
client = SparcClient(connect=False, config_file='config.ini')

#Connect to a specific module - REQUIRES PENNSIEVE AGENT RUNNING
#module = client.pennsieve.connect()
#module.user.whoami() #execute internal functions of the module

# alternatively connect all the services available
#client.connect()  #connect to all services

Modules can also be loaded from other locations by simply providing a dictionary with configurations and path to the module.


In [3]:
#modules could also be added later by passing a config with env variables and path to the module  
client.add_module(config={'pennsieve_profile_name' : 'ci'}, 
                  path = 'sparc.client.services.pennsieve', 
                  connect=False)

## Pennsieve Module API 

Pennsieve module allows users to interact with Pennsieve platform.

Without connecting to the agent, the user is able to query Discover service of the Pennsieve platform for databases, specific files and records as well as to download publicly available files or datasets. 



#### Listing datasets

Listing a dataset that match a specific word, e.g. the last name of the PI or a medical term could be performed the following way:

In [4]:
response=client.pennsieve.list_datasets(query='Wagenaar', limit=2)
response

{'limit': 2,
 'offset': 0,
 'totalCount': 5,
 'datasets': [{'id': 3,
   'sourceDatasetId': 13,
   'name': 'Canine Epilepsy Dataset',
   'description': 'Intracranial EEG recordings from three dogs with naturally-occurring focal epilepsy.',
   'ownerId': 97,
   'ownerFirstName': 'Jacqueline',
   'ownerLastName': 'Boccanfuso',
   'ownerOrcid': '',
   'organizationName': 'Mayo',
   'organizationId': 6,
   'license': 'Creative Commons Zero 1.0 Universal',
   'tags': ['canine', 'epilepsy', 'intracranial', 'continuous', 'eeg'],
   'version': 1,
   'revision': None,
   'size': 263615741197,
   'modelCount': [{'modelName': 'Channel', 'count': 48},
    {'modelName': 'Recording', 'count': 3},
    {'modelName': 'Subject', 'count': 3},
    {'modelName': 'Annotation', 'count': 219}],
   'fileCount': 61,
   'recordCount': 273,
   'uri': 's3://pennsieve-prod-discover-publish-use1/3/1/',
   'arn': 'arn:aws:s3:::pennsieve-prod-discover-publish-use1/3/1/',
   'status': 'PUBLISH_SUCCEEDED',
   'doi': '10.

We can query the Discover service with different options, e.g. looking within certain organization, only for embargoed datasets, and order the records by name, date, size in ascending or descending direction.

In [5]:
response=client.pennsieve.list_datasets(organization='Sparc', embargo=True, order_by='date', order_direction='asc')
response['datasets'][0]

{'id': 238,
 'sourceDatasetId': 1560,
 'name': 'Mapping colon and bladder innervating sensory neurons in CLARITY cleared ganglia in mouse',
 'description': 'Imaging of colon and bladder retrograde labelled sensory neurons in whole CLARITY cleared dorsal root ganglia and nodose/jugular ganglia complex.',
 'ownerId': 1205,
 'ownerFirstName': 'Stuart',
 'ownerLastName': 'Brierley',
 'ownerOrcid': '0000-0002-2527-2905',
 'organizationName': 'SPARC Consortium',
 'organizationId': 367,
 'license': 'Creative Commons Attribution',
 'tags': ['visceral sensory neurons', 'dorsal root ganglia'],
 'version': 1,
 'revision': None,
 'size': 52139108574,
 'modelCount': [{'modelName': 'researcher', 'count': 2},
  {'modelName': 'human_subject', 'count': 0},
  {'modelName': 'term', 'count': 2},
  {'modelName': 'award', 'count': 0},
  {'modelName': 'animal_subject', 'count': 21},
  {'modelName': 'sample', 'count': 21},
  {'modelName': 'protocol', 'count': 2},
  {'modelName': 'summary', 'count': 4}],
 'fil

#### Listing records

Apart from listing the dataset, we can also zoom into the records of a given dataset for a specific model, for example explore researchers within Sparc project.

In [6]:
response=client.pennsieve.list_records(model='researcher', organization='Sparc')
response

{'limit': 10,
 'offset': 0,
 'totalCount': 1141,
 'records': [{'datasetId': 282,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasORCIDId': 'https://orcid.org/0000-0002-0067-510X',
    'hasAffiliation': 'Univeristy of California, Los Angeles;University of California, Los Angeles;https://ror.org/046rm7j60',
    'middleName': '',
    'hasRole': '',
    'lastName': 'Yuan',
    'recordHash': '82329d634a673e2f45a1bfe90930e5fc',
    'firstName': 'Pu-Qing',
    'id': '3136648b-e83d-4e30-a20c-d6af747712d5'}},
  {'datasetId': 287,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasORCIDId': 'https://orcid.org/0000-0002-4153-9614',
    'hasAffiliation': 'University College London',
    'middleName': '',
    'hasRole': '',
    'lastName': 'Thompson',
    'recordHash': '77fbf8f43a9adb0180b8d1c8d289e6f6',
    'firstName': 'Nicole',
    'id': 'ea50a52a-54a3-4180-ace6-0b6b59beeb2a'}},
  {'datasetId': 290,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasO

#### Listing files

Similarly, we can query for files that are related with given name, or extension, e.g. that are included in a specific dataset.

In [7]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
response

[{'name': 'manifest.json',
  'datasetId': 90,
  'datasetVersion': 1,
  'size': 19928,
  'fileType': 'Json',
  'packageType': 'Unsupported',
  'icon': 'JSON',
  'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/manifest.json',
  'createdAt': None,
  'sourcePackageId': None},
 {'name': 'manifest.json',
  'datasetId': 90,
  'datasetVersion': 1,
  'size': 1660,
  'fileType': 'Json',
  'packageType': 'Unsupported',
  'icon': 'JSON',
  'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/revisions/1/manifest.json',
  'createdAt': None,
  'sourcePackageId': None}]

If we are only interested in relative paths of the files, for convenienve we can use list_filenames() function. 

In [8]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest')
list(map(lambda x: "/".join(x["uri"].split("/")[5:]), response))

#response=client.pennsieve.list_filenames(dataset_id=90, query='manifest')
#response

['files/primary/sub-898/manifest.xlsx',
 'manifest.json',
 'files/primary/sub-897/manifest.xlsx',
 'files/primary/sub-896/manifest.xlsx',
 'revisions/1/manifest.json',
 'files/derivative/manifest.xlsx']

### Downloading files

Downloading files is also very simple. All we need to do is to list file(s) that are to be downloaded and pass it to the download_files function.

The function will either download the file with its original extension (if output_name is not specified) or pack the files and download them in gzip format to the specified directory.

In [9]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is DC62-3214

 Directory of C:\Users\patryk\projects\test1

01/30/2023  04:25 PM    <DIR>          .
01/30/2023  04:25 PM    <DIR>          ..
01/30/2023  02:27 PM    <DIR>          .ipynb_checkpoints
01/30/2023  04:25 PM            67,436 Beginners guide.ipynb
01/30/2023  02:27 PM               130 config.ini
01/30/2023  04:01 PM            19,928 download
01/30/2023  04:03 PM            19,928 manifest.json
01/30/2023  04:03 PM            21,870 myfile.gz
01/30/2023  01:26 PM                 7 README.md
               6 File(s)        129,299 bytes
               3 Dir(s)  410,390,056,960 bytes free


In [10]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
client.pennsieve.download_file(file_list=response, output_name='myfile.gz')
!dir 

 Volume in drive C has no label.
 Volume Serial Number is DC62-3214

 Directory of C:\Users\patryk\projects\test1

01/30/2023  04:25 PM    <DIR>          .
01/30/2023  04:25 PM    <DIR>          ..
01/30/2023  02:27 PM    <DIR>          .ipynb_checkpoints
01/30/2023  04:25 PM            67,436 Beginners guide.ipynb
01/30/2023  02:27 PM               130 config.ini
01/30/2023  04:01 PM            19,928 download
01/30/2023  04:03 PM            19,928 manifest.json
01/30/2023  04:25 PM            21,870 myfile.gz
01/30/2023  01:26 PM                 7 README.md
               6 File(s)        129,299 bytes
               3 Dir(s)  410,657,992,704 bytes free


We can also download a single file. In that case, the original file will be downloaded (not a zip archive).


In [11]:
response[1]

{'name': 'manifest.json',
 'datasetId': 90,
 'datasetVersion': 1,
 'size': 1660,
 'fileType': 'Json',
 'packageType': 'Unsupported',
 'icon': 'JSON',
 'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/revisions/1/manifest.json',
 'createdAt': None,
 'sourcePackageId': None}

In [34]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
client.pennsieve.download_file(file_list=response[1]) #, output_name='manifest.json')
!dir 

TypeError: expected str, bytes or os.PathLike object, not dict

### Interacting with Pennsieve API

Sparc Client can also interact with Pennsieve API and submit HTTP requests, e.g. GET or POST.

This functionality is limited for browsing public datasets available through Discover service. 

More advanced functions may require user authentication, which we will cover in the next section.

For reference, please visit the documentation website:
- https://docs.pennsieve.io/reference

In [13]:
#e.g. calling GET: https://docs.pennsieve.io/reference/browsefiles-1 for version 1 of dataset 90 with additional parameters
client.pennsieve.get('https://api.pennsieve.io/discover/datasets/90/versions/1/files/browse', params={'limit':2})

{'limit': 2,
 'offset': 0,
 'totalCount': 6,
 'files': [{'name': 'files',
   'path': 'files',
   'size': 1140301577,
   'type': 'Directory'},
  {'name': 'metadata',
   'path': 'metadata',
   'size': 48774,
   'type': 'Directory'}]}

In [21]:
#e.g. calling POST: https://docs.pennsieve.io/reference/downloadmanifest-1 for version 1 of dataset 90 and browsing /metadata/records folder
client.pennsieve.post('/discover/datasets/90/versions/1/files/download-manifest', json={"paths":["/metadata/records"]})

{'header': {'count': 8, 'size': 24898},
 'data': [{'fileName': 'animal_subject.csv',
   'path': ['metadata', 'records'],
   'url': 'https://pennsieve-prod-discover-publish-use1.s3.amazonaws.com/90/1/metadata/records/animal_subject.csv?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBwaCXVzLWVhc3QtMSJIMEYCIQDKsy6Mb%2FeqhzE9TxSDcmwa603fdu5bFphNERcPuoJS%2FwIhAMtngHVYQCFyYox6ZuQYw4zUqcbaW3U0RaowFgfDDS1IKooECJX%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQAhoMNzQwNDYzMzM3MTc3Igwv6Q6FO2RaFayPmCcq3gOQp79okOc8vl%2BLG1dQa93L9UmCHRRP%2B%2FbNAixR%2BGKe8PQ8ALcEvHlPCAzRET3RdhRETnVPhtWSRMww2e9kYNJlRKEsUWX82jCnNtyB4pmFioVYJa1Iz0mjQrVO4yTG2qwk2%2B8IVx%2BgLUv%2FgWOcxEIW2fQoHMCTi5lO79DhSy1AnWd4BDyX6bcHtCt0PWhYZIohxV2VpLJgWTPuJT8iq%2BnWx2wPAXVm5%2FpOP2bOFnUKbrNNcoWtqW2M0D6u2Ed9ayDXJ4FNbjb4%2BgbFiHw2feFsdkDZWnbeYM6GIAPlND%2Fykfn%2BoLMuGdJ8%2FfEgJreamyOCt13Sp7%2BDlE929mL6Eii1WD%2BbNtmdcSvyXxjhsXn7UO%2Bv4k8Wtm4rhOYNKclZPswSJs8C3VT4Zmzw8H4JBxzIpDI5DpKKpJlJS1umrpQjBYfvpd%2BDcBED3pznvGIb0nI3Tgcl9HXqH%2Fo39MdwFovs8WtJysz6lTLJBIFeTJ1r

## Uploading files

In this section we will cover how to programmatically upload files from local drive to Pennsieve server with sparc.client.

First, we need to download and configure Pennsieve Agent which is required for uploading files. 

The instruction on configuring the agent could be found here:
https://docs.pennsieve.io/docs/uploading-files-programmatically

In [5]:
!pennsieve agent

Pennsieve Agent is already running on port: 9000


In [10]:
p=client.pennsieve.connect()

TypeError: Pennsieve.connect() got an unexpected keyword argument 'timeout'

In [28]:
p.use_dataset('xyz')

In [None]:
p.manifest.create('manifest.json')

In [None]:
p.manifest.upload()