# *pyopencga* Catalog: Clinical Data and other Metadata. 

------
**[NOTE]** The server methods used by *pyopencga* client are defined in the following swagger URL:
- https://ws.opencb.org/opencga-prod/webservices/   

**[NOTE]** Current implemented methods are registered at the following spreadsheet:
- https://docs.google.com/spreadsheets/d/1QpU9yl3UTneqwRqFX_WAqCiCfZBk5eU-4E3K-WVvuoc/edit?usp=sharing

## Overview

This notebook is intended to provide guidance for querying an OpenCGA server through *pyopencga* to explore studies which the user has access to, Clinical data provided in the study (Samples, Individuals Genotypes etc.) and other types of metadata, like permissions.

A good first step when start working with OpenCGA is to retrieve information about our user, which projects and studies are we allowed to see.<br>
It is also recommended to get a taste of the clinical data we are encountering in the study: How many samples and individuals does the study have? Is there any defined cohorts? Can we get some statistics about the genotypes of the samples in the Sudy?

For guidance on how to loggin and get started with *opencga* you can refer to : [001-pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)
 

## 1. Setup the Client and Login into *pyopencga* 

**Configuration and Credentials** 

Let's assume we already have *pyopencga* installed in our python setup (all the steps described on [001-pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)).

You need to provide **at least** a host server URL in the standard configuration format for OpenCGA as a python dictionary or in a json file.


In [1]:
from pyopencga.opencga_config import ClientConfiguration # import configuration module
from pyopencga.opencga_client import OpencgaClient # import client module
from pprint import pprint
import json

####################################
## Configuration parameters  #######
####################################
# OpenCGA host
host = 'https://ws.opencb.org/opencga-prod'

# User credentials
user = 'demouser'
passwd = 'demouser' ## you can skip this, see below.
study = 'demo@family:platinum'
####################################

# Creating ClientConfiguration dict
config_dict = {'rest': {
                       'host': host 
                    }
               }
print('Config information:\n',config_dict)

# Pass the config_dict dictionary to the ClientConfiguration method
config = ClientConfiguration(config_dict)

# Create the client
oc = OpencgaClient(config)

# Pass the credentials to the client
# (here we put only the user in order to be asked for the password interactively)
# oc.login(user)

# or you can pass the user and passwd
oc.login(user, passwd)



Config information:
 {'rest': {'host': 'https://ws.opencb.org/opencga-prod'}}


Once we have defined a variable with the client configuration and credentials, we can access to all the methods defined for the client. These methods implement calls to query different data models in *OpenCGA*. 

Over the user case addressed in this notebook we will be performing queries to the **users, projects, studies, samples, individuals and cohorts**<br> *OpenCGA* data models.

In [3]:
## Define variables to query different data models though the web services

user_client = oc.users
project_client = oc.projects
study_client = oc.studies
sample_client = oc.samples
individual_client = oc.individuals
file_client = oc.files
cohort_client = oc.cohorts


# 2. Use Cases 

In this seciton we are going to show how to work with some of the most common scenarios.<br>
The user-cases addresed here constute a high-level introduction aimed to provide a basis for the user to make their own explorations. This example can be adapted to each individual user-case.



## 2.1 Exploring User Account: Permissios, Projects and Studies

In this use case we cover retrieving information for our user.

**In OpenCGA, all the user permissions are established at a study level**. One project contains **at least** one study, although it may contain several.

#### Full Qualified Name (fqn) of Studies
It is also very important to understand that in OpenCGA, the projects and studies have a full qualified name (**fqn**) with the format:<br>
`[[owner]@[project]]:[study]`

We cannot be sure if there might be **other studies** with the same name contained in **other projects**.<br> (E.g: the study *platinium* might be defined in two different projects: *GRch37_project and GRch38_project*)

Because of that that, it is recomended to use the **fqn** when referencing studies.


### 2.1.1 Exploring Projects and Studies

Depending on the permissions granted, a user can be the owner of a study or just have access to some studies owned by other users.<br>We can retrieve information about our user and its permissions by:

In [92]:
## Getting user information
## [NOTE] User needs the quey_id string directly --> (user)
user_info = user_client.info(user).get_result(0)

print('User info:')
print('id:{}\taccount_type: {}\t projects_owned: {}'.format(user_info['id'], user_info['account']['type'], len(user_info['projects'])))


User info:
id:demouser	account_type: GUEST	 projects_owned: 0


We can appreciate that our user (demouser) has **not** projects from its own.<br> However, the user might has been granted access to some projects from other users. Let's see how to find this out:

#### Projects:
Now, we can list our user's projects using **project client** `search()` function.

In [93]:
## Getting user projects
## [NOTE] Client specific methods have the query_id as a key:value (i.e (user=user_id)) 

projects_info = project_client.search()

projects_info.print_results(fields='id,name,organism.scientificName,organism.assembly,fqn', metadata=False)


#id	name	organism.scientificName	organism.assembly	fqn
family	Family Studies GRCh37	Homo sapiens	GRCh37	demo@family
population	Population Studies GRCh38	Homo sapiens	GRCh38	demo@population


As we can see, our user (*demouser*) has access to 2 different projects:
- Project: **family**
- Project: **population**

The **fqn** `demo@population` and `demo@family` show that those projects are owned by the user *demo*, who have shared the permission to see them with us.


#### Studies
Now, let's see which studies do we have access within the **family** project.

In [94]:
project_id = 'family'  # The project we want to retrieve info

studies = study_client.search(project_id)

## Print the studies using the result_iterator() method

for study in studies.result_iterator():
    print("project:{}\t study_id:{}\t study_fqn:{} ".format(project_id, study['id'], study['fqn']))

project:family	 study_id:platinum	 study_fqn:demo@family:platinum 
project:family	 study_id:corpasome	 study_fqn:demo@family:corpasome 


Our user (*demouser*) has access to 2 different studies within the **family** project:
- study: **platinum**
- study: **corpasome**

### 2.1.2 Checking Groups and Permissions

Now we can assume that we want to check to which groups our user belongs to and which permisions pur user has been granted for the study (remember that all the permissions are established at the study level).<br>
The first step might be check which groups exist within the study **platinum**:

In [95]:
# Define the fqn of the study

study_fqn = 'demo@family:platinum'


# Query to the study web service

groups = study_client.groups(study_fqn)

study_groups = []  # Define an empty list for the groups


## This will give us the whole list of groups existing in the study

for group in groups.result_iterator():
    study_groups.append(group['id'])
    print("group_id: {}".format(group['id']))
    
print('\nThere are 3 groups in the study {}: {}'.format(study_fqn, study_groups))


group_id: @members
group_id: @admins
group_id: @opencb-team

There are 3 groups in the study demo@family:platinum: ['@members', '@admins', '@opencb-team']


The study **platinum** has 3 different groups of users defined: **members, admins and opencb-team**


Now we want to check in which groups is our user *demouser* included:

In [96]:
user_groups = [] # Define an empty list 

## This will give us only the groups our user belongs to

for group in groups.result_iterator():
    if 'demouser' in group['userIds']:
        user_groups.append(group['id'])
        print("group_id: {}".format(group['id']))
        
print('\nOur user {} belongs to group/s: {}'.format(user, user_groups))


group_id: @members

Our user demouser belongs to group/s: ['@members']


Although there are 3 groups defined for the study, our user (*demouser*) only belongs to group **members**, which is one of the default groups in *OpenCGA*.

Now must be wondering which specific permissions our user has. We can check this using the `client.acl()` method in the next call (**acl** = access control list):

In [100]:
# Permissions granted directly to user:

acls = study_client.acl(study_fqn, member='demouser')

print('The user demouser has the following permissions:\n', acl['demouser'])
    

The user demouser has the following permissions:
 ['VIEW_PANELS', 'VIEW_FAMILIES', 'VIEW_JOBS', 'VIEW_FILES', 'VIEW_FILE_ANNOTATIONS', 'VIEW_COHORTS', 'VIEW_SAMPLE_VARIANTS', 'VIEW_FAMILY_ANNOTATIONS', 'VIEW_FILE_HEADER', 'VIEW_FILE_CONTENT', 'VIEW_AGGREGATED_VARIANTS', 'VIEW_INDIVIDUALS', 'VIEW_COHORT_ANNOTATIONS', 'VIEW_SAMPLES', 'VIEW_SAMPLE_ANNOTATIONS', 'VIEW_CLINICAL_ANALYSIS', 'EXECUTE_JOBS', 'VIEW_INDIVIDUAL_ANNOTATIONS']


We've stated before that the group **members** is one of the default groups in *OpenCGA* (along with the group **admins**). We can also check the permissions granted to this group directly:

In [101]:
# perm granted to the groups our user belongs to
for group in user_groups:
    acls = study_client.acl(study_fqn, member=group)
    if acls.get_num_results() == 0:
        print('group_id: {}\t group_acls: []'.format(group))
    else:
        for acl in acls.result_iterator():
            print('group_id: {}\t group_acls: {}'.format(group, acl[group]))
            

group_id: @members	 group_acls: []


Intuitively, the group **members** is the basic group and has any default permissions. On the other hand, users in the group **admins** have permission to see and edit the study information.

For more information about user and group permissions, check the official *OpenCGA* documentation: **[Catalog and Security - Users and Permissions](http://docs.opencb.org/display/opencga/Sharing+and+Permissions)**

## 2.2 Exploring Catalog Clinical Metadata

A genomic data analysis platform need to keep track of different resources such as:

- Clinical Data: information about individuals, samples from those individuals etc.
- Files Metadata: information about files contained in the platform, such as VCFs and BAMs.

*OpenCGA Catalog* is the component that assumes this role by storing this kind of information



### Aggregation stats:  Requires improvement!

In [103]:
## We can aggregate information for many projects in a simple call

project_client.aggregation_stats('family,proyecto2')


In [44]:

family_stats = project_client.aggregation_stats(project_id)
pprint(family_stats.get_result(0))  #quiero imprimir el primer resultado


{'family': {'corpasome': {'cohort': {'events': [],
                                     'numDeleted': 0,
                                     'numInserted': 0,
                                     'numMatches': 3,
                                     'numResults': 3,
                                     'numTotalResults': 3,
                                     'numUpdated': 0,
                                     'resultType': 'org.opencb.commons.datastore.core.FacetField',
                                     'results': [{'buckets': [{'count': 1,
                                                               'facetFields': [{'buckets': [{'count': 1,
                                                                                             'facetFields': [],
                                                                                             'value': 'JUNE'}],
                                                                                'count': 1,
                        

### 2.2.1 Exploring Samples and Individuals
Ideas: contar smaples, contar indovdiasl, mirar cuantos individuals tienen samples 


Once we know the studies our user 'demo' has access to, we can explore the samples that a project contains.<br>
To fetch samples you need to use the sample client built in pyopencga.Remember that it is recomended to use the **[fqn](#Full-Qualified-Name-(fqn)-of-Studies )** when referencing studies.<br>
Let's imagine we want to know how many samples are in the study **platinum**, and list information about the first two samples: 

In [4]:
# Define the fqn of the study we want to query

study_id = 'family:platinum' 


## Call to the sample web endpoint

samples = sample_client.search(study=study_id, count=True, limit = 2) ## other possible params, count=False, id='NA12880,NA12881'
samples.print_results()


#Time: 45
#Num matches: 17
#Num results: 2
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	annotationSets	uuid	qualityControl	release	version	creationDate	modificationDate	description	somatic	phenotypes	individualId	fileIds	status	internal	attributes
NA12877	.	eba106b2-0172-0004-0001-0090f938ae01	{'fileIds': [], 'comments': [], 'alignmentMetrics': [], 'variantMetrics': {'variantStats': [], 'signatures': [], 'vcfFileIds': []}}	1	1	20200625131818	20201117012312		False	.	NA12877	data:platinum-genomes-vcf-NA12877_S1.genome.vcf.gz	{'name': '', 'description': '', 'date': ''}	{'status': {'name': 'READY', 'date': '20200625131818', 'description': ''}}	{}
NA12878	.	eba10c89-0172-0004-0001-8c90462fc396	{'fileIds': [], 'comments': [], 'alignmentMetrics': [], 'variantMetrics': {'variantStats': [], 'signatures': [], 'vcfFileIds': []}}	1	1	20200625131819	20201117015700		False	.	NA12878	data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz	{'name': '', 'description': '', 'date': ''}	{'status': {'nam

We can see that project *platinium* has **17 samples** (given by #Num matches). The count is returned because we have set the parameter `count=True`.

However, only information about **2 samples** is returned, because we have set the parameter `limit=2`. We can get all the samples ids by:

In [138]:
sample_ids = [] # Define an empty list


# Define a new sample query without limit

samples = sample_client.search(study=study_id, count=True) 

for sample in samples.result_iterator():
    sample_ids.append(sample['id'])

print('There are {} samples with ids:\n {}\n'.format(len(sample_ids), sample_ids))


There are 17 samples with ids:
 ['NA12877', 'NA12878', 'NA12879', 'NA12880', 'NA12881', 'NA12882', 'NA12883', 'NA12884', 'NA12885', 'NA12886', 'NA12887', 'NA12888', 'NA12889', 'NA12890', 'NA12891', 'NA12892', 'NA12893']



Now, we can repite the same process for check the number of individuals in the **family** study. The difference is that now we will be making a call to the **individual** web service:

In [108]:
## Call to the individual web service

individuals = individual_client.search(study=study_id, count=True, limit=2) ## other possible params, count=False, id='NA12880,NA12881'
individuals.print_results('')


#Time: 69
#Num matches: 17
#Num results: 2
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	annotationSets	name	uuid	father	mother	location	sex	karyotypicSex	ethnicity	population	release	version	creationDate	modificationDate	lifeStatus	phenotypes	disorders	samples	parentalConsanguinity	status	internal	attributes
NA12877	.	NA12877	eba0f035-0172-0006-0001-f2aeb4168df1	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{}	UNKNOWN	UNKNOWN		{}	1	1	20200625131812	20201002113644	UNKNOWN	.	.	.	False	{'name': '', 'description': '', 'date': ''}	{'status': {'name': 'READY', 'date': '20200625131812', 'description': ''}}	{}
NA12878	.	NA12878	eba0f2b4-0172-0006-0001-6af0ded43009	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{}	UNKNOWN	UNKNOWN		{}	1	1	20200625131813	20201002113649	UNKNOWN	.	.	.	False	{'name': '', 'description': '', 'date': ''}	{'status

In the study **family** there are the same number of individuals and samples (17 matches). It is likely that there is a sample per individual.

We might be interested in knowing when the individuals were added to *OpenCGA*:

In [116]:
## New call to the individual web service

individuals = individual_client.search(study=study_id) ## other possible params, count=False, id='NA12880,NA12881'


## Print the date each individual was created 

date_individuals = {} # Define an empty dictionary

for individual in individuals.result_iterator():
    date_individuals[individual['id']] = individual['creationDate']

individual_ids = list(date_individuals.keys())

print('There are {} individuals with ids:\n {}\n'.format(len(individual_ids), individual_ids))

for ind in date_individuals:
    print('The individual {} was created on {}'.format(ind, date_individuals[ind]))
    

There are 17 individuals with ids:
 ['NA12877', 'NA12878', 'NA12879', 'NA12880', 'NA12881', 'NA12882', 'NA12883', 'NA12884', 'NA12885', 'NA12886', 'NA12887', 'NA12888', 'NA12889', 'NA12890', 'NA12891', 'NA12892', 'NA12893']

The individual NA12877 was created on 20200625131812
The individual NA12878 was created on 20200625131813
The individual NA12879 was created on 20200625131813
The individual NA12880 was created on 20200625131813
The individual NA12881 was created on 20200625131814
The individual NA12882 was created on 20200625131814
The individual NA12883 was created on 20200625131814
The individual NA12884 was created on 20200625131814
The individual NA12885 was created on 20200625131815
The individual NA12886 was created on 20200625131815
The individual NA12887 was created on 20200625131815
The individual NA12888 was created on 20200625131815
The individual NA12889 was created on 20200625131815
The individual NA12890 was created on 20200625131816
The individual NA12891 was create

### 2.2.2 Exploring Files

We can start by exploring the number of files in the study, and retrieveing information about one file as an example of which kind of data is stored in the **file** data model of *OpenCGA*.

In [134]:
## Call to the file web service

files = file_client.search(study=study_id, count=True, type='FILE', limit=3, exclude='attributes') ## other possible params, count=False, id='NA12880,NA12881'
files.print_results('id')

pprint(files.get_result(1))  # Print information for the first file


#Time: 87
#Num matches: 3766
#Num results: 3
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id
data:platinum-genomes-vcf-NA12877_S1.genome.vcf.gz
data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz
data:platinum-genomes-vcf-NA12879_S1.genome.vcf.gz
{'annotationSets': [],
 'bioformat': 'VARIANT',
 'creationDate': '20200625131819',
 'experiment': {},
 'external': True,
 'format': 'VCF',
 'id': 'data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz',
 'internal': {'index': {'attributes': {},
                        'creationDate': '',
                        'jobId': -1,
                        'release': 1,
                        'status': {'date': '20200625135127',
                                   'description': 'Job finished. File index '
                                                  'ready',
                                   'name': 'READY'},
                        'userId': ''},
              'sampleMap': {},
              'status': {'date': '20200625131819',
                

As we can see, there are 3.766 files in the study.<br> There is plenty of useful information contained in the file data model like the file format, the stats, size of the file. If we want to look for more concrete information about one specific file:

In [136]:
my_vcf = files.get_result(1)

print('The study {} contains a {} file with id: {},\ncreated on: {}'.format(study_fqn, my_vcf['format'], 
                                                                            my_vcf['id'], my_vcf['creationDate']))



The study demo@family:platinum contains a VCF file with id: data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz,
created on: 20200625131819


Now, we might be interested in know the number of files for a specific sample:

In [149]:
# Select a sample id

sample_of_interest = sample_ids[0]


## List the files for a concrete sample

sample = sample_client.info(study=study_id, samples= sample_of_interest)## other possible params, count=False, id='NA12880,NA12881'

sample_files = sample.get_result(0)['fileIds']

print('The sample {} has file/s: {}'.format(sample_of_interest, sample_files))


The sample NA12877 has file/s: ['data:platinum-genomes-vcf-NA12877_S1.genome.vcf.gz']


### 2.2.3 Exploring Cohorts

One powerful feature of *OpenCGA* is the possibility of define **cohorts** that include individuals with common traits of interest, like a phenotype, nationality etc.<br>
The **cohorts** are defined at the study level. *OpenCGA* creates a default cohort *ALL*, which includes all the individuals of the study.

 We can explore which cohorts are defined in the study **family** by:

In [153]:
## Call to the file web service

cohorts = cohort_client.search(study=study_id, count=True, limit=2, exclude='samples') ## other possible params, count=False, id='NA12880,NA12881'
cohorts.print_results('id')

pprint(cohorts.get_result(0))  # Print information for the first file


#Time: 42
#Num matches: 1
#Num results: 1
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id
ALL
{'annotationSets': [],
 'attributes': {},
 'creationDate': '20200625131829',
 'description': 'Default cohort with almost all indexed samples',
 'id': 'ALL',
 'internal': {'status': {'date': '20200702090536',
                         'description': '',
                         'name': 'READY'}},
 'modificationDate': '20200702090536',
 'numSamples': 17,
 'release': 1,
 'status': {'date': '', 'description': '', 'name': ''},
 'type': 'COLLECTION',
 'uuid': 'eba13322-0172-0005-0001-1de37fca9efe'}


For the study **family** there is only 1 cohort: the default cohort *ALL*.

As we can see in the description file of the cohort data model, *ALL* is the default cohort with almost all indexed samples.

## 2.4 Filtering by Custom Annotations
You can easily filter samples, individuals, ... using your custom annotation ...