# *pyopencga* Catalog: Clinical Data and other Metadata. 

------
**[NOTE]** The server methods used by *pyopencga* client are defined in the following swagger URL:
- https://ws.opencb.org/opencga-prod/webservices/    ### THIS DOESN'T WORK

**[NOTE]** Current implemented methods are registered at the following spreadsheet:
- https://docs.google.com/spreadsheets/d/1QpU9yl3UTneqwRqFX_WAqCiCfZBk5eU-4E3K-WVvuoc/edit?usp=sharing

## Overview

This notebook is intended to provide guidance for querying an OpenCGA server through *pyopencga* to explore studies which the user has access to, Clinical data provided in the study (Samples, Individuals Genotypes etc.) and other types of metadata, like permissions.

A good first step when start working with OpenCGA is to retrieve information about our user, which projects and studies are we allowed to see.<br>
It is also recommended to get a taste of the clinical data we are encountering in the study: How many samples and individuals does the study have? Is there any defined cohorts? Can we get some statistics about the genotypes of the samples in the Sudy?

For guidance on how to loggin and get started with *opencga* you can refer to : [001-pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)
 

## 1. Setup the Client and Login into *pyopencga* 

**Configuration and Credentials** 

Let's assume we already have *pyopencga* installed in our python setup (all the steps described on [001-pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)).

You need to provide **at least** a host server URL in the standard configuration format for OpenCGA as a python dictionary or in a json file.


In [36]:
from pyopencga.opencga_config import ClientConfiguration # import configuration module
from pyopencga.opencga_client import OpencgaClient # import client module
from pprint import pprint
import json

####################################
## Configuration parameters  #######
####################################
# OpenCGA host
host = 'https://ws.opencb.org/opencga-prod'

# User credentials
user = 'demouser'
passwd = 'demouser' ## you can skip this, see below.
study = 'demo@family:platinum'
####################################

# Creating ClientConfiguration dict
config_dict = {'rest': {
                       'host': host 
                    }
               }
print('Config information:\n',config_dict)

# Pass the config_dict dictionary to the ClientConfiguration method
config = ClientConfiguration(config_dict)

# Create the client
oc = OpencgaClient(config)

# Pass the credentials to the client
# (here we put only the user in order to be asked for the password interactively)
# oc.login(user)

# or you can pass the user and passwd
oc.login(user, passwd)



Config information:
 {'rest': {'host': 'https://ws.opencb.org/opencga-prod'}}


Once we have defined a variable with the client configuration and credentials, we can access to all the methods defined for the client.

These methods implement perform calls to query different data models in OpenCGA. 

Over the user cases addressed in this notebooks we will be performing queries to the **users, projects, studies, samples, individuals and cohorts** data models.

In [37]:
## Define variables to query different data models though the web services

user_client = oc.users
project_client = oc.projects
study_client = oc.studies
sample_client = oc.samples
individual_client = oc.individuals
file_client = oc.files
cohort_client = oc.cohorts


# 2. Use Cases 

In this seciton we are going to show how to work with some of the most common scenarios.<br>
The user-cases addresed here constute a high-level introduction aimed to provide a basis for the user to make their own explorations. This example can be adapted to each individual user-case.



## 2.1 Exploring User Account: Permissios, Projects and Studies
In this use case we cover retrieving information for our user.
In OpenCGA, all the user permissions are established at a study level. One project may contain different studies.

#### Full Qualified Name (fqn) of Studies 

CAMBIAR

It is also very important to understand that in OpenCGA, the projects and studies have a full qualified name (**fqn**) with the format [owner]@[project]:[study]
We can access the studies for the specific project *family*:


### 2.1.1 Exploring Projects and Studies

Depending on the permissions granted, a user can be the owner of a study or just have access to some studies owned by other users.<br>We can retrieve information about our user and its permissions by:

In [42]:
## Getting user information
## [NOTE] User needs the quey_id string directly --> (user)
user_info = user_client.info(user).get_result(0)

print('User info:')
print('id:{}\taccount_type: {}\t projects_owned: {}'.format(user_info['id'], user_info['account']['type'], len(user_info['projects'])))


User info:
id:demouser	account_type: GUEST	 projects_owned: 0


We can appreciate that our user (demouser) has **not** projects from its own.<br> However, the user might has been granted access to some projects from other users. Let's see how to find this out:

#### Projects:
Now, we can list our user's projects using **project client** `search()` function.

In [10]:
## Getting user projects
## [NOTE] Client specific methods have the query_id as a key:value (i.e (user=user_id)) 
projects_info = project_client.search()

projects_info.print_results(fields='id,name,organism.scientificName,organism.assembly,fqn', metadata=False)

#id	name	organism.scientificName	organism.assembly	fqn
family	Family Studies GRCh37	Homo sapiens	GRCh37	demo@family
population	Population Studies GRCh38	Homo sapiens	GRCh38	demo@population


As we can see, our user (*demouser*) has access to 2 different projects:
- Project: **family**
- Project: **population**

The **fqn** shows that those projects are owned by the user *demo*, who shares the permission to see them with us.

**Studies:** Now, let's see which studies do we have access within the **family** project.

In [13]:
project_id = 'family'  # The project we want to retrieve info

studies = study_client.search(project_id)

## Print the studies using the result_iterator() method

for study in studies.result_iterator():
    print("project:{}\t study_id:{}\t study_fqn:{} ".format(project_id, study['id'], study['fqn']))

project:family	 study:platinum
project:family	 study:corpasome


Our user (*demo*) has access to 2 different studies within the *family* project:

Project: **family**
- study: **platinum**
- study: **corpasome**

### 2.1.2 Checking Groups and Permissions

No  we can assume that we want to check to which groups belong and permisions asigned  for platinium

In [21]:
study_fqn = 'demo@family:platinum'
groups = study_client.groups(study_fqn)

## This will give us the whole list of groups existing in the study
for group in groups.result_iterator():
    print("group_id: {}".format(group['id']))


group_id: @members
group_id: @admins
group_id: @opencb-team


In [22]:
user_groups = []

## This will give us only the groups our user belongs to
for group in groups.result_iterator():
    if 'demouser' in group['userIds']:
        user_groups.append(group['id'])
        print("group_id: {}".format(group['id']))

group_id: @members


In [24]:
## Look the permission our user has
# Permissions granted directly to user: acl = access control list (permisos)
acls = study_client.acl(study_fqn, member='demouser')

for acl in acls.result_iterator():
    print(acl['demouser'])


['VIEW_PANELS', 'VIEW_FAMILIES', 'VIEW_JOBS', 'VIEW_FILES', 'VIEW_FILE_ANNOTATIONS', 'VIEW_COHORTS', 'VIEW_SAMPLE_VARIANTS', 'VIEW_FAMILY_ANNOTATIONS', 'VIEW_FILE_HEADER', 'VIEW_FILE_CONTENT', 'VIEW_AGGREGATED_VARIANTS', 'VIEW_INDIVIDUALS', 'VIEW_COHORT_ANNOTATIONS', 'VIEW_SAMPLES', 'VIEW_SAMPLE_ANNOTATIONS', 'VIEW_CLINICAL_ANALYSIS', 'EXECUTE_JOBS', 'VIEW_INDIVIDUAL_ANNOTATIONS']


In [27]:
# perm granted to the groups our user belongs to
for group in user_groups:
    acls = study_client.acl(study_fqn, member=group)
    if acls.get_num_results() == 0:
        print('group_id: {}\t group_acls: []'.format(group))
    else:
        for acl in acls.result_iterator():
            print('group_id: {}\t group_acls: {}'.format(group, acl[group]))

group_id: @members	 group_acls: []


## 2.2 Exploring Catalog Metadata

Par
Brief description<br>
Explain Aggeation Stats briefly: 

In [None]:
## En una unica llamada podemos agregar listas de proyectos
project_client.aggregation_stats('family,proyecto2')

In [44]:
family_stats = project_client.aggregation_stats(project_id)
pprint(family_stats.get_result(0))  #quiero imprimir el primer resultado

####AGGREGATIONS NOT WORKING NOW- Doubkcheck later with Genentech Installation


{'family': {'corpasome': {'cohort': {'events': [],
                                     'numDeleted': 0,
                                     'numInserted': 0,
                                     'numMatches': 3,
                                     'numResults': 3,
                                     'numTotalResults': 3,
                                     'numUpdated': 0,
                                     'resultType': 'org.opencb.commons.datastore.core.FacetField',
                                     'results': [{'buckets': [{'count': 1,
                                                               'facetFields': [{'buckets': [{'count': 1,
                                                                                             'facetFields': [],
                                                                                             'value': 'JUNE'}],
                                                                                'count': 1,
                        

### 2.2.1 Exploring Samples and Individuals
Ideas: contar smaples, contar indovdiasl, listar fiels de una sample...


Once we know the studies our user 'demo' has access to, we can explore the samples that a project contains.<br>
To fetch samples you need to use the sample client built in pyopencga.<br>

Remember that it is recomended to use the **[fqn](#Full-Qualified-Name-(fqn)-of-Studies )** when referencing studies, since we cannot be sure if there might be **other studies** with the same name contained in **other projects**.<br> (E.g: the study *platinium* might be defined in two different projects: *GRch37_project and GRch38_project*)

Let's imagine we want to know how many samples are in the study **platinum**, and list information about the first two samples: 

In [46]:
# Define the fqn of the study we want to query

study_id = 'family:platinum' 


## Call to the sample web endpoint

samples = sample_client.search(study=study_id, count=True, limit = 2) ## other possible params, count=False, id='NA12880,NA12881'
samples.print_results()

#Time: 92
#Num matches: 17
#Num results: 2
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	annotationSets	uuid	qualityControl	release	version	creationDate	modificationDate	description	somatic	phenotypes	individualId	fileIds	status	internal	attributes
NA12877	.	eba106b2-0172-0004-0001-0090f938ae01	{'fileIds': [], 'comments': [], 'alignmentMetrics': [], 'variantMetrics': {'variantStats': [], 'signatures': [], 'vcfFileIds': []}}	1	1	20200625131818	20201117012312		False	.	NA12877	data:platinum-genomes-vcf-NA12877_S1.genome.vcf.gz	{'name': '', 'description': '', 'date': ''}	{'status': {'name': 'READY', 'date': '20200625131818', 'description': ''}}	{}
NA12878	.	eba10c89-0172-0004-0001-8c90462fc396	{'fileIds': [], 'comments': [], 'alignmentMetrics': [], 'variantMetrics': {'variantStats': [], 'signatures': [], 'vcfFileIds': []}}	1	1	20200625131819	20201117015700		False	.	NA12878	data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz	{'name': '', 'description': '', 'date': ''}	{'status': {'nam

We can see that project *platinium* has **17 samples** (given by #Num matches). The count is returned because we have set the parameter `count=True`.

However, only information about **2 samples** is returned, because we have set the parameter `limit=2`.

In [47]:
## We can do the same for individuals

individuals = individual_client.search(study=study_id, count=True, limit = 2) ## other possible params, count=False, id='NA12880,NA12881'
individuals.print_results()

#Time: 215
#Num matches: 17
#Num results: 2
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	annotationSets	name	uuid	father	mother	location	sex	karyotypicSex	ethnicity	population	release	version	creationDate	modificationDate	lifeStatus	phenotypes	disorders	samples	parentalConsanguinity	status	internal	attributes
NA12877	.	NA12877	eba0f035-0172-0006-0001-f2aeb4168df1	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{}	UNKNOWN	UNKNOWN		{}	1	1	20200625131812	20201002113644	UNKNOWN	.	.	.	False	{'name': '', 'description': '', 'date': ''}	{'status': {'name': 'READY', 'date': '20200625131812', 'description': ''}}	{}
NA12878	.	NA12878	eba0f2b4-0172-0006-0001-6af0ded43009	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{}	UNKNOWN	UNKNOWN		{}	1	1	20200625131813	20201002113649	UNKNOWN	.	.	.	False	{'name': '', 'description': '', 'date': ''}	{'statu

### 2.2.2 Exploring Files

In [None]:
## Listar los files de una sample CONCRETA
sample = oc.samples.info(study=study_id, limit=1)## other possible params, count=False, id='NA12880,NA12881'

sample.get_results(0)['fileIds']

##El campo que necesito es file_ids

sample_concreta = sample.get_results(0)['id']

## Listado completo de files con todos los metadatos 
oc.files.search(study=study_id, sampleIds=sample_concreta)


### 2.2.3 Exploring Cohorts

In [None]:
## Cohortes tienen un listado de samples agrupadas con unas caracteristicas concretas 



## 2.4 Filtering by Custom Annotations
You can easily filter samples, individuals, ... using your custom annotation ...