# Catalog Notebook: Overview
------
This notebook is intended to provide guidance for querying an OpenCGA server through *pyopencga* to explore studies which the user has access to, Clinical data provided in the study (Samples, Individuals Genotypes etc.) and other types of metadata, like permissions.

A good first step when start working with OpenCGA is to retrieve information about our user, which projects and studies are we allowed to see.<br>
It is also recommended to get a taste of the clinical data we are encountering in the study: How many samples and individuals does the study have? Is there any defined cohorts? Can we get some statistics about the genotypes of the samples in the Sudy?

For guidance on how to loggin and get started with *opencga* you can refer to : [pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)
 
 **[NOTE]** The server methods used by *pyopencga* client are defined in the following swagger URL:
- https://ws.opencb.org/opencga-prod/webservices/   

## Setup the Client and Login into *pyopencga* 

**Configuration and Credentials** 

Let's assume we already have *pyopencga* installed in our python setup (all the steps described on [pyopencga_first_steps.ipynb](https://github.com/opencb/opencga/tree/develop/opencga-client/src/main/python/notebooks/user-training)).

You need to provide **at least** a host server URL in the standard configuration format for OpenCGA as a python dictionary or in a json file.


In [1]:
## Step 1. Import pyopencga dependecies
from pyopencga.opencga_config import ClientConfiguration # import configuration module
from pyopencga.opencga_client import OpencgaClient # import client module
from pprint import pprint
from IPython.display import JSON
import matplotlib.pyplot as plt
import seaborn as sns

## Step 2. OpenCGA host
host = 'https://ws.opencb.org/opencga-prod'
# host = 'http://localhost:1234/opencga'

## Step 3. User credentials
user = 'demouser'
passwd = 'demouser' ## you can skip this, see below.
####################################

## Step 4. Create the ClientConfiguration dict
config_dict = {'rest': {
                       'host': host 
                    }
               }

## Step 5. Create the ClientConfiguration and OpenCGA client
config = ClientConfiguration(config_dict)
oc = OpencgaClient(config)

## Step 6. Login to OpenCGA using the OpenCGA client 
# Pass the credentials to the client
# (here we put only the user in order to be asked for the password interactively)
# oc.login(user)

# or you can pass the user and passwd
oc.login(user, passwd)

print('Logged succesfuly to {}, your token is: {} well done!'.format(host, oc.token))


[33m[INFO]: Client version (2.0.0) is lower than server version (2.1.0).
[0mLogged succesfuly to https://ws.opencb.org/opencga-prod, your token is: eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJkZW1vdXNlciIsImF1ZCI6Ik9wZW5DR0EgdXNlcnMiLCJpYXQiOjE2MjY5NTY1NDIsImV4cCI6MTYyNjk2MDE0Mn0.kwu6_yOKK0bZT6lHb4YC8zBwhrTcbf5M9QzI1Frxx_I well done!


## Setup OpenCGA Variables

Once we have defined a variable with the client configuration and credentials, we can access to all the methods defined for the client. These methods implement calls to query different data models in *OpenCGA*. 

Over the user case addressed in this notebook we will be performing queries to the **users, projects, studies, samples, individuals and cohorts**<br> *OpenCGA* data models.

#  Use Cases 
------

In this seciton we are going to show how to work with some of the most common scenarios.<br>
- The user-cases addresed here constute a high-level introduction aimed to provide a basis for the user to make their own explorations. 
- The examples can be adapted to each individual user-case.


#  Exploring User Account: Permissios, Projects and Studies
------------

In this use case we cover retrieving information for our user.

**In OpenCGA, all the user permissions are established at a study level**. One project contains **at least** one study, although it may contain several.

#### Full Qualified Name (fqn) of Studies
It is also very important to understand that in OpenCGA, the projects and studies have a full qualified name (**fqn**) with the format:<br>
`[[owner]@[project]]:[study]`

We cannot be sure if there might be **other studies** with the same name contained in **other projects**.<br> (E.g: the study *platinium* might be defined in two different projects: *GRch37_project and GRch38_project*)

Because of that that, it is recomended to use the **fqn** when referencing studies.


## 1. Exploring Projects and Studies with our user

### Users: owner and members 
Depending on the permissions granted, a user can be the owner of a study or just have access to some studies owned by other users.<br>We can retrieve information about our user and its permissions by:
- **Using the `print_results()` function**

In [19]:
## Getting user information
## [NOTE] User needs the quey_id string directly --> (user)
#Print using the print_results() function:
user_info = oc.users.info(user)
user_info.print_results( title='User info with print_results() function:') # metadata=False

## Uncomment next line to display an interactive JSON viewer
#JSON(user_info.get_results())

User info with print_results() function:
---------------------------------------------
#Time: 330
#Num matches: -1
#Num results: 1
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	name	email	organization	account	internal	quota	projects	sharedProjects	configs	filters	attributes
demouser	OpenCGA Demo User	demouser@opencb.org		{'type': 'GUEST', 'creationDate': '', 'expirationDate': '', 'authentication': {'id': 'internal', 'application': False}}	{'status': {'name': 'READY', 'date': '20200625130136', 'description': ''}}	{'diskUsage': -1, 'cpuUsage': -1, 'maxDisk': 0, 'maxCpu': -1}	.	.	{'IVA': {'lastStudy': 'demo@population:1000g', 'lastAccess': 1626714109935}}	.	{}


- **Using the REST response API** 

In [21]:
# Using REST response API:
print("\nUser info using REST response API:")
user_info = oc.users.info(user).get_result(0)
user_projects = user_info['projects']  # Define projects owned by our user
print('id:{}\taccount_type: {}\t projects_owned: {}'.format(user, user_info['account']['type'], len(user_projects)))

print('\nWe can appreciate that our user: {} has {} projects from its own: {}'.format(user_id, len(user_projects), user_projects))


User info using REST response API:
id:demouser	account_type: GUEST	 projects_owned: 0

We can appreciate that our user: demouser has 0 projects from its own: []


### User Projects:
Although an user doesn't own any project, it might has been granted access to projects created by other users. Let's see how to find this out.

We can list our user's projects using **project client** `search()` function.

In [5]:
## Getting user projects
## [NOTE] Client specific methods have the query_id as a key:value (i.e (user=user_id)) 

projects_info = oc.projects.search()
projects_info.print_results(fields='id,name,organism.scientificName,organism.assembly,fqn', title='Projects our user ({}) has access to:'.format(user_id), metadata=False)

Projects our user (demouser) has access to:
------------------------------------------------
#id	name	organism.scientificName	organism.assembly	fqn
family	Family Studies GRCh37	Homo sapiens	GRCh37	demo@family
population	Population Studies GRCh38	Homo sapiens	GRCh38	demo@population


The **fqn** `owner@project` shows the owner of the project/s; this owner has granted permission to our user to the projects above.


### User Studies:
Let's see which studies do we have access within the project.

In [6]:
# First we define one projectId
project_info = oc.projects.search().get_result(0)
project_id = project_info['id']

print('For this user-case, we can use project:{}'.format(project_id))

For this user-case, we can use project:family


In [10]:
studies = oc.studies.search(project_id)

## Print the studies using the result_iterator() method
print('Our user [{}] has access to 2 different studies within the [{}] project\n'.format(user, project_id))
for study in studies.result_iterator():
    print("project:{}\t study_id:{}\t study_fqn:{} ".format(project_id, study['id'], study['fqn']))
    

Our user [demouser] has access to 2 different studies within the [family] project

project:family	 study_id:platinum	 study_fqn:demo@family:platinum 
project:family	 study_id:corpasome	 study_fqn:demo@family:corpasome 


- For the rest of the notebook, we will use a specific study to query catalog information:

In [12]:
# Define the study we are going to work with
study_info = oc.studies.search(project_id).get_result(0)
study_id = study_info['id']
study_fqn = study_info['fqn']
print("Let's use the study: [{}] with fqn: [{}]".format(study_id, study_fqn))

Let's use the study: [platinum] with fqn: [demo@family:platinum]


## 2. Checking Groups and Permissions

Now we can assume that we want to check to which groups our user belongs to and which permisions pur user has been granted for the study (remember that all the permissions are established at the study level).

### Groups in the Study
OpenCGA define the permissions (for both groups and users) at the **Study** level. The first step might be check which groups exist within the **study**. 
**[NOTE]**: This can ONLY be done by an `admin` or the `owner`. If your user is not any of these, skipt the next two cells.

In [14]:
# # Query to the study web service
# groups = oc.studies.groups(study_fqn)
# study_groups = []  # Define an empty list for the groups

# ## This will give us the whole list of groups existing in the study
# for group in groups.result_iterator():
#     study_groups.append(group['id'])
#     print("group_id: {}".format(group['id']))
    
# print('\nThere are 3 groups in the study {}: {}'.format(study_fqn, study_groups))

### User Groups
If we want to check in which groups is our user included

In [17]:
# user_groups = [] # Define an empty list 

# ## This will give us only the groups our user belongs to
# for group in groups.result_iterator():
#     if user_id in group['userIds']:
#         user_groups.append(group['id'])
#         print("group_id: {}".format(group['id']))
        
# print('\nOur user {} belongs to group/s: {}'.format(user_id, user_groups))


Independently of the groups defined for a study, our user always belongs to the group **members**, which is one of the default groups in *OpenCGA*.

### User Permissions
We might be wondering which specific permissions our user has. We can check this using the `client.acl()` method (**acl** = access control list):

In [24]:
# Permissions granted directly to user:
acls = oc.studies.acl(study_id, member=user_id).get_result(0)
print('The user',user_id,' has the following permissions:\n\n', acls[user_id])
    

The user demouser  has the following permissions:

 ['VIEW_PANELS', 'VIEW_FAMILIES', 'VIEW_JOBS', 'VIEW_FILES', 'VIEW_FILE_ANNOTATIONS', 'VIEW_COHORTS', 'VIEW_SAMPLE_VARIANTS', 'VIEW_FAMILY_ANNOTATIONS', 'VIEW_FILE_HEADER', 'VIEW_FILE_CONTENT', 'VIEW_AGGREGATED_VARIANTS', 'VIEW_INDIVIDUALS', 'VIEW_COHORT_ANNOTATIONS', 'VIEW_SAMPLES', 'VIEW_SAMPLE_ANNOTATIONS', 'VIEW_CLINICAL_ANALYSIS', 'EXECUTE_JOBS', 'VIEW_INDIVIDUAL_ANNOTATIONS']


### Default Groups in OpenCGA

The default groups in *OpenCGA* are: **members** and **admins**.

Intuitively, the group **members** is the basic group and has any default permissions. On the other hand, users in the group **admins** have permission to see and edit the study information.

For more information about user and group permissions, check the official *OpenCGA* documentation: **[Catalog and Security - Users and Permissions](http://docs.opencb.org/display/opencga/Sharing+and+Permissions)**

# Exploring Catalog Clinical Metadata
-----------------------
A genomic data analysis platform need to keep track of different resources such as:

- Clinical Data: information about individuals, samples from those individuals etc.
- Files Metadata: information about files contained in the platform, such as VCFs and BAMs.

*OpenCGA Catalog* is the component that assumes this role by storing this kind of information

## 1. Exploring Samples and Individuals

Once we know the studies our user has access to, we can explore the samples within the study.<br>
To fetch samples you need to use the sample client built in **pyopencga**. Remember that it is recomended to use the **[fqn](#Full-Qualified-Name-(fqn)-of-Studies )** when referencing studies.<br>

### Samples:
Let's imagine we want to know how many samples are in the **study** stored in the `study_fqn` variable, and list information about the first two samples: 

In [143]:
## Call to the sample web endpoint
samples = oc.samples.search(study=study_fqn, includeIndividual=True, count=True, limit = 5) ## other possible params, count=False, id='NA12880,NA12881'
samples.print_results(fields='id,creationDate,somatic,phenotypes.id,phenotypes.name,individualId', title='Info from 5 samples from study {}'.format(study_fqn))

## Uncomment next line to display an interactive JSON viewer
#JSON(samples.get_results())

Info from 5 samples from study demo@family:platinum
--------------------------------------------------------
#Time: 68
#Num matches: 17
#Num results: 5
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	creationDate	somatic	phenotypes.id	phenotypes.name	individualId
NA12877	20200625131818	False	.	.	NA12877
NA12878	20200625131819	False	.	.	NA12878
NA12879	20200625131820	False	.	.	NA12879
NA12880	20200625131821	False	.	.	NA12880
NA12881	20200625131822	False	.	.	NA12881


We can see that the number of samples in the study is given by **#Num matches** by using the parameter `count=True`.

- **How to get all the sample ids?** 

Above, we have used the parameter `limit` to restrict the number of samples the query returns. We can get all the samples ids by:

In [144]:
sample_ids = [] # Define an empty list

# Define a new sample query without limit
samples = oc.samples.search(study=study_fqn, count=True) 

for sample in samples.result_iterator():
    sample_ids.append(sample['id'])

print('There are {} samples with ids:\n {}\n'.format(len(sample_ids), sample_ids))


There are 17 samples with ids:
 ['NA12877', 'NA12878', 'NA12879', 'NA12880', 'NA12881', 'NA12882', 'NA12883', 'NA12884', 'NA12885', 'NA12886', 'NA12887', 'NA12888', 'NA12889', 'NA12890', 'NA12891', 'NA12892', 'NA12893']



### Individuals:
Now, we can repite the same process for check the number of individuals in the **study** . The difference is that now we will be making a call to the **individuals** web service:

In [32]:
## Using the individuals search web service
individuals = oc.individuals.search(study=study_fqn, count=True, limit=5) ## other possible params, count=False, id='NA12880,NA12881'
individuals.print_results( title='Information about 5 individuals in the study{}'.format(study_fqn))

## Uncomment next line to display an interactive JSON viewer
#JSON(individuals.get_results())

Information about 5 individuals in the studydemo@family:platinum
---------------------------------------------------------------------
#Time: 67
#Num matches: 17
#Num results: 5
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	annotationSets	name	uuid	father	mother	location	sex	karyotypicSex	ethnicity	population	release	version	creationDate	modificationDate	lifeStatus	phenotypes	disorders	samples	parentalConsanguinity	status	internal	attributes
NA12877	.	NA12877	eba0f035-0172-0006-0001-f2aeb4168df1	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{}	UNKNOWN	UNKNOWN		{}	1	1	20200625131812	20201002113644	UNKNOWN	.	.	.	False	{'name': '', 'description': '', 'date': ''}	{'status': {'name': 'READY', 'date': '20200625131812', 'description': ''}}	{}
NA12878	.	NA12878	eba0f2b4-0172-0006-0001-6af0ded43009	{'release': 0, 'version': 0, 'parentalConsanguinity': False}	{'release': 0, 'version': 0, 'parentalConsanguinity': F

- We might be interested in knowing when the individuals were added to *OpenCGA*, or the individuals sex. Since **pyopencga 2.0.1.1** it is possible to export the results to a *pandas dataframe* object with the function `to_data_frame()`:

In [33]:
## Using the individuals search web service without limit param
individuals = oc.individuals.search(study=study_fqn) 
## Using the new function to_data_frame()
individuals_df = individuals.to_data_frame()
print(individuals_df[['id', 'sex', 'uuid', 'creationDate']].head())

        id      sex                                  uuid    creationDate
0  NA12877  UNKNOWN  eba0f035-0172-0006-0001-f2aeb4168df1  20200625131812
1  NA12878  UNKNOWN  eba0f2b4-0172-0006-0001-6af0ded43009  20200625131813
2  NA12879  UNKNOWN  eba0f467-0172-0006-0001-d1e44969fcc3  20200625131813
3  NA12880  UNKNOWN  eba0f56e-0172-0006-0001-95f703aba3e6  20200625131813
4  NA12881  UNKNOWN  eba0f66c-0172-0006-0001-d023fa4e791a  20200625131814


### Custom Annotations

## 2. Exploring Files

### Files in a study
We can start by exploring the number of files in the study, and retrieveing information about one file as an example of which kind of data is stored in the **file** data model of *OpenCGA*.

In [34]:
## Using the files web service
files = oc.files.search(study=study_fqn, count=True, type='FILE', limit=5, exclude='attributes') ## other possible params, count=False, id='NA12880,NA12881'
files.print_results(fields='id,format,size,software', title='Information about files in study {}'.format(study_fqn))

## Uncomment next line to display an interactive JSON viewer
#JSON(files.get_results())

Information about files in study demo@family:platinum
----------------------------------------------------------
#Time: 110
#Num matches: 4072
#Num results: 5
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	format	size	software
data:platinum-genomes-vcf-NA12877_S1.genome.vcf.gz	VCF	887890738	{}
data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz	VCF	883195909	{}
data:platinum-genomes-vcf-NA12879_S1.genome.vcf.gz	VCF	889974818	{}
data:platinum-genomes-vcf-NA12880_S1.genome.vcf.gz	VCF	899309868	{}
data:platinum-genomes-vcf-NA12881_S1.genome.vcf.gz	VCF	918334187	{}


### File Specific Info
There is plenty of useful information contained in the file data model like the file format, the stats, size of the file. If we want to look for more concrete information about one specific file:

In [29]:
my_vcf = files.get_result(1)
print('The study {} contains a {} file with id: {},\ncreated on: {}'.format(study_fqn, my_vcf['format'], 
                                                                            my_vcf['id'], my_vcf['creationDate']))

The study demo@family:platinum contains a VCF file with id: data:platinum-genomes-vcf-NA12878_S1.genome.vcf.gz,
created on: 20200625131819


### Files with a specific sample

We can also be interested in knowing the number of files for a specific sample:

In [31]:
## Using the samples info web service
sample_of_interest = sample_ids[0]

## List the files for a concrete sample
sample = oc.samples.info(study=study_fqn, samples= sample_of_interest) ## other possible params, count=False, id='NA12880,NA12881'
sample_files = sample.get_result(0)['fileIds']

print('The sample {} has file/s: {}'.format(sample_of_interest, sample_files))

## 3. Exploring Cohorts

One powerful feature of *OpenCGA* is the possibility of define **cohorts** that include individuals with common traits of interest, like a phenotype, nationality etc.
The **cohorts** are defined at the study level. *OpenCGA* creates a default cohort *ALL*, which includes all the individuals of the study.

 We can explore which cohorts are defined in the  **study** by:

In [148]:
## Using the cohorts search web service
cohorts = oc.cohorts.search(study=study_fqn, count=True, exclude='samples') ## other possible params, count=False, id='NA12880,NA12881'
cohorts.print_results(fields='id,type,description,numSamples', title='Information about cohorts in study {}'.format(study_fqn))

## Uncomment next line to display an interactive JSON viewer
#JSON(cohorts.get_results())

Information about cohorts in study demo@family:platinum
------------------------------------------------------------
#Time: 42
#Num matches: 1
#Num results: 1
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	type	description	numSamples
ALL	COLLECTION	Default cohort with almost all indexed samples	17


**[NOTE]**: For any **study** in *OpenCGA* the default cohort **ALL** is always present. As we can see in the description of the cohort data model, "**ALL** is the default cohort with almost all indexed samples".

# Aggregations
-------
You can easily filter samples, individuals, ... using your custom annotation ...