# First Steps with *pyopencga*; the Python client of OpenCGA
------

This notebook provides guidance for getting started with the *pyopencga* library, which is the Python client of OpenCGA. _pyopencga_ is a **REST client** that fully implements OpenCGA REST API.

These notebooks use a demo installation available at the Univeristy of Cambrdige, feel free to change OpenCGA host and credentials to use any other OpenCGA server.

We assume that your workstation (Linux, Mac, Windows) is connected to the internet and you have Python 3 and the *pip* package manager installed. We then show you how to:

- Install *pyopencga*.
- Connect to an OpenCGA instance.
- Execute OpenCGA calls and work with responses.
- Launch asynchronous jobs and retrieve results.

Walk-through guides of some **common use cases** are provided in two further notebooks:<BR>
- pyopencga_catalog.ipynb
- pyopencga_variant_query.ipynb
- pyopencga_variant_analysis.ipynb
    
You can check OpenCGA REST Web Service API with the following public OpenCGA installation:
- https://ws.opencb.org/opencga-prod/webservices


# Installing and importing the *pyopencga* library
-------

## 1. Install *pyopencga* with *pip*

_pyopencga_ is the OpenCGA python client available at PyPI (https://pypi.org/project/pyopencga/). You can easily install it by exeuting:

`$ pip install pyopencga`

_pyopencga_ uses some other dependencies , make sure you have installed Pandas, IPython or MatplotLib.

## 2. Import *pyopencga* library

You can find here the import section with all the dependecies required to use _pyopencga_:

In [34]:
from pyopencga.opencga_config import ClientConfiguration # import configuration module
from pyopencga.opencga_client import OpencgaClient # import client module
from pprint import pprint
from IPython.display import JSON
import matplotlib.pyplot as plt
import datetime

## 3. Setup OpenCGA Client 

**HOST:** You need to provide **at least** a OpenCGA host server URL in the standard configuration format for OpenCGA as a python dictionary or in a json file.
 
**CREDENTIALS:** Regarding credentials, you can set both user and password as two variables in the script. If you prefer not to show the password, it would be asked interactively without echo.


### Set variables for server host, user credentials and project owner

In [35]:
# Server host
host = 'http://bioinfo.hpc.cam.ac.uk/opencga-prod'

# User credentials
user = 'demouser'
passwd = 'demouser' ## You can skip this, see below.

### Creating ConfigClient dictionary for server connection configuration

In [36]:
# Creating ClientConfiguration dict
config_dict = {'rest': {
                       'host': host 
                    }
               }

print('Config information:\n',config_dict)

## 4. Initialize the Client 

Now we need to pass the *config_dict* dictionary to the **ClientConfiguration** method.<br>
Once we have the configuration defined as *config* (see below), we can initiate the client. This is the **most important step**.

#### OpencgaClient: what is and why is so important?

The `OpencgaClient` (see *oc* variable below) implements all the methods to query the REST API of OpenCGA. All the webservices available, can be directly accesed through the client.

In [37]:
## Create the configuration
config = ClientConfiguration(config_dict)

## Define the client
oc = OpencgaClient(config)

Once we have defined a variable with the client configuration and credentials, we can access to all the methods defined for the client. 

These methods implement calls to the OpenCGA **[web service endpoints](https://ws.opencb.org/opencga-prod/webservices/)** used by pyopencga.

## 5. Import the credentials and Login into OpenCGA

**Option 1**: pass the user and be asked for the password interactively. This option is more secure if you don't want to have your passwords hardcoded or you will run the notebook with public. Uncomment the cell bellow to try:

In [6]:
# ## Option 1: here we put only the user in order to be asked for the password interactively
# oc.login(user)
# print('Logged succesfuly to {}, your token is: {} well done!'.format(host, oc.token))

 ········


**Option 2**: pass the user and the password as variables. Be careful with this option as the password can be publicly showed.

In [38]:
# Option 2: you can pass the user and passwd
oc.login(user, passwd)
print('Logged succesfuly to {}, your token is: {} well done!'.format(host, oc.token))

#### ✅  Congrats! You are should be now connected to your OpenCGA installation

# Understanding REST Response
--------

*pyopencga* queries web services that return a RESTResponse object, which might be difficult to interpretate. The RESTResponse type provide the data in a manner that is not as intuitive as a python list or dictionary. Because of this, we have develop a useful functionality that retrieves the data in a simpler format. 

[OpenCGA Client Libraries](http://docs.opencb.org/display/opencga/Using+OpenCGA), including *pyopencga*, implement a **RESTReponse wrapper** to make even easier to work with REST web services responses. <br>REST responses include metadata and OpenCGA 2.0.1 has been designed to work in a federation mode (more information about OpenCGA federations can be found **[here](http://docs.opencb.org/display/opencga/Roadmapg)**).

All these can make a first-time user to struggle when start working with the responses. Please read this brief documentation about **[OpenCGA RESTful Web Services](http://docs.opencb.org/display/opencga/RESTful+Web+Services#RESTfulWebServices-OpenCGA2.x)**.

Let's see a quick example of how to use RESTResponse wrapper in *pyopencga*. 
You can get some extra inforamtion [here](http://docs.opencb.org/display/opencga/Python#Python-WorkingwiththeRestResponse). Let's execute a first simple query to fetch all projects for the user **demouser** already logged in **[Installing and importing the *pyopencga* library](#Installing-and-importing-the-*pyopencga*-library)**.

#### Example of foor loop: 
Although you can iterate through all the different projects provided by the response by executing the next chunk of code, this is a **not recommended** way. The next query iterates over all the projects retrieved from `projects.search()`

In [39]:
## Let's fecth the available projects.
## First let's get the project client and execute search() funciton
projects = oc.projects.search(include='id,name')

## Loop through all diferent projects 
for project in projects.responses[0]['results']:
   print(project['id'], project['name'])

## RestResponse API

Note: Table with API funcitons and the description

### 1. Using the `get_results()` function 

Using the functions that *pyopencga* implements for the RestResponse object makes things much easier! <br> Let's dig into an example using the same query as above:

In [40]:
## Let's fecth the available projects.
projects = oc.projects.search()

## Uncomment next line to display an interactive JSON viewer
# JSON(projects.get_results())

### 2. Using the `result_iterator()` function to iterate over the Rest results

You can also iterate results, this is specially interesting when fetching many results from the server.

In [41]:
## Let's fecth the available projects.
projects = oc.projects.search()

## Iterate through all diferent projects 
for project in projects.result_iterator():
   print(project['id'], project['name'], project['creationDate'], project['organism'])

### 3. Using `print_results()` function to iterate over the Rest results

**IMPORTANT**: This function implements a configuration to exclude metadata, change separator or even select the fields! Then it reaches all the user-desired results and prints them directly in the terminal.<br>In this way, the `RESTResponse` objectt implements a very powerful custom function to print results 😎

**[NOTE]**: From *pyopencga 2.0.1.2* you can use the `title` parameter in the function to add a header to the results printed.

In [42]:
## This function iterates over all the results, it can be configured to exclude metadata, change separator or even select the fields!
## Set a title to display the results
user_defined_title = 'These are the projects you can access with your user'

projects.print_results(title=user_defined_title, separator=',', fields='id,name,creationDate,organism')

#### Exercise:
- Let's try to costumize the results so we can get printed only the portion of the data that we might be interested in.
- The `metadata=False` parameter allows you to skip the header with the rest response information in the printed results.

In [43]:
## Lets exclude metadata and print only few fields, use dot notation for nested fields
user_defined_title = 'Display selected fields from the projects data model'
selected_fields='id,name,organism.scientificName,organism.assembly'

projects.print_results(fields=selected_fields, metadata=False, title=user_defined_title)

#### Exercise:
- A very useful parameter is the `separator`. It allows the user to decide the format in which the data is printed. For example, it's possible to print a CSV-like style:

In [44]:
## You can change separator
print('Print the projects with a header and a different separator:\n')
projects.print_results(fields='id,name,organism.scientificName,organism.assembly', separator=',', metadata=False)

### 4. Using Pandas DataFrame: `to_data_frame()` 

Pandas provides a very useful functionality for data science. You can convert `RestResponse` objects to Pandas DataFrames using the following function:

`rest_response.to_data_frame()`

In [45]:
## Convert REST response object 'projects' to Pandas dataframe
df = projects.to_data_frame()

## Select some specific columns from the data frame
formatted_df = df[['id', 'name', 'fqn', 'creationDate', 'studies', 'organism.scientificName']]
print('The results can be stored and printed as a pandas DF:\n\n', formatted_df)

# Working with Jobs [UNDER ACTIVE DEVELOPMENT]
------------------

### NOTE: this section is under construction. 

- Please check the latest version of this notebook at https://github.com/opencb/opencga/blob/develop/opencga-client/src/main/python/notebooks/user-training/pyopencga_first_steps.ipynb

OpenCGA implemets both a powerful interactive API to query data but also allows users to execute more demanding analysis by executing jobs. There are a number of analysis and operations that are executed as jobs such as Variant Export or GWAS analysis.


## 1. Job Info

Job data model contain all the information about a single excution: date, id, tool, status, files, ...   You can filter jobs by any of these parameters and even get some stats.

## 2. Executing Jobs

Job execution invovle different lifecycle stages: pending, queued, running, done, rejected.

OpenCGA takes care of executing and notifying job changes.

Executing a job is as simpla as the following code

In [9]:
## Eexecute GWAS analysis
#rest_response = oc.variant().gwas()

## wait for the job to finish
#oc.wait_for_job(rest_response)

#rest_response.print_results()

### 3. Wait for a job to finish `wait_for_job()`

This will stop python execution until the jobs been completed.