## Start using Gaia data in GAVIP

### Background
The Gaia satellite will create a catalog of more than 1 billion stars in our local galaxy, eventually forming an archive of more than 1 Petabyte in size.
The volume of the Gaia dataproducts will be too large for most users to download.
GAVIP allows users to create and submit reusable tools which can be run close to the data (allowing the user to run analysis without downloading the data to their machine). 
These tools are known as Added Value Interfaces or **AVIs**.

### Objective
In this tutorial we will perform some analysis of data retrieved from the Gaia archive (Gaia Data Release 1).
We will first perform this analysis in a Jupyter notebook, then create a simple AVI which others can use in GAVIP.

In this notebook, we will:
* Use TAP+ to retrieve data from GACS
* Temporarily store it as a votable
* Parse the votable as a Pandas dataframe
* Create a pandas profile report from the dataframe

We will later package this up for others to use in GAVIP.

**Note:** This notebook was made using Python 2. It is compatible with Python 3.

In [2]:
from astropy.io.votable import parse_single_table    
from astropy.table import Table

import os
import numpy as np

import pandas_profiling
import pandas as pd

import tempfile

### The asynchronous TAP+ module
GAVIP AVIs are built using the AVI framework. 
The framework handles authentication and asynchronous job processing to name a few.
The framework also includes "connectors" and "services". 
Connectors provide implementations of protocols useful to AVIs, such as TAP+. Services provide blocks of reusable codes using connectors - these will be shown later in the tutorial.

The Jupyter notebooks provided by GAVIP include the AVI framework, so we will use the **TAP+ connector** rather than writing the requests manually. We do this by importing the `AsyncJob` class from the `connectors.tapquery` module.

In [3]:
# Asynchronous TAP+ class
from connectors.tapquery import AsyncJob  

We create a function here using the `AsyncJob` class which will accept an ADQL query and a target TAP+ server.

This function performs the following:
1. Submit the ADQL query to the target TAP server 
2. Wait for the job to complete
3. Store the result in a temporary file as a VOtable
4. Parse the VOtable using the votable library from astropy
5. Return the table to a Pandas dataframe

**Note:** The AsyncJob class can be initialized with a `username` and `password` parameter. If your Cosmos login credentials are provided, your jobs will be recorded under your account. In this demo, we are making the TAP+ request anonymously.


In [5]:
def get_gaia_data(query, target):
    
    """
    Query a TAP service (designated by its tap_endpoint)
    with a given ADQL query
    
    Query is performed asynchronously
    
    Return an AstroPy Table object
    
    This object is converted to a PandasDataframe
    """
    
    async_check_interval = 1
    gacs_tap_conn = AsyncJob(target, query, poll_interval=async_check_interval)

    # Run the job (start + wait + raise_exception)
    gacs_tap_conn.run()

    # Store the response
    result = gacs_tap_conn.open_result()

    tmp_vot = tempfile.NamedTemporaryFile(delete = False)
    with open(tmp_vot.name, 'w') as f:
        f.write(result.content.decode("utf-8"))

    table = parse_single_table(tmp_vot.name).to_table()

    # finally delete temp files
    os.unlink(tmp_vot.name)
    
    #returns the pandas dataframe
    return pd.DataFrame(np.ma.filled(table.as_array()), columns=table.colnames)

Here we specify our TAP server target address, and the ADQL query (taken from the ADQL Help page on the Gaia Archive web page.

Once both are specified, we use the `get_gaia_data()` function defined above to obtain a Pandas dataframe of the resulting VOTable from the TAP server.

In [6]:
target = "http://gea.esac.esa.int/tap-server/tap"

# sample query from https://gea.esac.esa.int/archive/
query = """
        SELECT source_id, ra, dec, phot_g_mean_flux, phot_g_mean_mag,
        DISTANCE(POINT('ICRS',ra,dec), POINT('ICRS',266.41683,-29.00781)) 
        AS dist FROM gaiadr1.gaia_source WHERE 1=CONTAINS(POINT('ICRS',ra,dec),
        CIRCLE('ICRS',266.41683,-29.00781, 0.08333333))
        """

df = get_gaia_data(query, target)



Finally, we specify column names of interest as an array to retrieve from the dataframe. Then we pass the new dataframe to the `ProfileReport()` function provided by pandas_profiling. 

In [None]:
gaiamagcols=['dec', 'dist', 'phot_g_mean_flux', 'phot_g_mean_mag', 'ra', 'source_id']
gaiadf = df[gaiamagcols]
pandas_profiling.ProfileReport(gaiadf)