# Working with Pandas and XArray

This notebook demonstrates how Pandas and XArray can be used to work with the QCoDeS Dataset. It is not meant as a general introduction to Pandas and XArray. We refer to the official documentation for [Pandas](https://pandas.pydata.org/) and [XArray](http://xarray.pydata.org/en/stable/) for this. This notebook requires that both Pandas and XArray are installed.

## Setup

First we borrow an example from the measurement notebook to have some data to work with. We split the measurement in two so we can try merging it with Pandas.

In [1]:
%matplotlib notebook
import pandas as pd
from functools import partial
import numpy as np
import matplotlib.pyplot as plt

import qcodes as qc
from qcodes.dataset.experiment_container import load_or_create_experiment
from qcodes.dataset.database import initialise_database
from qcodes.tests.instrument_mocks import DummyInstrument
from qcodes.dataset.measurements import Measurement

qc.logger.start_all_logging()

Logging hadn't been started.
Activating auto-logging. Current session state plus future input saved.
Filename       : C:\Users\wihpniel\.qcodes\logs\command_history.log
Mode           : append
Output logging : True
Raw input log  : False
Timestamping   : True
State          : active


In [2]:
# preparatory mocking of physical setup
dac = DummyInstrument('dac', gates=['ch1', 'ch2'])
dmm = DummyInstrument('dmm', gates=['v1', 'v2'])
station = qc.Station(dmm, dac)

In [3]:
initialise_database()
load_or_create_experiment(experiment_name='working_with_pandas',
                          sample_name="no sample")

working_with_pandas#no sample#7@C:\Users\wihpniel\src\Qcodes\docs\examples\DataSet/db_files/mvmhqlmnfs.db
---------------------------------------------------------------------------------------------------------
86-results-1-dac_ch1,dac_ch2,dmm_v1-40200
87-results-2-dac_ch1,dac_ch2,dmm_v1-40401

In [4]:
meas = Measurement()
meas.register_parameter(dac.ch1)  # register the first independent parameter
meas.register_parameter(dac.ch2)  # register the second independent parameter
meas.register_parameter(dmm.v1, setpoints=(dac.ch1, dac.ch2))  # register the dependent one

<qcodes.dataset.measurements.Measurement at 0x17e85a7c828>

In [5]:
# and we'll make a 2D gaussian to sample from/measure
def gauss_model(x0: float, y0: float, sigma: float, noise: float=0.0005):
    """
    Returns a generator sampling a gaussian. The gaussian is
    normalised such that its maximal value is simply 1
    """
    while True:
        (x, y) = yield
        model = np.exp(-((x0-x)**2+(y0-y)**2)/2/sigma**2)*np.exp(2*sigma**2)
        noise = np.random.randn()*noise
        yield model + noise

In [6]:
# and finally wire up the dmm v1 to "measure" the gaussian

gauss = gauss_model(0.1, 0.2, 0.25)
next(gauss)

def measure_gauss(dac):
    val = gauss.send((dac.ch1.get(), dac.ch2.get()))
    next(gauss)
    return val

dmm.v1.get = partial(measure_gauss, dac)

We then perform a very basic experiment. To be able to demonstrate merging of datasets in Pandas we will perform the measurement in two parts.

In [7]:
# run a 2D sweep

with meas.run() as datasaver:

    for v1 in np.linspace(-1, 0, 200, endpoint=False):
        for v2 in np.linspace(-1, 1, 201):
            dac.ch1(v1)
            dac.ch2(v2)
            val = dmm.v1.get()
            datasaver.add_result((dac.ch1, v1),
                                 (dac.ch2, v2),
                                 (dmm.v1, val))
            
    dataid = datasaver.run_id
df1 = datasaver.dataset.get_data_as_pandas_dataframe()['dmm_v1']

Starting experimental run with id: 88


In [8]:
# run a 2D sweep

with meas.run() as datasaver:

    for v1 in np.linspace(0, 1, 201):
        for v2 in np.linspace(-1, 1, 201):
            dac.ch1(v1)
            dac.ch2(v2)
            val = dmm.v1.get()
            datasaver.add_result((dac.ch1, v1),
                                 (dac.ch2, v2),
                                 (dmm.v1, val))
            
    dataid = datasaver.run_id
df2 = datasaver.dataset.get_data_as_pandas_dataframe()['dmm_v1']

Starting experimental run with id: 89


`get_data_as_pandas_dataframe` returns the data as a dict from measured (dependent) parameters to DataFrames. Here we are only interested in the dataframe of a single parameter, so we select that from the dict.

## Working with Pandas

Lets first inspect the Pandas DataFrame. Note how both dependent variables are used for the index. Pandas refers to this as a [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). For visual clarity, we just look at the first N points of the dataset.

In [9]:
N = 10

In [10]:
df1[:N]

Unnamed: 0_level_0,Unnamed: 1_level_0,dmm_v1
dac_ch1,dac_ch2,Unnamed: 2_level_1
-1.0,-1.0,-0.000271
-1.0,-0.99,-8.1e-05
-1.0,-0.98,0.000111
-1.0,-0.97,0.00014
-1.0,-0.96,-0.00024
-1.0,-0.95,-0.000176
-1.0,-0.94,8.1e-05
-1.0,-0.93,-4e-06
-1.0,-0.92,8e-06
-1.0,-0.91,-2e-06


We can also reset the index to return a simpler view where all data points are simply indexed by a running counter. As we shall see below this can be needed in some situations. Note that calling `reset_index` leaves the original dataframe untouched.

In [11]:
df1.reset_index()[0:N]

Unnamed: 0,dac_ch1,dac_ch2,dmm_v1
0,-1.0,-1.0,-0.000271
1,-1.0,-0.99,-8.1e-05
2,-1.0,-0.98,0.000111
3,-1.0,-0.97,0.00014
4,-1.0,-0.96,-0.00024
5,-1.0,-0.95,-0.000176
6,-1.0,-0.94,8.1e-05
7,-1.0,-0.93,-4e-06
8,-1.0,-0.92,8e-06
9,-1.0,-0.91,-2e-06


Pandas has built-in support for various forms of plotting. This does not, however, support MultiIndex at the moment so we use `reset_index` to make the data available for plotting.

In [12]:
df1.reset_index().plot.scatter('dac_ch1', 'dac_ch2', c='dmm_v1')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x17e85f41d30>

Similarly, for the other dataframe:

In [13]:
df2.reset_index().plot.scatter('dac_ch1', 'dac_ch2', c='dmm_v1')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x17e86fe1da0>

Merging two dataframes with the same labels is fairly simple.

In [14]:
df = pd.concat([df1, df2], sort=True)

In [15]:
df.reset_index().plot.scatter('dac_ch1', 'dac_ch2', c='dmm_v1')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x17e863db780>

It is also possible to select a subset of data from the datframe based on the x and y values.

In [16]:
df.loc[(slice(-1, -0.95), slice(-1, -0.97)), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,dmm_v1
dac_ch1,dac_ch2,Unnamed: 2_level_1
-1.0,-1.0,-0.0002714239
-1.0,-0.99,-8.128461e-05
-1.0,-0.98,0.0001112536
-1.0,-0.97,0.0001397232
-0.995,-1.0,7.680241e-10
-0.995,-0.99,9.29848e-10
-0.995,-0.98,1.123969e-09
-0.995,-0.97,1.356443e-09
-0.99,-1.0,8.381701e-10
-0.99,-0.99,1.014774e-09


## Working with XArray

In many cases when working with data on a rectangular grids it may be more convenient to export the data to a [XArray](http://xarray.pydata.org) Dataset or DataArray

The Pandas DataSet can be directly converted to a XArray [Dataset](http://xarray.pydata.org/en/stable/data-structures.html?#dataset):

In [17]:
xaDataSet = df.to_xarray()

In [18]:
xaDataSet

<xarray.Dataset>
Dimensions:  (dac_ch1: 401, dac_ch2: 201)
Coordinates:
  * dac_ch1  (dac_ch1) float64 -1.0 -0.995 -0.99 -0.985 ... 0.985 0.99 0.995 1.0
  * dac_ch2  (dac_ch2) float64 -1.0 -0.99 -0.98 -0.97 ... 0.97 0.98 0.99 1.0
Data variables:
    dmm_v1   (dac_ch1, dac_ch2) float64 -0.0002714 -8.128e-05 ... 1.039e-05

However, in many cases it is more convenient to work with a XArray [DataArray](http://xarray.pydata.org/en/stable/data-structures.html?#dataarray). The DataArray can only contain a single dependent variable and can be obtained from the Dataset by indexing using the parameter name.

In [19]:
xaDataArray = xaDataSet['dmm_v1']

In [20]:
xaDataArray

<xarray.DataArray 'dmm_v1' (dac_ch1: 401, dac_ch2: 201)>
array([[-2.714239e-04, -8.128461e-05,  1.112536e-04, ...,  5.451526e-07,
         4.808069e-07,  4.233782e-07],
       [ 7.680241e-10,  9.298480e-10,  1.123969e-09, ...,  5.951812e-07,
         5.249305e-07,  4.622315e-07],
       [ 8.381701e-10,  1.014774e-09,  1.226624e-09, ...,  6.495409e-07,
         5.728740e-07,  5.044485e-07],
       ...,
       [ 1.991485e-08,  2.411094e-08,  2.914449e-08, ...,  1.543304e-05,
         1.361144e-05,  1.198566e-05],
       [ 1.854251e-08,  2.244944e-08,  2.713612e-08, ...,  1.436954e-05,
         1.267347e-05,  1.115972e-05],
       [ 1.725783e-08,  2.089408e-08,  2.525605e-08, ...,  1.337397e-05,
         1.179541e-05,  1.038654e-05]])
Coordinates:
  * dac_ch1  (dac_ch1) float64 -1.0 -0.995 -0.99 -0.985 ... 0.985 0.99 0.995 1.0
  * dac_ch2  (dac_ch2) float64 -1.0 -0.99 -0.98 -0.97 ... 0.97 0.98 0.99 1.0

In [21]:
fig, ax = plt.subplots(2,2)
xaDataArray.plot(ax=ax[0,0])
xaDataArray.mean(dim='dac_ch1').plot(ax=ax[1,0])
xaDataArray.mean(dim='dac_ch2').plot(ax=ax[0,1])
xaDataArray[200,:].plot(ax=ax[1,1])
fig.tight_layout()

<IPython.core.display.Javascript object>

Above we demonstrate a few ways to index the data from a DataArray. For instance the DataArray can be directly plotted, the mean extracted or a specific row/column selected.