![Image](./img/dataset-diagram-logo.png)

**objectif** : puissance de pandas en N dimentions, avec labels !! </br>
-> outils **pandas-compatibles pour tableaux multidimentionnels**

<img src="./img/intro_xarray.png">

In [2]:
import numpy as np
import pandas as pd
import xarray as xr

In [3]:
truc = xr.DataArray(np.random.randint(2, 3))
print(truc)

<xarray.DataArray ()>
array(2)


In [4]:
data = xr.DataArray(np.random.randint(6, size=(2, 3)), coords={'x': ['a', 'b']}, dims=('x', 'y'))
print(data)

<xarray.DataArray (x: 2, y: 3)>
array([[3, 5, 5],
       [1, 5, 4]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y


In [5]:
# On peut lui donner du pandas
xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))

<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
  * dim_0    (dim_0) object 'a' 'b' 'c'

## Propriétés de DataArray

In [6]:
print('values : ', data.values)
print('dims : ', data.dims)
print('coords : ', data.coords)
data.attrs  # pour métadonnées

values :  [[3 5 5]
 [1 5 4]]
dims :  ('x', 'y')
coords :  Coordinates:
  * x        (x) <U1 'a' 'b'


OrderedDict()

## Indexing

In [7]:
data[[0]]   #comme dans numpy, avec index int
data.loc['a':'b'] #comme dans pandas, avec des labels
data.isel(x=slice(1))  #par slice : dimension + int
data.sel(x=['a', 'b']) # dimension + label

print(data)
print(data.isel(x=slice(3)))

<xarray.DataArray (x: 2, y: 3)>
array([[3, 5, 5],
       [1, 5, 4]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y
<xarray.DataArray (x: 2, y: 3)>
array([[3, 5, 5],
       [1, 5, 4]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y


## Computation

In [8]:
# très similaire numpy
data + 10
np.sin(data)
data.T
data.sum()

# mais on peut utiliser les noms des axes au lieu de leur numéro
data.mean(dim='x')

<xarray.DataArray (y: 3)>
array([2. , 5. , 4.5])
Dimensions without coordinates: y

In [12]:
# Plus besoin de gerer l'ajout de nouvel ax (new_axis)
a = xr.DataArray(np.random.randint(3, size=(3)), [data.coords['y']])
b = xr.DataArray(np.random.randint(4, size=(4)), dims='z')

print('a : ', a)
print('b : ', b)
print('a+b : ', a+b)

a :  <xarray.DataArray (y: 3)>
array([0, 0, 2])
Coordinates:
  * y        (y) int64 0 1 2
b :  <xarray.DataArray (z: 4)>
array([1, 0, 2, 3])
Dimensions without coordinates: z
a+b :  <xarray.DataArray (y: 3, z: 4)>
array([[1, 0, 2, 3],
       [1, 0, 2, 3],
       [3, 2, 4, 5]])
Coordinates:
  * y        (y) int64 0 1 2
Dimensions without coordinates: z


In [18]:
data - data.T

<xarray.DataArray (x: 2, y: 3)>
array([[0, 0, 0],
       [0, 0, 0]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

In [19]:
data[:-1] - data[:1]

<xarray.DataArray (x: 1, y: 3)>
array([[0, 0, 0]])
Coordinates:
  * x        (x) <U1 'a'
Dimensions without coordinates: y

## GroupBy

In [22]:
display(data)
labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')

data1 = data.groupby(labels).mean('y')

data2 = data.groupby(labels).apply(lambda x: x - x.min())

print('data : \n', data)
print('\n')
print('labels : \n', labels)
print('\n')
print('data1 : \n', data1)
print('\n')
print('data2 : \n', data2)


<xarray.DataArray (x: 2, y: 3)>
array([[3, 5, 5],
       [1, 5, 4]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

data : 
 <xarray.DataArray (x: 2, y: 3)>
array([[3, 5, 5],
       [1, 5, 4]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y


labels : 
 <xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'], dtype='<U1')
Coordinates:
  * y        (y) int64 0 1 2


data1 : 
 <xarray.DataArray (x: 2, labels: 2)>
array([[4. , 5. ],
       [2.5, 5. ]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * labels   (labels) object 'E' 'F'


data2 : 
 <xarray.DataArray (x: 2, y: 3)>
array([[2, 0, 4],
       [0, 0, 3]])
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 0 1 2
    labels   (y) <U1 'E' 'F' 'E'


## Pandas

In [12]:
series = data.to_series()
series.to_xarray()

<xarray.DataArray (x: 2, y: 3)>
array([[0, 2, 2],
       [0, 2, 1]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

## Dataset

In [13]:
ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})
ds

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) int64 0 2 2 0 2 1
    bar      (x) int64 1 2
    baz      float64 3.142

In [14]:
ds['foo']

<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[0, 2, 2],
       [0, 2, 1]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

## NetCDF

In [15]:
ds.to_netcdf('example.nc')

In [16]:
xr.open_dataset('example.nc')

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
Dimensions without coordinates: y
Data variables:
    foo      (x, y) int32 ...
    bar      (x) int32 ...
    baz      float64 ...