# Basic concepts

This notebook explains basic concepts of the `helio` framework
* [`FilesIndex`](#FilesIndex)
* [`BatchSampler`](#BatchSampler)
* [`Batch`](#Batch)

and gives more examples of use.

## FilesIndex

Usually datasets are sets of files. Each dataset item can be represented by a single file (e.g. an image) or multiple files (e.g. an image and corresponding binary mask). `FilesIndex` provides a convenient way to organize such datasets as well as to sort, split and filter dataset items. In particular, `FilesIndex` is based on the `pandas.DataFrame` class and supports all features and methods
available with pandas dataframes.

To create a `FilesIndex` one has to provide a mask for files to be indexed and give a reference name to the file set. For example, here we index all files in the directory `aia193_fits` with file extension `.fits`. Reference name for this set is `images`:

In [1]:
import sys
sys.path.append("..")

from helio import FilesIndex

index = FilesIndex(images='../../aia193_fits/*.fits')
index.head()

Unnamed: 0_level_0,images
FilesIndex,Unnamed: 1_level_1
aia.lev1_euv_12s.2018-12-20T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...
aia.lev1_euv_12s.2018-12-21T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...
aia.lev1_euv_12s.2018-12-22T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...
aia.lev1_euv_12s.2018-12-23T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...
aia.lev1_euv_12s.2018-12-24T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...


Since the `index` is a pandas dataframe, we use `head` to see the first items. In the same way other dataframe methods can be used. E.g. to sort data:

In [2]:
index.sort_values(by='images').head()

Unnamed: 0_level_0,images
FilesIndex,Unnamed: 1_level_1
aia.lev1_euv_12s.2018-12-20T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...
aia.lev1_euv_12s.2018-12-21T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...
aia.lev1_euv_12s.2018-12-22T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...
aia.lev1_euv_12s.2018-12-23T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...
aia.lev1_euv_12s.2018-12-24T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...


Get dataset size:

In [3]:
len(index)

55

Select a subset:

In [4]:
index.iloc[10:15]

Unnamed: 0_level_0,images
FilesIndex,Unnamed: 1_level_1
aia.lev1_euv_12s.2018-12-30T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-30T...
aia.lev1_euv_12s.2018-12-31T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2018-12-31T...
aia.lev1_euv_12s.2019-01-01T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2019-01-01T...
aia.lev1_euv_12s.2019-01-02T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2019-01-02T...
aia.lev1_euv_12s.2019-01-03T234430Z.193.image_lev1,../../aia193_fits\aia.lev1_euv_12s.2019-01-03T...


Reset the index:

In [5]:
index.reset_index(drop=True).head()

Unnamed: 0,images
0,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...
1,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...
2,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...
3,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...
4,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...


Modify indices:

In [6]:
index.index = index.index.map(lambda x: x[17:34])
index.head()

Unnamed: 0_level_0,images
FilesIndex,Unnamed: 1_level_1
2018-12-20T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...
2018-12-21T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...
2018-12-22T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...
2018-12-23T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...
2018-12-24T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...


One more useful feature of the `FilesIndex` is a datetime parser for fuzzy formats:

In [7]:
index.parse_datetime().head()

Unnamed: 0_level_0,images,DateTime
FilesIndex,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-20T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...,2018-12-20 23:44:30
2018-12-21T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...,2018-12-21 23:44:30
2018-12-22T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...,2018-12-22 23:44:30
2018-12-23T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...,2018-12-23 23:44:30
2018-12-24T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...,2018-12-24 23:44:30


`DateTime` column can be used to merge multiple observation that correspond to the same time. Another application is to get basic sun parameters:

In [8]:
index.get_sun_params().head()

Unnamed: 0_level_0,images,DateTime,L0,B0,CR
FilesIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-12-20T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-20T...,2018-12-20 23:44:30,353.448898,-1.631964,2212
2018-12-21T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-21T...,2018-12-21 23:44:30,340.275175,-1.756973,2212
2018-12-22T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-22T...,2018-12-22 23:44:30,327.101705,-1.881456,2212
2018-12-23T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-23T...,2018-12-23 23:44:30,313.928509,-2.005373,2212
2018-12-24T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-24T...,2018-12-24 23:44:30,300.755608,-2.128685,2212


Now we can e.g. set the Carrington rotation number as an index and select specific numbers:

In [9]:
new_index = index.reset_index().set_index('CR').loc[[2213]]
new_index.head()

Unnamed: 0_level_0,FilesIndex,images,DateTime,L0,B0
CR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2213,2019-01-16T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-16T...,2019-01-16 23:44:30,357.856648,-4.707531
2213,2019-01-17T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-17T...,2019-01-17 23:44:30,344.689145,-4.804793
2213,2019-01-18T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-18T...,2019-01-18 23:44:30,331.521721,-4.900536
2213,2019-01-19T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-19T...,2019-01-19 23:44:30,318.35438,-4.994727
2213,2019-01-20T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-20T...,2019-01-20 23:44:30,305.187132,-5.087337


One more thing to be mentioned is a train/test split of the index:

In [10]:
train, test = index.train_test_split(train_ratio=0.8, shuffle=True)
print("Train size:", len(train), "Test size:", len(test))
train.head()

Train size: 44 Test size: 11


Unnamed: 0_level_0,images,DateTime,L0,B0,CR
FilesIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-12-27T234430,../../aia193_fits\aia.lev1_euv_12s.2018-12-27T...,2018-12-27 23:44:30,261.23885,-2.494584,2212
2019-02-05T234430,../../aia193_fits\aia.lev1_euv_12s.2019-02-05T...,2019-02-05 23:44:30,94.525899,-6.331311,2213
2019-01-27T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-27T...,2019-01-27 23:44:30,213.019959,-5.688857,2213
2019-01-26T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-26T...,2019-01-26 23:44:30,226.186285,-5.608132,2213
2019-01-31T234430,../../aia193_fits\aia.lev1_euv_12s.2019-01-31T...,2019-01-31 23:44:30,160.355668,-5.993455,2213


So far we used a single set of files. Two or more index instances can be joined on a common column (ids). For examples let's create a dataset of pairs (image, binary mask). First we index images

In [17]:
images = FilesIndex(img='../../catalogue_data/aia*.fits')
images.head()

Unnamed: 0_level_0,img
FilesIndex,Unnamed: 1_level_1
aia193_synmap_cr2098,../../catalogue_data\aia193_synmap_cr2098.fits
aia193_synmap_cr2099,../../catalogue_data\aia193_synmap_cr2099.fits
aia193_synmap_cr2100,../../catalogue_data\aia193_synmap_cr2100.fits
aia193_synmap_cr2101,../../catalogue_data\aia193_synmap_cr2101.fits
aia193_synmap_cr2102,../../catalogue_data\aia193_synmap_cr2102.fits


Then we index files with binary masks:

In [18]:
masks = FilesIndex(mask='../../catalogue_data/chs*.fits')
masks.head()

Unnamed: 0_level_0,mask
FilesIndex,Unnamed: 1_level_1
chs_synmap_cr2098,../../catalogue_data\chs_synmap_cr2098.fits
chs_synmap_cr2099,../../catalogue_data\chs_synmap_cr2099.fits
chs_synmap_cr2100,../../catalogue_data\chs_synmap_cr2100.fits
chs_synmap_cr2101,../../catalogue_data\chs_synmap_cr2101.fits
chs_synmap_cr2102,../../catalogue_data\chs_synmap_cr2102.fits


Now we need a common column to join on. This column can be a Carrington rotation number:

In [19]:
images['CR'] = images.index.map(lambda x: x[-4:])
images = images.reset_index(drop=True).set_index('CR')
images.head()

Unnamed: 0_level_0,img
CR,Unnamed: 1_level_1
2098,../../catalogue_data\aia193_synmap_cr2098.fits
2099,../../catalogue_data\aia193_synmap_cr2099.fits
2100,../../catalogue_data\aia193_synmap_cr2100.fits
2101,../../catalogue_data\aia193_synmap_cr2101.fits
2102,../../catalogue_data\aia193_synmap_cr2102.fits


Prepare the second index to join:

In [20]:
masks['CR'] = masks.index.map(lambda x: x[-4:])
masks = masks.reset_index(drop=True).set_index('CR')
masks.head()

Unnamed: 0_level_0,mask
CR,Unnamed: 1_level_1
2098,../../catalogue_data\chs_synmap_cr2098.fits
2099,../../catalogue_data\chs_synmap_cr2099.fits
2100,../../catalogue_data\chs_synmap_cr2100.fits
2101,../../catalogue_data\chs_synmap_cr2101.fits
2102,../../catalogue_data\chs_synmap_cr2102.fits


Now we can join datasets of images and masks using Carrington rotation number as a primary key:

In [21]:
index = images.index_merge(masks)
index.head()

Unnamed: 0_level_0,img,mask
CR,Unnamed: 1_level_1,Unnamed: 2_level_1
2098,../../catalogue_data\aia193_synmap_cr2098.fits,../../catalogue_data\chs_synmap_cr2098.fits
2099,../../catalogue_data\aia193_synmap_cr2099.fits,../../catalogue_data\chs_synmap_cr2099.fits
2100,../../catalogue_data\aia193_synmap_cr2100.fits,../../catalogue_data\chs_synmap_cr2100.fits
2101,../../catalogue_data\aia193_synmap_cr2101.fits,../../catalogue_data\chs_synmap_cr2101.fits
2102,../../catalogue_data\aia193_synmap_cr2102.fits,../../catalogue_data\chs_synmap_cr2102.fits


In the same way one can add more datasets into a joint index.

## BatchSampler

We use `BatchSampler` to iterate over the dataset in chunks. There are several options how
to organize iterations. For demonstration let's create some dataset (of size 10):

In [22]:
index = FilesIndex(img='../../aia193_images/*.jpg').iloc[:10].reset_index(drop=True)
index.head()

Unnamed: 0,img
0,../../aia193_images\20200202_234500.jpg
1,../../aia193_images\20200203_234500.jpg
2,../../aia193_images\20200204_234500.jpg
3,../../aia193_images\20200205_234500.jpg
4,../../aia193_images\20200206_234500.jpg


The simplest case is to iterate over the dataset with chunks (batches) of size e.g. 2. At each iteration we obtain an index for a subset of size 2. Here we show indices in each chunk:

In [23]:
from helio import BatchSampler

sampler = BatchSampler(index, batch_size=2)

for ids in sampler:
    print(ids.indices)

[0 1]
[2 3]
[4 5]
[6 7]
[8 9]


Sampled index is an instance of `FilesIndex` as well:

In [24]:
ids

Unnamed: 0,img
8,../../aia193_images\20200210_234500.jpg
9,../../aia193_images\20200211_234500.jpg


If the `batch_size` does not divide the dataset size, we can either drop incomplete chunks or return incomplete last chunk:

In [25]:
print('Only complete batches are sampled:')
sampler = BatchSampler(index, batch_size=4, drop_incomplete=True)
for ids in sampler:
    print(ids.indices)
print('Keep incomplete batches as well:')
sampler = BatchSampler(index, batch_size=4, drop_incomplete=False)
for ids in sampler:
    print(ids.indices)

Only complete batches are sampled:
[0 1 2 3]
[4 5 6 7]
Keep incomplete batches as well:
[0 1 2 3]
[4 5 6 7]
[8 9]


There is an option to shuffle data before iterations:

In [26]:
sampler = BatchSampler(index, batch_size=4, shuffle=True, drop_incomplete=False)
for ids in sampler:
    print(ids.indices)

[3 7 4 2]
[9 8 1 5]
[9 0]


Finally, we can iterate the dataset many times (epochs) and shuffle data each new epoch:

In [27]:
sampler = BatchSampler(index, n_epochs=3, batch_size=4, shuffle=True, drop_incomplete=True)
for ids in sampler:
    print("Epoch", sampler._on_epoch, "Indices:", ids.indices)

Epoch 0 Indices: [0 2 3 1]
Epoch 0 Indices: [5 8 9 7]
Epoch 1 Indices: [0 6 2 8]
Epoch 1 Indices: [4 5 9 3]
Epoch 2 Indices: [3 7 1 4]
Epoch 2 Indices: [5 8 9 0]


## Batch 

`HelioBatch` provides a unified framework for data storing and processing. First we use index to specify data subset to be processed:

In [28]:
index = FilesIndex(img='../../aia193_images/*.jpg') 
index.head()

Unnamed: 0_level_0,img
FilesIndex,Unnamed: 1_level_1
20200202_234500,../../aia193_images\20200202_234500.jpg
20200203_234500,../../aia193_images\20200203_234500.jpg
20200204_234500,../../aia193_images\20200204_234500.jpg
20200205_234500,../../aia193_images\20200205_234500.jpg
20200206_234500,../../aia193_images\20200206_234500.jpg


Then we define a batch:

In [29]:
from helio import HelioBatch

batch = HelioBatch(index)

At this moment batch is empty. Use load to get data specified with index: 

In [30]:
batch.load(src='img')

<helio.core.batch.HelioBatch at 0x192fe5f44a8>

Now dataset items can be accessed by `img` attribute: 

In [31]:
batch.img[2].shape

(1024, 1024, 3)

To be more detailed, batch has two main attributes: data and meta. Data contains the main content, while meta keeps additional information. So one can access data in a second way:

In [32]:
batch.data['img'][2].shape

(1024, 1024, 3)

However, meta imformation for `.jpeg` data is empty:

In [33]:
batch.meta['img'][2]

{}

In contrast, loading data from e.g. `.fits` files we can read fits header into batch meta: 

In [34]:
index = FilesIndex(img='../../aia193_fits/*fits').iloc[:5]
batch = HelioBatch(index).load(src='img', unit=1, meta='img')

 [astropy.io.fits.verify]


and access some headers:

In [35]:
batch.meta['img'][2]['T_OBS'], batch.meta['img'][2]['R_SUN']

('2018-12-22T23:44:29.84Z', 1624.316162)

Some batch methods add meta information. For example, estimation of solar disk radius:

In [36]:
import numpy as np

index = FilesIndex(img='../../aia193_images/*jpg').iloc[:5]
batch = HelioBatch(index).load(src='img', as_gray=True)
batch.get_radius(src='img', hough_radii=np.arange(390, 420))
batch.meta['img']

array([{'i_cen': 512, 'j_cen': 512, 'r': 408},
       {'i_cen': 511, 'j_cen': 512, 'r': 408},
       {'i_cen': 512, 'j_cen': 511, 'r': 408},
       {'i_cen': 512, 'j_cen': 512, 'r': 407},
       {'i_cen': 512, 'j_cen': 512, 'r': 407}], dtype=object)

Now consider working with batch methods. Most methods have `scr` and `dst` parameters which specify where to get data and where to write the results. If `dst` attribute does not exists, it will be created. If `dst` is not given, it is assumed that `dst`=`src`. Also one can explicitly set the same values to `src` and `dst` for inplace operation. For example consider image resize. The results will be aritten to a new attribute: 

In [37]:
batch.resize(src='img', dst='img_small', output_shape=(256, 256), preserve_range=True)

<helio.core.batch.HelioBatch at 0x192fe6c74e0>

New attribute has been created:

In [38]:
batch.attributes

['img', 'img_small']

It contains resized data while original data are not lost:

In [39]:
batch.img[0].shape, batch.img_small[0].shape

((1024, 1024), (256, 256))

For inplace operations we do not specify `dst`:

In [40]:
batch.resize(src='img', output_shape=(64, 64), preserve_range=True)
batch.img[0].shape

(64, 64)

Each batch method returns batch instance. This implies one can orginize methods chaining:

In [41]:
(HelioBatch(index).load(src='img', as_gray=True)
 .get_radius(src='img', hough_radii=np.arange(390, 420))
 .resize(src='img', output_shape=(256, 256), preserve_range=True))

<helio.core.batch.HelioBatch at 0x192fe6c7400>

In a combination with `BatchSampler` one can easiliy organize iterative dataset processing:

In [42]:
from tqdm import tqdm

index = FilesIndex(img='../../aia193_images/*jpg').iloc[:10]
sampler = BatchSampler(index, n_epochs=1, batch_size=4, shuffle=False, drop_incomplete=False)

for ids in tqdm(sampler):
    (HelioBatch(ids).load(src='img', as_gray=True)
     .get_radius(src='img', hough_radii=np.arange(390, 420))
     .resize(src='img', output_shape=(256, 256), preserve_range=True)
     .dump(src='img', path='./tmp/', format='npz'))

100%|████████████████████████████████████████████████████████████████| 3/3 [01:26<00:00, 28.92s/it]


To learn more about `HelioBatch` methods see the [documentation](http://observethesun.github.io/helio/).

Done!