# Working with index

## Content
* [Index types](#Index-types)
* [How to create an index](#How-to-create-an-index)
    * [from SEGY files](#from-SEGY-files)
    * [from SPS files](#from-SPS-files)
* [Conversion between types](#Conversion-between-types)
* [Merge](#Merge)

## Index types

There are 5 basic types of indices:
* ```TraceIndex``` enumerates individual traces
* ```FieldIndex``` enumerates field records
* ```SegyFilesIndex``` enumerates SEGY files
* ```BinsIndex``` enumerated bins of regular grid
* ```KNNIndex``` enumerated groups of k nearest traces

and ```CustomIndex``` that enables enumeration on a basis of any ```segyio.TraceField``` attribute, e.g. ```INLINE_3D``` or ```ShotPoint```.

There is an easy convertion between index types. Let ```index``` be an instance of some intex type, then  ```FieldIndex(index)``` will be an instance of type ```FieldIndex```, while ```TraceIndex(index)``` will be an instance of type ```TraceIndex``` etc.

Index can be created from a single SEGY file, from multiple SEGY files and from SPS files. Index can be merged with another one index. Below we illustrate these options. 

## How to create an index

### from SEGY files

We start with a single SEGY file and create a ```TraceIndex```. It requieres path to the file and name that we will assosiate with traces:   

In [1]:
import sys
import pandas as pd
import numpy as np

sys.path.append('..')
from geolog.src import (FieldIndex, TraceIndex, SegyFilesIndex, BinsIndex,
                        CustomIndex, KNNIndex, SeismicBatch)
from geolog.batchflow import Dataset

path_raw = '/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy'
index_trace = TraceIndex(name='raw', path=path_raw)

```head()``` shows 5 first traces (similar to pandas):

In [2]:
index_trace.head()

Unnamed: 0_level_0,TraceNumber,FieldRecord,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,raw,raw
0,1656,111906,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
1,1657,111906,2,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
2,1655,111906,3,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
3,1658,111906,4,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
4,1654,111906,5,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy


Note the index contains columns ```TraceNumber``` and ```FieldRecord``` that uniquely define seismic traces. ```TRACE_SEQUENCE_FILE``` gives a trace location within the file. To include more columns , use ```extra_headers``` argument (or set ```extra_headers='all'``` to include all available headers):

In [3]:
index_trace = TraceIndex(name='raw', path=path_raw, extra_headers=['ShotPoint', 'offset'])
index_trace.head()

Unnamed: 0_level_0,offset,TraceNumber,FieldRecord,ShotPoint,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,raw,raw
0,35,1656,111906,42000,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
1,36,1657,111906,42000,2,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
2,78,1655,111906,42000,3,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
3,79,1658,111906,42000,4,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
4,127,1654,111906,42000,5,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy


In the next example we will create an index of field records from a set of SEGY files:

In [4]:
index_ffid = FieldIndex(name='raw', path='/notebooks/egor/2D_Valyton/prof_37/segy/*.sgy')         
index_ffid.head()

Unnamed: 0_level_0,TraceNumber,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,raw,raw
FieldRecord,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
354,1,1,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,2,2,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,1,3,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,2,4,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,3,5,/notebooks/egor/2D_Valyton/prof_37/segy/000003...


Note that SEGY files contain auxiliary traces and we obtain duplicated (FieldRecord, TraceNumber) pairs in the index. This can be checked with ```duplicated``` method:

In [5]:
np.any(index_ffid.duplicated())

True

Method ```drop_duplicated``` helps to remove auxiliary traces:

In [6]:
index_ffid = index_ffid.drop_duplicates(keep='last')
index_ffid.head()

Unnamed: 0_level_0,TraceNumber,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,raw,raw
FieldRecord,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
354,1,3,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,2,4,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,3,5,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,4,6,/notebooks/egor/2D_Valyton/prof_37/segy/000003...
354,5,7,/notebooks/egor/2D_Valyton/prof_37/segy/000003...


If we assume iteration over files, we create a ```SegyFilesIndex``` in a similar way:

In [7]:
index_files = SegyFilesIndex(name='raw', path='/notebooks/egor/2D_Valyton/prof_37/segy/*.sgy')          
index_files.head()

Unnamed: 0_level_0,TraceNumber,FieldRecord,TRACE_SEQUENCE_FILE
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,raw
"(file_id, raw)",Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
/notebooks/egor/2D_Valyton/prof_37/segy/00000354_f1.r354.sgy,1,354,1
/notebooks/egor/2D_Valyton/prof_37/segy/00000354_f1.r354.sgy,2,354,2
/notebooks/egor/2D_Valyton/prof_37/segy/00000354_f1.r354.sgy,1,354,3
/notebooks/egor/2D_Valyton/prof_37/segy/00000354_f1.r354.sgy,2,354,4
/notebooks/egor/2D_Valyton/prof_37/segy/00000354_f1.r354.sgy,3,354,5


To make things more flexible, there is a ```CustomIndex``` that allows iteration by any ```segyio.TraceField``` attribute, e.g. ```INLINE_3D```, ```ShotPoint```, ```CDP``` etc. For example, let's create an index of shot points:

In [8]:
index_shot = CustomIndex(name='raw', index_name='ShotPoint', path=path_raw)
index_shot.head()

Unnamed: 0_level_0,TraceNumber,FieldRecord,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,raw,raw
ShotPoint,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
42000,1656,111906,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
42000,1657,111906,2,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
42000,1655,111906,3,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
42000,1658,111906,4,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
42000,1654,111906,5,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy


Finally, there is a```KNNIndex``` that enumerates groups of k nearest located traces based on its ```CDP_X``` and ```CDP_Y``` attributes: 

In [9]:
index_knn = KNNIndex(name='raw', n_neighbors=3, path=path_raw)
index_knn.head(9)

Unnamed: 0_level_0,FieldRecord,TraceNumber,CDP_Y,CDP_X,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,raw,raw
KNN,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,111906,1656,6639805,499279,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
0,111906,1657,6639805,499304,2,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
0,111906,1655,6639805,499254,3,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
1,111906,1657,6639805,499304,2,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
1,111906,1656,6639805,499279,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
1,111906,1658,6639805,499329,4,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
2,111906,1655,6639805,499254,3,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
2,111906,1656,6639805,499279,1,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy
2,111906,1654,6639805,499229,5,/notebooks/egor/noise_data/DN02A_LIFT_AMPSCAL.sgy


### from SPS files

```TraceIndex``` and ```FieldIndex``` can be alternatively constructed from SPS files. As a by-product, it will include offsets, azimuth and a number of other metadata: 

In [10]:
dfx = pd.read_csv('/notebooks/egor/2D_Valyton/sps/ALL_VALUNT0910_X37.csv')
dfr = pd.read_csv('/notebooks/egor/2D_Valyton/sps/ALL_VALUNT0910_R_utm.csv')
dfs = pd.read_csv('/notebooks/egor/2D_Valyton/sps/ALL_VALUNT0910_S_utm.csv')

index_sps = FieldIndex(dfx=dfx, dfr=dfr, dfs=dfs)
index_sps.head()

Unnamed: 0_level_0,sline,sid,rline,rid,TraceNumber,point_index,sht_depth,uphole,SourceX,SourceY,z_s,x_r,y_r,z_r,CDP_X,CDP_Y,azimuth,offset
FieldRecord,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
354,S37,3591,R37,219,1,1,18.0,14.0,338886.78,7033413.0,71.5,338560.0,7029929.5,76.0,338723.39,7031671.25,-1.664331,1749.396855
354,S37,3591,R37,220,2,1,18.0,14.0,338886.78,7033413.0,71.5,338562.4,7029954.5,76.1,338724.59,7031683.75,-1.664315,1736.839416
354,S37,3591,R37,221,3,1,18.0,14.0,338886.78,7033413.0,71.5,338564.7,7029979.5,76.2,338725.74,7031696.25,-1.664328,1724.286648
354,S37,3591,R37,222,4,1,18.0,14.0,338886.78,7033413.0,71.5,338567.0,7030004.5,76.8,338726.89,7031708.75,-1.664341,1711.73388
354,S37,3591,R37,223,5,1,18.0,14.0,338886.78,7033413.0,71.5,338569.38,7030029.0,77.4,338728.08,7031721.0,-1.664317,1699.426283


To create a ```Binsindex``` one should specify ```bin_size```. If grid position is not provided it will be optimized during index construction: 

In [11]:
dfx = pd.read_csv('/notebooks/egor/Xfield/Xfield_X.csv')
dfr = pd.read_csv('/notebooks/egor/Xfield/Xfield_R.csv')
dfs = pd.read_csv('/notebooks/egor/Xfield/Xfield_S.csv')

bin_size = 1000

index_bin = BinsIndex(dfr=dfr, dfs=dfs, dfx=dfx, bin_size=(bin_size, bin_size), iters=10)
index_bin.head()

NameError: name 'ppx' is not defined

The heatmap shows a distribution of traces within bins:

In [None]:
index_bin.show_heatmap()

## Conversion between types

Index can be easily converted to other index type. For example, ```FieldIndex``` to ```TraceIndex```:

In [None]:
TraceIndex(index_ffid).head()

or vise-versa:

In [None]:
FieldIndex(index_trace).head()

or ```BinsIndex``` to ```FieldIndex```:

In [None]:
FieldIndex(index_bin).head()

or custom shot index to ```FieldIndex```:

In [None]:
FieldIndex(index_shot).head()

or ```KNNIndex``` to ```FieldIndex```:

In [None]:
FieldIndex(index_knn).head(9)

Note that the index obtained contains 3 times duplicated traces. To remove them use ```drop_duplicates```:

In [None]:
FieldIndex(index_knn).drop_duplicates().head(9)

## Merge

Two index instances can be merged on common headers. For example, ```index_ffid``` does not contain offsets. However, we can merge it with ```index_sps``` that includes offsets:

In [None]:
index_ffid = index_ffid.merge(index_sps)
index_ffid.head()

Enjoy!