# Preprocessing pipeline

Objective of this notebook is to aggregate aggregate seismic processing actions into a neat pipeline. 

* [Dataset](#Dataset)
* [Pipeline](#Pipeline)
* [Conclusion](#Conclusion)
* [Suggestions for improvement](#Suggestions-for-improvement)

In [1]:
import os
import sys
sys.path.append('..')
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from time import time

In [2]:
from seismicpro import FieldIndex, TraceIndex, SeismicBatch, SeismicDataset
from seismicpro.batchflow import Dataset, Pipeline, V, B, D, L, I

from seismicpro.src import (seismic_plot,
                            gain_plot, calculate_sdc_quality,
                            measure_gain_amplitude, merge_segy_files
                           )

## Dataset

For this task we make use of Dataset 1 for processing testing. It contains several records with various amplitude range:

In [3]:
path_raw = '/data/preproc/1_input_4_PREP_full_pipeln.sgy'
field_index = FieldIndex(name='raw', path=path_raw, extra_headers='all')
field_set = SeismicDataset(field_index, SeismicBatch)

field_index.head()

Unnamed: 0_level_0,TimeBaseCode,TotalStaticApplied,CDP_TRACE,MuteTimeEND,ShotPointScalar,TaperType,SourceWaterDepth,TraceWeightingFactor,NSummedTraces,CoordinateUnits,...,INLINE_3D,SourceUpholeTime,SweepLength,LowCutFrequency,GroupWaterDepth,GainType,SourceMeasurementUnit,TransductionConstantPower,TRACE_SEQUENCE_FILE,file_id
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,...,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,raw,raw
FieldRecord,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
8834,2,0,1,0,0,0,0,0,1,1,...,504,0,0,30,0,1,0,0,1,/data/preproc/1_input_4_PREP_full_pipeln.sgy
8834,2,0,2,0,0,0,0,0,1,1,...,504,0,0,30,0,1,0,0,2,/data/preproc/1_input_4_PREP_full_pipeln.sgy
8834,2,0,3,0,0,0,0,0,1,1,...,504,0,0,30,0,1,0,0,3,/data/preproc/1_input_4_PREP_full_pipeln.sgy
8834,2,0,4,0,0,0,0,0,1,1,...,504,0,0,30,0,1,0,0,4,/data/preproc/1_input_4_PREP_full_pipeln.sgy
8834,2,0,5,0,0,0,0,0,1,1,...,504,0,0,30,0,1,0,0,5,/data/preproc/1_input_4_PREP_full_pipeln.sgy


## Pipeline

For spherical divergence correction we'll use predefined speed values:

In [4]:
speed = np.array([1524]*700 + [1924.5]*300 + [2184.0]*400 +  [2339.6]*400 + 
                 [2676]*150 + [2889.5]*2250 + [3566]*2800 + [4785.3]*1000)

In this pipeline we preform spherical divergence correction (SDC) and remove traces that contain more than 50 consequent zero values, then save the results. Parameters for SDC are calculated via optimization procedure only once for all dataset in `before` instance of pipeline.

In [5]:
tmp_dump_path = '/data/preproc/tmp/'
output_path = '/data/preproc/processed/merged.sgy'

first_preproc_ppl = (
    field_set.pipeline()
    .init_variable('sdc_params')
    .load(fmt='sgy', components='raw')
    .correct_spherical_divergence(src='raw', dst='raw', speed=speed, params=V('sdc_params'))
    .drop_zero_traces(src='raw', dst='raw', num_zero=50)
    .dump(path=L(lambda x: os.path.join(tmp_dump_path, str(x) + '.sgy'))(I()),
          src='raw', fmt='segy', split=False)
    .run_later(batch_size=1, n_epochs=1, shuffle=False, drop_last=False, bar=True)
    )

(first_preproc_ppl.before
                  .find_sdc_params(components='raw', speed=speed, loss=calculate_sdc_quality, initial_point=(2, 1), save_to=V('sdc_params')))
(first_preproc_ppl.after
                  .merge_segy_files(output_path=output_path, extra_headers='all', path=os.path.join(tmp_dump_path, '*.sgy')))

<seismicpro.batchflow.batchflow.once_pipeline.OncePipeline at 0x7f4b837cd9b0>

Let's run it:

In [6]:
first_preproc_ppl.run()

100%|██████████| 30/30 [01:48<00:00,  1.85s/it]
100%|██████████| 30/30 [00:22<00:00,  2.38it/s]


<seismicpro.batchflow.batchflow.pipeline.Pipeline at 0x7f4b837cd8d0>

We'll use results of the previous pipeline as input for the next one:

In [8]:
field_index = FieldIndex(name='raw', path=output_path, extra_headers='all')
field_set = SeismicDataset(field_index, SeismicBatch)

This pipeline calculated parameters for equalization in `before` instance and applies them to each batch in main pipeline:

In [9]:
output_path = '/data/preproc/processed/final.sgy'

second_preproc_ppl = (
    field_set.pipeline()
    .load(fmt='sgy', components='raw')
    .equalize(src='raw', dst='raw', params=V('equal_params'))
    .dump(path=L(lambda x: os.path.join(tmp_dump_path, str(x) + '.sgy'))(I()),
          src='raw', fmt='segy', split=False)
    .run_later(batch_size=1, n_epochs=1, shuffle=False, drop_last=False, bar=True)
)

(second_preproc_ppl.before
                   .find_equalization_params(component='raw', record_id='YearDataRecorded', save_to=V('equal_params')))
(second_preproc_ppl.after
                  .merge_segy_files(output_path=output_path, extra_headers='all', path=os.path.join(tmp_dump_path, '*.sgy')))

<seismicpro.batchflow.batchflow.once_pipeline.OncePipeline at 0x7f4b837b0e10>

Run equalization:

In [10]:
second_preproc_ppl.run()

100%|██████████| 30/30 [00:45<00:00,  1.12s/it]
100%|██████████| 30/30 [00:21<00:00,  1.88it/s]


<seismicpro.batchflow.batchflow.pipeline.Pipeline at 0x7f4b837b0d30>

## Conclusion

Current version of notebook contains preprocessing pipeline with three actions: correction for spherical divergence, removal of traces with more than 50 consecuent zeros and equalization of amplitudes.

## Suggestions for improvement

The next step in development of pipeline is adding two more actions: correction of inverted traces and attenuation of 50/60 Hz noise.

Another direction for improvement is unification of all actions in one whole pipeline.