# Noise Acquistion

In this notebook we shall explore how to acquire real interferometer output data, and use that data as a TensorFlow dataset. In GravyFlow, datasets are built by composition, so by combining different elements (i.e., noise, injections, conditioning) we can build custom datasets to suit our specifications.

First we shall start by performing the necessary imports.

In [1]:
# Built-in imports
import os
import sys
from typing import Iterator, List
from pathlib import Path
from itertools import islice

# Dependancy imports: 
import tensorflow as tf
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot

# GravyFlow import, again adding the grandparent directory to the path:

# Get the absolute path of the parent directory
parent_dir = os.path.abspath('../../')
print(parent_dir)

# Add the parent directory to sys.path
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# To import gravyflow simply use:
import gravyflow as gf

2024-01-09 03:20:13.605378: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


/home/michael.norman/data_ad_infinitum/chapter_05_dragonn/dragonn


## Setup Environment:

As described in notebook 1, we should set up enviroment with gf.env to ensure we work on an avalible GPU.

In [2]:
# Setup enviroment and return tf.distributed stratergy object.
env = gf.env()

INFO:root:TensorFlow version: 2.12.1, CUDA version: 11.8
2024-01-09 03:20:38.008033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2000 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
INFO:root:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Set GravyFlow Global Defaults

Often, when we are working in in a single notebook or Python script, there are some variables which will remain constant throughout or analysis. In order to accomadate these scenarios, GravyFlow allows us to set a number of defaults in a global defaults class. The values which can be set in this way are as follows:

- `seed` : int = 1000
	> This is the default random seed that GravyFlow will use to initlise TensorFlow and Numpy random operations used for operations such as dataset shuffling, random noise generation, and waveform parameter randomisation. Setting a consistent seed should result in consistent, repeatable outputs. By default this is set to 1000.
- `num_examples_per_generation_batch` : int = 2048
	> When acquiring real interferometer data, GravyFlow downloads data in batches. This is done for efficiency, to reduce the number of overall download requests and the overhead that comes with that. This parameter number of training examples generate for each of thoes batches. When generating waveforms, they are also generted in batches of this number. By default this is set to 2048.
- `num_examples_per_batch` : int = 32
	> This parameter determines the number of examples that will be output by each iteration of the GravyFlow generators. When used to train a machine learning model, which is the primary design goal of GravyFlow. This number should be set to the same value as your desired training batch size.
- `sample_rate_hertz` : float = 2048.0
	> The default sample rate of the data input and output by GravyFlow in Hertz. Default is 2048.0 Hz
- `onsource_duration_seconds` : float = 1.0
    > The default duration of onsource data provided by GravyFlow iterators, in seconds. In GravyFlow, the onsource data is defined as data being analysed by your method that may contain a gravitional wave signal. As opposed to the offsource data, which is assumed not to contain any sigificant data features, and can be used as an example of uncontaiminated noise for data conditioning purposes such as whitening. Default is 1.0 s.
- `offsource_duration_seconds` : float = 16.0
    > The default duration of offsource data provided by GravyFlow iterators, in seconds. Offsource data is data that is assumed not to contain any significant features, and can be used as an example of uncontaiminated noise for data conditioning purposes such as whitening. Default is 16.0 s.
- `crop_duration_seconds` : float = 0.5
    > During some data condition operations, (currently only whitening), edge effects will be created. This will need to be cropped before data analysis is performed. GravyFlow does this automatically. crop_duration_seconds defines how much data to be cropped either side of the onsource segment, in seconds. Data will be cropped either side of the onsource so total cropped duration will be 2 $times$ crop_duration_seconds. The default is 0.5 s.
- `scale_factor` : float = 1.0E21
    > When gathering gata for use in machine learning applications, we want our values to be close to one, as activation functions such as ReLU, SoftMax, and Sigmoid, are designed around this assumption. For that reason, we often want to scale our input data, which can be very small in the case of graviational wave data. This value is used to scale both aproximants and noise. By default this is 1.0E21

## Important Consideration

Setting global variables like this can be a bad idea if you are going to used different values within the same Python script or notebook, as you may forget to set variables for some functions and this may lead to errors. If you plan on working with data which varies in any of these parameters, it is recomented that you pass them as arguments to the corresponding function, rather than relying on the default values.

Below we set some of these values to illustrate how they can be defined:

In [3]:

# Here we set the default GravyFlow values: 
gf.Defaults.set(
    sample_rate_hertz=1024.0,
    onsource_duration_seconds=1.0,
    offsource_duration_seconds=16.0,
    crop_duration_seconds=0.5,
    scale_factor=1.0E21
)

In [4]:
# Setup ifo data acquisition object:
ifo_data_obtainer : gf.IFODataObtainer = gf.IFODataObtainer(
    gf.ObservingRun.O3, 
    gf.DataQuality.BEST, 
    [
        gf.DataLabel.NOISE, 
        gf.DataLabel.GLITCHES
    ],
    gf.SegmentOrder.RANDOM,
    force_acquisition = True,
    cache_segments = False
)

In [5]:
# Initilise noise generator wrapper:
noise : gf.NoiseObtainer = gf.NoiseObtainer(
    ifo_data_obtainer = ifo_data_obtainer,
    noise_type = gf.NoiseType.REAL,
    ifos = gf.IFO.L1
)

In [6]:
with env:
	onsource, offsource, gps_times = next(noise(num_examples_per_batch=1))

(11159, 1, 2)


NameError: name 'quit' is not defined

In [24]:
onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : onsource[0]},
        title = f"Onsource Background noise at {gps_times[0]}"
    )
    
offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : offsource[0]},
        title = f"Offsource Background noise at {gps_times[0]}"
    )

layout : List = [[onsource_strain_plot, offsource_strain_plot]]

# Arrange the plots in a grid. 
grid = gridplot(layout)
output_notebook()
show(grid)

In [8]:
num_iterations : int = 16

with env: 
    for onsource, offsource, gps_times in islice(noise(), num_iterations):
        print(f"Got {onsource.shape[0]} more noise Examples! Should probably do something usefull with them!")

(11080, 1, 2)


NameError: name 'quit' is not defined

In [15]:
# Initilise noise generator wrapper:
ifo_data_obtainer : gf.IFODataObtainer = gf.IFODataObtainer(
    gf.ObservingRun.O3, 
    gf.DataQuality.BEST, 
    [
        gf.DataLabel.NOISE, 
        gf.DataLabel.GLITCHES
    ],
    gf.SegmentOrder.RANDOM,
    force_acquisition = True,
    cache_segments = False
)


multi_ifo_noise : gf.NoiseObtainer = gf.NoiseObtainer(
        ifo_data_obtainer = ifo_data_obtainer,
        noise_type = gf.NoiseType.REAL,
        ifos = [gf.IFO.L1, gf.IFO.H1]
    )

In [16]:
multi_onsource, multi_offsource, multi_gps_times = next(multi_ifo_noise(num_examples_per_batch=1))

(9256, 2, 2)


NameError: name 'quit' is not defined

In [27]:
 multi_onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : multi_onsource[0]},
        title = f"Onsource Background noise at {multi_gps_times[0]}"
    )
    
multi_offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : multi_offsource[0]},
        title = f"Offsource Background noise at {multi_gps_times[0]}"
    )

multi_layout : List = [[multi_onsource_strain_plot, multi_offsource_strain_plot]]

# Arrange the plots in a grid. 
multi_grid = gridplot(multi_layout)
output_notebook()
show(multi_grid)