# Noise Acquistion

In this notebook we shall explore how to acquire real interferometer output data, and use that data as a TensorFlow dataset. In GravyFlow, datasets are built by composition, so by combining different elements (i.e., noise, injections, conditioning) we can build custom datasets to suit our specifications.

First we shall start by performing the necessary imports.

In [1]:
# Built-in imports
import os
import sys
from typing import Iterator, List
from pathlib import Path
from itertools import islice

# Dependancy imports: 
import tensorflow as tf
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot

# GravyFlow import, again adding the grandparent directory to the path:

# Get the absolute path of the parent directory
parent_dir = os.path.abspath('../../')

# Add the parent directory to sys.path
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# To import gravyflow simply use:
import gravyflow as gf

2024-01-10 00:30:47.262330: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Setup Environment:

As described in notebook 1, we should set up enviroment with gf.env to ensure we work on an avalible GPU.

In [2]:
# Setup enviroment and return tf.distributed stratergy object.
env = gf.env()

INFO:root:TensorFlow version: 2.12.1, CUDA version: 11.8
2024-01-10 00:31:23.637172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2000 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:89:00.0, compute capability: 7.0
INFO:root:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Set GravyFlow Global Defaults

Often, when we are working in in a single notebook or Python script, there are some variables which will remain constant throughout or analysis. In order to accomadate these scenarios, GravyFlow allows us to set a number of defaults in a global defaults class. The values which can be set in this way are as follows:

- `seed` : int = 1000
	> This is the default random seed that GravyFlow will use to initlise TensorFlow and Numpy random operations used for operations such as dataset shuffling, random noise generation, and waveform parameter randomisation. Setting a consistent seed should result in consistent, repeatable outputs. By default this is set to 1000.
- `num_examples_per_generation_batch` : int = 2048
	> When acquiring real interferometer data, GravyFlow downloads data in batches. This is done for efficiency, to reduce the number of overall download requests and the overhead that comes with that. This parameter number of training examples generate for each of thoes batches. When generating waveforms, they are also generted in batches of this number. By default this is set to 2048.
- `num_examples_per_batch` : int = 32
	> This parameter determines the number of examples that will be output by each iteration of the GravyFlow generators. When used to train a machine learning model, which is the primary design goal of GravyFlow. This number should be set to the same value as your desired training batch size.
- `sample_rate_hertz` : float = 2048.0
	> The default sample rate of the data input and output by GravyFlow in Hertz. Default is 2048.0 Hz
- `onsource_duration_seconds` : float = 1.0
    > The default duration of onsource data provided by GravyFlow iterators, in seconds. In GravyFlow, the onsource data is defined as data being analysed by your method that may contain a gravitional wave signal. As opposed to the offsource data, which is assumed not to contain any sigificant data features, and can be used as an example of uncontaiminated noise for data conditioning purposes such as whitening. Default is 1.0 s.
- `offsource_duration_seconds` : float = 16.0
    > The default duration of offsource data provided by GravyFlow iterators, in seconds. Offsource data is data that is assumed not to contain any significant features, and can be used as an example of uncontaiminated noise for data conditioning purposes such as whitening. Default is 16.0 s.
- `crop_duration_seconds` : float = 0.5
    > During some data condition operations, (currently only whitening), edge effects will be created. This will need to be cropped before data analysis is performed. GravyFlow does this automatically. crop_duration_seconds defines how much data to be cropped either side of the onsource segment, in seconds. Data will be cropped either side of the onsource so total cropped duration will be 2 $times$ crop_duration_seconds. The default is 0.5 s.
- `scale_factor` : float = 1.0E21
    > When gathering gata for use in machine learning applications, we want our values to be close to one, as activation functions such as ReLU, SoftMax, and Sigmoid, are designed around this assumption. For that reason, we often want to scale our input data, which can be very small in the case of graviational wave data. This value is used to scale both aproximants and noise. By default this is 1.0E21

### Important Consideration

Setting global variables like this can be a bad idea if you are going to used different values within the same Python script or notebook, as you may forget to set variables for some functions and this may lead to errors. If you plan on working with data which varies in any of these parameters, it is recomented that you pass them as arguments to the corresponding function, rather than relying on the default values.

Below we set some of these values to illustrate how they can be defined:

In [3]:
# Here we set the default GravyFlow values, here they are not changed from the default,
# but it illustrates how you can set them
gf.Defaults.set(
    sample_rate_hertz=2048.0,
    onsource_duration_seconds=1.0,
    offsource_duration_seconds=16.0,
    crop_duration_seconds=0.5,
    scale_factor=1.0E21
)

## Types of Noise

GravyFlow currenly supports four types of noise. Each type of noise has an assoicated GravyFlow ENUM. These are: 

1. White Gaussian Noise : `gf.NoiseType.WHITE`
> Simple white noise with a Gaussian distributrion.

2. Coloured Gaussian Noise : `gf.NoiseType.COLORED`
> White Gaussian Noise coloured by the specified detector's design PSD.

3. Pseudo-Real Noise : `gf.NoiseType.PSEUDO_REAL`
> White Gaussian Noise coloured by a PSD of data drawn from the detector. 
> This kind of noise can simulate the variance and non-stationary nature of
> real detector noise without including as many non-linearities.

4. Real Noise : `gf.NoiseType.REAL`
> Real Noise acquired from the given interferometer.

## Generating Simulated Noise

The first two types of noise, White Gaussian Noise and Coloured Gaussian Noise can be generated using a only a gf.NoiseObtainer object. gf.NoiseObtainer takes a number of arguments:

- `data_directory_path` : Path = Path("./generator_data")
> A path to the directory where the NoiseObtainer will cache downloaded noise if using Real or Pseudo-Real noise. Its default value is Path("./generator_data")
- `ifo_data_obtainer` : Union[None, gf.IFODataObtainer] = None
> If using real or pseudo-real noise a gf.IFODataObtainer object is required, this will be discussed in more detail later in this notebook. Its default value is None. 
- `ifos` : List[gf.IFO] = gf.IFO.L1
> A list of the interferometers you wish to simulate or acquire noise from. Currently GravyFlow supports 3 IFOs represented by an ENUM: LIGO Livingston, `gf.IFO.L1`, LIGO Hanford, `gf.IFO.L1`, and Virgo, `gf.IFO.V1`.
- `noise_type` : NoiseType = gf.NoiseType.REAL
> The type of noise you wish to simulate, as discuessed above, this can be one of `gf.NoiseType.WHITE`, `gf.NoiseType.COLORED`, `gf.NoiseType.PSEUDO_REAL`, or `gf.NoiseType.REAL`.
- `groups` : Union[dict, None] = {"train" : 0.98, "validate" : 0.01, "test" : 0.01}
> This parameter enables you to create distinct groups within real data segments to draw from. The most common use case would be to separate training and testing data, to ensure that no real noise segments used in model training are used to validate the model. By default, this parameter is set up with a train, validate, and test group. When generating noise, you can choose which group to choose from. You will not draw from another group invariant of model seed as long as you keep the groups dictionary the same. If you assign new groups to the dictionary, or change the percentage of data within each group, the groups wil not be consistent with data drawn from a different group. So it is important to decide on a consistent groups split and keep to it for all analyses. This parameter will do nothing when supplied for `gf.NoiseType.WHITE` or `gf.NoiseType.COLORED` noise.

We can initlise a noise_obtainer object like so:

In [4]:
# Initilise white noise generator:
white_noise : gf.NoiseObtainer = gf.NoiseObtainer(
    noise_type=gf.NoiseType.WHITE # In white noise, the only parameter we need to set, is the noise type. 
)

From this white_noise object, we can then can then create a noise generator by calling this object, when we call an intilised NoiseObtainer, it has the following arguments:

- `sample_rate_hertz` : Union[float, None] = None
	> The sample rate of the output noise. If None, which is default, this value reverts to the default set in gf.Defaults.
- `onsource_duration_seconds` : Union[float, None] = None
    > Th duration of the onsource noise, in seconds. If None, which is default, this value reverts to the default set in gf.Defaults.
- `crop_duration_seconds` : Union[float, None] = None
    > A crop duration can also be added, for consistency with the rest of the pipeline. This provides extra noise to the onsource equivilent to 2 $times$ the crop_duration_seconds which can be cropped after data conditioning. If None, which is default, this value reverts to the default set in gf.Defaults.
- `offsource_duration_seconds` : Union[float, None] = None
    > The duration of offsource noise, in seconds. If None, which is default, this value reverts to the default set in gf.Defaults.
- `num_examples_per_batch` : Union[int, None] = None
	> The number of noise examples provided each time this iterator is called. If None, which is default, this value reverts to the default set in gf.Defaults.
- `scale_factor` : float = 1.0
    > The scale factor to multiply our noise. Unlike the other values, this is 1.0 by default, as we usually scale at another point in our pipeline.
- `group` : str = "train"
	> This parameter designates which group to draw real data from in the real or pseudo-real case. See the description of the groups parameter above. This parameter will do nothing when supplied for `gf.NoiseType.WHITE` or `gf.NoiseType.COLORED` noise.

Next, let's create examples of GravyFlow generating some noise.

### White Noise Example

Since we have previously defined all out parameters, we can generate white noise without setting any additional parameters. However, since we only want to generate one example, we set `num_examples_per_batch=1`.

Since a call to a `gf.NoiseObtainer` returns a Python iterator, as its primarily designed for use in a loop. We cannot use the object as is, or index into the object. We can use Python's inbuilt `next` function to get the next item returned from the generator.

In [5]:
with env:
	white_onsource, white_offsource, _ = next(white_noise(num_examples_per_batch=1))
	# In real and pseudo-real noise, the third element that is returned are the GPS times 
	# of the real noise segment. For simulated noise, this simply returns None.

We can then plot these results:

In [6]:
white_onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : white_onsource[0]},
        title = "Onsource Background noise."
    )
    
white_offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : white_offsource[0]},
        title = "Offsource Background noise."
    )

white_plot_layout : List = [[white_onsource_strain_plot, white_offsource_strain_plot]]

# Arrange the plots in a grid. 
grid = gridplot(white_plot_layout)
output_notebook()
show(grid)

### Coloured Noise Example

Generating coloured noise is very similar. This time however, we must specify which interferometer we wish to simulate. GravyFlow currently used the O3 detector PSD specification to colour the noise based on the choice of detector.

In [8]:
# Initilise colored noise generator:
colored_noise : gf.NoiseObtainer = gf.NoiseObtainer(
    noise_type=gf.NoiseType.COLORED,
    ifos=gf.IFO.L1 # For coloured noise, we must specify an interferometer.
)

with env:
	colored_onsource, colored_offsource, _ = next(colored_noise(num_examples_per_batch=1))
	# In real and pseudo-real noise, the third element that is returned are the GPS times 
	# of the real noise segment. For simulated noise, this simply returns None.

And again we can plot these results.

In [9]:
colored_onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : colored_onsource[0]},
        title = "Onsource Background noise."
    )
    
colored_offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : colored_offsource[0]},
        title = "Offsource Background noise."
    )

colored_plot_layout : List = [[colored_onsource_strain_plot, colored_offsource_strain_plot]]

# Arrange the plots in a grid. 
grid = gridplot(colored_plot_layout)
output_notebook()
show(grid)

## Obtaining Real Noise

If we want to generate pseudo-random noise from real data or obtain samples of real data, we must create an additional object to pass to the noise generator, an instance of `gf.IFODataObtainer`. This object contains various parameters which specify which data to collect:

- `observing_runs` : Union[gf.ObservingRun, List[gf.ObservingRun]]
> Specifiy which observing run you would like to sample data from (multiple observing runs not yet supported). By default, the data obtainer will pull a random sample from a random time during the chosen observing run, which satisifes the other selection criteria given by data quality and data labels. If you want to retrieve from a custom gps range, this can be achived using the overrrides dictionary.
- `data_quality` : gf.DataQuality
> Specifiy what kind of data to acquire. Currently, only supports gf.DataQuality.BEST, which gets the cleaned output channels with lines removed.
- `data_labels` : Union[gf.DataLabel, List[gf.DataLabel]]
> The data labels paramater specifies which features we want to include and or exclude from our sample pool. The three types of data label are gf.DataLabel.NOISE, gf.DataLabel.EVENT, and gf.DataLabel.GLITCH. Glitches are mapped using the GravitySpy glitch database, and events are any event or candidate event listed in a GWTC catalogue. If we wanted to return only noise and glitches, we would input this list: [gf.DataLabel.NOISE, gf.DataLabel.GLITCH], this would exclude known and possible event times from returned data. If we wanted to exclude glitches we would use just gf.DataLabel.NOISE. Note that if glitches are exlcluded this will increase preprocessing time slighly, as they are numerous. Currently, the noise obtainer does not have a mode for extracting features only, such as only events, or only glitches. 
- `segment_order` : SegmentOrder = gf.SegmentOrder.RANDOM
> This parameter specifies the order in which segments are retrieved by the iterator, the options are gf.SegmentOrder.RANDOM, in which the order is randomised deterministically based on the current seed, gf.SegmentOrder.CHRONOLOGICAL, in which the segments are returned in order of their GPS times , or gf.SegmentOrder.SHORTEST_FIRST, in which the shortest segment is returned first. This is primarily used for debugging. gf.SegmentOrder.RANDOM is recommended for most use cases. The default value is gf.SegmentOrder.RANDOM.
- `max_segment_duration_seconds` : float = 2048.0
> This parameter deterimes the maximum length of downloaded data segments. GravyFlow does not download each retrieved segment individually, as that would introduce a large amount of overhead. Nor does it download all data simulutaniously, as that would be imposissble. Instead it downloads data in segments, and then distributes these segments into smaller examples untill that segment is exausted, at which point it downloads the next segment. This means that a number of segments in a row will be drawn from similar GPS times which are no greater than this value apart, which will reduce mixing. If a greater mix of data from across your input range is desired, use a lower value for this number. Note that this will increase data acquisition overhead. If less mixing is neccissary experiment with larger values. Note that larger values will result a higher memory usage. The default value is 2048.0 s.
- `saturation` : float = 1.0
> This parameter deterimes how many examples to create from every downloaded segment. If this is one, than one second of example data will be generated for every second of segment data. The default value is 1.0. 
- `force_acquisition` : bool = False
> If true, this parameter forces the data_obtainer to acquire and save new segment data even if it finds cached segment data. Default is False.
- `cache_segments` : bool = True
> If true, this parameter will cause the IFODataObtainer object to save downloaded segments to a hdf5 file in a locaton specfied by its containing DataObtainer. When running the iterator for a second time with the same parameters, the IFODataObtainer will load the saved data rather than downloading it again, unless force_acquisition is True. Default is True.
- `overrides` : dict = None
> This parameter lets you set more specific GPS time ranges by overriding the parameters given by the inputted gf.ObservingRun Enum. For example, this override dictionary could be passed to restrict the gps times to a specific range: {"start_gps_times" : }

- `logging_level` : int = logging.WARNING

In [None]:
# Setup ifo_data_obtainer object:
ifo_data_obtainer : gf.IFODataObtainer = gf.IFODataObtainer(
    observing_runs=gf.ObservingRun.O3, 
    data_quality=gf.DataQuality.BEST, 
    data_labels=[
        gf.DataLabel.NOISE, 
        gf.DataLabel.GLITCHES
    ],
    segment_order=gf.SegmentOrder.RANDOM,
    force_acquisition=True,
    cache_segments=False
)

In [42]:
# Initilise noise generator wrapper:
noise : gf.NoiseObtainer = gf.NoiseObtainer(
    ifo_data_obtainer=ifo_data_obtainer,
    noise_type=gf.NoiseType.REAL,
    ifos=gf.IFO.L1
)

In [6]:
with env:
	onsource, offsource, gps_times = next(noise(num_examples_per_batch=1))

2024-01-09 04:11:39.622486: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x5593c2644080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-09 04:11:39.622526: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2024-01-09 04:11:39.653334: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-09 04:11:39.701906: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2024-01-09 04:11:39.724502: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-09 04:11:39.782652: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


In [43]:
onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : onsource[0]},
        title = f"Onsource Background noise at {gps_times[0]}"
    )
    
offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : offsource[0]},
        title = f"Offsource Background noise at {gps_times[0]}"
    )

layout : List = [[onsource_strain_plot, offsource_strain_plot]]

# Arrange the plots in a grid. 
grid = gridplot(layout)
output_notebook()
show(grid)

In [44]:
num_iterations : int = 16

with env: 
    for onsource, offsource, gps_times in islice(noise(), num_iterations):
        print(f"Got {onsource.shape[0]} more noise Examples! Should probably do something usefull with them!")

2024-01-09 07:42:43.660868: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x55f8a96ae860 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-09 07:42:43.660915: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2024-01-09 07:42:43.668087: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-09 07:42:43.688900: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2024-01-09 07:42:43.712875: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-09 07:42:43.772522: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more noise Examples! Should probably do something usefull with them!
Got 32 more 

In [10]:
multi_ifo_noise : gf.NoiseObtainer = gf.NoiseObtainer(
        ifo_data_obtainer = ifo_data_obtainer,
        noise_type = gf.NoiseType.REAL,
        ifos = [gf.IFO.L1, gf.IFO.H1]
    )

In [11]:
multi_onsource, multi_offsource, multi_gps_times = next(multi_ifo_noise(num_examples_per_batch=1))

In [15]:
multi_onsource_strain_plot = gf.generate_strain_plot(
        {"Onsource Noise" : multi_onsource[0]},
        title = f"Onsource Background noise at {multi_gps_times[0]}"
    )
    
multi_offsource_strain_plot = gf.generate_strain_plot(
        {"Offsource Noise" : multi_offsource[0]},
        title = f"Offsource Background noise at {multi_gps_times[0]}"
    )

multi_layout : List = [[multi_onsource_strain_plot, multi_offsource_strain_plot]]

# Arrange the plots in a grid. 
multi_grid = gridplot(multi_layout)
output_notebook()
show(multi_grid)