# Read Phoenix data into MTH5

This example demonstrates how to read Phoenix data into an MTH5 file.  The data comes from example data in [PhoenixGeoPy](https://github.com/torresolmx/PhoenixGeoPy). Here I downloaded those data into a local folder on my computer by forking the main branch.  

## Imports

In [1]:
from pathlib import Path

from mth5.mth5 import MTH5
from mth5 import read_file
from mth5.io.phoenix import ReceiverMetadataJSON, PhoenixCollection

2022-08-23 14:33:37,157 [line 135] mth5.setup_logger - INFO: Logging file can be found C:\Users\jpeacock\OneDrive - DOI\Documents\GitHub\mth5\logs\mth5_debug.log


## Data Directory

Specify the station directory.  Phoenix files place each channel in a folder under the station directory named by the channel number.  There is also a `recmeta.json` file that has metadata output by the receiver that can be useful.  In the `PhoenixGeopPy/sample_data` there are 2 folders one for native data, these are `.bin` files which are the raw data in counts sampled at 24k.  There is also a folder for segmented files, these files are calibrated to millivolts and decimated or segmented data according to the recording configuration.  Most of the time you would use the segmented files? 

In [2]:
station_dir = Path(r"c:\Users\jpeacock\OneDrive - DOI\mt\phoenix_example_data\10291_2019-09-06-015630")

## File Collection

We've developed a collection dataframe to help sort out which files are which and which files can be grouped together into runs.  Continous runs will be given a single run name and segmented data will have sequential run names.  Both will have the pattern `sr{sample_rate}_####` for the `run.id`.

#### Receiver Metadata

The data logger or receiver will output a `JSON` file that contains useful metadata that is missing from the data files.  The `recmeta.json` file can be read into an object with methods to translate to `mt_metadata` objects. This is read in by `PhoenixCollection` and is in the attribute `receiver_metadata`. 

Here `PhoenixCollection.get_runs` returns a list of dataframes of just the first block within the sequence, as this is all you need for the reader.  For continous data the reader will read in all sequence blocks, for discontinous data it will only read in the burst.

In [3]:
phx_collection = PhoenixCollection(file_path=station_dir)
run_list = phx_collection.get_runs(sample_rates=[150, 24000])

In [4]:
receiver_metadata = ReceiverMetadataJSON(station_dir.joinpath(r"recmeta.json"))

## Initiate MTH5

First initiate an MTH5 file, can use the receiver metadata to fill in some `Survey` metadata

In [5]:
m = MTH5()
m.open_mth5(station_dir.joinpath("mth5_from_phoenix.h5"), "w")

2022-08-23 14:33:52,913 [line 663] mth5.mth5.MTH5._initialize_file - INFO: Initialized MTH5 0.2.0 file c:\Users\jpeacock\OneDrive - DOI\mt\phoenix_example_data\10291_2019-09-06-015630\mth5_from_phoenix.h5 in mode w


### Add Survey

In [6]:
survey_metadata = phx_collection.receiver_metadata.survey_metadata
survey_group = m.add_survey(survey_metadata.id)

### Add Station

Add a station and station metadata

In [7]:
station_metadata = phx_collection.receiver_metadata.station_metadata
station_group = survey_group.stations_group.add_station(
    station_metadata.id, 
    station_metadata=station_metadata
)

## Loop through runs

Using the `run_list` value output by the `PhoenixCollection.get_runs` we can simply loop through the runs without knowing if the data are continous or discontinuous, the `read_file` will take care of that.

Users should note the Phoenix file structure.  Inside the folder are files with extensions of `.td_24k` and `td_150`.  

- `.td_24k` are usually bursts of a few seconds of data sampled at 24k samples per second to get high frequency information.  The returned object is a `mth5.timeseries.ChannelTS`.
- `td_150` is data continuously sampled at 150 samples per second.  These files usually have a set length, commonly an hour. The returned object is a `mth5.timeseries.ChannelTS`.



In [None]:
%%time
for run_df in run_list:
    run_metadata = phx_collection.receiver_metadata.run_metadata
    run_metadata.id = run_df.run.unique()[0]
    run_metadata.sample_rate = float(run_df.sample_rate.unique()[0])
    
    run_group = station_group.add_run(run_metadata.id, run_metadata=run_metadata)
    for row in run_df.itertuples():
        ch_ts = read_file(row.fn, **{"channel_map":phx_collection.receiver_metadata.channel_map})
        ch_metadata = phx_collection.receiver_metadata.get_ch_metadata(
            ch_ts.channel_metadata.channel_number
        )
        # need to update the time period as estimated from the data not the metadata
        ch_metadata.time_period.update(ch_ts.channel_metadata.time_period)
        ch_ts.channel_metadata.update(ch_metadata)
        ch_dataset = run_group.from_channel_ts(ch_ts)
        

2022-08-23 14:36:15,506 [line 784] mth5.groups.base.Station.add_run - INFO: run sr150_0001 already exists, returning existing group.


#### Add a Run for continuous data

Here we will add a run for the continuous data labelled `sr150_0001`.  This is just a suggestion, you could name it whatever makes sense to you. We will use the the collection to make continuous runs.

In [9]:
df_150 = df.loc[df.sample_rate == 150]
runs_150 = df_150.run.unique()
for run_id in runs_150:
    run_df_150 = df_150.loc[df_150.run == run_id]
    first_block = run_df_150.loc[run_df_150.start == run_df_150.start.min()]
    
    run_metadata = phx_collection.receiver_metadata.run_metadata
    run_metadata.id = run_id
    run_metadata.sample_rate = 150.
    continuous_run = station_group.add_run(run_metadata.id, run_metadata=run_metadata)
    for fn in df_150.loc[df_150.start == df_150.start.min()].fn:
        ch_150_ts = read_file(fn, **{"channel_map":phx_collection.receiver_metadata.channel_map})
        ch_metadata = phx_collection.receiver_metadata.get_ch_metadata(
            ch_150_ts.channel_metadata.channel_number
        )
        # need to update the time period as estimated from the data not the metadata
        ch_metadata.time_period.update(ch_150_ts.channel_metadata.time_period)
        ch_150_ts.channel_metadata.update(ch_metadata)
        ch_dataset = continuous_run.from_channel_ts(ch_150_ts)
        
    

NameError: name 'df' is not defined

In [None]:
run_metadata = phx_collection.receiver_metadata.run_metadata
run_metadata.id = "sr150_001"
run_metadata.sample_rate = 150.
continuous_run = station_group.add_run(run_metadata.id, run_metadata=run_metadata)

In [None]:
%%time
for ch_dir in station_dir.iterdir():
    if ch_dir.is_dir():
        ch_metadata = phx_collection.receiver_metadata.get_ch_metadata(int(ch_dir.stem))
        # need to set sample rate to 0 so it does not override existing value
        ch_metadata.sample_rate = 0
        ch_150 = read_file(
            sorted(list(ch_dir.glob("*.td_150")))[0],
            **{"channel_map":phx_collection.receiver_metadata.channel_map}
        )
        # need to update the time period as estimated from the data not the metadata
        ch_metadata.time_period.update(ch_150.channel_metadata.time_period)
        ch_150.channel_metadata.update(ch_metadata)
        ch_dataset = continuous_run.from_channel_ts(ch_150)
        
continuous_run.validate_run_metadata()
continuous_run.write_metadata()

#### Update metadata before closing

Need to update the metadata to account for added stations, runs, and channels.

In [None]:

station_group.validate_station_metadata()
station_group.write_metadata()

survey_group.update_survey_metadata()
survey_group.write_metadata()

In [None]:
m.channel_summary.summarize()
m.channel_summary.to_dataframe()

In [None]:
m.close_mth5()