This script uses pandas to backfill (tugboat-related) metadata about all surveys that exist on NCEI.
This metadata is expected to be extracted from NCEI into an Excel file using the query below:

`
select * from cruise.wcsd_all_survey_summary_msql
where source_group like '%NMFS%'
order by source_name, dataset_name
`

The excel file is then parsed through in this script, and the resultant, processed dataframe is then uploaded to BigQuery.
The Tugboat metadata format is outlined below:

Fields:
* Cruise ID - string
* Segment ID - string
* Master (default) release date - date, when these data should be released
* Ship name - string (controlled vocabulary)
* Departure port - string
* Arrival port - string
* Departure date - date
* Arrival date - date
* Sea area - string (controlled vocabulary)
* Cruise title - string
* Cruise purpose - string
* Cruise description - string
* Sponsors - list of organization names (strings) (controlled vocabulary)
* Funders - list of organization names (strings) (controlled vocabulary)
* Scientists - list of person objects (controlled vocabulary)
* Projects - list of project names (strings)
* Metadata author - person object (controlled vocabulary)
* Instruments - list of instrument objects (controlled vocabulary)
* Documents URI - URI to documents files (bucket)

Instrument Object:
* instrument - instrument name (string) (controlled vocabulary)
* release date - date, when these data should be released (overrides master release date if specified)
* status - enum representing processing type (Raw, Processed, or Products)
* calibration state - enum representing calibration performed on instrument (Calibrated w/ calibration data, Calibrated w/o calibration data, Uncalibrated, Uncalibrated w/ calibration data, * Unknown)
* calibration date - date, when instrument calibration was performed
* calibration reports URI - URI to calibration reports (bucket)
* calibration data / support URI - URI to calibration data / support files (bucket)
* Data details - string
* Data URI - URI to instrument data (bucket)
* Ancillary data details - string
* Ancillary data URI - URI to ancillary data (bucket)

Person Object:
* name - person name (string)
* organization - organization person is associated with (string)

In [1]:
import pandas as pd

In [2]:
file_path = r"C:\Users\hannan.khan\Downloads\WCSD_DB_ALL_SUMMARY.xlsx"
df = pd.read_excel(file_path)
df.head()

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,OBJECTID,WCS_ID,DATASET_NAME,INSTRUMENT_NAME,PROJECT_NAME,SCIENTIST_NAME,SOURCE_NAME,SOURCE_GROUP,CRUISE_NAME,PLATFORM_NAME,...,MIN_PULSE_LENGTH,MAX_PULSE_LENGTH,ANCILLARY,SHAPE,GEOM_TYPE,CLOUD_PATH,FILE_COUNT,DATASET_SIZE,INGEST_TIME,ARCHIVE_DATE
0,10047,10047,AI04GL_ES60,ES60,|Bottom trawl survey of groundfish resources i...,|Groundfish Assessment Program|,|AFSC|,|NMFS|,AI04GL,Gladiator,...,,,,,line,https://noaa-wcsd-pds.s3.amazonaws.com/index.h...,50,23819849116,,2022-11-30 16:59:22
1,9902,9902,AI04SS_ES60,ES60,|Bottom trawl survey of groundfish resources i...,|Groundfish Assessment Program|,|AFSC|,|NMFS|,AI04SS,Sea Storm,...,,,,"MDSYS.SDO_GEOMETRY(2006, 8307, NULL, MDSYS.SDO...",line,https://noaa-wcsd-pds.s3.amazonaws.com/index.h...,118,67346349156,,2022-09-13 18:01:15
2,9903,9903,AI06GL_ES60,ES60,|Bottom trawl survey of groundfish resources i...,|Groundfish Assessment Program|,|AFSC|,|NMFS|,AI06GL,Sea Storm,...,,,,,,https://noaa-wcsd-pds.s3.amazonaws.com/index.h...,324,16165358520,,2022-09-13 18:02:57
3,9904,9904,AI06SS_ES60,ES60,|Bottom trawl survey of groundfish resources i...,|Groundfish Assessment Program|,|AFSC|,|NMFS|,AI06SS,Sea Storm,...,,,,,,https://noaa-wcsd-pds.s3.amazonaws.com/index.h...,1431,74291968696,,2022-09-13 18:20:32
4,9905,9905,AI10OE_ES60,ES60,|Bottom trawl survey of groundfish resources i...,|Groundfish Assessment Program|,|AFSC|,|NMFS|,AI10OE,Ocean Explorer,...,,,,,line,https://noaa-wcsd-pds.s3.amazonaws.com/index.h...,1082,90764737141,,2022-09-13 19:30:55


In [None]:
sorted(list(df.columns))

In [None]:
# Finding out which columns have multiple values based on the '|' delimiter.
columns_w_multiple_values_in_row = set()
for col in df.columns:
    values_list = df[col].tolist()
    for value in values_list:
        if type(value) == str:
            if '|' in value:
                columns_w_multiple_values_in_row.add(str(col))
#             if value.count('|') > 2:
#                 # This means there are two or more values in the string.
#                 columns_w_multiple_values_in_row.add(str(col))
print(f"COLUMNS WITH MULTIPLE VALUES")
print(columns_w_multiple_values_in_row)

In [None]:
def parse_multiple_values(s: str = ""):
    """Used for parsing through multiple values in a string using the '|' delimiter.
    NOTE: Some strings will begin and end with the delimiter, such as `|NEFSC|`."""
    ...

In [None]:
for idx,row in df.iterrows():
    cruise_id = row["CRUISE_NAME"]
    # the echosounder used
    # TODO: has multiple names sometimes that need to be parsed.
    segment_id = row["INSTRUMENT_NAME"]
    # This is the cruise_id and the segment_id concatenated with an underscore. Used as a prefix for file names in NCEI.
    package_id = row["DATASET_NAME"]
    # TODO: verify that we can use "PUBLISH_DATE" as the master_release_date
    master_release_date = row["PUBLISH_DATE"]
    ship = row["PLATFORM_NAME"]
    # TODO: not available
    ship_uuid = ""
    departure_port = row["DEPARTURE_PORT"]
    # TODO: verify that we can use "START_DATE" as the departure_date
    departure_date = row["START_DATE"]
    arrival_port = row["ARRIVAL_PORT"]
    # TODO: verify that we can use "END_DATE" as the arrival_date
    arrival_date = row["END_DATE"]
    # TODO: not available
    sea_area = ""
    # TODO: has multiple names sometimes that need to be parsed.
    cruise_title = row["PROJECT_NAME"]
    # TODO: not available
    cruise_purpose = ""
    # TODO: not available
    cruise_description = ""
    # TODO: not available
    metadata_author = ""

    # TODO: these have multiple names sometimes that need to be parsed.
    sponsors = row["SOURCE_NAME"]
    funders = row["SOURCE_NAME"]
    scientists = row["SCIENTIST_NAME"]
    projects = row["PROJECT_NAME"]
    instruments = row["INSTRUMENT_NAME"]
    package_instruments = row["INSTRUMENT_NAME"]
    # TODO: validate that we can just use the s3 cloud path for the calibration file paths.
    calibration_file_path = row["CLOUD_PATH"]