# **This notebook aims to extract data from a correctly formatted CSV file and adapt it to the pangeo-fish format**

### **Necessary imports**
___

In [None]:
import pandas as pd
import json
import numpy as np
from data_conversion import extract_tagging_events
from data_conversion import show_data_csv
from data_conversion import create_metadata_file
from data_conversion import extract_name
from data_conversion import format_date
from data_conversion import extract_DST
from data_conversion import convert_to_utc_with_formatting
from data_conversion import compat_checking
import pytz
import csv
from datetime import datetime
import os
from tqdm import tqdm
import statsmodels.api as sm

In [None]:
### Test with the tag NO_A12667
### These two paths will be used as an example to see if the full data extraction works correctly

csv_path = "../../all_raw/NO_A12667.CSV"  # Path to the raw csv file, where the code will extract data from. Update with yours to adapt
destination = "../../all_cleaned/NO_A12267/"  # Folder where you want to write your the different files. Update with yours to adapt
os.makedirs(destination, exist_ok=True)

___
### 1. **Extracting the tagging events**
In this section, we try to test and compare how to extract the necessary information for the tagging events (i.e., time and position for release, fish death, (recapture?))
___

See below the steps that the extract_DST function does: 

- **Purpose**:
  - Extracts releasing date, presumed fish death date, and fish release/recapture positions from a CSV file.

- **Initialization**:
  - Initializes variables for storing dates (`release_date`, `fish_death`) and coordinates (`lon`, `lat`).

- **Processing CSV**:
  - Opens the CSV file and iterates through each line.

- **Data Extraction**:
  - Extracts releasing date and presumed fish death date.
  - Formats latitude and longitude coordinates for fish release/recapture positions.

- **DataFrame Creation**:
  - Constructs a DataFrame with event names, dates, longitude, and latitude.
  - Returns the DataFrame containing tagging events data.

In [None]:
### See the function tagging_events in the file data_conversion.py for further information
tagging_events = extract_tagging_events(csv_path)
te_save_path = destination + "tagging_events.csv"
tagging_events.to_csv(te_save_path, index=False)
tagging_events

___
### 2. **Creating the metadata JSON file**
In this section, we try to test and compare how to extract the necessary information for the metatdat.json file.  
___
- **Purpose**:
  - Creates a metadata JSON file based on provided data path and destination path.

- **Initialization**:
  - Retrieves tag name from the provided file path using a helper function.
  - Initializes metadata with tag ID, scientific name, common name, and project information.

- **Metadata Construction**:
  - Constructs a dictionary (`metadata`) containing tag ID, scientific name ("Dicentrarchus labrax"), common name ("European seabass"), and project name ("BARGIP").

- **File Writing**:
  - Specifies the filename as "metadata.json" and constructs the full destination path.
  - Writes the metadata dictionary to a JSON file at the destination path.

- **Result**:
  - No return value; a metadata JSON file is created at the specified destination path.

In [None]:
### See data_conversion.py for more information about create_metadata_file function
create_metadata_file(csv_path, destination)

___
### 3. **Creating the dst.csv file**
In this section, we will create the dst file that contains the pressure, temperature and time data.  
See below the steps that the extract_DST function does:  
___

- **Opening the CSV File**:
  - Takes a file path to a CSV file containing time series data.
  - Opens the CSV file using the `csv.reader` object.
  
- **Iterating Through CSV Rows**:
  - Iterates through each row of the CSV file.
  
- **Extracting Tag ID**:
  - Extracts the tag ID from the file path using the `extract_name` function (not provided).
  
- **Finding Target Line**:
  - Searches for the line that contains the headers for the data of interest ("Date/Time Stamp", "Pressure", "Temp").
  
- **Reading Data**:
  - Once the target line is found, starts reading data rows.
  
- **Formatting Date and Time**:
  - Formats the date and time column using the `convert_to_utc_with_formatting` function.
  - Converts the local time to UTC time based on the specified time zone.
  
- **Converting Data Types**:
  - Converts the pressure and temperature data from strings to `numpy.float64` for numerical analysis.
  
- **Storing Data**:
  - Stores the formatted data into a list for further processing.
  
- **Creating DataFrame**:
  - After reading all data rows, converts the list of data into a Pandas DataFrame with columns ['time', 'pressure', 'temperature'].
  
- **Completeness Check**:
  - Checks if the number of data points extracted matches the expected length.
  - If they match, indicates completion; otherwise, suggests potential incompleteness.
  
- **Returning DataFrame**:
  - Finally, returns the DataFrame containing the extracted data.

This function primarily focuses on extracting time, pressure, and temperature data from a CSV file, converting the date and time to UTC time, and formatting the data for analysis. Check the function in the file data_conversion.py for further information.

In [None]:
time_zone = "Europe/Paris"
dst = extract_DST(csv_path, time_zone)
dst

In [None]:
dst_save_path = destination + "dst.csv"
dst.to_csv(dst_save_path)

___
### 4. **Convert and format everything under the raw_test folder to the cleaned folder**
This section has test purpose to see if it's easy and works correctly for the different tags in the **raw_test** folder.  
Afterwards, the purpose is to do the same operation on the **raw** folder
___
#### Explenation of the code below :
- **Folders and Time Zone Setup**:
  - Defines folders (`raw_folder`, `destination_folder`) and time zone (`time_zone`).

- **Destination Folder Creation**:
  - Checks if the destination folder exists; if not, creates it.

- **Processing Raw Data**:
  - Iterates through raw files in the raw folder.
  - Extracts tag ID and constructs destination paths.
  - Creates tag-specific folders if they don't exist.
  - Extracts tagging events and DST data from raw files.
  - Saves extracted data to CSV files in respective tag folders.
  - Creates metadata files for each raw file.

- **Handling Incorrect Raw Folder**:
  - Prints a message if the raw folder doesn't exist.

In [None]:
%%time
raw_folder = "all_raw/"  # Folder name to explore
destination_folder = "all_cleaned/"
time_zone = "Europe/Paris"

if not os.path.exists(destination_folder):
    os.mkdir(destination_folder)

# Check if the folder exists
if os.path.exists(raw_folder):

    # Get list of files to iterate through
    files = [
        f for f in os.listdir(raw_folder) if os.path.isfile(os.path.join(raw_folder, f))
    ]

    # Wrap files list with tqdm for progress bar
    for file_name in tqdm(files, desc="Processing files"):
        raw_file = os.path.join(raw_folder, file_name)

        # Extract filename without extension
        tag_id = extract_name(raw_file)
        destination_path = os.path.join(destination_folder, tag_id)

        # Check if the folder for the tag exists, if not, create it
        if not os.path.exists(destination_path):
            print("Creating folder for tag:", tag_id)
            os.mkdir(destination_path)

        ### Extracting tagging events from raw file
        tag_events = extract_tagging_events(raw_file)
        tagging_events_path = os.path.join(destination_path, "tagging_events.csv")
        tag_events.to_csv(
            tagging_events_path, index=False
        )  ### Saving them at the right path

        ### Extracting DST from raw file
        tag_dst = extract_DST(raw_file, time_zone)
        dst_path = os.path.join(destination_path, "dst.csv")
        tag_dst.to_csv(dst_path, index=False)  ### Saving them at the right path

        ###Creating metadata files
        print("creating_metadata")
        create_metadata_file(raw_file, destination_path)

else:
    print("Wrong folder for raw files")

### 5. **Warm plume detections**

In [None]:
tag_name = "DK_A10531"

In [None]:
data_path = f"../../all_cleaned/{tag_name}/"

In [None]:
storage_path = f"../../all_cleaned/{tag_name}"

In [None]:
def get_wp_timestamps(data_path, storage_path):
    # read tag data
    df = pd.read_csv(f"{data_path}dst.csv", delimiter=",")

    df["time"] = pd.to_datetime(df["time"])

    df = df.set_index("time")

    # format
    df_d = df.loc[
        df.groupby(pd.Grouper(freq="D"))["temperature"].idxmax()
    ]  # nan introduced because of daylight saving time change
    # df_d = df.groupby(pd.Grouper(freq='D')).mean() # nan introduced because of daylight saving time change
    idx = np.where(np.isnan(df_d["temperature"]))[0]
    for i in idx:
        df_d["pressure"][i] = df_d["pressure"][
            i - 1
        ]  # put the pressure value of the hour before
        df_d["temperature"][i] = df_d["temperature"][
            i - 1
        ]  # put the temperature value of the hour before
    df_d["dTemp"] = np.append(np.diff(df_d["temperature"]), 0.0)
    df_d["dPressure"] = np.append(np.diff(df_d["pressure"]), 0.0)
    df_d.index = pd.DatetimeIndex(df_d.index).to_period("D")

    clf_model = sm.tsa.MarkovAutoregression(
        df_d["temperature"], k_regimes=2, order=1, switching_ar=False
    )
    res_clf_model = clf_model.fit(method="bfgs")
    res_clf_model.summary()

    predicted_label = res_clf_model.smoothed_marginal_probabilities[0] > 0.50
    df_d["predicted_label"] = np.append(0.0, predicted_label)
    if (
        df_d["temperature"][df_d["predicted_label"] == 1.0].mean()
        < df_d["temperature"][df_d["predicted_label"] == 0.0].mean()
    ):
        df_d["predicted_label"] = df_d["predicted_label"] + 1.0
        df_d["predicted_label"][df_d["predicted_label"] == 2.0] = 0.0

    if not os.path.exists(storage_path):
        os.makedirs(storage_path)

    df_d["predicted_label"].to_csv(f"{storage_path}/detection.csv")

___
### Running for multiple tags 

In [None]:
detections = pd.read_csv("bar_flag_warm_plume.txt", sep="\t")

In [None]:
detections_tag = list(detections[detections["warm_plume"] == True]["tag_name"])

In [None]:
for tag_name in tqdm(detections_tag, desc=""):
    data_path = f"../../all_cleaned/{tag_name}/"
    storage_path = f"../../all_cleaned/{tag_name}"
    get_wp_timestamps(data_path, storage_path)

In [None]:
# Run this cell if you want to put all tags in it

import s3fs

s3 = s3fs.S3FileSystem(
    anon=False,
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
    },
)
s3.put("../../all_cleaned/", "gfts-ifremer/tags/bargip/cleaned", recursive=True)

In [None]:
# Run this cell if you want only to update the files with detections

import s3fs

s3 = s3fs.S3FileSystem(
    anon=False,
    client_kwargs={
        "endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
    },
)

for tag in detections_tag:
    s3.put(
        f"../../all_cleaned/{tag}",
        f"gfts-ifremer/tags/bargip/cleaned/{tag}",
        recursive=True,
    )