# Case Study: Bring Your Own Data

Do you have a dataset that needs to be gap-filled?  

In this notebook we repeat the analysis for user supplied data.

# Download and prepare the dataset

::: {note}
The section will be dependent on your own data set. You need to wrangle your data into a single file stored as a `.csv` similar to `dataset.csv`. 

The first column should be a date or timestamp field.

The remaining columns are numerical values for physical measurements. The multivariable imputation methods KNN, MICE, and MissForest are applicable when you have multiple dependent variables taken the the same time. 
:::

In [2]:
from erddapy import ERDDAP
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import panel as pn
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
pn.extension()

In [3]:
df = pd.read_csv('dataset.csv', parse_dates=True, index_col=0)

In [4]:
df

Unnamed: 0_level_0,BlueIsland_2m,BlueIsland_5m,BlueIsland_10m,Ingomar_2m,Ingomar_5m,Ingomar_10m,Ingomar_15m,McNuttsIsland_2m,McNuttsIsland_5m,McNuttsIsland_10m,McNuttsIsland_15m,McNuttsIsland_20m,TaylorsRock_2m,TaylorsRock_5m,TaylorsRock_10m,TaylorsRock_15m,TaylorsRock_20m
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2018-05-15,8.634,7.762,6.811,7.664,6.993,6.220,5.637,6.504,6.295,5.991,5.737,4.492,7.141,6.778,6.217,5.922,5.871
2018-05-16,9.009,7.564,6.215,7.347,6.636,5.912,5.390,7.222,6.786,6.184,5.737,4.420,6.980,6.595,6.050,5.732,5.449
2018-05-17,8.074,7.188,6.466,7.621,7.072,6.544,6.034,7.828,7.398,6.245,5.606,4.322,7.346,6.994,6.574,6.328,6.092
2018-05-18,8.441,7.328,6.099,7.993,7.554,7.025,6.501,8.065,7.444,6.599,5.952,4.644,7.312,7.033,6.523,6.207,5.757
2018-05-19,7.649,6.877,6.142,8.180,7.809,7.421,6.901,7.883,7.127,5.962,5.392,4.141,7.667,7.398,7.056,6.856,6.507
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-05-10,6.555,6.554,6.692,6.617,6.709,6.589,5.733,,,6.604,6.651,6.642,6.400,,6.358,6.396,6.521
2022-05-11,6.754,6.753,6.837,6.816,6.895,6.738,5.896,,,6.720,6.667,6.529,6.442,,6.393,6.424,6.525
2022-05-12,7.026,6.947,7.000,6.809,6.761,6.466,5.617,,,6.645,6.217,6.000,6.523,,6.252,6.139,6.142
2022-05-13,7.610,7.248,7.075,6.691,6.347,5.974,5.079,,,6.017,5.804,5.725,6.465,,5.565,5.470,5.600


## Explore the data

In [5]:
def plot_all_sites(df, cmap='Viridis'):
    image_data = df.astype('float32').T.values
    
    x_labels = df.index.strftime('%Y-%m-%d')  # dates → x-axis
    y_labels = list(df.columns)               # station-depths → y-axis
    
    x_coords = np.arange(len(x_labels))
    y_coords = np.arange(len(y_labels))
    
    heatmap = hv.Image((x_coords, y_coords, image_data)).opts(
        xaxis='bottom',
        xlabel='Date',
        ylabel='Station @ Depth',
        xticks=list(zip(x_coords[::30], x_labels[::30])),  # every 30th date
        yticks=list(zip(y_coords, y_labels)),
        xrotation=45,
        cmap=cmap,
        colorbar=True,
        width=1000,
        height=800,
        tools=['hover']
    )
    return heatmap
    
plot_all_sites(df)

### Visualize the series data

In [11]:
# Create a dropdown selector
site_selector = pn.widgets.Select(name='Site', options=list(df.columns))

def highlight_nan_regions(label):

    series = df[label]
    
    # Identify NaN regions
    is_nan = series.isna()
    nan_ranges = []
    current_start = None

    for date, missing in is_nan.items():
        if missing and current_start is None:
            current_start = date
        elif not missing and current_start is not None:
            nan_ranges.append((current_start, date))
            current_start = None
    if current_start is not None:
        nan_ranges.append((current_start, series.index[-1]))

    # Create shaded regions
    spans = [
        hv.VSpan(start, end).opts(color='red', alpha=0.2)
        for start, end in nan_ranges
    ]

    curve = hv.Curve(series, label=label).opts(
        width=900, height=250, tools=['hover', 'box_zoom', 'pan', 'wheel_zoom'],
        show_grid=True, title=label
    )

    return curve * hv.Overlay(spans)
    
interactive_plot = hv.DynamicMap(pn.bind(highlight_nan_regions, site_selector))

pn.Column(site_selector, interactive_plot, 'Hightlights regions are gaps that need to imputed.')

## Impute the gaps

We have determined that the `MissForest`appears to work reasonably well when imputing artificially large gaps. 

We use it to gap fill the missing data in this dataset.

In [12]:
from imputeMF import imputeMF

In [13]:
df_imputed = pd.DataFrame(imputeMF(df.values, 10, print_stats=True), columns=df.columns, index=df.index)

Statistics:
iteration 1, gamma = 0.03431894409323924
Statistics:
iteration 2, gamma = 0.0005762632212411448
Statistics:
iteration 3, gamma = 8.236875752282678e-05
Statistics:
iteration 4, gamma = 2.5255770971817778e-05
Statistics:
iteration 5, gamma = 1.3630642714695858e-05
Statistics:
iteration 6, gamma = 1.1119706955897746e-05
Statistics:
iteration 7, gamma = 9.008172512535662e-06
Statistics:
iteration 8, gamma = 8.351600838941382e-06
Statistics:
iteration 9, gamma = 7.730331223170641e-06
Statistics:
iteration 10, gamma = 7.148441593724086e-06


In [14]:
df_imputed

Unnamed: 0_level_0,BlueIsland_2m,BlueIsland_5m,BlueIsland_10m,Ingomar_2m,Ingomar_5m,Ingomar_10m,Ingomar_15m,McNuttsIsland_2m,McNuttsIsland_5m,McNuttsIsland_10m,McNuttsIsland_15m,McNuttsIsland_20m,TaylorsRock_2m,TaylorsRock_5m,TaylorsRock_10m,TaylorsRock_15m,TaylorsRock_20m
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2018-05-15,8.634,7.76200,6.811,7.66400,6.99300,6.220,5.637,6.50400,6.29500,5.991,5.737,4.492,7.141,6.77800,6.217,5.922,5.871
2018-05-16,9.009,7.56400,6.215,7.34700,6.63600,5.912,5.390,7.22200,6.78600,6.184,5.737,4.420,6.980,6.59500,6.050,5.732,5.449
2018-05-17,8.074,7.18800,6.466,7.62100,7.07200,6.544,6.034,7.82800,7.39800,6.245,5.606,4.322,7.346,6.99400,6.574,6.328,6.092
2018-05-18,8.441,7.32800,6.099,7.99300,7.55400,7.025,6.501,8.06500,7.44400,6.599,5.952,4.644,7.312,7.03300,6.523,6.207,5.757
2018-05-19,7.649,6.87700,6.142,8.18000,7.80900,7.421,6.901,7.88300,7.12700,5.962,5.392,4.141,7.667,7.39800,7.056,6.856,6.507
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-05-10,6.555,6.55400,6.692,6.61700,6.70900,6.589,5.733,6.58291,6.61944,6.604,6.651,6.642,6.400,6.43280,6.358,6.396,6.521
2022-05-11,6.754,6.75300,6.837,6.81600,6.89500,6.738,5.896,6.67078,6.70837,6.720,6.667,6.529,6.442,6.56042,6.393,6.424,6.525
2022-05-12,7.026,6.94700,7.000,6.80900,6.76100,6.466,5.617,6.65660,6.60378,6.645,6.217,6.000,6.523,6.63074,6.252,6.139,6.142
2022-05-13,7.610,7.24800,7.075,6.69100,6.34700,5.974,5.079,6.50291,6.49319,6.017,5.804,5.725,6.465,6.51844,5.565,5.470,5.600


In [15]:
def highlight_imputed_regions(label):

    series = df[label]
    series_imputed = df_imputed[label]
    
    # Identify NaN regions
    is_nan = series.isna()
    nan_ranges = []
    current_start = None

    for date, missing in is_nan.items():
        if missing and current_start is None:
            current_start = date
        elif not missing and current_start is not None:
            nan_ranges.append((current_start, date))
            current_start = None
    if current_start is not None:
        nan_ranges.append((current_start, series.index[-1]))

    # Create shaded regions
    spans = [
        hv.VSpan(start, end).opts(color='red', alpha=0.2)
        for start, end in nan_ranges
    ]

    curve = hv.Curve(series_imputed, label=label).opts(
        width=900, height=250, tools=['hover', 'box_zoom', 'pan', 'wheel_zoom'],
        show_grid=True, title=label
    )

    return curve * hv.Overlay(spans)
    
interactive_plot = hv.DynamicMap(pn.bind(highlight_imputed_regions, site_selector))

pn.Column(site_selector, interactive_plot)

Highlighted regions show where the gaps have been imputed.

Notice the imputation algorithm gap fills in time intervals where there is very limited information from any other site. Care should be taken in interpretation of interpolated data.

In [16]:
plot_all_sites(df_imputed)

```{warning}
Apply caution when using these imputed datasets in subsequent analysis steps.  While the imputed regions appears reasonable, they are not true measurements.  
```