# Time Machine Surf Cam

This is a primitive intial version and simply implements a nearest neighbour approach on key features which have been chosen some what arbitrarily (informed decision) and have all been given equal weight.

## Dealing with the Data

We first load the data in from the database.

### Data Loading

In this notebook, we are working off a duplicate of the 04/10/23 version of the [surf-forecast-videos](https://github.com/jreaso/surf-forecast-videos) dataset.

In [1]:
import sqlite3
import pandas as pd
import numpy as np

In [2]:
# Connect to database
conn = sqlite3.connect('SurfForecastDB')

# SQL queries
forecasts_query = '''
SELECT 
    spot_id, 
    forecast_timestamp, 
    surf_human_relation, 
    surf_raw_min, 
    surf_raw_max, 
    wind_speed, 
    wind_direction_type, 
    tide_type, 
    tide_height 
FROM 
    forecasts 
WHERE 
    forecast_probability > 60
'''

cam_videos_query = '''
SELECT 
    spot_id, 
    cam_number, 
    footage_timestamp, 
    video_storage_location 
FROM 
    cam_videos 
WHERE 
    download_status = "Downloaded"
'''

# Load data from the database into pandas DataFrames
forecasts_df = pd.read_sql_query(forecasts_query, conn)
cam_videos_df = pd.read_sql_query(cam_videos_query, conn)

# Close db connection
conn.close()

This extracts the very basic data from the forecast data.

In [3]:
forecasts_df.head()

Unnamed: 0,spot_id,forecast_timestamp,surf_human_relation,surf_raw_min,surf_raw_max,wind_speed,wind_direction_type,tide_type,tide_height
0,5842041f4e65fad6a7708bc3,2023-09-01 00:00:00,Shin to knee,0.0,0.55774,6.65738,Offshore,NORMAL,0.82
1,5842041f4e65fad6a7708bc3,2023-09-01 01:00:00,Shin to knee,0.0,0.52493,6.05135,Offshore,NORMAL,1.56
2,5842041f4e65fad6a7708bc3,2023-09-01 02:00:00,Shin to knee,0.0,0.52493,5.406,Offshore,NORMAL,2.36
3,5842041f4e65fad6a7708bc3,2023-09-01 03:00:00,Flat,0.0,0.49213,3.76856,Offshore,NORMAL,3.0
4,5842041f4e65fad6a7708bc3,2023-09-01 04:00:00,Flat,0.0,0.49213,4.3848,Offshore,NORMAL,3.31


In [4]:
cam_videos_df.head()

Unnamed: 0,spot_id,cam_number,footage_timestamp,video_storage_location
0,584204204e65fad6a7709084,1,2023-09-16 16:04:37.585000,../surf_cam_videos/584204204e65fad6a7709084_1_...
1,584204204e65fad6a7709084,1,2023-09-16 15:04:33.989000,../surf_cam_videos/584204204e65fad6a7709084_1_...
2,584204204e65fad6a7709084,1,2023-09-16 14:04:28.231000,../surf_cam_videos/584204204e65fad6a7709084_1_...
3,584204204e65fad6a7709084,1,2023-09-16 13:04:26.214000,../surf_cam_videos/584204204e65fad6a7709084_1_...
4,584204204e65fad6a7709084,1,2023-09-16 12:04:24.295000,../surf_cam_videos/584204204e65fad6a7709084_1_...


### Basic Data Cleaning

In [5]:
forecasts_df['forecast_timestamp'] = pd.to_datetime(forecasts_df['forecast_timestamp'])
cam_videos_df['footage_timestamp'] = pd.to_datetime(cam_videos_df['footage_timestamp'])

We only want forecast instances that have associated videos.

In [6]:
# Cross join on spot_id
merged_df = pd.merge(forecasts_df, cam_videos_df, on='spot_id', how='inner')

# Filter rows based on timestamp difference condition
filtered_df = merged_df[
    (merged_df['forecast_timestamp'] - merged_df['footage_timestamp']).abs() <= pd.Timedelta(minutes=10)
]

# Recover forecasts df filtered to rows with an associated video (within 10 mins)
exclusive_columns = set(cam_videos_df.columns) - set(forecasts_df.columns)
forecasts_filtered_df = filtered_df.drop(columns=exclusive_columns).drop_duplicates()

### Training and Evaluation Split

Since this is an unsupervised task, with no defined success metric, we put aside a selection of instances aside for manual comparison. We arbitrarily choose to set aside the first day of every week. Setting aside random instance would not be a fair comparison as the data is highly temporally dependant, and setting aside only one region of time would also not be fair since there may not have been a similar forecast in the time period we have to work with.

In [7]:
evaluation = forecasts_filtered_df[forecasts_filtered_df['forecast_timestamp'].dt.dayofweek == 0]
training = forecasts_filtered_df[forecasts_filtered_df['forecast_timestamp'].dt.dayofweek != 0]

In [8]:
training.shape, evaluation.shape

((443, 9), (80, 9))

## Implementing Nearest Neighbor Model

We choose our features to be: `surf_human_relation`, `wind_direction_type`, `tide_type`, `surf_raw_max`.

In [9]:
from sklearn.metrics import pairwise_distances
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [10]:
categorical_features = ['surf_human_relation', 'wind_direction_type', 'tide_type']
numerical_features = ['surf_raw_max']

In [11]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features), # scale the surf_raw_max numeric feature
        ('cat', OneHotEncoder(), categorical_features)
    ])

In [12]:
X_train = preprocessor.fit_transform(training)

In [13]:
evaluation = evaluation.copy()
evaluation.loc[:, 'nearest_neighbor_timestamp'] = None

In [14]:
# Set 'forecast_timestamp' as the index for both dataframes
training.set_index('forecast_timestamp', inplace=True)
evaluation.set_index('forecast_timestamp', inplace=True)

In [15]:
# Process independantly for each location
for spot in evaluation['spot_id'].unique():
    
    # Extract rows of the current spot from training and evaluation
    training_spot = training[training['spot_id'] == spot]
    evaluation_spot = evaluation[evaluation['spot_id'] == spot]
    
    # Transform the data
    X_train_spot = preprocessor.transform(training_spot)
    X_eval_spot = preprocessor.transform(evaluation_spot)

    # Convert transformed training data to dense arrays if they are sparse
    if hasattr(X_train_spot, "toarray"):
        X_train_spot = X_train_spot.toarray()

    # Iterate over the evaluation data and find the nearest neighbor from the training data
    for idx, (eval_time, eval_row) in enumerate(evaluation_spot.iterrows()):
        eval_row_dense = X_eval_spot[idx].toarray()[0] if hasattr(X_eval_spot[idx], "toarray") else X_eval_spot[idx]
        distances = pairwise_distances([eval_row_dense], X_train_spot)
        nearest_index = distances.argmin()
        nearest_time = training_spot.index[nearest_index]
        evaluation.at[eval_time, 'nearest_neighbor_timestamp'] = nearest_time


In [16]:
evaluation.head()

Unnamed: 0_level_0,spot_id,surf_human_relation,surf_raw_min,surf_raw_max,wind_speed,wind_direction_type,tide_type,tide_height,nearest_neighbor_timestamp
forecast_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-09-18 08:00:00,5842041f4e65fad6a7708bc3,Waist to shoulder,3.37927,4.03543,4.4922,Onshore,,,2023-09-24 18:00:00
2023-09-18 09:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.31365,3.96982,5.05388,Onshore,,,2023-09-24 18:00:00
2023-09-18 10:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.31365,3.96982,6.5993,Onshore,,,2023-09-20 15:00:00
2023-09-18 11:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.18241,3.80577,7.08996,Onshore,,,2023-09-20 15:00:00
2023-09-18 12:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.08399,3.67454,8.20111,Onshore,,,2023-09-20 13:00:00


## Relating to Footage

In [17]:
cam_videos_df.set_index('footage_timestamp', inplace=True)

In [18]:
cam_videos_df.head()

Unnamed: 0_level_0,spot_id,cam_number,video_storage_location
footage_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-09-16 16:04:37.585,584204204e65fad6a7709084,1,../surf_cam_videos/584204204e65fad6a7709084_1_...
2023-09-16 15:04:33.989,584204204e65fad6a7709084,1,../surf_cam_videos/584204204e65fad6a7709084_1_...
2023-09-16 14:04:28.231,584204204e65fad6a7709084,1,../surf_cam_videos/584204204e65fad6a7709084_1_...
2023-09-16 13:04:26.214,584204204e65fad6a7709084,1,../surf_cam_videos/584204204e65fad6a7709084_1_...
2023-09-16 12:04:24.295,584204204e65fad6a7709084,1,../surf_cam_videos/584204204e65fad6a7709084_1_...


In [23]:
def nearest_video_location(spot_id, timestamp):
    # Get all videos for the given spot_id
    videos_for_spot = cam_videos_df[cam_videos_df['spot_id'] == spot_id]
    
    # Compute absolute differences for each video timestamp
    differences = abs(videos_for_spot.index - timestamp)
    
    # Get the index (i.e., timestamp) of the video with the smallest difference
    nearest_video_timestamp = videos_for_spot.index[differences.argmin()]
    
    # Return the storage location of that video
    return videos_for_spot.loc[nearest_video_timestamp, 'video_storage_location']

evaluation['nearest_neighbor_file'] = evaluation.apply(lambda row: nearest_video_location(row['spot_id'], row['nearest_neighbor_timestamp']), axis=1)
evaluation['file'] = evaluation.apply(lambda row: nearest_video_location(row['spot_id'], row.name), axis=1)


In [24]:
evaluation.head()

Unnamed: 0_level_0,spot_id,surf_human_relation,surf_raw_min,surf_raw_max,wind_speed,wind_direction_type,tide_type,tide_height,nearest_neighbor_timestamp,nearest_neighbor_file,file
forecast_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-09-18 08:00:00,5842041f4e65fad6a7708bc3,Waist to shoulder,3.37927,4.03543,4.4922,Onshore,,,2023-09-24 18:00:00,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...
2023-09-18 09:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.31365,3.96982,5.05388,Onshore,,,2023-09-24 18:00:00,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...
2023-09-18 10:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.31365,3.96982,6.5993,Onshore,,,2023-09-20 15:00:00,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...
2023-09-18 11:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.18241,3.80577,7.08996,Onshore,,,2023-09-20 15:00:00,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...
2023-09-18 12:00:00,5842041f4e65fad6a7708bc3,Waist to chest,3.08399,3.67454,8.20111,Onshore,,,2023-09-20 13:00:00,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...,../surf_cam_videos/5842041f4e65fad6a7708bc3_1_...


## Evaluation

In [26]:
# Take 5 random rows from `evaluation`
for _, row in evaluation.sample(5).iterrows():
    print(f"File: {row['file']}\nNN File: {row['nearest_neighbor_file']}\n")

File: ../surf_cam_videos/584204204e65fad6a7709084_1_20231002T140430972.mp4
NN File: ../surf_cam_videos/584204204e65fad6a7709084_1_20230929T170050105.mp4

File: ../surf_cam_videos/584204204e65fad6a7709084_1_20231002T090422588.mp4
NN File: ../surf_cam_videos/584204204e65fad6a7709084_1_20231001T170351625.mp4

File: ../surf_cam_videos/584204204e65fad6a7709084_1_20230918T080442664.mp4
NN File: ../surf_cam_videos/584204204e65fad6a7709084_1_20230924T180245419.mp4

File: ../surf_cam_videos/5842041f4e65fad6a7708bc3_1_20230918T180155861.mp4
NN File: ../surf_cam_videos/5842041f4e65fad6a7708bc3_1_20230924T110907183.mp4

File: ../surf_cam_videos/5842041f4e65fad6a7708bc3_1_20230918T100025666.mp4
NN File: ../surf_cam_videos/5842041f4e65fad6a7708bc3_1_20230920T150256428.mp4



These files have been included in `evaluation_files/` with an `evaluation.txt` file recording these results. Of course, these files will change on different runs as they are random.