### Split the data into training, test and validation set

In [1]:
cd("$(homedir())/Documents/Repos/enso_project.jl")
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m project at `C:\Users\lisah\Documents\Repos\enso_project.jl`


In [2]:
using CSV, DataFrames, enso_project

[33m[1m│ [22m[39m- Run `import Pkg; Pkg.add("cuDNN")` to install the cuDNN package, then restart julia.
[33m[1m│ [22m[39m- If cuDNN is not installed, some Flux functionalities will not be available when running on the GPU.
[33m[1m└ [22m[39m[90m@ FluxCUDAExt C:\Users\lisah\.julia\packages\Flux\9PibT\ext\FluxCUDAExt\FluxCUDAExt.jl:10[39m


In [3]:
# read embedded data.
df_embed = CSV.read("data/sst_data/sst_34_anomaly_embedded.txt", DataFrame; delim=',', ignorerepeated=true, header=0)

# read non-embedded data
df_1D = CSV.read("data/sst_data/sst_34_format.csv", DataFrame)

Row,DATE,NINO3.4,ANOM_3
Unnamed: 0_level_1,Date,Float64,Float64
1,1982-01-01,26.65,0.08
2,1982-02-01,26.54,-0.2
3,1982-03-01,27.09,-0.14
4,1982-04-01,27.83,0.02
5,1982-05-01,28.37,0.49
6,1982-06-01,28.35,0.65
7,1982-07-01,27.57,0.27
8,1982-08-01,27.76,0.86
9,1982-09-01,28.01,1.24
10,1982-10-01,28.5,1.73


We create different data splits to perform training on, as we want to compare the training success given the size of the training data set. 

We keep the size of the test data set fixed to 10%. This corresponds to more than 4 years (the literature suggests that prediction of ENSO is not reliable for more than 1 year ahead).

We vary the size of the training and validation data accordingly and yield the following splits (in %):
- 20 | 70 | 10
- 40 | 50 | 10
- 50 | 40 | 10
- 60 | 30 | 10
- 70 | 20 | 10
- 80 | 10 | 10

In [4]:
# embedded data

percentages = [0.2, 0.4, 0.5, 0.6, 0.7, 0.8]

for p in percentages

    # create splits
    train, val, test = enso_project.train_val_test_split(df_embed, val_percent=0.9-p , test_percent=0.1)

    # store
    p = Int64(100*p)
    CSV.write("data/sst_34_data_split_$p/train_sst_34_anomaly_embedded_$p.txt", train)
    CSV.write("data/sst_34_data_split_$p/val_sst_34_anomaly_embedded_$p.txt", val)
    CSV.write("data/sst_34_data_split_$p/test_sst_34_anomaly_embedded_$p.txt", test)
end

In [6]:
# 1D data (only 80% split for now)

# create splits
train, val, test = enso_project.train_val_test_split(df_1D, val_percent=0.1 , test_percent=0.1)

# store
p = Int64(100*0.8)
CSV.write("data/sst_34_data_split_$p/train_sst_34_anomaly_$p.txt", train)
CSV.write("data/sst_34_data_split_$p/val_sst_34_anomaly_$p.txt", val)
CSV.write("data/sst_34_data_split_$p/test_sst_34_anomaly_$p.txt", test)

"data/sst_34_data_split_80/test_sst_34_anomaly_80.txt"