In [112]:
using Plots
using Random
using StatsBase
using JLD2
using DelimitedFiles

# Italy Power Demand (Again)

In [113]:
train_data = readdlm("ItalyPowerDemand_TRAIN.txt")
test_data = readdlm("ItalyPowerDemand_TEST.txt")
X_train = train_data[:, 2:end]
y_train = Int.(train_data[:, 1]);
X_test = test_data[:, 2:end]
y_test = Int.(test_data[:, 1])

1029-element Vector{Int64}:
 2
 2
 2
 2
 2
 1
 2
 2
 2
 2
 ⋮
 1
 1
 2
 1
 2
 1
 2
 2
 2

In [116]:
y_train .-= 1

67-element Vector{Int64}:
 0
 0
 1
 1
 0
 0
 1
 0
 0
 1
 ⋮
 0
 1
 1
 0
 0
 1
 0
 1
 1

In [117]:
y_test .-= 1

1029-element Vector{Int64}:
 1
 1
 1
 1
 1
 0
 1
 1
 1
 1
 ⋮
 0
 0
 1
 0
 1
 0
 1
 1
 1

Let's concatenate the train and test set to get something a little better for generative modelling. Classification is not our goal here, so this is fine. 

In [36]:
samps_combined = vcat(X_train, X_test);
labels_combined = vcat(y_train, y_test);

In [38]:
countmap(labels_combined)

Dict{Int64, Int64} with 2 entries:
  2 => 549
  1 => 547

Nice and balanced. Now let's make a subset of 400 samples per class for training:

In [62]:
rng = MersenneTwister(42)

MersenneTwister(42)

In [63]:
class_1_all_idxs = findall(x -> x.== 1, labels_combined);
class_2_all_idxs = findall(x -> x.== 2, labels_combined);

In [64]:
train_idxs_c1 = sample(rng, class_1_all_idxs, 400; replace=false);
train_idxs_c2 = sample(rng, class_2_all_idxs, 400; replace=false);

In [68]:
test_idxs_c1 = setdiff(class_1_all_idxs, train_idxs_c1)
test_idxs_c2 = setdiff(class_2_all_idxs, train_idxs_c2)

149-element Vector{Int64}:
    4
   36
   46
   48
   49
   52
   68
   69
   74
   76
    ⋮
 1048
 1061
 1067
 1076
 1077
 1082
 1083
 1085
 1092

Make the training subset:

In [81]:
X_train = vcat(samps_combined[train_idxs_c1, :], samps_combined[train_idxs_c2, :])
y_train = vcat(labels_combined[train_idxs_c1], labels_combined[train_idxs_c2])
X_test = vcat(samps_combined[test_idxs_c1, :], samps_combined[test_idxs_c2, :])
y_test = vcat(labels_combined[test_idxs_c1], labels_combined[test_idxs_c2])

296-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 2
 2
 2
 2
 2
 2
 2
 2
 2

Double-check the class distributions

In [82]:
countmap(y_train)

Dict{Int64, Int64} with 2 entries:
  2 => 400
  1 => 400

In [83]:
countmap(y_test)

Dict{Int64, Int64} with 2 entries:
  2 => 149
  1 => 147

shuffle the data

In [86]:
shuffled_train_idxs = randperm(rng, length(y_train));
shuffled_test_idxs = randperm(rng, length(y_test));

In [95]:
X_train_final = X_train[shuffled_train_idxs, :]
y_train_final = y_train[shuffled_train_idxs]

800-element Vector{Int64}:
 1
 1
 1
 1
 1
 2
 1
 1
 1
 1
 ⋮
 2
 2
 1
 2
 1
 2
 1
 1
 1

In [96]:
X_test_final = X_test[shuffled_test_idxs, :]
y_test_final = y_test[shuffled_test_idxs]

296-element Vector{Int64}:
 1
 2
 2
 2
 2
 1
 2
 2
 2
 1
 ⋮
 2
 1
 1
 1
 2
 1
 1
 2
 2

Relabel the classses as 0 or 1:

In [98]:
y_train_final .-= 1

800-element Vector{Int64}:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 ⋮
 1
 1
 0
 1
 0
 1
 0
 0
 0

In [99]:
y_test_final .-= 1

296-element Vector{Int64}:
 0
 1
 1
 1
 1
 0
 1
 1
 1
 0
 ⋮
 1
 0
 0
 0
 1
 0
 0
 1
 1

Save

In [106]:
@save "./train.jld2" X_train_final y_train_final

In [107]:
@save "./test.jld2" X_test_final y_test_final

In [108]:
X_train = X_train_final
y_train = y_train_final

800-element Vector{Int64}:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 ⋮
 1
 1
 0
 1
 0
 1
 0
 0
 0

In [109]:
X_test = X_test_final
y_test = y_test_final

296-element Vector{Int64}:
 0
 1
 1
 1
 1
 0
 1
 1
 1
 0
 ⋮
 1
 0
 0
 0
 1
 0
 0
 1
 1

In [111]:
@save "./test.jld2" X_test y_test