## How to load the different datasets in ML4AlgComb

This notebook shows how to load the datasets and how to create dataloaders that you can use for model training. 

In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
from load_datasets import get_dataset
from dataloaders import CombDataModule, OneHotDataModule

FOLDER is the filepath to the folder containing the various datasets. 

In [4]:
FOLDER = ""

## Grassmannian cluster algebras

In [26]:
dataset_name = "grassmannian_cluster_algebras"
N = 12
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 148658 examples
Test set has 37164 examples
Inputs are sequences of length 12, with 13 tokens, which represent 3x4 SSYT
There are 2 classes. SSYT that index a valid cluster variable are labeled 1 and SSYT that do not are labeled 0.


In [7]:
len([x for x in y_train if x ==1] + [x for x in y_test if x ==1] )

92911

In [8]:
len([x for x in y_train if x ==0] + [x for x in y_test if x ==0] )

92911

In [117]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [118]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[ 1.,  2.,  5.,  7.,  4.,  4.,  8., 10.,  5., 11., 12., 12.],
        [ 1.,  2.,  5.,  6.,  5.,  6.,  7.,  7.,  8.,  8.,  9., 11.],
        [ 2.,  4.,  6.,  7.,  3.,  5., 10., 11.,  7.,  9., 11., 12.],
        [ 1.,  2.,  2.,  2.,  2.,  6.,  6.,  9.,  7.,  9., 11., 11.],
        [ 1.,  1.,  6.,  9.,  4.,  7.,  7., 10.,  9., 10., 10., 12.],
        [ 6.,  6.,  6.,  6.,  8.,  9., 10., 11., 11., 11., 12., 12.],
        [ 1.,  2.,  3.,  4.,  2.,  3.,  7.,  8.,  5.,  6.,  8., 10.],
        [ 1.,  2.,  3.,  6.,  4.,  4.,  8.,  9.,  7.,  7., 10., 11.],
        [ 2.,  2.,  3.,  9.,  5.,  6.,  8., 11.,  9., 10., 11., 12.],
        [ 1.,  3.,  4.,  5.,  5.,  8.,  8., 11.,  6., 11., 12., 12.],
        [ 1.,  2.,  2.,  7.,  2.,  6.,  8., 10.,  7.,  8., 12., 12.],
        [ 1.,  4.,  6.,  7.,  6.,  6.,  9.,  9.,  7.,  8., 10., 12.],
        [ 1.,  3.,  4.,  4.,  3.,  6.,  7.,  9.,  5.,  9., 10., 12.],
        [ 1.,  2.,  6.,  7.,  3.,  4.,  9.,  9.,  5.,  5., 10., 10.],
        [ 1.,  2.,  

## KL polynomial coefficients

In [19]:
dataset_name = "kl_polynomial"
N = 8 #N = 8, 9 are supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 67699 examples
Test set has 16925 examples
Inputs are sequences of length 16, representing two permutations on the letters 0 through 7
There are 10 classes, which each represent the fifth coefficient in the polynomial.


In [110]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [111]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[2., 0., 7., 5., 4., 1., 6., 3., 7., 5., 6., 0., 4., 2., 3., 1.],
        [1., 0., 3., 2., 5., 7., 6., 4., 7., 5., 6., 0., 3., 4., 1., 2.],
        [2., 5., 0., 1., 3., 7., 4., 6., 5., 7., 2., 3., 4., 6., 0., 1.],
        [3., 0., 2., 5., 1., 6., 7., 4., 5., 3., 6., 7., 0., 2., 4., 1.],
        [1., 0., 3., 5., 2., 4., 7., 6., 3., 1., 2., 7., 0., 5., 6., 4.],
        [1., 4., 2., 6., 3., 0., 7., 5., 6., 7., 4., 5., 1., 2., 3., 0.],
        [1., 0., 5., 4., 3., 2., 7., 6., 5., 1., 4., 7., 3., 2., 6., 0.],
        [1., 0., 3., 6., 5., 4., 2., 7., 3., 1., 6., 7., 4., 2., 0., 5.],
        [1., 4., 0., 3., 7., 2., 6., 5., 4., 7., 1., 5., 6., 2., 3., 0.],
        [0., 2., 3., 1., 4., 6., 5., 7., 2., 6., 7., 3., 4., 5., 0., 1.],
        [1., 0., 3., 2., 5., 6., 7., 4., 5., 6., 7., 1., 2., 3., 4., 0.],
        [2., 5., 4., 3., 0., 7., 1., 6., 4., 7., 5., 2., 3., 6., 0., 1.],
        [1., 6., 5., 0., 3., 2., 7., 4., 6., 7., 3., 1., 5., 2., 4., 0.],
        [0., 5., 1., 4., 3., 2., 7., 6

## Lattice path posets

In [25]:
dataset_name = "lattice_path"
N = 13 #N = 10, 11, 12, 13 supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 497369 examples
Test set has 124369 examples
Inputs are two concatenated binary sequences represented a lattice path and its cover. The input for n=13 is length 75.
There are 2 classes. Lagrange covers are labeled 0, matching covers are labeled 1.


In [21]:
#This is not a balanced dataset
len([x for x in y_train if x ==0] + [x for x in y_test if x ==0] )/(len(y_train) + len(y_test))

0.6656405109547752

In [107]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [108]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 1., 0., 0.],
        ...,
        [1., 0., 0.,  ..., 1., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 1.],
        [0., 0., 1.,  ..., 1., 0., 0.]])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])


## mHeight

In [21]:
dataset_name = "mheight"
N = 11 #N = 10, 11, 12 are supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 2627172 examples
Test set has 656791 examples
Input sequences are permutations represented by their inversion sequence, which is a binary sequence of length (11 choose 2)= 55.
There are 5 classes; classes that contained less than 0.01% of the data were filtered.


In [27]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [29]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 1.,  ..., 0., 0., 1.],
        ...,
        [0., 0., 0.,  ..., 1., 1., 0.],
        [0., 1., 1.,  ..., 1., 0., 0.],
        [1., 1., 1.,  ..., 1., 0., 0.]])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])


## Quiver mutation equivalence

In [9]:
dataset_name = "quiver"
N = 11 #This is the only value of N supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 163795 examples
Test set has 40944 examples
Input sequences of length 120 are flattened adjacency matrices with entries 0 through 5
There are 7 classes: A_11: 0, BD_11: 1, D_11: 2, BE_11: 3, BB_11: 4, E_11: 5, DE_11: 6


In [11]:
#This is not a balanced dataset
for i in range(7):
    print( (len([x for x in y_train if x ==i]) + len([x for x in y_test if x ==i]) ) /(len(y_train) + len(y_test)) )

0.07289280498586005
0.14439359379502684
0.15661891481349424
0.13806846765882416
0.16734476577496227
0.1770400363389486
0.14364141663288382


In [104]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [105]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[2., 2., 2.,  ..., 2., 3., 3.],
        [2., 2., 2.,  ..., 2., 2., 2.],
        [2., 2., 2.,  ..., 2., 2., 2.],
        ...,
        [2., 2., 2.,  ..., 1., 2., 2.],
        [2., 2., 2.,  ..., 2., 3., 2.],
        [2., 2., 2.,  ..., 2., 3., 2.]])
tensor([1, 1, 4, 4, 5, 1, 3, 4, 3, 3, 6, 5, 4, 5, 4, 1, 3, 2, 4, 6, 6, 4, 2, 3,
        1, 1, 5, 3, 6, 5, 0, 5])


## RSK

In [12]:
N = 10
dataset_name = "rsk"
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER)

Train set has 2903040 examples
Test set has 725760 examples
Input sequence is length 66 with entries 0 through 12, representing two concatenated SSYT, padded so that all inputs have the same length.
Outputs are binary sequences of length 45. Output is one permutation represented by its inversion sequence.


In [None]:
len(X_train) + len(X_test)

In [100]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [101]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[ 0.,  0.,  0.,  ..., 10., 10., 10.],
        [ 0.,  0.,  0.,  ..., 10., 10., 10.],
        [ 0.,  0.,  0.,  ..., 10., 10., 10.],
        ...,
        [ 0.,  0.,  0.,  ..., 10., 10., 10.],
        [ 0.,  0.,  0.,  ..., 10., 10., 10.],
        [ 0.,  0.,  0.,  ..., 10., 10., 10.]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
         1, 0, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1],
        [0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 1, 1],
        [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
         0, 1, 1, 0],
        [0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
         1, 0, 1, 1],
        [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
         1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
         0,

## Schubert polynomials

In [31]:
dataset_name = "schubert"
N = 5 #N = 4, 5, 6 are suppored
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER )

Train set has 85620 examples
Test set has 21405 examples
Inputs are sequences of length 19, which represent three concatenated permutations on the letters 0 through 9.
There are 3 classes, which give the structure constant for the input permutations.


In [32]:
#N=6 has 6 classes? 

In [33]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [34]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[1., 4., 5., 3., 2., 5., 3., 2., 4., 1., 6., 5., 3., 4., 1., 2., 7., 8.,
         9.],
        [4., 3., 2., 5., 1., 2., 3., 5., 4., 1., 5., 6., 3., 4., 1., 2., 7., 8.,
         9.],
        [2., 1., 4., 5., 3., 5., 1., 2., 3., 4., 7., 4., 2., 1., 3., 5., 6., 8.,
         9.],
        [2., 1., 4., 5., 3., 1., 2., 4., 3., 5., 3., 2., 4., 5., 1., 6., 7., 8.,
         9.],
        [3., 2., 5., 1., 4., 1., 4., 2., 3., 5., 5., 3., 4., 1., 2., 6., 7., 8.,
         9.],
        [1., 5., 2., 4., 3., 5., 2., 3., 1., 4., 7., 4., 3., 1., 2., 5., 6., 8.,
         9.],
        [3., 1., 2., 4., 5., 4., 5., 1., 2., 3., 6., 4., 5., 2., 3., 1., 7., 8.,
         9.],
        [5., 3., 1., 4., 2., 1., 4., 5., 3., 2., 6., 4., 5., 2., 1., 3., 7., 8.,
         9.],
        [2., 1., 3., 5., 4., 2., 1., 4., 5., 3., 4., 1., 3., 5., 2., 6., 7., 8.,
         9.],
        [5., 3., 4., 2., 1., 1., 5., 4., 3., 2., 1., 7., 6., 3., 5., 2., 4., 8.,
         9.],
        [2., 1., 5., 4., 3., 1., 3., 5., 4., 2., 3

## Symmetric group character

In [24]:
dataset_name = "symmetric_group_char"
N = 18 #N = 18, 20, 22 supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER )

Train set has 112630 examples
Test set has 28216 examples
Inputs are sequences of length 36 with entries 0 through 18, which represent two concatenated integer partitions of n=18.
There are 589 classes for n=18.


In [95]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [96]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[ 6.,  5.,  5.,  ...,  0.,  0.,  0.],
        [ 5.,  2.,  2.,  ...,  0.,  0.,  0.],
        [ 5.,  3.,  2.,  ...,  0.,  0.,  0.],
        ...,
        [ 5.,  5.,  3.,  ...,  0.,  0.,  0.],
        [ 8.,  7.,  2.,  ...,  0.,  0.,  0.],
        [13.,  1.,  1.,  ...,  0.,  0.,  0.]])
tensor([293, 294, 299, 295, 294, 294, 292, 294, 318, 348, 290, 293, 296, 249,
        294, 294, 294, 304, 302, 292, 294, 294, 294, 294, 294, 300, 297, 299,
        434, 294, 294, 280])


In [97]:
#Can also one-hot encode the data
data_module = OneHotDataModule(X_train, y_train, X_test, y_test, num_tokens, batch_size=batch_size_choice)
data_module.setup()
input_size = input_size*num_tokens

In [98]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor([294, 294, 294, 294, 295, 293, 293, 296, 294, 295, 295, 299, 294, 294,
        293, 294, 295, 293, 291, 294, 291, 294, 296, 294, 294, 290, 294, 294,
        294, 297, 292, 293])


## Weaving patterns

In [12]:
dataset_name = "weaving"
N = 6 #N = 6, 7, 8 supported
X_train, y_train, X_test, y_test, input_size, output_size, num_tokens = get_dataset(data = dataset_name, n = N, folder = FOLDER )

Train set has 1750 examples
Test set has 751 examples
Inputs are sequences of length 16 with entries between 0 and 5, representing weaving patterns.
There are 2 classes. Weaving patterns are labeled 1, non-weaving patterns are labeled 0.


In [143]:
batch_size_choice = 32
data_module = CombDataModule(X_train, y_train, X_test, y_test, batch_size=batch_size_choice)
data_module.setup()

In [144]:
for seq, labs in data_module.train_dataloader():
    print(seq)
    print(labs)
    break

tensor([[0., 1., 2., 3., 3., 2., 3., 4., 4., 3., 2., 3., 3., 2., 3., 2.],
        [2., 3., 4., 5., 3., 2., 1., 2., 2., 1., 2., 1., 3., 2., 3., 2.],
        [2., 3., 4., 3., 3., 2., 3., 2., 2., 3., 2., 1., 5., 4., 3., 2.],
        [2., 3., 2., 3., 3., 2., 1., 2., 2., 1., 2., 3., 3., 4., 3., 2.],
        [0., 1., 2., 3., 3., 2., 3., 2., 4., 3., 2., 3., 5., 4., 3., 2.],
        [2., 3., 2., 3., 1., 0., 1., 2., 2., 1., 2., 1., 3., 2., 1., 0.],
        [0., 1., 2., 3., 3., 4., 3., 4., 2., 3., 4., 3., 3., 4., 3., 2.],
        [2., 3., 4., 5., 1., 0., 1., 2., 4., 3., 4., 3., 3., 2., 3., 2.],
        [2., 3., 2., 3., 3., 4., 5., 4., 2., 3., 2., 3., 3., 2., 3., 2.],
        [2., 3., 2., 3., 3., 2., 1., 2., 2., 1., 2., 1., 3., 2., 3., 2.],
        [2., 1., 2., 3., 1., 2., 1., 2., 2., 1., 0., 1., 3., 2., 1., 0.],
        [2., 1., 2., 3., 3., 2., 3., 4., 4., 3., 4., 3., 3., 2., 1., 2.],
        [2., 3., 2., 3., 1., 2., 3., 2., 2., 3., 4., 3., 5., 4., 3., 2.],
        [0., 1., 2., 3., 1., 2., 3., 4