# What is SETDataset?
The SETDataset is a python class where we create SET data where we can easily grok and corrupt patterns. To "grok" a pattern is to remove the pattern from the training set and to "corrupt" a pattern is to flip its label. After performing our required gork and corrupt operations, we can easily generate TensorFlow datasets.

## Imports

In [9]:
from jax import random
from src.task import SETDataset

## Instantiating SETDataset

In [10]:
key = random.PRNGKey(0)
set_dataset = SETDataset(key, 30, 15, 108)
set_dataset.print_training_testing()


TESTING DATA

Accepting Grid:
SET_combinations | Number of Trials | Status
GGG | 30 | 
GPR | 30 | 
GRP | 30 | 
PGR | 30 | 
PPP | 30 | 
PRG | 30 | 
RGP | 30 | 
RPG | 30 | 
RRR | 30 | 

Rejecting Grid:
SET_combinations | Number of Trials | Status
GGP | 15 | 
GGR | 15 | 
GPG | 15 | 
GPP | 15 | 
GRG | 15 | 
GRR | 15 | 
PGG | 15 | 
PGP | 15 | 
PPG | 15 | 
PPR | 15 | 
PRP | 15 | 
PRR | 15 | 
RGG | 15 | 
RGR | 15 | 
RPP | 15 | 
RPR | 15 | 
RRG | 15 | 
RRP | 15 | 

----------

TRAINING DATA

Accepting Grid:
SET_combinations | Number of Trials | Status
GGG | 1 | 
GPR | 1 | 
GRP | 1 | 
PGR | 1 | 
PPP | 1 | 
PRG | 1 | 
RGP | 1 | 
RPG | 1 | 
RRR | 1 | 

Rejecting Grid:
SET_combinations | Number of Trials | Status
GGP | 1 | 
GGR | 1 | 
GPG | 1 | 
GPP | 1 | 
GRG | 1 | 
GRR | 1 | 
PGG | 1 | 
PGP | 1 | 
PPG | 1 | 
PPR | 1 | 
PRP | 1 | 
PRR | 1 | 
RGG | 1 | 
RGR | 1 | 
RPP | 1 | 
RPR | 1 | 
RRG | 1 | 
RRP | 1 | 


Above, we can see what a standard single atttribute SET dataset looks like. All the patterns are in their respective accepting and rejecting categories.

Below is another method that we can use to view the status of individual SETs.

In [11]:
set_dataset.print_SET_dict()

Contents SET_dict:
SET_combination | Label | Status
GGG | 1 | 
GGP | -1 | 
GGR | -1 | 
GPG | -1 | 
GPP | -1 | 
GPR | 1 | 
GRG | -1 | 
GRP | 1 | 
GRR | -1 | 
PGG | -1 | 
PGP | -1 | 
PGR | 1 | 
PPG | -1 | 
PPP | 1 | 
PPR | -1 | 
PRG | 1 | 
PRP | -1 | 
PRR | -1 | 
RGG | -1 | 
RGP | 1 | 
RGR | -1 | 
RPG | 1 | 
RPP | -1 | 
RPR | -1 | 
RRG | -1 | 
RRP | -1 | 
RRR | 1 | 


## Grokking and corrupting patterns

Let's test how we could grok and corrupt SET patterns.

In [12]:
set_dataset.grok_SET(2)
set_dataset.corrupt_SET(3)
set_dataset.print_training_testing()


TESTING DATA

Accepting Grid:
SET_combinations | Number of Trials | Status
GPG | 30 | Corrupted
PGR | 30 | 
PRG | 30 | 
RGP | 30 | 
RPG | 30 | 
RRR | 30 | 

Rejecting Grid:
SET_combinations | Number of Trials | Status
GGG | 15 | Corrupted
GGP | 15 | 
GGR | 15 | 
GPP | 15 | 
GRG | 15 | 
GRR | 15 | 
PGG | 15 | 
PGP | 15 | 
PPG | 15 | 
PPP | 15 | Corrupted
PPR | 15 | 
PRP | 15 | 
PRR | 15 | 
RGG | 15 | 
RGR | 15 | 
RPP | 15 | 
RPR | 15 | 
RRG | 15 | 
RRP | 15 | 

----------

TRAINING DATA

Accepting Grid:
SET_combinations | Number of Trials | Status
GPG | 1 | Corrupted
GPR | 1 | Grokked
GRP | 1 | Grokked
PGR | 1 | 
PRG | 1 | 
RGP | 1 | 
RPG | 1 | 
RRR | 1 | 

Rejecting Grid:
SET_combinations | Number of Trials | Status
GGG | 1 | Corrupted
GGP | 1 | 
GGR | 1 | 
GPP | 1 | 
GRG | 1 | 
GRR | 1 | 
PGG | 1 | 
PGP | 1 | 
PPG | 1 | 
PPP | 1 | Corrupted
PPR | 1 | 
PRP | 1 | 
PRR | 1 | 
RGG | 1 | 
RGR | 1 | 
RPP | 1 | 
RPR | 1 | 
RRG | 1 | 
RRP | 1 | 


In [13]:
set_dataset.print_SET_dict()

Contents SET_dict:
SET_combination | Label | Status
GGG | -1 | Corrupted
GGP | -1 | 
GGR | -1 | 
GPG | 1 | Corrupted
GPP | -1 | 
GPR | 1 | Grokked
GRG | -1 | 
GRP | 1 | Grokked
GRR | -1 | 
PGG | -1 | 
PGP | -1 | 
PGR | 1 | 
PPG | -1 | 
PPP | -1 | Corrupted
PPR | -1 | 
PRG | 1 | 
PRP | -1 | 
PRR | -1 | 
RGG | -1 | 
RGP | 1 | 
RGR | -1 | 
RPG | 1 | 
RPP | -1 | 
RPR | -1 | 
RRG | -1 | 
RRP | -1 | 
RRR | 1 | 


As we can see, it's quite easy to grok and corrupt random SET patterns. One issue to keep and eye on might be that accepting SET patterns could be disproportionately altered compared to rejecting patterns. 

## Generating TensorFlow datasets

In [14]:
training_tf_dataset, testing_tf_dataset = set_dataset.tf_datasets()

In [15]:
print('Training data in TensorFlow')
for feature, label in training_tf_dataset.as_numpy_iterator():
    print('----------')
    print(feature.shape)
    print(label.shape)

Training data in TensorFlow
----------
(108, 50, 100)
(108, 5, 1)
----------
(108, 50, 100)
(108, 5, 1)
----------
(108, 50, 100)
(108, 5, 1)
----------
(108, 50, 100)
(108, 5, 1)


In [16]:
print('Testing data in TensorFlow')
for feature, label in testing_tf_dataset.as_numpy_iterator():
    print('----------')
    print(feature.shape)
    print(label.shape)

Testing data in TensorFlow
----------
(27, 50, 100)
(27, 5, 1)


Pretty easy right?