# TAYSIR competition - Track 1 Starter Kit

### Welcome!

This is a notebook to show the structure of a code to participate to the competition.

You can also check the baseline notebook (available in the same archive) for more details about the TAYSIR models and how to use them.

## Prepare your environment

In [1]:
%pip install --upgrade mlflow torch transformers



Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
import mlflow
from utils import predict, PytorchInference
#print('PyTorch version :', torch.__version__)
#print('MLflow version :', mlflow.__version__)
import sys
import pandas as pd
#print("Your python version:", sys.version)

This notebook was tested with:
* Torch version: 1.11.0+cu102
* MLFlow version: 1.25.1
* Python version: 3.8.10 [GCC 9.4.0]

Python versions starting at 3.7 are supposed to work (but have not been tested).
## Choosing the phase

First you must select one of the phases/datasets we provide

In [3]:
TRACK = 1 #always for his track
DATASET = 1

In [4]:
from transformers import AutoTokenizer, DistilBertForSequenceClassification

model_name = f"models/1.{DATASET}.taysir.model"

model = mlflow.pytorch.load_model(model_name)

model.eval()



TNetwork(
  7, 2, patience=5, bidirectional=True
  (mach[0]): RNN(6, 32, batch_first=True, bidirectional=True)
  (dense): Sequential(
    (0): Linear(in_features=64, out_features=2, bias=True)
    (1): Sigmoid()
    (2): Softmax(dim=-1)
  )
)

In [5]:
try:#RNN
    nb_letters = model.input_size - 1
    cell_type = model.cell_type

    print("The alphabet contains", nb_letters, "symbols.")
    print("The type of the recurrent cells is", cell_type.__name__)
except:
    nb_letters = model.distilbert.config.vocab_size - 2
    print("The alphabet contains", nb_letters, "symbols.")
    print("The model is a transformer (DistilBertForSequenceClassification)")

The alphabet contains 6 symbols.
The type of the recurrent cells is RNN


## Load the data

The input data is in the following format :

```
[Number of sequences] [Alphabet size]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
...
[Length of sequence] [List of symbols]
```

For example the following data :

```
5 10
6 8 6 5 1 6 7 4 9
12 8 6 9 4 6 8 2 1 0 6 5 9
7 8 9 4 3 0 4 9
4 8 0 4 9
8 8 1 5 2 6 0 5 3 9
```

is composed of 5 sequences and has an alphabet size of 10 (so symbols are between 0 and 9) and the first sequence is composed of 6 symbols (8 6 5 1 6 7 4 9), notice that 8 is the start symbol and 9 is the end symbol.

In [6]:
from pythautomata.base_types.alphabet import Alphabet

file = f"datasets/1.{DATASET}.taysir.valid.words"

alphabet = None
sequences = []

#In the competition the empty sequence is defined as [alphabet_size - 2, alphabet size -1]
#For example with the alphabet of size 22 the empty sequence is [20, 21]
empty_sequence_len = 2

with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - empty_sequence_len)])
    
    for line in f:
        line = line.strip()
        seq = line.split(' ')
        seq = [int(i) for i in seq[1:]] #Remove first value (length of sequence) and cast to int
        sequences.append(seq)

The variable *sequences* is thus **a list of lists**

In [7]:
# print('Number of sequences:', len(sequences))
print('10 first sequences:')
print(len(sequences))
for i in range(10):
    print(sequences[i])

10 first sequences:
10000
[4, 2, 1, 3, 1, 3, 0, 1, 2, 2, 3, 2, 2, 3, 2, 1, 3, 1, 2, 0, 2, 5]
[4, 0, 0, 0, 0, 0, 3, 3, 1, 2, 1, 3, 3, 1, 1, 0, 0, 2, 2, 0, 3, 5]
[4, 0, 1, 2, 3, 0, 3, 2, 0, 3, 2, 1, 0, 3, 2, 2, 2, 1, 0, 1, 2, 5]
[4, 0, 3, 0, 1, 3, 3, 3, 0, 1, 3, 2, 0, 2, 0, 2, 2, 3, 3, 3, 1, 5]
[4, 3, 1, 0, 1, 0, 3, 2, 2, 0, 1, 1, 0, 0, 0, 1, 3, 0, 1, 2, 2, 5]
[4, 2, 2, 3, 2, 0, 0, 2, 2, 2, 2, 1, 3, 0, 0, 3, 2, 2, 3, 3, 1, 5]
[4, 3, 1, 2, 0, 1, 3, 3, 0, 0, 2, 1, 1, 3, 0, 2, 0, 1, 0, 3, 2, 5]
[4, 1, 0, 1, 0, 0, 0, 0, 0, 3, 0, 2, 2, 3, 0, 2, 1, 1, 2, 3, 2, 5]
[4, 3, 2, 3, 0, 2, 0, 3, 0, 3, 2, 1, 2, 1, 1, 2, 2, 2, 3, 1, 0, 5]
[4, 3, 1, 1, 1, 2, 0, 3, 3, 1, 2, 3, 1, 1, 0, 1, 0, 3, 0, 1, 2, 5]


## Model extraction

This is where you will extract your simple own model.

### Define a Learner, a Comparator, and a Teacher

In [8]:
from pymodelextractor.teachers.pac_comparison_strategy import PACComparisonStrategy
from pymodelextractor.teachers.general_teacher import GeneralTeacher
from pymodelextractor.factories.lstar_factory import LStarFactory
from pythautomata.utilities.uniform_length_sequence_generator import UniformLengthSequenceGenerator
from pymodelextractor.learners.observation_tree_learners.kearns_vazirani_learner import KearnsVaziraniLearner

name = "Track: " + str(TRACK) + " - DataSet: " + str(DATASET)

target_model = PytorchInference(alphabet, model, name)

sequence_generator = UniformLengthSequenceGenerator(alphabet, max_seq_length=1000, min_seq_length=900)

comparator = PACComparisonStrategy(target_model_alphabet = alphabet, epsilon = 0.01, delta = 0.01, 
                                   sequence_generator = sequence_generator)

teacher = GeneralTeacher(target_model, comparator)

# Choose algorithm (LStar, KV)
algorithm = 'LStar'

# LStar learner
if(algorithm == 'LStar'): learner = LStarFactory.get_dfa_lstar_learner(max_time=5)

#Kearns Vazirani learner
if(algorithm == 'KV'):  learner = KearnsVaziraniLearner()

### Learn and extract

In [9]:
if(algorithm == 'LStar'): 
    res = learner.learn(teacher, log_hierachy=1)
elif(algorithm == 'KV'): 
    res = learner.learn(teacher)

**** Started lstar learning ****


### Save Observation Table in pickle (only for LStar)

In [11]:
import pickle

if(algorithm == 'LStar'):
    obs_table = res.info['observation_table']

    with open('predicted_models/observation_table.pickle', 'wb') as handle:
        pickle.dump(obs_table, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Load data example

with open('predicted_models/observation_table.pickle', 'rb') as handle:
    unserialized_data = pickle.load(handle)
    

"\nwith open('observation_table.pickle', 'rb') as handle:\n    unserialized_data = pickle.load(handle)\n"

### Test extracted model

In [12]:
from utils import test_model_w_data, test_model
import numpy as np

result = test_model_w_data(target_model, res.model, sequences)
np.mean(result)

1.0

In [13]:
result = test_model(target_model, res.model, max_seq_len=1000, min_seq_len=50, sequence_amount=1000)
np.mean(result)

0.612

In [14]:
result = test_model(target_model, res.model, max_seq_len=1000, min_seq_len=900, sequence_amount=1000)
np.mean(result)

0.59

### Final response info

In [15]:
print("Final model:")
print("- State amount -> " + str(len(res.model.states)))
print("- Membership queries -> " + str(res.info["membership_queries_count"]))
print("- Equivalence queries -> " + str(res.info["equivalence_queries_count"]))

Final model:
- State amount -> 2
- Membership queries -> 9
- Equivalence queries -> 1


In [16]:
max_exp = 0
for k in res.info['observation_table'].exp:
    if len(k) > max_exp:
        max_exp = len(k)
        
max_blue = 0
for k in res.info['observation_table'].blue:
    if len(k) > max_blue:
        max_blue = len(k)
        
print("Max sequence len: " + str(max_blue + max_exp))
print("Table rows: " + str(len(res.info['observation_table'].blue)+len(res.info['observation_table'].red)))
print("Table columns: " + str(len(res.info['observation_table'].exp)))

Max sequence len: 2
Table rows: 9
Table columns: 1


# Submission
Once you are satisfied with your model performance, you must write a predict function that takes a sequence as input (list of integers) and returns 0 or 1 (integer type) for the binary classification track and a probability (float type) for the language modeling track.

Your model is **NOT** a parameter of this function. You should NOT take care of MLFlow saving here. 

*For instance, if you want to submit the original model:*

## Save and submit 
This is the creation of the model needed for the submission to the competition

Uncomment to save predicted model.

In [17]:
from wrapper import MlflowDFA
from submit_tools_fix import save_function

mlflow_dfa = MlflowDFA(res.model)
save_function(mlflow_dfa, len(res.model.alphabet), target_model.name)

FileNotFoundError: [Errno 2] No such file or directory: 'predicted_models/Track: 1 - DataSet: 1.zip'

### Export model

In [None]:
from pythautomata.model_exporters.dot_exporters.dfa_dot_exporting_strategy import DfaDotExportingStrategy
res.model._exporting_strategies = [DfaDotExportingStrategy()]
#res.model.export()

In [1]:
import pickle
table_0, table_1 = None, None
with open('predicted_models/observation_table_0.pickle', 'rb') as handle:
    table_0 = pickle.load(handle)
    
with open('predicted_models/observation_table_1.pickle', 'rb') as handle:
    table_1 = pickle.load(handle)
    
with open('predicted_models/observation_table_2.pickle', 'rb') as handle:
    table_2 = pickle.load(handle)
    
print(table_0)
print(table_1)
print(table_2)


Observation Table:
RED: {0, ϵ}
BLUE: {0,3, 3, 1, 0,1, 0,2, 2, 0,0}
EXP: [ϵ]
OBSERVATIONS: {ϵ: [False], 0: [True], 1: [False], 3: [True], 2: [True], 0,0: [True], 0,1: [False], 0,3: [True], 0,2: [True]}


Observation Table:
RED: {0, ϵ}
BLUE: {0,3, 3, 1, 0,1, 0,2, 2, 0,0}
EXP: [ϵ]
OBSERVATIONS: {ϵ: [False], 0: [True], 1: [False], 3: [True], 2: [True], 0,0: [True], 0,1: [False], 0,3: [True], 0,2: [True]}


Observation Table:
RED: {0, ϵ}
BLUE: {0,3, 3, 1, 0,1, 0,2, 2, 0,0}
EXP: [ϵ]
OBSERVATIONS: {ϵ: [False], 0: [True], 1: [False], 3: [True], 2: [True], 0,0: [True], 0,1: [False], 0,3: [True], 0,2: [True]}



In [51]:
import mlflow
import pickle
from utils import predict, PytorchInference
import numpy as np
from wrapper import MlflowDFA
from submit_tools_fix import save_function
from pythautomata.utilities.uniform_word_sequence_generator import UniformWordSequenceGenerator
from pythautomata.model_exporters.dot_exporters.dfa_dot_exporting_strategy import DfaDotExportingStrategy
from pymodelextractor.teachers.pac_comparison_strategy import PACComparisonStrategy
from pymodelextractor.teachers.general_teacher import GeneralTeacher
from pymodelextractor.factories.lstar_factory import LStarFactory
from pythautomata.base_types.alphabet import Alphabet
from utils import test_model
from pymodelextractor.learners.observation_table_learners.translators.partial_dfa_translator import PartialDFATranslator

TRACK = 1 #always for his track
DATASET = 11

max_extraction_time =60# 2 * 60 * 60 # 2 horas
max_sequence_len = 80
min_sequence_len = 10

counter = 0
observation_table = None

model_name = f"models/1.{DATASET}.taysir.model"
model = mlflow.pytorch.load_model(model_name)
model.eval()

file = f"datasets/1.{DATASET}.taysir.valid.words"

empty_sequence_len = 2
with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - empty_sequence_len)])

name = "Track: " + str(TRACK) + " - DataSet: " + str(DATASET) + "-  partial n° " + str(counter)
target_model = PytorchInference(alphabet, model, name)

sequence_generator = UniformWordSequenceGenerator(alphabet, max_seq_length=max_sequence_len,
                                                        min_seq_length=min_sequence_len)

comparator = PACComparisonStrategy(target_model_alphabet = alphabet, epsilon = 0.01, delta = 0.01,
                                   sequence_generator = sequence_generator)

teacher = GeneralTeacher(target_model, comparator)

learner = LStarFactory.get_partial_dfa_lstar_learner(max_time=max_extraction_time)

name = "Track: " + str(TRACK) + " - DataSet: " + str(DATASET) + "-  partial n° " + str(counter)
res = learner.learn(teacher, observation_table)

#print(observation_table)



In [52]:
from wrapper import MlflowDFA
from submit_tools_fix import save_function

mlflow_dfa = MlflowDFA(res.model)
save_function(mlflow_dfa, len(res.model.alphabet), target_model.name)

Submission created at predicted_models/Track: 1 - DataSet: 11-  partial n° 0.zip.


In [57]:
print(res.info)

{'equivalence_queries_count': 1, 'membership_queries_count': 26, 'observation_table': <pymodelextractor.learners.observation_table_learners.general_observation_table.GeneralObservationTable object at 0x7f2440c0e4f0>, 'duration': 60}


In [54]:
print(len(res.model.states))

1


In [55]:
res.model.name = "Dataset11-1Acc"
res.model.export()