# TAYSIR competition - Track 2 Example Extraction

### Welcome!

This is a notebook to show the structure of a code to participate to the competition.

You can also check the baseline notebook (available in the same archive) for more details about the TAYSIR models and how to use them.

## Prepare your environment

In [None]:
#!pip install mlflow torch

In [None]:
!pip install --upgrade pymodelextractor

In [1]:
import torch
import mlflow
print('PyTorch version :', torch.__version__)
print('MLflow version :', mlflow.__version__)
import sys
print("Your python version:", sys.version)

PyTorch version : 2.1.0.dev20230403+cpu
MLflow version : 2.2.2
Your python version: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0]


In [2]:
torch.set_num_threads(4)

This notebook was tested with:
* Torch version: 1.11.0+cu102
* MLFlow version: 1.25.1
* Python version: 3.8.10 [GCC 9.4.0]

Python versions starting at 3.7 are supposed to work (but have not been tested).
## Choosing the phase

First you must select one of the phases/datasets we provide

In [3]:
dataset_amount = 10
for ds in range(1,dataset_amount+1):
    DATASET = ds
    model_name = f"models/2.{DATASET}.taysir.model"
    model = mlflow.pytorch.load_model(model_name)
    
    print("\n")
    print("Model:", ds)
    print(model.eval())
    try:#RNN
        nb_letters = model.input_size -1
        cell_type = model.cell_type

        print("The alphabet contains", nb_letters, "symbols.")
        print("The type of the recurrent cells is", cell_type.__name__)
    except:
        nb_letters = model.distilbert.config.vocab_size
        print("The alphabet contains", nb_letters, "symbols.")
        print("The model is a transformer (DistilBertForSequenceClassification)")





Model: 1
TNetwork(
  23, 22, n_layers=2, neurons_per_layer=64, batch_size=64, patience=5, split_dense=True, task=lm
  (mach[0]): RNN(22, 64, batch_first=True)
  (mach[1]): RNN(64, 64, batch_first=True)
  (dense): Sequential(
    (0): Linear(in_features=64, out_features=32, bias=True)
    (1): Linear(in_features=32, out_features=22, bias=True)
    (2): Sigmoid()
    (3): Softmax(dim=-1)
  )
)
The alphabet contains 22 symbols.
The type of the recurrent cells is RNN


Model: 2
TNetwork(
  10, 9, n_layers=2, neurons_per_layer=256, cell_type=lstmx.LSTMx, batch_size=64, patience=5, split_dense=True, task=lm
  (mach[0]): LSTMx(
    9, 256, batch_first=True
    (drop_layer): Dropout(p=0, inplace=False)
    (forward_layers[0]): LSTMCell(9, 256)
  )
  (mach[1]): LSTMx(
    256, 256, batch_first=True
    (drop_layer): Dropout(p=0, inplace=False)
    (forward_layers[0]): LSTMCell(256, 256)
  )
  (dense): Sequential(
    (0): Linear(in_features=256, out_features=128, bias=True)
    (1): Linear(in



symbols.
The type of the recurrent cells is RNN


Model: 9
TNetwork(
  23, 22, cell_type=torch.nn.modules.rnn.GRU, batch_size=64, patience=5, bidirectional=True, task=lm
  (mach[0]): GRU(22, 32, batch_first=True, bidirectional=True)
  (dense): Sequential(
    (0): Linear(in_features=64, out_features=22, bias=True)
    (1): Sigmoid()
    (2): Softmax(dim=-1)
  )
)
The alphabet contains 22 symbols.
The type of the recurrent cells is GRU


Model: 10
DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(64, 256, padding_idx=0)
      (position_embeddings): Embedding(512, 256)
      (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-7): 8 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): L

In [4]:
TRACK = 2 #always for this track
DATASET = 1

In [5]:
model_name = f"models/{TRACK}.{DATASET}.taysir.model"

model = mlflow.pytorch.load_model(model_name)
model.eval()



TNetwork(
  23, 22, n_layers=2, neurons_per_layer=64, batch_size=64, patience=5, split_dense=True, task=lm
  (mach[0]): RNN(22, 64, batch_first=True)
  (mach[1]): RNN(64, 64, batch_first=True)
  (dense): Sequential(
    (0): Linear(in_features=64, out_features=32, bias=True)
    (1): Linear(in_features=32, out_features=22, bias=True)
    (2): Sigmoid()
    (3): Softmax(dim=-1)
  )
)

In [6]:
if not hasattr(model, 'distilbert'):#RNN
    nb_letters = model.input_size -1
    cell_type = model.cell_type

    print("The alphabet contains", nb_letters, "symbols.")
    print("The type of the recurrent cells is", cell_type.__name__)
else:
    nb_letters = model.distilbert.config.vocab_size
    print("The alphabet contains", nb_letters, "symbols.")
    print("The model is a transformer (DistilBertForSequenceClassification)")

The alphabet contains 22 symbols.
The type of the recurrent cells is RNN


## Load the data

The input data is in the following format :

```
[Number of sequences] [Alphabet size]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
...
[Length of sequence] [List of symbols]
```

For example the following data :

```
5 10
6 8 6 5 1 6 7 4 9
12 8 6 9 4 6 8 2 1 0 6 5 9
7 8 9 4 3 0 4 9
4 8 0 4 9
8 8 1 5 2 6 0 5 3 9
```

is composed of 5 sequences and has an alphabet size of 10 (so symbols are between 0 and 9) and the first sequence is composed of 6 symbols (8 6 5 1 6 7 4 9), notice that 8 is the start symbol and 9 is the end symbol.

In [7]:
from pythautomata.base_types.alphabet import Alphabet

file = f"datasets/{TRACK}.{DATASET}.taysir.valid.words"

alphabet = None
sequences = []

#In the competition the empty sequence is defined as [alphabet_size - 2, alphabet size -1]
#For example with the alphabet of size 22 the empty sequence is [20, 21]
empty_sequence_len = 2

with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - empty_sequence_len)])
    
    for line in f:
        line = line.strip()
        seq = line.split(' ')
        seq = [int(i) for i in seq[1:]] #Remove first value (length of sequence) and cast to int
        sequences.append(seq)

The variable *sequences* is thus **a list of lists**

In [8]:
print('Number of sequences:', len(sequences))
print('10 first sequences:')
for i in range(10):
    print(sequences[i])

Number of sequences: 9090
10 first sequences:
[20, 13, 14, 6, 0, 15, 4, 3, 5, 12, 13, 13, 14, 4, 12, 17, 21]
[20, 3, 13, 3, 16, 6, 4, 13, 1, 21]
[20, 13, 6, 15, 21]
[20, 13, 10, 3, 21]
[20, 13, 10, 3, 16, 6, 4, 13, 13, 12, 17, 4, 13, 14, 10, 0, 10, 13, 14, 4, 15, 12, 17, 21]
[20, 3, 5, 0, 1, 4, 13, 6, 14, 4, 14, 4, 14, 13, 10, 12, 1, 5, 10, 3, 14, 5, 12, 14, 1, 12, 11, 12, 17, 18, 8, 21]
[20, 3, 13, 3, 19, 1, 4, 3, 5, 10, 3, 19, 8, 21]
[20, 13, 0, 1, 3, 1, 13, 3, 16, 6, 4, 13, 1, 12, 8, 0, 5, 10, 14, 12, 10, 3, 14, 1, 21]
[20, 13, 14, 14, 6, 3, 21]
[20, 13, 12, 13, 3, 16, 3, 16, 21]


## Model extraction

This is where you will extract your simple own model.

In [9]:
from pythautomata.model_comparators.wfa_partition_comparison_strategy import WFAPartitionComparator
from pythautomata.utilities.probability_partitioner import QuantizationProbabilityPartitioner
#from pythautomata.model_exporters.wfa_image_exporter_with_partition_mapper import WFAImageExporterWithPartitionMapper
from pythautomata.base_types.symbol import SymbolStr
from pythautomata.utilities.uniform_length_sequence_generator import UniformLengthSequenceGenerator

from pymodelextractor.learners.observation_tree_learners.bounded_pdfa_quantization_n_ary_tree_learner import BoundedPDFAQuantizationNAryTreeLearner
from pymodelextractor.teachers.pac_batch_probabilistic_teacher import PACBatchProbabilisticTeacher
from pymodelextractor.teachers.pac_probabilistic_teacher import PACProbabilisticTeacher
from pymodelextractor.utils.pickle_data_loader import PickleDataLoader

from utils import predict
from pytorch_language_model import PytorchLanguageModel

name = "track_" + str(TRACK) + "_dataset_" + str(DATASET)

target_model = PytorchLanguageModel(alphabet, model, name)


In [10]:
import numpy as np
import utils

In [11]:
utils.full_next_symbols_probas([20,1], model)

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 1.        , 0.        ],
       [0.        , 0.03931347, 0.03931397, 0.03931346, 0.10686506,
        0.03931346, 0.03931346, 0.03931346, 0.03931346, 0.03931346,
        0.03931346, 0.03931353, 0.03931346, 0.03931359, 0.10686506,
        0.03931346, 0.03931346, 0.03931346, 0.03931346, 0.03931346,
        0.03931346, 0.03931346, 0.03931346]], dtype=float32)

In [12]:
np.argmax(utils.full_next_symbols_probas([20,0], model)[0])

21

In [13]:
utils_seq_proba = utils.sequence_probability([20,0,1,0,21], model)
utils_seq_proba

4.528450062935008e-06

In [14]:
from pythautomata.base_types.alphabet import Alphabet
from pythautomata.base_types.symbol import SymbolStr
from pythautomata.base_types.sequence import Sequence

def get_alphabet_and_validation_sequences(ds):
    file = f"datasets/2.{ds}.taysir.valid.words"
    alphabet = None
    sequences = []

    empty_sequence_len = 2
    with open(file) as f:
        a = f.readline()
        headline = a.split(' ')
        alphabet_size = int(headline[1].strip())
        alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - empty_sequence_len)])
        for line in f:
            line = line.strip()
            seq = line.split(' ')[1:]
            seq = [SymbolStr(i) for i in seq[1:]]            
            sequences.append(Sequence(seq))
    return alphabet, sequences

In [15]:
_, pythaut_sequences = get_alphabet_and_validation_sequences(DATASET)

In [16]:
suffixes = [target_model.terminal_symbol]
for symbol in target_model.alphabet.symbols:
    suffixes.append(Sequence((symbol,)))

In [17]:
seq_0 = Sequence([SymbolStr('0')])
seq_010 = Sequence([SymbolStr('0'), SymbolStr('1'), SymbolStr('0')])
seq_eps = Sequence([])

In [18]:
symbols = list(target_model.alphabet.symbols)
symbols.sort()
symbols = [target_model.terminal_symbol] + symbols
target_model.get_last_token_weights(seq_0, symbols)

[0.039313461631536484,
 0.03931346535682678,
 0.03931397199630737,
 0.039313528686761856,
 0.039313461631536484,
 0.03931359201669693,
 0.1068650633096695,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.1068650633096695,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484]

In [19]:
target_model.sequence_probability(seq_010)

4.528450062935008e-06

In [20]:
target_model.get_last_token_weights(seq_0, symbols)

[0.039313461631536484,
 0.03931346535682678,
 0.03931397199630737,
 0.039313528686761856,
 0.039313461631536484,
 0.03931359201669693,
 0.1068650633096695,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.1068650633096695,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484,
 0.039313461631536484]

In [41]:
utils.sequence_probability([0,0], model)

0.02963464893400669

In [22]:
utils.next_symbols_probas([0,1,0], model)

array([0.        , 0.04060161, 0.04060161, 0.04060161, 0.04060161,
       0.04060161, 0.08435925, 0.04060161, 0.04060161, 0.04060161,
       0.04060161, 0.04432814, 0.04060161, 0.06651058, 0.05050921,
       0.04060161, 0.04060161, 0.04060161, 0.04060161, 0.04060161,
       0.04060161, 0.04060161, 0.06406554], dtype=float32)

In [23]:
symbols

[21, 0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9]

In [24]:
suffixes

[21, 2, 18, 6, 5, 12, 17, 4, 0, 13, 7, 3, 9, 16, 8, 15, 19, 10, 14, 1, 11]

In [25]:
target_model.get_last_token_weights_batch([seq_0], suffixes)

[[0.03931346,
  0.03931346,
  0.03931346,
  0.03931346,
  0.03931346,
  0.039313592,
  0.03931346,
  0.03931346,
  0.039313465,
  0.10686506,
  0.03931346,
  0.10686506,
  0.03931346,
  0.03931346,
  0.03931346,
  0.03931346,
  0.03931346,
  0.03931353,
  0.03931346,
  0.039313972,
  0.03931346]]

In [26]:
#from last_token_weights_pickle_dataset_generator import LastTokenWeightsPickleDataSetGenerator
#LastTokenWeightsPickleDataSetGenerator().genearte_dataset(target_model, 1000, "./test",10)

In [27]:
print(type(alphabet))

<class 'pythautomata.base_types.alphabet.Alphabet'>


In [28]:
from pythautomata.automata.wheighted_automaton_definition.probabilistic_deterministic_finite_automaton import \
    ProbabilisticDeterministicFiniteAutomaton as PDFA
from pythautomata.automata.wheighted_automaton_definition.weighted_state import WeightedState
from pythautomata.model_comparators.wfa_tolerance_comparison_strategy import WFAToleranceComparator

from pymodelextractor.teachers.sample_batch_probabilistic_teacher import SampleBatchProbabilisticTeacher

weight = 1.0/(len(alphabet)+1)
weight = 0
q0 = WeightedState("q0", 1, weight)
for symbol in alphabet.symbols:
    q0.add_transition(symbol, q0, weight)
states = {q0}
comparator = WFAToleranceComparator()
testPDFA = PDFA(alphabet, states, target_model.terminal_symbol, comparator, "Test", check_is_probabilistic = False)


In [29]:
testPDFA.sequence_probability(seq_0)

0.0

In [32]:
epsilon = 0.5
delta = 0.5
max_states = 1000000
max_query_length= 1000000
max_secs = 120
sequence_generator = UniformLengthSequenceGenerator(alphabet, max_seq_length=3, min_seq_length=2)
dataloader = None
#dataloader = PickleDataLoader("./test")

partitioner = QuantizationProbabilityPartitioner(10)
comparator = WFAPartitionComparator(partitioner)
#teacher = SampleBatchProbabilisticTeacher(model = target_model, comparator = comparator, sequence_generator=sequence_generator, max_seq_length=2, full_prefix_set=True,  cache_from_dataloader=dataloader)
teacher  = PACProbabilisticTeacher(target_model, epsilon = epsilon, delta = delta, max_seq_length = None, comparator = comparator, sequence_generator=sequence_generator, compute_epsilon_star=False, cache_from_dataloader=dataloader)
learner = BoundedPDFAQuantizationNAryTreeLearner(partitioner, max_states, max_query_length, max_secs, generate_partial_hipothesis = False, pre_cache_queries_for_building_hipothesis = False,  check_probabilistic_hipothesis = False)
learning_result = learner.learn(teacher)     

In [None]:
#teacher2  = PACBatchProbabilisticTeacher(target_model, epsilon = epsilon, delta = delta, max_seq_length = None, comparator = comparator, sequence_generator=sequence_generator, compute_epsilon_star=False, cache_from_dataloader=dataloader)
#learner2 = BoundedPDFAQuantizationNAryTreeLearner(partitioner, max_states, max_query_length, max_secs, generate_partial_hipothesis = False, pre_cache_queries_for_building_hipothesis = False,  check_probabilistic_hipothesis = False)
#learning_result = learner2.learn(teacher2)     

In [34]:
learned_pdfa = learning_result.model

In [42]:
import metrics

In [152]:
test_sequences = sequence_generator.generate_words(100)
validation_sequences = pythaut_sequences[0:100]

In [153]:
results = metrics.compute_stats(target_model, learned_pdfa, partitioner, test_sequences, validation_sequences)

In [199]:
from pythautomata.abstract.probabilistic_model import ProbabilisticModel
from pythautomata.base_types.sequence import Sequence
from pythautomata.base_types.alphabet import Alphabet
from pythautomata.base_types.symbol import SymbolStr

from typing import List
from collections import defaultdict
import utils

import numpy as np
class QuickWrapper():
    def __init__(self, original_model,alphabet, epsilon):
        self._model = original_model
        self._alphabet_len = len(alphabet) + 1
        self._alphabet = alphabet
        self._terminal_symbol = SymbolStr(str(self._alphabet_len))
        self._epsilon = epsilon
    
    def sequence_probability(self, sequence: Sequence, debug = False):
        adapted_sequence = self._adapt_sequence(sequence,  add_terminal=True)
        probs = utils.full_next_symbols_probas(adapted_sequence, model)
        probas_for_word = []
        for i,a in enumerate(adapted_sequence):
            probas_for_word.append(probs[i,a+1])
        value = 1
        for x in probas_for_word:
            value *= x
        return np.array(probas_for_word).prod()
        return value        
    
    @property
    def name(self) -> str:
        return self._name    
    @property
    def alphabet(self) -> Alphabet:
        return self._alphabet
    @property
    def terminal_symbol(self):
        return self._terminal_symbol
    
    def process_query(self, sequence):        
        adapted_sequence = self._adapt_sequence(sequence, add_terminal=len(sequence) == 0)       
        if len(sequence)==0:
            return utils.full_next_symbols_probas(adapted_sequence, self._model)[1]
        else:
            return utils.next_symbols_probas(adapted_sequence, self._model)
    
    def _adapt_sequence(self, sequence, add_terminal = False):
        """
        Method that converts sequence to list of ints and adds the starting token to the beggining 
        and terminal token at the end depending on the variable 'add_terminal'
        """
        adapted_seq = [self._alphabet_len-1]
        for symbol in sequence.value:
            adapted_seq.append(int(symbol.value))

        if add_terminal:
            adapted_seq.append(self._alphabet_len)
        
        return adapted_seq
    
    def raw_next_symbol_probas(self, sequence: Sequence):
        result = self.process_query(sequence)        
        return result 

    def _get_symbol_index(self, symbol):
        return int(symbol.value)+1

    def next_symbol_probas(self, sequence: Sequence):
        """
        Function that returns a dictionary with the probability of next symbols (not including padding_symbol)
        Quickly implemented, depends on raw_next_symbol_probas(sequence) 
        """                
        next_probas = self.raw_next_symbol_probas(sequence)

        symbols = list(self.alphabet.symbols) + [self._terminal_symbol]
        intermediate_dict = {}
        probas = np.zeros(len(symbols))
        for idx, symbol in enumerate(symbols):
            proba = next_probas[self._get_symbol_index(symbol)]
            intermediate_dict[symbol] = (proba, idx)
            probas[idx] = proba       

        dict_result = {}
        for symbol in intermediate_dict.keys():
            dict_result[symbol] = probas[intermediate_dict[symbol][1]]

        return dict_result      
    
    def last_token_probabilities_batch(self, sequences: list[Sequence], required_suffixes: list[Sequence]) -> list[list[float]]:
        return self.get_last_token_weights_batch(sequences, required_suffixes)

    
    def get_last_token_weights(self, sequence, required_suffixes):
        weights = list()
        alphabet_symbols_weights = self.next_symbol_probas(sequence)
        alphabet_symbols_weights = {Sequence() + k: alphabet_symbols_weights[k] for k in alphabet_symbols_weights.keys()}
        for suffix in required_suffixes:
            if suffix in alphabet_symbols_weights:
                weights.append(alphabet_symbols_weights[suffix])
            else:
                new_sequence = sequence + suffix
                new_prefix = Sequence(new_sequence[:-1])
                new_suffix = new_sequence[-1]
                next_symbol_weights = self.next_symbol_probas(new_prefix)
                weights.append(next_symbol_weights[new_suffix])
        return weights
    
    def get_last_token_weights_batch(self, sequences, required_suffixes):
        seqs_to_query = set()
        symbols = list(self.alphabet.symbols) + [self._terminal_symbol]
        for seq in sequences:
            for required_suffix in required_suffixes:
                if required_suffix not in symbols and len(required_suffix)>1:
                    seqs_to_query.add(seq+required_suffix[:-1])
                else:
                    seqs_to_query.add(seq)

        result_dict = self.raw_eval_batch(list(seqs_to_query))
        #result_dict = dict(zip(seqs_to_query, query_results))
        results = []
        for seq in sequences:
            seq_result = []
            for required_suffix in required_suffixes:
                if required_suffix not in symbols and len(required_suffix)>1:
                    seq_result.append(result_dict[seq+required_suffix[:-1]][required_suffix[-1]])
                else:
                    if required_suffix not in symbols:
                        required_suffix = SymbolStr(required_suffix.value[0].value)
                    seq_result.append(result_dict[seq][self._get_symbol_index(required_suffix)])
            results.append(seq_result)
        
        return results
    
    def raw_eval_batch(self, sequences: List[Sequence]):
        if not hasattr(self, '_model'):
            raise AttributeError

        sequences_by_length = defaultdict(lambda: [])
        for seq in sequences:
            sequences_by_length[len(seq)].append(seq)            
        query_results = []
        seqs_to_query = []
        for length in sequences_by_length:            
            seqs =  sequences_by_length[length]
            adapted_sequences = list(map(lambda x: self._adapt_sequence(x), seqs))     
            adapted_sequences_np = np.asarray(adapted_sequences)

            #if length == 1:
            #    adapted_sequences_np = adapted_sequences_np.reshape((-1, 1, len(adapted_sequences_np[0]))) 
            if length == 0:                
                seq = Sequence()
                #adapted_sequence = self._adapt_sequence(seqs[0], add_terminal=True)    
                #adapted_sequence_np = np.asarray(adapted_sequence)
                #result = utils.full_next_symbols_probas(adapted_sequence_np, self._model)
                #model_evaluation = [result[0]]
                result = self.process_query(seq)
                model_evaluation =[result]
            else:
                model_evaluation = utils.full_next_symbols_probas_batch(adapted_sequences_np, self._model)[:,-1]
            seqs_to_query.extend(seqs)
            query_results.extend(model_evaluation)

        result_dict = dict(zip(seqs_to_query, query_results))            
        return result_dict

In [200]:
model_b = QuickWrapper(model, alphabet, 0.0)
model_b._epsilon == 0.0

True

In [201]:
results = metrics.compute_stats(target_model, model_b, partitioner, test_sequences, validation_sequences)
model_b._epsilon == 0.0

True

In [202]:
results

{'Test Accuracy': 1.0, 'Test Taysir MSE': 0.0, 'Validation Taysir MSE': 0.0}

In [None]:
a = target_model.sequence_probability(seq_0)

In [None]:
a

In [None]:
b = model_b.sequence_probability(seq_0)

In [None]:
b

In [None]:
(a-b)*10**6

In [107]:
results

{'Test Accuracy': 1.0,
 'Test Taysir MSE': 1.5372453982012312e-17,
 'Validation Taysir MSE': 1.4419992677867512e-21}

In [49]:
results

{'Test Accuracy': 0.61,
 'Test Taysir MSE': 2.6053120983876092,
 'Validation Taysir MSE': 8.057360028618274e-06}

## Save and submit 
This is the creation of the model needed for the submission to the competition: you just have to run this cell. It will create in your current directory an **archive** that you can then submit on the competition website.

**You should NOT modify this part, just run it**

In [None]:
from fast_pdfa_wrapper import MlflowFastPDFA
from submit_tools_fix import save_function
from fast_pdfa_converter import FastProbabilisticDeterministicFiniteAutomatonConverter as FastPDFAConverter

fast_pdfa = FastPDFAConverter().to_fast_pdfa(learning_result.model)
mlflow_fast_pdfa = MlflowFastPDFA(fast_pdfa)
save_function(mlflow_fast_pdfa, len(learning_result.model.alphabet), target_model.name)

In [None]:
from fast_pdfa_wrapper import MlflowFastPDFA
from submit_tools_fix import save_function
from fast_pdfa_converter import FastProbabilisticDeterministicFiniteAutomatonConverter as FastPDFAConverter

fast_pdfa = FastPDFAConverter().to_fast_pdfa(testPDFA)
mlflow_fast_pdfa = MlflowFastPDFA(fast_pdfa)
save_function(mlflow_fast_pdfa, len(testPDFA.alphabet), "TEST_PDFA_0")

In [None]:
save_function(mlflow_fast_pdfa, len(testPDFA.alphabet), "TEST_PDFA")

In [None]:
#zip_path = f"predicted_models/{target_model.name}.zip"
zip_path = f"predicted_models/TEST_PDFA.zip"
from load_func import load_function
print(sequences[0:10])
load_function(zip_path, sequences[0:10])