# TAYSIR competition - Track 2 Example Extraction

### Welcome!

This is a notebook to show the structure of a code to participate to the competition.

You can also check the baseline notebook (available in the same archive) for more details about the TAYSIR models and how to use them.

## Prepare your environment

In [None]:
#!pip install mlflow torch

In [None]:
!pip install --upgrade pymodelextractor

In [4]:
import torch
import mlflow
print('PyTorch version :', torch.__version__)
print('MLflow version :', mlflow.__version__)
import sys
print("Your python version:", sys.version)

PyTorch version : 2.0.0+cpu
MLflow version : 2.2.2
Your python version: 3.9.1 (default, Dec 11 2020, 09:29:25) [MSC v.1916 64 bit (AMD64)]


In [2]:
torch.set_num_threads(4)

This notebook was tested with:
* Torch version: 1.11.0+cu102
* MLFlow version: 1.25.1
* Python version: 3.8.10 [GCC 9.4.0]

Python versions starting at 3.7 are supposed to work (but have not been tested).
## Choosing the phase

First you must select one of the phases/datasets we provide

In [7]:
dataset_amount = 10
for ds in range(1,dataset_amount+1):
    DATASET = ds
    model_name = f"models/2.{DATASET}.taysir.model"
    model = mlflow.pytorch.load_model(model_name)
    
    print("\n")
    print("Model:", ds)
    print(model.eval())
    try:#RNN
        nb_letters = model.input_size -1
        cell_type = model.cell_type

        print("The alphabet contains", nb_letters, "symbols.")
        print("The type of the recurrent cells is", cell_type.__name__)
    except:
        nb_letters = model.distilbert.config.vocab_size
        print("The alphabet contains", nb_letters, "symbols.")
        print("The model is a transformer (DistilBertForSequenceClassification)")





Model: 1
TNetwork(
  23, 22, n_layers=2, neurons_per_layer=64, batch_size=64, patience=5, split_dense=True, task=lm
  (mach[0]): RNN(22, 64, batch_first=True)
  (mach[1]): RNN(64, 64, batch_first=True)
  (dense): Sequential(
    (0): Linear(in_features=64, out_features=32, bias=True)
    (1): Linear(in_features=32, out_features=22, bias=True)
    (2): Sigmoid()
    (3): Softmax(dim=-1)
  )
)
The alphabet contains 22 symbols.
The type of the recurrent cells is RNN


Model: 2
TNetwork(
  10, 9, n_layers=2, neurons_per_layer=256, cell_type=lstmx.LSTMx, batch_size=64, patience=5, split_dense=True, task=lm
  (mach[0]): LSTMx(
    9, 256, batch_first=True
    (drop_layer): Dropout(p=0, inplace=False)
    (forward_layers[0]): LSTMCell(9, 256)
  )
  (mach[1]): LSTMx(
    256, 256, batch_first=True
    (drop_layer): Dropout(p=0, inplace=False)
    (forward_layers[0]): LSTMCell(256, 256)
  )
  (dense): Sequential(
    (0): Linear(in_features=256, out_features=128, bias=True)
    (1): Linear(in

In [8]:
TRACK = 2 #always for this track
DATASET = 1

In [5]:
model_name = f"models/2.{DATASET}.taysir.model"

model = mlflow.pytorch.load_model(model_name)
model.eval()



DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(64, 256, padding_idx=0)
      (position_embeddings): Embedding(512, 256)
      (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-7): 8 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=256, out_features=256, bias=True)
            (k_lin): Linear(in_features=256, out_features=256, bias=True)
            (v_lin): Linear(in_features=256, out_features=256, bias=True)
            (out_lin): Linear(in_features=256, out_features=256, bias=True)
          )
          (sa_layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [6]:
if not hasattr(model, 'distilbert'):#RNN
    nb_letters = model.input_size -1
    cell_type = model.cell_type

    print("The alphabet contains", nb_letters, "symbols.")
    print("The type of the recurrent cells is", cell_type.__name__)
else:
    nb_letters = model.distilbert.config.vocab_size
    print("The alphabet contains", nb_letters, "symbols.")
    print("The model is a transformer (DistilBertForSequenceClassification)")

The alphabet contains 35 symbols.
The model is a transformer (DistilBertForSequenceClassification)


## Load the data

The input data is in the following format :

```
[Number of sequences] [Alphabet size]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
...
[Length of sequence] [List of symbols]
```

For example the following data :

```
5 10
6 8 6 5 1 6 7 4 9
12 8 6 9 4 6 8 2 1 0 6 5 9
7 8 9 4 3 0 4 9
4 8 0 4 9
8 8 1 5 2 6 0 5 3 9
```

is composed of 5 sequences and has an alphabet size of 10 (so symbols are between 0 and 9) and the first sequence is composed of 6 symbols (8 6 5 1 6 7 4 9), notice that 8 is the start symbol and 9 is the end symbol.

In [9]:
from pythautomata.base_types.alphabet import Alphabet

file = f"datasets/2.{DATASET}.taysir.valid.words"

alphabet = None
sequences = []

#In the competition the empty sequence is defined as [alphabet_size - 2, alphabet size -1]
#For example with the alphabet of size 22 the empty sequence is [20, 21]
empty_sequence_len = 2

with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - empty_sequence_len)])
    
    for line in f:
        line = line.strip()
        seq = line.split(' ')
        seq = [int(i) for i in seq[1:]] #Remove first value (length of sequence) and cast to int
        sequences.append(seq)

The variable *sequences* is thus **a list of lists**

In [8]:
print('Number of sequences:', len(sequences))
print('10 first sequences:')
for i in range(10):
    print(sequences[i])

Number of sequences: 489
10 first sequences:
[33, 28, 32, 26, 8, 32, 16, 4, 8, 28, 34]
[33, 10, 18, 26, 1, 30, 20, 34]
[33, 26, 8, 28, 8, 15, 10, 9, 1, 30, 20, 34]
[33, 16, 24, 30, 32, 26, 18, 17, 1, 16, 32, 1, 30, 20, 34]
[33, 3, 1, 9, 32, 8, 26, 1, 30, 20, 34]
[33, 16, 24, 26, 26, 8, 16, 32, 28, 34]
[33, 16, 4, 24, 13, 8, 34]
[33, 26, 8, 28, 4, 31, 3, 3, 9, 8, 17, 34]
[33, 28, 4, 26, 24, 31, 17, 8, 17, 34]
[33, 28, 30, 18, 13, 8, 28, 34]


## Model extraction

This is where you will extract your simple own model.

In [9]:
from pythautomata.model_comparators.wfa_partition_comparison_strategy import WFAPartitionComparator
from pythautomata.utilities.probability_partitioner import QuantizationProbabilityPartitioner
#from pythautomata.model_exporters.wfa_image_exporter_with_partition_mapper import WFAImageExporterWithPartitionMapper
from pythautomata.base_types.symbol import SymbolStr
from pythautomata.utilities.uniform_length_sequence_generator import UniformLengthSequenceGenerator

from pymodelextractor.learners.observation_tree_learners.bounded_pdfa_quantization_n_ary_tree_learner import BoundedPDFAQuantizationNAryTreeLearner
from pymodelextractor.teachers.pac_batch_probabilistic_teacher import PACBatchProbabilisticTeacher
from pymodelextractor.teachers.pac_probabilistic_teacher import PACProbabilisticTeacher
from pymodelextractor.utils.pickle_data_loader import PickleDataLoader

from utils import predict
from pytorch_language_model import PytorchLanguageModel

name = "track_" + str(TRACK) + "_dataset_" + str(DATASET)

target_model = PytorchLanguageModel(alphabet, model, name)


In [10]:
#from last_token_weights_pickle_dataset_generator import LastTokenWeightsPickleDataSetGenerator
#LastTokenWeightsPickleDataSetGenerator().genearte_dataset(target_model, 1000, "./test",10)

In [11]:
epsilon = 0.1
delta = 0.1
max_states = 1000000
max_query_length= 1000000
max_secs = 30
sequence_generator = UniformLengthSequenceGenerator(alphabet, max_seq_length=100, min_seq_length=20)
#dataloader = PickleDataLoader("./test")

partitioner = QuantizationProbabilityPartitioner(10)
comparator = WFAPartitionComparator(partitioner)
teacher1  = PACBatchProbabilisticTeacher(target_model, epsilon = epsilon, delta = delta, max_seq_length = None, comparator = comparator, sequence_generator=sequence_generator, compute_epsilon_star=False)
learner = BoundedPDFAQuantizationNAryTreeLearner(partitioner, max_states, max_query_length, max_secs, generate_partial_hipothesis = False, pre_cache_queries_for_building_hipothesis = False,  check_probabilistic_hipothesis = False)
learning_result = learner.learn(teacher1)     

In [12]:
#teacher2  = PACBatchProbabilisticTeacher(target_model, epsilon = epsilon, delta = delta, max_seq_length = None, comparator = comparator, sequence_generator=sequence_generator, compute_epsilon_star=False, cache_from_dataloader=dataloader)
#learner2 = BoundedPDFAQuantizationNAryTreeLearner(partitioner, max_states, max_query_length, max_secs, generate_partial_hipothesis = False, pre_cache_queries_for_building_hipothesis = False,  check_probabilistic_hipothesis = False)
#learning_result = learner2.learn(teacher2)     

In [13]:
learning_result.model

<pythautomata.automata.wheighted_automaton_definition.probabilistic_deterministic_finite_automaton.ProbabilisticDeterministicFiniteAutomaton at 0x7f992741dc70>

## Save and submit 
This is the creation of the model needed for the submission to the competition: you just have to run this cell. It will create in your current directory an **archive** that you can then submit on the competition website.

**You should NOT modify this part, just run it**

In [14]:
from fast_pdfa_wrapper import MlflowFastPDFA
from submit_tools_fix import save_function
from fast_pdfa_converter import FastProbabilisticDeterministicFiniteAutomatonConverter as FastPDFAConverter

fast_pdfa = FastPDFAConverter().to_fast_pdfa(learning_result.model)
mlflow_fast_pdfa = MlflowFastPDFA(fast_pdfa)
save_function(mlflow_fast_pdfa, len(learning_result.model.alphabet), target_model.name)

Submission created at predicted_models/track_2_dataset_10.zip.




In [10]:
#zip_path = f"predicted_models/{target_model.name}.zip"
zip_path = f"predicted_models/track_2_dataset_1_TEST.zip"
from load_func import load_function
print(sequences[0:10])
load_function(zip_path, sequences[0:10])

[[20, 13, 14, 6, 0, 15, 4, 3, 5, 12, 13, 13, 14, 4, 12, 17, 21], [20, 3, 13, 3, 16, 6, 4, 13, 1, 21], [20, 13, 6, 15, 21], [20, 13, 10, 3, 21], [20, 13, 10, 3, 16, 6, 4, 13, 13, 12, 17, 4, 13, 14, 10, 0, 10, 13, 14, 4, 15, 12, 17, 21], [20, 3, 5, 0, 1, 4, 13, 6, 14, 4, 14, 4, 14, 13, 10, 12, 1, 5, 10, 3, 14, 5, 12, 14, 1, 12, 11, 12, 17, 18, 8, 21], [20, 3, 13, 3, 19, 1, 4, 3, 5, 10, 3, 19, 8, 21], [20, 13, 0, 1, 3, 1, 13, 3, 16, 6, 4, 13, 1, 12, 8, 0, 5, 10, 14, 12, 10, 3, 14, 1, 21], [20, 13, 14, 14, 6, 3, 21], [20, 13, 12, 13, 3, 16, 3, 16, 21]]
Model loaded, testing it on 10 sequences
1.777629888553824e-21
1.2247713436020192e-11
6.493212342068826e-06
1.7650410387975307e-05
1.9064327541014065e-30
1.4722507587839755e-42
7.952713969056707e-17
7.49522137622759e-32
2.727951463023517e-08
3.115369573423187e-10
