# TAYSIR competition - Track 1 Starter Kit

### Welcome!

This is a notebook to show the structure of a code to participate to the competition.

You can also check the baseline notebook (available in the same archive) for more details about the TAYSIR models and how to use them.

## Prepare your environment

In [1]:
#%pip install -q mlflow torch transformers

In [2]:
import torch
import mlflow
from utils import predict, PytorchInference
#print('PyTorch version :', torch.__version__)
#print('MLflow version :', mlflow.__version__)
import sys
import pandas as pd
#print("Your python version:", sys.version)

This notebook was tested with:
* Torch version: 1.11.0+cu102
* MLFlow version: 1.25.1
* Python version: 3.8.10 [GCC 9.4.0]

Python versions starting at 3.7 are supposed to work (but have not been tested).
## Choosing the phase

First you must select one of the phases/datasets we provide

In [3]:
TRACK = 1 #always for his track
DATASET = 7

In [4]:
from transformers import AutoTokenizer, DistilBertForSequenceClassification

model_name = f"models/1.{DATASET}.taysir.model"

model = mlflow.pytorch.load_model(model_name)

sequence = [4, 2, 1, 3, 1, 3, 0, 1, 2, 2, 3, 2, 2, 3, 2, 1, 3, 1, 2, 0, 2, 5]

#pred = predict(sequence, model)
#pred
model.eval()



DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(68, 12, padding_idx=0)
      (position_embeddings): Embedding(512, 12)
      (LayerNorm): LayerNorm((12,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=12, out_features=12, bias=True)
            (k_lin): Linear(in_features=12, out_features=12, bias=True)
            (v_lin): Linear(in_features=12, out_features=12, bias=True)
            (out_lin): Linear(in_features=12, out_features=12, bias=True)
          )
          (sa_layer_norm): LayerNorm((12,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Li

In [5]:
try:#RNN
    nb_letters = model.input_size - 1
    cell_type = model.cell_type

    print("The alphabet contains", nb_letters, "symbols.")
    print("The type of the recurrent cells is", cell_type.__name__)
except:
    nb_letters = model.distilbert.config.vocab_size - 2
    print("The alphabet contains", nb_letters, "symbols.")
    print("The model is a transformer (DistilBertForSequenceClassification)")

The alphabet contains 66 symbols.
The model is a transformer (DistilBertForSequenceClassification)


## Load the data

The input data is in the following format :

```
[Number of sequences] [Alphabet size]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
...
[Length of sequence] [List of symbols]
```

For example the following data :

```
5 10
6 8 6 5 1 6 7 4 9
12 8 6 9 4 6 8 2 1 0 6 5 9
7 8 9 4 3 0 4 9
4 8 0 4 9
8 8 1 5 2 6 0 5 3 9
```

is composed of 5 sequences and has an alphabet size of 10 (so symbols are between 0 and 9) and the first sequence is composed of 6 symbols (8 6 5 1 6 7 4 9), notice that 8 is the start symbol and 9 is the end symbol.

In [6]:
from pythautomata.base_types.alphabet import Alphabet

file = f"datasets/1.{DATASET}.taysir.valid.words"

alphabet = None
sequences = []

with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - 1)])
    print(alphabet)
    for line in f:
        line = line.strip()
        seq = line.split(' ')
        seq = [int(i) for i in seq[1:]] #Remove first value (length of sequence) and cast to int
        sequences.append(seq)

frozenset({8, 12, 38, 36, 58, 33, 0, 3, 37, 51, 23, 45, 32, 19, 4, 30, 16, 42, 25, 17, 26, 22, 49, 9, 13, 39, 1, 44, 61, 27, 57, 53, 60, 50, 31, 18, 7, 2, 24, 20, 28, 29, 55, 48, 47, 11, 54, 56, 59, 5, 6, 43, 52, 35, 15, 14, 40, 64, 62, 10, 46, 21, 41, 63, 34})


The variable *sequences* is thus **a list of lists**

In [7]:
print('Number of sequences:', len(sequences))
print('10 first sequences:')
for i in range(10):
    print(sequences[i])
    

Number of sequences: 10000
10 first sequences:
[64, 34, 11, 17, 9, 50, 40, 45, 55, 8, 58, 24, 34, 3, 62, 36, 60, 55, 53, 9, 55, 65]
[64, 39, 15, 4, 38, 5, 38, 63, 0, 51, 58, 14, 5, 32, 49, 61, 23, 24, 63, 57, 39, 65]
[64, 40, 0, 39, 17, 61, 59, 50, 40, 5, 24, 48, 0, 31, 21, 38, 19, 19, 59, 9, 28, 65]
[64, 48, 34, 44, 34, 52, 21, 37, 18, 21, 37, 14, 17, 9, 31, 11, 61, 24, 3, 9, 56, 65]
[64, 8, 19, 13, 61, 25, 13, 2, 42, 17, 9, 29, 44, 28, 41, 1, 32, 18, 54, 34, 18, 65]
[64, 15, 28, 18, 40, 3, 31, 21, 34, 4, 16, 50, 20, 14, 51, 46, 29, 36, 6, 0, 2, 65]
[64, 38, 33, 60, 8, 6, 34, 57, 37, 60, 55, 35, 52, 50, 63, 45, 53, 40, 7, 45, 20, 65]
[64, 41, 38, 37, 58, 32, 27, 17, 0, 21, 14, 50, 59, 25, 5, 7, 54, 53, 36, 11, 6, 65]
[64, 60, 59, 4, 48, 17, 27, 55, 32, 45, 29, 60, 53, 12, 34, 62, 48, 28, 3, 23, 24, 65]
[64, 17, 36, 60, 61, 24, 28, 60, 3, 39, 44, 41, 32, 4, 1, 41, 3, 3, 51, 50, 45, 65]


## Model extraction

This is where you will extract your simple own model.

### Define a Learner, a Comparator, and a Teacher

In [8]:
from pymodelextractor.teachers.pac_comparison_strategy import PACComparisonStrategy
from pymodelextractor.teachers.general_teacher import GeneralTeacher
from pymodelextractor.factories.lstar_factory import LStarFactory

name = "Track: " + str(TRACK) + " - DataSet: " + str(DATASET)

target_model = PytorchInference(alphabet, model, name)

comparator = PACComparisonStrategy(target_model_alphabet = alphabet, epsilon = 0.01, delta = 0.01)

teacher = GeneralTeacher(target_model, comparator)

learner = LStarFactory.get_dfa_lstar_learner()

### Learn and extract

In [9]:
res = learner.learn(teacher, log_hierachy=1)

**** Started lstar learning ****
**** Learning finished in 5.132320165634155s using 2 counterexamples & final model ended with 9 states ****



# Submission
Once you are satisfied with your model performance, you must write a predict function that takes a sequence as input (list of integers) and returns 0 or 1 (integer type) for the binary classification track and a probability (float type) for the language modeling track.

Your model is **NOT** a parameter of this function. You should NOT take care of MLFlow saving here. 

*For instance, if you want to submit the original model:*

## Save and submit 
This is the creation of the model needed for the submission to the competition

In [10]:
res

<pymodelextractor.learners.learning_result.LearningResult at 0x7fea6c69df10>

In [11]:
res.info

{'equivalence_queries_count': 3,
 'membership_queries_count': 3584,
 'observation_table': <pymodelextractor.learners.observation_table_learners.general_observation_table.GeneralObservationTable at 0x7feac0909bb0>,
 'duration': 5.132308483123779}

In [12]:
from pythautomata.model_exporters.dot_exporters.dfa_dot_exporting_strategy import DfaDotExportingStrategy
res.model._exporting_strategies = [DfaDotExportingStrategy()]
res.model.export()

In [13]:
from utils import test_model

result = test_model(target_model, res.model, sequence_length =100, sequence_amount=1000)

In [14]:
import numpy as np
np.mean(result)

0.999

Uncomment to save predicted model.

In [15]:
'''from mlflow_exporter.wrapper import MlflowDFA
from mlflow_exporter.submit_tools_fix import save_function

mlflow_dfa = MlflowDFA(res.model)
save_function(mlflow_dfa, len(res.model.alphabet), target_model.name)'''

'from mlflow_exporter.wrapper import MlflowDFA\nfrom mlflow_exporter.submit_tools_fix import save_function\n\nmlflow_dfa = MlflowDFA(res.model)\nsave_function(mlflow_dfa, len(res.model.alphabet), target_model.name)'