# TAYSIR competition - Track 1 Starter Kit

### Welcome!

This is a notebook to show the structure of a code to participate to the competition.

You can also check the baseline notebook (available in the same archive) for more details about the TAYSIR models and how to use them.

## Prepare your environment

In [1]:
#%pip install -q mlflow torch transformers

In [2]:
import torch
import mlflow
print('PyTorch version :', torch.__version__)
print('MLflow version :', mlflow.__version__)
import sys
import pandas as pd
print("Your python version:", sys.version)

PyTorch version : 1.10.0+cu102
MLflow version : 2.2.1
Your python version: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0]


This notebook was tested with:
* Torch version: 1.11.0+cu102
* MLFlow version: 1.25.1
* Python version: 3.8.10 [GCC 9.4.0]

Python versions starting at 3.7 are supposed to work (but have not been tested).
## Choosing the phase

First you must select one of the phases/datasets we provide

In [3]:
TRACK = 1 #always for this track
DATASET = 7

In [10]:
from transformers import AutoTokenizer, DistilBertForSequenceClassification

model_name = f"models/1.{DATASET}.taysir.model"

model = mlflow.pytorch.load_model(model_name)

sequence = [64, 34, 11, 17, 9, 50, 40, 45, 55, 8, 58, 24, 34, 3, 62, 36, 60, 55, 53, 9, 55, 65]

pred = predict(sequence, model)
pred
#model.eval()



0

In [None]:
try:#RNN
    nb_letters = model.input_size - 1
    cell_type = model.cell_type

    print("The alphabet contains", nb_letters, "symbols.")
    print("The type of the recurrent cells is", cell_type.__name__)
except:
    nb_letters = model.distilbert.config.vocab_size - 2
    print("The alphabet contains", nb_letters, "symbols.")
    print("The model is a transformer (DistilBertForSequenceClassification)")

## Load the data

The input data is in the following format :

```
[Number of sequences] [Alphabet size]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
[Length of sequence] [List of symbols]
...
[Length of sequence] [List of symbols]
```

For example the following data :

```
5 10
6 8 6 5 1 6 7 4 9
12 8 6 9 4 6 8 2 1 0 6 5 9
7 8 9 4 3 0 4 9
4 8 0 4 9
8 8 1 5 2 6 0 5 3 9
```

is composed of 5 sequences and has an alphabet size of 10 (so symbols are between 0 and 9) and the first sequence is composed of 6 symbols (8 6 5 1 6 7 4 9), notice that 8 is the start symbol and 9 is the end symbol.

In [None]:
#!pip install pythautomata
#!pip install pymodelextractor

from pythautomata.base_types.alphabet import Alphabet

file = f"datasets/1.{DATASET}.taysir.valid.words"

alphabet = None
sequences = []

with open(file) as f:
    a = f.readline() #Skip first line (number of sequences, alphabet size)
    headline = a.split(' ')
    alphabet_size = int(headline[1].strip())
    alphabet = Alphabet.from_strings([str(x) for x in range(alphabet_size - 1)])
    print(alphabet)
    for line in f:
        line = line.strip()
        seq = line.split(' ')
        seq = [int(i) for i in seq[1:]] #Remove first value (length of sequence) and cast to int
        sequences.append(seq)

The variable *sequences* is thus **a list of lists**

In [None]:
print('Number of sequences:', len(sequences))
print('10 first sequences:')
for i in range(10):
    print(sequences[i])

## Model extraction

This is where you will extract your simple own model.

### Model inference adapter for PyTorch
We define a class in order to get alphabet, output alphabet and proccess query of PyTorch RNN. After this we send the model to the generic LStar teacher. 

In [None]:
from pythautomata.base_types.sequence import Sequence
from pymodelextractor.teachers.teacher import Teacher

# SEE ALPHABET TYPE
class PytorchInference():
    
    def __init__(self, alphabet, model):
        self._alphabet = alphabet
        self._model = model
    
    def get_alphabet(self) -> Alphabet:
        return self._alphabet
    
    def get_output_alphabet(self) -> Alphabet:
        return Alphabet.from_strings(['1', '0'])
    
    def process_query(self, sequence):
        return predict(sequence, self.model) == 1

### Define a Learner, a Comparator, and a Teacher

In [None]:
from pymodelextractor.teachers.pac_comparison_strategy import PACComparisonStrategy
from pymodelextractor.teachers.general_teacher import GeneralTeacher
from pymodelextractor.factories.lstar_factory import LStarFactory

inference = PytorchInference(alphabet, model)

comparator = PACComparisonStrategy(target_model_alphabet = alphabet, epsilon = 1, delta = 1)

teacher = GeneralTeacher(state_machine, comparator)

learner = LStarFactory.get_dfa_lstar_learner(max_states = 0, max_query_length = 0, max_time = 0)

### Learn and extract

In [None]:
learner.learn()

# Submission
Once you are satisfied with your model performance, you must write a predict function that takes a sequence as input (list of integers) and returns 0 or 1 (integer type) for the binary classification track and a probability (float type) for the language modeling track.

Your model is **NOT** a parameter of this function. You should NOT take care of MLFlow saving here. 

*For instance, if you want to submit the original model:*

In [7]:
def predict(sequence, model):
    """ Define the function that takes a sequence as a list of integers and return the decision of your extracted model. 
    The function does not need your model as argument. See baseline notebooks for examples."""
    
    #For instance, if you want to submit the original model:
    if hasattr(model, 'predict'): #RNN
        value = model.predict(model.one_hot_encode(sequence)) 
        return value
    else: #Transformer
        """
        Note: In this function, add 2 to each int in the word before being input to the model,
        since ids 0 and 1 are used as special tokens.
            0 : padding id
            1 : classification token id
        Args:
            word: list of integers 
        """
        word = [ [1] + [ a+2 for a in sequence ] ]
        word = torch.IntTensor(word)
        with torch.no_grad():
            out = model(word)
            return (out.logits.argmax().item())

## Save and submit 
This is the creation of the model needed for the submission to the competition

In [None]:
from submit_tools import save_function

save_function(predict, alphabet_size=nb_letters, prefix=f'dataset_{TRACK}.{DATASET}_')