### Illustrating token masking 

This notebook has a small example of events and illustrates how masking works.

In [1]:
import pandas as pd
import numpy as np
from typing import Tuple
from pop2vec.llm.src.new_code.custom_vocab import CustomVocabulary
from pop2vec.llm.src.data_new.types import PersonDocument, Background
from pop2vec.llm.src.tasks.mlm import MLM


  from .autonotebook import tqdm as notebook_tqdm


#### First, we create some input data

In [47]:
data = [
    {
        "person_id": 1, 
        "segment": [2, 4, 6, 7, 8],
        "age": [18, 24, 29, 33, 56],
        "abspos": [200, 404, 500, 600, 805],
        "background": {"birth_year": "year_99.0", "birth_month": "month_12.0", "gender": "gender_2", "origin": "municipality_54.0"},
        "sentence": [ 
            ["educSim_2.0"], 
            ["_4_D"], 
            ["contractType2_1.0", "sicknessInsurance2_1.0", "wage_50.0"],
            ["contractType4_1.0", "sicknessInsurance3_1.0", "wage_10.0"],
            ["contractType1_1.0", "sicknessInsurance1_1.0", "wage_20.0"],
        ]
    },
    {
        "person_id": 2, 
        "segment": [6, 3, 10, 10, 10, 11, 12, 13, 14, 15],
        "age": [19, 22, 40, 42, 55, 80, 99, 101, 204, 206],
        "abspos": [99, 103, 501, 708, 890, 899, 901, 910, 915, 930],
        "background": {"birth_year": "year_95.0", "birth_month": "month_5.0", "gender": "gender_1", "origin": "municipality_15.0"},
        "sentence": [
            ["educSim_1.0"], 
            ["contractType2_1.0", "_4_D"],
            ["contractType2_1.0", "wage_50.0", "sicknessInsurance2_2.0"] ,
            ["contractType2_1.0", "sicknessInsurance1_1.0", "wage_50.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
            ["contractType4_1.0", "sicknessInsurance2_0.0", "wage_10.0"],
        ]
    }
]

person_df = pd.DataFrame(data)

We need to create the vocabulary from these data (normally done from files that define the sequence data above)

The vocabulary should look as follows:

```txt
TOKEN,CATEGORY,ID
[PAD],GENERAL,0
[CLS],GENERAL,1
[SEP],GENERAL,2
[MASK],GENERAL,3
[UNK],GENERAL,4
gender_1,BACKGROUND,5
gender_2,BACKGROUND,6
gender_MISSING,BACKGROUND,7
year_1958,background_shuffled_year,8
year_1962,background_shuffled_year,9
year_2003,background_shuffled_year,10
year_1961,background_shuffled_year,11
year_1952,background_shuffled_year,12
year_1971,background_shuffled_year,13
year_1991,background_shuffled_year,14
year_2016,background_shuffled_year,15
year_1965,background_shuffled_year,16
year_2008,background_shuffled_year,17
year_1963,background_shuffled_year,18
```

- category refers to `filename_colname`; token consists of `colname_content`
- abspos is an increasing sequence of integers
- age, segment and abspos are constant for all tokens in the same event (?)
- sentence
	- array of arrays. each array is an event. each event has string tokens such as `['INPAINV3400P_93', 'INPEMEZ_Others', 'INPKKGEM_Others', 'INPPH770UP_93']`


In [48]:
def get_vocab(person_list):
    sentence_tokens = set()
    for person in person_list:
        tokens = [token for event in person["sentence"] for token in event]
        sentence_tokens |= set(tokens) 

    background = [person["background"] for person in person_list]
    background_tokens = pd.DataFrame(background)

    general_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"]
    vocab = []
    
    for token in general_tokens:
        item = {"TOKEN": token, "CATEGORY": "GENERAL"}
        vocab.append(item)
    
    
    for token in sentence_tokens:
        item = {"TOKEN": token, "CATEGORY": "income_file"}
        vocab.append(item)
    
    for col in background_tokens.columns:
        for x in background_tokens[col].unique():
            item = {"TOKEN": x, "CATEGORY": "background"}
            vocab.append(item)
    
    vocab_df = pd.DataFrame(vocab)
    vocab_df["ID"] = vocab_df.index
    return vocab_df


In [49]:
vocab_df = get_vocab(data)
myvocab = CustomVocabulary(name="test", data_files=["a.csv", "b.csv"])
myvocab.vocab_df = vocab_df
myvocab.general_tokens

['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]']

In [50]:
vocab_df.head(10) # masked tokens have value 3

Unnamed: 0,TOKEN,CATEGORY,ID
0,[PAD],GENERAL,0
1,[CLS],GENERAL,1
2,[SEP],GENERAL,2
3,[MASK],GENERAL,3
4,[UNK],GENERAL,4
5,sicknessInsurance2_2.0,income_file,5
6,sicknessInsurance2_1.0,income_file,6
7,sicknessInsurance3_1.0,income_file,7
8,contractType4_1.0,income_file,8
9,educSim_1.0,income_file,9


In [51]:
person_df

Unnamed: 0,person_id,segment,age,abspos,background,sentence
0,1,"[2, 4, 6, 7, 8]","[18, 24, 29, 33, 56]","[200, 404, 500, 600, 805]","{'birth_year': 'year_99.0', 'birth_month': 'mo...","[[educSim_2.0], [_4_D], [contractType2_1.0, si..."
1,2,"[6, 3, 10, 10, 10, 11, 12, 13, 14, 15]","[19, 22, 40, 42, 55, 80, 99, 101, 204, 206]","[99, 103, 501, 708, 890, 899, 901, 910, 915, 930]","{'birth_year': 'year_95.0', 'birth_month': 'mo...","[[educSim_1.0], [contractType2_1.0, _4_D], [co..."


### Create a person document

Similar to the data that are passed into `pipeline.py`

In [82]:
row = list(person_df.itertuples())[0]
person_id = getattr(row, "person_id")
sentences = row.sentence
person_document = PersonDocument(
    person_id=person_id,
    sentences=sentences, # note: diff to original code (??)
    abspos=[int(float(x)) for x in row.abspos],
    age=[int(float(x)) for x in row.age],
    segment=[int(x) for x in row.segment],
    background=Background(**row.background),
)


In [83]:
len(sentences)

5

#### Set the MLM encoder and run it


In [84]:
mlm = MLM("mytest", max_length = 16, masking="random")
mlm.set_vocabulary(myvocab)

In [85]:
mlm

MLM(name='mytest', max_length=16, no_sep=False, p_sequence_timecut=0.0, p_sequence_resample=0.0, p_sequence_abspos_noise=0.0, p_sequence_hide_background=0.0, p_sentence_drop_tokens=0.0, shuffle_within_sentences=True, mask_ratio=0.3, smart_masking=False, masking='random')

In [86]:
output = mlm.encode_document(person_document, do_mlm=True)

In [87]:
mlm

MLM(name='mytest', max_length=16, no_sep=False, p_sequence_timecut=0.0, p_sequence_resample=0.0, p_sequence_abspos_noise=0.0, p_sequence_hide_background=0.0, p_sentence_drop_tokens=0.0, shuffle_within_sentences=True, mask_ratio=0.3, smart_masking=False, masking='random')

In [88]:
output is None

False

#### The main result are the `input_ids`

In [89]:
print(len(output.input_ids))
output.input_ids[0][:40] # what are these input ids? -> they are the model predictors
output.input_ids.shape

4


(4, 16)

For each sample, `input_ids` is an array of `[4, sequence_length]`
- first row are the tokens - the masked or non-masked sequence
- the second row is the absolute position (calender time?)
- the third row is the age
- the fourth row is the segment

In [90]:
output.original_sequence[10]
# the difference between the original sequence and the input_ids is that 
# the input_ids has the target tokens removed 

15

In [91]:
output.target_pos # this is the index in the sequence that is masked

array([ 4, 12,  6,  7])

In [92]:
output.target_pos.min()

4

In [93]:
output.target_tokens # this is the value of the masked tokens to be predicted

array([19., 18.,  8.,  7.])

In [94]:
def print_target_position(idx, encoded_document):
    print(f"position {idx} in the original sequence with value {encoded_document.original_sequence[idx]} is a target token")

In [95]:
for idx in output.target_pos:
    print_target_position(idx, output)


position 4 in the original sequence with value 19 is a target token
position 12 in the original sequence with value 18 is a target token
position 6 in the original sequence with value 8 is a target token
position 7 in the original sequence with value 7 is a target token


In [96]:
output.input_ids[0, 1]
output.input_ids[1,1] # this is the absolute position of the masked token 
output.input_ids[2,1] # this is the age of the masked token
output.input_ids[3,1] # this is the segment of the masked token


0.0

Checking a single token
- background tokens have 0s for age, abspos etc - makes sense

In [19]:
first_token = output.input_ids[:, 10]

In [20]:
first_token

array([  5., 805.,  56.,   8.])