A note to myself and others:
I become confused easily with cased/uncased transformers naming thing (cause it kinda sounds vice versa to me), so I'll use two other terms

* **lowercased** model/tokenizer/string - transformed to lower case (Mom -> mom), 'uncased' in BERT
* **propercased** model/tokenizer/string - with unchanged casing (Mom -> Mom), 'cased' in BERT

# Data examination
**UPDATE** A problem: 
There is at least one target that is a part of a longer string (it is glued to other characters). It is not demonstrated in target coulmn, and BERT tokenizes it with '##' appended

5. **how to deal with it? How to examine all possible variants of target representations in BERT?**

## INITIAL QUESTIONS:
1. **how many words are there in each subcorpus?** (single words vs mwe / bible vs biomed vs europarl)
2. **are there multiple target word instances in a sentence?**

**3-4 approaches are kinda outdated after a '##N' problem was encountered, but i keep them just for reference. i'd go to Q5 for the up-to-date answers**

3. **are there any words not in BERT vocab? (are there words that are segmented futher?).** here i was checking just a target word type being present in BERT's vocab (was a bad idea)
4. **are there any oov subwords in oov words?**

## SUMMARY:
* there are no duplicates inside a subcorpus and cross train-trail when looking at **ids**
* **case** is important to identify a target token position in a sentence, it is straitforward with propercased sentences (one exception in test: row 584, it is lowercased it 'target', but uppercase in 'sentence').
* with propercased sentences only rows 6144 and 6200 in single_train have double occurances of target words
* not all tokens are present as-a-whole in BERT vocabulary (represent as an average of their subwords?)
* there are no UNKs when target words are segmented, but there are some UNKs in sentences of singles corpora
* lowercased BERT has less unsegmented tokens than propercased (maybe case can be used to id target word in a sentence, but then lowercased and represented?)
* There should be several tests to find BERT tokens that represent targets (target as-a-whole, then target subwords as the sublist of sentence tokens, then target as-a-whole but as a substring of a token in tokenized sentence)

# Table of Contents
* [Data loading](#load)
* [Question 1: how many tasks in each data part](#q1)
* [Question 2: are there duplicates across data](#q2)
    * [Q2.1: single words](#q2_singles)
    * [Q2.2: mwes](#q2_mwe)
* [Question 3: are there targets that are not in BERT vocab as a whole? (**OUTDATED approach**)](#q3)
* [Question 4: are there targets that are completely OOVs for BERT? (**OUTDATED approach**)](#q4)
* [Question 5: what are the target representations in BERT? (**UP TO DATE**)](#q5)

In [1]:
import pandas as pd

def read_tsv(file_name):
    df = pd.read_csv(file_name, '\t', quoting=3, na_filter=False)
    return df

# Data loading <a class="anchor" id="load"></a>

In [2]:
single_train = read_tsv('data/train/lcp_single_train.tsv')
multi_train = read_tsv('data/train/lcp_multi_train.tsv')

single_trial = read_tsv('data/trial/lcp_single_trial.tsv')
multi_trial = read_tsv('data/trial/lcp_multi_trial.tsv')

single_test = read_tsv('data/test/lcp_single_test_labels.tsv')
multi_test = read_tsv('data/test/lcp_multi_test_labels.tsv')

In [4]:
single_train.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3ZLW647WALVGE8EBR50EGUBPU4P32A,bible,"Behold, there came up out of the river seven c...",river,0.0
1,34R0BODSP1ZBN3DVY8J8XSIY551E5C,bible,I am a fellow bondservant with you and with yo...,brothers,0.0
2,3S1WOPCJFGTJU2SGNAN2Y213N6WJE3,bible,"The man, the lord of the land, said to us, 'By...",brothers,0.05


In [5]:
multi_train.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3S37Y8CWI80N8KVM53U4E6JKCDC4WE,bible,but the seventh day is a Sabbath to Yahweh you...,seventh day,0.027778
1,3WGCNLZJKF877FYC1Q6COKNWTDWD11,bible,"But let each man test his own work, and then h...",own work,0.05
2,3UOMW19E6D6WQ5TH2HDD74IVKTP5CB,bible,To him who by understanding made the heavens; ...,loving kindness,0.05


In [6]:
single_trial.head(3)

Unnamed: 0,id,subcorpus,sentence,token,complexity
0,3QI9WAYOGQB8GQIR4MDIEF0D2RLS67,bible,They will not hurt nor destroy in all my holy ...,sea,0.0
1,3T8DUCXY0N6WD9X4RTLK8UN1U929TF,bible,"that sends ambassadors by the sea, even in ves...",sea,0.102941
2,3I7KR83SNADXAQ7HXK7S7305BYB9KD,bible,"and they entered into the boat, and were going...",sea,0.109375


In [7]:
multi_trial.head(3)

Unnamed: 0,id,subcorpus,sentence,token,complexity
0,31HLTCK4BLVQ5BO1AUR91TX9V9IVGH,bible,"The name of one son was Gershom, for Moses sai...",foreign land,0.0
1,389A2A304OIXVY7G5B71Q9M43LE0CL,bible,"unleavened bread, unleavened cakes mixed with ...",wheat flour,0.157895
2,31N9JPQXIPIRX2A3S9N0CCFXO6TNHR,bible,However the high places were not taken away; t...,burnt incense,0.2


In [8]:
single_test.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3K8CQCU3KE19US5SN890DFPK3SANWR,bible,"But he, beckoning to them with his hand to be ...",hand,0.0
1,3Q2T3FD0ON86LCI41NJYV3PN0BW3MV,bible,"If I forget you, Jerusalem, let my right hand ...",hand,0.197368
2,3ULIZ0H1VA5C32JJMKOTQ8Z4GUS51B,bible,"the ten sons of Haman the son of Hammedatha, t...",hand,0.2


In [9]:
multi_test.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3UXQ63NLAAMRIP4WG4XPD98AOYOBLX,bible,"for he had an only daughter, about twelve year...",only daughter,0.025
1,3FJ2RVH25Z62TA3R8E1O77EBUYU92W,bible,All these were cities fortified with high wall...,high walls,0.0999999999999999
2,3YO4AH2FPDK1PZHZAT8WAEBL70EQ0F,bible,"In the morning, 'It will be foul weather today...",weather today,0.125


In [10]:
# renaming a column
single_trial = single_trial.rename(columns={'subcorpus':'corpus'})
multi_trial = multi_trial.rename(columns={'subcorpus':'corpus'})

### checking that there are no duplicates inside a subcorpus
is the length of a list with unique ids the same as just a list of ids?

In [11]:
# checking that there are no duplicates inside a subcorpus
print(len(single_train['id'].unique()) == len(single_train['id']))
print(len(single_trial['id'].unique()) == len(single_trial['id']))
print(len(single_test['id'].unique()) == len(single_test['id']))

print(len(multi_train['id'].unique()) == len(multi_train['id']))
print(len(multi_trial['id'].unique()) == len(multi_trial['id']))
print(len(multi_test['id'].unique()) == len(multi_test['id']))

True
True
True
True
True
True


### checking that there are no duplicates across subcorpors
checking if any ids are the same

In [12]:
print(any([i in single_train['id'].tolist() for i in single_trial['id'].tolist()]))
print(any([i in single_train['id'].tolist() for i in single_test['id'].tolist()]))
print(any([i in single_trial['id'].tolist() for i in single_test['id'].tolist()]))


print(any([i in multi_train['id'].tolist() for i in multi_trial['id'].tolist()]))
print(any([i in multi_train['id'].tolist() for i in multi_test['id'].tolist()]))
print(any([i in multi_trial['id'].tolist() for i in multi_test['id'].tolist()]))

False
False
False
False
False
False


## Question 1 <a class="anchor" id="q1"></a>
## how many words are there in each subcorpus? (single words vs mwe / bible vs biomed vs europarl)
According to the github [description](https://github.com/MMU-TDMLab/CompLex),
*the training set includes 1,517 MWEs from 3 domains: 505 bible, 514 biomedical, and 498 Europarl and 7,662 single words: 2,574 bible, 2,576 biomedical, and 2,512 Europarl.*

*The trial set includes 99 MWEs from 3 domains: 29 bible, 33 biomedical, and 37 Europarl, and 421 single words: 143 bible, 135 biomedical, and 143 Europarl.*

*The test data comprises of 184 MWEs (66 bible, 53 biomed and 65 europarl) and 917 single word instances (283 bible, 289 biomed and 345 europarl).*

**NOTE**: there can be several sentence for the same target word

**NOTE2**: the size of the corpus differs from what is reported in the [paper](https://arxiv.org/pdf/2003.07008.pdf). or maybe I don't get what is being reported in Table 1.

In [13]:
dfs = [single_train, multi_train, single_trial, multi_trial, single_test, multi_test]
names = ['single_train', 'multi_train', 'single_trial', 'multi_trial', 'single test', 'multi test']
bible = []
europarl = []
biomed = []
total_set = []
for df in dfs:
    bible.append(len(df[df['corpus'] == 'bible']))
    europarl.append(len(df[df['corpus'] == 'europarl']))
    biomed.append(len(df[df['corpus'] == 'biomed']))
    total_set.append(len(df))

In [14]:
data = {'DATA_NAME':names,'BIBLE_LEN':bible,'EUROPARL_LEN':europarl,'BIOMED_LEN':biomed,'TOTAL_LEN':total_set}
corpora_lens = pd.DataFrame(data)
corpora_lens.loc['TOTAL_CORPUS_LEN'] = corpora_lens.sum(numeric_only=True,axis=0)

corpora_lens.astype(int, errors='ignore')

Unnamed: 0,DATA_NAME,BIBLE_LEN,EUROPARL_LEN,BIOMED_LEN,TOTAL_LEN
0,single_train,2574,2512,2576,7662
1,multi_train,505,498,514,1517
2,single_trial,143,143,135,421
3,multi_trial,29,37,33,99
4,single test,283,345,289,917
5,multi test,66,65,53,184
TOTAL_CORPUS_LEN,,3600,3600,3600,10800


## Question 2 <a class="anchor" id="q2"></a>
## are there multiple target word instances in a sentence? (Yes and No)
with propercased sentences only 6144 and 6200 in single train are with multiple instances

### single words <a class="anchor" id="q2_singles"></a>
### single words train
* when lowercased, there might be 1/2/3/4/11/9/5 occurances of a token substring in a sentence string (no specific tokenization). The most are for chemical elements (since they are just letters). there are 87 sentences with multiple target tokens this way. For example, in sentence 6756, it is not really clear which instance of a target word should be evaluated.
* when propercased, there are 1/2 occurances of a token substring in a sentence string. only two sentences contain more than 1 occurance.

**SUMMARY**: casing is important for chosing the right target token in a sentence

In [15]:
# counting the number of token strings in a lowercased sentence
single_train['occurance_number_lowercased'] = single_train.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

# counting the number of token strings in a proprcased sentence
single_train['occurance_number_propercased'] = single_train.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [18]:
# different numbers of occurances
print(single_train['occurance_number_lowercased'].unique())
print(single_train['occurance_number_propercased'].unique())

[ 1  2  3  4 11  9  5]
[1 2]


In [20]:
# how many targets appear more than once when lowercased?
print(len(single_train[single_train['occurance_number_lowercased']>1]))

# show everything that appears more than 3 times when lowercased
single_train[single_train['occurance_number_lowercased']>3]

87


Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_lowercased,occurance_number_propercased
4203,3X4Q1O9UBHMCMY43GF110OQ80EE7O2,biomed,"CorA, B, C and D belong to a protein family in...",B,0.4,4,1
4814,3A3KKYU7P3H3CAKSB7U0000KY4FWMJ,biomed,"The mice used in the present study, (NFR/N × B...",MA,0.59375,4,1
4927,3T5ZXGO9DEOYRKNPENLOGDE7P89QZL,biomed,Because synapsis occurs in TRIP13-deficient sp...,CO,0.526316,4,1
5079,3538U0YQ1FU0F2QNF0FL0D5E3B1F3J,biomed,Superficial and deep anterior cortical stainin...,N,0.602941,11,1
5080,3TTPFEFXCTKJQH4BTS1JA1TBTGIH6P,biomed,Peptide Aβ is released from APP by the action ...,N,0.75,9,1
7579,36MUZ9VAE626RGSODE1RV46QINFED2,europarl,"A4-0124/97 by Mr Wynn, on behalf of the Commit...",VI,0.485294,5,1


In [51]:
# show duplicates when propercased
for s in single_train[single_train['occurance_number_propercased']>1]['sentence']:
    print(s)

Approval of the minutes of the previous sitting: see minutes
Approval of Minutes of previous sitting: see Minutes


### single words trial
* always only 1 target occuarnce for propercased
* 6 sentences with 2 occurances for lowercased

In [21]:
single_trial['occurance_number_lowercased'] = single_trial.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

single_trial['occurance_number_propercased'] = single_trial.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [22]:
print(single_trial['occurance_number_lowercased'].unique())
print(single_trial['occurance_number_propercased'].unique())

[1 2]
[1]


In [23]:
print(len(single_trial[single_trial['occurance_number_lowercased']>1]))
single_trial[single_trial['occurance_number_lowercased']>1]

6


Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_lowercased,occurance_number_propercased
171,3EFNPKWBMSO9IYBXCIW0X6IAX8E030,biomed,Lung development in Dhcr7-/- embryos at the ea...,Lung,0.175,2,1
252,379OL9DBSSESUVWY1Z8JGBFG9E19YR,biomed,Rod spherules establish an invaginating synaps...,Rod,0.4,2,1
280,3P0I4CQYVY7RCD54ON9DS4PPT5QOWO,europarl,"We have simply confirmed, in accordance with o...",Rules,0.178571,2,1
318,3TFJJUELSHP4R8AUKYBF9XFJ0LWC2J,europarl,Proposal for a Council Decision establishing f...,Justice,0.203125,2,1
417,31GECDVA9JM3TSKUX9AFDA4LK3466H,europarl,The proposal to amend Regulation (EC) No 539/2...,EC,0.5,2,1
418,3OQQD2WO8I6KPTSDG8L63AI6J4E3IL,europarl,"the report by Mr Albertini, on behalf of the C...",EC,0.605263,2,1


In [24]:
single_trial['sentence'][252]

'Rod spherules establish an invaginating synapse with rod bipolar dendrites and axonal endings of horizontal cells.'

### single words test
* when propercased, everything works fine.
* there is one mistake with lowercased target

In [30]:
single_test['occurance_number_lowercased'] = single_test.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

single_test['occurance_number_propercased'] = single_test.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [31]:
print(single_test['occurance_number_lowercased'].unique())
print(single_test['occurance_number_propercased'].unique())

[1 3 2 7 5 4]
[1 0]


In [32]:
single_test[single_test['occurance_number_propercased']==0]

Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_lowercased,occurance_number_propercased
584,38LRF35D5LWPYKNDAPAKMD6HD1M3UI,europarl,"(DE) Mr President, ladies and gentlemen, furth...",Group,0.075,1,0


### MWE <a class="anchor" id="q2_mwes"></a>
### mwe train
* 5 sentences with 2 occurances of target MWE if lowercased

In [33]:
multi_train['occurance_number_lowercased'] = multi_train.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

multi_train['occurance_number_propercased'] = multi_train.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [34]:
print(multi_train['occurance_number_lowercased'].unique())
print(multi_train['occurance_number_propercased'].unique())

[1 2]
[1]


In [35]:
print(len(multi_train[multi_train['occurance_number_lowercased']>1]))

5


In [36]:
multi_train[multi_train['occurance_number_lowercased']>1]

Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_lowercased,occurance_number_propercased
1110,3O4VWC1GEW6GK4CJYQ66FBX6WJ5J3F,europarl,Proposal for a Council Decision establishing f...,Fundamental Rights,0.27381,2,1
1185,3O4VWC1GEW6GK4CJYQ66FBX6WJ53JZ,europarl,"- Madam President, the World Food Summit last ...",food security,0.315789,2,1
1236,30UZJB2POHC8D5XY9O2CE1E1EIA35D,europarl,Support for rural development by the European ...,rural development,0.329545,2,1
1390,3AA88CN98P3CBRFP5WZ86KTWK90YKZ,europarl,Revision of the Treaties - Transitional measur...,transitional measures,0.444444,2,1
1406,3G5RUKN2EC3YIWSKUXZ8ZVH95VKN94,europarl,"The next item is the report by Tanja Fajon, on...",Council regulation,0.4625,2,1


In [37]:
multi_train['sentence'][1236]

'Support for rural development by the European Agricultural Fund for Rural Development (EAFRD) ('

### mwe trial
only unique occurances

In [61]:
multi_trial['occurance_number_lowercased'] = multi_trial.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

multi_trial['occurance_number_propercased'] = multi_trial.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

print(multi_trial['occurance_number_lowercased'].unique())
print(multi_trial['occurance_number_propercased'].unique())

[1]
[1]


### mwe test
only unique occurances

In [38]:
multi_test['occurance_number_lowercased'] = multi_test.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

multi_test['occurance_number_propercased'] = multi_test.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

print(multi_test['occurance_number_lowercased'].unique())
print(multi_test['occurance_number_propercased'].unique())

[1]
[1]


## Question 3 <a class="anchor" id="q3"></a>
## are there any words not in BERT vocab? (are there words that are segmented futher?) (YES)

There are lots of tokens that are not identical to vocabulary tokens of BERT. They are segmented further into subwords. That means there should be a representation based on subwords. Good news: it seems that there are no unks in those subwords. Lowercased model knows more tokens than the propercased model.

**BAD NEWS:** BERT tokenizer tokenizes target word on its own differently than in context. so this method should be augumented.

In [39]:
from transformers import BertTokenizer

tokenizer_propercased = BertTokenizer.from_pretrained('bert-base-cased')
tokenizer_lowercased = BertTokenizer.from_pretrained('bert-base-uncased')

  return torch._C._cuda_getDeviceCount() > 0


In [45]:
def check_vocab_singles(vocab, words, get_not_in_vocab=True, lowercased=True):
    """Checks for target words in a model's vocabulary
    
    Prints out the number of target words absent from the vocabulary.
    
    WHEN get_not_in_vocab=True, returns a list of words absent from the vocabulary
    WHEN lowercased=True, the target words are lowercased to match the lowercased vocabulary given
    
    Parameters
    ----------
    vocab : iterable object of strings
        a model's vocabulary (a set of all known tokens)
    words : iterable object of strings
        a set of target word tokens that we hope to find in the vocabulary
    get_not_in_vocab : bool
        indicates if a list of words absent from a vocabulary should be returned (True for yes)
    lowercased : bool
        indicates if vocab containes only lowercased words, so targets should be lowercased to match that
    
    Returns
    -------
    not_in_vocab : list of strings
        a list of words absent from a vocabulary
        
    """
    if lowercased:
        words = set([word.lower() for word in words])
        not_in_vocab = [word for word in words if word not in vocab]
    else:
        not_in_vocab = [word for word in words if word not in vocab]
    if len(not_in_vocab) == 0:
        print("All word types are in the model's vocabulary")
    else:
        print("There are", len(not_in_vocab), 'word types absent out of', len(words))
        if get_not_in_vocab:
            return not_in_vocab

print('-----LOWERCASED-----')
print('####train####')
not_in_vocab_train_lowercased = check_vocab_singles(tokenizer_lowercased.vocab, set(single_train['token']))
print('####trial####')
not_in_vocab_trial_lowercased = check_vocab_singles(tokenizer_lowercased.vocab, set(single_trial['token']))
print('####test####')
not_in_vocab_test_lowercased = check_vocab_singles(tokenizer_lowercased.vocab, set(single_test['token']))
print('-----PROPER CASED-----')
print('####train####')
not_in_vocab_train_propercased = check_vocab_singles(tokenizer_propercased.vocab, set(single_train['token']), lowercased=False)
print('####trial####')
not_in_vocab_trial_propercased = check_vocab_singles(tokenizer_propercased.vocab, set(single_trial['token']), lowercased=False)
print('####test####')
not_in_vocab_test_propercased = check_vocab_singles(tokenizer_propercased.vocab, set(single_test['token']), lowercased=False)



-----LOWERCASED-----
####train####
There are 929 word types absent out of 3298
####trial####
There are 50 word types absent out of 212
####test####
There are 122 word types absent out of 423
-----PROPER CASED-----
####train####
There are 1310 word types absent out of 3487
####trial####
There are 73 word types absent out of 213
####test####
There are 164 word types absent out of 429


In [47]:
def check_vocab_mwe(tokenizer, mwes, get_not_in_vocab=True, lowercased=True):
    """Checks for both words in a pair being in a model's vocabulary
    
    Prints out the number of target pairs, where a pair tokenized by a given tokenizer doesn't match this pair separated by whitespace
    
    WHEN get_not_in_vocab=True, returns a list of target pairs where one or both words are absent from the vocabulary
    WHEN lowercased=True, the target pairs are lowercased to match the lowercased vocabulary used by a tokenizer
    
    
    
    Parameters
    ----------
    tokenizer : a tokenizer object
        takes a string and returns a list of string tokens
    mwes : iterable object of strings
        target word pairs that we hope to find in the vocabulary 
    get_not_in_vocab : bool
        indicates if a list of target pairs where one or both words are absent from the vocabulary should be returned (True for yes)
    lowercased : bool
        indicates if vocab containes only lowercased words, so targets should be lowercased to match that
    
    Returns
    -------
    not_in_vocab : list of strings
        a list of target pairs where one or both words are absent from the vocabulary
        
    """
    if lowercased:
        mwes = [word_pair.lower() for word_pair in mwes]
        not_in_vocab = set([word_pair for word_pair in mwes if tokenizer.tokenize(word_pair)!=word_pair.split(" ")])
    else:
        not_in_vocab = set([word_pair for word_pair in mwes if tokenizer.tokenize(word_pair)!=word_pair.split(" ")])
    
    if len(not_in_vocab) == 0:
        print("All mwe tokens are in the model's vocabulary")
    else:
        print("There are", len(not_in_vocab), 'MWEs not segmented into just two words out of', len(set(mwes)))
        if get_not_in_vocab:
            return not_in_vocab

print('-----LOWERCASED-----')
print('####train####')
not_in_vocab_train_mwe_lowercased = check_vocab_mwe(tokenizer_lowercased, multi_train['token'])
print('####trial####')
not_in_vocab_trial_mwe_lowercased = check_vocab_mwe(tokenizer_lowercased, multi_trial['token'])
print('####test####')
not_in_vocab_test_mwe_lowercased = check_vocab_mwe(tokenizer_lowercased, multi_test['token'])
print('-----PROPERCASED-----')
print('####train####')
not_in_vocab_train_mwe_propercased = check_vocab_mwe(tokenizer_propercased, multi_train['token'], lowercased=False)
print('####trial####')
not_in_vocab_trial_mwe_propercased = check_vocab_mwe(tokenizer_propercased, multi_trial['token'], lowercased=False)
print('####test####')
not_in_vocab_test_mwe_propercased = check_vocab_mwe(tokenizer_propercased, multi_test['token'], lowercased=False)

-----LOWERCASED-----
####train####
There are 473 MWEs not segmented into just two words out of 1263
####trial####
There are 34 MWEs not segmented into just two words out of 76
####test####
There are 50 MWEs not segmented into just two words out of 142
-----PROPERCASED-----
####train####
There are 576 MWEs not segmented into just two words out of 1270
####trial####
There are 40 MWEs not segmented into just two words out of 76
####test####
There are 59 MWEs not segmented into just two words out of 142


## Question 4 <a class="anchor" id="q4"></a>
## are there any oov subwords in oov words? (NO)
Good news: it seems that there are no unks in subwords of targets

There are a couple of UNKs in sentences for single targets. I don't really care about those for now.

In [49]:
def check_oovs(not_in_vocab, tokenizer, unk_token = '[UNK]'):
    """Checks if words missing from a vocabulary contain UNK subwords after tokenization
    
    Prints out the number of UNK tokens found after tokenization of target words or MWEs missing from a vocabulary.
    Can also be used to check if sentences contain any UNKs
    
    Parameters
    ----------
    not_in_vocab : iterable object of strings
        target word pairs that we hope to find in the vocabulary 
    tokenizer : a tokenizer object
        takes a string and returns a list of string tokens
    unk_token : str
        a token used to mark OOV string by a tokenizer, default to BERT unk
        
    """
    contains_unk_subs = 0
    for string in not_in_vocab:
        subwords = tokenizer.tokenize(string)
        if unk_token in subwords:
            contains_unk_subs+=1
    if contains_unk_subs>0:
        print('There are', contains_unk_subs, 'UNKs')
    else:
        print('All subwords are in vocab')

check_oovs(not_in_vocab_train_propercased, tokenizer_propercased)
check_oovs(not_in_vocab_trial_propercased, tokenizer_propercased)
check_oovs(not_in_vocab_test_propercased, tokenizer_propercased)
check_oovs(not_in_vocab_train_lowercased, tokenizer_lowercased)
check_oovs(not_in_vocab_trial_lowercased, tokenizer_lowercased)
check_oovs(not_in_vocab_test_lowercased, tokenizer_lowercased)

All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab


In [50]:
check_oovs(not_in_vocab_train_mwe_lowercased, tokenizer_lowercased)
check_oovs(not_in_vocab_trial_mwe_lowercased, tokenizer_lowercased)
check_oovs(not_in_vocab_test_mwe_lowercased, tokenizer_lowercased)
check_oovs(not_in_vocab_train_mwe_propercased, tokenizer_propercased)
check_oovs(not_in_vocab_trial_mwe_propercased, tokenizer_propercased)
check_oovs(not_in_vocab_test_mwe_propercased, tokenizer_propercased)

All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab


In [52]:
print('----Sinlgles train lowercased----')
check_oovs(single_train['sentence'], tokenizer_lowercased)
print('----Sinlgles train propercased----')
check_oovs(single_train['sentence'], tokenizer_propercased)
print('----Sinlgles trial lowercased----')
check_oovs(single_trial['sentence'], tokenizer_lowercased)
print('----Sinlgles trial propercased----')
check_oovs(single_trial['sentence'], tokenizer_propercased)
print('----Sinlgles test lowercased----')
check_oovs(single_test['sentence'], tokenizer_lowercased)
print('----Sinlgles test propercased----')
check_oovs(single_test['sentence'], tokenizer_propercased)


print('----MWE train lowercased----')
check_oovs(multi_train['sentence'], tokenizer_lowercased)
print('----MWE train propercased----')
check_oovs(multi_train['sentence'], tokenizer_propercased)
print('----MWE trial lowercased----')
check_oovs(multi_trial['sentence'], tokenizer_lowercased)
print('----MWE trial propercased----')
check_oovs(multi_trial['sentence'], tokenizer_propercased)
print('----MWE test lowercased----')
check_oovs(multi_test['sentence'], tokenizer_lowercased)
print('----MWE test propercased----')
check_oovs(multi_test['sentence'], tokenizer_propercased)

----Sinlgles train lowercased----
There are 1 UNKs
----Sinlgles train propercased----
There are 3 UNKs
----Sinlgles trial lowercased----
All subwords are in vocab
----Sinlgles trial propercased----
All subwords are in vocab
----Sinlgles test lowercased----
All subwords are in vocab
----Sinlgles test propercased----
All subwords are in vocab
----MWE train lowercased----
All subwords are in vocab
----MWE train propercased----
All subwords are in vocab
----MWE trial lowercased----
All subwords are in vocab
----MWE trial propercased----
All subwords are in vocab
----MWE test lowercased----
All subwords are in vocab
----MWE test propercased----
All subwords are in vocab


# Question 5 <a class="anchor" id="q5"></a>
How to deal with the fact that a target word can be tokenized by BERT differently when its alone and in some context? E.g:  How to make sure that we find the right BERT tokens for targets?

for a single word:
* 1. check if a target word token is in sentence tokens (and only once)
* 2. if not, check if a tokenized target word is 
  * A tokenizable into subparts
  * B its parts are a sublist of sentence tokens (once)
* 3. if not, check if a target string is a substring of any sentence token (and only once)
* 4. give out what is completely missing


for a MWE:
* 1. check if a tokenized pair is a sublist of sentence tokens (and only once)
* 2. give out what is completely missing

SUMMARY:
So, for the propercased model, everything is found, in the end. *Minutes* are still the only multiple occurance there is for singles df.

The only thing that is not as it should be is in single_test. 38LRF35D5LWPYKNDAPAKMD6HD1M3UI - the case in the sentence is different from the case in the 'token' column.

P.S. sorry for the lazy prin

In [56]:
def check_tokenized_sentence_singles(tokenizer, df, get_not_in_vocab=True, lowercased=True):
    """Checks for sentence tokens that represent target words
    
    Prints out:
    
    The number of targets that were not found at all
    The number of targets that are not represented as a standalone whole in a tokenized sentence ('##' appended counted here)
    The number of targets that were tokenized into subwords and those subwords match a tokenized sentence sublist
    The number of targets that were not segmented further, but instead were undersegmented in a sentence, 
        and thus were a substring of some sentence token.
    The number of times a target was found represented several times in a tokenized sentence.
    
    WHEN get_not_in_vocab=True, returns a tuple of lists of targets that constitues the statistics described above
    WHEN lowercased=True, the target words are lowercased to match the lowercased vocabulary given
    
    
    Parameters
    ----------
    tokenizer : a tokenizer object
        takes a string and returns a list of string tokens
    df : pandas dataframe
        a dataframe containing a sentence (df.sentene) and target words (df.token) at the same row
    get_not_in_vocab : bool
        indicates if a tuple of lists of targets that constitues the printed statistics should be returned (True for yes)
    
    Returns
    -------
    if get_not_in_vocab=True
    
    completely_missing : a list of tuples (str, list of strings)
        a list of tuples for targets that were not found at all and their tokenized sentences
    not_in_vocab : a list of strings
        target words that are not represented as a standalone whole in a tokenized sentence 
    subwords_in_vocab : a list of tuples (str, list of strings)
        a list of tuples for targets (and their tokenized sentences) that were tokenized into subwords 
        and those subwords match a tokenized sentence sublist
    partly_in_vocab : a list of tuples (str, list of strings)
        a list of tuples for targets (and their tokenized sentences) that were not segmented further, 
        but instead were undersegmented in a sentence, and thus were a substring of some sentence token 
    multiple_occurances : a list of tuples (str, list of strings)
        a list of tuples for targets (and their tokenized sentences) that were matched several times in a tokenized sentence
    """
    
    not_in_vocab = []
    subwords_in_vocab = []
    partly_in_vocab = []
    completely_missing = []
    multiple_occurances = []
    
    
    for i, row in df.iterrows():
        tokenized_sentence = tokenizer.tokenize(row.sentence)
        if lowercased:
            target_word = row.token.lower()
        else:
            target_word = row.token
        
        ##########################################
        num_target = tokenized_sentence.count(target_word)
        
        ##############
        # CHECK 1
        ##############
        if num_target > 1:
            multiple_occurances.append((target_word, tokenized_sentence))
        
        if num_target == 0:
            not_in_vocab.append(target_word)
            ##############
            # CHECK 2
            ##############
            # tokenize target and look for a subword list match
            tokenized_word = tokenizer.tokenize(target_word)
            # 2.A
            if len(tokenized_word) > 1:
                parts_len = len(tokenized_word)
                sentence_len = len(tokenized_sentence)
                
                # 2.B
                current_match = 0
                for i in range(sentence_len-parts_len+1):
                    if tokenized_sentence[i:i+parts_len] == tokenized_word:
                        current_match += 1 
                if current_match > 1:
                    multiple_occurances.append((target_word, tokenized_sentence))
                    subwords_in_vocab.append((target_word, tokenized_sentence))
                if current_match == 1:
                    subwords_in_vocab.append((target_word, tokenized_sentence))
                # if nothing found
                if current_match == 0:
                    completely_missing.append((target_word, tokenized_sentence))
                        
            ##############
            # CHECK 3
            ##############
            # look for a token in a sentence that has a target as its part
            else:
                partly_tokenized = [word for word in tokenized_sentence if target_word in word]
                
                if len(partly_tokenized)>1:
                    multiple_occurances.append((target_word, tokenized_sentence))
        
                if len(partly_tokenized)==1:
                    partly_in_vocab.append((target_word, tokenized_sentence))
                
                if len(partly_tokenized)==0:
                    print('Completely missing', row['id'], row['token'])
                    completely_missing.append((target_word, tokenized_sentence))
        
    print('Completely missing', len(completely_missing))
    print('Initially not found', len(not_in_vocab), 'out of', len(df))
    print('Segmented and found', len(subwords_in_vocab))
    print('Part of a sentence token', len(partly_in_vocab))
    print('Multiple occurances', len(multiple_occurances))
    
    return completely_missing, not_in_vocab, subwords_in_vocab, partly_in_vocab, multiple_occurances

In [57]:
#print('LOWERCASED')
#print('------------')
#print('train')
#print('------------')
#results_lowercased_single_train = check_tokenized_sentence_singles(tokenizer_lowercased, single_train)
#print('------------')
#print('trial')
#print('------------')
#results_lowercased_single_trial = check_tokenized_sentence_singles(tokenizer_lowercased, single_trial)
print('--------------')
print('PROPERCASED')
print('------------')
print('train')
print('------------')
results_propercased_single_train = check_tokenized_sentence_singles(tokenizer_propercased, single_train, lowercased=False)
print('------------')
print('trial')
print('------------')
results_propercased_single_trial = check_tokenized_sentence_singles(tokenizer_propercased, single_trial, lowercased=False)
print('------------')
print('test')
print('------------')
results_propercased_single_test = check_tokenized_sentence_singles(tokenizer_propercased, single_test, lowercased=False)

--------------
PROPERCASED
------------
train
------------
Completely missing 0
Initially not found 2470 out of 7662
Segmented and found 2469
Part of a sentence token 1
Multiple occurances 2
------------
trial
------------
Completely missing 0
Initially not found 135 out of 421
Segmented and found 135
Part of a sentence token 0
Multiple occurances 0
------------
test
------------
Completely missing 38LRF35D5LWPYKNDAPAKMD6HD1M3UI Group
Completely missing 1
Initially not found 291 out of 917
Segmented and found 289
Part of a sentence token 1
Multiple occurances 0


In [61]:
def check_tokenized_sentence_mwes(tokenizer, df, get_not_in_vocab=True):
    """Checks for sentence tokens that represent parts of target pairs
    
    Prints out:
    The number of target pairs where their subword list was not found as a sublist of sentence tokens
    The number of times a target pair was found represented several times in a tokenized sentence

    WHEN get_not_in_vocab=True, returns a tuple of lists of targets that constitues the statistics described above
    
    
    Parameters
    ----------
    tokenizer : a tokenizer object
        takes a string and returns a list of string tokens
    df : pandas dataframe
        a dataframe containing a sentence (df.sentene) and its target pairs (df.token) at the same row
    get_not_in_vocab : bool
        indicates if a tuple of lists of targets that constitues the printed statistics should be returned (True for yes)
    
    Returns
    -------
    if get_not_in_vocab=True
    
    completely_missing : a list of tuples (str, list of strings)
        a list of tuples for tokenized target pairs (and their tokenized sentences) that were not found at all in a sentence
    multiple_occurances : a list of tuples (str, list of strings)
        a list of tuples for tokenixed target pairs (and their tokenized sentences) that were matched several times in a tokenized sentence
    """
    
    completely_missing = []
    multiple_occurances = []
    
    
    for i, row in df.iterrows():
        tokenized_sentence = tokenizer.tokenize(row.sentence)
        tokenized_target_pair = tokenizer.tokenize(row.token)
        
        ##############
        # CHECK 1
        ##############
        parts_len = len(tokenized_target_pair)
        sentence_len = len(tokenized_sentence)

        current_match = 0
        for i in range(sentence_len-parts_len+1):
            if tokenized_sentence[i:i+parts_len] == tokenized_target_pair:
                current_match += 1 
        if current_match > 1:
            multiple_occurances.append((tokenized_target_pair, tokenized_sentence))
        # if nothing found
        if current_match == 0:
            completely_missing.append((tokenized_target_pair, tokenized_sentence))

        
    print('Completely missing', len(completely_missing), 'out of', len(df))
    print('Multiple occurances', len(multiple_occurances))
    
    return completely_missing, multiple_occurances

In [60]:
#print('LOWERCASED')
#print('------------')
#print('train')
#print('------------')
#results_lowercased_mwe_train = check_tokenized_sentence_mwes(tokenizer_lowercased, multi_train)
#print('------------')
#print('trial')
#print('------------')
#results_lowercased_mwe_trial = check_tokenized_sentence_mwes(tokenizer_lowercased, multi_trial)
print('--------------')
print('PROPERCASED')
print('------------')
print('train')
print('------------')
results_propercased_mwe_train = check_tokenized_sentence_mwes(tokenizer_propercased, multi_train)
print('------------')
print('trial')
print('------------')
results_propercased_mwe_trial = check_tokenized_sentence_mwes(tokenizer_propercased, multi_trial)
print('------------')
print('test')
print('------------')
results_propercased_mwe_test = check_tokenized_sentence_mwes(tokenizer_propercased, multi_test)

--------------
PROPERCASED
------------
train
------------
Completely missing 0 out of 1517
Multiple occurances 0
------------
trial
------------
Completely missing 0 out of 99
Multiple occurances 0
------------
test
------------
Completely missing 0 out of 184
Multiple occurances 0
