# Data examination
Questions:
* 1. how many words are there in each subcorpus? (single words vs mwe / bible vs biomed vs europarl)
* 2. are there multiple target word instances in a sentence?
* 3. are there any words not in BERT vocab? (are there words that are segmented futher?)
* 4. are there any oov subwords in oov words?

SUMMARY:
* case is important to identify target token position in a sentence
* not all tokens are present in BERT vocabulary (should they be represented as an average of their subwords?)
* cased BERT has less unsegmented tokens than uncased (case can be used to id target word in a sentence, but then lowercased and represented)

In [1]:
import pandas as pd

In [2]:
def read_tsv(file_name):
    df = pd.read_csv(file_name, '\t', quoting=3, na_filter=False)
    return df

In [3]:
single_train = read_tsv('data/train/lcp_single_train.tsv')
multi_train = read_tsv('data/train/lcp_multi_train.tsv')

single_trial = read_tsv('data/trial/lcp_single_trial.tsv')
multi_trial = read_tsv('data/trial/lcp_multi_trial.tsv')

In [4]:
single_train.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3ZLW647WALVGE8EBR50EGUBPU4P32A,bible,"Behold, there came up out of the river seven c...",river,0.0
1,34R0BODSP1ZBN3DVY8J8XSIY551E5C,bible,I am a fellow bondservant with you and with yo...,brothers,0.0
2,3S1WOPCJFGTJU2SGNAN2Y213N6WJE3,bible,"The man, the lord of the land, said to us, 'By...",brothers,0.05


In [5]:
multi_train.head(3)

Unnamed: 0,id,corpus,sentence,token,complexity
0,3S37Y8CWI80N8KVM53U4E6JKCDC4WE,bible,but the seventh day is a Sabbath to Yahweh you...,seventh day,0.027778
1,3WGCNLZJKF877FYC1Q6COKNWTDWD11,bible,"But let each man test his own work, and then h...",own work,0.05
2,3UOMW19E6D6WQ5TH2HDD74IVKTP5CB,bible,To him who by understanding made the heavens; ...,loving kindness,0.05


In [6]:
single_trial.head(3)

Unnamed: 0,id,subcorpus,sentence,token,complexity
0,3QI9WAYOGQB8GQIR4MDIEF0D2RLS67,bible,They will not hurt nor destroy in all my holy ...,sea,0.0
1,3T8DUCXY0N6WD9X4RTLK8UN1U929TF,bible,"that sends ambassadors by the sea, even in ves...",sea,0.102941
2,3I7KR83SNADXAQ7HXK7S7305BYB9KD,bible,"and they entered into the boat, and were going...",sea,0.109375


In [7]:
multi_trial.head(3)

Unnamed: 0,id,subcorpus,sentence,token,complexity
0,31HLTCK4BLVQ5BO1AUR91TX9V9IVGH,bible,"The name of one son was Gershom, for Moses sai...",foreign land,0.0
1,389A2A304OIXVY7G5B71Q9M43LE0CL,bible,"unleavened bread, unleavened cakes mixed with ...",wheat flour,0.157895
2,31N9JPQXIPIRX2A3S9N0CCFXO6TNHR,bible,However the high places were not taken away; t...,burnt incense,0.2


In [8]:
single_trial = single_trial.rename(columns={'subcorpus':'corpus'})
multi_trial = multi_trial.rename(columns={'subcorpus':'corpus'})

## Question 1
## how many words are there in each subcorpus? (single words vs mwe / bible vs biomed vs europarl)
NOTE: there can be several sentence for the same target word

In [9]:
dfs = [single_train, multi_train, single_trial, multi_trial]
names = ['single_train', 'multi_train', 'single_trial', 'multi_trial']
bible = []
europarl = []
biomed = []
total = []
for df in dfs:
    bible.append(len(df[df['corpus'] == 'bible']))
    europarl.append(len(df[df['corpus'] == 'europarl']))
    biomed.append(len(df[df['corpus'] == 'biomed']))
    total.append(len(df))

In [10]:
data = {'DATA_NAME':names,'BIBLE_LEN':bible,'EUROPARL_LEN':europarl,'BIOMED_LEN':biomed,'TOTAL_LEN':total}
corpora_lens = pd.DataFrame(data)
corpora_lens

Unnamed: 0,DATA_NAME,BIBLE_LEN,EUROPARL_LEN,BIOMED_LEN,TOTAL_LEN
0,single_train,2574,2512,2576,7662
1,multi_train,505,498,514,1517
2,single_trial,143,143,135,421
3,multi_trial,29,37,33,99


## Question 2
## are there multiple target word instances in a sentence? (Yes and No)

### single words
### train
* when uncased there might be 1/2/3/4/11/9/5 occurances of a token substring in a sentence string (no specific tokenization). there are 87 sentence with multiple target token this way. For example, in sentence 6756, it is not really clear which instance of a target word should be evaluated.
* when cased, there are 1/2 occurances of a token substring in a sentence string. only two sentences contain more than 1 occurance.

Summary: casing is important for chosing the right target token in a sentence

In [11]:
single_train['occurance_number_uncased'] = single_train.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

single_train['occurance_number_cased'] = single_train.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [12]:
print(single_train['occurance_number_uncased'].unique())
print(single_train['occurance_number_cased'].unique())

[ 1  2  3  4 11  9  5]
[1 2]


In [13]:
len(single_train[single_train['occurance_number_uncased']>1])

87

In [14]:
single_train[single_train['occurance_number_uncased']>3]

Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_uncased,occurance_number_cased
4203,3X4Q1O9UBHMCMY43GF110OQ80EE7O2,biomed,"CorA, B, C and D belong to a protein family in...",B,0.4,4,1
4814,3A3KKYU7P3H3CAKSB7U0000KY4FWMJ,biomed,"The mice used in the present study, (NFR/N × B...",MA,0.59375,4,1
4927,3T5ZXGO9DEOYRKNPENLOGDE7P89QZL,biomed,Because synapsis occurs in TRIP13-deficient sp...,CO,0.526316,4,1
5079,3538U0YQ1FU0F2QNF0FL0D5E3B1F3J,biomed,Superficial and deep anterior cortical stainin...,N,0.602941,11,1
5080,3TTPFEFXCTKJQH4BTS1JA1TBTGIH6P,biomed,Peptide Aβ is released from APP by the action ...,N,0.75,9,1
7579,36MUZ9VAE626RGSODE1RV46QINFED2,europarl,"A4-0124/97 by Mr Wynn, on behalf of the Commit...",VI,0.485294,5,1


In [15]:
for s in single_train[single_train['occurance_number_cased']>1]['sentence']:
    print(s)

Approval of the minutes of the previous sitting: see minutes
Approval of Minutes of previous sitting: see Minutes


### single words
### trial
* only 1 target occuarnce for cased
* 6 sentences with 2 occurances for uncased

In [16]:
single_trial['occurance_number_uncased'] = single_trial.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

single_trial['occurance_number_cased'] = single_trial.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [17]:
print(single_trial['occurance_number_uncased'].unique())
print(single_trial['occurance_number_cased'].unique())

[1 2]
[1]


In [18]:
print(len(single_trial[single_trial['occurance_number_uncased']>1]))
single_trial[single_trial['occurance_number_uncased']>1]

6


Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_uncased,occurance_number_cased
171,3EFNPKWBMSO9IYBXCIW0X6IAX8E030,biomed,Lung development in Dhcr7-/- embryos at the ea...,Lung,0.175,2,1
252,379OL9DBSSESUVWY1Z8JGBFG9E19YR,biomed,Rod spherules establish an invaginating synaps...,Rod,0.4,2,1
280,3P0I4CQYVY7RCD54ON9DS4PPT5QOWO,europarl,"We have simply confirmed, in accordance with o...",Rules,0.178571,2,1
318,3TFJJUELSHP4R8AUKYBF9XFJ0LWC2J,europarl,Proposal for a Council Decision establishing f...,Justice,0.203125,2,1
417,31GECDVA9JM3TSKUX9AFDA4LK3466H,europarl,The proposal to amend Regulation (EC) No 539/2...,EC,0.5,2,1
418,3OQQD2WO8I6KPTSDG8L63AI6J4E3IL,europarl,"the report by Mr Albertini, on behalf of the C...",EC,0.605263,2,1


In [19]:
single_trial['sentence'][252]

'Rod spherules establish an invaginating synapse with rod bipolar dendrites and axonal endings of horizontal cells.'

### MWE
### train
* 5 sentences with 2 occurances of target MWE if uncased

In [20]:
multi_train['occurance_number_uncased'] = multi_train.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

multi_train['occurance_number_cased'] = multi_train.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

In [21]:
print(multi_train['occurance_number_uncased'].unique())
print(multi_train['occurance_number_cased'].unique())

[1 2]
[1]


In [22]:
print(len(multi_train[multi_train['occurance_number_uncased']>1]))

5


In [23]:
multi_train[multi_train['occurance_number_uncased']>1]

Unnamed: 0,id,corpus,sentence,token,complexity,occurance_number_uncased,occurance_number_cased
1110,3O4VWC1GEW6GK4CJYQ66FBX6WJ5J3F,europarl,Proposal for a Council Decision establishing f...,Fundamental Rights,0.27381,2,1
1185,3O4VWC1GEW6GK4CJYQ66FBX6WJ53JZ,europarl,"- Madam President, the World Food Summit last ...",food security,0.315789,2,1
1236,30UZJB2POHC8D5XY9O2CE1E1EIA35D,europarl,Support for rural development by the European ...,rural development,0.329545,2,1
1390,3AA88CN98P3CBRFP5WZ86KTWK90YKZ,europarl,Revision of the Treaties - Transitional measur...,transitional measures,0.444444,2,1
1406,3G5RUKN2EC3YIWSKUXZ8ZVH95VKN94,europarl,"The next item is the report by Tanja Fajon, on...",Council regulation,0.4625,2,1


In [24]:
multi_train['sentence'][1236]

'Support for rural development by the European Agricultural Fund for Rural Development (EAFRD) ('

### MWE
### trial
only unique occurances

In [25]:
multi_trial['occurance_number_uncased'] = multi_trial.apply(lambda row:
    row['sentence'].lower().count(row['token'].lower()), axis=1
)

multi_trial['occurance_number_cased'] = multi_trial.apply(lambda row:
    row['sentence'].count(row['token']), axis=1
)

print(multi_trial['occurance_number_uncased'].unique())
print(multi_trial['occurance_number_cased'].unique())

[1]
[1]


## Question 3
## are there any words not in BERT vocab? (are there words that are segmented futher?) (YES)

There are lots of tokens that are not identical to vocabulary tokens of BERT. They are segmented into subwords. That means there should be a representation based on subwords. Good news: it seems that there are no unks in subwords. Uncased model knows more tokens than the cased model

In [27]:
from transformers import BertTokenizer, BertModel

In [28]:
tokenizer_cased = BertTokenizer.from_pretrained('bert-base-cased')
model_cased = BertModel.from_pretrained("bert-base-cased")
#config = model.config

tokenizer_uncased = BertTokenizer.from_pretrained('bert-base-uncased')
model_uncased = BertModel.from_pretrained("bert-base-uncased")

In [29]:
def check_vocab_words(vocab, words, get_not_in_vocab=True, cased=True):
    
    if cased:
        not_in_vocab = set([word for word in words if word not in vocab])
    else:
        words = [word.lower() for word in words]
        not_in_vocab = set([word for word in words if word not in vocab])
    
    if len(not_in_vocab) == 0:
        print("All tokens are in the model's vocabulary")
    else:
        print("There are", len(not_in_vocab), 'words absent out of', len(set(words)))
        if get_not_in_vocab:
            return not_in_vocab

not_in_vocab_train_cased = check_vocab_words(tokenizer_cased.vocab, single_train['token'])
not_in_vocab_trial_cased = check_vocab_words(tokenizer_cased.vocab, single_trial['token'])

not_in_vocab_train_uncased = check_vocab_words(tokenizer_uncased.vocab, single_train['token'], cased=False)
not_in_vocab_trial_uncased = check_vocab_words(tokenizer_uncased.vocab, single_trial['token'], cased=False)

There are 1310 words absent out of 3487
There are 73 words absent out of 213
There are 929 words absent out of 3298
There are 50 words absent out of 212


In [30]:
def check_vocab_mwe(tokenizer, mwes, get_not_in_vocab=True, cased=True):
    if cased:
        not_in_vocab = set([word_pair for word_pair in mwes if tokenizer.tokenize(word_pair)!=word_pair.split(" ")])
    else:
        mwes = [word_pair.lower() for word_pair in mwes]
        not_in_vocab = set([word_pair for word_pair in mwes if tokenizer.tokenize(word_pair)!=word_pair.split(" ")])
    
    if len(not_in_vocab) == 0:
        print("All tokens are in the model's vocabulary")
    else:
        print("There are", len(not_in_vocab), 'MWEs not segmented into two words out of', len(set(mwes)))
        if get_not_in_vocab:
            return not_in_vocab

not_in_vocab_train_mwe_cased = check_vocab_mwe(tokenizer_cased, multi_train['token'])
not_in_vocab_trial_mwe_cased = check_vocab_mwe(tokenizer_cased, multi_trial['token'])

not_in_vocab_train_mwe_uncased = check_vocab_mwe(tokenizer_uncased, multi_train['token'], cased=False)
not_in_vocab_trial_mwe_uncased = check_vocab_mwe(tokenizer_uncased, multi_trial['token'], cased=False)

There are 576 MWEs not segmented into two words out of 1270
There are 40 MWEs not segmented into two words out of 76
There are 473 MWEs not segmented into two words out of 1263
There are 34 MWEs not segmented into two words out of 76


## Question 4
## are there any oov subwords in oov words? (NO)
Good news: it seems that there are no unks in subwords

In [31]:
def check_oovs(not_in_vocab, tokenizer):
    contains_unk_subs = []
    for word in not_in_vocab:
        subwords = tokenizer.tokenize(word)
        if '[UNK]' in subwords:
            contains_unk_subs.append("_".join(subwords))
    if len(contains_unk_subs)>0:
        print('There some UNKs')
    else:
        print('All subwords are in vocab')

check_oovs(not_in_vocab_train_cased, tokenizer_cased)
check_oovs(not_in_vocab_trial_cased, tokenizer_cased)
check_oovs(not_in_vocab_train_uncased, tokenizer_uncased)
check_oovs(not_in_vocab_trial_uncased, tokenizer_uncased)

All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab


In [32]:
check_oovs(not_in_vocab_train_mwe_cased, tokenizer_cased)
check_oovs(not_in_vocab_trial_mwe_cased, tokenizer_cased)
check_oovs(not_in_vocab_train_mwe_uncased, tokenizer_uncased)
check_oovs(not_in_vocab_trial_mwe_uncased, tokenizer_uncased)

All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
All subwords are in vocab
