# Here I will show there are overlapping sentences in the train, validation and test set

In [1]:
cd ..

C:\Personal_Data\Machine_Learning_Project\Medical_Information_Extraction\mrec


In [23]:
import pandas as pd
from mrec.data.dataset import load_data

In [4]:
csv_fnames = {'train': 'dataset/raw/train.csv', 'validation': 'dataset/raw/validation.csv', 'test': 'dataset/raw/test.csv'}
dataset = load_data(csv_fnames)

[2020-12-22 19:48:58,245] [DEBUG] [mrec.data.dataset::load_data::48] Loaded dataset (train:dataset/raw/train.csv)
[2020-12-22 19:48:58,267] [DEBUG] [mrec.data.dataset::load_data::48] Loaded dataset (validation:dataset/raw/validation.csv)
[2020-12-22 19:48:58,288] [DEBUG] [mrec.data.dataset::load_data::48] Loaded dataset (test:dataset/raw/test.csv)


In [25]:
cols = ['_unit_id', 'relation', 'sentence', 'direction', 'term1', 'term2']
train = dataset.train[cols]
validation = dataset.validation[cols]
test= dataset.test[cols]

### Here I will show the inconsistent in labeling `relation` on sentences. I will group `_unit_id`, `relation`, `sentence`, `term1`, and `term2` and do a majority vote on `direction` to remove duplicates. Then I will show that same sentences can have different relation

In [91]:
group_cols = ['_unit_id', 'relation', 'sentence', 'term1', 'term2']

train_no_dup = train.groupby(group_cols)['direction'].agg(pd.Series.mode).reset_index()
val_no_dup = validation.groupby(group_cols)['direction'].agg(pd.Series.mode).reset_index()
test_no_dup = test.groupby(group_cols)['direction'].agg(pd.Series.mode).reset_index()

In [104]:
grouped_df = train_no_dup.groupby(['sentence']).size().reset_index(name='show-up counts')
grouped_df[grouped_df['show-up counts'] > 1].head()

Unnamed: 0,sentence,show-up counts
14,"1 , 14 , 15 , 17 , 18 , 41 Administ...",2
44,118 Plague Has been used for treatment of PL...,2
60,164 Babesiosis Treatment of BABESIOSIS + ...,3
73,"1] Henzl MR, Corson SL, Moghissi K, Buttram V...",2
75,"1] Knauf H, Mutschler E. Diuretic effectivene...",3


__We see that we still have duplicated sentences. Let's look close to sentence that have 3 duplicates after doing majority vote__

In [100]:
sentence = '164  Babesiosis  Treatment of BABESIOSIS   +    caused by  BABESIA MICROTI.'
train_no_dup[train_no_dup['sentence'] == sentence]

Unnamed: 0,_unit_id,relation,sentence,term1,term2,direction
1111,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIOSIS,BABESIA MICROTI,BABESIA MICROTI causes BABESIOSIS
1138,724919815,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIOSIS,BABESIA MICROTI,BABESIA MICROTI causes BABESIOSIS
1608,789214508,treats,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIOSIS,BABESIA MICROTI,"[BABESIOSIS treats BABESIA MICROTI, no_relation]"


__This sentence have duplicates beucase it has different `_unit_id` and `relation`. If we do majority vote without grouping `_unit_id`, we still have sentence duplicated and have different relation. Hence this train dataset is inconsistent in labeling relation for each unique sentence__

In [113]:
duplicates = train_no_dup['sentence'].duplicated().sum()
dset_size = train_no_dup.shape[0]
print('Number of rows after do majority vote:', dset_size)
print('Number of duplicate sentences:', duplicates)
print('Normalize: {:.2f}%'.format(duplicates / dset_size * 100))

Number of rows after do majority vote: 1868
Number of duplicate sentences: 250
Normalize: 13.38%


__Here is what that sentence look like in raw train set__

In [114]:
train[train['sentence'] == sentence]

Unnamed: 0,_unit_id,relation,sentence,direction,term1,term2
8041,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8042,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8043,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8044,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIOSIS causes BABESIA MICROTI,BABESIOSIS,BABESIA MICROTI
8045,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,no_relation,BABESIOSIS,BABESIA MICROTI
8046,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8047,724534033,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8230,724919815,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8231,724919815,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI
8232,724919815,causes,164 Babesiosis Treatment of BABESIOSIS + ...,BABESIA MICROTI causes BABESIOSIS,BABESIOSIS,BABESIA MICROTI


__Here is how severe this case is in validation set__

In [115]:
duplicates = val_no_dup['sentence'].duplicated().sum()
dset_size = val_no_dup.shape[0]
print('Number of rows after do majority vote:', dset_size)
print('Number of duplicate sentences:', duplicates)
print('Normalize: {:.2f}%'.format(duplicates / dset_size * 100))

Number of rows after do majority vote: 623
Number of duplicate sentences: 39
Normalize: 6.26%


__Here is how severe this case is in test set__

In [116]:
duplicates = test_no_dup['sentence'].duplicated().sum()
dset_size = test_no_dup.shape[0]
print('Number of rows after do majority vote:', dset_size)
print('Number of duplicate sentences:', duplicates)
print('Normalize: {:.2f}%'.format(duplicates / dset_size * 100))

Number of rows after do majority vote: 623
Number of duplicate sentences: 33
Normalize: 5.30%


#### CasIn order to prove that there are overlapping sentences in train, validation and test set, I will do majority vote on `direction` in each set to remove duplicates. Then I will concatenate train and validation set and check for duplicate sentences. I will also concatenate train and test set and check for duplicate sentences.

In [55]:
train_and_val_dfs = [train_no_dup, val_no_dup]
train_concat_val = pd.concat(train_and_val_dfs)
train_concat_val['sentence'].duplicated().sum()

460

In [69]:
data = [['tom'], ['tom'], ['tom']]
df = pd.DataFrame(data, columns=['Name'])
df

Unnamed: 0,Name
0,tom
1,tom
2,tom


In [70]:
df.duplicated().sum()

2