# Cantemist corpus: eCIE-O-3.1

This notebook contains a statistical description of the Cantemist corpus, with the aim of tackling the oncology clinical coding text classification subtask (CANTEMIST-CODING track) of the shared task described [here](https://temu.bsc.es/cantemist/). This task aims at assigning histological type of neoplasms ICD-O codes (eCIE-0-3.1 in Spanish) to Spanish clinical cases written in natural language. Section [Corpus summary](#Corpus-summary) describes the most important information extracted throughout the notebook.

In [1]:
import pandas as pd
import numpy as np

# Train & dev1 & dev2 & test

We perform an exploratory analysis of the eCIE-O-3.1 codes present in the training, developments and testing corpora.

In [2]:
# Valid codes

In [2]:
%%time
# https://github.com/TeMU-BSC/cantemist-evaluation-library/blob/master/valid-codes.tsv
cantemist_valid = pd.read_table("../resources/cantemist-evaluation-library/valid-codes.tsv", sep='\t', 
                                header=None)

CPU times: user 40.1 ms, sys: 4.02 ms, total: 44.1 ms
Wall time: 43.6 ms


In [3]:
cantemist_valid.shape

(58062, 3)

In [4]:
cantemist_valid.columns = ["code", "desc", "abr-desc"]

In [5]:
cantemist_valid.head()

Unnamed: 0,code,desc,abr-desc
0,8000/0,Neoplasia benigna,Neoplasia benigna
1,8000/1,Neoplasia de benignidad o malignidad incierta,"Neoplasia, benignidad o malignidad incierta"
2,8000/3,Neoplasia maligna,Neoplasia maligna
3,8000/31,"Neoplasia maligna - grado I, bien diferenciado",Neoplasia maligna - G1
4,8000/32,"Neoplasia maligna - grado II, moderadamente di...",Neoplasia maligna - G2


In [6]:
valid_codes = set(cantemist_valid["code"])

In [7]:
len(valid_codes)

58062

In [8]:
# Train corpus

In [9]:
%%time

cantemist_train = pd.read_table("../datasets/cantemist_v6/train-set/cantemist-coding/train-coding.tsv", sep='\t', header=0) 

CPU times: user 0 ns, sys: 3 ms, total: 3 ms
Wall time: 2.36 ms


In [10]:
cantemist_train.shape

(2757, 2)

In [11]:
cantemist_train.tail()

Unnamed: 0,file,code
2752,cc_onco644,8000/6
2753,cc_onco644,8000/3
2754,cc_onco644,8523/33
2755,cc_onco644,8520/34
2756,cc_onco644,8010/3/H


In [12]:
# 501 expected
len(set(cantemist_train["file"]))

501

In [13]:
cantemist_train.isnull().values.any()

False

We check if there are duplicated document-code associations:

In [14]:
sum(cantemist_train.duplicated())

1

In [15]:
cantemist_train[cantemist_train.duplicated(keep=False)]

Unnamed: 0,file,code
1862,cc_onco768,9080/1
1863,cc_onco768,9080/1


We remove duplicated document-code assignments:

In [16]:
cantemist_train = cantemist_train[~cantemist_train.duplicated(keep='first')]

In [17]:
sum(cantemist_train.duplicated())

0

In [18]:
cantemist_train.shape

(2756, 2)

We check the number of distinct oncology training codes:

In [19]:
train_codes = set(cantemist_train["code"])

In [20]:
len(train_codes)

493

As a sanity check procedure, we check if ALL training codes are valid:

In [21]:
len(train_codes - valid_codes)

0

In [23]:
# Dev1 corpus

In [22]:
%%time

cantemist_dev1 = pd.read_table("../datasets/cantemist_v6/dev-set1/cantemist-coding/dev1-coding.tsv", sep='\t', header=0) 

CPU times: user 2.16 ms, sys: 260 µs, total: 2.42 ms
Wall time: 1.94 ms


In [23]:
cantemist_dev1.shape

(1385, 2)

In [24]:
cantemist_dev1.tail()

Unnamed: 0,file,code
1380,cc_onco976,8500/3
1381,cc_onco976,8000/3
1382,cc_onco976,8000/1
1383,cc_onco976,8000/6
1384,cc_onco976,8841/1


In [25]:
# 250 expected
len(set(cantemist_dev1["file"]))

249

In [26]:
cantemist_dev1.isnull().values.any()

False

We check if there are duplicated document-code associations:

In [27]:
cantemist_dev1.duplicated().any()

False

We check the number of distinct oncology development codes:

In [28]:
dev1_codes = set(cantemist_dev1["code"])

In [29]:
len(dev1_codes)

338

As a sanity check procedure, we check if ALL development codes are valid:

In [30]:
len(dev1_codes - valid_codes)

0

In [23]:
# Dev2 corpus

In [31]:
%%time

cantemist_dev2 = pd.read_table("../datasets/cantemist_v6/dev-set2/cantemist-coding/dev2-coding.tsv", sep='\t', header=0) 

CPU times: user 4.07 ms, sys: 1 µs, total: 4.08 ms
Wall time: 3.63 ms


In [32]:
cantemist_dev2.shape

(1279, 2)

In [33]:
cantemist_dev2.tail()

Unnamed: 0,file,code
1274,cc_onco1353,8000/1
1275,cc_onco1444,9120/3
1276,cc_onco1444,8000/6
1277,cc_onco1444,8000/1
1278,cc_onco1444,9120/34


In [34]:
# 250 expected
len(set(cantemist_dev2["file"]))

250

In [35]:
cantemist_dev2.isnull().values.any()

False

We check if there are duplicated document-code associations:

In [36]:
cantemist_dev2.duplicated().any()

False

We check the number of distinct oncology development codes:

In [37]:
dev2_codes = set(cantemist_dev2["code"])

In [38]:
len(dev2_codes)

334

As a sanity check procedure, we check if ALL development codes are valid:

In [39]:
len(dev2_codes - valid_codes)

0

In [23]:
# Test corpus

In [41]:
%%time

cantemist_test = pd.read_table("../datasets/cantemist_v6/test-set/cantemist-coding/test-coding.tsv", sep='\t', header=0) 

CPU times: user 4.95 ms, sys: 0 ns, total: 4.95 ms
Wall time: 4.25 ms


In [42]:
cantemist_test.shape

(1599, 2)

In [43]:
cantemist_test.tail()

Unnamed: 0,file,code
1594,cc_onco248,8010/33
1595,cc_onco248,8001/3
1596,cc_onco248,8000/6
1597,cc_onco248,8000/1
1598,cc_onco248,8020/6


In [44]:
# 300 expected
len(set(cantemist_test["file"]))

300

In [45]:
cantemist_test.isnull().values.any()

False

We check if there are duplicated document-code associations:

In [46]:
cantemist_test.duplicated().any()

False

We check the number of distinct oncology testing codes:

In [47]:
test_codes = set(cantemist_test["code"])

In [48]:
len(test_codes)

386

As a sanity check procedure, we check if ALL testing codes are valid:

In [49]:
len(test_codes - valid_codes)

0

To sum up, we generate the following tables:

In [50]:
col_names = ['Training', 'Development-1', 'Development-2', 'Test']
row_names = ['Documents', 'Total ICD-O codes', 'Avg ICD-O codes per doc.', 'Unique ICD-O codes', 'Avg docs. per ICD-O code', 
             'Unique unseen ICD-O codes']

In [56]:
# Train & dev & test codes table
cantemist_tab = pd.DataFrame({col_names[0]: [len(set(cantemist_train.file)), 
                                       cantemist_train.shape[0], 
                                       cantemist_train.shape[0]/len(set(cantemist_train.file)), 
                                       len(train_codes), 
                                       cantemist_train.shape[0]/len(train_codes), 
                                       "-"], 
              col_names[1]: [len(set(cantemist_dev1.file)), 
                             cantemist_dev1.shape[0], 
                             cantemist_dev1.shape[0]/len(set(cantemist_dev1.file)), 
                             len(dev1_codes), 
                             cantemist_dev1.shape[0]/len(dev1_codes), 
                             len(dev1_codes - train_codes)],
              col_names[2]: [len(set(cantemist_dev2.file)), 
                             cantemist_dev2.shape[0], 
                             cantemist_dev2.shape[0]/len(set(cantemist_dev2.file)), 
                             len(dev2_codes), 
                             cantemist_dev2.shape[0]/len(dev2_codes), 
                             len(dev2_codes - dev1_codes - train_codes)],
              col_names[3]: [len(set(cantemist_test.file)), 
                             cantemist_test.shape[0], 
                             cantemist_test.shape[0]/len(set(cantemist_test.file)), 
                             len(test_codes), 
                             cantemist_test.shape[0]/len(test_codes), 
                             len(test_codes - dev2_codes - dev1_codes - train_codes)]}, 
                       index=row_names)
cantemist_tab = cantemist_tab.reindex(columns=col_names)
cantemist_tab

Unnamed: 0,Training,Development-1,Development-2,Test
Documents,501,249.0,250.0,300.0
Total ICD-O codes,2756,1385.0,1279.0,1599.0
Avg ICD-O codes per doc.,5.501,5.562249,5.116,5.33
Unique ICD-O codes,493,338.0,334.0,386.0
Avg docs. per ICD-O code,5.59026,4.097633,3.829341,4.142487
Unique unseen ICD-O codes,-,130.0,120.0,107.0


In [57]:
# T# Train & dev & test codes table
# Excluding 8000/6 eCIE code
exc_code = "8000/6"
cantemist_train_exc = cantemist_train[cantemist_train["code"] != exc_code]
train_codes_exc = set(cantemist_train_exc["code"])

cantemist_dev1_exc = cantemist_dev1[cantemist_dev1["code"] != exc_code]
dev1_codes_exc = set(cantemist_dev1_exc["code"])

cantemist_dev2_exc = cantemist_dev2[cantemist_dev2["code"] != exc_code]
dev2_codes_exc = set(cantemist_dev2_exc["code"])

cantemist_test_exc = cantemist_test[cantemist_test["code"] != exc_code]
test_codes_exc = set(cantemist_test_exc["code"])

exc_cantemist_tab = pd.DataFrame({col_names[0]: [len(set(cantemist_train_exc.file)), 
                                       cantemist_train_exc.shape[0], 
                                       cantemist_train_exc.shape[0]/len(set(cantemist_train_exc.file)), 
                                       len(train_codes_exc), 
                                       cantemist_train_exc.shape[0]/len(train_codes_exc), 
                                       "-"], 
              col_names[1]: [len(set(cantemist_dev1_exc.file)), 
                             cantemist_dev1_exc.shape[0], 
                             cantemist_dev1_exc.shape[0]/len(set(cantemist_dev1_exc.file)), 
                             len(dev1_codes_exc), 
                             cantemist_dev1_exc.shape[0]/len(dev1_codes_exc), 
                             len(dev1_codes_exc - train_codes_exc)],
              col_names[2]: [len(set(cantemist_dev2_exc.file)), 
                             cantemist_dev2_exc.shape[0], 
                             cantemist_dev2_exc.shape[0]/len(set(cantemist_dev2_exc.file)), 
                             len(dev2_codes_exc), 
                             cantemist_dev2_exc.shape[0]/len(dev2_codes_exc), 
                             len(dev2_codes_exc - dev1_codes_exc - train_codes_exc)],
              col_names[3]: [len(set(cantemist_test_exc.file)), 
                             cantemist_test_exc.shape[0], 
                             cantemist_test_exc.shape[0]/len(set(cantemist_test_exc.file)), 
                             len(test_codes_exc), 
                             cantemist_test_exc.shape[0]/len(test_codes_exc), 
                             len(test_codes_exc - dev2_codes_exc - dev1_codes_exc - train_codes_exc)]}, 
                       index=row_names)
exc_cantemist_tab = exc_cantemist_tab.reindex(columns=col_names)
exc_cantemist_tab

Unnamed: 0,Training,Development-1,Development-2,Test
Documents,500,248.0,249.0,300.0
Total ICD-O codes,2299,1158.0,1076.0,1349.0
Avg ICD-O codes per doc.,4.598,4.669355,4.321285,4.496667
Unique ICD-O codes,492,337.0,333.0,385.0
Avg docs. per ICD-O code,4.67276,3.436202,3.231231,3.503896
Unique unseen ICD-O codes,-,130.0,120.0,107.0


In [58]:
# Samples per eCIE-O-3.1 code distribution
sample_code = pd.DataFrame({col_names[0]: cantemist_train["code"].value_counts().describe(),
                            col_names[1]: cantemist_dev1["code"].value_counts().describe(),
                            col_names[2]: cantemist_dev2["code"].value_counts().describe(),
                            col_names[3]: cantemist_test["code"].value_counts().describe()}) 
sample_code = sample_code.reindex(columns=col_names)
sample_code

Unnamed: 0,Training,Development-1,Development-2,Test
count,493.0,338.0,334.0,386.0
mean,5.590264,4.097633,3.829341,4.142487
std,29.587174,17.885539,15.759574,18.503153
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,3.0,2.0,2.0,2.0
max,457.0,227.0,203.0,250.0


In [59]:
# Samples per eCIE-O-3.1 code distribution
# Excluding 8000/6
exc_sample_code = pd.DataFrame({col_names[0]: cantemist_train_exc["code"].value_counts().describe(),
                            col_names[1]: cantemist_dev1_exc["code"].value_counts().describe(),
                            col_names[2]: cantemist_dev2_exc["code"].value_counts().describe(),
                            col_names[3]: cantemist_test_exc["code"].value_counts().describe()}) 
exc_sample_code = exc_sample_code.reindex(columns=col_names)
exc_sample_code

Unnamed: 0,Training,Development-1,Development-2,Test
count,492.0,337.0,333.0,385.0
mean,4.672764,3.436202,3.231231,3.503896
std,21.47856,13.135158,11.369638,13.617567
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,3.0,2.0,2.0,2.0
max,386.0,199.0,178.0,222.0


### Corpus summary

Description of Cantemist training, developments (dev1 & dev2) and test corpora:

In [60]:
cantemist_tab

Unnamed: 0,Training,Development-1,Development-2,Test
Documents,501,249.0,250.0,300.0
Total ICD-O codes,2756,1385.0,1279.0,1599.0
Avg ICD-O codes per doc.,5.501,5.562249,5.116,5.33
Unique ICD-O codes,493,338.0,334.0,386.0
Avg docs. per ICD-O code,5.59026,4.097633,3.829341,4.142487
Unique unseen ICD-O codes,-,130.0,120.0,107.0


In [61]:
exc_cantemist_tab

Unnamed: 0,Training,Development-1,Development-2,Test
Documents,500,248.0,249.0,300.0
Total ICD-O codes,2299,1158.0,1076.0,1349.0
Avg ICD-O codes per doc.,4.598,4.669355,4.321285,4.496667
Unique ICD-O codes,492,337.0,333.0,385.0
Avg docs. per ICD-O code,4.67276,3.436202,3.231231,3.503896
Unique unseen ICD-O codes,-,130.0,120.0,107.0


In [62]:
sample_code

Unnamed: 0,Training,Development-1,Development-2,Test
count,493.0,338.0,334.0,386.0
mean,5.590264,4.097633,3.829341,4.142487
std,29.587174,17.885539,15.759574,18.503153
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,3.0,2.0,2.0,2.0
max,457.0,227.0,203.0,250.0


In [63]:
exc_sample_code

Unnamed: 0,Training,Development-1,Development-2,Test
count,492.0,337.0,333.0,385.0
mean,4.672764,3.436202,3.231231,3.503896
std,21.47856,13.135158,11.369638,13.617567
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,3.0,2.0,2.0,2.0
max,386.0,199.0,178.0,222.0
