# CodiEsp corpus: CIE codes

This notebook contains a statistical description of the CodiEsp corpus, with the aim of tackling the text classification subtasks (1 and 2) of the shared task described [here](https://temu.bsc.es/codiesp/). The first subtask aims at performing CIE-Diagnóstico automatic coding, while the goal of the second subtask is to perform CIE-Procedimiento automatic coding. Sections [Copus summary](#Corpus-summary) and [Expanded corpus summary](#Expanded-corpus-summary) describe the most important information extracted throughout the notebook.

In [1]:
import pandas as pd
import numpy as np

## Train, dev, test

Firstly, we perform an exploratory analysis of the CIE-10 codes present in the train, development and test corpora.

### CodiEsp Diagnosis Coding

In [2]:
%%time

codiesp_d_train = pd.read_table("../datasets/final_dataset_v3_to_publish/train/trainD.tsv", sep='\t', header=None)

CPU times: user 3.68 ms, sys: 2.16 ms, total: 5.84 ms
Wall time: 5.15 ms


In [3]:
codiesp_d_train.columns = ["document", "code"]

In [4]:
codiesp_d_train.tail()

Unnamed: 0,document,code
5634,S2340-98942015000100005-1,r69
5635,S2340-98942015000100005-1,r06.00
5636,S2340-98942015000100005-1,c56.2
5637,S2340-98942015000100005-1,r97.1
5638,S2340-98942015000100005-1,r55


In [5]:
codiesp_d_train.shape

(5639, 2)

We check if there are duplicated document-code associations:

In [6]:
codiesp_d_train.duplicated().any()

False

We check the number of distinct CIE10-Diagnóstico training codes:

In [7]:
train_d_codes = set(codiesp_d_train["code"])

In [8]:
len(train_d_codes)

1767

Now, development set:

In [9]:
%%time

codiesp_d_dev = pd.read_table("../datasets/final_dataset_v3_to_publish/dev/devD.tsv", sep='\t', header=None)

CPU times: user 111 µs, sys: 3.45 ms, total: 3.56 ms
Wall time: 2.87 ms


In [10]:
codiesp_d_dev.columns = ["document", "code"]

In [11]:
codiesp_d_dev.tail()

Unnamed: 0,document,code
2672,S2254-28842013000300009-1,i11.9
2673,S2254-28842013000300009-1,e78.00
2674,S2254-28842013000300009-1,e11.9
2675,S2254-28842013000300009-1,i51.9
2676,S2254-28842013000300009-1,k65.9


In [12]:
codiesp_d_dev.shape

(2677, 2)

We check if there are duplicated document-code associations:

In [13]:
codiesp_d_dev.duplicated().any()

False

We check the number of distinct CIE10-Diagnóstico development codes:

In [14]:
dev_d_codes = set(codiesp_d_dev["code"])

In [15]:
len(dev_d_codes)

1158

Check if all dev codes are present in train subset:

In [16]:
not_common_codes = dev_d_codes - train_d_codes

In [17]:
print(len(not_common_codes))
print(len(not_common_codes)/len(dev_d_codes))

427
0.3687392055267703


~37% of the codes contained in the development set are NOT contained in the train set.

The number of total distinct codes present in train and dev subsets is:

In [18]:
d_codes = dev_d_codes.union(train_d_codes)

In [19]:
len(d_codes)

2194

### CodiEsp Procedure Coding

In [20]:
%%time

codiesp_p_train = pd.read_table("../datasets/final_dataset_v3_to_publish/train/trainP.tsv", sep='\t', header=None)

CPU times: user 2.79 ms, sys: 375 µs, total: 3.17 ms
Wall time: 19.4 ms


In [21]:
codiesp_p_train.columns = ["document", "code"]

In [22]:
codiesp_p_train.tail()

Unnamed: 0,document,code
1545,S2254-28842014000200009-1,5a1d
1546,S2254-28842014000200009-1,06hn
1547,S2254-28842014000200009-1,3e1m39z
1548,S2340-98942015000100005-1,0dtp
1549,S2340-98942015000100005-1,0fb0


In [23]:
codiesp_p_train.shape

(1550, 2)

We check if there are duplicated document-code associations:

In [24]:
codiesp_p_train.duplicated().any()

False

We check the number of distinct CIE-10-Procedimiento training codes:

In [25]:
train_p_codes = set(codiesp_p_train["code"])

In [26]:
len(train_p_codes)

563

Now, development set:

In [28]:
%%time

codiesp_p_dev = pd.read_table("../datasets/final_dataset_v3_to_publish/dev/devP.tsv", sep='\t', header=None)

CPU times: user 4.66 ms, sys: 86 µs, total: 4.75 ms
Wall time: 15.1 ms


In [29]:
codiesp_p_dev.columns = ["document", "code"]

In [30]:
codiesp_p_dev.tail()

Unnamed: 0,document,code
812,S2254-28842013000300009-1,5a1d
813,S2254-28842013000300009-1,3e1m39z
814,S2254-28842013000300009-1,bw03zzz
815,S2254-28842013000300009-1,bw00zzz
816,S2254-28842013000300009-1,bw20


In [31]:
codiesp_p_dev.shape

(817, 2)

We check if there are duplicated document-code associations:

In [32]:
codiesp_p_dev.duplicated().any()

False

We check the number of distinct CIE10-Procedimiento development codes:

In [33]:
dev_p_codes = set(codiesp_p_dev["code"])

In [34]:
len(dev_p_codes)

375

Check if all dev codes are present in train subset:

In [35]:
not_common_codes = dev_p_codes - train_p_codes

In [36]:
print(len(not_common_codes))
print(len(not_common_codes)/len(dev_p_codes))

164
0.43733333333333335


~44% of the codes contained in the development set are NOT contained in the train set.

The number of total distinct codes present in train and dev subsets is:

In [37]:
p_codes = dev_p_codes.union(train_p_codes)

In [38]:
len(p_codes)

727

Total number of distinct CIE10-Diagnóstico and CIE10-Procedimiento codes contained in train and dev subsets:

In [39]:
len(d_codes) + len(p_codes)

2921

In [40]:
# Empty set expected
d_codes & p_codes

set()

To sum up, we generate the following tables, using the same format as the one used to describe the corpora generated for the previous [CLEF shared tasks](https://pdfs.semanticscholar.org/7726/a6eff024adee59c1bf21d88f5a0f75239f29.pdf):

In [41]:
# Information extracted from: https://temu.bsc.es/codiesp/index.php/category/data/
total_uniq_codes = 3427

All codes table:

In [42]:
# CodiEsp corpus table
col_names = ['Training', 'Development', 'Test']
row_names = ['Documents', 'Total CIE codes', 'Avg CIE codes per doc.', 'Unique CIE codes', 'Avg samples (docs) per CIE code', 
             'Unique unseen CIE codes']
all_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_d_train.document).union(set(codiesp_p_train.document))), 
                                       codiesp_d_train.shape[0] + codiesp_p_train.shape[0], 
                                       (codiesp_d_train.shape[0] + codiesp_p_train.shape[0])/len(set(codiesp_d_train.document).union(set(codiesp_p_train.document))), 
                                       len(train_d_codes) + len(train_p_codes), 
                                       (codiesp_d_train.shape[0] + codiesp_p_train.shape[0])/(len(train_d_codes) + len(train_p_codes)), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_d_dev.document).union(set(codiesp_p_dev.document))), 
                             codiesp_d_dev.shape[0] + codiesp_p_dev.shape[0], 
                             (codiesp_d_dev.shape[0] + codiesp_p_dev.shape[0])/len(set(codiesp_d_dev.document).union(set(codiesp_p_dev.document))), 
                             len(dev_d_codes) + len(dev_p_codes), 
                             (codiesp_d_dev.shape[0] + codiesp_p_dev.shape[0])/(len(dev_d_codes) + len(dev_p_codes)), 
                             len(dev_d_codes - train_d_codes) + len(dev_p_codes - train_p_codes)],
              col_names[2]: [250, None, None, None, None, total_uniq_codes - len(d_codes) - len(p_codes)]}, 
                       index=row_names)
all_tab = all_tab.reindex(columns=col_names)
all_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,250.0
Total CIE codes,7189,3494.0,
Avg CIE codes per doc.,14.378,13.976,
Unique CIE codes,2330,1533.0,
Avg samples (docs) per CIE code,3.08541,2.279191,
Unique unseen CIE codes,-,591.0,506.0


CIE-Diagnóstico codes table:

In [43]:
# CIE-Diagnóstico codes fraction in Train
codiesp_d_train.shape[0]/(codiesp_d_train.shape[0] + codiesp_p_train.shape[0])

0.784392822367506

In [44]:
# CIE-Diagnóstico codes fraction in Dev
codiesp_d_dev.shape[0]/(codiesp_d_dev.shape[0] + codiesp_p_dev.shape[0])

0.766170578133944

In [45]:
# Unique CIE-Diagnóstico codes fraction in Train
len(train_d_codes)/(len(train_d_codes) + len(train_p_codes))

0.7583690987124464

In [46]:
# Unique CIE-Diagnóstico codes fraction in Dev
len(dev_d_codes)/(len(dev_d_codes) + len(dev_p_codes))

0.7553816046966731

In [47]:
# Unique CIE-Diagnóstico codes fraction in Train+Dev
len(d_codes)/(len(d_codes) + len(p_codes))

0.7511126326600479

In [48]:
# Unique unseen CIE-Diagnóstico codes fraction in Dev
len(dev_d_codes - train_d_codes)/(len(dev_d_codes - train_d_codes) + len(dev_p_codes - train_p_codes))

0.7225042301184433

In [50]:
# Approximate Unique unseen CIE-Diagnóstico codes fraction in Test
d_approx = 0.75

In [51]:
# CodiEsp-Diagnóstico codes table
d_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_d_train.document)), 
                                       codiesp_d_train.shape[0], 
                                       codiesp_d_train.shape[0]/len(set(codiesp_d_train.document)), 
                                       len(train_d_codes), 
                                     codiesp_d_train.shape[0]/len(train_d_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_d_dev.document)), 
                             codiesp_d_dev.shape[0], 
                             codiesp_d_dev.shape[0]/len(set(codiesp_d_dev.document)), 
                             len(dev_d_codes), 
                             codiesp_d_dev.shape[0]/len(dev_d_codes), 
                             len(dev_d_codes - train_d_codes)],
              col_names[2]: ["~250", None, None, None, None, "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*d_approx)]}, 
                       index=row_names)
d_tab = d_tab.reindex(columns=col_names)
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [52]:
# CIE-Procedimiento documents fraction in Train
len(set(codiesp_p_train.document))/len(set(codiesp_d_train.document).union(set(codiesp_p_train.document)))

0.87

In [53]:
# CIE-Procedimiento documents fraction in Dev
len(set(codiesp_p_dev.document))/len(set(codiesp_d_dev.document).union(set(codiesp_p_dev.document)))

0.888

In [54]:
# Approximate CIE-Procedimiento documents fraction in Test
c_approx = 0.88

In [55]:
# CodiEsp-Procedimiento codes table
p_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_p_train.document)), 
                                       codiesp_p_train.shape[0], 
                                       codiesp_p_train.shape[0]/len(set(codiesp_p_train.document)), 
                                       len(train_p_codes), 
                                     codiesp_p_train.shape[0]/len(train_p_codes),
                                       "-"], 
              col_names[1]: [len(set(codiesp_p_dev.document)), 
                             codiesp_p_dev.shape[0], 
                             codiesp_p_dev.shape[0]/len(set(codiesp_p_dev.document)), 
                             len(dev_p_codes), 
                             codiesp_p_dev.shape[0]/len(dev_p_codes),
                             len(dev_p_codes - train_p_codes)],
              col_names[2]: ["~" + str(c_approx*250), None, None, None, None,
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*(1-d_approx))]},
                       index=row_names)
p_tab = p_tab.reindex(columns=col_names)
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


As it can be observed from the last table, there are documents without any CIE-Procedimiento codes associated to them.

### Corpus summary

Description of CodiEsp (train, development and test) corpus:

In [56]:
# Diagnóstico+Procedimiento
all_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,250.0
Total CIE codes,7189,3494.0,
Avg CIE codes per doc.,14.378,13.976,
Unique CIE codes,2330,1533.0,
Avg samples (docs) per CIE code,3.08541,2.279191,
Unique unseen CIE codes,-,591.0,506.0


In [57]:
# Diagnóstico
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [58]:
# Samples per CIE-Diagnóstico code distribution
col_names_sample = ["Train", "Dev", "Train+Dev"]
d_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_d_train["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_d_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_d_train["code"].append(codiesp_d_dev["code"]).value_counts().describe()})
d_sample_code = d_sample_code.reindex(columns=col_names_sample)
d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,3.191285,2.311744,3.790337
std,6.921389,3.770245,9.123652
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,3.0,2.0,3.0
max,112.0,51.0,163.0


In [59]:
# Procedimiento
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [60]:
# Samples per CIE-Procedimiento code distribution
p_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_p_train["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_p_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_p_train["code"].append(codiesp_p_dev["code"]).value_counts().describe()})
p_sample_code = p_sample_code.reindex(columns=col_names_sample)
p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,2.753108,2.178667,3.255846
std,5.603234,3.415531,7.580642
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,67.0,40.0,107.0


## CIE tree-shape structure

As described in the [official documentation](https://eciemaps.mscbs.gob.es/ecieMaps/documentation/documentation.html#), the tree-shape structure of CIE-10 Diagnóstico codes corresponds to:

* Capítulo (first char)
    * Categoría (3-char)
        * Sub-categoría (4-char)
            * Sub-clasificación (5-7 -char)

Also, to refer to a group of related catagories, more generic codes using '-' character to link different categories can be employed, such as 'O03-O06'.

In the case of CIE-10 Procedimiento codes, all codes have a 7-character length (see [official documentation](https://www.mscbs.gob.es/estadEstudios/estadisticas/normalizacion/CIE10/M_Ref_USA_2016.pdf)), having no hierarchical structure.

We explore the hierarchical levels of the CIE-10 codes annotated in CodiEsp corpus.

### CIE10-Diagnóstico

In the [Additional Guidelines](https://temu.bsc.es/codiesp/index.php/2019/09/19/annotation-guidelines/), CIE10-Diagnóstico annotated codes are reported to be minimum 3-character long.

In [61]:
codiesp_d_train.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [62]:
codiesp_d_train_codes = pd.DataFrame({'code': codiesp_d_train["code"].drop_duplicates()})

In [63]:
codiesp_d_train_codes.shape

(1767, 1)

In [64]:
codiesp_d_dev_codes = pd.DataFrame({'code': codiesp_d_dev["code"].drop_duplicates()})

In [65]:
codiesp_d_dev_codes.shape

(1158, 1)

In [66]:
codiesp_d_train_codes["n_char"] = codiesp_d_train_codes["code"].apply(lambda x: len(x))

In [67]:
codiesp_d_train_codes.head()

Unnamed: 0,code,n_char
0,n44.8,5
1,z20.818,7
2,r60.9,5
3,r52,3
4,a23.9,5


In [68]:
codiesp_d_dev_codes["n_char"] = codiesp_d_dev_codes["code"].apply(lambda x: len(x))

In [69]:
codiesp_d_train_codes["n_char"].value_counts(normalize=False)

5    890
6    520
7    272
3     79
8      6
Name: n_char, dtype: int64

In [70]:
codiesp_d_train_codes["n_char"].value_counts(normalize=True)

5    0.503679
6    0.294284
7    0.153933
3    0.044709
8    0.003396
Name: n_char, dtype: float64

In [71]:
codiesp_d_dev_codes["n_char"].value_counts(normalize=False)

5    623
6    323
7    160
3     48
8      4
Name: n_char, dtype: int64

In [72]:
codiesp_d_dev_codes["n_char"].value_counts(normalize=True)

5    0.537997
6    0.278929
7    0.138169
3    0.041451
8    0.003454
Name: n_char, dtype: float64

In [73]:
# Train: Number of generic codes
sum(codiesp_d_train_codes["code"].apply(lambda x: '-' in x))

0

In [74]:
# Dev: Number of generic codes
sum(codiesp_d_dev_codes["code"].apply(lambda x: '-' in x))

0

In [75]:
# Train: Codes 8-char long
codiesp_d_train_codes[codiesp_d_train_codes["n_char"] == 8]

Unnamed: 0,code,n_char
1084,r40.2421,8
4461,r40.2410,8
4513,r40.2430,8
4588,o41.1290,8
5112,r40.2420,8
5153,o40.9xx0,8


In [76]:
# Dev: Codes 8-char long
codiesp_d_dev_codes[codiesp_d_dev_codes["n_char"] == 8]

Unnamed: 0,code,n_char
83,s39.94xa,8
624,r40.2412,8
1437,s42.322b,8
2203,r40.2430,8


In [77]:
# Train: Only 3-long codes do not contain '.' character
codiesp_d_train_codes[codiesp_d_train_codes["code"].apply(lambda x: '.' not in x)]["n_char"].value_counts()

3    79
Name: n_char, dtype: int64

In [78]:
# Dev: Only 3-long codes do not contain '.' character
codiesp_d_dev_codes[codiesp_d_dev_codes["code"].apply(lambda x: '.' not in x)]["n_char"].value_counts()

3    48
Name: n_char, dtype: int64

To sum up, we can say that, both in Train and Dev subsets, CIE-Diagnóstico codes char-length vary from 3 to 8. Actually, the distribution of the codes char-length is nearly the same in both Train and Dev subsets.

### CIE10-Procedimiento

In the [Additional Guidelines](https://temu.bsc.es/codiesp/index.php/2019/09/19/annotation-guidelines/), CIE10-Procedimiento annotated codes are reported to be either 4 or 7-character long.

In [79]:
codiesp_p_train.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [80]:
codiesp_p_train_codes = pd.DataFrame({'code': codiesp_p_train["code"].drop_duplicates()})

In [81]:
codiesp_p_train_codes.shape

(563, 1)

In [82]:
codiesp_p_dev_codes = pd.DataFrame({'code': codiesp_p_dev["code"].drop_duplicates()})

In [83]:
codiesp_p_dev_codes.shape

(375, 1)

In [84]:
codiesp_p_train_codes["n_char"] = codiesp_p_train_codes["code"].apply(lambda x: len(x))

In [85]:
codiesp_p_train_codes.head()

Unnamed: 0,code,n_char
0,bw03zzz,7
1,3e02329,7
2,bw40zzz,7
3,bv44zzz,7
4,bn20,4


In [86]:
codiesp_p_dev_codes["n_char"] = codiesp_p_dev_codes["code"].apply(lambda x: len(x))

In [87]:
codiesp_p_train_codes["n_char"].value_counts(normalize=False)

4    338
7    225
Name: n_char, dtype: int64

In [88]:
codiesp_p_train_codes["n_char"].value_counts(normalize=True)

4    0.600355
7    0.399645
Name: n_char, dtype: float64

In [89]:
codiesp_p_dev_codes["n_char"].value_counts(normalize=False)

4    228
7    147
Name: n_char, dtype: int64

In [90]:
codiesp_p_dev_codes["n_char"].value_counts(normalize=True)

4    0.608
7    0.392
Name: n_char, dtype: float64

To sum up, we can say that, both in Train and Dev subsets, CIE-Procedimiento codes char-length is either 4 or 7. Indeed, the distribution of the codes char-length is nearly the same in both Train and Dev subsets.

## Valid codes

Organizers also provide a [set of valid CIE-10 codes for the task](https://zenodo.org/record/3706838#.XnIi45-YU8o). As expected, all CIE-10 codes contained in CodiEsp (train and dev) corpus are valid codes.

### CIE10-Diagnóstico

In [91]:
%%time

codiesp_d_valid = pd.read_table("../resources/CodiEsp-Evaluation-Script/codiesp_codes/codiesp-D_codes.tsv", sep='\t', 
                                header=None)

CPU times: user 147 ms, sys: 8.14 ms, total: 155 ms
Wall time: 155 ms


In [92]:
codiesp_d_valid.columns = ["code", "es-desc", "en-desc"]

In [93]:
codiesp_d_valid.head()

Unnamed: 0,code,es-desc,en-desc
0,A00.0,"Cólera debido a Vibrio cholerae 01, biotipo ch...","Cholera due to Vibrio cholerae 01, biovar chol..."
1,A00.1,"Cólera debido a Vibrio cholerae 01, biotipo El...","Cholera due to Vibrio cholerae 01, biovar eltor"
2,A00.9,"Cólera, no especificado","Cholera, unspecified"
3,A01,Fiebres tifoidea y paratifoidea,
4,A01.0,Fiebre tifoidea,


In [94]:
codiesp_d_valid.shape

(98288, 3)

As stated in [the web](https://zenodo.org/record/3632523#.XmQOh5-YU8p), a limited number of codes do not have an English description:

In [95]:
codiesp_d_valid.isnull().values.sum()

26944

In [96]:
codiesp_d_valid["en-desc"].isnull().sum()

26944

In [97]:
codiesp_d_valid["es-desc"].isnull().sum()

0

In [98]:
codiesp_d_valid["code"].isnull().sum()

0

Total number of valid codes:

In [99]:
valid_d_codes = set(codiesp_d_valid["code"])

In [100]:
len(valid_d_codes)

98288

ALL codes contained in train and dev sets are present in the valid set:

In [101]:
lower_valid_d_codes = {s.lower() for s in valid_d_codes}

In [102]:
len(d_codes - lower_valid_d_codes)

0

We also explore the character-length distribution of the valid codes:

In [103]:
df_lower_valid_d_codes = pd.DataFrame({'code': list(lower_valid_d_codes)})

In [104]:
df_lower_valid_d_codes["n_char"] = df_lower_valid_d_codes["code"].apply(lambda x: len(x))

In [105]:
df_lower_valid_d_codes.shape

(98288, 2)

In [106]:
df_lower_valid_d_codes.head()

Unnamed: 0,code,n_char
0,m12.0,5
1,t21.49,6
2,z04.4,5
3,s04.40xs,8
4,s46.329s,8


In [107]:
df_lower_valid_d_codes["n_char"].value_counts(normalize=False)

8    49920
7    22003
6    14474
5     9982
3     1909
Name: n_char, dtype: int64

In [108]:
df_lower_valid_d_codes["n_char"].value_counts(normalize=True)

8    0.507895
7    0.223863
6    0.147261
5    0.101559
3    0.019423
Name: n_char, dtype: float64

In [109]:
# Number of generic codes
sum(df_lower_valid_d_codes["code"].apply(lambda x: '-' in x))

0

In [110]:
# Only 3-long codes do not contain '.' character
df_lower_valid_d_codes[df_lower_valid_d_codes["code"].apply(lambda x: '.' not in x)]["n_char"].value_counts()

3    1909
Name: n_char, dtype: int64

As previously reported for the codes in CodiEsp, CIE-Diagnóstico valid codes char-length vary from 3 to 8.

### CIE10-Procedimiento

In [112]:
%%time

codiesp_p_valid = pd.read_table("../resources/CodiEsp-Evaluation-Script/codiesp_codes/codiesp-P_codes.tsv", sep='\t', header=None)

CPU times: user 130 ms, sys: 10.5 ms, total: 141 ms
Wall time: 140 ms


In [113]:
codiesp_p_valid.columns = ["code", "es-desc", "en-desc"]

In [114]:
codiesp_p_valid.head()

Unnamed: 0,code,es-desc,en-desc
0,16070,Derivación de ventrículo cerebral a nasofaring...,Bypass Cerebral Ventricle to Nasopharynx with ...
1,16071,Derivación de ventrículo cerebral a seno masto...,Bypass Cerebral Ventricle to Mastoid Sinus wit...
2,16072,"Derivación de ventrículo cerebral a aurícula, ...",Bypass Cerebral Ventricle to Atrium with Autol...
3,16073,Derivación de ventrículo cerebral a vaso sangu...,Bypass Cerebral Ventricle to Blood Vessel with...
4,16074,Derivación de ventrículo cerebral a cavidad pl...,Bypass Cerebral Ventricle to Pleural Cavity wi...


In [115]:
codiesp_p_valid.shape

(87170, 3)

As stated in [the web](https://zenodo.org/record/3632523#.XmQOh5-YU8p), a limited number of codes do not have neither English nor Spanish description, and others do not have English description:

In [116]:
codiesp_p_valid.isnull().values.sum()

23408

In [117]:
codiesp_p_valid["en-desc"].isnull().sum()

12027

In [118]:
codiesp_p_valid["es-desc"].isnull().sum()

11381

In [119]:
# Codes not having neither English not Spanish desc
sum(codiesp_p_valid["es-desc"].isnull() & codiesp_p_valid["en-desc"].isnull())

11381

In [120]:
# Codes not having only English desc
codiesp_p_valid["en-desc"].isnull().sum() - sum(codiesp_p_valid["es-desc"].isnull() & codiesp_p_valid["en-desc"].isnull())

646

In [121]:
codiesp_p_valid["code"].isnull().sum()

0

Total number of valid codes:

In [122]:
valid_p_codes = set(codiesp_p_valid["code"])

In [120]:
len(valid_p_codes)

87170

ALL codes contained in train and dev sets are present in the valid set:

In [123]:
lower_valid_p_codes = {s.lower() for s in valid_p_codes}

In [124]:
len(p_codes - lower_valid_p_codes)

0

We also explore the character-length distribution of the valid codes:

In [125]:
df_lower_valid_p_codes = pd.DataFrame({'code': list(lower_valid_p_codes)})

In [126]:
df_lower_valid_p_codes["n_char"] = df_lower_valid_p_codes["code"].apply(lambda x: len(x))

In [127]:
df_lower_valid_p_codes.shape

(87170, 2)

In [128]:
df_lower_valid_p_codes.head()

Unnamed: 0,code,n_char
0,3e09729,7
1,f07d1yz,7
2,dhy6fzz,7
3,0dpw47z,7
4,bp2a1zz,7


In [129]:
df_lower_valid_p_codes["n_char"].value_counts(normalize=False)

7    75789
4    11381
Name: n_char, dtype: int64

In [130]:
df_lower_valid_p_codes["n_char"].value_counts(normalize=True)

7    0.869439
4    0.130561
Name: n_char, dtype: float64

As previously reported for the codes in CodiEsp, CIE-Procedimiento valid codes char-length is either 4 or 7.

## Abstracts with CIE codes

Now, we perform an exploratory analysis of an additional dataset [available from CodiEsp](https://temu.bsc.es/codiesp/index.php/category/data/): a set of abstracts labeled with CIE-10 Diagnóstico and Procedimiento codes. Our main goal is to expand the CodiEsp train and development corpora using this additional corpus.

In [132]:
%%time

additional_ank = pd.read_table("../datasets/abstractsWithCIE10_v2/table_codes_ankushset_v2.tsv", sep='\t', header=None)

CPU times: user 226 ms, sys: 24.2 ms, total: 251 ms
Wall time: 250 ms


In [133]:
additional_ank.columns = ["doc", "type", "code", "word"]

In [134]:
additional_ank.head()

Unnamed: 0,doc,type,code,word
0,biblio-994981,DIAGNOSTICO,r68.82,Libido
1,biblio-994981,PROCEDIMIENTO,gzfzzzz,Hipnosis
2,biblio-1008268,DIAGNOSTICO,f91.9,Trastornos Mentales
3,biblio-1008268,DIAGNOSTICO,f99,Trastornos Mentales
4,biblio-1008268,DIAGNOSTICO,f99-f99,Trastornos Mentales


In [135]:
additional_ank.shape

(442419, 4)

In [136]:
# Samples type distribution
additional_ank["type"].value_counts()

DIAGNOSTICO      440068
PROCEDIMIENTO      2351
Name: type, dtype: int64

In [137]:
additional_ank["type"].value_counts(normalize=True)

DIAGNOSTICO      0.994686
PROCEDIMIENTO    0.005314
Name: type, dtype: float64

### CIE-10 Diagnóstico

We select the abstracts annotated with CIE-Diagnóstico codes:

In [138]:
additional_ank_d = additional_ank[additional_ank["type"]=="DIAGNOSTICO"]

In [139]:
additional_ank_d.shape

(440068, 4)

As previously performed for the CodiEsp corpus, we start by analyzing the char-length of the CIE codes contained in the additional abstracts:

In [140]:
additional_ank_d_codes = pd.DataFrame({'code': list(set(additional_ank_d["code"]))})

In [141]:
additional_ank_d_codes.shape

(3093, 1)

In [142]:
additional_ank_d_codes["n_char"] = additional_ank_d_codes["code"].apply(lambda x: len(x))

In [143]:
additional_ank_d_codes.head()

Unnamed: 0,code,n_char
0,k00.5,5
1,q92.6,5
2,f68.a,5
3,n17.0,5
4,m89.30,6


In [144]:
additional_ank_d_codes["n_char"].value_counts(normalize=False)

5    1819
6     668
3     437
7     144
1      24
8       1
Name: n_char, dtype: int64

In [145]:
# Number of generic codes
sum(additional_ank_d_codes["code"].apply(lambda x: '-' in x))

50

In [146]:
# Only 7-long codes are generic codes
all(additional_ank_d_codes[additional_ank_d_codes["code"].apply(lambda x: '-' in x)]["n_char"] == 7)

True

In [147]:
# Generic codes
additional_ank_d_codes[additional_ank_d_codes["code"].apply(lambda x: '-' in x)]

Unnamed: 0,code,n_char
382,b35-b49,7
432,f30-f39,7
569,g00-g99,7
587,q65-q79,7
596,j00-j99,7
608,n70-n77,7
620,s00-s09,7
630,b65-b83,7
632,x71-x83,7
656,e08-e13,7


In [148]:
# Codes 8-char long
additional_ank_d_codes[additional_ank_d_codes["n_char"] == 8]

Unnamed: 0,code,n_char
1350,o41.1290,8


In [149]:
# Codes 1-char long
additional_ank_d_codes[additional_ank_d_codes["n_char"] == 1]

Unnamed: 0,code,n_char
60,1,1
202,2,1
313,e,1
412,g,1
513,5,1
675,.,1
770,9,1
929,a,1
944,f,1
1008,p,1


In [150]:
# Only 3/1-long and generic codes do not contain the '.' character
additional_ank_d_codes[additional_ank_d_codes["code"].apply(lambda x: '.' not in x)]["n_char"].value_counts()

3    437
7     50
1     23
Name: n_char, dtype: int64

In [151]:
additional_ank_d_codes[additional_ank_d_codes["code"].apply(lambda x: '.' not in x) & 
    additional_ank_d_codes["code"].apply(lambda x: '-' in x)]["n_char"].value_counts()

7    50
Name: n_char, dtype: int64

To make the additional abstracts corpus compatible with CodiEsp-Diagnóstico corpus, samples with invalid codes should be removed from the dataset.

In [156]:
valid_filter = additional_ank_d["code"].isin(set(additional_ank_d_codes["code"]) & lower_valid_d_codes)

In [157]:
# Number of samples with invalid codes
additional_ank_d.shape[0] - sum(valid_filter)

36212

In [165]:
comp_additional_ank_d = additional_ank_d[valid_filter][["doc", "code", "word"]] 

In [166]:
comp_additional_ank_d.shape

(403856, 3)

In [167]:
# No code is expected to be associated with the same document multiple times
sum(comp_additional_ank_d[["doc", "code"]].duplicated())

0

#### Expand CodiEsp with additional abstracts

We propose three main ways of expanding the CodiEsp train and development corpora using this additional corpus. The first one consists in adding all additional abstracts to the train corpus. In the second, the abstracts labeled with CIE codes present in CodiEsp train or development corpus are added. Finally, in the last manner, only the additional abstracts labeled with CIE codes present in CodiEsp train corpus are added. Each of these alternatives is analyzed in the next subsections.

In [168]:
comp_additional_ank_d.columns

Index(['doc', 'code', 'word'], dtype='object')

In [169]:
comp_additional_ank_d.columns=["document", "code", "word"]

In [170]:
# Unique CIE-Diagnóstico codes in Additional abstracts
comp_additional_ank_d_codes = set(comp_additional_ank_d["code"])
len(comp_additional_ank_d_codes)

2984

In [171]:
# Number of unique CIE-Diagnóstico codes in train corpus that are also contained in additional abstracts
len(train_d_codes & comp_additional_ank_d_codes)

615

In [172]:
# Number of unique unseen CIE-Diagnóstico codes in dev set that are also contained in additional abstracts
len(set(dev_d_codes - train_d_codes) & comp_additional_ank_d_codes)

118

In [173]:
# Fraction of unique CIE-Diagnóstico codes in additional abstracts that correspond to CodiEsp (train and dev) codes
len(d_codes & comp_additional_ank_d_codes)/len(comp_additional_ank_d_codes)

0.24564343163538874

In [174]:
# Fraction of CIE-Diagnóstico codes samples in additional abstracts that correspond to CodiEsp (train and dev) codes
sum(comp_additional_ank_d["code"].isin(d_codes & comp_additional_ank_d_codes))/comp_additional_ank_d.shape[0]

0.39779525375381325

##### All abstracts

All additional abstracts are added to the CodiEsp train corpus.

We firstly save all additional abstracts table for further usage:

In [175]:
comp_additional_ank_d.shape

(403856, 3)

In [176]:
comp_additional_ank_d.head()

Unnamed: 0,document,code,word
0,biblio-994981,r68.82,Libido
2,biblio-1008268,f91.9,Trastornos Mentales
3,biblio-1008268,f99,Trastornos Mentales
5,biblio-1008288,r09.02,Hipoxia
6,biblio-1010254,h91.9,Sordera


In [177]:
comp_additional_ank_d[["document", "code"]].isnull().values.any()

False

In [187]:
comp_additional_ank_d[["document", "code"]].to_csv(path_or_buf="../datasets/abstractsWithCIE10_v2/all_abstracts_table_valid_codes_D.tsv", 
                                                   sep="\t", header=False, index=False)

In [178]:
codiesp_d_train_all_add = pd.concat([codiesp_d_train, comp_additional_ank_d[["document", "code"]]])

In [179]:
codiesp_d_train_all_add.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [180]:
codiesp_d_train_all_add.tail()

Unnamed: 0,document,code
442414,biblio-1012794,c80
442415,biblio-1012794,c80.1
442416,biblio-1012794,d36.9
442417,biblio-1012794,d49
442418,biblio-977844,r09.02


In [181]:
codiesp_d_train_all_add.shape

(409495, 2)

In [182]:
all_add_train_d_codes = set(codiesp_d_train_all_add["code"])

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [183]:
# Ratio between new (after additional abstracts inclusion) and original unseen unique CIE-Diagnóstico codes in Dev
len(dev_d_codes - all_add_train_d_codes)/len(dev_d_codes - train_d_codes)

0.7236533957845434

In [184]:
# Approximate new ratio of unseen unique CIE-Diagnóstico codes in Test
new_unseen_ratio = 0.72

In [185]:
# Abstracts-CodiEsp-Diagnóstico codes table
all_d_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_d_train_all_add.document)), 
                                       codiesp_d_train_all_add.shape[0], 
                                       codiesp_d_train_all_add.shape[0]/len(set(codiesp_d_train_all_add.document)), 
                                       len(all_add_train_d_codes), 
                                     codiesp_d_train_all_add.shape[0]/len(all_add_train_d_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_d_dev.document)), 
                             codiesp_d_dev.shape[0], 
                             codiesp_d_dev.shape[0]/len(set(codiesp_d_dev.document)), 
                             len(dev_d_codes), 
                             codiesp_d_dev.shape[0]/len(dev_d_codes), 
                             len(dev_d_codes - all_add_train_d_codes)],
              col_names[2]: ["~250", None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*d_approx*new_unseen_ratio)]}, 
                       index=row_names)
all_d_tab = all_d_tab.reindex(columns=col_names)
all_d_tab

Unnamed: 0,Training,Development,Test
Documents,170620,250.0,~250
Total CIE codes,409495,2677.0,
Avg CIE codes per doc.,2.40004,10.708,
Unique CIE codes,4136,1158.0,
Avg samples (docs) per CIE code,99.0075,2.311744,
Unique unseen CIE codes,-,309.0,~273.24


In [186]:
# Original table, when no abstract is added to the train corpus
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [187]:
# Samples per CIE-Diagnóstico code distribution
all_d_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_d_train_all_add["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_d_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_d_train_all_add["code"].append(codiesp_d_dev["code"]).value_counts().describe()})
all_d_sample_code = all_d_sample_code.reindex(columns=col_names_sample)
all_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,4136.0,1158.0,4445.0
mean,99.007495,2.311744,92.727109
std,259.122685,3.770245,251.729374
min,1.0,1.0,1.0
25%,3.0,1.0,2.0
50%,22.0,1.0,18.0
75%,86.0,2.0,79.0
max,4768.0,51.0,4807.0


In [188]:
# Original samples per CIE-Diagnóstico code distribution
d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,3.191285,2.311744,3.790337
std,6.921389,3.770245,9.123652
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,3.0,2.0,3.0
max,112.0,51.0,163.0


##### Abstracts with CodiEsp train or dev codes

Only the additional abstracts labeled with CIE codes present in CodiEsp train or development corpus are added.

We firstly save additional abstracts with CodiEsp-training codes table for further usage:

In [200]:
comp_additional_ank_d_train_dev = comp_additional_ank_d[comp_additional_ank_d["code"].isin(d_codes)][["document", "code"]]

In [201]:
comp_additional_ank_d_train_dev.shape

(160652, 2)

In [202]:
comp_additional_ank_d_train_dev.head()

Unnamed: 0,document,code
3,biblio-1008268,f99
5,biblio-1008288,r09.02
8,biblio-1008344,a90
10,biblio-1008411,e55.9
12,biblio-1008711,i49.9


In [203]:
comp_additional_ank_d_train_dev.isnull().values.any()

False

In [204]:
comp_additional_ank_d_train_dev.to_csv(path_or_buf="../datasets/abstractsWithCIE10_v2/train_dev_abstracts_table_valid_codes_D.tsv", 
                                                   sep="\t", header=False, index=False)

In [205]:
codiesp_d_train_add_train_dev = pd.concat([codiesp_d_train, 
    comp_additional_ank_d[comp_additional_ank_d["code"].isin(d_codes & comp_additional_ank_d_codes)][["document", "code"]]]) 

In [206]:
codiesp_d_train_add_train_dev.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [207]:
codiesp_d_train_add_train_dev.tail()

Unnamed: 0,document,code
442403,biblio-1015950,r13.10
442409,biblio-1015955,g54.6
442412,biblio-1012795,q90.9
442415,biblio-1012794,c80.1
442418,biblio-977844,r09.02


In [208]:
codiesp_d_train_add_train_dev.shape

(166291, 2)

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [209]:
train_dev_add_train_d_codes = set(codiesp_d_train_add_train_dev["code"])

In [210]:
# Abstracts-CodiEsp-Diagnóstico codes table
train_dev_add_d_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_d_train_add_train_dev.document)), 
                                       codiesp_d_train_add_train_dev.shape[0], 
                                       codiesp_d_train_add_train_dev.shape[0]/len(set(codiesp_d_train_add_train_dev.document)), 
                                       len(train_dev_add_train_d_codes), 
                                     codiesp_d_train_add_train_dev.shape[0]/len(train_dev_add_train_d_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_d_dev.document)), 
                             codiesp_d_dev.shape[0], 
                             codiesp_d_dev.shape[0]/len(set(codiesp_d_dev.document)), 
                             len(dev_d_codes), 
                             codiesp_d_dev.shape[0]/len(dev_d_codes), 
                             len(dev_d_codes - train_dev_add_train_d_codes)],
              col_names[2]: ["~250", None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*d_approx)]}, 
                       index=row_names)
train_dev_add_d_tab = train_dev_add_d_tab.reindex(columns=col_names)
train_dev_add_d_tab

Unnamed: 0,Training,Development,Test
Documents,115957,250.0,~250
Total CIE codes,166291,2677.0,
Avg CIE codes per doc.,1.43407,10.708,
Unique CIE codes,1885,1158.0,
Avg samples (docs) per CIE code,88.218,2.311744,
Unique unseen CIE codes,-,309.0,~379.5


In [211]:
# Original table, when no abstract is added to the train corpus
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [212]:
# Samples per CIE-Diagnóstico code distribution
train_dev_add_d_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_d_train_add_train_dev["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_d_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_d_train_add_train_dev["code"].append(codiesp_d_dev["code"]).value_counts().describe()})
train_dev_add_d_sample_code = train_dev_add_d_sample_code.reindex(columns=col_names_sample)
train_dev_add_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1885.0,1158.0,2194.0
mean,88.218037,2.311744,77.013674
std,272.060588,3.770245,255.129413
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,3.0,1.0,2.0
75%,64.0,2.0,44.0
max,4768.0,51.0,4807.0


In [213]:
# Original samples per CIE-Diagnóstico code distribution
d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,3.191285,2.311744,3.790337
std,6.921389,3.770245,9.123652
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,3.0,2.0,3.0
max,112.0,51.0,163.0


##### Abstracts with CodiEsp train codes

Only the additional abstracts labeled with CIE codes present in CodiEsp train corpus are added. This option is only meaningful during the model development phase, when the CodiEsp devel corpus is employed to evaluate the model, as in the final evaluation phase train+dev corpora will be used to train the final model.

We firstly save additional abstracts with CodiEsp-training codes table for further usage:

In [214]:
comp_additional_ank_d_train = comp_additional_ank_d[comp_additional_ank_d["code"].isin(train_d_codes)][["document", "code"]]

In [215]:
comp_additional_ank_d_train.shape

(148197, 2)

In [216]:
comp_additional_ank_d_train.head()

Unnamed: 0,document,code
3,biblio-1008268,f99
5,biblio-1008288,r09.02
8,biblio-1008344,a90
12,biblio-1008711,i49.9
13,biblio-1008711,i48.0


In [217]:
comp_additional_ank_d_train.isnull().values.any()

False

In [219]:
comp_additional_ank_d_train.to_csv(path_or_buf="../datasets/abstractsWithCIE10_v2/train_abstracts_table_valid_codes_D.tsv", 
                                                   sep="\t", header=False, index=False)

In [220]:
codiesp_d_train_add_train = pd.concat([codiesp_d_train, 
    comp_additional_ank_d[comp_additional_ank_d["code"].isin(train_d_codes & comp_additional_ank_d_codes)][["document", "code"]]]) 

In [221]:
codiesp_d_train_add_train.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [222]:
codiesp_d_train_add_train.tail()

Unnamed: 0,document,code
442390,biblio-975189,a90
442403,biblio-1015950,r13.10
442409,biblio-1015955,g54.6
442415,biblio-1012794,c80.1
442418,biblio-977844,r09.02


In [223]:
codiesp_d_train_add_train.shape

(153836, 2)

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [224]:
# Sanity check, True expected
set(codiesp_d_train_add_train["code"]) == train_d_codes

True

In [225]:
# Abstracts-CodiEsp-Diagnóstico codes table
train_add_d_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_d_train_add_train.document)), 
                                       codiesp_d_train_add_train.shape[0], 
                                       codiesp_d_train_add_train.shape[0]/len(set(codiesp_d_train_add_train.document)), 
                                       len(train_d_codes), 
                                     codiesp_d_train_add_train.shape[0]/len(train_d_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_d_dev.document)), 
                             codiesp_d_dev.shape[0], 
                             codiesp_d_dev.shape[0]/len(set(codiesp_d_dev.document)), 
                             len(dev_d_codes), 
                             codiesp_d_dev.shape[0]/len(dev_d_codes), 
                             len(dev_d_codes - train_d_codes)],
              col_names[2]: ["~250", None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*d_approx)]}, 
                       index=row_names)
train_add_d_tab = train_add_d_tab.reindex(columns=col_names)
train_add_d_tab

Unnamed: 0,Training,Development,Test
Documents,108529,250.0,~250
Total CIE codes,153836,2677.0,
Avg CIE codes per doc.,1.41746,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,87.0606,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [226]:
# Original table, when no abstract is added to the train corpus
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [227]:
# Samples per CIE-Diagnóstico code distribution
train_add_d_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_d_train_add_train["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_d_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_d_train_add_train["code"].append(codiesp_d_dev["code"]).value_counts().describe()})
train_add_d_sample_code = train_add_d_sample_code.reindex(columns=col_names_sample)
train_add_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,87.060555,2.311744,71.336828
std,278.515095,3.770245,253.410353
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,2.0,1.0,2.0
75%,53.5,2.0,24.0
max,4768.0,51.0,4807.0


In [228]:
# Original samples per CIE-Diagnóstico code distribution
d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,3.191285,2.311744,3.790337
std,6.921389,3.770245,9.123652
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,3.0,2.0,3.0
max,112.0,51.0,163.0


### CIE-10 Procedimiento

We select the abstracts annotated with CIE-Procedimiento codes:

In [229]:
additional_ank_p = additional_ank[additional_ank["type"]=="PROCEDIMIENTO"]

In [230]:
additional_ank_p.shape

(2351, 4)

As previously performed for the CodiEsp corpus, we start by analyzing the char-length of the CIE codes contained in the additional abstracts:

In [231]:
additional_ank_p_codes = pd.DataFrame({'code': list(set(additional_ank_p["code"]))})

In [232]:
additional_ank_p_codes.shape

(12, 1)

In [233]:
additional_ank_p_codes["n_char"] = additional_ank_p_codes["code"].apply(lambda x: len(x))

In [234]:
additional_ank_p_codes.head()

Unnamed: 0,code,n_char
0,f02zfzz,7
1,gz2zzzz,7
2,gz61zzz,7
3,8e0zxy4,7
4,gzc9zzz,7


In [235]:
additional_ank_p_codes["n_char"].value_counts(normalize=False)

7    12
Name: n_char, dtype: int64

The additional abstracts corpus is already compatible with CodiEsp-Procedimiento corpus. Also, ALL codes contained in the additional abstracts are present in the valid set, so none of them should be removed:

In [236]:
len(set(additional_ank_p_codes["code"]) - lower_valid_p_codes)

0

In [237]:
# Remove type column
additional_ank_p = additional_ank_p[['doc', 'code', 'word']]

In [238]:
# No code is expected to be associated with the same document multiple times
sum(additional_ank_p[["doc", "code"]].duplicated())

0

#### Expand CodiEsp with additional abstracts

We propose three main ways of expanding the CodiEsp train and development corpora using this additional corpus. The first one consists in adding all additional abstracts to the train corpus. In the second, the abstracts labeled with CIE codes present in CodiEsp train or development corpus are added. Finally, in the last manner, only the additional abstracts labeled with CIE codes present in CodiEsp train corpus are added. Each of these alternatives is analyzed in the next subsections.

In [239]:
additional_ank_p.columns

Index(['doc', 'code', 'word'], dtype='object')

In [240]:
additional_ank_p.columns=["document", "code", "word"]

In [241]:
# Unique CIE-Procedimiento codes in Additional abstracts
additional_ank_p_codes = set(additional_ank_p["code"])
len(additional_ank_p_codes)

12

In [242]:
# Number of unique CIE-Procedimiento codes in train corpus that are aso contained in additional abstracts
len(train_p_codes & additional_ank_p_codes)

1

In [243]:
# Number of unique unseen CIE-Procedimiento codes in dev set that are aso contained in additional abstracts
len(set(dev_p_codes - train_p_codes) & additional_ank_p_codes)

1

In [244]:
# Fraction of unique CIE-Procedimiento codes in additional abstracts that correspond to CodiEsp (train and dev) codes
len(p_codes & additional_ank_p_codes)/len(additional_ank_p_codes)

0.16666666666666666

In [245]:
# Fraction of CIE-Procedimiento codes samples in additional abstracts that correspond to CodiEsp (train and dev) codes
sum(additional_ank_p["code"].isin(p_codes & additional_ank_p_codes))/additional_ank_p.shape[0]

0.295618885580604

##### All abstracts

All additional abstracts are added to the CodiEsp train corpus.

In [246]:
codiesp_p_train_all_add = pd.concat([codiesp_p_train, additional_ank_p[["document", "code"]]])

In [247]:
codiesp_p_train_all_add.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [248]:
codiesp_p_train_all_add.tail()

Unnamed: 0,document,code
441971,lil-472601,f02zfzz
442187,lil-499421,gzjzzzz
442228,lil-492262,gzhzzzz
442230,lil-492263,gzhzzzz
442235,lil-492265,gzhzzzz


In [249]:
codiesp_p_train_all_add.shape

(3901, 2)

In [250]:
all_add_train_p_codes = set(codiesp_p_train_all_add["code"])

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [251]:
# Ratio between new (after additional abstracts inclusion) and original unseen unique CIE-Procedimiento codes in Dev
len(dev_p_codes - all_add_train_p_codes)/len(dev_p_codes - train_p_codes)

0.9939024390243902

In [252]:
# Approximate new ratio of unseen unique CIE-Procedimiento codes in Test
new_unseen_ratio = 0.99

In [253]:
# Abstracts-CodiEsp-Procedimiento codes table
all_p_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_p_train_all_add.document)), 
                                       codiesp_p_train_all_add.shape[0], 
                                       codiesp_p_train_all_add.shape[0]/len(set(codiesp_p_train_all_add.document)), 
                                       len(all_add_train_p_codes), 
                                     codiesp_p_train_all_add.shape[0]/len(all_add_train_p_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_p_dev.document)), 
                             codiesp_p_dev.shape[0], 
                             codiesp_p_dev.shape[0]/len(set(codiesp_p_dev.document)), 
                             len(dev_p_codes), 
                             codiesp_p_dev.shape[0]/len(dev_p_codes), 
                             len(dev_p_codes - all_add_train_p_codes)],
              col_names[2]: ["~" + str(c_approx*250), None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*(1-d_approx)*new_unseen_ratio)]}, 
                       index=row_names)
all_p_tab = all_p_tab.reindex(columns=col_names)
all_p_tab

Unnamed: 0,Training,Development,Test
Documents,2755,222.0,~220.0
Total CIE codes,3901,817.0,
Avg CIE codes per doc.,1.41597,3.68018,
Unique CIE codes,574,375.0,
Avg samples (docs) per CIE code,6.79617,2.178667,
Unique unseen CIE codes,-,163.0,~125.235


In [254]:
# Original table, when no abstract is added to the train corpus
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [255]:
# Samples per CIE-Procedimiento code distribution
all_p_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_p_train_all_add["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_p_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_p_train_all_add["code"].append(codiesp_p_dev["code"]).value_counts().describe()})
all_p_sample_code = all_p_sample_code.reindex(columns=col_names_sample)
all_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,574.0,375.0,737.0
mean,6.796167,2.178667,6.401628
std,36.959984,3.415531,33.116008
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,596.0,40.0,596.0


In [256]:
# Original samples per CIE-Procedimiento code distribution
p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,2.753108,2.178667,3.255846
std,5.603234,3.415531,7.580642
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,67.0,40.0,107.0


##### Abstracts with CodiEsp train or dev codes

Only the additional abstracts labeled with CIE codes present in CodiEsp train or development corpus are added.

In [257]:
codiesp_p_train_add_train_dev = pd.concat([codiesp_p_train, 
    additional_ank_p[additional_ank_p["code"].isin(p_codes & additional_ank_p_codes)][["document", "code"]]]) 

In [258]:
codiesp_p_train_add_train_dev.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [259]:
codiesp_p_train_add_train_dev.tail()

Unnamed: 0,document,code
440750,ibc-95104,gzhzzzz
440898,ibc-62375,8e0zxy1
442228,lil-492262,gzhzzzz
442230,lil-492263,gzhzzzz
442235,lil-492265,gzhzzzz


In [260]:
codiesp_p_train_add_train_dev.shape

(2245, 2)

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [261]:
train_dev_add_train_p_codes = set(codiesp_p_train_add_train_dev["code"])

In [262]:
# Abstracts-CodiEsp-Procedimiento codes table
train_dev_add_p_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_p_train_add_train_dev.document)), 
                                       codiesp_p_train_add_train_dev.shape[0], 
                                       codiesp_p_train_add_train_dev.shape[0]/len(set(codiesp_p_train_add_train_dev.document)), 
                                       len(train_dev_add_train_p_codes), 
                                     codiesp_p_train_add_train_dev.shape[0]/len(train_dev_add_train_p_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_p_dev.document)), 
                             codiesp_p_dev.shape[0], 
                             codiesp_p_dev.shape[0]/len(set(codiesp_p_dev.document)), 
                             len(dev_p_codes), 
                             codiesp_p_dev.shape[0]/len(dev_p_codes), 
                             len(dev_p_codes - train_dev_add_train_p_codes)],
              col_names[2]: ["~" + str(c_approx*250), None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*(1-d_approx))]}, 
                       index=row_names)
train_dev_add_p_tab = train_dev_add_p_tab.reindex(columns=col_names)
train_dev_add_p_tab

Unnamed: 0,Training,Development,Test
Documents,1130,222.0,~220.0
Total CIE codes,2245,817.0,
Avg CIE codes per doc.,1.98673,3.68018,
Unique CIE codes,564,375.0,
Avg samples (docs) per CIE code,3.9805,2.178667,
Unique unseen CIE codes,-,163.0,~126.5


In [263]:
# Original table, when no abstract is added to the train corpus
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [264]:
# Samples per CIE-Procedimiento code distribution
train_dev_add_p_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_p_train_add_train_dev["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_p_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_p_train_add_train_dev["code"].append(codiesp_p_dev["code"]).value_counts().describe()})
train_dev_add_p_sample_code = train_dev_add_p_sample_code.reindex(columns=col_names_sample)
train_dev_add_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,564.0,375.0,727.0
mean,3.980496,2.178667,4.211829
std,21.549842,3.415531,19.833792
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,404.0,40.0,404.0


In [265]:
# Original samples per CIE-Procedimiento code distribution
p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,2.753108,2.178667,3.255846
std,5.603234,3.415531,7.580642
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,67.0,40.0,107.0


##### Abstracts with CodiEsp train codes

Only the additional abstracts labeled with CIE codes present in CodiEsp train corpus are added. This option is only meaningful during the model development phase, when the CodiEsp devel corpus is employed to evaluate the model, as in the final evaluation phase train+dev corpora will be used to train the final model.

In [266]:
codiesp_p_train_add_train = pd.concat([codiesp_p_train, 
    additional_ank_p[additional_ank_p["code"].isin(train_p_codes & additional_ank_p_codes)][["document", "code"]]]) 

In [267]:
codiesp_p_train_add_train.head()

Unnamed: 0,document,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [268]:
codiesp_p_train_add_train.tail()

Unnamed: 0,document,code
440547,biblio-963399,gzhzzzz
440750,ibc-95104,gzhzzzz
442228,lil-492262,gzhzzzz
442230,lil-492263,gzhzzzz
442235,lil-492265,gzhzzzz


In [269]:
codiesp_p_train_add_train.shape

(1953, 2)

We examine the next tables in order to analyze the features of the new resulting CodiEsp corpus:

In [270]:
# Sanity check, True expected
set(codiesp_p_train_add_train["code"]) == train_p_codes

True

In [271]:
# Abstracts-CodiEsp-Procedimiento codes table
train_add_p_tab = pd.DataFrame({col_names[0]: [len(set(codiesp_p_train_add_train.document)), 
                                       codiesp_p_train_add_train.shape[0], 
                                       codiesp_p_train_add_train.shape[0]/len(set(codiesp_p_train_add_train.document)), 
                                       len(train_p_codes), 
                                     codiesp_p_train_add_train.shape[0]/len(train_p_codes), 
                                       "-"], 
              col_names[1]: [len(set(codiesp_p_dev.document)), 
                             codiesp_p_dev.shape[0], 
                             codiesp_p_dev.shape[0]/len(set(codiesp_p_dev.document)), 
                             len(dev_p_codes), 
                             codiesp_p_dev.shape[0]/len(dev_p_codes), 
                             len(dev_p_codes - train_p_codes)],
              col_names[2]: ["~" + str(c_approx*250), None, None, None, None, 
                             "~" + str((total_uniq_codes - len(d_codes) - len(p_codes))*(1-d_approx))]}, 
                       index=row_names)
train_add_p_tab = train_add_p_tab.reindex(columns=col_names)
train_add_p_tab

Unnamed: 0,Training,Development,Test
Documents,838,222.0,~220.0
Total CIE codes,1953,817.0,
Avg CIE codes per doc.,2.33055,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,3.46892,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [272]:
# Original table, when no abstract is added to the train corpus
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [273]:
# Samples per CIE-Procedimiento code distribution
train_add_p_sample_code = pd.DataFrame({col_names_sample[0]: codiesp_p_train_add_train["code"].value_counts().describe(),
                              col_names_sample[1]: codiesp_p_dev["code"].value_counts().describe(),
                              col_names_sample[2]: codiesp_p_train_add_train["code"].append(codiesp_p_dev["code"]).value_counts().describe()})
train_add_p_sample_code = train_add_p_sample_code.reindex(columns=col_names_sample)
train_add_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,3.468917,2.178667,3.810179
std,17.814395,3.415531,16.68406
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,404.0,40.0,404.0


In [274]:
# Original samples per CIE-Procedimiento code distribution
p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,2.753108,2.178667,3.255846
std,5.603234,3.415531,7.580642
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,67.0,40.0,107.0


### Expanded corpus summary

Description of the different resulting expanded CodiEsp corpora, depending on the way additional abstracts are added to the train corpus.

#### Diagnóstico

Corpus description tables:

In [275]:
# All abstracts are added
all_d_tab

Unnamed: 0,Training,Development,Test
Documents,170620,250.0,~250
Total CIE codes,409495,2677.0,
Avg CIE codes per doc.,2.40004,10.708,
Unique CIE codes,4136,1158.0,
Avg samples (docs) per CIE code,99.0075,2.311744,
Unique unseen CIE codes,-,309.0,~273.24


In [277]:
# Abstracts with CodiEsp train or dev codes are added
train_dev_add_d_tab

Unnamed: 0,Training,Development,Test
Documents,115957,250.0,~250
Total CIE codes,166291,2677.0,
Avg CIE codes per doc.,1.43407,10.708,
Unique CIE codes,1885,1158.0,
Avg samples (docs) per CIE code,88.218,2.311744,
Unique unseen CIE codes,-,309.0,~379.5


In [278]:
# Abstracts with CodiEsp train codes are added
train_add_d_tab

Unnamed: 0,Training,Development,Test
Documents,108529,250.0,~250
Total CIE codes,153836,2677.0,
Avg CIE codes per doc.,1.41746,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,87.0606,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


In [279]:
# Original table, when no abstract is added to the train corpus
d_tab

Unnamed: 0,Training,Development,Test
Documents,500,250.0,~250
Total CIE codes,5639,2677.0,
Avg CIE codes per doc.,11.278,10.708,
Unique CIE codes,1767,1158.0,
Avg samples (docs) per CIE code,3.19128,2.311744,
Unique unseen CIE codes,-,427.0,~379.5


Samples per codes distributions:

In [281]:
# All abstracts are added
all_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,4136.0,1158.0,4445.0
mean,99.007495,2.311744,92.727109
std,259.122685,3.770245,251.729374
min,1.0,1.0,1.0
25%,3.0,1.0,2.0
50%,22.0,1.0,18.0
75%,86.0,2.0,79.0
max,4768.0,51.0,4807.0


In [282]:
# Abstracts with CodiEsp train or dev codes are added
train_dev_add_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1885.0,1158.0,2194.0
mean,88.218037,2.311744,77.013674
std,272.060588,3.770245,255.129413
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,3.0,1.0,2.0
75%,64.0,2.0,44.0
max,4768.0,51.0,4807.0


In [283]:
# Abstracts with CodiEsp train codes are added
train_add_d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,87.060555,2.311744,71.336828
std,278.515095,3.770245,253.410353
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,2.0,1.0,2.0
75%,53.5,2.0,24.0
max,4768.0,51.0,4807.0


In [284]:
# Original samples per CIE-Diagnóstico code distribution
d_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,1767.0,1158.0,2194.0
mean,3.191285,2.311744,3.790337
std,6.921389,3.770245,9.123652
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,3.0,2.0,3.0
max,112.0,51.0,163.0


Expanding CodiEsp-Diagnóstico train corpus with additional abstracts seem to be beneficial. Indeed, both "all abstracts" (much greater number of unique CIE codes) and "abstracts with CodiEsp codes" approaches should be applied and compared.

#### Procedimiento

Corpus description tables:

In [285]:
# All abstracts are added
all_p_tab

Unnamed: 0,Training,Development,Test
Documents,2755,222.0,~220.0
Total CIE codes,3901,817.0,
Avg CIE codes per doc.,1.41597,3.68018,
Unique CIE codes,574,375.0,
Avg samples (docs) per CIE code,6.79617,2.178667,
Unique unseen CIE codes,-,163.0,~125.235


In [286]:
# Abstracts with CodiEsp train or dev codes are added
train_dev_add_p_tab

Unnamed: 0,Training,Development,Test
Documents,1130,222.0,~220.0
Total CIE codes,2245,817.0,
Avg CIE codes per doc.,1.98673,3.68018,
Unique CIE codes,564,375.0,
Avg samples (docs) per CIE code,3.9805,2.178667,
Unique unseen CIE codes,-,163.0,~126.5


In [287]:
# Abstracts with CodiEsp train codes are added
train_add_p_tab

Unnamed: 0,Training,Development,Test
Documents,838,222.0,~220.0
Total CIE codes,1953,817.0,
Avg CIE codes per doc.,2.33055,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,3.46892,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


In [288]:
# Original table, when no abstract is added to the train corpus
p_tab

Unnamed: 0,Training,Development,Test
Documents,435,222.0,~220.0
Total CIE codes,1550,817.0,
Avg CIE codes per doc.,3.56322,3.68018,
Unique CIE codes,563,375.0,
Avg samples (docs) per CIE code,2.75311,2.178667,
Unique unseen CIE codes,-,164.0,~126.5


Samples per codes distributions:

In [289]:
# All abstracts are added
all_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,574.0,375.0,737.0
mean,6.796167,2.178667,6.401628
std,36.959984,3.415531,33.116008
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,596.0,40.0,596.0


In [290]:
# Abstracts with CodiEsp train or dev codes are added
train_dev_add_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,564.0,375.0,727.0
mean,3.980496,2.178667,4.211829
std,21.549842,3.415531,19.833792
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,404.0,40.0,404.0


In [291]:
# Abstracts with CodiEsp train codes are added
train_add_p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,3.468917,2.178667,3.810179
std,17.814395,3.415531,16.68406
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,404.0,40.0,404.0


In [292]:
# Original samples per CIE-Procedimiento code distribution
p_sample_code

Unnamed: 0,Train,Dev,Train+Dev
count,563.0,375.0,727.0
mean,2.753108,2.178667,3.255846
std,5.603234,3.415531,7.580642
min,1.0,1.0,1.0
25%,1.0,1.0,1.0
50%,1.0,1.0,1.0
75%,2.0,2.0,2.0
max,67.0,40.0,107.0


In contrast with CodiEsp-Diagnóstico, in CodiEsp-Procedimiento, expanding the train corpus with additional abstracts does not seem to contribute in a significant manner. However, both "all abstracts" and "abstracts with CodiEsp codes" approaches could be applied and compared. Also, given the scarce number of samples, a transfer-learning approach or any other strategy that uses the expanded CodiEsp-Diagnóstico to also expand the CodiEsp-Procedimiento corpus should be considered.