## Evidence Inference Dataset Introduction

#### We need to import the preprocessor file, as this will help with grabbing the data

In [3]:
import os

from evidence_inference.preprocess.preprocessor import get_Xy, train_document_ids, test_document_ids, validation_document_ids, get_train_Xy

tr_ids, val_ids, te_ids = train_document_ids(), validation_document_ids(), test_document_ids()

#### We can view the type of the ids, and what the ids actually consist of (validation ids and test ids are of a similar format). Please note that these are linked to the articles by appending 'PMC' + str(id)  + '.nxml' to find the corresponding xml file. 

In [4]:
print("The object is of type {}".format(type(tr_ids)))
print("The first 10 elements are: {}".format(list(tr_ids)[:10]))

The object is of type <class 'odict_keys'>
The first 10 elements are: [5741844, 3794450, 4759860, 5460737, 5054596, 4786378, 2430617, 5103135, 3410988, 4464926]


### Loading the training data

#### [Optional] We can either use a preset vocab list, or just use nothing. Here, we will use a preset one.

In [5]:
vocab_f = os.path.join("./annotations", "vocab.txt")

#### Here we will grab the training data... To get the full article, we don't want any sections_of_interest. We also do not need to know where the sentences are split. 

In [6]:
## NOTE: This may take a little while to load.
train_Xy, inference_vectorizer = get_train_Xy(tr_ids, sections_of_interest=None, vocabulary_file=vocab_f, include_sentence_span_splits = False)

Loaded 2351706 words from vocab file ./annotations\vocab.txt


#### Lets look at the data

In [7]:
print("The types of our outputs are: {} (trainXy), and {} (inference_vectorizer)\n".format(type(train_Xy), type(inference_vectorizer)))
print("trainXy's inner dimension is of type: {}".format(type(train_Xy[0])))

The types of our outputs are: <class 'list'> (trainXy), and <class 'evidence_inference.preprocess.preprocessor.SimpleInferenceVectorizer'> (inference_vectorizer)

trainXy's inner dimension is of type: <class 'dict'>


#### This is what one instance of X's keys look like

In [8]:
print("Let's look at trainXy's first element's keys:\n {}".format(train_Xy[0].keys()))

Let's look at trainXy's first element's keys:
 dict_keys(['article', 'I', 'C', 'O', 'a_id', 'p_id', 'y', 'sentence_span', 'evidence_spans'])


### Loading the validation and test sets

In [None]:
# Let's get the validation data, and the test data. They are of a similar format to the training data.
val_Xy  = get_Xy(val_ids, inference_vectorizer, sections_of_interest=None, include_sentence_span_splits = False)
test_Xy = get_Xy(te_ids, inference_vectorizer, sections_of_interest=None, include_sentence_span_splits = False)

### Extracting the X's from a (train/validation/test) data set 

#### This isn't exactly what we want for training data though... So, let's pull out some useful bits.

In [None]:
tr_data = [[inner[i] for inner in train_Xy] for i in ['article', 'I', 'C', 'O']]
print("Training is an array of length {}, w/ inner length {}, such that the ith element of all of the inner arrays are from the same prompt".format(len(tr_data), len(tr_data[0])))

#### Let's look at some training data... 

In [None]:
print("This is what an article looks like: {} ...".format(tr_data[0][0][:10]))
print("This is what an outcome looks like: {}".format(tr_data[1][0]))
print("This is what an intervention looks like: {}".format(tr_data[2][0][:10]))
print("This is what an comparator looks like: {}".format(tr_data[3][0][:10]))

### Extracting the Y's from a (train/validation/test) data set 

#### This is what the y looks like for 1 prompt.

In [None]:
print(train_Xy[5000]['y'])

#### Let's define this helper to get us the proper labels.

In [None]:
import random
import numpy as np
from scipy import stats

def _get_y_vec(y_dict):
    # +1 because raw labels are -1, 0, 1 -> 0, 1, 2
    # for indexing reasons that appear in the loss function
    # (cross-entropy loss wants the index of the highest value, and we index at 0)
    all_labels = [y_j[0] + 1 for y_j in y_dict]
    y_collapsed = int(stats.mode(all_labels)[0][0])
    y_vec = np.zeros(3)
    y_vec[y_collapsed] = 1.0
    return y_vec

#### Using our helper function to inspect the labels.

In [None]:
tr_labels = [_get_y_vec(inst['y']) for inst in train_Xy]

In [None]:
print("This is what our labels will look like: {}".format(tr_labels[0]))

### Extracting evidence spans from the (train/validation/test) data set 

#### It might also be helpful to work with the evidence spans (i.e. from above: 'Furthermore, the results ...') 

In [None]:
def _get_evidence_spans(y_dict, inference_vectorizer):
    res = []
    for arr in y_dict:
        res.append(inference_vectorizer.string_to_seq(arr[-1]))
    return res

tr_spans = [_get_evidence_spans(inst['y'], inference_vectorizer) for inst in train_Xy]
    

In [None]:
print("This is what an evidence span for a prompt will look like:\n{}".format(tr_spans[5000][0]))

### We have finished going over the essentials, but here are some extra features that may be helpful:

#### Looking at WHERE in the article the evidence span is given...

In [None]:
where_ev_spans = [inst['evidence_spans'] for inst in train_Xy]
print("For an arbituary prompt, here is where the evidence spans are: {}".format(where_ev_spans[5000]))

#### Comparing our spans with ground truth (at the time of this notebook, it is only 87 percent accuracte (ignoring some encoding issues)).

In [None]:
(span_st, span_end) = list(where_ev_spans[5000])[0]
# Now let's compare
print("Here is what it looks like in the article:\n{}".format(tr_data[0][5000][span_st:span_end]))
print("Here is what the evidence span looks like:\n{}".format(tr_spans[5000][0]))