# Look at a sample

In [33]:
from datasets import load_dataset

dataset = load_dataset("covid_qa_deepset")

Downloading data:   0%|          | 0.00/2.27M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [34]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
        num_rows: 2019
    })
})

In [39]:
print(dataset['train'][0]['context'][:500])

Functional Genetic Variants in DC-SIGNR Are Associated with Mother-to-Child Transmission of HIV-1

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752805/

Boily-Larouche, Geneviève; Iscache, Anne-Laure; Zijenah, Lynn S.; Humphrey, Jean H.; Mouland, Andrew J.; Ward, Brian J.; Roger, Michel
2009-10-07
DOI:10.1371/journal.pone.0007211
License:cc-by

Abstract: BACKGROUND: Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide. Given that the C-type lectin recep


In [40]:
print(dataset['train'][0]['question'])

What is the main cause of HIV-1 infection in children?


In [41]:
print(dataset['train'][0]['answers'])

{'text': ['Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.'], 'answer_start': [370]}


In [69]:
dataset['train'][:5]['id']

[262, 276, 278, 316, 305]

# Tokenizer

In [48]:
from transformers import AutoTokenizer

# Load tokenizer
tokinizer= AutoTokenizer.from_pretrained('UFNLP/gatortronS')

# Tokenize an example
question = dataset['train'][:4]['question']
context = dataset['train'][:4]['context']
inputs = tokenizer(
    question,
    context,
    max_length=512,
    truncation="only_second",
    stride=250,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

for ids in inputs["input_ids"][:3]:
    print(tokenizer.decode(ids)[:300])

[CLS] what is the main cause of hiv - 1 infection in children? [SEP] functional genetic variants in dc - signr are associated with mother - to - child transmission of hiv - 1 https : / / www. ncbi. nlm. nih. gov / pmc / articles / pmc2752805 / boily - larouche, genevieve ; iscache, anne - laure ; zi
[CLS] what is the main cause of hiv - 1 infection in children? [SEP]ct of hiv - 1. methods and findings : to investigate the potential role of dc - signr in mtct of hiv - 1, we carried out a genetic association study of dc - signr in a well - characterized cohort of 197 hiv - infected mothers and th
[CLS] what is the main cause of hiv - 1 infection in children? [SEP] - 180a ) hiv - 1 infection. the promoter variant reduced transcriptional activity in vitro. in homozygous h1 infants bearing both the p - 198a and int2 - 180a mutations, we observed a 4 - fold decrease in the level of placental dc 


In [49]:
print(inputs["overflow_to_sample_mapping"])


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [51]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [58]:
answers = dataset["train"][:4]["answers"]
start_positions = []
end_positions = []

for i, offset in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)

    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    # If the answer is not fully inside the context, label is (0, 0)
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        start_positions.append(0)
        end_positions.append(0)
    else:
        # Otherwise it's the start and end token positions
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

print(start_positions)
print(end_positions)

[154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 361, 123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 423, 181, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 454, 227, 0, 0]
[177, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 390, 152, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 452, 210, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 499, 272, 0, 0]


In [62]:
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide., labels give: mother - to - child transmission ( mtct ) is the main cause of hiv - 1 infection in children worldwide.


In [63]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

Theoretical answer: Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide., decoded example: [CLS] what is the main cause of hiv - 1 infection in children? [SEP] ]. it has been proposed that interaction between dc - signr and hiv - 1 might enhance viral transfer to other susceptible cell types [ 2 ] but dc - signr can also internalize and mediate proteasome - dependant degradation of viruses [ 4 ] that may differently affect the outcome of infection. given the presence of dc - signr at the maternal - fetal interface and its interaction with hiv - 1, we hypothesized that it could influence mtct of hiv - 1. to investigate the potential role of dc - signr in mtct of hiv - 1, we carried out a genetic association study of dc - signr in a well - characterized cohort of hiv - infected mothers and their infants recruited in zimbabwe, and identified specific dc - signr variants associated with increased risks of hiv transmission. we further characterized

# End-to-end pipeline

In [1]:
from src.data.data_loader import *

# Load the data
data_loader = DatasetLoader(dataset_name='covid_qa_deepset', model_name='UFNLP/gatortronS', max_length=512, doc_stride=250)
train_dataset, validation_dataset, validation_dataset_raw = data_loader.load_dataset()

Loading and preprocessing the dataset ...
covid_qa_deepset


Map:   0%|          | 0/1817 [00:00<?, ? examples/s]

Map:   0%|          | 0/202 [00:00<?, ? examples/s]

In [2]:
train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 50269
})

In [3]:
validation_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 5451
})

In [4]:
validation_dataset_raw

Dataset({
    features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
    num_rows: 202
})