<a href="https://colab.research.google.com/github/krishnarevi/Attention_model_from_scratch/blob/main/w5_Capstone_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we show how we can take advantage of these recent advances to train a long form question answering system which takes in a question, fetches relevant passages from a document corpus, and writes a multi-sentence answer based on the question and retrieved passages.In the last few months have seen some significant progress in cases where direct supervision is available, or with extensive task-specific pretraining. Here, we show how our custom dataset allows us to train a dense retrieval system without access to either, making dense retrieval models more accessible.

### 1- Preliminaries
The implementation presented here relies on the Hugging Face 🤗transformers and 🤗nlp libraries. Wikipedia indexing relies on faiss for the dense version. You can get all of these by running:

<!-- pip install elasticsearch -->
pip install faiss_gpu
pip install nlp
pip install transformers
<!-- 
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz
tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz -->

In [1]:
! nvidia-smi

Sat Sep 18 06:32:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install faiss_gpu nlp transformers

Collecting faiss_gpu
  Downloading faiss_gpu-1.7.1.post2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[K     |████████████████████████████████| 89.7 MB 9.5 kB/s 
[?25hCollecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 36.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 51.3 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 36.2 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 

In [3]:
import functools
import math
import os  # noqa: F401
from random import choice, randint
from time import time

import numpy as np
import torch
import torch.utils.checkpoint as checkpoint
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from tqdm import tqdm

import faiss  # noqa: F401
import nlp  # noqa: F401
import pandas as pd
from transformers import AdamW, AutoModel, AutoModelForSeq2SeqLM, AutoTokenizer, get_linear_schedule_with_warmup
import json

In [12]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
import sys
sys.path.append('/content/drive/MyDrive/END')

In [14]:
# from lfqa_utils import *

In [16]:
import os
print(os.getcwd())
path = '/content/drive/MyDrive/END'
os.chdir(path)

/content


In [17]:
folder = "retriever_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("retriever_models directory exists")
else:
    print("retriever_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/END
retriever_models directory Doesn't exists, creating one


In [18]:
folder = "seq2seq_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("seq2seq_models directory exists")
else:
    print("seq2seq_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/END
seq2seq_models directory Doesn't exists, creating one


## 2. Data Description
Let's recap: we are interested in the task of Long Form Question Answering. As in other Question Answering tasks, the model is presented with a question, and is required to generate a natural language answer. Whereas a majority of QA datasets contain mostly factoid questions, where the answer, such as a date or the name of a single entity, can be expressed in a few words or single sentence, Long Form QA focuses on questions which call for an explanation consisting of a few sentences or a few paragraphs.

In order to teach a model to answer such questions, we use questions and answers written by Reddit users. Note that the nlp.load_dataset command above actually downloaded questions and their associated answers from the r/explainlikeimfive, r/askscience, and r/AskHistorians subreddits. We focus here on the ELI5/explainlikeimfive part to train the system, as these examples tend to be a little simpler.

Let's look at one item from the test set:

In [19]:
with open(path+'/train_data.json') as f:
        train = json.load(f)
with open(path+'/test_data.json') as f:
        test = json.load(f)
with open(path+'/context_master.json') as f:
        passage_snippets = json.load(f)

In [20]:
train[0]

{'id': 1,
 'x': 'Maxout Layer',
 'y': 'The Maxout layer can be implemented as follows \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*shape).max(max_dim)\n return m\n',
 'z': 'For ones who need Maxout, I changed the above code to make it work. \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*sh

In [21]:
len(train)

9140

In [22]:
test[100]

{'id': 101,
 'x': 'What do Variable(tensor, requires_grad) return instead of Variables?',
 'y': 'Tensors',
 'z': 'The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True. Below please find a quick guide on what has changed: Variable(tensor) and Variable(tensor, requires_grad) still work as expected, but they return Tensors instead of Variables. var.data is the same thing as tensor.data. Methods such as var.backward(), var.detach(), var.register_hook() now work on tensors with the same method names.'}

In [23]:
len(test)

2286

### 2 Retrieving Support Documents with an Dense retriever Model

The sparse retriever works by finding passages which feature the words from the query. However, it has no way to know a priori which of these words are more important in context, and seems to struggle with understanding the central theme of the query.

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as DPR or REALM for example learn to compute a vector representation of the query, as well as vector representations of Wikipedia passages in such a way that the passages that best answers a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like FAISS.

These successes are very encouraging for our Open-Domain Long Form QA application. However, our task and setup do not quite meet the requirements of either of either of these approaches. On the one hand, the DPR system is trained using gold passage annotations: most major QA dataset tell the system which Wikipedia passage contains the answer. Unfortunately, we do not have such annotations for our data. On the other hand, while REALM is trained without passage supervision, it requires a pretty expensive pre-training step with an Inverse Cloze Task (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all Wikipedia passages regularly during training.

In order to train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to take advantage of another unique feature of our dataset, namely the fact that the long form answers are quite similar in style to thesupport documents we want to index. Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on support documents should allow us to similarly match questions to supporting evidence from document corpus.

4.a - Contrastive Training with ELI5 In-Batch Negatives
As mentioned above, we want to train a system to produce question and answer embeddings, such that the dot product between the representation of a question and any of its answers is greater than between it and answers of all of the other questions in the dataset.

Unfortunately, actually comparing all questions to all answers before taking every single gradient step is computationally prohibitive: instead, we follow previous work in simply processing medium to large batches of question-answer pairs, and making sure that the dot product of a question with its answer is larger than with all other answers in the batch, and vice versa.

We use a cross-entropy loss for the multinomial distribution over all of the answers (or questions) in a batch, and make use of PyTorch gradient checkpointing to be able to use large batches with limited GPU memory: you can find all implementation details in the RetrievalQAEmbedder class in eli5_utils.py.

We use a single BERT-style pre-trained model to embed the questions and answers, and learn different projection matrices to bring both representations down to dimension 128: the projection matrices are trained from scratch as the sentence embedding model is fine-tuned. We found that the 8-layer distilled version of BERT from the Well-Read Students Learn Better paper performed as well or better as full BERT for a notable gain in computation speed: if you want an even faster model, that work provides pre-trained models spanning the full range of computation/accuracy trade-offs.

The retriever model can than be trained with the following code: with batch size 32/512 on a single Tesla T4 GPU, one can run 10 training epochs in under 1 hour.

In [24]:
###############
# retriever training
###############
class ELI5DatasetQARetriver(Dataset):
    def __init__(self, examples_array, num_rows, extra_answer_threshold=2, min_answer_length=1, training=True, n_samples=None):
        self.data = examples_array
        self.answer_thres = extra_answer_threshold
        self.min_length = min_answer_length
        self.training = training
        self.n_samples = num_rows if n_samples is None else n_samples
        self.num_rows = num_rows

    def __len__(self):
        return self.n_samples

    def make_example(self, idx):
        example = self.data[idx]
        question = example["x"]
        answer = example["y"]
        return (question, answer)

    def __getitem__(self, idx):
        return self.make_example(idx % self.num_rows)


class RetrievalQAEmbedder(torch.nn.Module):
    def __init__(self, sent_encoder, dim):
        super(RetrievalQAEmbedder, self).__init__()
        self.sent_encoder = sent_encoder
        self.output_dim = 128
        self.project_q = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.project_a = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.ce_loss = torch.nn.CrossEntropyLoss(reduction="mean")

    def embed_sentences_checkpointed(self, input_ids, attention_mask, checkpoint_batch_size=-1):
        # reproduces BERT forward pass with checkpointing
        if checkpoint_batch_size < 0 or input_ids.shape[0] < checkpoint_batch_size:
            return self.sent_encoder(input_ids, attention_mask=attention_mask)[1]
        else:
            # prepare implicit variables
            device = input_ids.device
            input_shape = input_ids.size()
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
            head_mask = [None] * self.sent_encoder.config.num_hidden_layers
            extended_attention_mask: torch.Tensor = self.sent_encoder.get_extended_attention_mask(
                attention_mask, input_shape, device
            )

            # define function for checkpointing
            def partial_encode(*inputs):
                encoder_outputs = self.sent_encoder.encoder(inputs[0], attention_mask=inputs[1], head_mask=head_mask,)
                sequence_output = encoder_outputs[0]
                pooled_output = self.sent_encoder.pooler(sequence_output)
                return pooled_output

            # run embedding layer on everything at once
            embedding_output = self.sent_encoder.embeddings(
                input_ids=input_ids, position_ids=None, token_type_ids=token_type_ids, inputs_embeds=None
            )
            # run encoding and pooling on one mini-batch at a time
            pooled_output_list = []
            for b in range(math.ceil(input_ids.shape[0] / checkpoint_batch_size)):
                b_embedding_output = embedding_output[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                b_attention_mask = extended_attention_mask[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                pooled_output = checkpoint.checkpoint(partial_encode, b_embedding_output, b_attention_mask)
                pooled_output_list.append(pooled_output)
            return torch.cat(pooled_output_list, dim=0)

    def embed_questions(self, q_ids, q_mask, checkpoint_batch_size=-1):
        q_reps = self.embed_sentences_checkpointed(q_ids, q_mask, checkpoint_batch_size)
        return self.project_q(q_reps)

    def embed_answers(self, a_ids, a_mask, checkpoint_batch_size=-1):
        a_reps = self.embed_sentences_checkpointed(a_ids, a_mask, checkpoint_batch_size)
        return self.project_a(a_reps)

    def forward(self, q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=-1):
        device = q_ids.device
        q_reps = self.embed_questions(q_ids, q_mask, checkpoint_batch_size)
        a_reps = self.embed_answers(a_ids, a_mask, checkpoint_batch_size)
        compare_scores = torch.mm(q_reps, a_reps.t())#cosine similarity
        loss_qa = self.ce_loss(compare_scores, torch.arange(compare_scores.shape[1]).to(device))#cross entrophy loss
        loss_aq = self.ce_loss(compare_scores.t(), torch.arange(compare_scores.shape[0]).to(device))
        loss = (loss_qa + loss_aq) / 2
        return loss


def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    bert_model = AutoModel.from_pretrained(model_name).to(device)
    # run bert_model on a dummy batch to get output dimension
    d_ids = torch.LongTensor(
        [[bert_model.config.bos_token_id if bert_model.config.bos_token_id is not None else 1]]
    ).to(device)
    d_mask = torch.LongTensor([[1]]).to(device)
    sent_dim = bert_model(d_ids, attention_mask=d_mask)[1].shape[-1]
    qa_embedder = RetrievalQAEmbedder(bert_model, sent_dim).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        qa_embedder.load_state_dict(param_dict["model"])
    return tokenizer, qa_embedder


def make_qa_retriever_batch(qa_list, tokenizer, max_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
 
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    
    q_ids, q_mask = (torch.LongTensor(q_toks["input_ids"]).to(device),torch.LongTensor(q_toks["attention_mask"]).to(device),)
    # print(len(a_ls))

    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
    # TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
    # print(a_toks)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )

    return (q_ids, q_mask, a_ids, a_mask)


def train_qa_retriever_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0):
    model.train()
    # make iterator
    train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # print(next(iter(data_loader)).shape)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch in enumerate(epoch_iterator):
        # print("q_ids",q_ids.shape)
        # print(" q_mask,", q_mask.shape)
        # print("A_id", a_ids.shape)
        q_ids, q_mask, a_ids, a_mask = batch
        pre_loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
        loss = pre_loss.sum()
        # optimizer
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def evaluate_qa_retriever(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    eval_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=eval_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    tot_loss = 0.0
    with torch.no_grad():
        for step, batch in enumerate(epoch_iterator):
            q_ids, q_mask, a_ids, a_mask = batch
            loss = model(q_ids, q_mask, a_ids, a_mask)
            tot_loss += loss.item()
        return tot_loss / (step + 1)


def train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args):
    qar_optimizer = AdamW(qar_model.parameters(), lr=qar_args.learning_rate, eps=1e-8)
    qar_scheduler = get_linear_schedule_with_warmup(
        qar_optimizer,
        num_warmup_steps=100,
        num_training_steps=(qar_args.num_epochs + 1) * math.ceil(len(qar_train_dset) / qar_args.batch_size),
    )
    for e in range(qar_args.num_epochs):
        train_qa_retriever_epoch(qar_model, qar_train_dset, qar_tokenizer, qar_optimizer, qar_scheduler, qar_args, e)
        m_save_dict = {
            "model": qar_model.state_dict(),
            "optimizer": qar_optimizer.state_dict(),
            "scheduler": qar_scheduler.state_dict(),
        }
        print("Saving model {}".format(qar_args.model_save_name))
        # torch.save(m_save_dict, "{}_{}.pth".format(qar_args.model_save_name, e))
        eval_loss = evaluate_qa_retriever(qar_model, qar_valid_dset, qar_tokenizer, qar_args)
        print("Evaluation loss epoch {:4d}: {:.3f}".format(e, eval_loss))

In [25]:
# training arguments
class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs =10

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = ELI5DatasetQARetriver(train,num_rows=len(train), training=True)
qar_valid_dset = ELI5DatasetQARetriver(test,num_rows=len(test), training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(
        model_name=qar_args.pretrained_model_name,
        from_file=None,
        device="cuda"
)

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

Downloading:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/327M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-8_H-768_A-12 were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncat

 0     0 of    17 	 L: 6.504 	 -- 10.617
 0     1 of    17 	 L: 6.502 	 -- 21.007
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    0: 5.879
 1     0 of    17 	 L: 6.031 	 -- 10.398
 1     1 of    17 	 L: 6.080 	 -- 20.783
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    1: 5.535
 2     0 of    17 	 L: 5.596 	 -- 10.395
 2     1 of    17 	 L: 5.560 	 -- 20.792
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    2: 5.126
 3     0 of    17 	 L: 5.112 	 -- 10.401
 3     1 of    17 	 L: 5.044 	 -- 20.807
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    3: 4.739
 4     0 of    17 	 L: 4.427 	 -- 10.394
 4     1 of    17 	 L: 4.414 	 -- 20.801
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    4: 4.565
 5     0 of    17 	 L: 3.567 	 -- 10.379
 5     1 of    17 	 L: 3.595 	 -- 20.756
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    5: 4.659
 6     0 o

In [26]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Once the model is trained, it can be used to compute passage embeddings for all document corpus. The make_qa_dense_index method takes advantage of numpy memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about1 hour.

In [27]:
# type(qar_model)

In [28]:
# qar_model.save_pretrained('/content/drive/MyDrive/TSAI/Capstone_1/retriever_models/')

In [29]:
# qar_model = AutoModel.from_pretrained('/content/').to('cuda')

In [30]:
# type(qar)

In [31]:
# type(qar_tokenizer)

In [32]:
# qar_tokenizer.save_pretrained('/content/drive/MyDrive/TSAI/Capstone_1/ret_tokenizer/')

In [33]:
# qar_tokenizer = AutoTokenizer.from_pretrained('/content/drive/MyDrive/TSAI/Capstone_1qa_s2s_tokenizer/')

In [34]:

###############
# ELI5-trained retrieval model usage
###############
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda"):
    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        a_reps = qa_embedder.embed_answers(a_ids, a_mask).cpu().type(torch.float)
    return a_reps.numpy()
def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda"):
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        q_reps = qa_embedder.embed_questions(q_ids, q_mask).cpu().type(torch.float)
    return q_reps.numpy()
def make_qa_dense_index(
    qa_embedder,
    tokenizer,
    passages_dset,
    batch_size=128,
    max_length=128,
    index_name="kilt_passages_reps.dat",
    dtype="float32",
    device="cuda",
):
    st_time = time()
    fp = np.memmap(index_name, dtype=dtype, mode="w+", shape=(len(passages_dset),128))
    n_batches = math.ceil(len(passages_dset) / batch_size)
    for i in range(n_batches):
        passages = [p["z"] for p in passages_dset[i * batch_size : (i + 1) * batch_size]]
        reps = embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length, device)
        fp[i * batch_size : (i + 1) * batch_size] = reps
        if i % 50 == 0:
            print(i, time() - st_time)

In [35]:
# os.chdir(r'/content/drive/MyDrive/TSAI/Capstone_1')

In [36]:
if not os.path.isfile('passages_reps_32_l-8_h-768_b-512-512.dat'):
  print("hi")

  make_qa_dense_index(
          qar_model, qar_tokenizer, passage_snippets, device='cuda',
          index_name='passages_reps_32_l-8_h-768_b-512-512.dat' )

hi




0 0.40659284591674805
50 17.19633650779724


### 4.b - Using the Trained Dense Retriever and document Index
Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our document snippets, let's see whether it can actually find supporting evidence for a new question. Recalling the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

The MIPS part can be executed efficiently with the faiss library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the faiss_gpu index with the following code:

In [37]:
n_ret = 5

In [38]:
faiss_res = faiss.StandardGpuResources()
passage_reps = np.memmap(
            'passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            # shape=(wiki40b_snippets.num_rows, 128)
            # wiki40b_snippets.num_rows = 11378343,english sections from wiki40B dataset
            shape=(len(passage_snippets), 128)
)

doc_index_flat = faiss.IndexFlatIP(128)
doc_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, doc_index_flat)
doc_gpu_index.add(passage_reps)

In [39]:

def query_qa_dense_index(
    question, qa_embedder, tokenizer, doc_passages, doc_index, n_results=n_ret, min_length=1, device="cuda"
):
    q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
    D, I = doc_index.search(q_rep, 2 * n_results)
    res_passages = [doc_passages[int(i)] for i in I[0]]
    support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
    res_list = [dict([(k, p[k]) for k in ["z"]]) for p in res_passages]
    res_list = [res for res in res_list if len(res["z"].split()) > min_length][:n_results]
    for r, sc in zip(res_list, D[0]):
        r["score"] = float(sc)
    return support_doc, res_list

Now we can use the query_qa_dense_index function to query the dense index for our running example question :

In [40]:
question = test[12]['x']
question

'In what platform do the modules Conv2d() and Linear() run?'

In [41]:
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, passage_snippets, doc_gpu_index, device='cuda')
print(res_list)
df = pd.DataFrame({
    
    'Text': ['--- ' + question] + [res['z'] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})

[{'z': "pytorch ships it's own cudnn", 'score': 20.960006713867188}, {'z': 'logic and:\n\na * b\n\n\nlogic or:\n\na + b\n\n', 'score': 20.245590209960938}, {'z': 'To verify that pytorch uses cudnn:\n\n>>> torch.backends.cudnn.version()\n6021\n', 'score': 19.688329696655273}, {'z': 'cc malfet', 'score': 19.112272262573242}, {'z': 'cudnn 9.0 is no longer supported. Supported cuda versions are 9.2, 10.1, 10.2', 'score': 18.883193969726562}]




Unnamed: 0,Text
0,--- In what platform do the modules Conv2d() and Linear() run?
1,pytorch ships it's own cudnn
2,logic and: a * b logic or: a + b
3,To verify that pytorch uses cudnn: >>> torch.backends.cudnn.version() 6021
4,cc malfet
5,"cudnn 9.0 is no longer supported. Supported cuda versions are 9.2, 10.1, 10.2"


### 4.c - Retriever Model Evaluation
We have trained a retrieval model that seems to be working a little better than the traditional word-matching based approach, at least on our running example. Before we use it to actually answer questions, however, we would like to be able to get some quantitative evaluation of the performances of both approaches.

For the retriever, we want to favor recall over precision: our first priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their Inverse Document Frequency. This gives us the following IDF-recall scoring function:

## 5. Generating Answers with a Sequence-to-Sequence Model

In [42]:
# ELI5 seq2seq model training
###############
class ELI5DatasetS2S(Dataset):
    def __init__(
        self, examples_array,num_rows, make_doc_fun=None, doc_cache=None, training=True
    ):
        self.training = training
        self.data = examples_array
        self.make_doc_function = make_doc_fun
        self.doc_cache = {} if doc_cache is None else doc_cache
        self.num_rows = num_rows
        assert not (make_doc_fun is None and doc_cache is None)
        # make index of specific question-answer pairs from multi-answers
        if self.training:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

        else:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

    def __len__(self):
        return len(self.qa_id_list)

    def make_example(self, idx):
        i, j = self.qa_id_list[idx]
        example = self.data[i]
        question = example["x"] 
        answer = example["y"]
        q_id = example["id"]
        if self.make_doc_function is not None:
            self.doc_cache[q_id] = self.doc_cache.get(q_id, self.make_doc_function(example["x"]))
        document = self.doc_cache[q_id]
        in_st = "question: {} context: {}".format(
            question.lower().strip(), document.lower().strip(),
        )
        out_st = answer
        return (in_st, out_st)

    def __getitem__(self, idx):
        return self.make_example(idx)


def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        model.load_state_dict(param_dict["model"])
    return tokenizer, model


def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    labels = a_ids[:, 1:].contiguous().clone()
    labels[a_mask[:, 1:].contiguous() == 0] = -100
    # print(labels)
    model_inputs = {
        "input_ids": q_ids,
        "attention_mask": q_mask,
        "decoder_input_ids": a_ids[:, :-1].contiguous(),
        "labels": labels,
    }
    # print("it'sme",model_inputs)
    return model_inputs


def train_qa_s2s_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0, curriculum=True):
    model.train()
    # make iterator
    if curriculum:
        train_sampler = SequentialSampler(dataset)
    else:
        train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)

  
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch_inputs in enumerate(epoch_iterator):
        # print(type(step))
        pre_loss = model(**batch_inputs)[0].unsqueeze(dim=0)
        # print(pre_loss,'pre_loss')
        # print(pre_loss.shape)
        # print(pre_loss.sum(),'sum')
        loss = pre_loss.sum() / pre_loss.shape[0]
        loss.backward()
        # optimizer
        if step % args.backward_freq == 0:
            optimizer.step()
            scheduler.step()
            model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def eval_qa_s2s_epoch(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    train_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    with torch.no_grad():
        for step, batch_inputs in enumerate(epoch_iterator):
            pre_loss = model(**batch_inputs)[0].unsqueeze(dim=0)
            # print(pre_loss,'pre_loss')
            # print(pre_loss.shape)
            # print(pre_loss.sum(),'sum')
            loss = pre_loss.sum() / pre_loss.shape[0]
            loc_loss += loss.item()
            # print("loc loss here",loc_loss)
            loc_steps += 1
            if step % args.print_freq == 0:
                print(
                    "{:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                        step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                    )
                )
    print("Total \t L: {:.3f} \t -- {:.3f}".format(loc_loss / loc_steps, time() - st_time,))


def train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args):
    s2s_optimizer = AdamW(qa_s2s_model.parameters(), lr=s2s_args.learning_rate, eps=1e-8)
    s2s_scheduler = get_linear_schedule_with_warmup(
        s2s_optimizer,
        num_warmup_steps=400,
        num_training_steps=(s2s_args.num_epochs + 1) * math.ceil(len(s2s_train_dset) / s2s_args.batch_size),
    )
    for e in range(s2s_args.num_epochs):
        # print((e == 0))

        train_qa_s2s_epoch(
            qa_s2s_model,
            s2s_train_dset,
            qa_s2s_tokenizer,
            s2s_optimizer,
            s2s_scheduler,
            s2s_args,
            e,
            curriculum=True,
        )
        m_save_dict = {
            "model": qa_s2s_model.state_dict(),
            "optimizer": s2s_optimizer.state_dict(),
            "scheduler": s2s_scheduler.state_dict(),
        }
        print("Saving model {}".format(s2s_args.model_save_name))
        eval_qa_s2s_epoch(qa_s2s_model, s2s_valid_dset, qa_s2s_tokenizer, s2s_args)
        # torch.save(m_save_dict, "\{}_{}.pth".format(s2s_args.model_save_name, e))



In [43]:
n_ret = 2

In [None]:
# pre-computing support documents
eli5_train_docs = []
for example in train:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer,passage_snippets, doc_gpu_index, n_results=n_ret
    )
    eli5_train_docs += [(example['id'], support_doc, dense_res_list)]

eli5_valid_docs = []
for example in test:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer, passage_snippets, doc_gpu_index, n_results=n_ret
    )
    eli5_valid_docs += [(example['id'], support_doc, dense_res_list)]

# training loop proper
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 2
        self.backward_freq = 16
        self.max_length = 512
        self.print_freq = 100
        self.model_save_name = "eli5_bart_model"
        self.learning_rate = 2e-4
        self.num_epochs =3

s2s_args = ArgumentsS2S()

# eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
# eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(train,num_rows =len(train), doc_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(test,num_rows =len(test), doc_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
    model_name="facebook/bart-large",
    from_file=None,
    device="cuda"
)
# qa_s2s_model = torch.nn.DataParallel(pre_model)
qa_s2s_model =pre_model
train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args)



Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


 0     0 of  4570 	 L: 5.819 	 -- 0.416
 0     1 of  4570 	 L: 15.740 	 -- 0.733
 0   100 of  4570 	 L: 5.276 	 -- 32.972
 0   200 of  4570 	 L: 4.891 	 -- 65.283
 0   300 of  4570 	 L: 4.272 	 -- 97.957
 0   400 of  4570 	 L: 3.781 	 -- 130.289
 0   500 of  4570 	 L: 3.553 	 -- 162.859
 0   600 of  4570 	 L: 3.406 	 -- 195.469
 0   700 of  4570 	 L: 3.168 	 -- 227.510
 0   800 of  4570 	 L: 2.922 	 -- 260.009
 0   900 of  4570 	 L: 3.137 	 -- 292.243
 0  1000 of  4570 	 L: 2.709 	 -- 324.783
 0  1100 of  4570 	 L: 3.201 	 -- 357.386
 0  1200 of  4570 	 L: 3.112 	 -- 390.004
 0  1300 of  4570 	 L: 2.826 	 -- 422.442
 0  1400 of  4570 	 L: 2.437 	 -- 455.058
 0  1500 of  4570 	 L: 2.806 	 -- 487.680
 0  1600 of  4570 	 L: 2.891 	 -- 520.037
 0  1700 of  4570 	 L: 2.626 	 -- 552.583
 0  1800 of  4570 	 L: 2.548 	 -- 584.690
 0  1900 of  4570 	 L: 2.661 	 -- 616.380
 0  2000 of  4570 	 L: 2.582 	 -- 648.372
 0  2100 of  4570 	 L: 2.425 	 -- 680.207
 0  2200 of  4570 	 L: 2.571 	 -- 712.77

We now have everything we need to answer any question! Now let's try the full system on our running example along with the first four questions of the test set:

In [49]:
import psutil
def get_size(bytes, suffix="B"):
    factor = 1024
    for unit in ["", "K", "M", "G", "T", "P"]:
        if bytes < factor:
            return f"{bytes:.2f}{unit}{suffix}"
        bytes /= factor
print("="*40, "Memory Information", "="*40)
svmem = psutil.virtual_memory()
print(f"Total: {get_size(svmem.total)}") ; print(f"Available: {get_size(svmem.available)}")
print(f"Used: {get_size(svmem.used)}") ; print(f"Percentage: {svmem.percent}%")

Total: 25.46GB
Available: 22.30GB
Used: 5.54GB
Percentage: 12.4%


In [None]:
    import torch
    torch.cuda.empty_cache()

In [50]:
# generate answer from input "question: ... context: <p> ..."
def qa_s2s_generate(
    question_doc,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=None,
    min_len=64,
    max_len=512,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
    max_input_length=1024,
    device="cuda:0",
):
    model_inputs = make_qa_s2s_batch([(question_doc, "A")], qa_s2s_tokenizer, max_input_length, device=device,)
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs["input_ids"],
        attention_mask=model_inputs["attention_mask"],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]

In [55]:
questions = []
answers = []
docs = []

for i in [10] + [j for j in range(4)]:
    # create support document with the dense index
    question = test[i]['x']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        passage_snippets, doc_gpu_index, device='cuda'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=64,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    questions += [question]
    answers += [answer]
    docs += [doc]

df = pd.DataFrame({
    'Question': questions,
    'Answer': answers,
    'Documents' :docs
})
df.style.set_properties(**{'text-align': 'left'})



Unnamed: 0,Question,Answer,Documents
0,`with torch.enable_grad` also works outside a `no_grad` context,"It seems there is no difference. Just that one takes a bool as input and not the other. Also torch.set_grad_enabled(False) can be used as just a function to set the grad mode forever. Great, thanks a lot. Very convenient to use the same piece of code for training and evaluating.","It seems there is no difference: https://stackoverflow.com/a/53447634/6888630 There is no difference. Just that one takes a bool as input and not the other. Also torch.set_grad_enabled(False) can be used as just a function to set the grad mode forever. Great, thanks a lot. Very convenient to use the same piece of code for training and evaluating. Yeah that makes sense. There's really nothing that prevents people from setting `requires_grad` to `False` after they construct the optimizer, and it won't even complain. We should just remove the check. This API is in beta. Even though the function signatures are very unlikely to change, major improvements to performances are planned before we consider this stable. This section contains the higher level API for the autograd that builds on the basic API above and allows you to compute jacobians, hessians, etc. ""Enables gradient calculation, if it has been disabled via `no_grad` or `torch.set_grad_enabled`"" <- I think something like that would make it clearer. But yes, we would accept a PR, thank you! Keep in mind that set_grad_enabled is a state of the program (“do you want to keep track of gradients for outputs when the inputs require gradients”) and applies to new tensors, while someparam.requires_grad_(False) (which is the suggested form to disable gradients of tensors and parameters) says “this thing, when used as input, doesn’t require gradients”. To then decide whether a given operation’s output requires gradients, the autograd engine checks if both gradient-mode is enabled and any inputs require gradients. As such these are two distinct knobs you can operate independently, but they have a combined effect.  eprox: with torch.set_grad_enabled(mode = True):  self.model.lin2.requires_grad = False # Will this work?  for m in self.model.mlp_f: m.set_grad_enabled = False # Or this? This seems to try to work on Modules, which isn’t a thing (and using requires_grad_ would tell you so, which is why it is preferred). If you wanted something like that, you should go for for p in self.model.mlp_f.parameters(): p.requires_grad_(False) which certainly works. Ok that is useful. 1. because it's so much more common than rescaling gradients by a constant that doesn't depend on the iteration 2. it needs to be performed on weights and not gradients, and this requires an extra `no_grad` block 3. ideally it wouldn't be part of the optimizer, but we're stuck with it for now because of backward compatibility Having both zero_grad() and reset_grad() might be kind of confusing to the end users. Are zero-ing all gradients always equivalent to setting all gradients to None? Wonder if there's any case where zero_grad() is preferred over reset_grad(). If the two are not equivalent, and have pros and cons, we probably should mention those pros and cons in the documentations to articulate the difference in order to prevent confusion. We should probably have some default recommendation (e.g. do we think reset_grad() will be better in most cases, which looks like the case, because most references I found are just calling .backward() right after calling zero_grad()). torch.is_grad_enabled Returns True if grad mode is currently enabled. The torch.autograd.enable_grad documentation says:  Enables gradient calculation inside a no_grad context. This has no effect outside of no_grad. Given this wording, the following is expected: torch.set_grad_enabled(False) with torch.enable_grad:  # Gradient tracking will NOT be enabled here. torch.set_grad_enabled(True) vs: with torch.no_grad():  with torch.enable_grad:  # Gradient tracking IS enabled here. But as blue-phoenix shows, this is not the case. I raised an issue here.  > One question: do we need to do special handling when the `create_graph=True` flag is set?  Other than setting create_graph=True when we call autograd.grad, no, I don't think so."
1,"""exp_cuda"" not implemented for 'ComplexDouble'","We will take a look at this. I can reproduce this issue on my skylake machine. Thank you for reporting this issue, @shmsong. The team agrees it's a high priority. CC @driazati for Final constant handling It's a constructor for a Tensor. It's generated by C macros, what you probably want to look for is `THCTensor_(new)` or `THCudaTensor_(New)`","@albanD Its unrelated to `nn.Parameter` and `torch.cuda.FloatTensor(torch.ones(3))` alreay crashes. @carefree0910 argmin returns the index of the minimum value in the dimension. It doesn't have a guarantee to return the index of the **first** minimum. The GPU result is also correct from what `argmin` is supposed to return I don't think the GPU behavior is completely wrong, for a mathematical reason: it should be the case that `torch.max(torch.cat(xs, ys)) == torch.max(torch.max(xs), torch.max(ys))` for any xs and ys, including empty. In that case, `-Inf` is the correct neutral element to pick.  The way Numpy gets out of this situation, is they ask you for an initial element to handle the case explicitly, and error if you don't provide it. So in the end I agree with your suggested course of action. thanks, we'll look into this! @Markus-Goetz repasting my comment from #17738 (comment) the only determinism we aim to have is hashed on device, and for CPU, single-threaded. Even across different kinds of GPUs, all bets are off. We cannot guarantee cross-device, cross-CPU-type determinism due to severe performance cliffs that'll result from such a constraint. For example, to guarantee that we pick the first argmax (or argmin) element, we have to add an additional pass in our CUDA kernel to sort or order the results, which costs performance. This is inconsistent with numpy's, Eigen's, C++ STL etc. If you see, you have given examples of CPU kernels, where this is easy to guarantee without significant performance regression. Hi, I am from @quansight team. I would like to work on this issue. We will take a look at this. I can reproduce this issue on my skylake machine. Thank you for reporting this issue, @shmsong. The team agrees it's a high priority. cc @driazati for Final constant handling @ThisIsIsaac it's a constructor for a Tensor. It's generated by C macros, what you probably want to look for is `THCTensor_(new)` or `THCudaTensor_(new)`"
2,How to correctly use CTC Loss with GRU in pytorch?,"The negative log likelihood loss (NLLLoss) is suitable for classification problems, where the output is one out of C classes. Since the classes are discrete, your labels need to be of the long type. In your case, in a comment, you say: I want to create a network that simulates a quadratic function with x as input and sth similar to x**2 as output. However, I don't think this is a very good benchmark. It does many small non-vectorized 1D transforms which is not what tensor libraries are best at. It's also","Your NumPy is using Accererate, but PyTorch uses MKL. It is natural that they behave slightly differently. Given that the singular value is so small. I would just classify this as precision problem. criterion is defined as torch.nn.CrossEntropyLoss() in your notebook. As mentioned in documentation of CrossEntropyLoss, it expects probability values returned by model for each of the 'K' classes and corresponding value for ground-truth label as input. Now, probability values are float tensors, while ground-truth label should be a long tensor representing a class (class can not be a float, e.g. 2.3 can not represent a class). hence: loss = criterion(predictions, batch.label.long()) should work.  Timing with gh-43011 still shows a performance drop with pytorch:  zero elapsed time: 3.07e-05 seconds  rot elapsed time: 5.59e-05 seconds  loop elapsed time: 0.001341 seconds  NUMPY TIME elapsed time: 0.0015816 seconds  zero elapsed time: 8.81e-05 seconds  rot elapsed time: 0.0002193 seconds  loop elapsed time: 0.0037979 seconds  PYTORCH TIME elapsed time: 0.00421 seconds  However, I don't think this is a very good benchmark. It does many small non-vectorized 1D transforms which is not what tensor libraries are best at. It's also conflating `diag`, `fft`, `abs` and copy-assign into one single benchmark.  If I use `timeit` to isolate each line:  * diag is ~3 us for numpy and ~5-6 us for pytorch  * fft is ~10 us for numpy and 20 us for pytorch  * abs is ~9 us for numpy and 12 us for pytorch  * assignment is ~1us in numpy and 11 us for pytorch.  Every operator has at least 3 us overhead with pytorch. That's pretty bad, but for large tensors I'm sure it balances out a bit better. FFT is about 2x slower for this small 1D tensor, but for a `256 X 512` FFT pytorch is 2x faster than NumPy, even single threaded.  The biggest standout is copying from one slice to another which is an order of magnitude slower in pytorch. Take care that Pytorch CTC loss takes log softmax probabilities as input while CTC loss by Baidu does not mention that.  The negative log likelihood loss (NLLLoss) is suitable for classification problems, where the output is one out of C classes. Since the classes are discrete, your labels need to be of the long type. In your case, in a comment, you say: I want to create a network that simulates a quadratic function with x as input and sth similar to x**2 as output. This is a regression problem, where the output can have a real, continuous value. For this, you should use a suitable loss function such as the mean squared error loss (MSELoss). So, one way to fix would be changing F.nll_loss in your code to F.mse_loss.  Since both outputs the value between 0 and 1. I thought there shouldn’t be an issue. the output of log sigmoid isn’t between 0 and 1, pls refer to docs with link ""https://pytorch.org/docs/stable/nn.html?highlight=logsigmoid#torch.nn.LogSigmoid"" Sorry that was a mistake. I knew they are not. I meant to say sigmod and softmax both output the value between 0 and 1. post is updated To me, the logsigmoid+NLLLoss combination hardly makes any sense, because the objective function only tries to promote the gt_class, but no suppression on the negative ones. Maybe you wanna try sigmoid+bceloss.  so much for your advice. the model trains with bce even with sigmoid softmax works as well with bce but only up to a certain point and the training collapse. I am not sure why but I guess it has something to do with the dependency among classes… Do you know if there is a loss function which will be good for both activation function. a naive answer: If you really want to test with a single loss function for both activation functions, what about L2 loss with one-hot vectors as target? I’m not sure if it will give good performance though. Softmax is actually not an activation function… logsigmoid+nllloss doesn’t make sense mathematically (if you derive the gradients, you’ll find it.) PyTroch documentation says NLLLoss expects log probability though and that is logsigmoid + nllloss is how CrossEntropyLoss is constructed for PyTorch nn.CrossEntropyLoss uses F.log_softmax and nn.NLLLoss internally as shown here with link ""https://github.com/pytorch/pytorch/blob/2e97c82470966df6942f364102690460ea58403e/torch/nn/functional.py#L2028"". Look at the description of F.nll_loss. It expects to get as input not the argmax of the prediction (type torch.long), but rather the full 64x50x43 prediction vectors (of type torch.float). Note that indeed the prediction you provide to F.nll_loss has an extra dimension more than the ground truth targets you provide. In your case, simply remove the argmax: loss = F.nll_loss(output, targets)  nn.CrossEntropyLoss() expects target tensors of type Long, but what you're passing is of type Double. Try to change this line from: single_loss = loss_function(y_pred, train_op) to: single_loss = loss_function(y_pred, train_op.long())  I was not able to reproduce non-deterministic results as long as bn weights are initialized to a fixed value. You are right that in your case the expected results is 0, as expected results is (x-mean)/var*weight[+bias], and, as long as mean is exactly equal to tensor values (as it should be), result is 0. However, due to some quirks of how cudnn computes output, a small (on the order of 1 ulp) error creeps in. With the default epsilon 1/var is approx 300, weight is on the order of 1, to get the result you are getting x-mean should be on the order of 1e-5. Given that x and mean are 100, it's enough for mean to have a 1e-7 relative error (which is approx 1 ulp) to produce the results you are seeing, which may happen due to fp arithmetic being inexact. For anyone else with a similar issue, I got it to work. I removed the log_softmax calculation, so this: tag_space = self.hidden2tag(outputs) tag_scores = F.log_softmax(tag_space, dim=0) return tag_scores[-1] becomes this: tag_space = self.hidden2tag(outputs) return tag_space[-1] I also changed NLLLoss to CrossEntropyLoss, (not shown above), and initialized CrossEntropyLoss with no parameters (aka no ignore_index). I am not certain why these changes were necessary (the docs even say that NLLLoss should be run after a log_softmax layer), but they got my model working and brought my loss back to a reasonable range (~0.5)."
3,glibc error while importing torch,"Try this if you have added the Anaconda directory to your bash shell PATH environment variable: python -c ""import struct; print(struct.calcsize('P') * 8)"" Checking Wheel Version: 64 bit wheels typically contain amd64 or similar in their title 32 bit wheels usually contain win32or similar in theirs title Switching to a 64 bit Python Installaion: The default Windows Python installer does not have a 64bit option. To acquire 64 bitPython, navigate to python.org/downloads/windows/, and select a version that specifies x86-64","It looks as though you may have the 32-bit installation of Python, in which case you're issue is this: #16633. Just be aware, that pyTorch doesn't work on 32-bit systems. Please use Windows and Python 64-bit version.  Try this if you have added the Anaconda directory to your bash shell PATH environment variable:  shell  CC=clang CXX=clang++ python setup.py install  The wheel I was trying to install required 32 bit Python, I had 64 bit Python installed. Therefore, the wheel I was trying to install was not compatible with my Python version. Checking Python Version: I confirmed my Python version using the following command: python -c ""import struct; print(struct.calcsize('P') * 8)"" Checking Wheel Version: 64 bit wheels typically contain amd64 or similar in their title 32 bit wheels typically contain win32or similar in their title Switching to a 64 bit Python Installaion: The default Windows Python installer does not have a 64 bit option. To acquire 64 bit Python, navigate to python.org/downloads/windows/, and select a version that specifies x86-64 (the other versions are all 32 bit). Credit to phd for the comment that led to this solution. A Redditor had the same problem here.  Definitely a conda 3.6 thing, fresh install into a 3.5 environment works perfectly :+1: I believe this should now be resolved for `macOS`: pytorch37 ❯ pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html Collecting torch  Downloading https://download.pytorch.org/whl/nightly/cpu/torch-1.5.0.dev20200110-cp37-none-macosx_10_9_x86_64.whl (82.4MB) Not entirely sure about where we are on windows nightly builds though, cc @peterjc123 Get the Microsoft Visual C++ Redistributable installer from the link in the error, in this case is this. Run the installer and launch again your shell with conda configured when finished  Update conda first with `conda update conda` and try again Problem solved. I found what was wrong and I fixed it. The whole problem lies in the fact that Anaconda distribution comes with its own ld linker that is located in /opt/anaconda/compiler_compat/ and it overshadows system ld residing at /usr/bin. To fix my error I ran python setup.py clean and then I temporarily renamed Anaconda's ld linker to ld-old to make it invisible during PyTorch installation.  @pjh5 Updating conda to 4.5.4 and then running `conda create -n pytorch python=3` instead of `conda create -n pytorch anaconda` fixed it! I'm able to properly install pytorch now by running `conda install pytorch torchvision -c pytorch`. Thanks for the help. I fixed the problem by importing opencv before torch."
4,"TypeError: add(): argument 'other' (position 1) must be Tensor, not numpy.ndarray","To solve your problem you may use: temp = list() for key, b in reader:  temp.append(torch.from_numpy(b)) labels = torch.cat(temp) For more, you can check the manual here Cheers","As the error message explains, c is a tensor. To use torch.cat() you must pass a group of tensors or a list. To solve your problem you may use: temp = list() for key, b in reader:  temp.append(torch.from_numpy(b)) labels = torch.cat(temp) For more, you can check the manual here Cheers  The pytorch LSTM returns a tuple. So you get this error as your second LSTM layer self.seq2 can not handle this tuple. So, change prefix1=self.seq1(input1) suffix1=self.seq1(input2) to something like this: prefix1_out, prefix1_states = self.seq1(input1) suffix1_out, suffix1_states = self.seq1(input2) and then pass prefix1_out and suffix1_out tensors to the next LSTM layers as prefix2_out, prefix2_states = self.seq2(prefix1_out) suffix2_out, suffix2_states = self.seq2(suffix1_out) And, concat prefix1_out and suffix1_out tensors like this result = torch.cat([out1,out2],1) Also, change r1=F.sigmoid(self.fc1(result)) r2=self.fc2(r1) to something like this: out_ll = self.fc1(result) r1 = nn.Sigmoid() r2 = self.fc2(r1(out_ll))  It looks like your input is a numpy array, not torch tensor. You need to convert it first, like input = torch.Tensor(input).  Your issue is here: signals = np.stack([src.signal for src in pst_sources], axis=1) # signals shape: [Len, n_signals] It looks like pst_sources is empty, and so you are trying to stack an empty list.  As stated by user8426627 you want to change the tensor type, not the data type. Therefore the solution was to add .type(torch.LongTensor) to convert it to a LongTensor. Final code: Ytrain_ = torch.from_numpy(Y_train.values).view(1, -1)[0].type(torch.LongTensor) Test tensor type: Ytrain_.type() 'torch.LongTensor'  I can't confirm but I believe your problem will be solved by changing: train_y = np.array(train_labels) == 'fake' test_y = np.array(test_labels) == 'fake' to: train_y = (np.array(train_labels) == 'fake').astype(int) test_y = (np.array(test_labels) == 'fake').astype(int) The train_y data is currently an array of type Bool (True or False) and the tensor needs and int (0 or 1).  I am not sure the rest of your code is alright, but in order to fix this error, you can convert your rnn_out list to a torch tensor by adding the following line after the ending of your for loop: rnn_out = torch.stack(rnn_out)  I had same issue resolved Before converting to Tensor, try this X_train = X_train.astype(np.float32)  I would suggest you to check the input type I had the same issue which solved by converting the input type from int32 to int64.(running on win10) ex: x = torch.tensor(train).to(torch.int64)  The error message says it all. The tensors involved contain elements of different data types. By default, w and b have elements of type torch.float32, while data_input is a NumPy array with the Python default floating point type, i.e. double. That datatype will be preserved when you convert with from_numpy. Try using dtype=np.float32 in your np.genfromtxt call."
