Industries are increasingly turning to automated chat assistants to handle customer support interactions. But, these tools can only successfully answer questions they were trained on, exposing a growing challenge for enterprise question answering (QA) techniques today. To address this, we are trying to build an intelligent question answering system that doesn’t just return documents related to the question, but extracts relevant information within the documents and puts forth the detailed answer, like one that a human would have come up with. 

In this notebook, we implement a domain-specific question-answering engine that answers any questions related to PyTorch. This model benefits students and working professionals who are beginners in PyTorch. The model is developed with datasets collected from Stack Overflow, PyTorch Github issues, PyTorch documentation, PyTorch discussion forum, and Youtube videos.

### 1- Preliminaries
The implementation presented here relies on the Hugging Face 🤗transformers and 🤗nlp libraries. Wikipedia indexing relies on faiss for the dense version. You can get all of these by running:

<!-- pip install elasticsearch -->
pip install faiss_gpu
pip install nlp
pip install transformers
<!-- 
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz
tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz -->

In [None]:
! nvidia-smi

Sun Sep 19 03:00:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install faiss_gpu nlp transformers

Collecting faiss_gpu
  Downloading faiss_gpu-1.7.1.post2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[K     |████████████████████████████████| 89.7 MB 9.4 kB/s 
[?25hCollecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 53.7 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 56.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 49.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 57.4 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-n

In [None]:
import functools
import math
import os  # noqa: F401
from random import choice, randint
from time import time

import numpy as np
import torch
import torch.utils.checkpoint as checkpoint
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from tqdm import tqdm

import faiss  # noqa: F401
import nlp  # noqa: F401
import pandas as pd
from transformers import AdamW, AutoModel, AutoModelForSeq2SeqLM, AutoTokenizer, get_linear_schedule_with_warmup
import json

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import sys
sys.path.append('/content/drive/MyDrive/TSAI/Capstone_1/data')

In [None]:
# from lfqa_utils import *

In [None]:
import os
print(os.getcwd())
path = '/content/drive/MyDrive/TSAI/Capstone_1/data'
os.chdir(path)

/content


In [None]:
folder = "retriever_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("retriever_models directory exists")
else:
    print("retriever_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/TSAI/Capstone_1/data
retriever_models directory Doesn't exists, creating one


In [None]:
folder = "seq2seq_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("seq2seq_models directory exists")
else:
    print("seq2seq_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/TSAI/Capstone_1/data
seq2seq_models directory Doesn't exists, creating one


## Data Set Preparation
One of the major tasks after collecting data is to clean the datasets and bring all five sets into a common json format.

An example from the final question-answer pair training data set looks like follows :

- Example :

- - *x: Question: Why is PyTorch better than Keras?*
  - *z: Answer Document: In my opinion, this is a very personal question. Every developer/person would have his/her comfort level while deciding on which platform suits them. I will share my opinion. I think PyTorch is better because it is pythonic, (literally like python and you do not need to learn anything new), has higher performance, strong community support, most new papers are written using PyTorch, and has dynamic computation graphs. But again, I might be biased towards it, and maybe you may like something that is better in Keras than Pytorch and might decide to go ahead with it.*
  - *y: Exact Answer: PyTorch is better because it is pythonic, has higher performance,* has*strong community support, most new papers are written using PyTorch, and has dynamic computation graphs.*

We will save all the supporting documents (answer documents) from train and test sets separately in another json for the ease of training. It's because one question might have been addressed in other supporting documents as well (other than the oblivious answer document ). Saving all supporting documents separately benefits us to take advantage of all relevant documents to answer that particular question (We'll see how this works in the following sections).

In [None]:
with open(path+'/train_data.json') as f:
        train = json.load(f)
with open(path+'/test_data.json') as f:
        test = json.load(f)
with open(path+'/context_master.json') as f:
        passage_snippets = json.load(f)

In [None]:
train[0]

{'id': 1,
 'x': 'Maxout Layer',
 'y': 'The Maxout layer can be implemented as follows \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*shape).max(max_dim)\n return m\n',
 'z': 'For ones who need Maxout, I changed the above code to make it work. \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*sh

In [None]:
len(train)

9140

In [None]:
test[100]

{'id': 101,
 'x': 'What do Variable(tensor, requires_grad) return instead of Variables?',
 'y': 'Tensors',
 'z': 'The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True. Below please find a quick guide on what has changed: Variable(tensor) and Variable(tensor, requires_grad) still work as expected, but they return Tensors instead of Variables. var.data is the same thing as tensor.data. Methods such as var.backward(), var.detach(), var.register_hook() now work on tensors with the same method names.'}

In [None]:
len(test)

2286

## Retrieving Support Documents with an Dense retriever Model

Before generating answers for the given query, our QA system needs to find supporting documents. The Retriever module’s job is to find the best candidate documents by calculating the similarity between query and document vectors. We can use either a sparse retriever or a dense retriever to automatically find relevant document snippets for a question.

The sparse retriever works by finding passages that feature the words from the query. However, it has no way to know *a priori* which of these words are more important in context and seems to struggle with understanding the central theme of the query.

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as [DPR](https://arxiv.org/abs/2004.04906) or [REALM](https://arxiv.org/abs/2002.08909) for example learn to compute a vector representation of the query, as well as vector representations of passages in such a way that the passages that best answer a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like [FAISS](https://github.com/facebookresearch/faiss).

These successes are very encouraging for our Open-Domain QA application. However, our setup does not quite meet the requirements of either of these approaches. On the one hand, the [DPR](https://arxiv.org/abs/2004.04906) system is trained using gold passage annotations. Unfortunately, we do not have such annotations for our data set. On the other hand, while [REALM](https://arxiv.org/abs/2002.08909) is trained without passage supervision, it requires a pretty expensive pre-training step with an [Inverse Cloze Task](https://arxiv.org/abs/1906.00300) (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all passages regularly during training.

To train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to **take advantage of another unique feature of our dataset**, **namely the fact that our answers are quite similar in style to the document snippets we want to index. Our answers are summarized, cleaned versions of corresponding document snippets.** Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on document snippets should allow us to similarly match questions to supporting evidence from document corpus.

In [None]:
###############
# retriever training
###############
class DatasetQARetriver(Dataset):
    def __init__(self, examples_array, num_rows, extra_answer_threshold=2, min_answer_length=1, training=True, n_samples=None):
        self.data = examples_array
        self.answer_thres = extra_answer_threshold
        self.min_length = min_answer_length
        self.training = training
        self.n_samples = num_rows if n_samples is None else n_samples
        self.num_rows = num_rows

    def __len__(self):
        return self.n_samples

    def make_example(self, idx):
        example = self.data[idx]
        question = example["x"]
        answer = example["y"]
        return (question, answer)

    def __getitem__(self, idx):
        return self.make_example(idx % self.num_rows)


class RetrievalQAEmbedder(torch.nn.Module):
    def __init__(self, sent_encoder, dim):
        super(RetrievalQAEmbedder, self).__init__()
        self.sent_encoder = sent_encoder
        self.output_dim = 128
        self.project_q = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.project_a = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.ce_loss = torch.nn.CrossEntropyLoss(reduction="mean")

    def embed_sentences_checkpointed(self, input_ids, attention_mask, checkpoint_batch_size=-1):
        # reproduces BERT forward pass with checkpointing
        if checkpoint_batch_size < 0 or input_ids.shape[0] < checkpoint_batch_size:
            return self.sent_encoder(input_ids, attention_mask=attention_mask)[1]
        else:
            # prepare implicit variables
            device = input_ids.device
            input_shape = input_ids.size()
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
            head_mask = [None] * self.sent_encoder.config.num_hidden_layers
            extended_attention_mask: torch.Tensor = self.sent_encoder.get_extended_attention_mask(
                attention_mask, input_shape, device
            )

            # define function for checkpointing
            def partial_encode(*inputs):
                encoder_outputs = self.sent_encoder.encoder(inputs[0], attention_mask=inputs[1], head_mask=head_mask,)
                sequence_output = encoder_outputs[0]
                pooled_output = self.sent_encoder.pooler(sequence_output)
                return pooled_output

            # run embedding layer on everything at once
            embedding_output = self.sent_encoder.embeddings(
                input_ids=input_ids, position_ids=None, token_type_ids=token_type_ids, inputs_embeds=None
            )
            # run encoding and pooling on one mini-batch at a time
            pooled_output_list = []
            for b in range(math.ceil(input_ids.shape[0] / checkpoint_batch_size)):
                b_embedding_output = embedding_output[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                b_attention_mask = extended_attention_mask[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                pooled_output = checkpoint.checkpoint(partial_encode, b_embedding_output, b_attention_mask)
                pooled_output_list.append(pooled_output)
            return torch.cat(pooled_output_list, dim=0)

    def embed_questions(self, q_ids, q_mask, checkpoint_batch_size=-1):
        q_reps = self.embed_sentences_checkpointed(q_ids, q_mask, checkpoint_batch_size)
        return self.project_q(q_reps)

    def embed_answers(self, a_ids, a_mask, checkpoint_batch_size=-1):
        a_reps = self.embed_sentences_checkpointed(a_ids, a_mask, checkpoint_batch_size)
        return self.project_a(a_reps)

    def forward(self, q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=-1):
        device = q_ids.device
        q_reps = self.embed_questions(q_ids, q_mask, checkpoint_batch_size)
        a_reps = self.embed_answers(a_ids, a_mask, checkpoint_batch_size)
        compare_scores = torch.mm(q_reps, a_reps.t())#cosine similarity
        loss_qa = self.ce_loss(compare_scores, torch.arange(compare_scores.shape[1]).to(device))#cross entrophy loss
        loss_aq = self.ce_loss(compare_scores.t(), torch.arange(compare_scores.shape[0]).to(device))
        loss = (loss_qa + loss_aq) / 2
        return loss


def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    bert_model = AutoModel.from_pretrained(model_name).to(device)
    # run bert_model on a dummy batch to get output dimension
    d_ids = torch.LongTensor(
        [[bert_model.config.bos_token_id if bert_model.config.bos_token_id is not None else 1]]
    ).to(device)
    d_mask = torch.LongTensor([[1]]).to(device)
    sent_dim = bert_model(d_ids, attention_mask=d_mask)[1].shape[-1]
    qa_embedder = RetrievalQAEmbedder(bert_model, sent_dim).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        qa_embedder.load_state_dict(param_dict["model"])
    return tokenizer, qa_embedder


def make_qa_retriever_batch(qa_list, tokenizer, max_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
 
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    
    q_ids, q_mask = (torch.LongTensor(q_toks["input_ids"]).to(device),torch.LongTensor(q_toks["attention_mask"]).to(device),)
    # print(len(a_ls))

    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
    # TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
    # print(a_toks)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )

    return (q_ids, q_mask, a_ids, a_mask)


def train_qa_retriever_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0):
    model.train()
    # make iterator
    train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # print(next(iter(data_loader)).shape)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch in enumerate(epoch_iterator):
        # print("q_ids",q_ids.shape)
        # print(" q_mask,", q_mask.shape)
        # print("A_id", a_ids.shape)
        q_ids, q_mask, a_ids, a_mask = batch
        pre_loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
        loss = pre_loss.sum()
        # optimizer
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def evaluate_qa_retriever(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    eval_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=eval_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    tot_loss = 0.0
    with torch.no_grad():
        for step, batch in enumerate(epoch_iterator):
            q_ids, q_mask, a_ids, a_mask = batch
            loss = model(q_ids, q_mask, a_ids, a_mask)
            tot_loss += loss.item()
        return tot_loss / (step + 1)


def train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args):
    qar_optimizer = AdamW(qar_model.parameters(), lr=qar_args.learning_rate, eps=1e-8)
    qar_scheduler = get_linear_schedule_with_warmup(
        qar_optimizer,
        num_warmup_steps=100,
        num_training_steps=(qar_args.num_epochs + 1) * math.ceil(len(qar_train_dset) / qar_args.batch_size),
    )
    for e in range(qar_args.num_epochs):
        train_qa_retriever_epoch(qar_model, qar_train_dset, qar_tokenizer, qar_optimizer, qar_scheduler, qar_args, e)
        m_save_dict = {
            "model": qar_model.state_dict(),
            "optimizer": qar_optimizer.state_dict(),
            "scheduler": qar_scheduler.state_dict(),
        }
        print("Saving model {}".format(qar_args.model_save_name))
        # torch.save(m_save_dict, "{}_{}.pth".format(qar_args.model_save_name, e))
        eval_loss = evaluate_qa_retriever(qar_model, qar_valid_dset, qar_tokenizer, qar_args)
        print("Evaluation loss epoch {:4d}: {:.3f}".format(e, eval_loss))

In [None]:
# training arguments
class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs =10

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = DatasetQARetriver(train,num_rows=len(train), training=True)
qar_valid_dset = DatasetQARetriver(test,num_rows=len(test), training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(
        model_name=qar_args.pretrained_model_name,
        from_file=None,
        device="cuda"
)

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

Downloading:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/327M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-8_H-768_A-12 were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncat

 0     0 of    17 	 L: 6.387 	 -- 10.484
 0     1 of    17 	 L: 6.375 	 -- 20.875
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    0: 5.888
 1     0 of    17 	 L: 5.996 	 -- 10.389
 1     1 of    17 	 L: 5.982 	 -- 20.901
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    1: 5.510
 2     0 of    17 	 L: 5.566 	 -- 10.400
 2     1 of    17 	 L: 5.530 	 -- 20.779
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    2: 5.087
 3     0 of    17 	 L: 5.060 	 -- 10.392
 3     1 of    17 	 L: 5.112 	 -- 20.781
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    3: 4.851
 4     0 of    17 	 L: 4.527 	 -- 10.381
 4     1 of    17 	 L: 4.361 	 -- 20.765
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    4: 4.590
 5     0 of    17 	 L: 3.759 	 -- 10.395
 5     1 of    17 	 L: 3.641 	 -- 20.770
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    5: 4.770
 6     0 o

In [None]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Once the model is trained, it can be used to compute passage embeddings for all document corpus. The make_qa_dense_index method takes advantage of numpy memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about1 hour.

In [None]:

###############
# trained retrieval model usage
###############
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=512, device="cuda"):
    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        a_reps = qa_embedder.embed_answers(a_ids, a_mask).cpu().type(torch.float)
    return a_reps.numpy()
def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda"):
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        q_reps = qa_embedder.embed_questions(q_ids, q_mask).cpu().type(torch.float)
    return q_reps.numpy()
def make_qa_dense_index(
    qa_embedder,
    tokenizer,
    passages_dset,
    batch_size=512,
    max_length=128,
    index_name="kilt_passages_reps.dat",
    dtype="float32",
    device="cuda",
):
    st_time = time()
    fp = np.memmap(index_name, dtype=dtype, mode="w+", shape=(len(passages_dset),128))
    n_batches = math.ceil(len(passages_dset) / batch_size)
    for i in range(n_batches):
        passages = [p["z"] for p in passages_dset[i * batch_size : (i + 1) * batch_size]]
        reps = embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length, device)
        fp[i * batch_size : (i + 1) * batch_size] = reps
        if i % 50 == 0:
            print(i, time() - st_time)

In [None]:
# os.chdir(r'/content/drive/MyDrive/TSAI/Capstone_1')

In [None]:
if not os.path.isfile('passages_reps_32_l-8_h-768_b-512-512.dat'):
  print("hi")

  make_qa_dense_index(
          qar_model, qar_tokenizer, passage_snippets, device='cuda',
          index_name='passages_reps_32_l-8_h-768_b-512-512.dat' )

hi




0 1.4317922592163086


## Using the Trained Dense Retriever and document Index
Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our document snippets, let's see whether it can actually find supporting evidence for a new question. Recalling the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

The MIPS part can be executed efficiently with the faiss library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the faiss_gpu index with the following code:

In [None]:
n_ret = 5

In [None]:
faiss_res = faiss.StandardGpuResources()
passage_reps = np.memmap(
            'passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            # shape=(wiki40b_snippets.num_rows, 128)
            # wiki40b_snippets.num_rows = 11378343,english sections from wiki40B dataset
            shape=(len(passage_snippets), 128)
)

doc_index_flat = faiss.IndexFlatIP(128)
doc_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, doc_index_flat)
doc_gpu_index.add(passage_reps)

In [None]:

def query_qa_dense_index(
    question, qa_embedder, tokenizer, doc_passages, doc_index, n_results=n_ret, min_length=1, device="cuda"
):
    q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
    D, I = doc_index.search(q_rep, 2 * n_results)
    res_passages = [doc_passages[int(i)] for i in I[0]]
    support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
    res_list = [dict([(k, p[k]) for k in ["z"]]) for p in res_passages]
    res_list = [res for res in res_list if len(res["z"].split()) > min_length][:n_results]
    for r, sc in zip(res_list, D[0]):
        r["score"] = float(sc)
    return support_doc, res_list

Now we can use the query_qa_dense_index function to query the dense index for our running example question :

In [None]:
question = test[90]['x']
question

'Parameter not registering if .to(device) is used'

In [None]:
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, passage_snippets, doc_gpu_index, device='cuda')
print(res_list)
df = pd.DataFrame({
    
    'Text': ['--- ' + question] + [res['z'] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})

[{'z': "It's supposed to be `torch.device`", 'score': 27.264524459838867}, {'z': 'your input is on gpu 1 but your net work is on gpu 0.', 'score': 25.201988220214844}, {'z': 'As a data point, in the current C++ API for CUDAStreamGuard, if you pass a stream that lives on a different device, we change *both* the device and the stream.', 'score': 24.98475456237793}, {'z': "DataParallel requires every input tensor be provided on the first device in its device_ids list.\n\nIt basically uses that device as a staging area before scattering to the other GPUs and it's the device where final outputs are gathered before returning from forward. If you want device 2 to be the primary device then you just need to put it at the front of the list as follows\n\nmodel = nn.DataParallel(model, device_ids = [2, 0, 1, 3])\nmodel.to(f'cuda:{model.device_ids[0]}')\n\n\nAfter which all tensors provided to model should be on the first device as well.\n\nx = ... # input tensor\nx = x.to(f'cuda:{model.device_ids



Unnamed: 0,Text
0,--- Parameter not registering if .to(device) is used
1,It's supposed to be `torch.device`
2,your input is on gpu 1 but your net work is on gpu 0.
3,"As a data point, in the current C++ API for CUDAStreamGuard, if you pass a stream that lives on a different device, we change *both* the device and the stream."
4,"DataParallel requires every input tensor be provided on the first device in its device_ids list. It basically uses that device as a staging area before scattering to the other GPUs and it's the device where final outputs are gathered before returning from forward. If you want device 2 to be the primary device then you just need to put it at the front of the list as follows model = nn.DataParallel(model, device_ids = [2, 0, 1, 3]) model.to(f'cuda:{model.device_ids[0]}') After which all tensors provided to model should be on the first device as well. x = ... # input tensor x = x.to(f'cuda:{model.device_ids[0]}') y = model(x)"
5,According to the documentation for torch.cuda.device device (torch.device or int) – device index to select. It’s a no-op if this argument is a negative integer or None. Based on that we could use something like with torch.cuda.device(self.device if self.device.type == 'cuda' else None):  # do a bunch of stuff which would simply be a no-op when self.device isn't a CUDA device.


## Retriever Model Evaluation
We have trained a retrieval model that *seems* to be working fine, at least on our running example. Before we use it to answer questions, however, we would like to be able to get some quantitative evaluation of the performance of our dense retriever model.

For the retriever, we want to favor recall over precision as our priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their *Inverse Document Frequency*.This gives us the following IDF-recall scoring function:

In [None]:

# We first select high-scoring answers (answers beyond the first must have a score of at least 3)
test_qa_list = [(exple['x'],exple['y']) for exple in test]

# We then compute word frequencies in answer text
answer_doc_freq = {}
for q, a in test_qa_list:
    for w in a.lower().split():
        answer_doc_freq[w] = answer_doc_freq.get(w, 0) + 1

# The IDF-recall function is then:
def da_idf_recall(doc, answer):
    d_words = dict([(w, True) for w in doc.lower().split()])
    a_words = answer.lower().split()   
    recall = sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words if w in d_words]) / \
                sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words])
    return recall

In [None]:
def evaluate_retriever(qa_list, retriever_func, scoring_func, n_ret, verbose=False):
    total_retriever_time = 0.0
    total_retriever_score = 0.0
    st_time = time()
    for i, (question, answer) in enumerate(qa_list):
        r_time = time()
        retrieved_passages = retriever_func(question, n_ret)
        total_retriever_time += time() - r_time
        total_retriever_score += scoring_func(retrieved_passages, answer)
        if verbose and ((i + 1) % 500 == 0 or i <= 1):
            print(
                "{:03d}: S-{:.4f} T-{:.4f} | {:.2f}".format(
                    i + 1, total_retriever_score / (i + 1), total_retriever_time / (i + 1), time() - st_time
                )
            )
    return {"idf_recall": total_retriever_score / (i + 1), "retrieval_time": total_retriever_time / (i + 1)}

In [None]:
def dense_ret_for_eval(question, n_ret):
    _, dense_res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, passage_snippets, doc_gpu_index, device='cuda')
    dense_doc =" ".join([p["z"] for p in dense_res_list])
    return dense_doc


dense_score = evaluate_retriever(test_qa_list, dense_ret_for_eval, da_idf_recall,n_ret)

df = pd.DataFrame({
    'IDF-Recall': [dense_score['idf_recall']],
    'Time/Query': [dense_score['retrieval_time']],
}, index=[ 'Dense'])
df.style.format({'IDF-Recall': "{:.4f}", 'Time/Query': "{:.4f}"})




Unnamed: 0,IDF-Recall,Time/Query
Sparse,0.254,0.0093


## Generating Answers with a Sequence-to-Sequence Model

In [None]:
# generator seq2seq model training
###############
class DatasetS2S(Dataset):
    def __init__(
        self, examples_array,num_rows, make_doc_fun=None, doc_cache=None, training=True
    ):
        self.training = training
        self.data = examples_array
        self.make_doc_function = make_doc_fun
        self.doc_cache = {} if doc_cache is None else doc_cache
        self.num_rows = num_rows
        assert not (make_doc_fun is None and doc_cache is None)
        # make index of specific question-answer pairs from multi-answers
        if self.training:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

        else:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

    def __len__(self):
        return len(self.qa_id_list)

    def make_example(self, idx):
        i, j = self.qa_id_list[idx]
        example = self.data[i]
        question = example["x"] 
        answer = example["y"]
        q_id = example["id"]
        if self.make_doc_function is not None:
            self.doc_cache[q_id] = self.doc_cache.get(q_id, self.make_doc_function(example["x"]))
        document = self.doc_cache[q_id]
        in_st = "question: {} context: {}".format(
            question.lower().strip(), document.lower().strip(),
        )
        out_st = answer
        return (in_st, out_st)

    def __getitem__(self, idx):
        return self.make_example(idx)


def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        model.load_state_dict(param_dict["model"])
    return tokenizer, model


def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    labels = a_ids[:, 1:].contiguous().clone()
    labels[a_mask[:, 1:].contiguous() == 0] = -100
    # print(labels)
    model_inputs = {
        "input_ids": q_ids,
        "attention_mask": q_mask,
        "decoder_input_ids": a_ids[:, :-1].contiguous(),
        "labels": labels,
    }
    # print("it'sme",model_inputs)
    return model_inputs


def train_qa_s2s_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0, curriculum=True):
    model.train()
    # make iterator
    if curriculum:
        train_sampler = SequentialSampler(dataset)
    else:
        train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)

  
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch_inputs in enumerate(epoch_iterator):
        # print(type(step))
        pre_loss = model(**batch_inputs)[0].unsqueeze(dim=0)
        # print(pre_loss,'pre_loss')
        # print(pre_loss.shape)
        # print(pre_loss.sum(),'sum')
        loss = pre_loss.sum() / pre_loss.shape[0]
        loss.backward()
        # optimizer
        if step % args.backward_freq == 0:
            optimizer.step()
            scheduler.step()
            model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def eval_qa_s2s_epoch(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    train_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    with torch.no_grad():
        for step, batch_inputs in enumerate(epoch_iterator):
            pre_loss = model(**batch_inputs)[0].unsqueeze(dim=0)
            # print(pre_loss,'pre_loss')
            # print(pre_loss.shape)
            # print(pre_loss.sum(),'sum')
            loss = pre_loss.sum() / pre_loss.shape[0]
            loc_loss += loss.item()
            # print("loc loss here",loc_loss)
            loc_steps += 1
            if step % args.print_freq == 0:
                print(
                    "{:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                        step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                    )
                )
    print("Total \t L: {:.3f} \t -- {:.3f}".format(loc_loss / loc_steps, time() - st_time,))


def train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args):
    s2s_optimizer = AdamW(qa_s2s_model.parameters(), lr=s2s_args.learning_rate, eps=1e-8)
    s2s_scheduler = get_linear_schedule_with_warmup(
        s2s_optimizer,
        num_warmup_steps=400,
        num_training_steps=(s2s_args.num_epochs + 1) * math.ceil(len(s2s_train_dset) / s2s_args.batch_size),
    )
    for e in range(s2s_args.num_epochs):
        # print((e == 0))

        train_qa_s2s_epoch(
            qa_s2s_model,
            s2s_train_dset,
            qa_s2s_tokenizer,
            s2s_optimizer,
            s2s_scheduler,
            s2s_args,
            e,
            curriculum=True,
        )
        m_save_dict = {
            "model": qa_s2s_model.state_dict(),
            "optimizer": s2s_optimizer.state_dict(),
            "scheduler": s2s_scheduler.state_dict(),
        }
        print("Saving model {}".format(s2s_args.model_save_name))
        eval_qa_s2s_epoch(qa_s2s_model, s2s_valid_dset, qa_s2s_tokenizer, s2s_args)
        # torch.save(m_save_dict, "\{}_{}.pth".format(s2s_args.model_save_name, e))



In [None]:
n_ret = 2

In [None]:
# pre-computing support documents
qna_train_docs = []
for example in train:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer,passage_snippets, doc_gpu_index, n_results=n_ret
    )
    qna_train_docs += [(example['id'], support_doc, dense_res_list)]

qna_valid_docs = []
for example in test:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer, passage_snippets, doc_gpu_index, n_results=n_ret
    )
    qna_valid_docs += [(example['id'], support_doc, dense_res_list)]

# training loop proper
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 2
        self.backward_freq = 16
        self.max_length = 512
        self.print_freq = 100
        self.model_save_name = "s2s_bart_model"
        self.learning_rate = 2e-4
        self.num_epochs = 5

s2s_args = ArgumentsS2S()

# qna_train_docs = json.load(open('precomputed/qna_train_precomputed_dense_docs.json'))
# qna_valid_docs = json.load(open('precomputed/qna_valid_precomputed_dense_docs.json'))
s2s_train_dset = DatasetS2S(train,num_rows =len(train), doc_cache=dict([(k, d) for k, d, src_ls in qna_train_docs]))
s2s_valid_dset = DatasetS2S(test,num_rows =len(test), doc_cache=dict([(k, d) for k, d, src_ls in qna_valid_docs]), training=False)

qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
    model_name="facebook/bart-large",
    from_file=None,
    device="cuda"
)
# qa_s2s_model = torch.nn.DataParallel(pre_model)
qa_s2s_model =pre_model
train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args)



Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


 0     0 of  4570 	 L: 10.090 	 -- 0.436
 0     1 of  4570 	 L: 19.756 	 -- 0.749
 0   100 of  4570 	 L: 5.388 	 -- 33.072
 0   200 of  4570 	 L: 4.342 	 -- 65.793
 0   300 of  4570 	 L: 4.027 	 -- 98.213
 0   400 of  4570 	 L: 3.936 	 -- 130.459
 0   500 of  4570 	 L: 3.783 	 -- 162.676
 0   600 of  4570 	 L: 3.532 	 -- 195.352
 0   700 of  4570 	 L: 3.236 	 -- 227.447
 0   800 of  4570 	 L: 2.926 	 -- 259.430
 0   900 of  4570 	 L: 3.050 	 -- 291.820
 0  1000 of  4570 	 L: 2.854 	 -- 324.104
 0  1100 of  4570 	 L: 3.297 	 -- 356.097
 0  1200 of  4570 	 L: 3.502 	 -- 387.916
 0  1300 of  4570 	 L: 3.090 	 -- 419.629
 0  1400 of  4570 	 L: 2.913 	 -- 452.050
 0  1500 of  4570 	 L: 2.895 	 -- 484.833
 0  1600 of  4570 	 L: 2.804 	 -- 517.658
 0  1700 of  4570 	 L: 2.789 	 -- 550.460
 0  1800 of  4570 	 L: 3.015 	 -- 583.275
 0  1900 of  4570 	 L: 2.946 	 -- 615.972
 0  2000 of  4570 	 L: 2.554 	 -- 648.777
 0  2100 of  4570 	 L: 2.753 	 -- 681.486
 0  2200 of  4570 	 L: 2.639 	 -- 714.2

In [None]:
import psutil
def get_size(bytes, suffix="B"):
    factor = 1024
    for unit in ["", "K", "M", "G", "T", "P"]:
        if bytes < factor:
            return f"{bytes:.2f}{unit}{suffix}"
        bytes /= factor
print("="*40, "Memory Information", "="*40)
svmem = psutil.virtual_memory()
print(f"Total: {get_size(svmem.total)}") ; print(f"Available: {get_size(svmem.available)}")
print(f"Used: {get_size(svmem.used)}") ; print(f"Percentage: {svmem.percent}%")

Total: 25.46GB
Available: 22.28GB
Used: 5.57GB
Percentage: 12.5%


In [None]:
    import torch
    torch.cuda.empty_cache()

We now have everything we need to answer any question! Now let's try the full system on our running example along with the first 100 questions of the test set:

In [None]:
# generate answer from input "question: ... context: <p> ..."
def qa_s2s_generate(
    question_doc,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=None,
    min_len=64,
    max_len=512,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
    max_input_length=1024,
    device="cuda:0",
):
    model_inputs = make_qa_s2s_batch([(question_doc, "A")], qa_s2s_tokenizer, max_input_length, device=device,)
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs["input_ids"],
        attention_mask=model_inputs["attention_mask"],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]

In [None]:
questions = []
pred_answers = []
act_answers = []
docs = []

for i in range(100):
    # create support document with the dense index
    question = test[i]['x']
    act_answer = test[i]['y']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        passage_snippets, doc_gpu_index, device='cuda'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=64,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    questions += [question]
    pred_answers += [answer]
    act_answers += [act_answer]
    docs += [doc]

df = pd.DataFrame({
    'Question': questions,
    'Predicted Answer': pred_answers,
    'Actual Answer' :  act_answers,
    'Documents' :docs
})
df.style.set_properties(**{'text-align': 'left'})



Unnamed: 0,Question,Predicted Answer,Actual Answer,Documents
0,"""exp_cuda"" not implemented for 'ComplexDouble'","You need to overwrite the reference with a new CUDA tensor: python input = input.cuda() model(Variable(input)) Let's investigate supporting this once #6688 is merged. If you want this get prioritized, please just say so. There is no need to blame devs on working on other changes that, while may be BC breaking, are also and may be more important.",It is fixed. Works on latest master.  python  >>> import torch  >>> torch.set_default_tensor_type(torch.cuda.DoubleTensor)  >>> tt = torch.Tensor([1])  >>> torch.exp(1j*tt)  tensor([0.5403+0.8415j]),"Thanks for catching this! We would accept a PR to fix this. OK, it is indeed a race. This diff, for example, ""fixes"" it:  diff --git a/torch/cuda/__init__.py b/torch/cuda/__init__.py  index 411cfb7315..5abb645fc5 100644  --- a/torch/cuda/__init__.py  +++ b/torch/cuda/__init__.py  @@ -165,6 +165,7 @@ def _lazy_init():  global _initialized, _cudart, _original_pid, _queued_calls  if _initialized:  return  + _initialized = True  if _in_bad_fork:  from sys import version_info  if version_info < (3, 4):  @@ -181,7 +182,6 @@ def _lazy_init():  _cudart.cudaGetErrorName.restype = ctypes.c_char_p  _cudart.cudaGetErrorString.restype = ctypes.c_char_p  _original_pid = os.getpid()  - _initialized = True  # Important to do this after _initialized, since some queued calls  # may themselves call _lazy_init()  for queued_call, orig_traceback in _queued_calls:  If you look at the relevant segment:  torch._C._cuda_init()  _cudart = _load_cudart()  _cudart.cudaGetErrorName.restype = ctypes.c_char_p  _cudart.cudaGetErrorString.restype = ctypes.c_char_p  _original_pid = os.getpid()  _initialized = True  The problem is that `_cuda_init` release the GIL at some point, which means that another Python thread can come in and trigger the same initialization (we aren't protected against the lock until we set `_initialized = True`. If you look into the list of types in the `got (...)` part, you'll find a mix of CPU and CUDA tensors, with `input` and `output` being on CPU, while `weight` and `bias` is on the GPU. You probably forgot to send the input to the GPU. Alternatively, keep in mind that `.cuda()` is an out of place operation i.e. python input.cuda() model(Variable(input)) will fail. You need to overwrite the reference with a new CUDA tensor: python input = input.cuda() model(Variable(input))  Let's investigate supporting this once #6688 is merged.  @mario98 histogram on GPU is really difficult. If you want this get prioritized, please just say so. There is no need to blame devs on working on other changes that, while may be BC breaking, are also and may be more important. @soumith Thanks a lot! `cudaDeviceReset()` can release the resource associated with the current process. @albanD Its unrelated to `nn.Parameter` and `torch.cuda.FloatTensor(torch.ones(3))` alreay crashes. repeating a message from Olexa on slack, closing this as wontfix:: @Amir, when I wrote in #general that CUDA is totally incompatible with forking, I meant this in the most serious and literal sense of those words possible. This simply *cannot* be fixed in CUDA, PyTorch or anyone else. The problems inherent in `fork()`'ing *any* _multithreaded_ program are fundamentally unsolvable, and simply beyond the power of anyone to fix, at least not until a revolution in OS design happens. I can only plead that you accept the general difficulty of safely forking within a multithreaded program and try to do things another way. There's several blogs out there discussing why it's dangerous to fork in a multithreaded program, such as https://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/ The gist of it is that when a thread `fork()`'s, *in the child process, all other threads _die instantly._* It doesn't matter what they were doing, they're _gone_. - If they had locked a mutex, that mutex will never be unlocked again. - If they were modifying a data structure, that data structure might be invalid. - If they `malloc()`'ed some memory, that memory might never be deallocated, and `malloc()` may use locks and data structures anyways. - If a thread was doing useful work, that work will never be complete, because the threads no longer exist. - You can't join these non-existent threads. - `pthread_atfork()` is a function meant to solve the problems above, but it's simply incapable of doing what it was meant to do safely, and that's why the POSIX.1 standard explicitly says that this function may be formally deprecated in the next version of the standard. It was a mistake. So almost the only safe thing to do if you `fork()`'ed from a multi-threaded process is to call `exec()`. That's what `spawn` does. Because the CUDA runtime uses threads to implement its runtime and asynchronous streams, once the CUDA runtime is initialized, it's insanely dangerous to `fork()`. Don't do it.  Is anyone working on this, I would be happy take this up with some guidance. Hi, I am from @quansight team. I would like to work on this issue. Issue is still present on `pytorch==1.3.1` To fix, replace the following: `X = X.to(device)` With this: `X = X.to(device=device)` Provided that ""X"" is a packed sequence."
1,How to correctly use CTC Loss with GRU in pytorch?,"I suspect your issue has to do with your outputs / data[1] (it would help if you show examples of your train_set). Running the following piece of code gives no NaN, but I forced shape of output by hand before calling the loss_fn(pred, outputs) : class BaselineModel(nn.Module):  def __init__(self, feature_dim=5, hidden_size=3, num_layers=2, batch size=32):  super(BaselineModel, self).__init__() self.num_l","Your model predicts 28 classes, therefore the output of the model has size [batch_size, seq_len, 28] (or [seq_len, batch_size, 28] for the log probabilities that are given to the CTC loss). In the nn.CTCLoss you set blank=28, which means that the blank label is the class with index 28. To get the log probabilities for the blank label you would index it as output[:, :, 28], but that doesn't work, because that index is out of range, as the valid indices are 0 to 27. The last class in your output is at index 27, hence it should be blank=27: criterion = nn.CTCLoss(blank=27, zero_infinity=False)","I suspect your issue has to do with your outputs / data[1] (it would help if you show examples of your train_set). Running the following piece of code gives no nan, but I forced shape of output by hand before calling the loss_fn(pred, outputs) : class BaselineModel(nn.Module):  def __init__(self, feature_dim=5, hidden_size=5, num_layers=2, batch_size=32):  super(BaselineModel, self).__init__()  self.num_layers = num_layers  self.hidden_size = hidden_size  self.lstm = nn.LSTM(input_size=feature_dim,  hidden_size=hidden_size, num_layers=num_layers)  def forward(self, x, hidden):  lstm_out, hidden = self.lstm(x, hidden)  return lstm_out, hidden  def init_hidden(self, batch_size):  hidden = Variable(next(self.parameters()).data.new(  self.num_layers, batch_size, self.hidden_size))  cell = Variable(next(self.parameters()).data.new(  self.num_layers, batch_size, self.hidden_size))  return (hidden, cell) model = BaselineModel(batch_size=32) optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=0.0001) loss_fn = torch.nn.MSELoss(reduction='sum') hidden = model.init_hidden(10) model.zero_grad() pred, hidden = model(torch.randn(2,10,5), hidden) pred.size() #torch.Size([2, 10, 5]) outputs = torch.zeros(2,10,5) loss = loss_fn(pred, outputs) loss loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() print(loss) Please note a common reason for nan values can be related to numerical stability of your learning phase, but usually you have values for the first steps before you see the divergence happening, which is apparently not the case here.  This exception is raised because the objetive function from your study must return a float. In your case, the problem is in this line: study.optimize(autotune, n_trials=1) The autotune function you defined before does not return a value and cannot be used for optimization. How to fix? For hyperparameter search, the autotune function must return the either some metric you can get after some training - like the loss or cross-entropy. A quick fix on your code could be something like this: def autotune():  cfg= { 'device' : ""cuda"" if torch.cuda.is_available() else ""cpu""  ...etc...  }  best_loss = 1e100; # or larger  # Train the model  for _ in range(epochs):  for i, (data, labels) in enumerate(trainloader):  ... (train the model) ...  # compute loss/error by comparing predicted out vs actual labels  loss = criterion(y_pred, labels)  best_loss = min(loss,best_loss)  return best_loss There is a good example with Pythorch in the Optuna repo that uses a pythoch callback to retrieve the accuracy (but can be changed easily to use the RMSE if needed). It also uses more than one experiment and takes the median for hyperparameters.  I belive the error lies in the following line: y_pred = (y_pred 0).float().requires_grad_() You try to binarize the model prediction in a weird way, I suggest do the following instead: y_pred = torch.sigmoid(y_pred) And pass this to the loss function. Explanation The output of the model can be any value, but we want to normalize that values to reside in the [0,1] range. This is exactly what the sigmoid function does. Once we have the values in the range of [0,1] the comparison with the binary labels will make sense, closer to 1 will be ""1"" and the opposite. You can refer to the following link: https://www.youtube.com/watch?v=WsFasV46KgQ  This problem seams to me like instead you define F: import torch.nn.functional as F You in accident have set F to some tuple F=(1,2) And then when you call F.log_softmax you get exactly this error.  > See also: #2209 > > BCELoss accepts only inputs that have all elements in range [0; 1] but this condition doesn't hold in your case I got [the same error](https://discuss.pytorch.org/t/cuda-out-of-memory-when-optimizer-step/55942?u=shirui-japina) and I tried to use `nn.BCELoss()` like: optimizer = optim.SGD(model.parameters(), lr=0.0001) criterion = nn.BCELoss() _loop epoch train part:_ prediction = model(batch_input) loss = criterion(torch.sigmoid(prediction), label) optimizer.zero_grad() loss.backward() optimizer.step() Then I solved the problem. Thank you for your comment! (But I don't know why optim.Adam() can't work well. It still errors: CUDA out of memory.)  This is something that happens when you use PyTorch inside fastai (I believe this should be fixed). Just create custom loss_func. For example: def loss_func(output, target): return CrossEntropyLossFlat()(out, targ.long()) and pass it when creating the DataBlock: dblock = DataBlock(... , loss_func=loss_func, ...)  I was also very surprised of this issue. Although I have never used the library I went down and did some debugging and found out that the issue is coming from the library transformers. The problem is comming from from this line : encoder_states = tuple(hidden_state.transpose(0, 1) for hidden_state in encoder_states) If you comment it out, you will get the gradient just with some dimensions transposed. This issue is related to the fact that Pytorch Autograd does not do very well on inplace operations as mentioned here. So to recap the solution is to comment line 382 in modeling_bart.py. You will get the gradient with this shape T x B x C instead of B x T x C, but you can reshape it as you want later.  Change the criterion call to: age_loss, gender_loss, race_loss = criterion(output, age.float(), gender, race) If you look at your error we can trace it to: frame #3: at::native::smooth_l1_loss_backward_out In the MultiLoss Class, the smooth_l1_loss works with age. So I changed it's type to float (as the expected dtype is Float) while passing it to the criterion. You can check that age is torch.int64 (i.e. torch.long) by printing age.dtype I am not getting the error after doing this. Hope it helps.  The negative log likelihood loss (NLLLoss) is suitable for classification problems, where the output is one out of C classes. Since the classes are discrete, your labels need to be of the long type. In your case, in a comment, you say: I want to create a network that simulates a quadratic function with x as input and sth similar to x**2 as output. This is a regression problem, where the output can have a real, continuous value. For this, you should use a suitable loss function such as the mean squared error loss (MSELoss). So, one way to fix would be changing F.nll_loss in your code to F.mse_loss.  Defake:loss = criterion_test(dec_outs.view(-1, vocab_size, batch_size), targets.view(-1, batch_size)) ,I think you need to do criterion_test(dec_outs.view(-1, vocab_size),targets.view(-1)) In your case, ( C )- vocab_size and (N)- (batch_size*seq_length). I am assuming all the batches have the same sequence length. If not, youll have to use pack_padded_sequence and also mask the loss for the pad token."
2,glibc error while importing torch,"For the record, the problem was:  - In a conda environment, I installed pytorch with `conda install`(as described on PyTorch web site) and fastText with `pip install.` from their git clone.  - That resulted in a segfault when doing `import fastText` and `import torch` Reason:  - Pytorch is compiled with gcc 4.9.2  - Conda's default gcc is 4.8.5 Fix: - install gcc-4.9 in conda (e.g. `",I am also having this issue. I know that the version of GLIBC in my conda environment is up to date (version 2.55) but pytorch seems to be using the global GLIBC on my machine (version 2.12). how can I configure pytorch so that it uses the correct GLIBC installation?,"For the record, the problem was:  - in a conda environment, I installed pytorch with `conda install`(as described on pytorch web site) and fastText with `pip install .` from their git clone.  - that resulted in a segfault when doing `import fastText` and `import torch` Reason:  - pytorch is compiled with gcc 4.9.2  - conda's default gcc is 4.8.5 Fix:  - install gcc-4.9 in conda (e.g. `conda install -c serge-sans-paille gcc_49`)  - install pytorch with `conda install` (in my case, `conda install pytorch torchvision cuda90 -c pytorch`)  - install fastText with gcc-4.9 compiler: `CC=gcc-4.9 pip install .` in the fastText git clone That's it! Thanks a lot @weiyangfb and @SsnL for your help! Update conda first with `conda update conda` and try again `brew install libomp` solves the problem. i think $ brew install libomp can help u, cause i solve the same problem by it. according github-issue-""libomp.dylib can't be loaded""  I encounter the same error, and I solve it by running command 'conda install mkl' in my activated conda env. > File \""/usr/local/lib/python3.5/site-packages/torch/__init__.py\"", line 45, in > from torch._C import * > ImportError: dlopen(/usr/local/lib/python3.5/site-packages/torch/_C.cpython-35m-darwin.so, 10): Library not loaded: @rpath/libmkl_intel_lp64.dylib > Referenced from: /usr/local/lib/python3.5/site-packages/torch/lib/libTH.1.dylib > Reason: image not found We had the same issue. Problem was the version of MKL which was not good. Just running: conda remove mkl mkl-include conda install numpy pyyaml mkl=2019.3 mkl-include setuptools cmake cffi typing before the installation fixed the issue. Probably duplicate of #20030 try `brew install libomp ` I had the same error. I fixed it using: system('env -i /usr/bin/python3 -c ""import torch""').  Some environment variable set by matlab is messing with pytorch, but I don't know which. Hi @malfet,  I've just been giving the PyTorch 1.8 whl a go. I `pip install`ed `torch-1.8.0-cp36-cp36m-manylinux2014_aarch64.whl` on a CentOS 8.3.2011 machine, but I'm getting the following error:  OSError: /lib64/python3.6/site-packages/torch/lib/libtorch_global_deps.so: ELF load command alignment not page-aligned  This appears to be an issue previously encountered with NumPy ManyLinux builds (https://github.com/numpy/numpy/issues/16677) where the whl was built with the wrong pagesize.  The same issue appears to affect the nightly .whls from https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html too.  Ubuntu 20.10 with Python 3.8.6 appears to work as expected. If you are not using conda, pip install mkl-devel gets you the headers. I haven't looked at your CI script. @SsnL can we add this to the error message?"
3,"TypeError: add(): argument 'other' (position 1) must be Tensor, not numpy.ndarray","This is not a cmake-related error, it's just how the library was implemented. I do not know why, but it appears that the specialization of T* at::Tensor::dataT const with T = long long was forgotten/omitted. If you want to get your signed 64-bits pointer, you can still get it with int64_t: auto data = T.DataT(); It's good practice to use these types for which the size is explicit in general, in order to avoid compatibility issues.","I have had the same issue with this Kaggle kernel. My workarounds are the following: 1st option: In the F1 __call__ method convert preds and targs from pytorch tensors to numpy arrays; 2nd option: Initialise TP/FP/FN with pytorch tensors instead of numpy arrays, i.e. replace np.zeros(self.n) with torch.zeros(1, self.n). Basically, the main idea - all variables should be of the same type.","The error message says it all. The tensors involved contain elements of different data types. By default, w and b have elements of type torch.float32, while data_input is a NumPy array with the Python default floating point type, i.e. double. That datatype will be preserved when you convert with from_numpy. Try using dtype=np.float32 in your np.genfromtxt call.  If I understood your question correctly, you have a python list and want to convert to a pytorch tensor. python_list = [[1,2,3], [4, 5, 6]] torch_list = torch.tensor(python_list) Can you point out where you are getting the error in the code? Yes, that’s right. In #3, X = torch.FloatTensor(data) This is where I’m getting the error. I have also tried X = torch.Tensor(data) , but ended up with the same error. Hi! I have mistaken list of tensor as a simple python list. (So I have also changed the title of question.) The variable ‘data’ was actually a list of tensors with only one item. And I can’t create a tensor from a list of tensors using torch.Tensor() method. Hence the error. I used the below method to turn the list of tensor into a single tensor with link ""https://discuss.pytorch.org/t/how-to-turn-a-list-of-tensor-to-tensor/8868"" : X = torch.stack(data) And it works now. Thanks! I would suggest you to check the input type I had the same issue which solved by converting the input type from int32 to int64.(running on win10) ex: x = torch.tensor(train).to(torch.int64)  I encounter the same error. Here's the solution. You should change the type of input from float64 to float32, which means you should type: input_seq = input_seq.float()  This is not a cmake-related error, it's just how the library was implemented. I do not know why, but it appears that the specialization of T* at::Tensor::dataT const with T = long long was forgotten/omitted. If you want to get your signed 64-bits pointer, you can still get it with int64_t: auto data = T.dataint64_t(); It's good practice to use these types for which the size is explicit in general, in order to avoid compatibility issues.  The error is because k tensor is of dtype torch.int32 while other tensors P and d_k are of dtype torch.float32. But the cat operation requires all the input tensors to be of same type. From the documentation  torch.cat(tensors, dim=0, out=None) → Tensor  tensors (sequence of Tensors) – any python sequence of tensors of the  same type. One of the solutions is to convert k to float dtype as follows: k = torch.from_numpy(k).float()  As the error message explains, c is a tensor. To use torch.cat() you must pass a group of tensors or a list. To solve your problem you may use: temp = list() for key, b in reader:  temp.append(torch.from_numpy(b)) labels = torch.cat(temp) For more, you can check the manual here Cheers  I had same issue resolved Before converting to Tensor, try this X_train = X_train.astype(np.float32)  AFAIK, the tensor is basically a NumPy array bound to the device. If not too expensive for your application and you can afford to do it on CPU, you can simply convert it to NumPy and do what you need with the comparison.  It is converting; torch.float64 and torch.double are the same thing"
4,Get a 10x10 patch from a 100x100 pytorch tensor with torus style wrap around the boundries,"In pytorch, the nn.Conv2d module needs the data to be in float. You can just make a simple edit: x = torch.randint(500,(256,), dtype=torch.float32) Alternatively you can also do: x = print(x) x[0] = x[0].view(-1) X = x.float()","torch.roll def random_patch(a, size) - Tensor:  shifts = np.random.randint(low = 0, high = a.size())  return torch.roll(a, shifts=shifts, dims=(0, 1))[:size[0], :size[1]]","In pytorch, the nn.Conv2d module needs the data to be in float. You can just make a simple edit: x = torch.randint(500,(256,), dtype=torch.float32) Alternatively you can also do: x = torch.randint(500,(256,)) x = x.float()  i got it!Instead of using the for loop for the whole matrix, i just have to use it for the block_matrix size####a=torch.ones(8, 8)####print(a)####block_size=2####for i in range(block_size):####a[i: : 4]=2*a[i: : 4] #### print(a) It is actually very simple : torch.nn.functional.conv2d !  Check out the example on the tensorflow doc page (link): im1 = tf.decode_png('path/to/im1.png') im2 = tf.decode_png('path/to/im2.png') print(tf.image.ssim(im1, im2, max_val=255)) This should work on latest version of tensorflow. If you use older versions tf.image.ssim will return a tensor (print will not give you a value), but you can call .run() to evaluate it.  contiguous is a torch method, it has to be a torch tensor not a numpy array Got it. Just create a tensor using : - torch.zeros(batch_size,100,y,dtype = embedded.dtype,device = embedded.device)  Try looking into this, if you have to edit package called onnx-caffe2 to add the mapping b/w Unsqueeze to ExpandDims https://github.com/onnx/onnx/issues/1481 Look for the answer: I found that the Caffe2 equivalence for Unsqueeze in ONNX is ExpandDims, and there is a special mapping in onnx_caffe2/backend.py around line 121 for those operators that are different only in their names and attribute names, but somehow Unsqueeze isn't presented there (have no idea why). So I manually added the mapping rules for it in the _renamed_operators and _per_op_renamed_attrs dicts and the code would look like: _renamed_operators = {  'Caffe2ConvTranspose': 'ConvTranspose',  'GlobalMaxPool': 'MaxPool',  'GlobalAveragePool': 'AveragePool',  'Pad': 'PadImage',  'Neg': 'Negative',  'BatchNormalization': 'SpatialBN',  'InstanceNormalization': 'InstanceNorm',  'MatMul': 'BatchMatMul',  'Upsample': 'ResizeNearest',  'Equal': 'EQ',  'Unsqueeze': 'ExpandDims', # add this line } _global_renamed_attrs = {'kernel_shape': 'kernels'} _per_op_renamed_attrs = {  'Squeeze': {'axes': 'dims'},  'Transpose': {'perm': 'axes'},  'Upsample': {'mode': ''},  'Unsqueeze': {'axes': 'dims'}, # add this line } And everything works as expected. I am not the OP, thanks to OP though.  You may just add .to(torch.float32) to your train_x and val_x tensors  Static linking libtorch doesn’t work well yet, please see, with link https://github.com/pytorch/pytorch/issues/21737 As you can see matplotlib works fine even without conversion to numpy array. But PyTorch Tensors (""Image tensors"") are channel first, so to use them with matplotlib you need to reshape it: Code: from scipy.misc import face import matplotlib.pyplot as plt import torch np_image = face() print(type(np_image), np_image.shape) tensor_image = torch.from_numpy(np_image) print(type(tensor_image), tensor_image.shape) # reshape to channel first: tensor_image = tensor_image.view(tensor_image.shape[2], tensor_image.shape[0], tensor_image.shape[1]) print(type(tensor_image), tensor_image.shape) # If you try to plot image with shape (C, H, W) # You will get TypeError: # plt.imshow(tensor_image) # So we need to reshape it to (H, W, C): tensor_image = tensor_image.view(tensor_image.shape[1], tensor_image.shape[2], tensor_image.shape[0]) print(type(tensor_image), tensor_image.shape) plt.imshow(tensor_image) plt.show() Output: class 'numpy.ndarray' (768, 1024, 3) class 'torch.Tensor' torch.Size([768, 1024, 3]) class 'torch.Tensor' torch.Size([3, 768, 1024]) class 'torch.Tensor' torch.Size([768, 1024, 3])"
5,"`F.logsigmoid(input, out=blah)` crashes","This issue looks addressed, so I'm going to close it. Feel free to reopen if I'm mistaken. This was fixed at some point, we forgot to close the issue. Oh, sorry, I read the repro completely wrong, there is a PackedSequence and it's being created with a CUDA tensor. Thank you for fact checking me, @ailzhang! If we're confident that the above example doesn't segfault without PackedSence, then we should close this.",`log_softmax` and `log_sigmoid` used underscores consistently,"Please shout if you have this problem, so I can prioritize it accordingly. Thank you for reporting this issue, @shmsong. The team agrees it's a high priority. This looks addressed, so I'm going to close it. Feel free to reopen if I'm mistaken. this was fixed at some point, we forgot to close the issue. Oh, sorry, I read the repro completely wrong, there is a PackedSequence and it's being created with a CUDA tensor. Thank you for fact checking me, @ailzhang! If we're confident that the above example doesn't segfault without PackedSequence, then we should close this. Oh sorry I see it has been fixed. thanks for the bug report, we'll fix this. That makes sense. I'll send a fix. I think this has been fixed on `master`. Thanks for catching this! We would accept a PR to fix this."
6,How to understand creating leaf tensors in PyTorch?,"This is a bug in PyTorch, because after this a.requires_grad is still True. a = torch.tensor([1.]) a requires_grad = True a.reshape(a.shape) A.retain_grad() b = torch.(a.T.) c = a.cat([a,b]) d = torch(c) d.backward() Old answer: If you move a.Requires_grad >= True after the reshape, it works: a =torch.Tensor([","When a tensor is first created, it becomes a leaf node. Basically, all inputs and weights of a neural network are leaf nodes of the computational graph.","Edit: Here is a detailed explanation of what's going on (""this isn't a bug per se, but it is definitely a source of confusion""): https://github.com/pytorch/pytorch/issues/19778 So one solution is to specifically ask to retain grad for now non-leaf a: a = torch.tensor([1.]) a.requires_grad = True a = a.reshape(a.shape) a.retain_grad() b = torch.tensor([1.]) c = torch.cat([a,b]) d = torch.sum(c) d.backward() Old answer: If you move a.requires_grad = True after the reshape, it works: a = torch.tensor([1.]) a = a.reshape(a.shape) a.requires_grad = True b = torch.tensor([1.]) c = torch.cat([a,b]) d = torch.sum(c) d.backward() Seems like a bug in PyTorch, because after this a.requires_grad is still true. a = torch.tensor([1.]) a.requires_grad = True a = a.reshape(a.shape) This seems to be related to the fact the a is no longer a leaf in your ""Doesn't work"" example, but still a leaf in other cases (print a.is_leaf to check).  detach() One example without detach(): from torchviz import make_dot x=torch.ones(2, requires_grad=True) y=2*x z=3+x r=(y+z).sum() make_dot(r) The end result in green r is a root of the AD computational graph and in blue is the leaf tensor. Another example with detach(): from torchviz import make_dot x=torch.ones(2, requires_grad=True) y=2*x z=3+x.detach() r=(y+z).sum() make_dot(r) This is the same as: from torchviz import make_dot x=torch.ones(2, requires_grad=True) y=2*x z=3+x.data r=(y+z).sum() make_dot(r) But, x.data is the old way (notation), and x.detach() is the new way. What is the difference with x.detach() print(x) print(x.detach()) Out: tensor([1., 1.], requires_grad=True) tensor([1., 1.]) So x.detach() is a way to remove requires_grad and what you get is a new detached tensor (detached from AD computational graph). torch.no_grad torch.no_grad is actually a class. x=torch.ones(2, requires_grad=True) with torch.no_grad():  y = x * 2 print(y.requires_grad) Out: False From help(torch.no_grad): Disabling gradient calculation is useful for inference, when you are sure | that you will not call :meth:Tensor.backward(). It will reduce memory | consumption for computations that would otherwise have requires_grad=True. | | In this mode, the result of every computation will have | requires_grad=False, even when the inputs have requires_grad=True.  There is no ""one shot instruction"" to switch .requires_grad for all tensors in graph. Usually parameters are kept in torch.nn.Module instances but in case they are elsewhere, you can always add them to some list and iterate over it, I'd do something like this: import torch class Leafs:  def __init__(self):  self.leafs = []  def add(self, tensor):  self.leafs.append(tensor)  return tensor  def clear(self):  for leaf in self.leafs:  leaf.requires_grad_(False) keeper = Leafs() x = keeper.add(torch.tensor([1.2], requires_grad=True)) y = keeper.add(torch.tensor([1.3], requires_grad=True)) print(x.requires_grad, y.requires_grad) keeper.clear() print(x.requires_grad, y.requires_grad) Usually there is no need for that, also if you don't want gradient for some part of computation you can always use with torch.no_grad() context manager.  features = torch.rand(1, 5) weights = torch.Tensor([1, 2, 3, 4, 5]) print(features) print(weights) # Element-wise multiplication of shape (1 x 5) # out = [f1*w1, f2*w2, f3*w3, f4*w4, f5*w5] print(features*weights) # weights has been reshaped to (5, 1) # Element-wise multiplication of shape (5 x 5) # out = [f1*w1, f2*w1, f3*w1, f4*w1, f5*w1] # [f1*w2, f2*w2, f3*w2, f4*w2, f5*w2] # [f1*w3, f2*w3, f3*w3, f4*w3, f5*w3] # [f1*w4, f2*w4, f3*w4, f4*w4, f5*w4] # [f1*w5, f2*w5, f3*w5, f4*w5, f5*w5] print(features*weights.view(5, 1)) # Matrix-multiplication # (1, 5) * (5, 1) - (1, 1) # out = [f1*w1 + f2*w2 + f3*w3 + f4*w4 + f5*w5] print(torch.mm(features, weights.view(5, 1))) output tensor([[0.1467, 0.6925, 0.0987, 0.5244, 0.6491]]) # features tensor([1., 2., 3., 4., 5.]) # weights tensor([[0.1467, 1.3851, 0.2961, 2.0976, 3.2455]]) # features*weights tensor([[0.1467, 0.6925, 0.0987, 0.5244, 0.6491],  [0.2934, 1.3851, 0.1974, 1.0488, 1.2982],  [0.4400, 2.0776, 0.2961, 1.5732, 1.9473],  [0.5867, 2.7701, 0.3947, 2.0976, 2.5964],  [0.7334, 3.4627, 0.4934, 2.6220, 3.2455]]) # features*weights.view(5,1) tensor([[7.1709]]) # torch.mm(features, weights.view(5, 1))  Though I don't suggest to do that, if you want, then In [18]: torch.set_printoptions(edgeitems=1) In [19]: a Out[19]: tensor([[-0.7698, ..., -0.1949],  ...,  [-0.7321, ..., 0.8537]]) In [20]: torch.set_printoptions(edgeitems=3) In [21]: a Out[21]: tensor([[-0.7698, 1.3383, 0.5649, ..., 1.3567, 0.6896, -0.1949],  [-0.5761, -0.9789, -0.2058, ..., -0.5843, 2.6311, -0.0008],  [ 1.3152, 1.8851, -0.9761, ..., 0.8639, -0.6237, 0.5646],  ...,  [ 0.2851, 0.5504, -0.9471, ..., 0.0688, -0.7777, 0.1661],  [ 2.9616, -0.8685, -1.5467, ..., -1.4646, 1.1098, -1.0873],  [-0.7321, 0.7610, 0.3182, ..., 2.5859, -0.9709, 0.8537]])  As alternative to [https://stackoverflow.com/users/6210807/kharshit] suggestion, you can define network functional way: class MyCell(torch.nn.Module):  def __init__(self):  super(MyCell, self).__init__()  self.w = []  for i in range(5):  self.w.append( torch.Tensor( 1, 1, 2*i+3 ) )  # init w[i] here, maybe make it ""requires grad"" def forward(self, x):  for i in range(5):  x = torch.nn.functional.conv1d( x, self.w[i] )  x = torch.nn.functional.relu( x )  return x  It does not really make much sense to have a single tensor which requires_grad for only part of its entries. Why not have two separate tensors one that us updated (requires_grad=True) and another one fixed (requires_grad=False)? You can then merge them for computational ease: fixed = torch.rand([2, 3], require_grad=False) upd = torch.rand([2, 3], require_grad=True) mask = torch.tensor([[0, 1, 0], [1, 0, 1]], require_grad=False) # how to combine the two # combine them using fixed ""mask"": z = mask * fixed + (1-mask) * upd You can obviously have other methods of combining fixed and upd other than using a binary mask. For example, if upd occupies the first two columns of z and fixed the rest, then: fixed = torch.rand([2, 1], require_grad=False) upd = torch.rand([2, 2], require_grad=True) # combine them using concatination z = torch.cat((upd, fixed),dim=1) Or, if you know the indices fidx = torch.tensor([0, 2], dtype=torch.long) uidx = torch.tensor([1, 3, 4, 5], dtype=torch.long) fixed = torch.rand([2,], require_grad=False) upd = torch.rand([4,], require_grad=True) z = torch.empty([2, 3]) z[fidx] = fixed z[uidx] = upd  Intro First, definition of what a leaf variable in PyTorch is, you can check official documentation for tensor.is_leaf (emphasis mine): All Tensors that have requires_grad which is False will be leaf Tensors by convention. For Tensors that have requires_grad which is True, they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None. So let's see how this looks for outi variable in original code. Immediately after creation, running this snippet: outi = torch.empty(2, requires_grad=True) print(outi.is_leaf, outi.grad_fn, outi.requires_grad) gives: True, None, True as it was created by user and there is no previous operation creating it so it should be the second bolded case from the above citation. Now this line: outi[0] = out1 outi[1] = out2 Uses two nodes which are not leafs and are part of the graph which goes back to x (which is the only leaf in it). By doing this outi is also part of the original x graph and would have to be backpropagated through, yet you specified it as a leaf (more on that later), which cannot be backpropagated through (by the definition they either don't require gradient or are created by user). Version of outi as leaf was already put on graph, after above assignment, this snippet: print(outi.is_leaf, outi.grad_fn, outi.requires_grad) changes to: False CopySlices object at 0x7f2dfa83a3d0 True Error Now, I agree it's a pretty uninformative error given that changing requires_grad=False does not make it non-leaf variable (requires_grad=False is implicit): outi = torch.empty(2) print(outi.is_leaf, outi.grad_fn, outi.requires_grad) # True None False But this tensor could be ""upgraded"" to non-leaf tensor if you use assignment as you did without breaking the expected behaviour. Why? Because you implicitly (or explicitly in case of your code) said you don't need gradient for this variable and PyTorch retains gradient only for leaf variables (unless you specify .retain_grad for specific tensor) due to memory optimization. So the only change here would be it will no longer be a leaf, but this would not break promises as .grad would be None anyway. If you were to have requires_grad=True as you originally did you could, reasonably, according to PyTorch semantics, think that this: outi.grad Will give you a tensor with gradient. But if this requires_grad=True tensor were to be changed to non-leaf tensor, then, by definition it wouldn't have this field (as non-leaf tensors have .grad=None). To me it seems like a design decision on their part to avoid confusion with requires_grad=True and breaking expected user experience. BTW. If they were to disallow leaf variables inside graph then operation which works fine now (requires_grad=False) should be disallowed as well. But as requires_grad=False is implicit and often used (creating tensors or something like you did) it seems not to be to much of a stretch to allow it. Disallowing it would be much more severe. On the other hand if you specify requires_grad=True it could be assumed you know better what you are doing and really need that gradient here. BTW2. This explanation might be a stretch but hopefully will shed some light. I haven't found anything official regarding this error (admittedly though I didn't dig too deep). Some resources here, here (this one is important, someone was asking for justification of some design decisions though didn't get one AFAIK). Comments Comment 1 I think the requires_grad is getting inherited from the slice and also .grad is available. Yes, it has requires_grad as True also as it's part of the graph now, BUT grad is not available as it is no longer a leaf. Printing outi.grad after backward gives you None and the following warning: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. So the .grad attribute is None anyway as user would expect giving requires_grad=False as creation argument. User could expect gradient to be not None if he was to set requires_grad=True and that's when PyTorch raises the error, IMO due to possible inconsistency with user expectation in that case. Comment 2 For example: a = torch.ones(2,requires_grad=False) b = 2*a b.requires_grad=True print(b.is_leaf) #True I have changed your code a little to go through it step by step: a = torch.ones(2, requires_grad=False) print(a.is_leaf) # True We should start with a here, a is a leaf according to docs as: All Tensors that have requires_grad which is False will be leaf Tensors by convention. b = a * 2 print(b.is_leaf) Now b is leaf as it does not require gradient (because a does not need a gradient it doesn't have to be backpropagated through this branch). Manipulating tensors with requires_grad=False creates tensors which do not require_grad otherwise it would be wasteful and non-sensical to turn it on. b.requires_grad = True print(b.is_leaf) Now this one returns True as well. Once again, docs wording might not be the best here (as I've stated before), but (my additions in bold): For Tensors that have requires_grad which is True (our case now) they will be leaf Tensors if they were created by the user (debatable about creation here as you have modified the existing one, indeed). This means that they are not the result of an operation and so grad_fn is None (this one IMO clarifies the previous point) About clarification- as this tensor is not a result of any mathematical operation, you simply said you want this b tensor to require_grad. IMO it is a user created tensor as it was placed (created) on graph for the first time (before there was no need for that as it didn't require gradient). And it does have it's requires_grad set to True, you did it explicitly here. Comment 3 4 Everything with requires_grad=True is on the graph Yes, but something with requires_grad=False can be on a graph as well if it is a leaf. Actually every PyTorch operation is created and added dynamically onto computational graph, here we use simplification: it's on graph if it takes part in backpropagation. For example neural network parameters are leafs yet they are on graph as the last part during backpropagation and they have their gradients (as they are optimized they need to be in graph in order to backprop through it). Everything not on the graph is a leaf Yes, essentially everything on the graph that is not the product of operation on graph tensors is a leaf Yes, if you add some tensor to it (e.g. created by torch.randn or such) is a leaf every leaf on the graph and non-leaf where I set retain_grad=True manually will get .grad attribute populated. Yes, it if it is part of backpropagation which is almost always the case in our ""mental graph"" case (I think at least). Unless it already has requires_grad=True, in this case it will be populated with gradient. Basically, except for creation, you shouldn't tinker with setting requires_grad=True as it is prone to fail (as you saw) and will definitely raise some eyebrows for other people reading your code. every non-leaf on the graph has a grad_fn associated with it Yes, that follows as some operation had to create it (and if it was created by some operation and this operation is differentiable, grad_fn is registered to be used during backward() call).  How about (tens == 101).nonzero()[:, 1] In [20]: from torch import tensor In [21]: tens = torch.tensor([[ 101, 146, 1176, 21806, 1116, 1105, 18621, 119, 102, 0, ...: 0, 0, 0], ...: [ 101, 1192, 1132, 1136, 1184, 146, 1354, 1128, 1127, 117, ...: 1463, 119, 102], ...: [ 101, 6816, 1905, 1132, 14918, 119, 102, 0, 0, 0, ...: 0, 0, 0]]) In [22]: (tens == 101).nonzero()[:, 1] Out[22]: tensor([0, 0, 0])  Instead of constructing W_mat_directly from the elements of w, try assigning w into W: W_mat_directly = torch.zeros((3, 3), dtype=w.dtype) W_mat_directly[(0, 0, 1, 1, 2, 2), (1, 2, 0, 2, 0, 1)] = w You'll get tensor([[ 0., 10., 11.],  [12., 0., 13.],  [14., 15., 0.]], grad_fn=IndexPutBackward)"
7,Multiprocessping-distributed ERROR,"This is kind of expected. There are two main problems at play here: 1. RNG calls are not functional - they mutate the state of the global PRNG, and therefore we probably shouldn't rearrange them. This however requires us to add the ""world token"" to inhibit optimizations. 2. We never trace constructors. If we want to do this, we probably need to make the tracing a thread-local property, instead of auto-detecting it as a per-Variable property.","After almost one year later, now, I know why.  If we are using multi-GPU to train models, it would like to start multi-thread for different GPUs. And each thread has to rerun the script. If we change the code during this time, the other threads may load the modified code, which caused this problem.","This is kind of expected. There are two main problems at play here: 1. RNG calls are not functional - they mutate the state of the global PRNG, and therefore we probably shouldn't rearrange them. This however requires us to add the ""world token"" to inhibit optimizations. 2. We never trace constructors. If we want to do this, we probably need to make the tracing a thread-local property, instead of auto-detecting it as a per-Variable property. Hi there! Thank you for the detailed report. You're right and this is a bug. We do properly set the completed flag for collectives but not for the send/recv operations. It's not that the operation hasn't completed yet (it has), but the `completed_` flag in `ProcessGroup::Work` isn't updated accordingly. Fixing this is a bit more complex than simply fixing the boolean, because Gloo's send/recv doesn't have a non-blocking way to check if the operation completed or not. To fix this properly, we need to: 1) Update Gloo to allow non-blocking check for send/recv completion. 2) Update the bindings to call this function when `is_completed()` is called. ## Trying Solution 2 #21534 seems addressed the problem but in quite a dirty way. A better solution might need to satisfy the following requirements: 1. As mentioned by @pietern, the hook deletion function should be implemented in `torch/csrc/autograd/function.h`, as it owns the data. 2. We should not slow down existing use cases of `add_post_hook` and `post_hooks()`. I initially thought about using an `OrderedDict` to store named hooks, as what we did for params, buffers, and children in `nn/Module.h`, but that would violate the second requirement. ~Another possibility is that, instead of using the default deleter, we create a special deleter for the hook unique ptr in DDP, e.g., `ReducerHookDeleter`, that wraps the default deleter. The `add_post_hook` and `post_hooks()` APIs would stay the same, then we add one `delete_post_hook()` API to `torch/csrc/autograd/function.h`, which loops over all post hooks, and only delete the ones with matching deleter type, i.e., `ReducerHookDeleter`. This would be slow, but is OK, as we only need this on failures, where timeout delay will dominate. Any comments?~ Should be able to directly check hook pointer type. [here you go](https://gist.githubusercontent.com/Wulfsta/5d8b2b902e1068433aaf737657453a25/raw/18028b37ea909905d23b60805399df4ea0c6172b/gistfile1.txt). Edit: I obviously haven't parsed this whole file, but the problems appear to be limited to gradients and it looks like a lot of equality checks on NaNs are happening? Can you provide a minimum example that reproduces this behavior and that indicate that this is a PyTorch problem? From your description, it might imply that `pickle.dumps(tensor)` is not always the same. But I couldn't reproduce this locally. It could also be that the algorithm that Redis uses to cache large objects is not perfect and is subject to missing it a few times. Hard to say without a repro. I can see two solutions to this issue: 1. Having `encode` and `decode` functions on class Variable. Every time the variable is computed/stored these functions will be called. 2. Allowing `register_hook` to accept a keyword argument `check`, which will default to `True`. if `check == True`, [python_hook.cpp#L147] will be run. [python_hook.cpp#L147]:https://github.com/pytorch/pytorch/blob/1290e586fbc3d6266423f3417723d6620267054b/torch/csrc/autograd/python_hook.cpp#L147 Just to clarify before I start putting in pull requests, the way forward is:  1. ASAP remove the `torch_function_dispatch` decorator from everything in `torch.functional`  2. Look into either:  A) Speed up the decorator so the overhead is smaller. I don't think we can ever get the overhead to zero but we can probably at least get it much lower by doing C++-level checks for if parameters are `torch.tensor` instances or not  B) Rewrite the operators in `torch.functional` to be in C++ @dawnwch The equivalent to `with torch.no_grad()` in C++ is `torch::NoGradGuard no_grad`. For example:  Python:  python  with torch.no_grad():  module.weight += 1  C++:  cpp  {  torch::NoGradGuard no_grad;  module->weight += 1;  } // Note that anything out of this scope will still record gradients  Please try it out and let me know if it resolves the memory usage issue. Hi @decodyng, I think the best we can guarantee in the `torch.distributions` library is to correctly catch errors *when validation is enabled*. I believe this error would have been caught earlier if you had initially called py torch.distributions.Distribution.set_default_validate_args(True) In fact we recently [enabled validation by default](https://github.com/pyro-ppl/pyro/pull/2701) in Pyro (a downstream library). If this seems useful we could consider enabling validation by default also in PyTorch. What's your opinion? Once you have an ATen context object, you can just call `lazyInitCUDA()` and get a `THCState*`. Then, you can pass that into the THCS functions.  Also, please remember that we're using GitHub issues for bug reports only, and all questions should be posted on [our forums](https://discuss.pytorch.org)."
8,"Differences between F.relu(X) and torch.max(X, 0)","Yes, tf.math.reduce_max does allow gradients to flow. It is easy to check (this is TensorFlow 2.x but it is the same result in 1.x): import tensorflow as tf with tf.GradientTape() as tape: x = tf.linspace(0., 2. * 3.1416, 10) >>> tape.watch(x)  # A sequence of operations involving reduce_max  y = TF.matmul(x, dim=1) Sorted [[11.",torch.max is not differentiable according to this discussion. A loss function needs to be continuous and differentiable to do backprop. relu is differentiable as it can be approximated and hence the use of it in a loss function.,"Yes, tf.math.reduce_max does allow gradients to flow. It is easy to check (this is TensorFlow 2.x but it is the same result in 1.x): import tensorflow as tf with tf.GradientTape() as tape:  x = tf.linspace(0., 2. * 3.1416, 10)  tape.watch(x)  # A sequence of operations involving reduce_max  y = tf.math.square(tf.math.reduce_max(tf.math.sin(x))) # Check gradients g = tape.gradient(y, x) print(g.numpy()) # [ 0. 0. 0.3420142 -0. -0. -0. # -0. 0. 0. 0. ] As you can see, there is a valid gradient for y with respect to x. Only one of the values is not zero, because it is the value that then resulted in the maximum value, so it is the only value in x that affects the value of y. This is the correct gradient for the operation.  One must admit the unique function can sometimes be very confusing without given proper examples and explanations. The dim parameter specifies which dimension on the matrix tensor you want to apply on. For instance, in a 2D matrix, dim=0 will let operation perform vertically where dim=1 means horizontally. Example, let's consider a 4x4 matrix with dim=1. As you can see from my code below, the unique operation is applied row by row. You notice the double occurrence of the number 11 in the first and last row. Numpy and Torch does this to preserve the shape of the final matrix. However, if you do not specify any dimension, torch will automatically flatten your matrix and then apply unique to it and you will get a 1D array that contains unique data. import torch m = torch.Tensor([  [11, 11, 12,11], [13, 11, 12,11], [16, 11, 12, 11], [11, 11, 12, 11] ]) output, indices = torch.unique(m, sorted=True, return_inverse=True, dim=1) print(""Ori \n{}"".format(m.numpy())) print(""Sorted \n{}"".format(output.numpy())) print(""Indices \n{}"".format(indices.numpy())) # without specifying dimension output, indices = torch.unique(m, sorted=True, return_inverse=True) print(""Sorted (no dim) \n{}"".format(output.numpy())) Result (dim=1) Ori [[11. 11. 12. 11.]  [13. 11. 12. 11.]  [16. 11. 12. 11.]  [11. 11. 12. 11.]] Sorted [[11. 11. 12.]  [11. 13. 12.]  [11. 16. 12.]  [11. 11. 12.]] Indices [1 0 2 0] Result (no dimension) Sorted (no dim) [11. 12. 13. 16.]  Actually, this is what you are looking for: Case 1: when z = 2*x**3 + x import torch from torch.autograd import Variable x = Variable(2*torch.ones(2, 2), requires_grad=True) z = x*x*x*2+x z.backward(torch.ones_like(z)) print(x.grad) output: tensor([[25., 25.],  [25., 25.]]) Case 2: when z = x*x x = Variable(2*torch.ones(2, 2), requires_grad=True) z = x*x z.backward(torch.ones_like(z)) print(x.grad) output: tensor([[4., 4.],  [4., 4.]]) Case 3: when z = x (your case) x = Variable(2*torch.ones(2, 2), requires_grad=True) z = x z.backward(torch.ones_like(z)) print(x.grad) output: tensor([[1., 1.],  [1., 1.]]) To learn more how to calculate gradient in pytorch, check this.  Alias for torch.acosh(). Adds the scalar other to each element of the input input and returns a new resulting tensor. Performs the element-wise division of tensor1 by tensor2, multiply the result by the scalar value and add it to input. Performs the element-wise multiplication of tensor1 by tensor2, multiply the result by the scalar value and add it to input. Computes the element-wise angle (in radians) of the given input tensor. Returns a new tensor with the arcsine of the elements of input. Alias for torch.asin(). Returns a new tensor with the inverse hyperbolic sine of the elements of input. Alias for torch.asinh(). Returns a new tensor with the arctangent of the elements of input. Alias for torch.atan(). Returns a new tensor with the inverse hyperbolic tangent of the elements of input. Alias for torch.atanh(). Element-wise arctangent of inputi/otheri\text{input}_{i} / \text{other}_{i}inputi​/otheri​ with consideration of the quadrant. Computes the bitwise NOT of the given input tensor. Computes the bitwise AND of input and other. Computes the bitwise OR of input and other. Computes the bitwise XOR of input and other. Returns a new tensor with the ceil of the elements of input, the smallest integer greater than or equal to each element. Clamps all elements in input into the range [ min, max ]. Alias for torch.clamp(). Computes the element-wise conjugate of the given input tensor. Create a new floating-point tensor with the magnitude of input and the sign of other, elementwise. Returns a new tensor with the cosine of the elements of input. Returns a new tensor with the hyperbolic cosine of the elements of input. Returns a new tensor with each of the elements of input converted from angles in degrees to radians. Divides each element of the input input by the corresponding element of other. Alias for torch.div(). Computes the logarithmic derivative of the gamma function on input. Alias for torch.special.erf(). Alias for torch.special.erfc(). Yes, they are the same! import tensorflow as tf tensor = [[1., 2.], [4., 5.], [3., 6.], [7., 8.], [5., 2.]] tensor = tf.convert_to_tensor(tensor, dtype=tf.float32) t_norm = tf.norm(tensor, ord=2, axis=1) print(t_norm) Output tf.Tensor([ 2.236068 6.4031243 6.708204 10.630146 5.3851647], shape=(5,), dtype=float32) import torch tensor = [[1., 2.], [4., 5.], [3., 6.], [7., 8.], [5., 2.]] tensor = torch.tensor(tensor, dtype=torch.float32) t_norm = torch.norm(tensor, p=2, dim=1) print(t_norm) Output tensor([ 2.2361, 6.4031, 6.7082, 10.6301, 5.3852])  When you use *, the multiplication is elementwise, when you use torch.mm it is matrix multiplication. Example: a = torch.rand(2,5) b = torch.rand(2,5) result = a*b result will be shaped the same as a or b i.e (2,5) whereas considering operation result = torch.mm(a,b) It will give a size mismatch error, as this is proper matrix multiplication (as we study in linear algebra) and a.shape[1] != b.shape[0]. When you apply the view operation in torch.mm you are trying to match the dimensions. In the special case of the shape in some particular dimension being 1, it becomes a dot product and hence sum (a*b) is same as mm(a, b.view(5,1))  For quick understanding, layout=torch.strided corresponds to dense tensors while layout=torch.sparse_coo corresponds to sparse tensors. From another perspective, we can understand it together with torch.tensor.view. A tensor can be viewed indicates it is contiguous. If we change the view of a tensor, the strides will change accordingly, but the data will keep the same. More specifically, view returns a new tensor with the same data but different shape, and strides is compatible with the view to indicate how to access the data in the memory. For example In [1]: import torch In [2]: a = torch.arange(15) In [3]: a.data_ptr() Out[3]: 94270437164688 In [4]: a.stride() Out[4]: (1,) In [5]: a = a.view(3, 5) In [6]: a.data_ptr() # share the same data pointer Out[6]: 94270437164688 In [7]: a.stride() # the stride changes as the view changes Out[7]: (5, 1) In addition, the idea of torch.strided is basically the same as strides in numpy. View this question for more detailed understanding. How to understand numpy strides for layman?  See torch.igammac() and torch.lgamma() for related functions. Supports broadcasting to a common shape and float inputs. Note The backward pass with respect to input is not yet supported. Please open an issue on PyTorch’s Github to request it. input (Tensor) – the first non-negative input tensor other (Tensor) – the second non-negative input tensor out (Tensor, optional) – the output tensor. Example: Looks like the [`torch.amax()`]() docs are aware of this behavior: amax/amin evenly distributes gradient between equal values, while max(dim)/min(dim) propagates gradient only to a single index in the source tensor.  The minus essentially means you go backwards through the dimensions. Let A be a n-dimensional matrix. Then dim=n-1=-1, dim=n-2=-2, ..., dim=1=-(n-1), dim=0=-n. See the numpy doc for more information, as pytorch is heavily based on numpy."
9,quantization.fuse_modules fails with Conv1d and BatchNorm1d,"The issue is that we don't have the direct test coverage for the `from_float` function with `bias = NULL`. The current unit test with ""indirect"" test coverage is to test `Convert` and `Quantize` function from here: Previously we have fixed the general `Bias = nullptr` here:",Here ( is a list of modules which can be fused. LeakyRELU fusion does not work because we don't have a fusion implemented for it.,"cudnn does not support backprop in evaluate mode, we should add a fallback to nn for this case The issue is that we don't have the direct test coverage for the `from_float` function with `bias = NULL`. The current unit test with ""indirect"" test coverage is to test `convert` and `quantize` function from here: https://github.com/pytorch/pytorch/blob/master/test/test_quantization.py#L21 Previously we have fixed the general `bias = nullptr` here: https://github.com/pytorch/pytorch/pull/22403 This sounds like a silent correctness issue so I marked it as high-priority but someone with more context behind the normalization layers should take a look at this. I have not yet fully analyzed the problem and I am not sure if the observed effect is a bug or by design.  It seems that in the self-attention variant with the added zero k & v sequence-entries (produced internally by `add_zero_attn=True`) the gradient is composed of a component coming from the query and one from values and keys. The gradient from the query is causing the undesired effect. I modified the repro script a bit in order to get the effect without `add_zero_attn=True` (initial in-projection bias should be 0) and to split the contribution of query and key & value: import torch  embedding_dim = 1  batch_size = 1  num_heads = 1  seq_len = 4  net = torch.nn.MultiheadAttention(embedding_dim, num_heads, add_zero_attn=False)  mask = torch.cat([torch.ones(seq_len, seq_len).triu(), torch.zeros(seq_len, 1)], dim=1)  mask[mask==1]=float('-inf')  print(mask)  for i in range(seq_len):  x = torch.ones(seq_len, batch_size, embedding_dim, requires_grad=True)  y = torch.ones(seq_len, batch_size, embedding_dim, requires_grad=True)  z = torch.cat([y, torch.zeros(1, 1, embedding_dim)]) # add zero sequence element  o, w = net(x, z, z, attn_mask=mask)  #print(w)  # o.shape is (seq_len, batch_size, embedding_dim)  o.mean([1, 2])[i].backward()  print(i, 'x:', x.grad.abs().sum([1, 2]).view(-1))  print(i, 'y:', y.grad.abs().sum([1, 2]).view(-1))  Output is:  tensor([[-inf, -inf, -inf, -inf, 0.],  [0., -inf, -inf, -inf, 0.],  [0., 0., -inf, -inf, 0.],  [0., 0., 0., -inf, 0.]])  0 x: tensor([0., 0., 0., 0.])  0 y: tensor([0., 0., 0., 0.])  1 x: tensor([0.0000, 0.0148, 0.0000, 0.0000])  1 y: tensor([0.2801, 0.0000, 0.0000, 0.0000])  2 x: tensor([0.0000, 0.0000, 0.0127, 0.0000])  2 y: tensor([0.1798, 0.1798, 0.0000, 0.0000])  3 x: tensor([0.0000, 0.0000, 0.0000, 0.0105])  3 y: tensor([0.1323, 0.1323, 0.1323, 0.0000])  I think adding noise to gradients is simple. A simple `apply_` call should be able to iterate through all model parameters and add the noise to the gradients, after which `step` is called. I initially proposed adding Noisy SGD, but the proposal was rejected considering its triviality. Ref: https://github.com/pytorch/pytorch/pull/4332 The same happens with `index_copy` iirc. It comes from the fact that the current tests are very thorough. They have cases for all cases of contiguous / non-contiguous inputs with all the possible flags. Making the input sizes smaller would alleviate but not completely solve this problem QR batching PR was done after 1.1, please allow me to investigate. correct. The model would still be in fp32, it's just the forward pass which is run in fp16 to speed up training. This should be independent from any post training quantization strategies. @albanD @ezyang Thanks for your suggestions.  We understood the risk of varargs in custom Function, and will follow your suggestion.  Thanks. this is now fixed in master: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/functions/batch_normalization.cpp#L119"


In [None]:
predicted = []
reference = []

# Generate answers for the full test set
for i in range(len(test)):
    # create support document with the dense index
    question = test[i]['x']

    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        passage_snippets, doc_gpu_index, device='cuda'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=96,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    predicted += [answer]
    reference += [test[i]['y']]



## Generator model evaluation 

The last thing we'll do is see how we can get a quantitative evaluation of the model performance. Here, we'll use the ROUGE implementation provided in the nlp library.

In [None]:
pip install rouge_score rouge_score

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [None]:
# Compare each generation to the  answer from the dataset
nlp_rouge = nlp.load_metric('rouge')

scores = nlp_rouge.compute(
    predicted, reference,
    rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
    use_agregator=True, use_stemmer=False
)
df = pd.DataFrame({
    'rouge1': [scores['rouge1'].mid.precision, scores['rouge1'].mid.recall, scores['rouge1'].mid.fmeasure],
    'rouge2': [scores['rouge2'].mid.precision, scores['rouge2'].mid.recall, scores['rouge2'].mid.fmeasure],
    'rougeL': [scores['rougeL'].mid.precision, scores['rougeL'].mid.recall, scores['rougeL'].mid.fmeasure],
}, index=[ 'P', 'R', 'F'])
df.style.format({'rouge1': "{:.4f}", 'rouge2': "{:.4f}", 'rougeL': "{:.4f}"})

Unnamed: 0,rouge1,rouge2,rougeL
P,0.1382,0.0231,0.0868
R,0.268,0.0591,0.2083
F,0.1313,0.0226,0.0864
