# SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

This notebook gives a minimal example usage of SPLADE.

* We provide models via Hugging Face (https://huggingface.co/naver)
* See [Naver Labs Europe website](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/) for other intermediate models.

| model | MRR@10 (MS MARCO dev) | recall@1000 (MS MARCO dev) | expected FLOPS | ~ avg q length | ~ avg d length | 
| --- | --- | --- | --- | --- | --- |
| `naver/splade_v2_max` (**v2** [HF](https://huggingface.co/naver/splade_v2_max)) | 34.0 | 96.5 | 1.32 | 18 | 92 |
| `naver/splade_v2_distil` (**v2** [HF](https://huggingface.co/naver/splade_v2_distil)) | 36.8 | 97.9 | 3.82 | 25 | 232 |
| `naver/splade-cocondenser-selfdistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-selfdistil))| 37.6 | 98.4 | 2.32 | 56 | 134 |
| `naver/splade-cocondenser-ensembledistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-ensembledistil)) | 38.3 | 98.3  | 1.85 | 44 | 120 |

In [1]:
import torch, os, string
from transformers import AutoModelForMaskedLM, AutoTokenizer
from splade.models.transformer_rep import SpladeMaxSim, Splade
from collections import Counter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [3]:
# set the dir for trained weights

##### v2
# model_type_or_dir = "naver/splade_v2_max"
# model_type_or_dir = "naver/splade_v2_distil"

### v2bis, directly download from Hugging Face
# model_type_or_dir = "naver/splade-cocondenser-selfdistil"
# model_type_or_dir = "naver/splade-cocondenser-ensembledistil"
model_type_or_dir = "/scratch/lamdo/phrase_splade_checkpoints/phrase_splade_31/debug/checkpoint/model"
# model_type_or_dir = "/scratch/lamdo/splade_maxsim_ckpts/splade_maxsim_150k_lowregv3/debug/checkpoint/model"
# model_type_or_dir = 'lamdo/distilbert-base-uncased-phrase-16kaddedphrasesfroms2orc-mlm-150000steps-multiwords'
# model_type_or_dir = "/scratch/lamdo/splade_checkpoints/experiments_combined_references_v8-1/debug/checkpoint/model"
# model_type_or_dir = "lamdo/distilbert-base-uncased-phrase-60kaddedphrasesfroms2orc-mlm-150000steps"

In [4]:
# loading model and tokenizer

model = Splade(model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

len(reverse_voc)

79577

In [5]:
model

Splade(
  (transformer_rep): TransformerRep(
    (transformer): DistilBertForMaskedLM(
      (activation): GELUActivation()
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(79577, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): Linear(in_features=768, out_features=768, bias=True)
                (k_lin): Linear(in_features=768, out_features=768, bias=True)
                (v_lin): Linear(in_features=768, out_features=768, bias=True)
                (out_lin): Linear(in_features=768, out_features=768, bias=True)
              )
 

In [6]:
def encode_custom(tokens, model, is_q = False):
    out = model.encode_(tokens, is_q)["logits"]  # shape (bs, pad_len, voc_size)
    out = torch.log(1 + torch.relu(out)) * tokens["attention_mask"].unsqueeze(-1)

    # mask = ~torch.isin(tokens["input_ids"], PUNCID)
    # out = out * mask.unsqueeze(-1)

    res = torch.zeros_like(out)
    res = res.to(out.device)

    out, token_indices = torch.max(out, dim = 1)


    res.scatter_(1, token_indices.unsqueeze(1), out.unsqueeze(1))
    return res


PUNCID = torch.tensor([tokenizer.vocab[punc] for punc in string.punctuation])
def encode_custom_mask_punc(tokens, model, is_q = False):
    out = model.encode_(tokens, is_q)["logits"]  # shape (bs, pad_len, voc_size)
    out = torch.log(1 + torch.relu(out)) * tokens["attention_mask"].unsqueeze(-1)

    mask = ~torch.isin(tokens["input_ids"], PUNCID)
    out = out * mask.unsqueeze(-1)

    res = torch.zeros_like(out)
    res = res.to(out.device)

    out, token_indices = torch.max(out, dim = 1)


    return out

In [7]:
# example document from MS MARCO passage collection (doc_id = 8003157)

# doc = """ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector representations, late interaction models produce multi-vector representations at the granularity of each token and decompose relevance modeling into scalable token-level computations. This decomposition has been shown to make late interaction more effective, but it inflates the space footprint of these models by an order of magnitude. In this work, we introduce ColBERTv2, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. We evaluate ColBERTv2 across a wide range of benchmarks, establishing state-of-the-art quality within and outside the training domain while reducing the space footprint of late interaction models by 6--10×."""

doc = """Supplementing Remote Sensing of Ice: Deep Learning-Based Image Segmentation System for Automatic Detection and Localization of Sea-ice Formations From Close-Range Optical Images. This paper presents a three-stage approach for the automated analysis of close-range optical images containing ice objects. The proposed system is based on an ensemble of deep learning models and conditional random field postprocessing. The following surface ice formations were considered: Icebergs, Deformed ice, Level ice, Broken ice, Ice floes, Floebergs, Floebits, Pancake ice, and Brash ice. Additionally, five non-surface ice categories were considered: Sky, Open water, Shore, Underwater ice, and Melt ponds. To find input parameters for the approach, the performance of 12 different neural network architectures was explored and evaluated using a 5-fold cross-validation scheme. The best performance was achieved using an ensemble of models having pyramid pooling layers (PSPNet, PSPDenseNet, DeepLabV3+, and UPerNet) and convolutional conditional random field postprocessing with a mean intersection over union score of 0.799, and this outperformed the best single-model approach. The results of this study show that when per-class performance was considered, the Sky was the easiest class to predict, followed by Deformed ice and Open water. Melt pond was the most challenging class to predict. Furthermore, we have extensively explored the strengths and weaknesses of our approach and, in the process, discovered the types of scenes that pose a more significant challenge to the underlying neural networks. When coupled with optical sensors and AIS, the proposed approach can serve as a supplementary source of large-scale ‘ground truth’ data for validation of satellite-based sea-ice products. We have provided an implementation of the approach at https://github.com/panchinabil/sea_ice_segmentation ."""


# doc = """A comprehensive survey of graph embedding: Problems, techniques, and applications. Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different [MASK] [MASK]"""

# doc = """Attention Is All You Need. The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."""

# doc = "ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation"

# doc = """ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation. Unsupervised keyphrase prediction has gained growing interest in recent years. However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation. In addition, they lack consideration for time efficiency. To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of a phraseness and an informativeness module. The former generate candidates, while the latter estimate their relevance. The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of the document are perceived in different contexts and 2) estimate informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates. ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89% of the performance of a supervised baseline for top 10 predictions. Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions. Finally, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes."""

# doc = """SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than 9\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark."""

# doc = """The author uses 3 096 sample households in 15 counties from the year 1995 to 2006 to analyze the impact of PFPs on rural households' income inequality by income inequality decomposition.The research indicates that:(1) the percentage of subsidy income generated from PFPs has increased 8.03% during the period from 1995 to 2006;(2) the contribution of subsidy income generated from PFPs has been up from 0.330 7% in 1995 to 3.794 1% in 2006;(3) the policy-caused subsidy income inequality is more prominent than that caused by the planned regions of PFPs.Therefore a rational policy adjustment of PFPs will contribute more to poverty reduction in China's rural areas."""

# doc = " | But much of the responsibility of the social inequity that leads to different health outcomes lies elsewhere. Health is affected by policies in other sectors, such as education, taxation, transport, and agriculture too."


# doc = "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily."

# doc = "Generative Image Dynamics. We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics."

# doc = "Rich Human Feedback for Text-to-Image Generation. Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants)"

# doc = "MedYOLO: A Medical Image Object Detection Framework. Artificial intelligence-enhanced identification of organs, lesions, and other structures in medical imaging is typically done using convolutional neural networks (CNNs) designed to make voxel-accurate segmentations of the region of interest. However, the labels required to train these CNNs are time-consuming to generate and require attention from subject matter experts to ensure quality. For tasks where voxel-level precision is not required, object detection models offer a viable alternative that can reduce annotation effort. Despite this potential application, there are few options for general purpose object detection frameworks available for 3-D medical imaging. We report on MedYOLO, a 3-D object detection framework using the one-shot detection method of the YOLO family of models and designed for use with medical imaging. We tested this model on four different datasets: BRaTS, LIDC, an abdominal organ Computed Tomography (CT) dataset, and an ECG-gated heart CT dataset. We found our models achieve high performance on commonly present medium and large-sized structures such as the heart, liver, and pancreas even without hyperparameter tuning. However, the models struggle with very small or rarely present structures."

# doc = "A study of smoothing methods for language models applied to ad hoc information retrieval. Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection."

# doc = "Big data: astronomical or genomical? Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the “genomical” challenges of the next decade."

# doc = "Topic sentiment mixture: modeling facets and opinions in weblogs. In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections."


# doc = "Deep Residual Learning for Image Recognition. Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers."

# doc = "Fairness in Dead-Reckoning based Distributed Multi-Player Games. In a distributed multi-player game that uses dead-reckoning vectors to exchange movement information among players, there is inaccuracy in rendering the objects at the receiver due to network delay between the sender and the receiver. The object is placed at the receiver at the position indicated by the dead-reckoning vector, but by that time, the real position could have changed considerably at the sender. This inaccuracy would be tolerable if it is consistent among all players; that is, at the same physical time, all players see inaccurate (with respect to the real position of the object) but the same position and trajectory for an object. But due to varying network delays between the sender and different receivers, the inaccuracy is different at different players as well. This leads to unfairness in game playing. In this paper, we first introduce an error measure for estimating this inaccuracy. Then we develop an algorithm for scheduling the sending of dead-reckoning vectors at a sender that strives to make this error equal at different receivers over time. This algorithm makes the game very fair at the expense of increasing the overall mean error of all players. To mitigate this effect, we propose a budget based algorithm that provides improved fairness without increasing the mean error thereby maintaining the accuracy of game playing. We have implemented both the scheduling algorithm and the budget based algorithm as part of BZFlag, a popular distributed multi-player game. We show through experiments that these algorithms provide fairness among players in spite of widely varying network delays. An additional property of the proposed algorithms is that they require less number of DRs to be exchanged (compared to the current implementation of BZflag) to achieve the same level of accuracy in game playing."

# doc = "Evaluating Adaptive Resource Management for Distributed Real-Time Embedded Systems. A challenging problem faced by researchers and developers of distributed real-time and embedded (DRE) systems is devising and implementing effective adaptive resource management strategies that can meet end-to-end quality of service (QoS) requirements in varying operational conditions. This paper presents two contributions to research in adaptive resource management for DRE systems. First, we describe the structure and functionality of the Hybrid Adaptive Resourcemanagement Middleware (HyARM), which provides adaptive resource management using hybrid control techniques for adapting to workload fluctuations and resource availability. Second, we evaluate the adaptive behavior of HyARM via experiments on a DRE multimedia system that distributes video in real-time. Our results indicate that HyARM yields predictable, stable, and high system performance, even in the face of fluctuating workload and resource availability."

# doc = "Real World BCI: Cross-Domain Learning and Practical Applications"

# doc = "keyphrase generation"

In [8]:
# now compute the document representation
# for punc in string.punctuation:
#     doc = doc.replace(punc, " ")
    
doc_tokens = tokenizer(doc, max_length = 256, return_tensors="pt")
with torch.no_grad():
    doc_rep = model(d_kwargs=doc_tokens)["d_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)
    print(torch.sum(doc_rep))
    # doc_rep = encode_custom_mask_punc(doc_tokens, model).squeeze()
print(doc_rep.shape)
# get the number of non-zero dimensions in the rep:
col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = doc_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    print((reverse_voc[k], round(v, 2)))
    bow_rep.append((reverse_voc[k], round(v, 2)))
# print("SPLADE BOW rep:\n", bow_rep)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  with torch.cuda.amp.autocast() if self.fp16 else NullContextManager():
  with torch.cuda.amp.autocast() if self.fp16 else NullContextManager():


tensor(147.4938)
torch.Size([79577])
number of actual dimensions:  327
('pyramid', 1.79)
('ice', 1.77)
('iceberg', 1.75)
('union', 1.68)
('segmentation', 1.67)
('pooling', 1.67)
('fl', 1.6)
('ice formation', 1.59)
('random field', 1.57)
('intersection', 1.49)
('psp', 1.49)
('supplement', 1.49)
('sky', 1.49)
('validation', 1.43)
('ebit', 1.41)
('deep learning', 1.4)
('close', 1.39)
('range', 1.38)
('post', 1.37)
('remote sensing', 1.36)
('melt', 1.34)
('sea', 1.33)
('automated analysis', 1.32)
('##et', 1.31)
('sea ice', 1.31)
('surface', 1.3)
('localization', 1.29)
('detection', 1.28)
('##cake', 1.27)
('optical image', 1.27)
('pond', 1.26)
('bra', 1.26)
('##oe', 1.24)
('##ern', 1.19)
('ea', 1.17)
('ponds', 1.15)
('anybody', 1.13)
('ice cover', 1.09)
('dense', 1.09)
('automatic detection', 1.08)
('##sh', 1.07)
('segment', 1.06)
('fold', 1.05)
('conditional', 1.04)
('ensemble', 1.03)
('near', 1.01)
('up', 0.99)
('layers', 0.99)
('class', 0.98)
('images', 0.97)
('cross', 0.97)
('per', 0.96

In [9]:
len(tokenizer.tokenize(doc))

10

In [10]:
tokens = tokenizer(doc, return_tensors="pt")
out = encode_custom(tokens, model = model, is_q = False)
out.shape

torch.Size([1, 235, 46327])

In [11]:
tokens_str = [reverse_voc[int(idx)] for idx in tokens["input_ids"][0]]

In [161]:
row, col = torch.nonzero(out[0][:], as_tuple = True)

In [28]:
token_mapper = [[tokens_str[j], Counter()] for j in range(max(row) + 1)]
for r,c in zip(row, col):
    r_token_id = int(tokens["input_ids"][0][r])
    r_token_str = reverse_voc[r_token_id]

    temp = {}

    c_token_id = int(c)
    c_token_str = reverse_voc[c_token_id]
    if c_token_str not in temp:
        temp[c_token_str] = float(out[0][r, c])

    token_mapper[r][0] = r_token_str
    token_mapper[r][1].update(temp)

In [29]:
token_mapper

[['[CLS]', Counter()],
 ['sp', Counter({'sp': 1.6556336879730225})],
 ['##lad', Counter({'##lad': 0.5018502473831177})],
 ['##e', Counter({'##e': 0.886421799659729, '##a': 0.2372046709060669})],
 ['v', Counter({'v': 1.2498834133148193})],
 ['##2', Counter({'##2': 0.9780962467193604})],
 [':', Counter()],
 ['sparse',
  Counter({'sparse': 1.1829668283462524,
           'low rank': 0.7515316605567932,
           'matrix': 0.4643259346485138,
           'heavy': 0.3268167972564697,
           'filter': 0.25119322538375854,
           'high order': 0.23822885751724243,
           'mixed': 0.1988535225391388,
           'composite': 0.16261816024780273})],
 ['lexi',
  Counter({'lexi': 0.8966094255447388,
           'word embedding': 0.875156581401825,
           'speech synthesis': 0.5308637619018555,
           'facial expression': 0.1530182659626007,
           'peer assessment': 0.028486358001828194})],
 ['##cal',
  Counter({'##cal': 0.5345647931098938,
           'linear algebra': 0.1577

In [12]:
start_index =24
end_index = 33
print([item[0] for item in token_mapper[start_index:end_index]])
test = Counter()
for item in token_mapper[start_index:end_index]:
    test.update(item[1])

print(test.keys())
test

['.', 'these', 'rb', '##a', 'subsidiaries', 'were', 'involved', 'in', 'br']
dict_keys(['by', '...', 'albert', 'this', 'martin', 'was', 'radio', ')', '"'])


Counter({'...': 1.4506860971450806,
         'was': 1.3149594068527222,
         'radio': 1.3119200468063354,
         '"': 1.0273680686950684,
         'by': 0.9747397899627686,
         'this': 0.7900874018669128,
         ')': 0.694354772567749,
         'albert': 0.03176310285925865,
         'martin': 0.013809965923428535})

In [13]:
# in reverse

token_mapper = {}
for r,c in zip(row, col):
    r_token_id = int(tokens["input_ids"][0][r])
    r_token_str = reverse_voc[r_token_id]

    c_token_id = int(c)
    c_token_str = reverse_voc[c_token_id]

    if c_token_str not in token_mapper: token_mapper[c_token_str] = []
    score = float(out[0][r, c])

    token_mapper[c_token_str].append(score)

In [13]:
for k in token_mapper:
    scores = list(sorted(token_mapper[k], reverse=True))
    print(k, [round(item, 2) for item in scores[:10]])

model [1.86, 1.85, 1.8, 1.7, 1.61, 1.48, 0.84, 0.78, 0.57, 0.41]
efficiency [2.18, 2.14, 1.99, 1.94, 1.88, 1.88, 1.61, 1.38, 1.31, 1.17]
efficient [1.5, 0.6, 0.55, 0.46, 0.38, 0.38, 0.16]
research [0.58, 0.09, 0.01]
study [1.2]
test [0.91, 0.57]
assessment [0.22, 0.04]
sp [1.97, 1.93, 1.91, 1.89]
##lad [1.38, 1.31, 1.29, 1.23]
##e [1.03, 0.99, 0.86, 0.78]
late [1.76, 1.65, 1.57]
##nce [0.37, 0.26, 0.11]
##ncy [1.6, 1.54, 1.41]
issue [0.29]
problem [0.33]
important [1.06, 0.17]
considered [0.15, 0.07]
overlooked [1.09, 1.07]
evaluate [0.39]
ir [1.52]
based [0.59]
pre [0.96]
##train [0.99]
##ed [0.64]
language [0.89]
pl [1.39]
##m [0.93, 0.68, 0.37]
##ms [1.02, 0.77]
reason [0.01]
multiple [0.42]
hardware [0.6]
software [0.51]
part [0.07]
system [0.84, 0.75]
paper [0.0]
good [0.19, 0.04]
better [0.59, 0.14]
improve [1.4, 1.2, 0.86, 0.53]
improvement [0.72, 0.42]
achieved [0.2]
state [0.49, 0.07]
zero [0.95]
shot [0.52]
performance [0.66, 0.57, 0.53]
competitive [1.09]
result [0.34]
tre [

In [13]:
tokens

{'input_ids': tensor([[  101, 15756, 12850,  2869,  2241,  2006,  9742, 15066,  4117,  2007,
         15796,  7205, 10638,  3945,  2031,  3728,  2363,  1037,  2843,  1997,
          3086,  1010, 11427,  2037,  3112,  2000,  4487, 16643, 20382,  1998,
          1013,  2030,  2488, 16227,  1997,  4973,  2005,  2731,  1011,  1011,
          2096,  2145, 18345,  2006,  1996,  2168, 21505,  4294,  1012,  1999,
          1996, 12507,  1010, 20288,  6630,  4083, 17999,  2011,  3151, 20037,
          5950,  2075,  5461,  2038,  2464,  1037,  3652,  3037,  1010, 22490,
          2075,  2013, 16166, 20868,  3188,  2015,  2107,  2004, 13216, 16105,
          9289,  9844,  1012,  2096,  2070,  6549, 10176,  2031,  2042,  3818,
          1010,  1037,  8276,  3947,  2038,  2042,  2404,  1999,  1996,  2731,
          1997,  2107,  4275,  1012,  1999,  2023,  2147,  1010,  2057,  3857,
          2006, 11867, 27266,  2063,  1011,  1011,  1037, 20288,  4935,  1011,
          2241, 12850,  2099,  1011,  

In [None]:
with torch.no_grad():
    batch_doc_rep, batch_doc_token_indices, batch_doc_pad_len = model.encode(tokenizer([doc, doc], return_tensors="pt"), is_q = False)  # (sparse) doc rep in voc space, shape (30522,)



for i in range(batch_doc_rep.size(0)):
    doc_rep = batch_doc_rep[i]
    doc_token_indices = batch_doc_token_indices[i]

    # get the number of non-zero dimensions in the rep:
    col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
    print("number of actual dimensions: ", len(col))

    # now let's inspect the bow representation:
    weights = doc_rep[col].cpu().tolist()
    _indices = doc_token_indices[col].cpu().tolist()
    d = {k: v for k, v in zip(col, weights)}
    d_indices = {reverse_voc[k]: v for k, v in zip(col, _indices)}
    sorted_d = {reverse_voc[k]: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    print(d_indices, "\n", sorted_d)

ValueError: not enough values to unpack (expected 3, got 2)

In [23]:
temp[0].shape, temp[1].shape

(torch.Size([2, 30522]), torch.Size([2, 30522]))

In [30]:
tokenizer.tokenize(doc)

['er',
 '##u',
 '-',
 'kg',
 ':',
 'efficient',
 'reference',
 '-',
 'aligned',
 'un',
 '##su',
 '##per',
 '##vis',
 '##ed',
 'key',
 '##ph',
 '##rase',
 'generation',
 '.',
 'un',
 '##su',
 '##per',
 '##vis',
 '##ed',
 'key',
 '##ph',
 '##rase',
 'prediction',
 'has',
 'gained',
 'growing',
 'interest',
 'in',
 'recent',
 'years',
 '.',
 'however',
 ',',
 'existing',
 'methods',
 'typically',
 'rely',
 'on',
 'he',
 '##uri',
 '##stic',
 '##ally',
 'defined',
 'importance',
 'scores',
 ',',
 'which',
 'may',
 'lead',
 'to',
 'inaccurate',
 'inform',
 '##ative',
 '##ness',
 'estimation',
 '.',
 'in',
 'addition',
 ',',
 'the',
 'y la',
 'ck',
 'consideration',
 'for',
 'time',
 'efficiency',
 '.',
 'to',
 'solve',
 'these',
 'problems',
 ',',
 'we',
 'propose',
 'er',
 '##u',
 '-',
 'kg',
 ',',
 'an',
 'un',
 '##su',
 '##per',
 '##vis',
 '##ed',
 'key',
 '##ph',
 '##rase',
 'generation',
 '(',
 'uk',
 '##g',
 ')',
 'model',
 'that',
 'consists',
 'of',
 'a',
 'phrase',
 '##ness',
 'and'

In [32]:
original_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [37]:
print(tokenizer.tokenize(doc))
print()
print(original_tokenizer.tokenize(doc))

['attention', 'is', 'all', 'you', 'need', '.', 'the', 'dominant', 'sequence', 'trans', '##duction', 'models', 'are', 'based', 'on', 'complex', 'recurrent', 'or', 'convolution', 'al', 'neural networks', 'in', 'an', 'en', '##code', '##r', '-', 'decoder', 'configuration', '.', 'the', 'best', 'performing', 'models', 'also', 'connect', 'the', 'en', '##code', '##r', 'and', 'decoder', 'through', 'an', 'attention mechanism', '.', 'we', 'propose', 'a', 'new', 'simple', 'network architecture', ',', 'the', 'transform', '##er', ',', 'based', 'solely', 'on', 'attention mechanism', 's', ',', 'di', '##sp', '##ens', '##ing', 'with', 'recurrence', 'and', 'convolution', 's', 'entirely', '.', 'experiments', 'on', 'two', 'machine translation', 'tasks', 'show', 'these', 'models', 'to', 'be', 'superior', 'in', 'quality', 'while', 'being', 'more', 'parallel', '##iza', '##ble', 'and', 'requiring', 'significantly', 'less', 'time', 'to', 'train', '.', 'our', 'model', 'achieve', '##s', '28', '.', '4', 'b', '##le

In [39]:
original_tokenizer.tokenize("asdbaisbd")

['as', '##db', '##ais', '##b', '##d']

In [1]:
import json

In [2]:
with open("/scratch/lamdo/doris-mae/DORIS-MAE_dataset_v1.json") as f:
    ds = json.load(f)

In [5]:
ds["Corpus"][10]

{'masked_abstract': "Machine learning systems often experience a distribution shift between training and testing . In this paper , we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor , * , covariate shifts . Our objective has two components . First , a representation must remain discriminative for the task , i.e. , some predictor must be able to simultaneously minimize the source and target risk . Second , the representation 's marginal support needs to be the same across source and target . We make this practical by designing self-supervised objectives that only use unlabelled data and augmentations to train robust representations . Our objectives give insights into the robustness of * , and further improve * 's representations to achieve * results on * .",
 'original_abstract': "Machine learning systems often experience a di

In [None]:
machine learning is fun

-> ["machine" "learning" "machine learning" "is" "fun"]