<a href="https://colab.research.google.com/github/mzkhan2000/NLP/blob/main/semantic_search_publications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search in Publications

This notebook demonstrates how [sentence-transformers](https://www.sbert.net) and the [SPECTER](https://github.com/allenai/specter) model can be used to find similar publications.

As corpus, we use all EMNLP publications from 2016 - 2018.

We then search for similar papers using papers that have been presented at EMNLP 2019 / 2020.


In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 2.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.11.2-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 4.6 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 29.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 38.7 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |█████████

In [None]:
import json
import os
from sentence_transformers import SentenceTransformer, util

#First, we load the papers dataset (with title and abstract information)
dataset_file = 'emnlp2016-2018.json'

if not os.path.exists(dataset_file):
  util.http_get("https://sbert.net/datasets/emnlp2016-2018.json", dataset_file)

with open(dataset_file) as fIn:
  papers = json.load(fIn)

print(len(papers), "papers loaded")

  0%|          | 0.00/1.10M [00:00<?, ?B/s]

974 papers loaded


In [None]:
#We then load the allenai-specter model with SentenceTransformers
model = SentenceTransformer('allenai-specter')

#To encode the papers, we must combine the title and the abstracts to a single string
paper_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]

#Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True)


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]



In [None]:
#We define a function, given title & abstract, searches our corpus for relevant (similar) papers
def search_papers(title, abstract):
  query_embedding = model.encode(title+'[SEP]'+abstract, convert_to_tensor=True)

  search_hits = util.semantic_search(query_embedding, corpus_embeddings)
  search_hits = search_hits[0]  #Get the hits for the first query

  print("Paper:", title)
  print("Most similar papers:")
  for hit in search_hits:
    related_paper = papers[hit['corpus_id']]
    print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['venue'], related_paper['year']))

## Search
Now we search for some papers that have been presented at EMNLP 2019 and 2020.

In [None]:
# This paper was the EMNLP 2019 Best Paper
search_papers(title='Specializing Word Embeddings (for Parsing) by Information Bottleneck', 
              abstract='Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.')


Paper: Specializing Word Embeddings (for Parsing) by Information Bottleneck
Most similar papers:
0.88	An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing	EMNLP 2018
0.87	NORMA: Neighborhood Sensitive Maps for Multilingual Word Embeddings	EMNLP 2018
0.87	Generalizing Word Embeddings using Bag of Subwords	EMNLP 2018
0.87	Word Embeddings for Code-Mixed Language Processing	EMNLP 2018
0.87	LAMB: A Good Shepherd of Morphologically Rich Languages	EMNLP 2016
0.87	Word Mover's Embedding: From Word2Vec to Document Embedding	EMNLP 2018
0.87	Charagram: Embedding Words and Sentences via Character n-grams	EMNLP 2016
0.87	Segmentation-Free Word Embedding for Unsegmented Languages	EMNLP 2017
0.86	Addressing Troublesome Words in Neural Machine Translation	EMNLP 2018
0.86	Conditional Word Embedding and Hypothesis Testing via Bayes-by-Backprop	EMNLP 2018




In [None]:
# This paper was the EMNLP 2020 Best Paper
search_papers(title='Digital Voicing of Silent Speech',
              abstract='In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.')




Paper: Digital Voicing of Silent Speech
Most similar papers:
0.82	Session-level Language Modeling for Conversational Speech	EMNLP 2018
0.79	Neural Multitask Learning for Simile Recognition	EMNLP 2018
0.78	Speech segmentation with a neural encoder model of working memory	EMNLP 2017
0.77	MSMO: Multimodal Summarization with Multimodal Output	EMNLP 2018
0.77	Estimating Marginal Probabilities of n-grams for Recurrent Neural Language Models	EMNLP 2018
0.76	A Co-Attention Neural Network Model for Emotion Cause Analysis with Emotional Context Awareness	EMNLP 2018
0.76	Learning Unsupervised Word Translations Without Adversaries	EMNLP 2018
0.75	Large Margin Neural Language Model	EMNLP 2018
0.75	Phrase-Based & Neural Unsupervised Machine Translation	EMNLP 2018
0.75	Multimodal Language Analysis with Recurrent Multistage Fusion	EMNLP 2018


In [None]:
# This paper was a EMNLP 2020 Honourable Mention Papers
search_papers(title='If beam search is the answer, what was the question?',
              abstract='Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.')


Paper: If beam search is the answer, what was the question?
Most similar papers:
0.91	A Stable and Effective Learning Strategy for Trainable Greedy Decoding	EMNLP 2018
0.90	Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation	EMNLP 2018
0.90	Why Neural Translations are the Right Length	EMNLP 2016
0.88	Learning Neural Templates for Text Generation	EMNLP 2018
0.87	Towards Decoding as Continuous Optimisation in Neural Machine Translation	EMNLP 2017
0.86	A Tree-based Decoder for Neural Machine Translation	EMNLP 2018
0.86	Memory-enhanced Decoder for Neural Machine Translation	EMNLP 2016
0.86	Trainable Greedy Decoding for Neural Machine Translation	EMNLP 2017
0.86	Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation	EMNLP 2018
0.86	Addressing Troublesome Words in Neural Machine Translation	EMNLP 2018




In [None]:
# This paper was a EMNLP 2020 Honourable Mention Papers
search_papers(title='Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems',
              abstract='The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.')




Paper: Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Most similar papers:
0.86	Multi-view Response Selection for Human-Computer Conversation	EMNLP 2016
0.84	Patterns of Argumentation Strategies across Topics	EMNLP 2017
0.84	Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog	EMNLP 2017
0.83	Towards Exploiting Background Knowledge for Building Conversation Systems	EMNLP 2018
0.83	AirDialogue: An Environment for Goal-Oriented Dialogue Research	EMNLP 2018
0.82	WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community	EMNLP 2018
0.82	Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task	EMNLP 2018
0.82	The Teams Corpus and Entrainment in Multi-Party Spoken Dialogues	EMNLP 2016
0.81	Deal or No Deal? End-to-End Learning of Negotiation Dialogues	EMNLP 2017
0.81	MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Tas

In [None]:
# EMNLP 2020 paper on making Sentence-BERT multilingual
search_papers(title='Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation',
              abstract='We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training is lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.')




Paper: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Most similar papers:
0.90	Sentence Compression for Arbitrary Languages via Multilingual Pivoting	EMNLP 2018
0.90	Learning Crosslingual Word Embeddings without Bilingual Corpora	EMNLP 2016
0.89	Unsupervised Multilingual Word Embeddings	EMNLP 2018
0.89	InferLite: Simple Universal Sentence Representations from Natural Language Inference Data	EMNLP 2018
0.88	Improving Cross-Lingual Word Embeddings by Meeting in the Middle	EMNLP 2018
0.88	Dynamic Meta-Embeddings for Improved Sentence Representations	EMNLP 2018
0.88	Porting an Open Information Extraction System from English to German	EMNLP 2016
0.88	Unsupervised Statistical Machine Translation	EMNLP 2018
0.87	Contextual Parameter Generation for Universal Neural Machine Translation	EMNLP 2018
0.87	Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations	EMNLP 2018
