<a href="https://colab.research.google.com/github/m-tari/arxiv_interface/blob/master/notebooks/04_semantic_search_publications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search in Publications

This notebook demonstrates how [sentence-transformers](https://www.sbert.net) to find similar publications ([source](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing)).

As corpus, we use 100k articles from arXiv dataset that are published after 2021.


In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 1.9 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 4.3 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 15.4 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 39.3 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 481 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |██████████████████████

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import json
import os
from sentence_transformers import SentenceTransformer, util
import pandas as pd

sample_df_2021 = pd.read_csv('/content/drive/MyDrive/ML/sample_df_2021.csv')

print(len(sample_df_2021), "papers loaded")

100670 papers loaded


In [5]:
sample_df_2021.head()

Unnamed: 0,id,title,category,abstract,general_category
0,2102.09833,Demonstrating change from a drop-in space soun...,"['physics.ed-ph', 'physics.pop-ph']",Impact evaluation in public engagement neces...,"['physics', 'physics']"
1,2109.14384,Challenges for variational reduced-density-mat...,['physics.chem-ph'],The direct variational optimization of the t...,['physics']
2,2103.01407,On the maximum number of maximum dissociation ...,['math.CO'],"In a graph $G$, a subset of vertices is a di...",['math']
3,2105.02669,How to split the costs among travellers sharin...,"['cs.GT', 'math.OC']",How to form groups in a mobility system that...,"['cs', 'math']"
4,2012.11201,Normalization and electronic circuit correctio...,['physics.app-ph'],In this manuscript we propose a theoretical ...,['physics']


In [6]:
sample_df_2021_papers = sample_df_2021.loc[:, ['title', 'abstract']]


In [12]:
#We then load the model with SentenceTransformers
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.22k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
#To encode the papers, we must combine the title and the abstracts to a single string
paper_texts_concat = sample_df_2021_papers['title'] + '[SEP]' + sample_df_2021_papers['abstract']

In [14]:
paper_texts_concat.to_list()[:10]

["Demonstrating change from a drop-in space soundscape exhibit by using\n  graffiti walls both before and after[SEP]  Impact evaluation in public engagement necessarily requires measuring change.\nHowever, this is extremely challenging for drop-in activities due to their very\nnature. We present a novel method of impact evaluation which integrates\ngraffiti walls into the experience both before and after the main drop-in\nactivity. The activity in question was a soundscape exhibit, where young\nfamilies experienced the usually inaudible sounds of near-Earth space in an\nimmersive and accessible way. We apply two analysis techniques to the captured\nbefore and after data - quantitative linguistics and thematic analysis. These\nanalyses reveal significant changes in participants' responses after the\nactivity compared to before, namely an increased diversity in language used to\ndescribe space and altered conceptions of what space is like. The results\ndemonstrate that the soundscape was

In [15]:
paper_texts = paper_texts_concat.to_list()

#Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True)

In [17]:
import pickle
#Saving corpus embeddings
with open('/content/drive/MyDrive/ML/sample_df_2021_embeddings.pkl', "wb") as fOut:
    pickle.dump(corpus_embeddings, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [18]:
sample_df_2021_papers.iloc[0]['title']

'Demonstrating change from a drop-in space soundscape exhibit by using\n  graffiti walls both before and after'

In [19]:
#We define a function, given title & abstract, searches our corpus for relevant (similar) papers
def search_papers(title, abstract):
  query_embedding = model.encode(title+'[SEP]'+abstract, convert_to_tensor=True)

  search_hits = util.semantic_search(query_embedding, corpus_embeddings)
  search_hits = search_hits[0]  #Get the hits for the first query

  print("Paper:", title)
  print("Most similar papers:")
  for hit in search_hits:
    related_paper = sample_df_2021_papers.iloc[hit['corpus_id']]
    print("{:.2f}\t{}".format(hit['score'], related_paper['title']))

## Search

In [22]:
sample_title = '''
 Holomorphy of normalized intertwining operators for certain induced representations I: a toy example 
'''
sample_abstract = '''
The theory of intertwining operators plays an important role in the development of the
Langlands program. This, in some sense, is a very sophisticated theory, but the basic question of
its singularity, in general, is quite unknown. Motivated by its deep connection with the longstand-
ing pursuit of constructing automorphic L-functions via the method of integral representations,
we prove the holomorphy of normalized local intertwining operators, normalized in the sense of
Casselman–Shahidi, for a family of induced representations of quasi-split classical groups as an
exercise. Our argument is the outcome of an observation of an intrinsic non-symmetry property
of normalization factors appearing in different reduced decompositions of intertwining operators.
Such an approach bears the potential to work in general.
'''
search_papers(title=sample_title, abstract=sample_abstract)


Paper: 
 Holomorphy of normalized intertwining operators for certain induced representations I: a toy example 

Most similar papers:
0.51	Representations of closed quadratic forms associated with Stieltjes and
  inverse Stieltjes holomorphic families of linear relations
0.50	Mutually Normalizing Regular Permutation Groups and Zappa-Szep
  Extensions of the Holomorph
0.50	An arithmetic property of intertwining operators for p-adic groups
0.49	Hausdorff Operators on Some Spaces of Holomorphic Functions on the Unit
  Disc
0.49	Inversion of a Class of Singular Integral Operators on Entire Functions
0.49	Holomorphic family of Dirac-Coulomb Hamiltonians in arbitrary dimension
0.49	Actions of Cusp Forms on Holomorphic Discrete Series and Von Neumann
  Algebras
0.48	The Krein-von Neumann Extension of a Regular Even Order
  Quasi-Differential Operator
0.48	Essential Commutants on Strongly Pseudo-convex Domains
0.47	Multiplication by a finite Blaschke product on weighted Bergman spaces:
  commut