In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
from pylatexenc.latex2text import LatexNodes2Text
from sentence_transformers import SentenceTransformer

import warnings
warnings.filterwarnings('ignore')

In [6]:
model = SentenceTransformer('allenai-specter')

In [4]:
# loading the data and emeddings for first 10 chunks
embeddings1 = np.load(rf'N:\arxiv_dataset\github_121420\chunks_1000\embeddings\0.npy')
data1 = pd.read_csv(rf'N:\arxiv_dataset\github_121420\chunks_1000\data\0.csv')
for i in range(1, 10):
    embeddings = np.load(rf'N:\arxiv_dataset\github_121420\chunks_1000\embeddings\{i}.npy')
    data = pd.read_csv(rf'N:\arxiv_dataset\github_121420\chunks_1000\data\{i}.csv')
    embeddings1 = np.concatenate((embeddings1, embeddings), axis=0)
    data1 = pd.concat([data1, data])

In [7]:
# defining a function to preprocess and generate embeddings for user input
def preprocess_input(user_input, model):

    # cleaning the user input if needed
    user_input.replace('\n', ' ')
    latex_converter = LatexNodes2Text()
    user_input = latex_converter.latex_to_text(user_input)

    # generating embeddings for the user input
    user_embeddings = model.encode(user_input)

    return user_embeddings

    

In [8]:
sample_user_input = '''Kiosk machines have gained good popularity among
the general public as they are easy to operate and provide a good
interactive interface. As a result, multiple users use the kiosk
machine throughout the day to find the information they are
looking for. Users interact with the kiosk machine by the means of
touching its screen or using the buttons. Due to this, it is observed
that throughout the day hundreds or even thousands of people
end up touching the surface of the kiosk machine. Because of this
hygiene cannot be maintained as it is not possible to sanitize the
kiosk machine after each use. This has become a serious issue
considering the effects that the Covid-19 pandemic had on the
world. Multiple people touching the same surface is one of the
most common ways through which the virus can spread. To help
deal with this problem we have designed a gesture control system
using deep learning techniques through which kiosk machines can
be operated in a touch-less way.'''

In [9]:
user_embeddings = preprocess_input(sample_user_input, model)

In [10]:
# defining a function to find consine similarity
def cosine_sim(user_embeddings, data_embeddings):
    similarity = cosine_similarity([user_embeddings], data_embeddings)
    return similarity

In [11]:
similarity = cosine_sim(user_embeddings, embeddings1)

In [14]:
data1['score'] = similarity[0]

In [17]:
pd.set_option('display.max_colwidth', None)

In [18]:
# top 10 recommendation based on similarity score
data1.sort_values('score', ascending=False).iloc[:10]

Unnamed: 0,id,submitter,title,doi,abstract,update_date,score
256,2248,Jegor Uglov Mr,Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition,10.1155/2008/468693,"Noise, corruptions and variations in face images can seriously hurt the performance of face recognition systems. To make such systems robust, multiclass neuralnetwork classifiers capable of learning from noisy data have been suggested. However on large face data sets such systems cannot provide the robustness at a high level. In this paper we explore a pairwise neural-network system as an alternative approach to improving the robustness of face recognition. In our experiments this approach is shown to outperform the multiclass neural-network system in terms of the predictive accuracy on the face images corrupted by noise.",2016-02-17,0.700318
713,2705,Nathalie Villa,Un résultat de consistance pour des SVM fonctionnels par interpolation spline,10.1016/j.crma.2006.09.025,"This Note proposes a new methodology for function classification with Support Vector Machine (SVM). Rather than relying on projection on a truncated Hilbert basis as in our previous work, we use an implicit spline interpolation that allows us to compute SVM on the derivatives of the studied functions. To that end, we propose a kernel defined directly on the discretizations of the observed functions. We show that this method is universally consistent.",2007-05-23,0.69933
760,5740,Sophie Frisch,Parametrization of Pythagorean triples by a single triple of polynomials,10.1016/j.jpaa.2007.05.019,"It is well known that Pythagorean triples can be parametrized by two triples of polynomials with integer coefficients. We show that no single triple of polynomials with integer coefficients in any number of variables is sufficient, but that there exists a parametrization of Pythagorean triples by a single triple of integer-valued polynomials.",2011-06-29,0.674661
712,2704,Nathalie Villa,Support vector machine for functional data classification,10.1016/j.neucom.2005.12.010,"In many applications, input data are sampled functions taking their values in infinite dimensional spaces rather than standard vectors. This fact has complex consequences on data analysis algorithms that motivate modifications of them. In fact most of the traditional data analysis tools for regression, classification and clustering have been adapted to functional inputs under the general name of functional Data Analysis (FDA). In this paper, we investigate the use of Support Vector Machines (SVMs) for functional data analysis and we focus on the problem of curves discrimination. SVMs are large margin classifier tools based on implicit non linear mappings of the considered data into high dimensional spaces thanks to kernels. We show how to define simple kernels that take into account the unctional nature of the data and lead to consistent classification. Experiments conducted on real world data emphasize the benefit of taking into account some functional aspects of the problems.",2007-05-23,0.668528
668,9632,Yanxia Zhang,Two novel approaches for photometric redshift estimation based on SDSS and 2MASS databases,10.1088/1009-9271/8/1/13,"We investigate two training-set methods: support vector machines (SVMs) and Kernel Regression (KR) for photometric redshift estimation with the data from the Sloan Digital Sky Survey Data Release 5 and Two Micron All Sky Survey databases. We probe the performances of SVMs and KR for different input patterns. Our experiments show that the more parameters considered, the accuracy doesn't always increase, and only when appropriate parameters chosen, the accuracy can improve. Moreover for different approaches, the best input pattern is different. With different parameters as input, the optimal bandwidth is dissimilar for KR. The rms errors of photometric redshifts based on SVM and KR methods are less than 0.03 and 0.02, respectively. Finally the strengths and weaknesses of the two approaches are summarized. Compared to other methods of estimating photometric redshifts, they show their superiorities, especially KR, in terms of accuracy.",2009-11-13,0.656348
148,5128,Kenichiro Aoki,A small tabletop experiment for a direct measurement of the speed of light,10.1119/1.2919743,"A small tabletop experiment for a direct measurement of the speed of light to an accuracy of few percent is described. The experiment is accessible to a wide spectrum of undergraduate students, in particular to students not majoring in science or engineering. The experiment may further include a measurement of the index of refraction of a sample. Details of the setup and equipment are given. Results and limitations of the experiment are analyzed, partly based on our experience in employing the experiment in our student laboratories. Safety considerations are also discussed.",2009-11-13,0.652193
385,2377,Jo\~ao Bastos,A multivariate approach to heavy flavour tagging with cascade training,10.1088/1748-0221/2/11/P11007,"This paper compares the performance of artificial neural networks and boosted decision trees, with and without cascade training, for tagging b-jets in a collider experiment. It is shown, using a Monte Carlo simulation of WH → lν qq̅ events, that for a b-tagging efficiency of 50",2011-01-27,0.649132
109,9073,Riccardo Zecchina,Efficient supervised learning in networks with binary synapses,10.1073/pnas.0700324104,"Recent experimental studies indicate that synaptic changes induced by neuronal activity are discrete jumps between a small number of stable states. Learning in systems with discrete synapses is known to be a computationally hard problem. Here, we study a neurobiologically plausible on-line learning algorithm that derives from Belief Propagation algorithms. We show that it performs remarkably well in a model neuron with binary synapses, and a finite number of `hidden' states per synapse, that has to learn a random classification task. Such system is able to learn a number of associations close to the theoretical limit, in time which is sublinear in system size. This is to our knowledge the first on-line algorithm that is able to achieve efficiently a finite number of patterns learned per binary synapse. Furthermore, we show that performance is optimal for a finite number of hidden states which becomes very small for sparse coding. The algorithm is similar to the standard `perceptron' learning algorithm, with an additional rule for synaptic transitions which occur only if a currently presented pattern is `barely correct'. In this case, the synaptic changes are meta-plastic only (change in hidden states and not in actual synaptic state), stabilizing the synapse in its current state. Finally, we show that a system with two visible states and K hidden states is much more robust to noise than a system with K visible states. We suggest this rule is sufficiently simple to be easily implemented by neurobiological systems or in hardware.",2009-11-13,0.648897
130,4114,Niels Ubbelohde,Bimodal Counting Statistics in Single Electron Tunneling through a Quantum Dot,10.1103/PhysRevB.76.155307,We explore the full counting statistics of single electron tunneling through a quantum dot using a quantum point contact as non-invasive high bandwidth charge detector. The distribution of counted tunneling events is measured as a function of gate and source-drain-voltage for several consecutive electron numbers on the quantum dot. For certain configurations we observe super-Poissonian statistics for bias voltages at which excited states become accessible. The associated counting distributions interestingly show a bimodal characteristic. Analyzing the time dependence of the number of electron counts we relate this to a slow switching between different electron configurations on the quantum dot.,2009-11-13,0.646445
521,8489,Marian Anghel,Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise,10.1214/07-AOS562,"We consider the problem of forecasting the next (observable) state of an unknown ergodic dynamical system from a noisy observation of the present state. Our main result shows, for example, that support vector machines (SVMs) using Gaussian RBF kernels can learn the best forecaster from a sequence of noisy observations if (a) the unknown observational noise process is bounded and has a summable α-mixing rate and (b) the unknown ergodic dynamical system is defined by a Lipschitz continuous function on some compact subset of ℝ^d and has a summable decay of correlations for Lipschitz continuous functions. In order to prove this result we first establish a general consistency result for SVMs and all stochastic processes that satisfy a mixing notion that is substantially weaker than α-mixing.",2009-04-07,0.64236
