Skip to content

Vector Space based Search Engine for Arxiv Research Publications

Notifications You must be signed in to change notification settings

praneetmehta/reSEARCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reSEARCH

Information Retrieval System for Research Papers using Python.

Course Assignment for CS F469- Information Retrieval @ BITS Pilani, Hyderabad Campus.

Done under the guidance of Dr. Aruna Malapati, Assistant Professor, BITS Pilani, Hyderabad Campus.

Table of Contents

Table of contents generated with markdown-toc

For setup, run the following commands in order:

Running the scraper
python scraper.py
Text pre-processing
python process_files.py
Create a trie
python create_trie.py l #lemmatized tokens
or
python create_trie.py s #stemmed tokens
or
python create_trie.py n #no stemming/lemmatization
python doc2idx.py 
Starting the web server
python run.py

Introduction

A tf-idf based Search Engine for research papers on Arxiv. The main purpose of this project is understand how vector space based retrieval models work. More on Tf-Idf.

Data

The data has been scraped from Arxiv. The scraper is present in scraper.py which can be found in the directory scraper.

We use the following data of papers from all categories present on Arxiv:

  1. Title
  2. Abstract
  3. Authors
  4. Subjects

Total terms in vocabulary = 38773. Total documents in corpus = 15686. Note: Only Abstract data has been used for searching.

The data is organized into directories as follows:

Data/

├── abstracts       (text files containing the abstract)

├── authors           (text files containing authors)

├── link                 (text files containing link to the pdf of the paper)

├── subject            (text files containing subjects)

└── title                  (text files containing the title)

Text Preprocessing

We processed the raw text scraped from Arxiv by applying the following operations-

  1. Tokenization
  2. Stemming
  3. Lemmatization
  4. Stopwords removal

Data Structures used

We used a Trie to store words and their Term frequencies in the different Documents and their Document Frequencies. The Trie is also used to generate typing suggestions while querying.

Time complexity of Inserting and Querying

The time complexity of querying and inserting an element into the Trie is O(n). nnumber of charcters in the query term.

Tf-Idf formulation

Tf-Idf score = tf*log(N/df)

tf  →  Term frequency of the term in the current document df →  Document frequency of the term N →  Total number of documents in corpus

Machine specs:

Processor: i7 Ram: 8 GB DDR4 OS: Ubuntu 16.04 LTS

Results

Index building time: * No stemming/lemmatization - 41.67s * Lemmatized text - 76.97s * Stemmed text - 146.13 s

Memory usage: around 410 MB.

Screenshots

Retrieval time statistics:

alt text


Search results: alt text


Search suggestions: alt text


Document view: alt text

Members

Shubham Jha

Praneet Mehta

Abhinav Jain

Saurabh Khandelwal