reSEARCH

Information Retrieval System for Research Papers using Python.

Course Assignment for CS F469- Information Retrieval @ BITS Pilani, Hyderabad Campus.

Done under the guidance of Dr. Aruna Malapati, Assistant Professor, BITS Pilani, Hyderabad Campus.

Introduction

A tf-idf based Search Engine for research papers on Arxiv. The main purpose of this project is understand how vector space based retrieval models work. More on Tf-Idf.

Data

The data has been scraped from Arxiv. The scraper is present in scraper.py which can be found in the directory scraper.

We use the following data of papers from all categories present on Arxiv:

Title
Abstract
Authors
Subjects

Total terms in vocabulary = 38773. Total documents in corpus = 15686. Note: Only Abstract data has been used for searching.

The data is organized into directories as follows:

Data/

├── abstracts (text files containing the abstract)

├── authors (text files containing authors)

├── link (text files containing link to the pdf of the paper)

├── subject (text files containing subjects)

└── title (text files containing the title)

Text Preprocessing

We processed the raw text scraped from Arxiv by applying the following operations-

Tokenization
Stemming
Lemmatization
Stopwords removal

Data Structures used

We used a Trie to store words and their Term frequencies in the different Documents and their Document Frequencies. The Trie is also used to generate typing suggestions while querying.

Time complexity of Inserting and Querying

The time complexity of querying and inserting an element into the Trie is O(n). n → number of charcters in the query term.

Tf-Idf formulation

Tf-Idf score = tf*log(N/df)

tf → Term frequency of the term in the current document df → Document frequency of the term N → Total number of documents in corpus

Machine specs:

Processor: i7 Ram: 8 GB DDR4 OS: Ubuntu 16.04 LTS

Results

Index building time: * No stemming/lemmatization - 41.67s * Lemmatized text - 76.97s * Stemmed text - 146.13 s

Memory usage: around 410 MB.

Screenshots

Retrieval time statistics:

Search results:

Search suggestions:

Document view:

Members

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
img		img
src		src
.gitignore		.gitignore
Design Document.pdf		Design Document.pdf
README.md		README.md
data.zip		data.zip
index.htm		index.htm
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reSEARCH

Table of Contents

Running the scraper

Text pre-processing

Create a trie

Starting the web server

Introduction

Data

Text Preprocessing

Data Structures used

Time complexity of Inserting and Querying

Tf-Idf formulation

Machine specs:

Results

Screenshots

Members

About

Releases

Packages

Contributors 2

Languages

praneetmehta/reSEARCH

Folders and files

Latest commit

History

Repository files navigation

reSEARCH

Table of Contents

Running the scraper

Text pre-processing

Create a trie

Starting the web server

Introduction

Data

Text Preprocessing

Data Structures used

Time complexity of Inserting and Querying

Tf-Idf formulation

Machine specs:

Results

Screenshots

Members

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages