Skip to content

A search engine bases on the course Information Retrieval at BML Munjal University. It includes features like relevance feedback, pseudo relevance feedback, page rank, hits analysis, document clustering.

Notifications You must be signed in to change notification settings

mohit155/SearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchEngine

A search engine based on the course "Information Retrieval" at BML Munjal University.

The main parts of the project are:-

Crawler

Collects information by crawling websites and stores them in form of token document inverted index. The project contains the crawled data of 500 pages with seed URL being https://en.wikipedia.org/wiki/Main_Page. It also stores the tokens and URLs.

Relevance and Pseudo Relevance feedback

It is a feature used by search engines for query expansion. User marks the relevant and irrelevant documents according to his/her discretion and then the IR (Information retrieval) system recommends the user an expanded query which might give more relevant result. Pseudo relevance feedback on the other hand treats the first two documents to be relevant and then suggest a query.

Relevance feedback form Relevance feedback image

Relevance feedback result Relevance feedback result image

Document Clustering

Clustering means grouping similar things together. I have used k - means clustering to group documents together. The documents were clustered into 50(k=50) clusters after iterating over 25 times. The user is recommended the cluster to which the most relevant result of the system belongs. Clusters Cluster 1 Cluster 2

Page Rank Algorithm

PR (PageRank) value of all the crawled document were calculated using the algebraic method discussed here https://en.wikipedia.org/wiki/PageRank#Algebraic. The project includes method to sort based on either naïve (based on TFID) or based on the page rank values of the pages.

HITS analysis

The hub and authority score of retrieved pages are calculated by the diagonalization of the product of the adjacency matrix and its transpose. These are then displayed on the links and can be useful in identifying the hubs and authority.

hits

About

A search engine bases on the course Information Retrieval at BML Munjal University. It includes features like relevance feedback, pseudo relevance feedback, page rank, hits analysis, document clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published