Skip to content

Search Engine on the Kaggle Environmental News NLP archive

License

Notifications You must be signed in to change notification settings

hritvikpatel4/YAG

Repository files navigation

Yet Another Google (YAG) [Algorithms for Information Retrieval]

Search Engine Implementation in Python

This project was implemented as part of a course.

The problem statement:

  1. Build a search engine for Environmental News NLP archive.
  2. Build a corpus for archive with at least 418 documents.

Our search engine is capable of the following query types

  • Simple Boolean Query (for eg: good deed -> this would translate to "good AND deed")
  • Phrase Query (for eg: prince charles)
  • Wildcard Query (for eg: nat* , *til , nat*nal)

Some features include

  • Corpus and Query Preprocessing
  • Inverted Index
  • Parallelized Index Construction
  • Ranked Results (for top K documents retrieval)
  • Searching on a single index (for eg: republicans and democrats | CNN.201710.csv)

Getting Started

The following steps will help you setup and run the project.

Prerequisites

Installing external libraries using requirements.txt

python -m pip install -r requirements.txt

Executing Code

  • Windows
python main.py
  • Linux
python3 main.py

Built With

  • NLTK - For Natural Language processing and Corpus Preprocessing
  • pandas - For reading and interpreting csv files in the dataset
  • bidict - For the Bidirectional Dictionary
  • pygtrie - For Index Construction

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

About

Search Engine on the Kaggle Environmental News NLP archive

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages