Search Engine Implementation in Python
This project was implemented as part of a course.
The problem statement:
- Build a search engine for Environmental News NLP archive.
- Build a corpus for archive with at least 418 documents.
Our search engine is capable of the following query types
- Simple Boolean Query (for eg: good deed -> this would translate to "good AND deed")
- Phrase Query (for eg: prince charles)
- Wildcard Query (for eg: nat* , *til , nat*nal)
Some features include
- Corpus and Query Preprocessing
- Inverted Index
- Parallelized Index Construction
- Ranked Results (for top K documents retrieval)
- Searching on a single index (for eg: republicans and democrats | CNN.201710.csv)
The following steps will help you setup and run the project.
Installing external libraries using requirements.txt
python -m pip install -r requirements.txt
- Windows
python main.py
- Linux
python3 main.py
- NLTK - For Natural Language processing and Corpus Preprocessing
- pandas - For reading and interpreting csv files in the dataset
- bidict - For the Bidirectional Dictionary
- pygtrie - For Index Construction
This project is licensed under the MIT License - see the LICENSE file for details