Information Retrieval

This Python module is designed as an educational and intuitive module for understanding the fundamentals of Information Retrieval (IR). It provides simple implementations of basic IR techniques with minimal dependencies.

The module is intended for educational purposes and is not suitable for production environments.

Features

Simple Search and Ranking: Provides simple search and ranking functionality using the Vector Space Model.
Automated Testing: Includes a suite of automated tests to ensure the reliability and correctness of the implementation.
Minimal Dependencies: Includes only few dependencies (word stemming & tokenise) to keep the module lightweight and easy to understand.
Progress Tracking: Utilizes the tqdm library to provide real-time progress updates during lengthy operations, enhancing user experience.
Detailed Logging: Incorporates a logging system to track the operations, helping in debugging and ensuring transparency of the process.
Object-Oriented Design: The module is designed with a focus on modularity and extensibility with little abstraction (OOP).
- easy to extend with new features and new weighting models
Optimisation Techniques: Some optimisation techniques added to speed up the computation process.
- cache inverse documents frequency
- use matrix multiplication for vector operations
- process larger documents first (the remaining documents for computation will be smaller and smaller)

Installation

Install all the dependencies using pip within virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# install dev dependencies
pip install -r requirements-dev.txt

Usage

Data Preparation

Example of the data directory structure:

./data/
└── EnglishNews
   ├── News6.txt
   ├── News78.txt
   ├── News84.txt
   ├── News2994.txt
   └── News3000.txt

Vector space model can be built by given collection of documents in the EnglishNews directory.

Model Construction

Build vector space model in main.py and run:

python main.py
# or with arguments
python main.py --sample-size 1000 --query "London BBC breaking news" --logging-level CRITICAL

Save and Load Model

Here's the code snippet to save and load the model, with joblib library:

the joblib library is particularly efficient for objects that carry large numpy arrays, which might be the case with a vector space model.

import joblib

vs = VectorSpace(
    weighting_model=BM25(),
    parser=Parser(stemmer=SnowballStemmer(language="english")),
    logging_level=logging_level,
)
vs.build(documents_directory=files_path, sample_size=sample_size)

# saving the model to disk
joblib.dump(vs, os.path.join("vsm", "bm25_snwball_vs.joblib"))

# load model from disk
vs_loaded = joblib.load(os.path.join("vsm", "bm25_snwball_vs.joblib"))
vs_loaded.search("London BBC breaking news")

Check main.py for examples.

Testing

Since Makefile is provided, you can run all tests with:

# run all tests
make test
# run coverage
make cov

Acknowledgments

This project began as a part of a course on Web Search and Mining, taught by Professor Tsai at National Chengchi University (NCCU). I extend my heartfelt gratitude to Professor Tsai for his invaluable guidance and the insights that sparked the development of this module.

A special acknowledgment goes to the adage that reminds us that software does not merely get built; it grows.

UML

TODO: UML diagram

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
docs		docs
ir		ir
tests		tests
vsm		vsm
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval

Features

Installation

Usage

Data Preparation

Model Construction

Save and Load Model

Testing

Acknowledgments

UML

About

Releases

Packages

Languages

License

hzionn/Information-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval

Features

Installation

Usage

Data Preparation

Model Construction

Save and Load Model

Testing

Acknowledgments

UML

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages