Reasona is a modular AI/ML pipeline framework for streaming, preprocessing, embedding, indexing, retrieval, reranking, and inference on large-scale datasets. It supports streaming-first workflows to process data end-to-end efficiently, with vector-based retrieval and transformer-based inference, and a Flask web app for interactive usage.
- Streaming Data Ingestion: Stream large datasets directly from [Hugging Face Datasets] or Wikimedia without downloading entire files.
- Data Cleaning & Transformation: Remove duplicates, handle missing values, and convert to instruction-based JSON for embedding.
- Chunking & Embedding: Split long documents into configurable chunks with optional overlap; embed using [SentenceTransformers].
- Vector Indexing: Store embeddings in [FAISS] for fast similarity search.
- Retrieval & Reranking: Retrieve top-k results via FAISS and optionally rerank results using transformer-based models.
- Inference: Generate answers using pretrained models with configurable parameters.
- Flask Web App: Interactive interface for querying Reasona pipelines, managing chats, and visualizing results.
- Scalable & Configurable: Centralized YAML configuration for all pipelines.
- Logging & Monitoring: JSON logs and runtime metrics for all stages, including checkpoints and progress tracking.
Reasona/
│
├── src/
│ └── Reasona/
│ ├── config/
│ │ ├── config_manager.py
│ │ └── params.yaml
│ ├── data/
│ │ ├── loader.py
│ │ ├── cleaner.py
│ │ ├── formatter.py
│ │ ├── chunker.py
│ │ └── embedder.py
│ ├── pipeline/
│ │ ├── preprocess_pipeline.py # Streaming + preprocessing
│ │ ├── indexing_pipeline.py # Chunking + embedding + FAISS
│ │ ├── reranking_pipeline.py # Transformer reranking
│ │ ├── inference_pipeline.py # Inference & retrieval
│ │ └── training_pipeline.py # Qwen/Qwen2.5-1.5B-Instruct
│ ├── vectorstore/
│ │ └── faiss_store.py
│ ├── services/
│ │ └── reasona_service.py
│ └── utils/
│ ├── logger.py
│ └── helpers.py
├── config/
│ ├── config.yaml
│ └── params.yaml
├── artifacts/
├── logs/
├── main.py
├── app.py
└── README.md
- Clone the repository:
git clone https://github.com/louayamor/Reasona.git
cd Reasona- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txtControl pipelines via config/config.yaml and config/params.yaml. Key sections:
- preprocess: dataset, split, max samples, batch size, shuffle, prefetch buffer.
- indexing: embedding model, chunk size/overlap, batch size, queue size, vector store directory, checkpoint frequency.
- reranking: reranker model, tokenizer, top-k reranking.
- retrieval: top-k results, embedding model, vector store path.
- inference: model path, tokenizer path, engine type, max tokens, temperature.
- flask: host, port, debug mode.
python main.pyPipeline flow:
-
Preprocessing Pipeline (Producer)
- Streams data from Hugging Face or Wikimedia.
- Cleans, formats, and converts to instruction-based JSON.
- Stops at
max_samples.
-
Indexing Pipeline (Consumer)
- Chunks data.
- Embeds chunks using
SentenceTransformers. - Stores vectors in FAISS with checkpoints.
Pipelines communicate via queues to handle large datasets efficiently.
python src/Reasona/pipeline/reranking_pipeline.py
python src/Reasona/pipeline/inference_pipeline.py- Retrieve top-k results from FAISS.
- Optionally rerank using transformer models.
- Generate answers or code snippets.
python app.py- Access interactive web interface at
http://localhost:5000. - Query datasets, retrieve top-k results, and perform inference.
- Hugging Face Datasets (streamable)
wikimedia/wikipedia
-
JSON-based logs:
logs/pipeline/preprocess_pipeline.jsonlogs/pipeline/indexing_pipeline.jsonlogs/pipeline/reranking_pipeline.jsonlogs/pipeline/inference_pipeline.json
-
Logs include progress, checkpoints, runtime, and embedding statistics.