Problem Statement:
Given a URL of a Github repository, the proposed solution enables the user to have a priliminary understanding of the repository by asking the system in a conversation style setup.
3 main modules, each of which function as a standlone dockerized microservice:
- API: That captures the main tasks of downloading the repository, processing it, embedding the same and so on. More details can be found here
- UI: Simple interface to see the solution in action.
- VecDB: A vector database that supports CRUD of vector embeddings as well as some metadata information.
See demo here
- Include a
.env
file with a key forOPENAI_API_KEY=""
in theAPI
module (ie in this path) - Download the embedding model from here and place it in this location
- The microservices are encapsulated as composable docker services. Hence run
docker-compose up ---build
at the root location. - You can find each of the modules in the following URLs:
- UI: localhost:8000
- API: localhost:8001/docs
- Qdrant Dashboard: localhost:6333/dashboard
- UI: Streamlit is used for building a basic interactive app.
Screenshot:
- API: Fast API is used for implementing RESTful services. More details about the supported APIs can be found here.
- VectorDB: Self-hosted Qdrant database is used as a vector database.
- Improvements can be done at several places, Some of them (but not limited to) could be:
- Filtering strategies while downloading the repo
- Detecting programming language of the scripts and performing appropriate cleaning strategies
- Creating richer metadata for the scripts at both document and chunk level such as summaries of functions, comments, function names etc.
- More advanced strategies while embedding the scripts, to bring about "Contextual RAG".
- I have tried to add comments (
@TODO
) in appropriate places in the scripts.