This project involves developing a FastAPI and Streamlit-based document exploration application for clients to securely access, explore, and analyze research publications from the CFA Institute Research Foundation. The application facilitates efficient data ingestion, document interaction, and multi-modal querying with capabilities for summarization and research note generation.
- Architecture Diagram
- Components
- Technologies Used
- Install dependencies using Poetry
- Set up environment variables
- Deployment
- License
- Support
- Acknowledgments
- Web Scraping and Data Ingestion Pipeline
- Web scraping for CFA Institute publications using Selenium
- Airflow DAGs for data pipelne
- Data storage in S3 and Snowflake
- FastAPI Backend
- Document exploration API
- Integration with NVIDIA services
- Multi-modal RAG implementation
- Q&A processing system
- Streamlit Frontend
- Document grid/list view
- Summary generation & previewing interface
- Q&A interaction chatbot
- Research notes management
- Multi-Modal RAG and Research Notes link
- Pinecone vector database integration
- Appending the research notes
FastAPI: Backend framework for user authentication, document retrieval, and summarization APIs
Streamlit: Frontend application framework for document exploration and interaction
CFA Institute Research Foundation Publications: Source of data for the research documents
Airflow & Selenium: Tools for automating data ingestion and web scraping
AWS S3: Storage for images and PDFs associated with research documents
Snowflake: Data warehouse to store metadata, research notes, and user data
NVIDIA meta llama-3.1-8b-instruct : Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.
NVIDIA DePlot: Converts graphs and plots from documents into descriptive text, making visual data accessible for text-based search and analysis.
NVIDIA NeVA 22B: Transforms images within documents into text representations, allowing comprehensive querying across visual data types.
NVIDIA embedqa-v5-v6 Model: Generates detailed text embeddings, supporting high-precision, contextually relevant retrieval of document content in the query system.
NVIDIA API Key: Secures and manages access to all NVIDIA functionalities, ensuring authenticated, controlled usage across summarization and querying features.
Pinecone: Vector database for storing and retrieving embeddings for context-based search
Docker: Containerization tool for packaging the FastAPI and Streamlit applications to streamline deployment on cloud platforms
- Clone the repository:
git clone https://github.com/kratipaliwal/InsightDoc.git
cd InsightDocpoetry installexport S3_KEY="path/to/service-account.json"
export SNOWFLAKE_ACCOUNT="your-account"
export PINECONE_API_KEY="your-api-key"
export NVIDIA_API_KEY="your-api-key"- Build Docker images:
docker-compose build- Deploy to AWS:
* Make sure your AWS credentials are set correctly to access the S3 bucket containing the task files as well as the RDS database containing user data.This project is licensed under the MIT License - see the LICENSE file for details.
- CFA Institute Research Foundation
- NVIDIA for AI services
- Open source community
