InsightDoc

Project Overview

This project involves developing a FastAPI and Streamlit-based document exploration application for clients to securely access, explore, and analyze research publications from the CFA Institute Research Foundation. The application facilitates efficient data ingestion, document interaction, and multi-modal querying with capabilities for summarization and research note generation.

Architecture Diagram

Components

Web Scraping and Data Ingestion Pipeline
- Web scraping for CFA Institute publications using Selenium
- Airflow DAGs for data pipelne
- Data storage in S3 and Snowflake
FastAPI Backend
- Document exploration API
- Integration with NVIDIA services
- Multi-modal RAG implementation
- Q&A processing system
Streamlit Frontend
- Document grid/list view
- Summary generation & previewing interface
- Q&A interaction chatbot
- Research notes management
Multi-Modal RAG and Research Notes link
- Pinecone vector database integration
- Appending the research notes

Technologies Used

FastAPI: Backend framework for user authentication, document retrieval, and summarization APIs

Streamlit: Frontend application framework for document exploration and interaction

CFA Institute Research Foundation Publications: Source of data for the research documents

Airflow & Selenium: Tools for automating data ingestion and web scraping

AWS S3: Storage for images and PDFs associated with research documents

Snowflake: Data warehouse to store metadata, research notes, and user data

NVIDIA meta llama-3.1-8b-instruct : Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

NVIDIA DePlot: Converts graphs and plots from documents into descriptive text, making visual data accessible for text-based search and analysis.

NVIDIA NeVA 22B: Transforms images within documents into text representations, allowing comprehensive querying across visual data types.

NVIDIA embedqa-v5-v6 Model: Generates detailed text embeddings, supporting high-precision, contextually relevant retrieval of document content in the query system.

NVIDIA API Key: Secures and manages access to all NVIDIA functionalities, ensuring authenticated, controlled usage across summarization and querying features.

Pinecone: Vector database for storing and retrieving embeddings for context-based search

Docker: Containerization tool for packaging the FastAPI and Streamlit applications to streamline deployment on cloud platforms

Environment Setup

Clone the repository:

git clone https://github.com/kratipaliwal/InsightDoc.git
cd InsightDoc

Install dependencies using Poetry:

poetry install

Set up environment variables:

export S3_KEY="path/to/service-account.json"
export SNOWFLAKE_ACCOUNT="your-account"
export PINECONE_API_KEY="your-api-key"
export NVIDIA_API_KEY="your-api-key"

Deployment

Build Docker images:

docker-compose build

Deploy to AWS:

* Make sure your AWS credentials are set correctly to access the S3 bucket containing the task files as well as the RDS database containing user data.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

CFA Institute Research Foundation
NVIDIA for AI services
Open source community

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Application		Application
Code		Code
Diagrams		Diagrams
Search System		Search System
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InsightDoc

Project Overview

Table of Contents

Architecture Diagram

Components

Technologies Used

Environment Setup

Install dependencies using Poetry:

Set up environment variables:

Deployment

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InsightDoc

Project Overview

Table of Contents

Architecture Diagram

Components

Technologies Used

Environment Setup

Install dependencies using Poetry:

Set up environment variables:

Deployment

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages