Skip to content

kratipaliwal/InsightDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsightDoc

Project Overview

This project involves developing a FastAPI and Streamlit-based document exploration application for clients to securely access, explore, and analyze research publications from the CFA Institute Research Foundation. The application facilitates efficient data ingestion, document interaction, and multi-modal querying with capabilities for summarization and research note generation.

Table of Contents

  1. Architecture Diagram
  2. Components
  3. Technologies Used
  4. Install dependencies using Poetry
  5. Set up environment variables
  6. Deployment
  7. License
  8. Support
  9. Acknowledgments

Architecture Diagram

image

Components

  1. Web Scraping and Data Ingestion Pipeline
    • Web scraping for CFA Institute publications using Selenium
    • Airflow DAGs for data pipelne
    • Data storage in S3 and Snowflake
  2. FastAPI Backend
    • Document exploration API
    • Integration with NVIDIA services
    • Multi-modal RAG implementation
    • Q&A processing system
  3. Streamlit Frontend
    • Document grid/list view
    • Summary generation & previewing interface
    • Q&A interaction chatbot
    • Research notes management
  4. Multi-Modal RAG and Research Notes link
    • Pinecone vector database integration
    • Appending the research notes

Technologies Used

FastAPI: Backend framework for user authentication, document retrieval, and summarization APIs

Streamlit: Frontend application framework for document exploration and interaction

CFA Institute Research Foundation Publications: Source of data for the research documents

Airflow & Selenium: Tools for automating data ingestion and web scraping

AWS S3: Storage for images and PDFs associated with research documents

Snowflake: Data warehouse to store metadata, research notes, and user data

NVIDIA meta llama-3.1-8b-instruct : Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.

NVIDIA DePlot: Converts graphs and plots from documents into descriptive text, making visual data accessible for text-based search and analysis.

NVIDIA NeVA 22B: Transforms images within documents into text representations, allowing comprehensive querying across visual data types.

NVIDIA embedqa-v5-v6 Model: Generates detailed text embeddings, supporting high-precision, contextually relevant retrieval of document content in the query system.

NVIDIA API Key: Secures and manages access to all NVIDIA functionalities, ensuring authenticated, controlled usage across summarization and querying features.

Pinecone: Vector database for storing and retrieving embeddings for context-based search

Docker: Containerization tool for packaging the FastAPI and Streamlit applications to streamline deployment on cloud platforms

Environment Setup

  1. Clone the repository:
git clone https://github.com/kratipaliwal/InsightDoc.git
cd InsightDoc

Install dependencies using Poetry:

poetry install

Set up environment variables:

export S3_KEY="path/to/service-account.json"
export SNOWFLAKE_ACCOUNT="your-account"
export PINECONE_API_KEY="your-api-key"
export NVIDIA_API_KEY="your-api-key"

Deployment

  • Build Docker images:
docker-compose build
  • Deploy to AWS:
* Make sure your AWS credentials are set correctly to access the S3 bucket containing the task files as well as the RDS database containing user data.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • CFA Institute Research Foundation
  • NVIDIA for AI services
  • Open source community

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors