Skip to content

mayankjonwal02/CiteFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CITEFLOW

πŸ”— CITEFLOW

AI-powered document suggestions with structured citation metadata

Python FastAPI LangGraph OpenAI Docker Qdrant


πŸ“– Overview

CITEFLOW is an intelligent document writing assistant that provides real-time content suggestions backed by verified, structured academic citations. Every suggestion comes with full metadata β€” authors, year, title, DOI, abstract, publication venue β€” enabling in-text citations, hover previews, reference list generation, and APA/MLA formatting.


πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • OpenAI API Key

Setup

git clone https://github.com/yourusername/citeflow.git
cd citeflow

cp env.template .env
# Edit .env and set your OPENAI_API_KEY

docker compose up -d

Verify

docker compose ps
Service Port Description
recommendation-agent 8000 Main API & WebSocket
qdrant-digitrix 6333 Vector Database
searxng-digitrix 8080 Meta Search Engine
firecrawl-api-digitrix 3002 Web Scraper
playwright-digitrix 3000 Browser Rendering

πŸ’‘ WebSocket API

Endpoint

ws://localhost:8000/suggest/{document_id}

document_id β€” Any unique string identifying the document being edited. Each document gets its own isolated knowledge base.


Connection

On connecting, the server sends a confirmation message:

{
  "status": "connected",
  "message": "Connected to CITEFLOW for document: my-doc-123",
  "document_id": "my-doc-123"
}

Request Format

Send a JSON message with the document context:

{
  "title": "The History of Qutub Minar",
  "heading": "Introduction",
  "content": "The Qutub Minar, a UNESCO World Heritage Site, stands as a remarkable testament to the architectural brilliance of the era."
}
Field Type Required Description
title string No Document title
heading string No Current section heading
content string Yes Recent content from the document (last few sentences)

Response Format

The server returns a suggestion with structured citation metadata:

{
  "suggestion": "Constructed in 1193 by Qutb ud-Din Aibak, the tower was later completed by his successor Iltutmish, reaching a height of 72.5 meters.",
  "citations": [
    {
      "id": "cite_1",
      "inText": "Asher, 2020",
      "type": "Article",
      "articleType": "Journal",
      "title": "The Qutb Complex: Architecture and History of the Delhi Sultanate",
      "shortTitle": "",
      "abstract": "This paper examines the architectural evolution of the Qutb complex...",
      "publication": "Journal of Islamic Architecture",
      "year": 2020,
      "month": 6,
      "day": 15,
      "authors": [
        { "family": "Asher", "given": "Catherine B." }
      ],
      "identifiers": {
        "doi": "10.1234/jia.2020.0042",
        "url": "https://example.com/article"
      }
    },
    {
      "id": "cite_2",
      "inText": "Unknown, n.d.",
      "type": "Webpage",
      "articleType": "",
      "title": "Qutb Minar",
      "shortTitle": "",
      "abstract": "Qutb Minar is a minaret that forms part of the Qutb complex...",
      "publication": "Wikipedia",
      "year": null,
      "month": null,
      "day": null,
      "authors": [],
      "identifiers": {
        "doi": "",
        "url": "https://en.wikipedia.org/wiki/Qutb_Minar"
      }
    }
  ]
}

Citation Object Schema

Each citation object in the citations array follows this schema:

Field Type Description
id string Unique citation ID within the response (cite_1, cite_2, ...)
inText string Pre-formatted in-text citation ("Shen et al., 2025", "Author & Author, 2020", "Unknown, n.d.")
type string Citation type: Article, Book, Webpage, ConferencePaper, Thesis, Dataset, Report
articleType string Sub-type: Journal, Preprint, Conference, BookChapter, Book, Thesis, Dataset, Report, or ""
title string Full title of the source
shortTitle string Abbreviated title (reserved for future use)
abstract string Abstract or summary of the source (up to 1000 chars)
publication string Journal, publisher, or venue name
year int | null Publication year
month int | null Publication month (1-12)
day int | null Publication day (1-31)
authors array List of author objects
authors[].family string Author's family/last name
authors[].given string Author's given/first name
identifiers object Identifier URLs
identifiers.doi string DOI string (e.g. "10.1234/example") or ""
identifiers.url string Source URL

In-Text Citation Format

The inText field is auto-generated following academic conventions:

Authors Format Example
1 author Family, Year Asher, 2020
2 authors Family & Family, Year Asher & Koch, 2020
3+ authors Family et al., Year Shen et al., 2025
No authors Unknown, Year Unknown, 2023
No year Family, n.d. Asher, n.d.

⚑ Session Lifecycle

CONNECT  ws://host:8000/suggest/{doc_id}
   β”‚
   β–Ό
MESSAGE #1 ─── RESEARCH PATH (~15-30s) ───────────────▢
   β”‚   Web Search β†’ Scrape β†’ Store in Qdrant β†’ Query
   β”‚   β†’ Generate Suggestion β†’ Enrich Citations via
   β”‚     CrossRef / arXiv / OpenAlex / GPT-4o fallback
   β”‚
   β”‚   βœ… doc_id marked as "initialized"
   β–Ό
MESSAGE #2+ ─── FAST PATH (~3-5s) ────────────────────▢
   β”‚   Query Qdrant β†’ Generate Suggestion
   β”‚   β†’ Enrich Citations (cached URLs resolve instantly)
   β–Ό
DISCONNECT
   β”‚
   β–Ό
CLEANUP ── Delete doc_id vectors from Qdrant ─────────▢
  • 1st message: Full research pipeline β€” searches the web, scrapes pages, stores in vector DB, then generates a suggestion with enriched citations. Takes ~15-30 seconds.
  • 2nd+ messages: Fast path β€” only queries the existing knowledge base. Previously enriched citation metadata is cached. Takes ~3-5 seconds.
  • On disconnect: The document's Qdrant collection and citation cache are deleted.

πŸ”¬ Citation Enrichment Pipeline

For each raw URL returned by the AI agent, the backend resolves structured metadata using this priority chain:

Priority Condition Source What It Returns
1 DOI found in URL CrossRef API Title, authors, year, journal, abstract, DOI
2 arXiv URL detected arXiv API Title, authors, year, abstract, arXiv DOI
3 Any URL OpenAlex API Title, authors, year, publication, abstract
4 All above fail GPT-4o-mini Extracts metadata from page HTML
5 Everything fails Minimal URL-only Webpage citation

All enrichment results are cached per session, so the fast path never re-fetches the same URL.


πŸ“ Usage Examples

Python

import asyncio
import websockets
import json

async def get_suggestion():
    uri = "ws://localhost:8000/suggest/my-doc-123"

    async with websockets.connect(uri) as ws:
        # Wait for connection confirmation
        print(await ws.recv())

        # Send document context
        await ws.send(json.dumps({
            "title": "The History of Qutub Minar",
            "heading": "Introduction",
            "content": "The Qutub Minar stands as one of India's most iconic monuments."
        }))

        # Receive structured suggestion + citations
        response = json.loads(await ws.recv())

        print(f"Suggestion: {response['suggestion']}")
        for cite in response['citations']:
            print(f"  [{cite['id']}] {cite['inText']} β€” {cite['title']}")
            print(f"         Type: {cite['type']} | DOI: {cite['identifiers']['doi']}")
            print(f"         Authors: {', '.join(a['given'] + ' ' + a['family'] for a in cite['authors'])}")

asyncio.run(get_suggestion())

JavaScript

const ws = new WebSocket('ws://localhost:8000/suggest/my-doc-123');

ws.onopen = () => {
  ws.send(JSON.stringify({
    title: "The History of Qutub Minar",
    heading: "Introduction",
    content: "The Qutub Minar stands as one of India's most iconic monuments."
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  // Skip connection confirmation
  if (data.status === 'connected') return;

  console.log('Suggestion:', data.suggestion);

  data.citations.forEach(cite => {
    console.log(`[${cite.id}] ${cite.inText}`);
    console.log(`  Title: ${cite.title}`);
    console.log(`  Type: ${cite.type} (${cite.articleType})`);
    console.log(`  Year: ${cite.year}`);
    console.log(`  DOI: ${cite.identifiers.doi}`);
    console.log(`  URL: ${cite.identifiers.url}`);
    console.log(`  Authors:`, cite.authors.map(a => `${a.given} ${a.family}`).join(', '));
    console.log(`  Abstract: ${cite.abstract?.substring(0, 100)}...`);
  });
};

πŸ› οΈ REST Endpoints

Method Path Description
GET / Service info
GET /health Health check with active/initialized session counts

Health Check Response

{
  "status": "healthy",
  "active_sessions": 2,
  "initialized_sessions": 1,
  "services": {
    "qdrant": "http://qdrant-digitrix:6333",
    "searxng": "http://searxng-digitrix:8080",
    "firecrawl": "http://firecrawl-api-digitrix:3002"
  }
}

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              CITEFLOW                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                 Recommendation Agent (FastAPI)                   β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚WebSocket β”‚β†’ β”‚LangGraph β”‚β†’ β”‚ GPT-4o   β”‚β†’ β”‚  Citation     β”‚  β”‚  β”‚
β”‚  β”‚  β”‚Handler   β”‚  β”‚Agent     β”‚  β”‚ mini     β”‚  β”‚  Enricher     β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              β”‚                         β”‚              β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚         β–Ό                    β–Ό          β–Ό    β–Ό         β–Ό         β–Ό  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  SearXNG  β”‚  β”‚ Firecrawl β”‚  β”‚  Qdrant  β”‚ β”‚CrossRefβ”‚ β”‚arXiv  β”‚  β”‚
β”‚  β”‚  :8080    β”‚  β”‚  :3002    β”‚  β”‚  :6333   β”‚ β”‚  API   β”‚ β”‚ API   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β””β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                              β”‚OpenAlexβ”‚            β”‚
β”‚                                              β”‚  API   β”‚            β”‚
β”‚                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

citeflow/
β”œβ”€β”€ docker-compose.yml              # All 9 services orchestration
β”œβ”€β”€ env.template                    # Environment variable template
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ test_ws.html                    # Browser-based WebSocket test UI
β”‚
β”œβ”€β”€ Recommendation Agent/           # Main Python application
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ main.py                     # FastAPI app, WebSocket handler, session management
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── utils/
β”‚       β”œβ”€β”€ agent.py                # LangGraph agent (research + fast paths)
β”‚       β”œβ”€β”€ citation_metadata.py    # Citation enrichment (CrossRef/arXiv/OpenAlex/LLM)
β”‚       β”œβ”€β”€ crawl_ops.py            # Firecrawl web scraping
β”‚       β”œβ”€β”€ embeddings.py           # OpenAI text-embedding-3-small
β”‚       β”œβ”€β”€ qdrant_ops.py           # Qdrant vector DB operations
β”‚       └── search_ops.py           # SearXNG meta-search
β”‚
β”œβ”€β”€ firecrawl/                      # Firecrawl source (built from source)
β”‚   └── apps/
β”‚       β”œβ”€β”€ api/
β”‚       β”œβ”€β”€ nuq-postgres/
β”‚       └── playwright-service-ts/
β”‚
└── searxng-docker/                 # SearXNG configuration
    └── searxng/
        β”œβ”€β”€ settings.yml
        └── limiter.toml

βš™οΈ Environment Variables

Variable Description Required
OPENAI_API_KEY OpenAI API key for GPT-4o-mini & embeddings Yes
USE_DB_AUTHENTICATION Firecrawl auth toggle (default: false) No

πŸ›‘ Stop / Cleanup

# Stop all services
docker compose down

# Stop and remove all data (Qdrant vectors, caches)
docker compose down -v

πŸ“„ License

MIT License

About

CiteFlow is an agent-powered, research-aware suggestion engine that delivers contextually relevant writing suggestions with citations. It uses a FastAPI WebSocket backend to process content, searches the web via SearXNG, scrapes sources with Firecrawler, builds a Qdrant V-database, and queries it using an AI agent for fact-grounded results

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors