Skip to content

manziosee/DocParse

Repository files navigation

DocParse - AI-Powered Document Parser API

Professional Django REST API for extracting structured information from documents (PDF, Word, Images) using AI and natural language prompts - Simple Upload & Extract!

License: MIT Python Django Docker

πŸš€ Live Demo

Production API: https://docparse.onrender.com

πŸš€ Features

  • πŸ“„ Multi-format Support: PDF, Word (.docx), Text (.txt), and image files (JPG, PNG, etc.)
  • πŸ€– AI-Powered Extraction: Uses OpenAI GPT with natural language prompts
  • ⚑ Instant Results: Upload document with optional prompt and get immediate extraction
  • πŸ“‹ No Complex Workflows: Single endpoint - upload and extract in one step
  • πŸ”„ RESTful API: Built with Django REST Framework
  • πŸ“š Interactive Documentation: Swagger UI and ReDoc
  • 🐳 Docker Ready: Complete containerization support
  • πŸ” OCR Support: Extract text from images using Tesseract

πŸ“‹ Table of Contents

πŸš€ Quick Start

Using Docker (Recommended)

  1. Clone and configure:
git clone <repository-url>
cd DocParse
cp .env.example .env
# Edit .env file with your OpenAI API key
  1. Build and run:
docker-compose up --build
  1. Access the application:

Manual Installation

  1. Install dependencies:
pip install -r requirements.txt
sudo apt-get install tesseract-ocr  # For OCR support
  1. Configure environment:
cp .env.example .env
# Edit .env file with your OpenAI API key
  1. Setup database:
python manage.py makemigrations
python manage.py migrate
  1. Run server:
python manage.py runserver

⚑ Simple Usage

Production API (Live)

With specific prompt:

curl -X POST https://docparse.onrender.com/api/documents/ \
  -F "file=@your_document.pdf" \
  -F "prompt=What is the total amount?"

Without prompt (extracts everything):

curl -X POST https://docparse.onrender.com/api/documents/ \
  -F "file=@your_document.pdf"

Local Development

Upload Document & Extract Information:

With specific prompt:

curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@your_document.pdf" \
  -F "prompt=What is the total amount?"

Without prompt (extracts everything):

curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@your_document.pdf"

Example Response:

{
  "seller": {
    "company_name": "BrightLine Traders Ltd",
    "address": "78 Innovation Road, Kigali, Rwanda"
  },
  "buyer": {
    "company_name": "TechNova Solutions",
    "address": "902 Enterprise Drive, Kigali, Rwanda"
  },
  "invoice_number": "PRO-2024-014",
  "date": "2024-02-10",
  "subtotal": "RWF 555,000",
  "tax": "18%",
  "total": "RWF 654,900"
}

πŸ”— API Endpoint

Method Endpoint Description
POST /api/documents/ Upload document & extract with optional prompt

Request Parameters

Parameter Type Required Description
file File βœ… Document file (PDF, DOCX, TXT, JPG, PNG, etc.)
prompt String ❌ Natural language question about the document
document_type String ❌ invoice, proforma, receipt, other

πŸ’‘ Example Extractions

Specific Information

# Extract total amount only
curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@invoice.pdf" \
  -F "prompt=What is the total amount?"

# Extract names only
curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@invoice.pdf" \
  -F "prompt=Get me the names of people in this document"

# Extract invoice date
curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@invoice.pdf" \
  -F "prompt=What is the invoice date?"

All Information

# Extract everything (no prompt)
curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@invoice.pdf"

Business Cards

# Extract contact info
curl -X POST http://localhost:8000/api/documents/ \
  -F "file=@business_card.jpg" \
  -F "prompt=Extract name, phone number, email, and company"

πŸ“š Documentation

Live API Documentation

Local API Documentation

Testing Tools

  • Postman Collection: Import DocParse_API.postman_collection.json
    • Production URL: https://docparse.onrender.com
    • Local URL: http://localhost:8000

🐳 Docker Commands

# Start services
docker-compose up -d

# View logs
docker-compose logs -f web

# Stop services
docker-compose down

# Clean restart (if having issues)
docker-compose down -v
docker system prune -f
docker-compose up --build

βš™οΈ Environment Setup

Required Environment Variables

Create a .env file in the project root (copy from .env.example):

# Django Configuration
SECRET_KEY=your-secret-key-here-change-in-production
DEBUG=False
ALLOWED_HOSTS=localhost,127.0.0.1

# OpenAI Configuration (required for AI extraction)
OPENAI_API_KEY=sk-proj-your-openai-api-key-here
OPENAI_MODEL=gpt-3.5-turbo

Supported File Formats

Input Files:

  • PDF: .pdf
  • Word: .docx, .doc
  • Text: .txt
  • Images: .jpg, .jpeg, .png, .gif, .bmp, .tiff

πŸ”§ Advanced Usage

Python Integration

import requests

# Production API
url = 'https://docparse.onrender.com/api/documents/'

# Upload and extract with specific prompt
with open('invoice.pdf', 'rb') as f:
    response = requests.post(
        url,
        files={'file': f},
        data={'prompt': 'What is the total amount?'}
    )

result = response.json()
print(f"Extracted data: {result}")

# Upload and extract everything (no prompt)
with open('invoice.pdf', 'rb') as f:
    response = requests.post(
        url,
        files={'file': f}
    )

result = response.json()
print(f"All data: {result}")

JavaScript/Node.js Integration

const FormData = require('form-data');
const fs = require('fs');

const form = new FormData();
form.append('file', fs.createReadStream('invoice.pdf'));
form.append('prompt', 'What is the total amount?');

fetch('https://docparse.onrender.com/api/documents/', {
    method: 'POST',
    body: form
})
.then(response => response.json())
.then(data => console.log(data));

🚨 Troubleshooting

Common Issues

Docker ContainerConfig error:

# Clean up Docker containers and volumes
docker-compose down -v
docker system prune -f
docker-compose up --build

OpenAI API errors:

  • Check API key validity and quota
  • Ensure OPENAI_API_KEY is set in .env file
  • Verify you have credits in your OpenAI account

Performance Tips

  • Use specific prompts for faster, more accurate results
  • Optimize image quality for better OCR results
  • Use PDF format when possible for best accuracy

πŸ“Š What DocParse Can Extract

With Specific Prompts:

  • Any information you ask for in natural language
  • Financial data (amounts, totals, line items)
  • Contact information (names, emails, phones)
  • Dates and reference numbers
  • Custom business data

Without Prompt (Extracts All):

  • All dates found in document
  • All monetary amounts
  • All person names
  • All company names
  • All email addresses

πŸ“ž Support

🎯 Example Prompts

Get inspired with these example prompts:

  • "What is the total amount?"
  • "Get me the names of people in this document"
  • "What is the invoice date?"
  • "Extract all contact information"
  • "What are the line items?"
  • "Find all monetary amounts"
  • "Who are the companies mentioned?"
  • "What are the key dates?"
  • "Extract email addresses"

Or leave prompt empty to extract everything automatically!

Simple and powerful - just upload your document and get instant results!

About

Professional Django REST API for extracting structured information from documents (PDF, Word, Images) using AI and natural language prompts - Simple Upload & Extract

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors