Enhance /api/status with processing breakdown, recent files, and aggregate stats#48
Enhance /api/status with processing breakdown, recent files, and aggregate stats#48
Conversation
…nd statistics Add new fields to the status API response: - processing_details: breakdown of processing documents by status - currently_processing: list of documents currently being processed - recent_files: last 5 completed/failed files with timestamps - total_pdfs: total document count - failed_pdfs: failed document count - avg_processing_seconds: average processing time All existing fields are preserved for backward compatibility. Add comprehensive unit tests for the enhanced status endpoint. Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/51a488f1-8693-4d02-894c-931f85fc9432
|
@copilot please add those changes to the readme |
Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/bd224ccb-e72f-459a-81b6-091b19b464d3
Added the |
There was a problem hiding this comment.
@copilot can we combine those new queries and reduce the overall query count? if this is more efficient than having everything combined into one query, keep it as it is. I want to reduce the stress and workload for sql as much as possible.
There was a problem hiding this comment.
Reduced from 4 queries to 3 in 5253604. The processing_details GROUP BY query was scanning the exact same rows (status_code BETWEEN 0 AND 4) as the currently_processing query, so I eliminated it and derive the grouped breakdown in Python from the individual items instead. The remaining 3 queries (summary scalars, in-progress items, recent files) each target distinct subsets with different result shapes, so combining them further would add complexity without reducing actual DB work.
…rrently_processing in Python Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/4abb078b-a869-4988-be63-1ebc9ce473d5
|
@copilot There is an already existing bug: The sqlite> SELECT file_name, created, file_status, status_code FROM scanneddata WHERE status_code < 5 ORDER BY created DESC;
+----------------------------------------------+---------------------+-------------+-------------+
| file_name | created | file_status | status_code |
+----------------------------------------------+---------------------+-------------+-------------+
| neue_kreditkarte.pdf | 2025-07-24 18:51:48 | Deleted | -1 |
| doc20250708092425026129.pdf | 2025-07-08 10:24:36 | Deleted | -1 |
| temp_scan_data_622bcd_7af0a4b8 | 2025-05-14 20:57:52 | Deleted | -1 |
+----------------------------------------------+---------------------+-------------+-------------+ |
…s_code < 0) Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/992614c5-ed42-428c-96b2-256ccf0585c4
Fixed in |
|
@copilot pytests are failing here on GitHub actions: |
…service Dockerfile Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/ad59c16f-0584-4752-afa2-c81172389beb
... Fixed in |
|
@copilot das hat es leider nicht behoben, ich bekomme noch folgende Fehlermeldung auf GitHub: |
…t conflict Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/5e7d9707-f37f-4978-a7ad-066038fcc966
... Fixed in |
|
@copilot pytests for the webpage are now failing. please fix those errors and run the .run-tests.sh file to verify before committing. |
…rom connect_rabbitmq() Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a640e2b5-ce42-4e6e-85f4-5790011e0046
... Fixed in
... Fixed in |
The status API lacked visibility into per-status document counts and recent completion history—critical when running replicated services (e.g., 2x OCR, 2x metadata).
New response fields (backward compatible)
processing_details— documents in flight grouped by status ([{status, status_code, count}])currently_processing— individual in-progress documents with id, filename, status, created, pdf_pagesrecent_files— last 5 completed/failed documents with completion timestampstotal_pdfs/failed_pdfs— aggregate countsavg_processing_seconds— mean time from created→modified for completed docsBug fixes
latest_processing_timestamp— fixed pre-existing bug where the query usedWHERE status_code < 5, which incorrectly included deleted/failed documents (status_code = -1) when no active jobs were running. Changed toWHERE status_code BETWEEN 0 AND 4to only match actively processing documents.rabbitmq_listener()crash — fixed pre-existing bug inweb_service/src/main.pywhereconnect_rabbitmq()returningNone(when RabbitMQ is unavailable) caused an unhandledTypeError: cannot unpack non-iterable NoneType object. With gevent workers, this crashed the greenlet and made the entire web service unresponsive. Added aNonecheck so the listener exits gracefully, logging a warning instead.CI fix
test_service/Dockerfile— theseleniarm/standalone-chromium:latestbase image ships with olderlibssl3t64 (3.2.1-3), which conflicts with the neweropenssl-provider-legacy (3.6.1-3)during a Debian package split. Added-o Dpkg::Options::="--force-overwrite"to theapt-get installcommand to handle the file ownership transfer cleanly.Example response
{ "processed_pdfs": 10, "processing_pdfs": 3, "latest_processing_timestamp": "2024-06-01 12:00:00", "latest_completed_timestamp": "2024-06-01 11:30:00", "latest_created_name": "invoice.pdf", "latest_created_status": 2, "total_pdfs": 15, "failed_pdfs": 2, "avg_processing_seconds": 45.68, "processing_details": [ {"status": "OCR Processing", "status_code": 2, "count": 2}, {"status": "Reading Metadata", "status_code": 1, "count": 1} ], "currently_processing": [ {"id": 12, "file_name": "scan1.pdf", "status": "OCR Processing", "status_code": 2, "created": "2024-06-01 12:00:00", "pdf_pages": 3} ], "recent_files": [ {"id": 11, "file_name": "doc1.pdf", "status": "Completed", "status_code": 5, "created": "2024-06-01 10:00:00", "completed": "2024-06-01 10:01:00", "pdf_pages": 2} ] }Changes
web_service/src/routes/api.py— replaced single monolithic query with 3 focused queries: summary stats, active items list, and recent completions (LIMIT 5). Theprocessing_detailsbreakdown is derived in Python from thecurrently_processingresult set, avoiding a redundant GROUP BY query on the same rows. Original six fields preserved. Fixedlatest_processing_timestampfilter to exclude deleted/failed documents.web_service/src/main.py— addedNonecheck forconnect_rabbitmq()return value inrabbitmq_listener()so the web service remains responsive when RabbitMQ is unavailable (e.g., in the test environment).test_service/Dockerfile— added--force-overwritedpkg option to resolveopenssl-provider-legacy/libssl3t64package split conflict in theseleniarm/standalone-chromiumbase image.tests/test_status_api.py— 10 unit tests covering backward compat, new fields, empty states, null avg, error handling, andNonefallback for list queries.README.md— added API documentation section with response fields table and collapsible example response forGET /api/status.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.