Skip to content

Genarate a static webpage as a simple, yet powerful document search and retrieval system to simply browse your documents as PDF within your web browser.

License

Notifications You must be signed in to change notification settings

marctrommen/docarchivebrowser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Archive Browser as Static Website With Keyword Search

The following description is PART TWO of a two parts description for a whole document archive system. PART TWO deals with the genaration of a static webpage as a simple, yet powerful document search and retrieval system to simply browse your documents as PDF within your web browser.

The Document Archive Browser as a static web page can either reside on any file system (e.g. USB stick) or on a simple Web Server or Webpage hoster.

MIT License Linux Python 3.x HTML 5 CSS 3

PART ONE deals with the scanning of documents, collecting some meta information for further usage, enhancing the quality of the documents, reducing the file size of them drastically, extract with the help of OCR tools (Tesseract) some plain text, putting the scanned images and extracted texts into one PDF with some (previously collected) meta information and finally organizing all the files within a simple tree structure onto your file system. It is a totally decoupled workflow and the Communication Interface between PART ONE and PART TWO is just the file tree structure, existing of Document_IDs, Metadata as JSON files and the PDF files.

Content

Feature List

  • eigenes, simples Template System
  • responsive WebDesign (mobile first)
  • statische Web-Seiten
  • so wenige Abhängigkeiten wie möglich
  • Suche über Schlagwort-Katalog
  • Generierung entspricht einem Build-Prozess, inkl. Initialisierung, CleanUp, usw.
  • Metainformationen zu den Dokumenten liegen als JSON-Dateien vor

SiteMap of Website

Grobe SiteMap des Static-Document-Archive sieht wie folgt aus:

doc_archive_root
├── index.html (Liste aller Dokumente zum aktuellen Jahr <YYYY>)
├── pages.css
├── keyword_catalog.html (Liste aller Schlagworte)
├── keyword_<xxx>.html (Liste aller Dokumente zu einem Schlagwort <xxx>)
├── archive.html (Liste aller Jahresarchive)
├── archive_<YYYY>.html (Liste aller Dokumente zu einem Jahr <YYYY>)
└── archive
    ├── <yyyymmdd_xx>.* (verlinktes Dokument, z.B. PDF, PNG, JPG)
    └── ...

SiteMap of Build

Grobe SiteMap der Build-Umgebung des Generators sieht wie folgt aus:

project_root
├── source
|   ├── config_template.py
|   ├── build.py
|   ├── templatehandler.py
|   ├── oneyear.py
|   ├── onekeyword.py
|   ├── allkeywords.py
|   ├── allyears.py
|   ├── jsontreewalker.py
|   └── dirtreewalker.py
├── doc
|   ├── 
|   └── 
├── pages.css
├── requirements.txt
├── README.md
├── LICENSE
├── .gitignore
└── templates
    ├── 
    ├── 
    └── 

SiteMap of Scan Archive

scan_archive_root
├── YYYYMMDD_01
│   ├── YYYYMMDD_01.json
│   ├── YYYYMMDD_01.pdf
│   └── ...
├── YYYYMMDD_02
│   ├── YYYYMMDD_02.json
...

Process of Build

Grober Ablauf des Build-Prozesses:

  • Initialisierung
  • CleanUp des letzten Builds (Verzeichnisbaum des Dokumenten-Archivs löschen)
  • Zielverzeichnisse erstellen
  • Scan-Archiv-Baum durchschreiten und in allen Unterverzeichnissen die JSON-Dateien einlesen und deren Metadaten in die globale Datenstruktur aufnehmen.
  • Datenstruktur für Jahresarchive erstellen (Jahr --> Dokument-ID)
  • Datenstruktur für Stichwortverzeichnisse erstellen (Stichwort --> Dokument-ID)
  • index.html generieren
  • archive.html generieren
  • .html generieren (optional)
  • keyword_catalog.html generieren
  • verlinkte Dateien (Bilder, PDF, usw.) vom Scan-Archiv in das Dokumenten-Archiv kopieren

Links on CSS

Links on Python

About

Genarate a static webpage as a simple, yet powerful document search and retrieval system to simply browse your documents as PDF within your web browser.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages