Web application offering browsing, search, retrieval, addition, and deletion of documents in a repository, with user registration, authentication, and directory-specific authorization.
https://github.com/jrpool/docsearch
This application is a web server that manages, and provides selective access to, a repository of documents.
The intended use case is a person or organization that has possession, on its own server, of a collection of documents in various formats and wants to make various parts of the collection accessible for various actions by various categories of users using web browsers.
This project was initially developed at Learners Guild in the course of an apprenticeship in full-stack web development. The learning objectives served by the project included:
- Encrypted server-client communication
- Authentication
- Cookie-based session persistence
- Role-based authorization
- Web-database integration
- Web-email integration
- User administration
- Web-filesystem integration
- File-access permission management
- Cross-format document relevance discovery
- Document display and delivery
- Controlled distributed document repository modification
- Security of administrative and user secrets
- Internationalization/localization
- Protection of customizations from deletion by updates
- Accessibility
- Usability
The tools used in the implementation include HTML, CSS, JavaScript, bcrypt
, body-parser
, dotenv
, ejs
, express
, express-session
, session-file-store
, PostgreSQL
, pg
(node-postgres), the SendGrid Web API, and PM2
.
Originally Apache solr
was included in the planned tool set for indexing and searching. Subsequently, Elasticsearch replaced it in the plan. After that, the approach to search was made incremental, starting with operations available in Node and JavaScript plus pdftotext
, a utility available in the poppler-utils
package. Searching is currently based on those operations.
The application is a work in progress. Its intended functionalites include the following (“*” = not yet implemented):
-
User identity capabilities:
- Registration.
- Login with temporary username (“UID”) issued on registration.
- Login.
- Logout.
- Deregistration.
-
User document capabilities:
- Browse through the directory tree.
- Browser-based return to previous tree nodes.
- *Breadcrumb-based return to previous tree nodes.
- Display and download specific documents.
- Search with query strings for documents a user is authorized to see.
- Filesystem-based document addition and deletion.
- *Browser-based document addition and deletion.
-
Role-based document access:
- Distinct permissions for reading, adding, and deleting.
- Directory-specific permissions.
- Propagation of permissions to subdirectories.
- Multi-role users having the union of their role permissions.
- Pruning of redundant entries in displayed directory trees.
-
Administrator (“curator”) capabilities:
- File-based customization of the application configuration (see below).
- Registration as a curator with a secret code.
- *Web-based definition of user roles (“categories”).
- *Web-based assignment of permissions to categories.
- Assignment of users to categories.
- Assignment of permanent UIDs to users.
- Editing of user registration records.
- Deregistration of users.
-
Email notices:
- Events triggering notices:
- User registration.
- User deregistration.
- Curator editing of a user registration record.
- Curator deregistration of a user.
- Parties receiving notices:
- Affected user.
- Performing curator.
- Application administrator.
- Events triggering notices:
-
Localization:
- File-based whole-application language localization.
- *User-based dynamic localization.
Suggestions on priorities for the further development of the project, and of course bug reports, are welcome. Feel free to file issues at the repository.
Efforts have been made, and are continuing, to make the user interface comply with level AAA of [WCAG 2.1][wcag] so that the application is reasonably accessible to persons with disabilities.
Accessibility features include:
- Natural navigation order
- Explicit main and sectional region structure
- Visible focus
- Mouse-free operability
- Semantic headings
- Descriptive titling
- Contrastive colors
- Purpose-labeled controls
- Appearance of form-error messages after the offending elements
- Declared page language
The application is designed so that the texts in its interface can exist in multiple, linguistically distinct, versions, and choices among the versions can be made. This feature is described in the configuration instructions.
As distributed for installation, the application is configured to allow you to replicate the demonstration cited above, including the sample documents.
To navigate back up the document tree when browsing, use the browser’s back button.
-
These instructions presuppose that (1) npm, PostgreSQL, and pdftotext are installed, (2) there is a PostgreSQL database cluster, (3) PostgreSQL is running, (4) when you connect to the cluster you are a PostgreSQL superuser, and (5) your PostgreSQL configuration permits trusted local IPv4 connections from you and from the PostgreSQL user that this application will create. If you get authentication errors running the
revive_db
script described below, you can edit yourpg_hba.conf
file, which may be located in/etc/postgresql/«version»/main
or/usr/local/var/postgres
. Insert the following lines above the existing similar line of typehost
, then restart postgreSQL with the applicable command on your server, such assudo service postgresql restart
orpg_ctl restart
. You will replace «docsearchowner» with the value ofPGUSER
that you choose (see below).host all «you» 127.0.0.1/32 trust host all «docsearchowner» 127.0.0.1/32 trust
-
Your copy of this project will be located in its own directory, inside some other directory that you may choose or create. For example, to create that parent directory inside your own home directory’s
Documents
subdirectory and call itprojects
, you can execute:mkdir ~/Documents/projects
Make that parent directory your working directory, by executing, for example:
cd ~/Documents/projects
-
Clone this project’s repository into it, thereby creating the project directory, named
docsearch
, by executing:git clone https://github.com/jrpool/docsearch.git docsearch
-
Make the project directory your working directory by executing:
cd docsearch
-
Create a directory named
sessions
by executing:mkdir sessions
-
If there is no
access.log
file in thelogs
directory, rename theaccess-init.log
file there toaccess.log
. -
Obtain an account at SendGrid. For development or light production use, the free plan with a limit of 100 messages per day will suffice. (Each complete user registration entails sending 4 messages.) Note the API key that SendGrid issues to you.
-
Create a file named
.env
at the root of your project directory and populate it with the following content, amended as you wish. This file will be protected from modification by any updates of the application. Details:CURATOR_CAT
andPUBLIC_CAT
are the categories whose users are to have the access rights of curators (maximum rights) and of the general public (minimum rights), respectively.DAEMON
can be left as is, but, if you install two or more instances of this application on the same server, each must have a distinct value ofDAEMON
.DOC_DIR
,SEED_DIR
, andMSGS
should have the valuesdemodocs
,demoseed
, anddemomsgs
while you are running the demonstration. When you add your own data and configuration, change these to match the names you give to your directories in thepublic
andsrc/db
directories and the file containing your messages. Updates of the application may updatedemodocs
,demoseed
, anddemomsgs
, but will not interfere with your own customizations of these, as long as you give them different names.LINK_PREFIX
is equal to any application prefix you use with a reverse proxy server, or''
if none. For example, if requests tohttps://yourdomain.org/docs/…
are passed to the application, the value should be/docs
.- If you are doing development on the application, change the value of
NODE_ENV
fromproduction
todevelopment
. - See below for information about the
LANG
variable, and above for information about theSENDGRID_API_KEY
variable. PGDATABASE
andPGUSER
must be unique to this installation if you have multiple installations on the same host. They both are deleted and recreated in the course of installation, soPGUSER
should exist only for this installation.PGUSER
is a PostgreSQL user, but not necessarily an operating-system user.PORT
is the port the application will listen for requests on. If users will connect via a reverse proxy server, make it a port that the host’s firewall does not permit incoming traffic to address. (Letting users connect directly to the port is considered secure only if user clients are on the same host as the application, because otherwise unencrypted transmission of all content, including passwords and confidential documents, will occur.)STYLESHEET
is the base of the name of your stylesheet file inpublic
. You can leave it asdemostyle
. If you want to customize any styles, copydemostyle.css
to a differently named file, customize the copy, and reference its filename inSTYLESHEET
.- The
TEMP_UID_MAX
value is the largest number of registrants you expect to still have temporary UIDs at the same time, before curators assign permanent UIDs to them. URL
is the URL the application will tell users to use in reaching the application. Whether it specifieshttp
orhttps
depends on the user’s required behavior, not on the protocol used by the application itself (see the next paragraph).- Decide whether to make the application require the
https
protocol. You may have it usehttp
and still require users to connect withhttps
, by passing all requests through a reverse proxy server that communicates with users viahttps
but with the application viahttp
. The deployed live demonstration does this. It uses Nginx as a reverse proxy server, with credentials obtained fromcertbot
andletsencrypt
.- If users connect with
https
:- Set
HTTPS_CERT
to the path to your SSL/TLS certificate. - Set
HTTPS_KEY
to the path to your SSL/TLS private key. - Set
PROTOCOL
tohttps
.
- Set
- If users connect with
http
:- Set
HTTPS_CERT
to''
. - Set
HTTPS_KEY
to''
. - Set
PROTOCOL
tohttp
.
- Set
- If users connect with
COOKIE_EXPIRE_DAYS=7 CURATOR_CAT=0 CURATOR_KEY=ASecretKey DAEMON=demodocsearch DOC_DIR=docs DOMAIN=yourdomain.org FROM_EMAIL=noreply@yourdomain.org FROM_NAME='Documents from Your Organization' HTTPS_CERT=/etc/letsencrypt/live/yourdomain.org/fullchain.pem HTTPS_KEY=/etc/letsencrypt/live/yourdomain.org/privkey.pem LANG=eng LINK_PREFIX=/ds MSGS=msgs NODE_ENV=production PGDATABASE=demodocs PGHOST=localhost PGPASSWORD=null PGPORT=5432 PGUSER=demodocmaster # PORT must be 1024 or greater to allow a non-root process owner. PORT=3000 PROTOCOL=https PUBLIC_CAT=1 REG_EMAIL=admin@yourdomain.org REG_NAME='Your Administrator' SECRET=AnAuthenticationSecret SEED_DIR=seed SENDGRID_API_KEY=wHaTeVer.SenDGriDgIvEs.YoU STYLESHEET=demostyle TEMP_UID_MAX=3 URL=https://www.yourdomain.org/ds/
-
Install required dependencies (you can see them listed in
package.json
) by executingnpm i
. The dependencies that this installs will depend on whether you defined the Node environment asdevelopment
orproduction
in step 0. -
Create your document directory (named in
.env
asDOC_DIR
) insidepublic
, as the root of your repository. Populate it with subdirectories and files. You may include symbolic links in it, and users with access to those links will also have access to the files and directories that they reference. This feature offers you the ability to grant multiple categories of users access to a particular file or directory without the need to make copies of it. But the feature requires care, because it is possible to mistakenly include a symbolic link to directories and files, anywhere in your file system, that you intend not to disclose. -
Create your seed directory (named in
.env
asSEED_DIR
) insidesrc/db
. Copy thedemoseed
files into it. Edit them to define the categories of users you want to have and their access rights to directories in your repository. The user access rights must conform to this application’s fundamental principle that permission to do something to a directory implies permission to do the same thing to all of its descendants. The names of categories inseedcat.sql
are internal to the database, so they should each begin with a letter or_
and contain only letters, digits, and_
(thus, no spaces). -
Copy
src/server/demomsgs.js
into the samesrc/server
directory, giving your copy the name you specified asMSGS
. In your copy, modify the values of the properties in theeng
object to conform to your requirements. Among the properties that you will probably need to redefine areaccessText
,cats
,docsTitle
,footText
,introText
, andusrEtc
. -
If you wish to add an additional language, add an object like
eng
to your message file, replacing the English values of the properties with strings in the other language. Name the new object with the ISO 639-3 alpha-3 code of that language. Add it to the export list at the end of the file. To make that language the language of the application’s user interface, replaceeng
with that code as the value of theLANG
environment variable in your.env
file. This version of the application does not yet support on-the-fly localization per user or browser preferences.
-
Once the application is installed, create and populate the database by executing
npm run revive_db
. -
There are 3 ways to start the application. In each case, make the project directory your working directory first.
-
If you have chosen to install a development environment, execute
npm run start_dev
. This will run the application undernodemon
, automatically restarting the application when you change files or their content, to ensure that the changes are live. -
If you have installed a production environment and want to test it, execute
npm start
. -
If you have installed a production environment and want to launch it as a daemon, so it is detached from your command-line environment and it restarts when the server reboots, execute
npm run start_daemon
. If you want to stop the application after that, executenpm run stop_daemon
. (On some systems it is necessary to execute these commands as a superuser, namely assudo npm run start_dev
andsudo npm run stop_daemon
.) -
In a production environment, both start methods cannot be relied on to adapt to any changes you make in the code. So, if you have made changes and want to test them, stop the application with
CONTROL-c
ornpm run stop_daemon
and then start it again.
-
-
To access the application while it is running, use a web browser to request the application’s port on your server, such as:
http://localhost:3000 https://www.yourserver.org/ds
-
When you access the application with your browser, register yourself as a curator. To obtain curator status, enter the CURATOR_KEY value into the “For administrative use” text field. Then, when you log in, you will be a curator.