# Document Automation

## Introduction
This reference use case is an end-to-end reference solution for building an AI-augmented multi-modal semantic search system for document images (for example, scanned documents). This solution can help enterprises gain more insights from their document archives more quickly and easily using natural language queries. 

## Table of Content
- [Overview](#overview)
- [Validated Hardware Details](#validated-hardware-details)
- [How It Works](#how-it-works)
- [Run Using Jupyter Lab](#run-using-jupyter-lab)
- [Expected Output](#expected-output)

## Overview
Enterprises are accumulating a vast quantity of documents, a large portion of which is in image formats such scanned documents. These documents contain a large amount of valuable information, but it is a challenge for enterprises to index, search and gain insights from the document images due to the reasons below:
* The end-to-end (e2e) solution involves many components that are not easy to integrate together.
* Indexing a large document image collection is very time consuming. Distributed indexing capability is needed, but there is no open-source solution that is ready to use.
* Query against document image collections with natural language requires multi-modality AI that understands both images and languages. Building such multi-modality AI models requires deep expertise in machine learning.
* Deploying multi-modality AI models together with databases and a user interface is not an easy task.
* Majority of multi-modality AI models can only comprehend English, developing non-English models takes time and requires ML experience.


In this reference use case, we implement and demonstrate a complete end-to-end solution that help enterprises tackle these challenges and jump start the customization of the referece solution for their own document archives. The architecture of the reference use case is shown in the figure below. It is composed of 3 pipelines, for which we will go into details in the [How It Works](#how-it-works) section: 
* Single-node Dense Passage Retriever (DPR) fine tuning pipeline
* Image-to-document indexing pipeline (can be run on either single node or distributed on multiple nodes)
* Single-node deployment pipeline
</br>

![usecase-architecture](assets/usecase-architecture.PNG)


## Validated Hardware Details
Please note that indexing of the entire Dureader-vis dataset can take days depending on the type and the number of CPU nodes that you are using for the indexing pipeline. This reference use case provides a multi-node distributed indexing pipeline that accelerates the indexing process. It is recommended to use at least 2 nodes with the hardware specifications listed in the table below. A network file system (NFS) is needed for the distributed indexing pipeline. </p>

To try out this reference use case in a shorter time frame, you can download only one part of the Dureader-vis dataset (the entire dataset has 10 parts) and follow the instructions below. 

| Supported Hardware           | Specifications  |
| ---------------------------- | ---------- |
| Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors| FP32 |
|Memory|larger is better, recommend >376 GB|
|Storage|>250 GB|


## How it Works
We present some technical background on the three pipelines of this use case. We recommend running our reference solution first and then customizing the reference solution to your own use case by following the [Customize the Reference Solution to Your Own Use Case](#customize-the-reference-solution-to-your-own-use-case) section.
### Dense passage retriever (DPR) fine tuning
Dense passage retriever is a dual-encoder retriever based on transformers. Please refer to [the original DPR paper](https://arxiv.org/abs/2004.04906) for in-depth description of DPR model architecture and the fine-tuning algorithms. Briefly, DPR consists of two encoders, one for the query and one for the documents. DPR encoders can be fine tuned with customer datasets using the in-batch negative method where the answer documents of the other queries in the mini-batch serve as the negative samples. Hard negatives can be added to further improve the retrieval performance (recall and MRR). </p>

In this reference use case, we used a pretrained cross-lingual language model open-sourced on Huggingface model hub, namely, the [infoxlm-base model pretrained by Microsoft](https://aclanthology.org/2021.naacl-main.280/), as the starting point for both the query encoder and document encoder. We fine tuned the encoders with in-batch negatives. However, we did not include hard negatives in our fine-tuning pipeline. This can be future work in our later releases. We showcase that ensembling our fine-tuned DPR with BM25 retriever (a type of widely used non-AI retriever) can improve the retrieval recall and MRR compared to BM25 only. </p>

The stock haystack library only supports BERT based DPR models, we have made modifications to the haystack APIs to allow any encoder architecture (e.g., RoBERTa, xlm-RoBERTa, etc.) that you can load via the from_pretrained method of Hugging Face transformers library. By using our containers, you can fine tune a diverse array of custom DPR models by setting ```xlm_roberta``` flag to true when initiating ```DensePassageRetriever``` object. (Note: although the flag is called "xlm_roberta", it supports any model architecture that can be loaded with from_pretrained method.)

### Image-to-document indexing
In order to retrieve documents in response to queries, we first need to index the documents where the raw document images are converted into text passages and stored into databases with indices. In this reference use case, we demonstrate that use an ensemble retrieval method (BM25 + DPR) improves the retrieval recall and MRR over the BM25 only and DPR only retrieval methods. In order to condcut the ensemble retrieval, we need to build two databases: 1) an ElasticSearch database for BM25 retrieval, and 2) a PostgreSQL database plus a FAISS index file for DPR retrieval. </p>

The architecture of the indexing pipeline is shown in the diagram below. There are 3 tasks in the indexing pipeline:
1. Preprocessing task: this task consists of 3 steps - image preprocessing, text extraction with OCR (optical character recognition), post processing of OCR outputs. This task converts images into text passages.
2. Indexing task: this tasks write text passages produced by the Preprocessing task into one or two databases depending on the retrieval method that the user specified.
3. Embedding task: this task generates dense vector representations of text passages using the DPR document encoder and then generates a FAISS index file with the vectors. [FAISS](https://github.com/facebookresearch/faiss) is a similarity search engine for dense vectors. When it comes to retrieval, the query will be turned into its vector representation by the DPR query encoder, and the query vector will be used to search against the FAISS index to find the vectors of text passages with the highest similarity to the query vector. The embedding task is required for the DPR retrieval or the ensemble retrieval method, but is not required for BM25 retrieval.


![indexing-architecture](assets/indexing-architecture.PNG)


### Deployment
After the DPR encoders are fine-tuned and the document images are indexed into databases as text passages, we can deploy a retrieval system on a server with Docker containers and retrieve documents in response to user queries. Once the deployment pipeline is successfully launched, users can interact with the retrieval system through a web user interface (UI) and submit queries in natural language. The retrievers will search the most relevant text passages in the databases and return those passages to be displayed on the web UI. The diagram below shows how BM25 and DPR retrievers work to retrieve top-K passages and how the ensembler rerank the passages with weighting factors to improve the recall and MRR of individual retrievers.

![retrieval-architecture](assets/retrieval-architecture.png)


## Run Using Jupyter Lab

Step1. Set environment variables

In [1]:
%env HEAD_IP=sr608
%env MODEL_NAME=my_dpr_model
%env WORKSPACE=/root/work
%env DB_DIR=/root/work/output/databases

env: HEAD_IP=sr608
env: MODEL_NAME=my_dpr_model
env: WORKSPACE=/root/work
env: DB_DIR=/root/work/output/databases


Step2. Create working directory

In [2]:
%%bash
## cleanup history output
rm -rf $WORKSPACE

## create work dir
mkdir -p $WORKSPACE/dataset $WORKSPACE/output
mkdir -p $WORKSPACE/output/dpr_models $WORKSPACE/output/index_files $WORKSPACE/output/processed_data $WORKSPACE/output/databases

Step3. Download repo for document automation refkit

In [3]:
%%bash
cd $WORKSPACE
git clone https://github.com/intel/document-automation.git

Cloning into 'document-automation'...


Step4. Download dataset

In [4]:
%%bash
## For quick demo purpose, you can use our pre-downloaded dataset.
cd $WORKSPACE/dataset
[[ -f ~/work_bak/dataset/dureader_vis_images_part_2.tar.gz ]] && cp ~/work_bak/dataset/dureader_vis_images_part_2.tar.gz ./ || wget https://dataset-bj.cdn.bcebos.com/qianyan/dureader_vis_images_part_2.tar.gz
tar -xzf dureader_vis_images_part_2.tar.gz
cd $WORKSPACE
[[ -f ~/work_bak/dureader_vis_docvqa.tar.gz ]] && cp ~/work_bak/dureader_vis_docvqa.tar.gz ./ || wget https://dataset-bj.cdn.bcebos.com/qianyan/dureader_vis_docvqa.tar.gz
tar -xzf dureader_vis_docvqa.tar.gz

Step5. Build docker images

In [5]:
%%bash
cd $WORKSPACE/document-automation/docker
docker compose build

#1 [intel/ai-workflows:beta-doc-automation-fine-tuning internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [intel/ai-workflows:beta-doc-automation-fine-tuning internal] load build definition from Dockerfile.fine-tuning
#2 transferring dockerfile: 307B done
#2 DONE 0.0s

#3 [intel/ai-workflows:beta-doc-automation-indexing internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [intel/ai-workflows:beta-doc-automation-indexing internal] load build definition from Dockerfile.indexing
#4 transferring dockerfile: 422B done
#4 DONE 0.0s

#5 [intel/ai-workflows:beta-doc-automation-fine-tuning internal] load metadata for docker.io/intel/ai-workflows:odqa-haystack-api
#5 DONE 2.3s

#6 [intel/ai-workflows:beta-doc-automation-fine-tuning internal] load build context
#6 transferring context: 1.63MB 0.0s done
#6 DONE 0.0s

#7 [intel/ai-workflows:beta-doc-automation-indexing internal] load build context
#7 transferring context: 1.63MB 0.0s done
#7 DON

Step6. Run dataset preprocess

In [6]:
%%bash
cd $WORKSPACE/document-automation/docker
docker compose run pre-process

Processing data, this may take a while....
Namespace(cluster_doc=False, cluster_model='microsoft/infoxlm-base', crop_image=False, data_dir='/home/user/docvqa/', dev_file='docvqa_dev.json', encoder='dpr', folder_prefix='/home/user/dataset/dureader_vis_images_part_', hard_neg=False, host='localhost', index_name='faiss', max_seq_len_passage=500, method='v0', min_chars=5, n_component=2, neg_ratio=128, num_retrieve=200, ocr_lang='chi_sim', overlap=10, port=9205, process_dev=True, retrieval_method=None, save_to='/home/user/output/processed_data/', split_doc=False, train_file='docvqa_train.json')
process training file: docvqa_train.json
Reading data...
Read complete!
Start processing data....


100%|██████████| 11109/11109 [00:24<00:00, 450.39it/s]


Completed processing data!
Saving processed data...
Save complete!
process dev file: docvqa_dev.json
Reading data...
Read complete!
Start processing data....
Completed processing data!
Saving processed data...
Save complete!


100%|██████████| 1512/1512 [00:03<00:00, 499.74it/s]


Step7. Run DPR model fine-tuning

In [7]:
%%bash
cd $WORKSPACE/document-automation/docker
## For quick demo purpose, you can use fine-tuned dpr model
[[ -d ~/work_bak/output/dpr_models/${MODEL_NAME} ]] && cp -r ~/work_bak/output/dpr_models/${MODEL_NAME} $WORKSPACE/output/dpr_models/${MODEL_NAME} || docker compose run fine-tuning

Step8. Run indexing pipeline

In [8]:
%%bash
## enable --toy-example
cp ~/document-automation/scripts/run_distributed_indexing.sh $WORKSPACE/document-automation/scripts/run_distributed_indexing.sh

cd $DB_DIR
mkdir -p esdb && chmod -R 777 esdb
cd $WORKSPACE/document-automation/docker
docker compose up postgresql -d 
docker compose up elasticsearch -d
docker compose run indexing

Container docker-postgresql-1  Creating
Container docker-postgresql-1  Created
Container docker-postgresql-1  Starting
Container docker-postgresql-1  Started
Container docker-elasticsearch-1  Creating
Container docker-elasticsearch-1  Created
Container docker-elasticsearch-1  Starting
Container docker-elasticsearch-1  Started


2023-06-29 08:54:20,302	INFO usage_lib.py:490 -- Usage stats collection is disabled.
2023-06-29 08:54:20,303	INFO scripts.py:702 -- Local node IP: sr608
2023-06-29 08:54:22,676	SUCC scripts.py:739 -- --------------------
2023-06-29 08:54:22,676	SUCC scripts.py:740 -- Ray runtime started.
2023-06-29 08:54:22,677	SUCC scripts.py:741 -- --------------------
2023-06-29 08:54:22,677	INFO scripts.py:743 -- Next steps
2023-06-29 08:54:22,677	INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2023-06-29 08:54:22,677	INFO scripts.py:747 --   ray start --address='sr608:6379'
2023-06-29 08:54:22,677	INFO scripts.py:763 -- Alternatively, use the following Python code:
2023-06-29 08:54:22,677	INFO scripts.py:765 -- import ray
2023-06-29 08:54:22,677	INFO scripts.py:769 -- ray.init(address='auto', _node_ip_address='sr608')
2023-06-29 08:54:22,678	INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2023-06-29 08:54:22,678	INFO sc

100%|██████████| 3.83M/3.83M [00:28<00:00, 137kiB/s] 
100%|██████████| 11.9M/11.9M [00:39<00:00, 298kiB/s] 


download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar
[2023/06/29 08:56:53] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_box_type='quad', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', 

100%|██████████| 2.19M/2.19M [01:15<00:00, 28.9kiB/s]


Completed downloading paddleocr models!
postgresql://postgres:postgres@localhost:5432/haystack
localhost


2023-06-29 08:57:05,086	INFO worker.py:1352 -- Connecting to existing Ray cluster at address: sr608:6379...
[2023-06-29 08:57:05,099 I 1214 1214] global_state_accessor.cc:357: This node has an IP address of 10.0.2.208, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2023-06-29 08:57:05,101	INFO worker.py:1538 -- Connected to Ray cluster.


Namespace(add_doc=True, cluster_doc=False, cluster_model='microsoft/infoxlm-base', crop_image=False, db='postgresql://postgres:postgres@localhost:5432/haystack', doc_encoder='/home/user/output/dpr_models/my_dpr_model/passage_encoder', embed_doc=True, embedding_bs=50, embedding_cpus_per_actor=20, embedding_max_actors=8, embedding_min_actors=4, esdb='localhost', faiss_efconstruct=200, faiss_efsearch=128, faiss_nlinks=512, force_num_cluster=False, index_file='/home/user/output/index_files/faiss-indexfile.faiss', index_name='dureadervis-documents', max_seq_len_passage=500, max_seq_len_query=128, min_chars=5, n_components=2, ocr_cfg=None, ocr_engine='paddleocr', ocr_lang='chi_sim', overlap=10, preprocess='grayscale', preprocess_cpus_per_actor=4, preprocess_max_actors=20, preprocess_min_actors=8, query_encoder='/home/user/output/dpr_models/my_dpr_model/query_encoder', retrieval_method='all', split_doc=True, toy_example=True, writing_bs=10000, writing_cpus_per_actor=4)
dir_path=/home/user/dat

Read progress: 100%|██████████| 1/1 [00:00<00:00, 499.80it/s]


Dataset(num_blocks=1, num_rows=10, schema=<class 'tuple'>)


Read progress: 100%|██████████| 1/1 [00:00<00:00, 1398.57it/s]
[2m[36m(pid=1879)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1875)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1874)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1876)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1877)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1872)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1873)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=1878)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=4566)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=4565)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=5309)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=5310)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=5313)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=6320)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=6319)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=6323)[0m PLEASE USE OMP_NUM_THREADS W

[2m[36m(BlockWorker pid=1874)[0m path = 00004a59edab633b2c39be53af0f651090346bbfc090b0c0fa79a811.png
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m Image 00004a59edab633b2c39be53af0f651090346bbfc090b0c0fa79a811.png is split into 4 passages
[2m[36m(BlockWorker pid=1874)[0m path = 0005209b3f630d2f0ea690be10a1781e7ba63d3d614218374207f86c.png
[2m[1m[36m(scheduler +52s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m Image 0005209b

[2m[36m(pid=8073)[0m PLEASE USE OMP_NUM_THREADS WISELY.
[2m[36m(pid=8072)[0m PLEASE USE OMP_NUM_THREADS WISELY.
Writing Documents:   0%|          | 0/41 [00:00<?, ?it/s]00:09<?, ?it/s][2m[36m(BlockWorker pid=8074)[0m 
Writing Documents: 10000it [00:00, 26730.94it/s]         
Map Progress (1 actors 0 pending): 100%|██████████| 1/1 [00:11<00:00, 11.58s/it]


[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m contain ad
[2m[36m(BlockWorker pid=1874)[0m Image 002e203cf04cf0d02b3add5c3e1220eb02948004b906570f9a2ac4ea.png is split into 4 passages
[2m[36m(BlockWorker pid=8074)[0m write to es and postgresql
preprocess time= 333.5598497390747
write doc time= 11.583030462265015
41
Dataset(num_blocks=41, num_rows=41, schema=<class 'haystack.schema.Document'>)
[2m[36m(BlockWorker pid=9420)[0m num of passages to be embeded 41
[2m[36m(BlockWorker pid=9420)[0m shape of passage embeddings in this batch: (41, 768)
41
embedding time= 29.502660036087036
save time= 0.03675341606140137


Map Progress (3 actors 1 pending): 100%|██████████| 1/1 [00:29<00:00, 29.37s/it]


Step9. Run retrieval performance evaluation

In [9]:
%%bash
cd $WORKSPACE/document-automation/docker
docker compose run performance-retrieval

Container docker-indexing-1  Creating
Container docker-indexing-1  Created
Container docker-indexing-1  Starting
Container docker-indexing-1  Started


Namespace(json_file='/home/user/docvqa/docvqa_dev.json', save_to='/home/user/output/processed_data/docvqa_dev.csv')
Converting /home/user/docvqa/docvqa_dev.json to csv....
Completed conversion!
Time to convert: 8.91 sec
Evaluating retrieval performance...
Namespace(bs=16, datapath='/home/user/output/processed_data/docvqa_dev.csv', doc_encoder='/home/user/output/dpr_models/my_dpr_model/passage_encoder', error_analysis=False, eval_subset=False, host='localhost', hpo=False, index_file='/home/user/output/index_files/faiss-indexfile.faiss', index_name='dureadervis-documents', max_seq_len_passage=500, max_seq_len_query=64, num_query=100, port=9200, query_encoder='/home/user/output/dpr_models/my_dpr_model/query_encoder', ranker_path=None, rerank_topk=10, retrieval_method='ensemble', save_path=None, simple_test=False, topk=100, weight=1.5)
# of questions to be tested:  1512
hit at top 10: 0
Recall at top 10: 0.0000
MRR at top 10: 0.0001


100%|██████████| 1512/1512 [02:08<00:00, 11.80it/s]


Step10. Stop all docker containers

In [10]:
%%bash
cd $WORKSPACE/document-automation/docker
docker compose down

Container docker-postgresql-1  Stopping
Container docker-postgresql-1  Stopping
Container docker_performance-retrieval_run_13d5d1f16732  Stopping
Container docker_pre-process_run_d0ee767fe3ed  Stopping
Container docker_performance-retrieval_run_13d5d1f16732  Stopping
Container docker_pre-process_run_d0ee767fe3ed  Stopping
Container docker-elasticsearch-1  Stopping
Container docker-elasticsearch-1  Stopping
Container docker_performance-retrieval_run_13d5d1f16732  Stopped
Container docker_performance-retrieval_run_13d5d1f16732  Removing
Container docker_pre-process_run_d0ee767fe3ed  Stopped
Container docker_pre-process_run_d0ee767fe3ed  Removing
Container docker_pre-process_run_d0ee767fe3ed  Removed
Container docker_performance-retrieval_run_13d5d1f16732  Removed
Container docker_indexing_run_2b05ce7066bc  Stopping
Container docker-indexing-1  Stopping
Container docker_indexing_run_2b05ce7066bc  Stopping
Container docker-indexing-1  Stopping
Container docker_indexing_run_2b05ce7066bc  St

Step11. Deploy document automation

In [11]:
%%bash
cd $WORKSPACE/document-automation/docker
docker compose --env-file env.ensemble -f docker-compose-ensemble.yml up 

haystack-api Pulling 
ui Pulling 
ca1778b69356 Pulling fs layer 
05ab8fbb7b6d Pulling fs layer 
6ef052e60b8d Pulling fs layer 
ba80a9ebc2bb Pulling fs layer 
5c5e0cc2f157 Pulling fs layer 
ba80a9ebc2bb Waiting 
5544ebdc0c7b Already exists 
16f91c5e2a06 Already exists 
6aec83223701 Already exists 
6850614a7123 Already exists 
cb144268a237 Already exists 
3cc303c898a3 Already exists 
c686c2e885ba Already exists 
9916d9ba4b74 Already exists 
93161ca4a4bb Already exists 
25ff14257ff7 Already exists 
ebecc16d64a4 Already exists 
6f86d869061d Already exists 
bbdfaae91b09 Already exists 
e0b571e6e34d Already exists 
3ae585920383 Already exists 
235b7ec3bebf Already exists 
5e93672ee9a1 Already exists 
05ab8fbb7b6d Downloading [==>                                                ]     732B/12.44kB
05ab8fbb7b6d Download complete 
haystack-api Pulled 
ca1778b69356 Downloading [>                                                  ]  278.5kB/27.5MB
ca1778b69356 Downloading [=>                        

Attaching to docker-elasticsearch-1, docker-haystack-api-1, docker-postsql-db-1, docker-ui-1
docker-postsql-db-1     | 
docker-postsql-db-1     | PostgreSQL Database directory appears to contain a database; Skipping initialization
docker-postsql-db-1     | 
docker-postsql-db-1     | 2023-06-29 09:09:24.549 UTC [1] LOG:  starting PostgreSQL 14.1 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
docker-postsql-db-1     | 2023-06-29 09:09:24.549 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
docker-postsql-db-1     | 2023-06-29 09:09:24.549 UTC [1] LOG:  listening on IPv6 address "::", port 5432
docker-postsql-db-1     | 2023-06-29 09:09:24.549 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
docker-postsql-db-1     | 2023-06-29 09:09:24.552 UTC [21] LOG:  database system was shut down at 2023-06-29 09:08:22 UTC
docker-postsql-db-1     | 2023-06-29 09:09:24.557 UTC [1] LOG:  database system is ready to acce

Container docker-ui-1  Stopping
Container docker-haystack-api-1  Stopping


Error while terminating subprocess (pid=903): 


Container docker-ui-1  Stopped
Container docker-haystack-api-1  Stopped
Container docker-postsql-db-1  Stopping
Container docker-elasticsearch-1  Stopping
Container docker-postsql-db-1  Stopped
Container docker-elasticsearch-1  Stopped
canceled


## Expected Output
Once the containers are launched successfully, you can open up a browser (Chrome recommended) and type in the following address:
```
<head node ip>:8501
```
And you should see a webpage that look like the one below.

![demo](assets/demo.PNG)