Skip to content

Conversation

@eero-t
Copy link
Contributor

@eero-t eero-t commented Jan 8, 2025

Description

Current application Dockerfiles copy other Dockerfiles. I've gone through all of them (currently 19, initially 20), and found that except for "EdgeCraftRAG" Dockerfile.server file, only unique part in each container is just the small application Python file (and FFmpeg addition for "DocSum").

This changes all the other 18 Dockerfiles (for 15 apps) to use common base image: opea-project/GenAIComps#1127

That should clearly speed up their building, and greatly reduce the resulting disk usage as new base image is shared between container images. IMHO the main benefit from removing the duplicated content will be dractic simplification of the Dockerfiles though.

Note: image size shown by docker & crictl include also sizes of underlying images, but disk space is actually used only once per shared base image, so the real disk cost of each additional application image will be just few tens of KBs.

EDIT: subset of these changes have been split to 3 other PRs (#1612 + #1638 + #1671), for apps which do not have CI issues (any more). This PR will handle the last (AvatarChatbot) app once it is fixed.

Issues

There are several tickets about large disk usage, both for GenAIExamples and GenAIComps repo.

Type of change

  • Others (enhancement, documentation, validation, etc.)

Dependencies

No new 3rd party dependencies.

"GenAIComps" content included earlier directly to Dockerfiles comes now from "GenAIComps" base image.

This PR will fail until the AvatarChatbot application is fixed: #1607

Tests

Did manual testing that ChatQnA & DocSum images build when GenAIComps image is available, and that those 2 applications still work fine.

As to other applications, this relies on CI tests in these PRs, and earlier staged builds PR (which changed Dockerfiles to be constructed similarly to base-image use, with all content just being built in same Dockerfile): #1031

Future work

This solves only application image size aspect. There are many other images which disk usage can and should still be optimized (app UIs, GenAIComps backend service images etc).

@github-actions
Copy link

github-actions bot commented Jan 8, 2025

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

@eero-t
Copy link
Contributor Author

eero-t commented Jan 8, 2025

Setting to draft state until base image is available: opea-project/GenAIComps#1127

@mkbhanda
Copy link
Collaborator

mkbhanda commented Jan 8, 2025

Delightfully clean!

@eero-t
Copy link
Contributor Author

eero-t commented Jan 20, 2025

Rebased to main and resolved all conflicts. Only difference to earlier Dockerfile contents is "DocSum" including now FFmpeg. Updated description accordingly.

@eero-t
Copy link
Contributor Author

eero-t commented Feb 5, 2025

Base container PR opea-project/GenAIComps#1127 is merged.

Before this PR can be merged:

  • the resulting base container image need to be added to nightly build images,
  • pushed to registry used by CI (in this / GenAIComps repo), and
  • pushed also to DockerHub OPEA project.

@chensuyue
Copy link
Collaborator

@eero-t you can resolve the conflict and run the test.

@eero-t eero-t marked this pull request as ready for review March 4, 2025 10:23
@eero-t eero-t requested a review from rbrugaro as a code owner March 4, 2025 10:23
@eero-t
Copy link
Contributor Author

eero-t commented Mar 4, 2025

Rebased to main and resolved the conflict. Dropped draft status as base image is now available (thanks @chensuyue!): https://hub.docker.com/r/opea/comps-base

@eero-t
Copy link
Contributor Author

eero-t commented Mar 4, 2025

Majority of the CI checks (64) passes, include application tests, but CI tests take ~5 hours to run, and there are also lot (26) of test failures...

"AvatarChatbot" tests fail to Timeouts.

"ChatQnA" tests fail because:

  • chatqna.py is newer than comps-base image. Former has the async revert, latter doesn't
  • Invalid type used in llama_guard health check for vllm-llm:
    [2025-03-04 12:34:47,339] [ ERROR] - opea_llama_guard - Health check failed due to an exception: Invalid input type <class 'dict'>. Must be a PromptValue, str, or list of BaseMessages.
  • Trace endpoint timeouts:
    urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=4318): Max retries exceeded with url: /v1/traces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc3d2775710>: Failed to establish a new connection: [Errno 111] Connection refused'))

Content checks fail (CONTENT='') for "CodeGen", "CodeTrans", "DocSum" and Xeon version of "Translation", maybe due to latest comps-base image missing async fix.

Whereas rocm version of "Translation" test does not show why it fails:

Container translation-tgi-service  Started
Container translation-tgi-service  Waiting
Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

"GraphRAG" services shows several connection errors:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=4318): Max retries exceeded with url: /v1/traces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe350700b10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Retrying llama_index.llms.openai.base.OpenAI._acomplete in 1.0 seconds as it raised APIConnectionError: Connection error..
Retrying llama_index.llms.openai.base.OpenAI._acomplete in 1.73633567763621 seconds as it raised APIConnectionError: Connection error..
[2025-03-04 14:09:53,458] [    INFO] - neo4j_retrievers - [ check health ] Failed to connect to Neo4j: Connection error.
[2025-03-04 14:09:53,458] [   ERROR] - neo4j_retrievers - OpeaNeo4jRetriever health check failed.
...
 [2025-03-04 14:15:18,358] [   ERROR] - opea_retrievers_microservice - [ retrieval ] Error during retrieval invocation: Cannot resolve address 100.83.111.229:None
....
socket.gaierror: [Errno -8] Servname not supported for ai_socktype

"SearchQnA" Xeon version got Timeout, and rocm version got Internal Server error.

"MultiModalQnA" (rocm) test returns some base64 binary data which cannot be really be viewed in Browser. Output of that test should be fixed.

"VisualQnA" rocm version fails to device being unavailable:

+ echo '[ lvm ] HTTP status is not 200. Received status was 500'
+ docker logs visualqna-tgi-service
Error: ShardCannotStart
+ exit 1
Error: Process completed with exit code 1.

@eero-t
Copy link
Contributor Author

eero-t commented Mar 4, 2025

I split the changes for a subset of (5) apps[1] that do not have any CI issues currently into a separate PR (#1612).

[1] ("AudioQnA", "DocIndexRetriever", "EdgeCraftRAG", "FaqGen", "VideoQnA".

@eero-t
Copy link
Contributor Author

eero-t commented Mar 14, 2025

Second set of Dockerfile updates was merged => rebased this to main, to see whether remaining 4 apps would now succeed in their earlier failing 7 CI tests:

  • AvatarChatbot (3/3 fails)
  • CodeGen (1/2 fails)
  • CodeTrans (1/3 fails)
  • MultimodalQnA (2/3 fails)

@chensuyue
Copy link
Collaborator

#1607 AvatarChatbot failed with known issue.

@eero-t
Copy link
Contributor Author

eero-t commented Mar 17, 2025

#1607 AvatarChatbot failed with known issue.

Thanks @chensuyue! I split the currently working 3 apps to a separate PR #1671.

@eero-t
Copy link
Contributor Author

eero-t commented Mar 24, 2025

With the #1671 merged, rebased this to main and renamed the commit changing the last (AvatarChatbot) app, to be last part in series.

@eero-t
Copy link
Contributor Author

eero-t commented Mar 28, 2025

Rebased to main, to check current situation with AvatarChatbot.

@xiguiw
Copy link
Collaborator

xiguiw commented Apr 1, 2025

@eero-t

Any ideas about this new issue:
opea-project/GenAIComps#1465

Update the last remaining application (megaservice) image of the 15
apps that use GenAIComps repo code as base.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Apr 4, 2025

Rebased to main to see whether AvatarChatbot finally passes CI, but it still fails.

@eero-t
Copy link
Contributor Author

eero-t commented Apr 4, 2025

@chensuyue Now the CI failures are due to container cleanup:

Stop and remove all containers used by the services in ./docker_compose/intel/cpu/xeon/compose.yaml ...
312cfa11f69a
....
34b467b308e2
Error: Process completed with exit code 1.

@joshuayao joshuayao added this to the v1.3 milestone Apr 8, 2025
@chensuyue
Copy link
Collaborator

@chensuyue Now the CI failures are due to container cleanup:

Stop and remove all containers used by the services in ./docker_compose/intel/cpu/xeon/compose.yaml ...
312cfa11f69a
....
34b467b308e2
Error: Process completed with exit code 1.

No, it's not the main issue, the main issue is functionality test failed.

@joshuayao joshuayao linked an issue Apr 8, 2025 that may be closed by this pull request
8 tasks
@joshuayao joshuayao added this to OPEA Apr 9, 2025
@joshuayao joshuayao moved this to In review in OPEA Apr 9, 2025
@joshuayao joshuayao self-requested a review April 9, 2025 06:50
@joshuayao joshuayao merged commit 8b7cb35 into opea-project:main Apr 9, 2025
23 of 24 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in OPEA Apr 9, 2025
@eero-t eero-t deleted the use-base-image branch April 9, 2025 12:31
Mahathi-Vatsal pushed a commit to Mahathi-Vatsal/GenAIExamples that referenced this pull request Apr 9, 2025
…opea-project#1369)

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Co-authored-by: chen, suyue <suyue.chen@intel.com>
Signed-off-by: Mahathi Vatsal <mahathi.vatsal.salopanthula@intel.com>
cwlacewe pushed a commit to cwlacewe/GenAIExamples that referenced this pull request Apr 11, 2025
…opea-project#1369)

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Co-authored-by: chen, suyue <suyue.chen@intel.com>
Signed-off-by: Lacewell, Chaunte W <chaunte.w.lacewell@intel.com>
chyundunovDatamonsters pushed a commit to chyundunovDatamonsters/OPEA-GenAIExamples that referenced this pull request May 16, 2025
…opea-project#1369)

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Co-authored-by: chen, suyue <suyue.chen@intel.com>
Signed-off-by: Chingis Yundunov <c.yundunov@datamonsters.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug] AvatarChatbot test fail [Feature] Dockerfile Optimization for OPEA v1.3

5 participants