Fix for the high memory usage #935

chandrashekar-s · 2024-02-05T16:36:17Z

Description of what I changed

Fixes #900

The incremental run used to fail when there were high number of FHIR resources to be processed already present in the system. This was because during the incremental run, the delta records had to be merged with the existing records in the parquet files and this needed reading of all the parquet records.

It turned out to be that the minimum amount of memory needed to read from all the parquet shards in parallel is determined by the below formula

memoryNeeded = Misc memory for JVM stack + (#Parallel Pipeline Threads * #Parallel Pipelines * Parquet Row Group Size)

This was causing failures for the default parquet row size of 128mb which needed a high memory to load these records into in-memory by all the pipeline threads.

Changes are made to reduce the Parquet Row Group Size to 32mb to run the pipelines even on a low resource environment and it is made configurable as well. Also, the application fails fast if the configured params is not sufficient.

E2E test

Ran the pipelines incremental runs and verified the following

Tested with an initial load of 80K patients already being set and then processed a delta records of 100 patients. The pipeline did not fail and successfully processed the records.

TESTED:

Please replace this with a description of how you tested your PR beyond the
automated e2e/unit tests.

Checklist: I completed these to help reviewers :)

I have read and will follow the review process.
I am familiar with Google Style Guides for the language I have coded in.

No? Please take some time and review Java and Python style guides.
My IDE is configured to follow the Google code styles.

No? Unsure? -> configure your IDE.
I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)
I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.
All new and existing tests passed.
My pull request is based on the latest changes of the master branch.

No? Unsure? -> execute command git pull --rebase upstream master

chandrashekar-s · 2024-02-05T17:25:27Z

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

chandrashekar-s · 2024-02-06T03:07:16Z

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

The other option which I am thinking is to automatically generate the JVM memory parameters based on the threads and parquet row group settings, which might be little non-trivial.

chandrashekar-s · 2024-02-07T03:22:47Z

In the latest commit, the default JVM memory values have been increased to support machines up to 32 cores. Also, a point to note is that the Parquet Group Row Size has been reduced from a default value of 128mb to 32mb. This is to support running the pipelines without fail even on low resource environment. The down sides of reducing this has been documented in the PR.

bashir2

Thanks @chandrashekar-s for the investigation and the fix. All of my comments are minor suggestions or questions. Once addressed, please feel free to merge this.

docker/.env

pipelines/controller/src/main/java/com/google/fhir/analytics/FlinkConfiguration.java

pipelines/batch/src/main/java/com/google/fhir/analytics/BasePipelineOptions.java

pipelines/controller/src/main/java/com/google/fhir/analytics/FlinkConfiguration.java

remove redundant space Fix for the high memory usage (google#935) * Fix for the high memory usage * Increased the default JVM memory values * Review comments minor format increase time for reading paquet files increase time restore innitial time for reading parquet files Fixed jackson issue for hapi upgrade (google#940) Changed e2e test to use the latest version of HAPI (google#941) Bump org.slf4j:slf4j-api from 2.0.11 to 2.0.12 (google#943) Bump commons-codec:commons-codec from 1.16.0 to 1.16.1 (google#942) try increasing time for parquet reader test with default parquet row size increase time increase more time increase row size revert parquet runtime time to 5 minutes Bump com.google.apis:google-api-services-healthcare (google#945) reset rowGroupSizeForParquetFiles

chandrashekar-s requested a review from bashir2 February 5, 2024 17:19

chandrashekar-s closed this Feb 5, 2024

chandrashekar-s force-pushed the tune-memory-params branch from 464a23c to 0a8c205 Compare February 5, 2024 17:22

chandrashekar-s reopened this Feb 5, 2024

chandrashekar-s force-pushed the tune-memory-params branch from 1f44ec1 to 0c01b28 Compare February 7, 2024 03:19

chandrashekar-s added 2 commits February 8, 2024 09:19

Fix for the high memory usage

3f5a43c

Increased the default JVM memory values

a6549c2

chandrashekar-s force-pushed the tune-memory-params branch from 0c01b28 to a6549c2 Compare February 8, 2024 03:49

bashir2 approved these changes Feb 12, 2024

View reviewed changes

chandrashekar-s added 2 commits February 13, 2024 11:16

Merge branch 'master' into tune-memory-params

040d3a1

Review comments

9c81dc7

chandrashekar-s merged commit 8854f23 into google:master Feb 13, 2024
5 checks passed

chandrashekar-s mentioned this pull request Feb 14, 2024

Deprecation of maxWorkers param #951

Merged

7 tasks

mozzy11 mentioned this pull request Feb 14, 2024

Added FHIR Sink Config Properties to the Pipeline Controller #947

Merged

7 tasks

chandrashekar-s mentioned this pull request Feb 15, 2024

Enable the pipelines for Flink non-local execution modes as well #893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for the high memory usage #935

Fix for the high memory usage #935

chandrashekar-s commented Feb 5, 2024 •

edited

Loading

chandrashekar-s commented Feb 5, 2024 •

edited

Loading

chandrashekar-s commented Feb 6, 2024

chandrashekar-s commented Feb 7, 2024 •

edited

Loading

bashir2 left a comment

Fix for the high memory usage #935

Fix for the high memory usage #935

Conversation

chandrashekar-s commented Feb 5, 2024 • edited Loading

Description of what I changed

E2E test

Checklist: I completed these to help reviewers :)

chandrashekar-s commented Feb 5, 2024 • edited Loading

chandrashekar-s commented Feb 6, 2024

chandrashekar-s commented Feb 7, 2024 • edited Loading

bashir2 left a comment

Choose a reason for hiding this comment

chandrashekar-s commented Feb 5, 2024 •

edited

Loading

chandrashekar-s commented Feb 5, 2024 •

edited

Loading

chandrashekar-s commented Feb 7, 2024 •

edited

Loading