Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate flink-conf.yaml file automatically to set optimum conf values #874

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

chandrashekar-s
Copy link
Collaborator

@chandrashekar-s chandrashekar-s commented Nov 8, 2023

Description of what I changed

Fixes #823

  • Enabled a feature flag to automate the generation of flink-conf.yaml based on the numThreads in application.yaml file and the number of cores available in the machine.
  • If the flag is disabled, then the flink-conf.yaml directed by the env. var FLINK_CONF_DIR will be used. If var is empty, then default conf will be used.
  • JAVA_OPTS can be set to override the default JVM heap and other parameter values.
  • The application will fail to launch if the JVM Off Heap memory is insufficient for the Flink cluster to launch.
  • Fixes the regression introduced by this change where the maxWorkers was defaulted to 1 and in turn the num of shards for parquet files was also 1.

E2E test

Tested Full Run and Incremental Run by enabling and disabling the flags. Also tested for different numThread configurations.

TESTED:

Load tested the application on a 48 core machine. Given below are the results

Run # Run Type # of Patient Records Time in secs (Before Change) Time in secs (After Change) Remarks
1 Full 9K 400 358 Full run timing remains the same before and after the changes
2 Full (Repeat) 9K 360 256  
3 Incremental 9K Upfront + 100 additional 140 30 The improvement in time after change is because of fixing the regression (increased num of shards from 1 to N)
4 Incremental 9K Upfront + 500 additional 180 32  

After fixing the regression the time taken for incremental run has significantly reduced

Checklist: I completed these to help reviewers :)

  • I have read and will follow the review process.

  • I am familiar with Google Style Guides for the language I have coded in.

    No? Please take some time and review Java and Python style guides.

  • My IDE is configured to follow the Google code styles.

    No? Unsure? -> configure your IDE.

  • I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)

  • I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.

  • All new and existing tests passed.

  • My pull request is based on the latest changes of the master branch.

    No? Unsure? -> execute command git pull --rebase upstream master

@chandrashekar-s
Copy link
Collaborator Author

@bashir2 The build is failing currently, I am looking into it.

@chandrashekar-s chandrashekar-s force-pushed the automate-flink-params branch 3 times, most recently from 0f173f4 to ab8d8f5 Compare November 20, 2023 12:40
@chandrashekar-s
Copy link
Collaborator Author

@bashir2 The PR is ready for review. Since changes were made to create target jars compatible with jdk 17, the docker images had to be updated for streaming module.

@bashir2
Copy link
Collaborator

bashir2 commented Nov 20, 2023

@bashir2 The PR is ready for review. Since changes were made to create target jars compatible with jdk 17, the docker images had to be updated for streaming module.

Thanks @chandrashekar-s for the updates. I'll review this by tomorrow.

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chandrashekar-s for this change.

@chandrashekar-s
Copy link
Collaborator Author

@bashir2 Thanks for reviewing the changes. I have addressed/responded to the comments. Can you please have a look. Also, for few cases where unit test cases could not be written, I feel we need few more tests for e2e or performance to avoid regressions.

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chandrashekar-s the remaining comments are all questions or minor suggestions. Please feel free to merge after addressing them.

@chandrashekar-s
Copy link
Collaborator Author

chandrashekar-s commented Nov 28, 2023

Thanks @bashir2 for the reviewing the changes. I have addressed some of the comments in the latest commit. Also I have created these 2 issues #893 and #891 for Validating Flink in non-local mode and Investigating/fixing the reshuffle operations for writing to parquet files respectively.

@chandrashekar-s
Copy link
Collaborator Author

The performance results have been attached in the PR description. No noticeable changes for the Full run, but for Incremental run the timing has improved after fixing the number of Shards to be N (Flink parallelism)

@chandrashekar-s chandrashekar-s merged commit 7dd6df2 into google:master Nov 29, 2023
5 checks passed
@bashir2 bashir2 mentioned this pull request Nov 29, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default memory configurations fail in a low resource environment
2 participants