Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gangams/logs 50k eps per node #1235

Merged
merged 88 commits into from May 9, 2024
Merged

Conversation

ganga1980
Copy link
Contributor

@ganga1980 ganga1980 commented Apr 29, 2024

This PR has following changes

  1. AMA core agent integration on high log scale mode
  2. Telemetry related to AMA core agent traces and high log scale mode
  3. Extend Config Map for high log scale mode
  4. Use high log scale config related to fluent-bit, such as threading, storage type
  5. Extend liveness probe for AMA Core Agent when high log scale enabled
  6. ARM template updates related to high log scale

Onboarding steps

  1. Configure enabled = true under [agent_settings.high_log_scale] in ConfigMap
  2. Specify Microsoft-ContainerLogV2-HighScale instead of Microsoft-ContainerLogV2 in ARM template parameter file

Scale tests has been performed on VM SKU - Standard_D16s_v5

For 50k eps, we would need minimum 2GB and 3Core.

Validated following scenarios with high log scale mode

  1. ContainerLogV2 with Microsoft-ContainerLogV2-HighScale stream on both Linux and Windows
  2. Here are memory usage numbers for MDSD + AMA Core Agent on various log scales
    - No scale MDSD (90MB) and AMACA (90MB)
    - 3K eps MDSD(120MB) and AMACA(300MB)
    - 50K eps MDSD(120MB) and AMACA(400MB)
  3. HTTP proxy
  4. AMPLS
  5. Both Windows & Linux

Without high log scale mode (i.e. default )

  1. Validated on ARM64, Azure Linux and Windows nodes

Not supported high log scale scenarios -

  1. ARM64
  2. HTTP proxy with CA cert

Resource Usage at 50K eps - CPU is around 4 core and memory is 2.5GB, Disk IOPS - 570.

@ganga1980
Copy link
Contributor Author

Disk IO charts at 50K logs/sec/node.

Total Max Read & Writes/sec is 570 at peak. At peak, Read/sec is 329 and writes/sec is 242.

image

@ganga1980
Copy link
Contributor Author

latest test run with MDSD & AMACA RC image -
image
image

@ganga1980
Copy link
Contributor Author

ganga1980 commented May 9, 2024

On default scenario (without high log scale mode) on MSI mode on x64, ARM64 and windows node -

Daemonset Memory Usage (with & without AMA version update)
image

Daemonset CPU Usage (with & without AMA version update)
image

Replicaset CPU Usage
image

Replicaset Memory Usage
image

On default scenario (without high log scale mode) on Legacy mode on x64, ARM64 and windows node -

Daemonset Memory Usage (with & without AMA version update)
image

Deamonset CPU Usage (with & without AMA version update)

image

Replicaset CPU Usage (with & without AMA version update)
image

Replicaset memory usage(with & without AMA version update)

image

@ganga1980 ganga1980 enabled auto-merge (squash) May 9, 2024 23:37
@ganga1980 ganga1980 merged commit 84fb709 into ci_prod May 9, 2024
15 checks passed
jatakiajanvi12 added a commit that referenced this pull request May 10, 2024
* fix version in Geneva config xml (#1227)

* fix bugs (#1230)

* fix bugs

* fix comment

* update dcr optimization error messages (#1228)

* update dcr optimization error messages

* add additional check for geneva

* redirect dcr parser stderr and stdout to traces file

---------

Co-authored-by: Amol Agrawal <amagraw@microsoft.com>

* update fluent-bit to 2.2.2 in linux (#1229)

* update fluent-bit to 2.2.2 in linux

---------

Co-authored-by: Amol Agrawal <amagraw@microsoft.com>

* update charts, yaml and release notes for 3.1.20 (#1234)

Co-authored-by: Amol Agrawal <amagraw@microsoft.com>

* Geneva -send windows container inventory and perf with RS (#1233)

* Update the geneva feature flag for RS

---------

Co-authored-by: Janvi Jatakia (from Dev Box) <jajataki@microsoft.com>

* Add scan tools to the build pipeline (#1237)

* Add the missing tools to the build pipeline

* update policheck similar to prom metrics

* update binskim

* update trivyignore

* add policheck in windows section

---------

Co-authored-by: Janvi Jatakia (from Dev Box) <jajataki@microsoft.com>

* streamline input plugin code. (#1238)

* streamline input plugin code

---------

Co-authored-by: Amol Agrawal <amagraw@microsoft.com>

* Telemetry optimization: adding addon token adapter traces as metrics (#1231)

* Add token adapter traces as metrics

* update trivyignore

* updating name of mdsd function

* Updating the addon token adapter to discard unnecessary logs

* Update trivyignore

---------

Co-authored-by: Janvi Jatakia (from Dev Box) <jajataki@microsoft.com>

* Update ai instrumentation key for USNAT/USSEC (#1239)

* update ai instrumentation key

* address comments

* resolve comments

* syntax error

---------

Co-authored-by: Janvi Jatakia (from Dev Box) <jajataki@microsoft.com>

* Gangams/logs 50k eps per node (#1235)

* mdsd version 50k changes

* amacore agent integration

* update liveness probe

* handle non-existent file

* refactor code

* fix bugs in mdsd install

* add poll to check amaca port up and running

* fix bug

* configure amaca configport

* try released mdsd version 1.30.3

* fix bug in logs and events profile

* test latest version of mdsd in GIG mode for both arm and x64

* try with build 50k eps changes

* update templates for high log scale mode

* remove libc.so copying

* revert logrotate conf for amaca log

* update mdsd version which has crash fix

* add proxy support for amacore agent

* update mdsd build with amaca gig la changes

* update mdsd build with gig la fixes

* update windows ama build

* mdsd version with 25k buffer size in mdsd

* update mdsd build

* add telemetry and configmap option

* fix bugs

* windows ama build with resource id bug fix

* update mdsd version with qos fixes

* update to use working templates

* add frequency to control amaca log

* mdsd build with qos updates

* trivy ignore update

* log amaca agent version

* improve comments

* add default fluent-bit config for high log scale

* add threding on tail plugin when high log scale enabled

* fix bugs

* fix bug

* fix bugs

* some improvements

* improve comments

* improve code

* update trivyignore

* fix bug

* update trivyignore

* pick GIGLA stream from config when highlogscale enabled

* fix bug

* template updates for high log scale mode

* fix bug

* clean up

* set envvar for ishighlogscale

* set envvar for ishighlogscale

* fix bug

* add log message to troubleshoot duplicate logs

* add log message to troubleshoot duplicate logs

* handle ama bug until fixed

* add storage total limit size

* rename for better reading

* fix pr feedback

* fix pr feedback

* fix pr feedback

* mdsd version update

* fix proxy bug

* fix proxy bug

* update trivy ignore

* clean up the code

* refactor code

* increase storage limit size to 2GB

* increase storage limit size to 10GB

* official mdsd and windows ama versions

* code cleanup

* code cleanup

* mdsd version annotation update

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

---------

Co-authored-by: Ganga Mahesh Siddem <gangams@microsoft.com>
Co-authored-by: Amol Agrawal <pfrcks@gmail.com>
Co-authored-by: Amol Agrawal <amagraw@microsoft.com>
Co-authored-by: Janvi Jatakia (from Dev Box) <jajataki@microsoft.com>
jatakiajanvi12 pushed a commit that referenced this pull request May 10, 2024
* mdsd version 50k changes

* amacore agent integration

* update liveness probe

* handle non-existent file

* refactor code

* fix bugs in mdsd install

* add poll to check amaca port up and running

* fix bug

* configure amaca configport

* try released mdsd version 1.30.3

* fix bug in logs and events profile

* test latest version of mdsd in GIG mode for both arm and x64

* try with build 50k eps changes

* update templates for high log scale mode

* remove libc.so copying

* revert logrotate conf for amaca log

* update mdsd version which has crash fix

* add proxy support for amacore agent

* update mdsd build with amaca gig la changes

* update mdsd build with gig la fixes

* update windows ama build

* mdsd version with 25k buffer size in mdsd

* update mdsd build

* add telemetry and configmap option

* fix bugs

* windows ama build with resource id bug fix

* update mdsd version with qos fixes

* update to use working templates

* add frequency to control amaca log

* mdsd build with qos updates

* trivy ignore update

* log amaca agent version

* improve comments

* add default fluent-bit config for high log scale

* add threding on tail plugin when high log scale enabled

* fix bugs

* fix bug

* fix bugs

* some improvements

* improve comments

* improve code

* update trivyignore

* fix bug

* update trivyignore

* pick GIGLA stream from config when highlogscale enabled

* fix bug

* template updates for high log scale mode

* fix bug

* clean up

* set envvar for ishighlogscale

* set envvar for ishighlogscale

* fix bug

* add log message to troubleshoot duplicate logs

* add log message to troubleshoot duplicate logs

* handle ama bug until fixed

* add storage total limit size

* rename for better reading

* fix pr feedback

* fix pr feedback

* fix pr feedback

* mdsd version update

* fix proxy bug

* fix proxy bug

* update trivy ignore

* clean up the code

* refactor code

* increase storage limit size to 2GB

* increase storage limit size to 10GB

* official mdsd and windows ama versions

* code cleanup

* code cleanup

* mdsd version annotation update

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback
jatakiajanvi12 added a commit that referenced this pull request May 10, 2024
jatakiajanvi12 added a commit that referenced this pull request May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants