-
Notifications
You must be signed in to change notification settings - Fork 73
[WIP] TRT-292: Add disruption test documentation #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ | |
| # Hugo | ||
| public/ | ||
| resources/_gen | ||
| .hugo_build.lock | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,257 @@ | ||
| --- | ||
| title: "Disruption Tests" | ||
| description: A look at how disruption tests are architected and implemented. | ||
| --- | ||
|
|
||
| ## Summary | ||
|
|
||
| Disruption tests are a set of tests that measure an endpoint's resilience or an allowed alert duration. As the `e2e tests` are run the results are matched against historical data to determine if the disruption intervals are within expectations. | ||
|
|
||
| ### Resources | ||
|
|
||
| {{% alert title="⚠️ Note!" color="warning" %}} | ||
| You'll need access to the appropriate groups to work with disruption data, please reach out to the TRT team for access. | ||
| {{% /alert %}} | ||
|
|
||
| - [Periodic Jobs](https://github.com/openshift/release/tree/master/ci-operator/jobs/openshift/release) | ||
| - [BigQuery](https://console.cloud.google.com/bigquery?project=openshift-ci-data-analysis) | ||
| - [DPCR Job Aggregation Configs](https://github.com/openshift/continuous-release-jobs/tree/master/config/clusters/dpcr/services/dpcr-ci-job-aggregation) | ||
| - [Origin Synthetic Backend Tests](https://github.com/openshift/origin/tree/master/pkg/synthetictests/allowedbackenddisruption) | ||
|
|
||
| ## Disruption Data Architecture | ||
|
|
||
| {{% alert color="info" %}} | ||
| The below diagram presents a high level overview on how we use our `periodic jobs`, `job aggregation` and `BigQuery` to generate the disruption historical data. | ||
| It does not cover how the tests themselves are run against a cluster. | ||
| {{% /alert %}} | ||
|
|
||
| ### High Level Diagram | ||
|
|
||
| {{< inlineSVG file="/static/disruption_test_diagram.svg" >}} | ||
|
|
||
| ### How The Data Flows | ||
|
|
||
| 1. Disruption data is gathered from periodic jobs that run e2e tests. When e2e tests are done the results are uploaded to `GCS` and the results can be viewed in the artifacts folder for a particular job run. | ||
|
|
||
| Clicking the artifact link on the top right of a prow job and navigating to the `openshift-e2e-test` folder will show you the disruption results. (ex. `.../openshift-e2e-test/artifacts/junit/backend-disruption_[0-9]+-[0-9]+.json`). | ||
|
|
||
| 1. Job aggregation jobs are run in the DPCR cluster, the current configuration can be found in the [openshift/continuous-release-jobs](https://github.com/openshift/continuous-release-jobs/tree/master/config/clusters/dpcr/services/dpcr-ci-job-aggregation). | ||
|
|
||
| These jobs fetch and parse out the results from the e2e runs. They are then pushed to the [openshift-ci-data-analysis](https://console.cloud.google.com/bigquery?project=openshift-ci-data-analysis) table in BigQuery. | ||
|
|
||
| 1. Currently, backend disruption data is queried from BigQuery and downloaded in `json` format. The resulting `json` file is then committed to [origin/pkg/synthetictests/allowedbackenddisruption/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/query_results.json) for **backend disruption** or [origin/pkg/synthetictests/allowedalerts/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedalerts/query_results.json) for **alert data** (see [how to query](#how-to-query-the-data)) | ||
|
|
||
| ### How To Query The Data | ||
|
|
||
| Once you have access to BigQuery in the `openshift-ci-data-analysis` project, you can run the below query to fetch the latest results. | ||
|
|
||
| #### Query | ||
|
|
||
| {{% card-code header="[origin/pkg/synthetictests/allowedbackenddisruption/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/types.go#L13-L43)" %}} | ||
|
|
||
| ```sql | ||
| SELECT | ||
| BackendName, | ||
| Release, | ||
| FromRelease, | ||
| Platform, | ||
| Architecture, | ||
| Network, | ||
| Topology, | ||
| ANY_VALUE(P95) AS P95, | ||
| ANY_VALUE(P99) AS P99, | ||
| FROM ( | ||
| SELECT | ||
| Jobs.Release, | ||
| Jobs.FromRelease, | ||
| Jobs.Platform, | ||
| Jobs.Architecture, | ||
| Jobs.Network, | ||
| Jobs.Topology, | ||
| BackendName, | ||
| PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.95) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P95, | ||
| PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.99) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P99, | ||
| FROM | ||
| openshift-ci-data-analysis.ci_data.BackendDisruption as BackendDisruption | ||
| INNER JOIN | ||
| openshift-ci-data-analysis.ci_data.BackendDisruption_JobRuns as JobRuns on JobRuns.Name = BackendDisruption.JobRunName | ||
| INNER JOIN | ||
| openshift-ci-data-analysis.ci_data.Jobs as Jobs on Jobs.JobName = JobRuns.JobName | ||
| WHERE | ||
| JobRuns.StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 21 DAY) | ||
| ) | ||
| GROUP BY | ||
| BackendName, Release, FromRelease, Platform, Architecture, Network, Topology | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| {{% card-code header="[origin/pkg/synthetictests/allowedalerts/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedalerts/types.go#L17-L35)" %}} | ||
|
|
||
| ```sql | ||
| SELECT * FROM openshift-ci-data-analysis.ci_data.Alerts_Unified_LastWeek_P95 | ||
| where | ||
| alertName = "etcdMembersDown" or | ||
| alertName = "etcdGRPCRequestsSlow" or | ||
| alertName = "etcdHighNumberOfFailedGRPCRequests" or | ||
| alertName = "etcdMemberCommunicationSlow" or | ||
| alertName = "etcdNoLeader" or | ||
| alertName = "etcdHighFsyncDurations" or | ||
| alertName = "etcdHighCommitDurations" or | ||
| alertName = "etcdInsufficientMembers" or | ||
| alertName = "etcdHighNumberOfLeaderChanges" or | ||
| alertName = "KubeAPIErrorBudgetBurn" or | ||
| alertName = "KubeClientErrors" or | ||
| alertName = "KubePersistentVolumeErrors" or | ||
| alertName = "MCDDrainError" or | ||
| alertName = "PrometheusOperatorWatchErrors" or | ||
| alertName = "VSphereOpenshiftNodeHealthFail" | ||
| order by | ||
| AlertName, Release, FromRelease, Topology, Platform, Network | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| #### Downloading | ||
|
|
||
| Once the query is run, you can download the data locally. | ||
|
|
||
|  | ||
|
|
||
| ## Code Implementation | ||
|
|
||
| > Note! In the examples below we use the `Backend Disruption` tests, but the same will hold true for the alerts durations, | ||
|
|
||
| Now that we have a better understanding of how the disruption test data is generated and updated, let's discuss how the code makes use of it. | ||
|
|
||
| ### Best Matcher | ||
|
|
||
| The [origin/pkg/synthetictests/allowedbackenddisruption/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/query_results.json) file that we updated previously is embedded into the `openshift-tests` binary. At runtime, we ingest the raw data and create a `historicaldata.NewMatcher()` object which implements the `BestMatcher` interface. | ||
|
|
||
| {{% card-code header="[origin/pkg/synthetictests/allowedbackenddisruption/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/types.go#L53-L77)" %}} | ||
|
|
||
| ```go | ||
| //go:embed query_results.json | ||
| var queryResults []byte | ||
|
|
||
| var ( | ||
| readResults sync.Once | ||
| historicalData historicaldata.BestMatcher | ||
| ) | ||
|
|
||
| const defaultReturn = 2.718 | ||
|
|
||
| func getCurrentResults() historicaldata.BestMatcher { | ||
| readResults.Do( | ||
| func() { | ||
| var err error | ||
| genericBytes := bytes.ReplaceAll(queryResults, []byte(` "BackendName": "`), []byte(` "Name": "`)) | ||
| historicalData, err = historicaldata.NewMatcher(genericBytes, defaultReturn) | ||
| if err != nil { | ||
| panic(err) | ||
| } | ||
| }) | ||
|
|
||
| return historicalData | ||
| } | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| ### Best Guesser | ||
|
|
||
| The core logic of the current best matcher will check if we have an exact match in the historical data. An exact match is one that contains the same `Backend Name` and [JobType](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/platformidentification/types.go#L16-L23). When we don't have an exact match, we make a best guess effort by doing a fuzzy match for data we don't have. Fuzzy matching is done by iterating through all the `nextBestGuessers` and stopping at the first match that fits our criteria and checking if it's contained in the data set. | ||
|
|
||
| {{% card-code header="[origin/pkg/synthetictests/historicaldata/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/types.go#L89-L111)" %}} | ||
|
|
||
| ```go | ||
| exactMatchKey := DataKey{ | ||
| Name: name, | ||
| JobType: jobType, | ||
| } | ||
|
|
||
| if percentiles, ok := b.historicalData[exactMatchKey]; ok { | ||
| return percentiles, "", nil | ||
| } | ||
|
|
||
| for _, nextBestGuesser := range nextBestGuessers { | ||
| nextBestJobType, ok := nextBestGuesser(jobType) | ||
| if !ok { | ||
| continue | ||
| } | ||
| nextBestMatchKey := DataKey{ | ||
| Name: name, | ||
| JobType: nextBestJobType, | ||
| } | ||
| if percentiles, ok := b.historicalData[nextBestMatchKey]; ok { | ||
| return percentiles, fmt.Sprintf("(no exact match for %#v, fell back to %#v)", exactMatchKey, nextBestMatchKey), nil | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| ### Default Next Best Guessers | ||
|
|
||
| [Next Best Guessers](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L13-L53) are functions that can be chained together and will return either a `true` or `false` if the current `JobType` matches the desired logic. In the code snippet below, we check if `MicroReleaseUpgrade` matches the current `JobType`, if false, we continue down the list. The [combine](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L179-L191) helper function gives you the option to chain and compose a more sophisticated check. In the example below, if we can do a [PreviousReleaseUpgrade](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L100-L113) the result of that will be fed into [MicroReleaseUpgrade](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L80-L98) and if no function returns `false` during this chain, we have successfully fuzzy matched and can now check the historical data has information for this match. | ||
|
|
||
| {{% card-code header="Ex: `nextBestGuessers` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L13-L53)" %}} | ||
|
|
||
eggfoobar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ```go | ||
| var nextBestGuessers = []NextBestKey{ | ||
| MicroReleaseUpgrade, | ||
| PreviousReleaseUpgrade, | ||
| ... | ||
| combine(PreviousReleaseUpgrade, MicroReleaseUpgrade), | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| {{% card-code header="Ex: `PreviousReleaseUpgrade` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L100-L113)" %}} | ||
|
|
||
| ```go | ||
| // PreviousReleaseUpgrade if we don't have data for the current toRelease, perhaps we have data for the congruent test | ||
| // on the prior release. A 4.11 to 4.11 upgrade will attempt a 4.10 to 4.10 upgrade. A 4.11 no upgrade, will attempt a 4.10 no upgrade. | ||
| func PreviousReleaseUpgrade(in platformidentification.JobType) (platformidentification.JobType, bool) { | ||
| toReleaseMajor := getMajor(in.Release) | ||
| toReleaseMinor := getMinor(in.Release) | ||
|
|
||
| ret := platformidentification.CloneJobType(in) | ||
| ret.Release = fmt.Sprintf("%d.%d", toReleaseMajor, toReleaseMinor-1) | ||
| if len(in.FromRelease) > 0 { | ||
| fromReleaseMinor := getMinor(in.FromRelease) | ||
| ret.FromRelease = fmt.Sprintf("%d.%d", toReleaseMajor, fromReleaseMinor-1) | ||
| } | ||
| return ret, true | ||
| } | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
|
|
||
| {{% card-code header="Ex: `MicroReleaseUpgrade` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L80-L98)" %}} | ||
|
|
||
| ```go | ||
| // MicroReleaseUpgrade if we don't have data for the current fromRelease and it's a minor upgrade, perhaps we have data | ||
| // for a micro upgrade. A 4.10 to 4.11 upgrade will attempt a 4.11 to 4.11 upgrade. | ||
| func MicroReleaseUpgrade(in platformidentification.JobType) (platformidentification.JobType, bool) { | ||
| if len(in.FromRelease) == 0 { | ||
| return platformidentification.JobType{}, false | ||
| } | ||
|
|
||
| fromReleaseMinor := getMinor(in.FromRelease) | ||
| toReleaseMajor := getMajor(in.Release) | ||
| toReleaseMinor := getMinor(in.Release) | ||
| // if we're already a micro upgrade, this doesn't apply | ||
| if fromReleaseMinor == toReleaseMinor { | ||
| return platformidentification.JobType{}, false | ||
| } | ||
|
|
||
| ret := platformidentification.CloneJobType(in) | ||
| ret.FromRelease = fmt.Sprintf("%d.%d", toReleaseMajor, toReleaseMinor) | ||
| return ret, true | ||
| } | ||
| ``` | ||
|
|
||
| {{% /card-code %}} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| <div class="card" style="margin-top:0.5em"> | ||
| {{- with $.Get "header" -}} | ||
| <div class="card-header"> | ||
| {{- $.Get "header" | markdownify -}} | ||
| </div> | ||
| {{end}} | ||
| <div class="card-body code" style="padding:0.0em 0.85em;"> | ||
| {{ $.Inner }} | ||
| </div> | ||
| </div> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| <div style="width: 100%; display: inline-block"> | ||
| {{ .Get "file" | readFile | safeHTML }} | ||
| </div> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.